Machine Learning

1. Tmp
2. Supervised Learning
3. Support Vector Machine
4. Unsupervised Learning
- 4.1. K-means
- 4.2. Hierarchical Clustering

All models are wrong, but some are useful.

1 Tmp

false positive: error, output true, but supposed to output false
false negative: error, output false
true positive: correct
true negative: correct

2 Supervised Learning

The input is also called predictors, independent variables, or features. The output is also called response or dependent variables.

The quantitative measurement means some measurements are bigger than others, and they are close in value means they are close in nature as well. Qualitative variables are also refereed to as categorical variable or discrete variable. We call a problem regression when we predict quantitative outputs, and classification when we predict qualitative outputs.

Parameter based approaches, like linear models, make huge assumptions about the structure, thus are stable (the decision boundary is very smooth) but inaccurate. Non-parametric methods, like k-nearest neighbors, makes very mild structural assumptions, thus often accurate but not stable. Since it makes very mild assumption, it can adapt to many situations.

2.1 General Concepts

Euclidean distance
reduce the fitting problem to estimating a set of coefficients. But it will be imprecise if the model choose-d is very different from the true model
a very large number of observations is required.

2.2 Linear Regression

Least square method is the one that train the model by minimizing the residual sum of square (RSS), defined as where \(\beta\) is the coefficients:

\[RSS(\beta) = \sum_{i=1}^{N} (y_i - x_i^T \beta)^2\]

It can also be represented as vector product:

\[RSS(\beta) = (y - X\beta)^T (y - X\beta)\]

Differentiating with respect to \(\beta\):

\[\frac{\partial RSS}{\partial{\beta}} = -2X^T (y-X\beta)\] \[\frac{\partial^2 RSS}{\partial \beta \partial \beta^T} = 2X^TX\]

Setting the first derivative to zero, we obtain the result \[X^T(y-X\beta) = 0\] \[\hat{\beta} = (X^TX)^{-1}X^Ty\]

Further we can have the prediction of y being

\[\hat{y} = X\hat{\beta} = X(X^TX)^{-1}X^Ty = Hy\]

Where \(H=X(X^TX)^{-1}X^T\) is called hat matrix because it puts hat on \(y\). On the geometrical representation, we can see actually \(H\) computes the orthogonal projection of y onto the space created by the input vector \(x\). Thus it is also called projection matrix.

To test the hypothesis that a particular coefficient \(\beta_j=0\), we need Z-score. To test for a group of coefficients simultaneously, we use F-statistic.

The Gauss-Markov theorem states that the least squares estimates of the parameters \(\beta\) have the smallest variance among all linear unbiased estimates. Formally, if we have any other linear estimator \(\theta=c^Ty\) that is unbiased for \(a^T\beta\), i.e. \(E(c&Ty)=a^T\beta\), then

\[Var(a^T\hat{\beta}) \le Var(c^Ty)\]

This theorem implies that the least square estimator has the smallest mean square error of all linear estimators with no bias.

In terms of multiple variables, the model is called multiple linear regression model. The one variable version is thus called univariate linear model. Gram-Schmidt procedure is used for multiple regression. In case of multiple outputs, they do not affect one another's least square estimate.

2.3 Linear Classification

This kind of problem estimate \(Pr(G=k|X=x)\) where G is the categories.

Bayes classifier: classify to the most probable class, using conditional distribution. The error rate of Bayes classifier is called the Bayes rate.

Bayes Classifier: assign each observation to the most likely class, given its predictor values.
Bayes Error Rate: Bayes classifier produces the lowest possible test error rate. The Bayes error rate is analogous to the irreducible error. This means it is the optimal value. So Bayes classifier serves as an unattainable gold standard against which to compare other methods.
Bayes Decision Boundary: determine the prediction

Linear Discriminant Analysis (LDA) solves the case when we assume the classes have a common covariance matrix. If this assumption is not correct, we use Quandratic Discriminant Analysis (QDA). There is a method Regularized Discriminant Analysis (RDA) that compromises between LDA and QDA, and allows one to shrink the separate covariances of QDA toward a common covariance as in LDA, very similar to ridge regression.

How to choose from LDA or QDA depends on the bias-variance trade-off. LDA is better when:

LDA is a much less flexible classifier than QDA.
So it has a much lower variance.
It improve the prediction performance.

QDA is better when:

if the assumption of common covariance matrix is bad, LDA is bias
if the training set is very large, the variance of the classifier is not a major concern.

The posterior probabilities of K classes are not sum to 1. But this is often not convenient. So logistic regression is invented. It models the posterior probabilities of K classes via linear functions in \(x\), while at the same time ensuring that they sum to one and remain in [0,1]. The model has the form of K-1 log-odds, or logic transformation:

\begin{eqnarray} log \frac{Pr(G=1|X=x)}{Pr(G=K|X=x)} & = & \beta_{10} + \beta_1^Tx\\ log \frac{Pr(G=2|X=x)}{Pr(G=K|X=x)} & = & \beta_{20} + \beta_2^Tx\\ \cdots\\ log \frac{Pr(G=K-1|X=x)}{Pr(G=K|X=x)} & = & \beta_{(K-1)0} + \beta_(K-1)^Tx\\ \end{eqnarray}

The logistic regression model are usually fit by maximum likelihood, using the conditional likelihood of G given X. The log-likelihood for N observations is:

\[\ell(\theta) = \sum_{i=1}^N log p_{g_i} (x_i;\theta)\]

Where \(p_k(x_i;\theta) = Pr(G=k|X=x_i;\theta)\), and this can be obtained from the model definition. Take the derivation and set to zero, and take the second-order derivation (i.e. Hessian matrix), we can get the solution to the optimization problem.

Logistic regression models are used mostly as a data analysis and inference tool, where the goal is to understand the role of the input variables in explaining the outcome.

The LDA and logistic regression appear very similar. The difference lies in the way the linear coefficients are estimated. The logistic regression is more general in that it makes less assumptions. The logistic regression estimate the parameter by maximizing the conditional likelihood \(Pr(G=k|X)\), while the LDA maximize the full log-likelihood \(Pr(X,G=k)\).

2.4 Basis Expansion and Regularization

These linear model can be extended to non-linear through simple transformations.

Basis expansion applies a function transformation on the input, to transform it and use the result as the new input. For example, we can use \(x_i^2\) as the input, and thus linear regression can have non-linear (in this case polynomial) results. Specifically the model has the form:

\[f(X) = \sum_{m=1}^M \beta_m h_m(X)\]

Where \(h_m\) is the transformation function.

Polynomial regression is just replace standard linear model to higher dimension ones (typically less than 4). The one with \(X,X^2,X^3\) is called cubic regression.

The first kind of transformation is to divide the input X into continuous intervals, and represent each interval a polynomial function. This is called piecewise polynomial or spline.

Also called piecewise constant regression. It actually piecewise the data, and do linear regression. The linear model is

\begin{eqnarray} y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) + ... + \beta_K C_K(x_i) + \epsilon_i \end{eqnarray}

Given a value X, there's at most one of \(C_i\) can be non-zero.

regression spline is piecewise polynomial. But it ensures the smooth at the knots. We have K knots, and fit a cubic regression. At the knots, we need to ensure the 0,1,2 deviation is the same.

Adaptively chosen basis function methods are known as dictionary methods.

Regularization method is also called penalty function. It expresses our prior belief that the type of functions we seek exhibit a certain type of smooth behavior. This is realized through introducing a penalizing function to RSS:

\[PRSS(f;\lambda) = RSS(f) + \lambda J(f)\]

The \(\lambda\) is called smoothing parameter, and is fixed. \(J(f)\) should be chosen such that it will be large for functions f that vary too rapidly over small regions of input space. For example, for cubic smoothing spline, it is

\[PRSS(f;\lambda) = \sum_{i=1}^N (y_i - f(x_i))^2 + \lambda \int [f''(x)]^2dx\]

The smoothing parameter encodes the degree of the spline, and the number and placement of knots, and can be automatically selected.

Smoothing spline is a different approach, but also produces a spline. Instead of making RSS minimal, we make the following minimal

\begin{eqnarray} RSS = \sum_{i=1}^n (y_i - g(x_i))^2 \end{eqnarray}

We need to find a \(g\). If we do not put any constraints, we can simply let \(g\) equal to \(y_i\). But this is overfitting. We need some constraints on \(g\). We want to find the \(g\) that minimizes:

\begin{eqnarray} \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2dt \end{eqnarray}

The function \(g\) that minimizes it is a smoothing spline.

The first term is a loss function, nd second is a penalty term.

2.5 KNN

In KNN, all data are the model. The prediction is made by averaging the nearest k observations. In particular, the output is computed by:

\[\hat{Y}(x) = \frac{1}{k} \sum_{x_i \in N_k(x)} y_i\]

Where the \(N_k(x)\) is the k closest points of \(x\) in the training sample.

2.6 Subset selection

For two reasons we might want to use subset of input:

prediction accuracy: the least square estimates often have low bias but large variance.
interpretation: we would like to determine a smaller subset that exhibit the strongest effects.

The best-subset selection explore all possible subset and select the one with the smallest residual sum of square (RSS). However, this is of course expensive. Instead, we can use a search. The forward step-wise selection starts with the intercept, then sequentially adds into the model the predictor that improves the fit the most. Clearly this is a greedy algorithm. The backward step-wise selection starts with the full model, and sequentially deletes the predictor that has the least impact to the fit. The candidate for dropping is the one with the smallest Z-score. There is also another called forward stage-wise regression (FS), which is more constrained. At each step, it identifies the variable most correlated with the current residual, then computes the simple linear regression coefficient on this chosen variable, then adds it to the corrent coefficient for that variable. This process continue until none of the variable have correlation with the residual, i..e the least square fit. Unlike the forward step-wise regression, none of hte other variables are adjusted when a term is added to the model. As a result, it may take more steps. However, this "slow fitting" seems to be more effective in high-dimensional problems. There is also a hybrid approach. After adding each new variable, the method may also remove any variables that no longer provide an improvement.

Shrinkage method is a generalize of the subset selection. Subset selection remove a variable at one time, in another word, it is discrete process. Thus it exhibits high variance.

Ridge regression shrinks the regression coefficients by imposing a penalty on their size. Ridge regression protects against the potentially high variance of gradients estimated in the short direction. The assumption is that the response will tend to vary most in the directions of high variance of the input. Thus the name "ridge".

The ordinary least squares minimize:

\begin{eqnarray} RSS = \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij})^2 \end{eqnarray}

and ridge regression introduce a shrinkage penalty:

\begin{eqnarray} RSS + \lambda \sum_{j=1}^p \beta^2_j \end{eqnarray}

When \(\beta_j\) is small toward 0, the above added penalty will be small. So it will make the \(\beta_j\) smaller, i.e. shrinkage. The parameter \(\lambda\) is critical for the influence of the penalty.

Actually it uses the l₂ norm.

\begin{eqnarray} ||\beta||_2 = \sqrt{\sum_{j=1}^p} \beta^2_j \end{eqnarray}

Principal Components Regression (PCR) and Partial Least Square (PLS) use derived input. PCR uses all the input, but PLS also uses \(y\) in addition to \(x\). Both are similar to Ridge regression because they shrinks smoothly.

Lasso falls somewhere between ridge regression and best subset regression.

Ridge regression cannot remove any features, unless \(\lambda = \infty\). This may not be a problem for prediction accuracy but it can create a challenge in model interpretation.

The Lasso uses l₁ norm penalty.

\begin{eqnarray} ||\beta||_1 = \sum |\beta_j| \end{eqnarray}

\(l_1\) penalty has the effect of forcing some of the coefficient estimates to be exactly equal to 0 when the tuning parameter \(\lambda\) is sufficiently large.

It is much easier to interpret, it yields sparse model, i.e. models that involve only a subset of the variables.

As a conclusion, ridge regression does a proportional shrinkage. Lasso translate each coefficient by a constant factor, truncating at zero. Best-subset selection drops all variables with coefficients smaller than the M-th largest.

2.7 Kernel Methods and local regression

Kernel means local. Kernel method is generally applicable, e.g. to KNN. KNN is discontinuous, but this is not necessary. The idea is to use only observations close to target point to make fit the model. This is achieved by assigning a kernel function \(K_\lambfda(x_0,x)\) which assigns weights to the points in the region near \(x_0\) (according to distance). The model, however, is the entire training data. As a result, the RSS to minimize becomes:

\[RSS(f_\theta, x_0) = \sum_{i=1}^N K_\lambda (x_0, x_i)(y_i - f_\theta (x_i))^2\]

The \(\lambda\) is a parameter that control the width of the kernel. However, local regression becomes less useful in dimensions much higher than 2. It is impossible to simultaneously maintain localness (low bias) and a sizable sample in the neighborhood (low variance) as the dimension increases, without the total number of sample increasing exponentially. This can not be solved unless we make some structural assumption about the model.

2.8 High Dimension

It seems that for KNN, if given more data, we can make more precise predictions. However, this intuition does not work when the dimension is high. This is known as curse of dimensionality. This is because if the dimension is high, the KNN need not to be close to the target point. In general, any method that attempts to produce local varying functions in small isotropic neighborhoods will run into problems in high dimensions.

Dimension Reduction techniques are discussed here.

linear combination of the predictors into M new predictors.

\begin{eqnarray} Z_m = \sum_{j=1}^p \phi_{jm} X_j \end{eqnarray}

Principal Component Analysis (PCA): The following are some criteria for the direction selection, they all talk about the same thing:

The first principal component direction is that along which the observation vary the most.
This also yields the highest variance.
It also defines the line that is as close as possible to the data.
projected observations are as close as possible to the original observations.

The second principal component \(Z_2\) is a linear combination of the variables that is uncorrelated with \(Z_1\), and has largest variance subject to this constraint. Actually \(Z_1\) and \(Z_2\) are always orthogonal.

Principal Component Regression (PCR): Construct the first M principal components, and do linear regression on the new predictors.

Partial Least Squares (PLS): The directions identified by PCA is in an unsupervised way, i.e. it does not use response Y.

Set each \(\phi_{j1}\) equal to the coefficient from teh simple linear regression of Y onto X_j. Intuitively PLS places the highest weight on the variables that are most strongly related to the response.

Second PLS direction is by

adjust each of the variables for Z₁, by regressing each variable on Z₁ and taking residuals This captures the remaining information that has not been explained by the first PLS direction
use this orthogonalized data in exactly the same fashion as Z₁.
Repeat M times.

2.9 Model Assessment

Test error is also called generalization error, prediction error, and is the error over test sample. Training error is the error on training sample.

mean square error (MSE) is defined as:

\begin{eqnarray} MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2 \end{eqnarray}

The bias-variance trade-off:

variance: the amount by which \(\hat{f}\) would change if we estimated it using a different training data set. More flexible statistical methods have higher variance.
bias: the error that is introduced by approximating a real-life problem. E.g. it is unlikely that any real-life problem has simple linear relationship. More flexible methods result in less bias.
95% confidence interval: a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.

The prediction error can be assessed by some criterion. Akaike Information Criterion (AIC) is defined as

\[AIC = - \frac{2}{N} loglik + 2 \frac{d}{N}\]

Bayesian Information Criterion (BIC), also known as Schwarz criterion, is similar to AIC, and defined as:

\[BIC = -2 loglik + (log N) d\]

Choosing the model with minimum BIC is equivalent to choosing the model with largest (approximate) posterior probability, thus the name. The Minimum Description Length (MDL) is formally identical to BIC, but from different point of view.

AS the model complexity increases, the variance tends to increase and the squared bias tends to decrease. Vapnik-Chervonenkis Dimension (VC-Dimension) provides the measurement of model complexity.

In K-fold cross validation, we split the data into K roughly equal-sized parts. For the k-th part, we fit the model using other data, and use it as the test data. The prediction error is the average of the K experiments. When K=N, it is called leave-one-out cross-validation. In this case, it is unbiased, but has high variance.

The bootstrap method is also for assessing the accuracy, as cross validation. It randomly draw-with-replacement from the training data. This is done B times, and producing B bootstrap datasets. The model is fit on each of the data set, and all the prediction errors are examined.

2.10 Model Inference

prediction: \(\hat{f}\) is treated as black box
inference: understand the relationship between X and Y. \(\hat{f}\) cannot be treated as black box.

Model inference is the problem of estimate the model parameters.

The above bootstrap can be used for model inference. This is called non-parametric bootstrap. The parametric bootstrap does not use raw data, but simulate the new response by adding Gaussian noise to the predicted values.

In essence, bootstrap is a computer implementation of nonparametric or parametric maximum likelihood. Also, bootstrap distribution represents an (approximate) nonparametric, noninformative posterior distribution for the parameter. But this bootstrap distribution is obtained painlessly, without having to specify a prior, and without having to sample from the posterior distribution. Thus, the bootstrap distribution is also called "poor man's" Bayes posterior.

In Bayesian model, we need to draw samples from the resulting posterior distribution. Typically the Markov Chain Monte Carlo (MCMC) method is used. Gibbs is one kind of MCMC.

Bagging is also called bootstrap aggregation, it is a method to use bootstrap not to access the accuracy, but to improve the prediction itself. It makes the prediction over a collection of bootstrap samples, thus reducing the variance. The prediction result is defined as the averaging of all predictors for all samples.

The decision tree suffers from high variance. If we split the training data into two parts at random, the result two trees can be very different. Bagging can reduce the variance. It is related to bootstrap.

Bagging involves

creating multiple copies of the original training data set using the bootstrap,
fitting a separate decision tree to each copy,
and then combining all of the trees in order to create a single predictive model.

Each tree is built on a bootstrap data set, independent of the other trees. The key idea is averaging a set of observations reduces variance.

Random Forrest is a substantial modification of bagging that builds a collection of trees and then averages them. The essential idea in bagging is to average many noisy but approximately unbiased models, and hence reduce the variance. Trees are ideal candidates for bagging because they can capture complex interaction structure in the data, and if grown sufficiently deep, have relatively low bias. However, the size of the correlation of bagged trees limits the benefit of averaging. The random forest is to improve the variance reduction of bagging by reducing the correlation between trees, without increasing the variance too much. This is achieved in the tree-growing process through random selection of the input variables. Specifically:

Before each split, select \(m\le p\) of the input variables at random as candidates for splitting.

Bumping is also based on bootstrap, but is a stochastic process: it randomly walk through all the models in the bootstrap samples. This helps the fitting to avoid getting stuck in (some?) local minima.

2.11 Maximum Likelihood Inference

Given a probability density or probability mass function for the observation:

\[z_i \sim g_\theta(z)\]

where the \(\theta\) is unknown parameter. The maximum likelihood function is:

\[L(\theta;Z) = \prod_{i=1}^N g_\theta(z_i)\]

This is the probability of the observed data under the model \(g_\theta\). The log of this likelihood, is called the log-likelihood:

\[\ell(\theta;Z) = \sum_{i=1}^N \ell(\theta;z_i) = \sum_{i=1}^N log g_\theta(z_i)\]

Each value in the summary is called log-likelihood component. The method of maximum likelihood chooses the value of \(\theta\) to maximize \(\ell(\theta;Z)\).

2.12 Tree-based method

Tree-based methods partition the feature space into a set of rectangles, then fit a model in each one.

regression tree:

divide the predictor space into J distinct and non-overlapping regions \(R_1,...,R_J\).
for each observation fail into R_j, make the prediction using the mean in R_j.

To get the regions, use recursive binary splitting, a top-down, greedy approach.

From the root
every split choose the best split that leads to the greatest possible reduction of RSS

It is likely to overfit the data. So we can grow a very large tree, and then prune it back in order to obtain a subtree.

The whole algorithm goes here:

recursive binary splitting to grow a large tree
apply cost complexity pruning
use K-fold cross-validation

Patient Rule Induction Method (PRIM) is also a tree-based method, but with different splitting approach. It seeks for boxes in which response average is high. Hence, it looks for maxima in the target function. It works from the top down, starting with a box containing all the data. It repeatedly peel off data.

Hierarchical Mixture of Experts (HME) is a variant of tree-based method. The tree split not by hard decision, but by probability. The observation goes left or right with probability depending on its input value. The terminal node is called expert, the non-terminal node is called gating networks. The idea is that each expert provides an opinion (prediction) about the response, and these are combined together by the gating network.

2.13 Additive Model

Generalized Additive Model (GAM) has the form

\[E(Y|X_1,X_2,...,X_p) = \alpha + f_1(X_1) + f_2(X_2) + ... + f_p(X_p)\]

where the \(f_i\) are unspecified smooth (nonparametric) functions. It is called additive model because we calculate a separate \(f_j\) for each \(X_j\), and add together all of their contributions. Additive models provide a useful extension of linear models, making them more flexible while retaining much of their interpretability.

Boosting method combines the output of many weak classifier to produce a powerful committee, through a weighted majority vote, with the assumption that they do not make the same mistake.

Boosting is another approach for improving the prediction results from a decision tree. The different from bagging is,

the trees are grown sequentially: each tree is grown using information from previously grown trees.
Boosting does not involve bootstrap sampling. Each tree is fit on a modified version of the original data set.

Ensemble learning builds a prediciton model by combining the strengths of a collection of simpler base models. Bagging and random forest are ensemble methods for classification. The Importance Sampled Learning Ensemble (ISLE) is one ensemble-generation method.

2.14 Other

Multivariate Adaptive Regression Splines (MARS) is an adaptive procedure for regression, and well suited for high-dimensional problems. It can be viewed as generalization of step-wise linear regression.

3 Support Vector Machine

A linear hyperplane can separate classes, but there exists an infinite number of such hyperplanes. There is a classifier called Maximal Margin Classifier, also known as optimal separating hyperplane. The separating hyperplane that is farthest from the training observations. That is, we can compute the (perpendicular) distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin. The maximal margin hyperplane is the separating hyperplane for which the margin is largest—that is, it is the hyperplane that has the farthest minimum dis- tance to the training observations. The closest observations are support vectors. they “support” the maximal margin hyperplane in the sense that if these points were moved slightly then the maximal margin hyperplane would move as well.

However, the above classifier is not stable, and more importantly, it is still linear decision boundary, and cannot be applied in non-separable cases. Thus the Support Vector Classifier (SVM), also known as soft margin classifier, is introduce. It the best “out of the box” classifiers. It allows some observations to be on the incorrect side of the margin, or even the incorrect side of the hyperplane.

Some observations:

only observations that either lie on the margin or that violate the margin will affect the hyperplane
an observation that lies strictly on the correct side of the margin does not affect the support vector classifier
Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors.
When the tuning parameter C is large, then the margin is wide

The support vector machine is a linear classifier, but it can be extended into non-linear through using non-linear kernels. Kernel trick is basis expansion.

linear kernel:

\begin{eqnarray} K(x_i, x_{i'}) = \sum_{j=1}^p x_{ij} x_{i'j} \end{eqnarray}

polynomial kernel:

\begin{eqnarray} K(x_i, x_{i'}) = (1 + sum_{j=1}^p x_{ij} x_{i'j})^d \end{eqnarray}

Radial kernel:

\begin{eqnarray} K(x_i, x_{i'}) = exp(-\gamma \sum_{j=1}^p (x_{ij} - x{i'j})^2) \end{eqnarray}

SVMs can be extended easily to more than 2 classes. One-versus-one classification computes all pairs SVMs, while one-versus-all classification computes all one versus all other SVMs.

4 Unsupervised Learning

Unsupervised learning is typically used for Clustering.

4.1 K-means

K-means method ensures the total within-cluster variation, summed over all K clusters, is as small as possible. It defines the within-cluster variation. The formula for it is: the sum of all of the pairwise squared Euclidean distances between the observations in the kth cluster. W(C_k) depicts the amount by which the observations within a cluster differ from each other

\begin{eqnarray} W(C_k) = \frac{1}{|C_k|} \sum_{i,i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2 \end{eqnarray}

where \(|C_k|\) denotes the number of observations in the kth cluster.

The algorithm:

select a number K, randomly assign a clustering from 1 to K for each observation
iterate until cluster assignments stop changing
1. for each cluster, compute centroid: the vector of the p features means for the observations in the kth cluster.
2. assign each observation to the cluster whose centroid is closest.

This algorithm guarantee to decrease the objective formula above.

4.2 Hierarchical Clustering

The downside of K-mean is the need for predefined number of clusters. Hierarchical Clustering does not predefine the number of clusters. The result is called a dendrogram, a tree-based representation of the observations.

It is constructed bottom-up. The tree node means a fusion. The height of the fusion indicates how different the two observations are. Never compare the horizontal distance. Construction algorithm: examine all pairwise inter-cluster dissimilarities among all clusters. Fuse the most similar ones.

The four most commonly used types of linkage:

complete: maximal intercluster dissimilarity
single: minimal intercluster dissimilarity
average: mean intercluster dissimilarity
centroid: dissimilarity between the centroid of cluster A and B