# Machine Learning

## Table of Contents

- 1. Supervised Learning
- 1.1. General Concepts
- 1.2. Linear Regression
- 1.3. Linear Classification
- 1.4. Basis Expansion and Regularization
- 1.5. KNN
- 1.6. Subset selection
- 1.7. Kernel Methods and local regression
- 1.8. High Dimension
- 1.9. Model Assessment
- 1.10. Model Inference
- 1.11. Maximum Likelihood Inference
- 1.12. Tree-based method
- 1.13. Additive Model
- 1.14. Other

- 2. Support Vector Machine
- 3. Unsupervised Learning

All models are wrong, but some are useful.

## 1 Supervised Learning

The *input* is also called *predictors*, *independent variables*, or
*features*. The *output* is also called *response* or *dependent
variables*.

The *quantitative measurement* means some measurements are bigger than
others, and they are close in value means they are close in nature as
well. *Qualitative variables* are also refereed to as *categorical
variable* or *discrete variable*. We call a problem *regression* when
we predict quantitative outputs, and *classification* when we predict
qualitative outputs.

Parameter based approaches, like linear models, make huge assumptions about the structure, thus are stable (the decision boundary is very smooth) but inaccurate. Non-parametric methods, like k-nearest neighbors, makes very mild structural assumptions, thus often accurate but not stable. Since it makes very mild assumption, it can adapt to many situations.

### 1.1 General Concepts

- Euclidean distance
- reduce the fitting problem to estimating a set of coefficients. But it will be imprecise if the model choose-d is very different from the true model
- a very large number of observations is required.

### 1.2 Linear Regression

*Least square* method is the one that train the model by minimizing
the *residual sum of square (RSS)*, defined as where \(\beta\) is the
coefficients:

\[RSS(\beta) = \sum_{i=1}^{N} (y_i - x_i^T \beta)^2\]

It can also be represented as vector product:

\[RSS(\beta) = (y - X\beta)^T (y - X\beta)\]

Differentiating with respect to \(\beta\):

\[\frac{\partial RSS}{\partial{\beta}} = -2X^T (y-X\beta)\] \[\frac{\partial^2 RSS}{\partial \beta \partial \beta^T} = 2X^TX\]

Setting the first derivative to zero, we obtain the result \[X^T(y-X\beta) = 0\] \[\hat{\beta} = (X^TX)^{-1}X^Ty\]

Further we can have the prediction of y being

\[\hat{y} = X\hat{\beta} = X(X^TX)^{-1}X^Ty = Hy\]

Where \(H=X(X^TX)^{-1}X^T\) is called *hat matrix* because it puts hat
on \(y\). On the geometrical representation, we can see actually \(H\)
computes the orthogonal projection of y onto the space created by the
input vector \(x\). Thus it is also called *projection matrix*.

To test the hypothesis that a particular coefficient \(\beta_j=0\), we
need *Z-score*. To test for a group of coefficients simultaneously, we
use *F-statistic*.

The *Gauss-Markov theorem* states that the least squares estimates of
the parameters \(\beta\) have the smallest variance among all linear
unbiased estimates. Formally, if we have any other linear estimator
\(\theta=c^Ty\) that is unbiased for \(a^T\beta\),
i.e. \(E(c&Ty)=a^T\beta\), then

\[Var(a^T\hat{\beta}) \le Var(c^Ty)\]

This theorem implies that the least square estimator has the smallest mean square error of all linear estimators with no bias.

In terms of multiple variables, the model is called *multiple linear
regression model*. The one variable version is thus called *univariate
linear model*. *Gram-Schmidt procedure* is used for *multiple
regression*. In case of multiple outputs, they do not affect one
another's least square estimate.

### 1.3 Linear Classification

This kind of problem estimate \(Pr(G=k|X=x)\) where G is the categories.

*Bayes classifier*: classify to the most probable class, using
conditional distribution. The error rate of Bayes classifier is called
*the Bayes rate*.

- Bayes Classifier
- assign each observation to the most likely class, given its predictor values.
- Bayes Error Rate
- Bayes classifier produces the lowest possible test error rate. The Bayes error rate is analogous to the irreducible error. This means it is the optimal value. So Bayes classifier serves as an unattainable gold standard against which to compare other methods.
- Bayes Decision Boundary
- determine the prediction

*Linear Discriminant Analysis (LDA)* solves the case when we assume
the classes have a common covariance matrix. If this assumption is not
correct, we use *Quandratic Discriminant Analysis (QDA)*. There is a
method *Regularized Discriminant Analysis (RDA)* that compromises
between LDA and QDA, and allows one to shrink the separate covariances
of QDA toward a common covariance as in LDA, very similar to ridge
regression.

How to choose from LDA or QDA depends on the bias-variance trade-off. LDA is better when:

- LDA is a much less flexible classifier than QDA.
- So it has a much lower variance.
- It improve the prediction performance.

QDA is better when:

- if the assumption of common covariance matrix is bad, LDA is bias
- if the training set is very large, the variance of the classifier is not a major concern.

The posterior probabilities of K classes are not sum to 1. But this is
often not convenient. So *logistic regression* is invented. It models
the posterior probabilities of K classes via linear functions in \(x\),
while at the same time ensuring that they **sum to one and remain in
[0,1]**. The model has the form of K-1 *log-odds*, or *logic
transformation*:

The logistic regression model are usually fit by maximum likelihood, using the conditional likelihood of G given X. The log-likelihood for N observations is:

\[\ell(\theta) = \sum_{i=1}^N log p_{g_i} (x_i;\theta)\]

Where \(p_k(x_i;\theta) = Pr(G=k|X=x_i;\theta)\), and this can be obtained from the model definition. Take the derivation and set to zero, and take the second-order derivation (i.e. Hessian matrix), we can get the solution to the optimization problem.

Logistic regression models are used mostly as a data analysis and
inference tool, where the goal is to understand the role of the input
variables in *explaining* the outcome.

The LDA and logistic regression appear very similar. The difference lies in the way the linear coefficients are estimated. The logistic regression is more general in that it makes less assumptions. The logistic regression estimate the parameter by maximizing the conditional likelihood \(Pr(G=k|X)\), while the LDA maximize the full log-likelihood \(Pr(X,G=k)\).

### 1.4 Basis Expansion and Regularization

These linear model can be extended to non-linear through simple transformations.

*Basis expansion* applies a function transformation on the input, to
transform it and use the result as the new input. For example, we can
use \(x_i^2\) as the input, and thus linear regression can have
non-linear (in this case polynomial) results. Specifically the model
has the form:

\[f(X) = \sum_{m=1}^M \beta_m h_m(X)\]

Where \(h_m\) is the transformation function.

Polynomial regression is just replace standard linear model to higher dimension ones (typically less than 4). The one with \(X,X^2,X^3\) is called cubic regression.

The first kind of transformation is to divide the input X into
continuous intervals, and represent each interval a polynomial
function. This is called *piecewise polynomial* or *spline*.

Also called *piecewise constant regression*. It actually piecewise
the data, and do linear regression. The linear model is

Given a value X, there's at most one of \(C_i\) can be non-zero.

*regression spline* is piecewise polynomial. But it ensures the
smooth at the knots. We have K knots, and fit a cubic regression. At
the knots, we need to ensure the 0,1,2 deviation is the same.

Adaptively chosen basis function methods are known as *dictionary
methods*.

*Regularization method* is also called *penalty function*. It
expresses our prior belief that the type of functions we seek exhibit
a certain type of smooth behavior. This is realized through
introducing a penalizing function to RSS:

\[PRSS(f;\lambda) = RSS(f) + \lambda J(f)\]

The \(\lambda\) is called *smoothing parameter*, and is fixed. \(J(f)\)
should be chosen such that it will be large for functions f that vary
too rapidly over small regions of input space. For example, for *cubic
smoothing spline*, it is

\[PRSS(f;\lambda) = \sum_{i=1}^N (y_i - f(x_i))^2 + \lambda \int [f''(x)]^2dx\]

The smoothing parameter encodes the degree of the spline, and the
number and placement of *knots*, and can be automatically selected.

*Smoothing spline* is a different approach, but also produces a
spline. Instead of making RSS minimal, we make the following minimal

We need to find a \(g\). If we do not put any constraints, we can simply let \(g\) equal to \(y_i\). But this is overfitting. We need some constraints on \(g\). We want to find the \(g\) that minimizes:

\begin{eqnarray} \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2dt \end{eqnarray}The function \(g\) that minimizes it is a smoothing spline.

The first term is a *loss function*, nd second is a *penalty term*.

### 1.5 KNN

In KNN, all data are the model. The prediction is made by averaging the nearest k observations. In particular, the output is computed by:

\[\hat{Y}(x) = \frac{1}{k} \sum_{x_i \in N_k(x)} y_i\]

Where the \(N_k(x)\) is the k closest points of \(x\) in the training sample.

### 1.6 Subset selection

For two reasons we might want to use subset of input:

- prediction accuracy: the least square estimates often have low bias but large variance.
- interpretation: we would like to determine a smaller subset that exhibit the strongest effects.

The *best-subset selection* explore all possible subset and select the
one with the smallest residual sum of square (RSS). However, this is
of course expensive. Instead, we can use a search. The *forward
step-wise selection* starts with the intercept, then sequentially adds
into the model the predictor that improves the fit the most. Clearly
this is a greedy algorithm. The *backward step-wise selection* starts
with the full model, and sequentially deletes the predictor that has
the least impact to the fit. The candidate for dropping is the one
with the smallest Z-score. There is also another called *forward
stage-wise regression (FS)*, which is more constrained. At each step,
it identifies the variable most correlated with the current residual,
then computes the simple linear regression coefficient on this chosen
variable, then adds it to the corrent coefficient for that
variable. This process continue until none of the variable have
correlation with the residual, i..e the least square fit. Unlike the
forward step-wise regression, none of hte other variables are adjusted
when a term is added to the model. As a result, it may take more
steps. However, this "slow fitting" seems to be more effective in
high-dimensional problems. There is also a hybrid approach. After
adding each new variable, the method may also remove any variables
that no longer provide an improvement.

*Shrinkage method* is a generalize of the subset selection. Subset
selection remove a variable at one time, in another word, it is
discrete process. Thus it exhibits high variance.

*Ridge regression* shrinks the regression coefficients by imposing a
penalty on their size. Ridge regression protects against the
potentially high variance of gradients estimated in the short
direction. The assumption is that the response will tend to vary most
in the directions of high variance of the input. Thus the name
"ridge".

The ordinary least squares minimize:

\begin{eqnarray} RSS = \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij})^2 \end{eqnarray}
and ridge regression introduce a *shrinkage penalty*:

When \(\beta_j\) is small toward 0, the above added penalty will be small. So it will make the \(\beta_j\) smaller, i.e. shrinkage. The parameter \(\lambda\) is critical for the influence of the penalty.

Actually it uses the l_{2} norm.

*Principal Components Regression (PCR)* and *Partial Least Square
(PLS)* use derived input. PCR uses all the input, but PLS also uses
\(y\) in addition to \(x\). Both are similar to Ridge regression because
they shrinks smoothly.

*Lasso* falls somewhere between ridge regression and best subset
regression.

Ridge regression cannot remove any features, unless \(\lambda = \infty\). This may not be a problem for prediction accuracy but it can create a challenge in model interpretation.

The Lasso uses l_{1} norm penalty.

\(l_1\) penalty has the effect of forcing some of the coefficient
estimates to be *exactly* equal to 0 when the tuning parameter
\(\lambda\) is sufficiently large.

It is much easier to interpret, it yields *sparse* model, i.e. models
that involve only a subset of the variables.

As a conclusion, ridge regression does a proportional shrinkage. Lasso translate each coefficient by a constant factor, truncating at zero. Best-subset selection drops all variables with coefficients smaller than the M-th largest.

### 1.7 Kernel Methods and local regression

Kernel means local. Kernel method is generally applicable, e.g. to KNN. KNN is discontinuous, but this is not necessary. The idea is to use only observations close to target point to make fit the model. This is achieved by assigning a kernel function \(K_\lambfda(x_0,x)\) which assigns weights to the points in the region near \(x_0\) (according to distance). The model, however, is the entire training data. As a result, the RSS to minimize becomes:

\[RSS(f_\theta, x_0) = \sum_{i=1}^N K_\lambda (x_0, x_i)(y_i - f_\theta (x_i))^2\]

The \(\lambda\) is a parameter that control the width of the kernel. However, local regression becomes less useful in dimensions much higher than 2. It is impossible to simultaneously maintain localness (low bias) and a sizable sample in the neighborhood (low variance) as the dimension increases, without the total number of sample increasing exponentially. This can not be solved unless we make some structural assumption about the model.

### 1.8 High Dimension

It seems that for KNN, if given more data, we can make more precise
predictions. However, this intuition does not work when the dimension
is high. This is known as *curse of dimensionality*. This is because
if the dimension is high, the KNN need not to be close to the target
point. In general, any method that attempts to produce local varying
functions in small isotropic neighborhoods will run into problems in
high dimensions.

Dimension Reduction techniques are discussed here.

linear combination of the predictors into M new predictors.

\begin{eqnarray} Z_m = \sum_{j=1}^p \phi_{jm} X_j \end{eqnarray}
*Principal Component Analysis (PCA)*: The following are some criteria
for the direction selection, they all talk about the same thing:

- The first principal component direction is that along which the observation vary the most.
- This also yields the highest variance.
- It also defines the line that is as close as possible to the data.
- projected observations are as close as possible to the original observations.

The second principal component \(Z_2\) is a linear combination of the variables that is uncorrelated with \(Z_1\), and has largest variance subject to this constraint. Actually \(Z_1\) and \(Z_2\) are always orthogonal.

*Principal Component Regression (PCR)*: Construct the first M
principal components, and do linear regression on the new predictors.

*Partial Least Squares (PLS)*: The directions identified by PCA is in
an unsupervised way, i.e. it does not use response Y.

Set each \(\phi_{j1}\) equal to the coefficient from teh simple linear
regression of Y onto X_{j}. Intuitively PLS places the highest weight
on the variables that are most strongly related to the response.

Second PLS direction is by

- adjust each of the variables for Z
_{1}, by regressing each variable on Z_{1}and taking residuals This captures the remaining information that has not been explained by the first PLS direction - use this orthogonalized data in exactly the same fashion as Z
_{1}. - Repeat M times.

### 1.9 Model Assessment

*Test error* is also called *generalization error*, *prediction
error*, and is the error over test sample. *Training error* is the
error on training sample.

*mean square error (MSE)* is defined as:

The bias-variance trade-off:

- variance
- the amount by which \(\hat{f}\) would change if we
estimated it using a different training data set.
*More flexible statistical methods have higher variance.* - bias
- the error that is introduced by approximating a real-life
problem. E.g. it is unlikely that any real-life problem has
simple linear relationship.
*More flexible methods result in less bias.* - 95% confidence interval
- a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.

The prediction error can be assessed by some criterion. *Akaike
Information Criterion (AIC)* is defined as

\[AIC = - \frac{2}{N} loglik + 2 \frac{d}{N}\]

*Bayesian Information Criterion (BIC)*, also known as *Schwarz
criterion*, is similar to AIC, and defined as:

\[BIC = -2 loglik + (log N) d\]

Choosing the model with minimum BIC is equivalent to choosing the
model with largest (approximate) posterior probability, thus the name.
The *Minimum Description Length (MDL)* is formally identical to BIC,
but from different point of view.

AS the *model complexity* increases, the variance tends to increase
and the squared bias tends to decrease. *Vapnik-Chervonenkis Dimension
(VC-Dimension)* provides the measurement of model complexity.

In *K-fold cross validation*, we split the data into K roughly
equal-sized parts. For the k-th part, we fit the model using other
data, and use it as the test data. The prediction error is the average
of the K experiments. When K=N, it is called *leave-one-out
cross-validation*. In this case, it is unbiased, but has high
variance.

The *bootstrap method* is also for assessing the accuracy, as cross
validation. It randomly draw-with-replacement from the training
data. This is done B times, and producing B bootstrap datasets. The
model is fit on each of the data set, and all the prediction errors
are examined.

### 1.10 Model Inference

- prediction
- \(\hat{f}\) is treated as black box
- inference
- understand the relationship between X and Y. \(\hat{f}\) cannot be treated as black box.

Model inference is the problem of estimate the model parameters.

The above bootstrap can be used for model inference. This is called
*non-parametric bootstrap*. The *parametric bootstrap* does not use
raw data, but simulate the new response by adding Gaussian noise to
the predicted values.

In essence, bootstrap is a computer implementation of nonparametric or parametric maximum likelihood. Also, bootstrap distribution represents an (approximate) nonparametric, noninformative posterior distribution for the parameter. But this bootstrap distribution is obtained painlessly, without having to specify a prior, and without having to sample from the posterior distribution. Thus, the bootstrap distribution is also called "poor man's" Bayes posterior.

In Bayesian model, we need to draw samples from the resulting
posterior distribution. Typically the *Markov Chain Monte Carlo
(MCMC)* method is used. Gibbs is one kind of MCMC.

*Bagging* is also called *bootstrap aggregation*, it is a method to
use bootstrap not to access the accuracy, but to improve the
prediction itself. It makes the prediction over a collection of
bootstrap samples, thus reducing the variance. The prediction result
is defined as the averaging of all predictors for all samples.

The decision tree suffers from high variance. If we split the training data into two parts at random, the result two trees can be very different. Bagging can reduce the variance. It is related to bootstrap.

Bagging involves

- creating multiple copies of the original training data set using the bootstrap,
- fitting a separate decision tree to each copy,
- and then combining all of the trees in order to create a single predictive model.

Each tree is built on a bootstrap data set, independent of the other trees. The key idea is averaging a set of observations reduces variance.

*Random Forrest* is a substantial modification of bagging that builds
a collection of trees and then averages them. The essential idea in
bagging is to average many noisy but approximately unbiased models,
and hence reduce the variance. Trees are ideal candidates for bagging
because they can capture complex interaction structure in the data,
and if grown sufficiently deep, have relatively low bias. However, the
size of the correlation of bagged trees limits the benefit of
averaging. The random forest is to improve the variance reduction of
bagging by reducing the correlation between trees, without increasing
the variance too much. This is achieved in the tree-growing process
through random selection of the input variables. Specifically:

Before each split, select \(m\le p\) of the input variables at random as candidates for splitting.

*Bumping* is also based on bootstrap, but is a stochastic process: it
randomly walk through all the models in the bootstrap samples. This
helps the fitting to avoid getting stuck in (some?) local minima.

### 1.11 Maximum Likelihood Inference

Given a probability density or probability mass function for the observation:

\[z_i \sim g_\theta(z)\]

where the \(\theta\) is unknown parameter. The maximum likelihood function is:

\[L(\theta;Z) = \prod_{i=1}^N g_\theta(z_i)\]

This is **the probability of the observed data under the model
\(g_\theta\)**. The log of this likelihood, is called the log-likelihood:

\[\ell(\theta;Z) = \sum_{i=1}^N \ell(\theta;z_i) = \sum_{i=1}^N log g_\theta(z_i)\]

Each value in the summary is called log-likelihood component. The method of maximum likelihood chooses the value of \(\theta\) to maximize \(\ell(\theta;Z)\).

### 1.12 Tree-based method

*Tree-based methods* partition the feature space into a set of
rectangles, then fit a model in each one.

*regression tree*:

- divide the predictor space into J distinct and non-overlapping regions \(R_1,...,R_J\).
- for each observation fail into R
_{j}, make the prediction using the mean in R_{j}.

To get the regions, use *recursive binary splitting*, a top-down,
greedy approach.

- From the root
- every split choose the best split that leads to the greatest possible reduction of RSS

It is likely to overfit the data. So we can grow a very large tree,
and then *prune* it back in order to obtain a subtree.

The whole algorithm goes here:

- recursive binary splitting to grow a large tree
- apply
*cost complexity pruning* - use
*K-fold cross-validation*

*Patient Rule Induction Method (PRIM)* is also a tree-based method,
but with different splitting approach. It seeks for boxes in which
response average is high. Hence, it looks for maxima in the target
function. It works from the top down, starting with a box containing
all the data. It repeatedly peel off data.

*Hierarchical Mixture of Experts (HME)* is a variant of tree-based
method. The tree split not by hard decision, but by probability. The
observation goes left or right with probability depending on its input
value. The terminal node is called *expert*, the non-terminal node is
called *gating networks*. The idea is that each expert provides an
opinion (prediction) about the response, and these are combined
together by the gating network.

### 1.13 Additive Model

*Generalized Additive Model (GAM)* has the form

\[E(Y|X_1,X_2,...,X_p) = \alpha + f_1(X_1) + f_2(X_2) + ... + f_p(X_p)\]

where the \(f_i\) are unspecified smooth (nonparametric) functions. It is called additive model because we calculate a separate \(f_j\) for each \(X_j\), and add together all of their contributions. Additive models provide a useful extension of linear models, making them more flexible while retaining much of their interpretability.

*Boosting method* combines the output of many weak classifier to
produce a powerful committee, through a weighted majority vote, with
the assumption that they do not make the same mistake.

*Boosting* is another approach for improving the prediction results from
a decision tree. The different from bagging is,

- the trees are grown sequentially: each tree is grown using information from previously grown trees.
- Boosting does not involve bootstrap sampling. Each tree is fit on a modified version of the original data set.

*Ensemble learning* builds a prediciton model by combining the
strengths of a collection of simpler base models. Bagging and random
forest are ensemble methods for classification. The *Importance
Sampled Learning Ensemble (ISLE)* is one ensemble-generation method.

### 1.14 Other

*Multivariate Adaptive Regression Splines (MARS)* is an adaptive
procedure for regression, and well suited for high-dimensional
problems. It can be viewed as generalization of step-wise linear
regression.

## 2 Support Vector Machine

A linear hyperplane can separate classes, but there exists an infinite
number of such hyperplanes. There is a classifier called *Maximal
Margin Classifier*, also known as *optimal separating hyperplane*.
The separating hyperplane that is farthest from the training
observations. That is, we can compute the (perpendicular) distance
from each training observation to a given separating hyperplane; the
smallest such distance is the minimal distance from the observations
to the hyperplane, and is known as the *margin*. The *maximal margin
hyperplane* is the separating hyperplane for which the margin is
largest—that is, it is the hyperplane that has the farthest minimum
dis- tance to the training observations. The closest observations are
*support vectors*. they “support” the maximal margin hyperplane in
the sense that if these points were moved slightly then the maximal
margin hyperplane would move as well.

However, the above classifier is not stable, and more importantly, it
is still linear decision boundary, and cannot be applied in
non-separable cases. Thus the *Support Vector Classifier (SVM)*, also
known as *soft margin classifier*, is introduce. It the best “out of
the box” classifiers. It allows some observations to be on the
incorrect side of the margin, or even the incorrect side of the
hyperplane.

Some observations:

- only observations that either lie on the margin or that violate the margin will affect the hyperplane
- an observation that lies strictly on the correct side of the margin does not affect the support vector classifier
- Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors.
- When the tuning parameter C is large, then the margin is wide

The support vector machine is a linear classifier, but it can be extended into non-linear through using non-linear kernels. Kernel trick is basis expansion.

linear kernel:

\begin{eqnarray} K(x_i, x_{i'}) = \sum_{j=1}^p x_{ij} x_{i'j} \end{eqnarray}polynomial kernel:

\begin{eqnarray} K(x_i, x_{i'}) = (1 + sum_{j=1}^p x_{ij} x_{i'j})^d \end{eqnarray}Radial kernel:

\begin{eqnarray} K(x_i, x_{i'}) = exp(-\gamma \sum_{j=1}^p (x_{ij} - x{i'j})^2) \end{eqnarray}
SVMs can be extended easily to more than 2 classes. *One-versus-one
classification* computes all pairs SVMs, while *one-versus-all
classification* computes all one versus all other SVMs.

## 3 Unsupervised Learning

*Unsupervised learning* is typically used for *Clustering*.

### 3.1 K-means

*K-means* method ensures the total within-cluster variation, summed
over all K clusters, is as small as possible. It defines the
within-cluster variation. The formula for it is: the sum of all of
the pairwise squared Euclidean distances between the observations in
the kth cluster. W(C_{k}) depicts the amount by which the observations
within a cluster differ from each other

where \(|C_k|\) denotes the number of observations in the kth cluster.

The algorithm:

- select a number K, randomly assign a clustering from 1 to K for each observation
- iterate until cluster assignments stop changing
- for each cluster, compute
*centroid*: the vector of the p features means for the observations in the kth cluster. - assign each observation to the cluster whose centroid is closest.

- for each cluster, compute

This algorithm guarantee to decrease the objective formula above.

### 3.2 Hierarchical Clustering

The downside of K-mean is the need for predefined number of
clusters. *Hierarchical Clustering* does not predefine the number of
clusters. The result is called a *dendrogram*, a tree-based
representation of the observations.

It is constructed bottom-up. The tree node means a fusion. The height of the fusion indicates how different the two observations are. Never compare the horizontal distance. Construction algorithm: examine all pairwise inter-cluster dissimilarities among all clusters. Fuse the most similar ones.

The four most commonly used types of linkage:

- complete: maximal intercluster dissimilarity
- single: minimal intercluster dissimilarity
- average: mean intercluster dissimilarity
- centroid: dissimilarity between the centroid of cluster A and B