Three years into my pure math studies at UMass Amherst, I had my first exprience in machine learning modeling at a summer research program at SDSU. The program was in mathematics, although ML frenzy had already started spreading across adjacent industries without mercy, and I was paired with more senior statisticians to work out a model to predict location of zeros of a Riemann Zeta function on the critical line. To start, an undereducated pure math geek, I instinctively wanted to first understand
what a "model" is in the most general sense, give it a rigorous definiton for my own peace of mind. Of course, this is a rather philoshopical feat, which was hard to understand for me back then since in my world everything must have had precise definition. This search for a formal framework of a "model" at the beginning of this supposedly practical and fun project wasted quite a bit of my time and also didn't lead to a satisfactory answer. Most explanations spoke of models as "sets of equations", "rules",
"relationships between quantities", etc. While conceptually all this makes sense and is true in some approximation, it just wasn't what I was looking for, and eventually I just moved on. Believe it or not, it took me 4 years, three courses in probability and statistics, and half of a PhD to formulate an answer to that question that I think is reasonably valid, general, and mathematically sound. And it's not even a product of my philosophical contemplations—I just randomly found it in the first pages
of an introductory textbook on statistics. the simplicity of it left me wishing that all statistics textbooks start with it rather than with repeatedly tossing fair coins or rolling dice.
A statistical model is a collection of distributions. A parametric model \(\mathcal{P}\) is a collection of distributions \(\mathcal{P}=\{P_{\theta}\colon \theta\in\Theta\}\) parameterized by a finite number of parameters \(\theta\in\Theta\subseteq\mathbb{R}^d\). A Bayesian model is described by two collections—posteriors and priors. This may seem like a facier way of saying that a model "describes relationships between variables" but I find it way more satisfying
and telling and, looking back at my experiences with ML models, it definitely helps me understand what they are and how they are different. For example, linear regression corresponds to a set of conditional distributions \(Y|X=x\in\{\mathcal{N}(w^{\top}x, \sigma^2)\colon w\in\mathcal{R}^d, \sigma>0\}\) (not how the homoscedascity assumption is engraved in this), logistic regression corresponds to \(Y|X=x\in\{\text{Bern}([1+e^{-w^{\top}x}]^{-1})\}\). Neural
networks are also parametric models with associated distributions dependent on architecture. Now, not all models
\(Y\)
This section is largely based on an excellent lecture by Kilian Weinberger, courtesy of Cornell University.
Under the assumption that data \((x_i,y_i)_{i=1}^{N}\) is linearly separable, the perceptron model is
guaranteed to converge to an affine separator \((w_{*},b_{*})\) so that \(y_i({w_{*}}^{\top}x_{i}+b_{*})>0\).
However, there are infinitely many such hyperplanes and some are more reasonable and useful from a practical perspective
than others. Having a privilege to choose one, we prefer a separating hyperplane that is as remote from all
datapoints as possible. To make our intuition more precise, we wish to maximize the minimum distance to any sample
\(\gamma(w,b) = \lVert w\rVert^{-1}\min_{i}|w^{\top}x_i+b|\) while obviously making the correct predicitions, so \(y_i(w^{\top}x_i+b)\geq 0.\) for all \(i\in[n]\).
What is the distance from a point to a hyperplane?
Given a point \(x\) and a hyperplane \(\{y\colon w^{\top}y+b=0\}\), define the orthogonal projection of \(x\) onto the hyperplane by \(x_p\).
Then, define the difference \(d=x-x_p\) where \(d=\alpha w\) for some constant \(\alpha\in\mathbb{R}\) (because both \(d\) and \(w\) are orthogonal to a subspace of rank \(n-1\)) and note that \(\lVert d\rVert\) is the distance we are looking for. Since \(x_p\) lies on the hyperplane, we can write
\(0=w^{\top}x_p+b=w^{\top}(x-\alpha w)+b\), so that \(\alpha=\lVert w\rVert^2(w^{\top}x+b)\). It finally follows that \(\lVert d\Vert=|\alpha|\lVert w\rVert=\lVert w\rVert^{-1}|w^{\top}x+b|.\)
Above we used geometric intuition to formulate the following optimization problem:
\[\max_{w,b}\frac{1}{\lVert w\rVert}\min_i|w^{\top}x_i+b|\quad \text{subject to } y_i(w^{\top}x_i+b)\geq 0.\]
Formally speaking, the same hyperplane can be realized by different pairs \((w,b)\) by scaling both \((w,b)\) by the same positive factor. As a sanity check,
observe that the above problem is invariant under such scaling, i.e., the objective function has the same value for \((\alpha w,\alpha b)\) for all \(\alpha>0\).
This is good news: for each hyperplane \(\{(\alpha w, \alpha b)\colon \alpha>0\}\) we can choose a representative \((w_0, b_0)\) such that the inner minimization is
always \(1\) and narrow down the outer optimization space to these representatives. Formally, we simplify the objective function to just \(\max\lVert w\rVert^{-1}\)
at the expense of adding a new constraint \(\min_i|w^{\top}x_i+b|=1\):
\[\max\frac{1}{\lVert w\rVert}\quad\text{subject to }\begin{cases}y_i(w^{\top}x_i+b) \geq 0\quad\forall i\in[n],\\\min_i|w^{\top}x_i+b|=1.\end{cases}\]
Merging these constraints with the existing ones, it is now
tempting to write \[\max \frac{1}{\lVert w\rVert}\quad\text{ subject to } y_i(w^{\top}x_i+b)\geq 1.\] and claim that this problem is equivalent to the one above.
While the statement is true, the justification is a bit more subtle than it seems. In particular, the feasible sets of these two problems are actually different (the new one is larger).
Assume that \((w^{\star},b^{\star})\) is an optimal solution to the new problem. The first set of constraints of the old formulation is clearly satisfied, but what if
\(\min_i|{w^{\star}}^{\top}x+b^{\star}>1\)? This, however, contradicts optimality because we can scale both \(w^{\star}\) and \(b^{\star}\) down, improve the objective, and still satisfy the constraints of
the new problem. Hence, \(w^{\star},b^{\star}\) must satisfy the constraints of the original problem, and they clearly achieve the optimal objective. The forward direction
is straightforward now that we proved that optimal values of the two problems agree.
This new formulation is exactly the SVM quadratic problem (sometimes written with an equivalent objective \(\min\lVert w\rVert^2\)). The above discussion is rather didactic—it shows that, for the optimal hyperplane \((w^{\star},b^{\star})\), there must be at least
one datapoint \(x\) with an active constraint \(y({w^{\star}}^{\top}x+b)=1\) for otherwise the margin could be increased. In fact, more is true: both classes must have datapoints with active constraints
for the same reason. While the geometric intuition is more obvious, we can show it algebraically just as well. The datapoints with active constraints are called
support vectors
because they determine the optimal hyperplane. Other datapoints can be just as well removed with no consequences for the optimization result (unless more data is added later). Let \(x\) be a support vector.
The distance from \(x\) to the SVM hyperplane \((w^{\star},b^{\star})\). is the then the optimal margin \(\gamma(w^{\star}, b^{\star})\), which is \(\lVert w\rVert^{-1}|{w^{\star}}^{\top}x+b^{\star}|=\lVert w\rVert^{-1}\).
The KKT conditions can potentially reveal some insights about the solution to this convex problem. Let the dual variables be \(\{\lambda_i\}_{i=1}^n\). Since the Lagrangian
\(L(w,b,\lambda)=\lVert w\rVert^2+\sum_{i=1}^n\lambda_i(1-y_i(w^{\top}x_i+b)\) is convex and differentiable in \(w,b\), it follows that \(\partial_{w,b}L(w,b,\lambda)=\{\nabla_{w,b}L(w,b,\lambda)\}\) for any feasible \(\lambda\).
The stationarity condition, necessarily satisfied at the solution \((w^{\star},b^{\star},\lambda^{\star})\), is then
\[\begin{align}0&=\nabla_w L(w,b,\lambda)=2w^{\star}-\sum_{i=1}^n\lambda^{\star}_i y_ix_i,\\0&=\nabla_b L(w,b,\lambda)=-\sum_{i=1}^n\lambda^{\star}_iy_i.\end{align}\]
Therefore, the optimal triplet \((w^{\star},b^{\star},\lambda^{\star})\) must satisfy \(w^{\star}=\frac{1}{2}\sum_{i=1}^n\lambda^{\star}_iy_ix_i\). By complementary slackness, \(\lambda^{\star}_i(1-y_i(w^{\top}x_i+b^{\star}))=0\) for all \(i\in[n]\),
so that \(\lambda_i^{\star}>0\) only if \(x_i\) is a support vector. This shows that \(w^{\star}\) is a weighted combinaton of the support vectors, which verifies our geometric intuition that other (
nuisance) samples can actually be removed from the dataset
with no harm for the algorithm. Under these relations,
\[\begin{align}L(w,b,\lambda) &= \underbrace{\frac{1}{4}\sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_i^{\top}x_j}_{\lVert w\rVert^2}+\underbrace{\sum_{i=1}^n\lambda_i-b\sum_{i=1}^n\lambda_iy_i-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_i^{\top}x_j}_{\sum_{i=1}^n1-y_i(w^{\top}x_i+b)}\\&=-\frac{1}{4}\sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_i^{\top}x_j+\sum_{i=1}\lambda_i.\end{align}\]
We can then equivalently formulate the dual problem as
\[\max\left[-\frac{1}{4}\sum_{i=1}^n\sum_{j=1}^n\lambda_i\lambda_jy_iy_jx_i^{\top}x_j+\sum_{i=1}\lambda_i\right]\quad\text{subject to }\begin{cases}\sum_{i=1}^n\lambda_iy_i=0&\forall i\in[n],\\\lambda_i\geq 0&\forall i\in[n].\end{cases}\]
How important is \(\frac{1}{4}\)?
Different texts will have a different constant in front of the term corresponding to \(\lVert w\rVert^2\) in the new dual formulation.
The discrepancy comes from different preferences for the primal objective function: \(\max \lVert w\rVert^{-1}, \min\lVert w\rVert\) or \(\min\lVert w\rVert^2\): all of these objectives are clearly equivalent for the primal problem as they have equal solution sets (although different optimal values).
It turns out that the dual problem is also not affected; multiplying the primal objective by a constant \(K>0\) just scales the optimal dual variables with no effect on the primal
solution \(w^{\star}\). Hence, maybe rather counterintuitively, we can safely ignore the factor \(1/4\) in the above problem.
Where did \(b\) go?
The bias term disappeared from the objective because \(\sum_{i=1}^n\lambda_iy_i\) conveniently popped out in the Lagrangian.
While it won't be missed during optimization, we still need to compute it for inference. Once the optimal \(\lambda^{\star}\)
are found, the corresponding \(w^{\star}=\sum_{i=1}^n\lambda^{\star}_iy_ix_i\). We don't have a similar explicit formula for \(b^{\star}\)
but note that, for any support vector \(x_s\), we have \(y_s({w^{\star}}^{\top}x_s+b^{\star})=1\), so that \(b^{\star}=y_s-{w^{\star}}^{\top}x_s\).
With this dual problem we are now in position to introduce the
kernel trick—a feature that makes linear methods extremely expressive.
It is often covered together with the SVM algorithm although it more generally applies to any linear learning algorithm with a decision rule \(\text{sgn}(w^{\top}x+b)\).
Note that the most recent optimization
problem depends not on the individual datapoints \(x_i\) but rather on their dot products \(x_i^{\top}x_j\). For starters, precomputing these dot products
will save us quite some runtime. Most importantly, however, we can now choose a different (higher-dimensional or even infinite-dimensional) optimization space where we embed our samples
using a potentially non-linear mapping \(\phi\), allowing us to capture non-linear patterns in the data or separate
originally linearly inseparable datasets. To this end, we just need to substitute the original \(x_i^{\top}x_j\) by \(\phi(x_i)^{\top}\phi(x_j)\). And there is more! We don't actually need to handcraft the mapping \(\phi\):
all we need is a valid
kernel function \(K\colon\mathbb{R}^{d}\times\mathbb{R}^d\rightarrow\mathbb{R}\) where \(K(x_i,x_j)=\phi(x_i)^{\top}\phi(x_j)\) for some \(\phi\).
How do we do inference?
Before all this madness we could simply compute \(w^{\star}=\sum_{i=1}^n\lambda^{\star}_iy_ix_i\), \(b^{\star}=y_s-{w^{\star}}^{\top}x_s\).
There is no hope we can similarly compute \(\sum_{i=1}^n\lambda_i^{\star}y_i\phi(x_i)\) for an infinite-dimensional \(\phi\) or when
\(\phi\) is not explicitly defined. Here is an alternative approach: given a test vector \(x_t\), \[{w^{\star}}^{\top}x_t = \sum_{i=1}^n\lambda_i^{\star}y_i\phi(x_i)^{\top}\phi(x_t)=\sum_{i=1}^n\lambda^{\star}_iy_iK(x_i,x_t),\]
so computing \({w^{\star}}^{\top}x_t\) amounts to evaluating the kernel \(K(x_s,x_t)\) for all support vectors \(x_s\).The bias term \(b^{\star}\) can be computed explicitly as before.
Kernel SVM allows us to deal with linearly inseparable data by mapping it to a higher dimensional space and separating it there.
Often times, however, we actually do not need perfect separation and would tolerate some mistakes for sake of a simpler model, whether in the original or a higher-dimensional space.
To this end, we relax the separability constraints in the original primal problem using
slack variables and incorporate a penalty for making mistakes
into the objective function: \[\min\left[\lVert w\rVert^2+C\sum_{i=1}^n\xi_i\right]\quad\text{subject to }\begin{cases}\begin{align}y_i(w^{\top}x+b)\geq 1-\xi_i\quad &\forall i\in[n],\\\xi_i\geq 0\quad\quad\quad &\forall i\in[n].\end{align}\end{cases}\]
where \(C\) is a hyperparameter and is not optimized. Like in many other machine learning algorithms, this parameter esentially controls the regularization (unlike most algorithms, though,
higher values of \(C\) are associated with potentially overfitting solutions). The above optimization problem still remains quadratic with linear constraints and we call it
soft-margin SVM
(as opposed to the original
hard-margin SVM) since we now allow violations of the margin contraints. In soft-margin SVM, we naturally define support vectors to be the offenders of the constraint, i.e.,
the datapoints with \(\xi_i>0\). By partitioning all samples into support and nuisance vectors, we can decompose the penalty term in the objective function:
\[\sum_{i=1}^n\xi_i = \begin{cases}1-y_i(w^{\top}x_i+b)&\text{if } y_i(w^{\top}x_i+b)<1,\\0&\text{otherwise.}\end{cases}\]
Thus, we can rewrite the objective as \[C\sum_{i=1}^{n}\max[0,1-y_i(w^{\top}x_i+b)]+\lVert w\rVert^2.\]
The term \(\max[0,1-y(w^{\top}x+b)]\) is called
the hinge loss and it majorizes the 0-1 loss. This is now awfully similar to
a standard optimization setup with \(\ell_2\) regularization.