Contents

Introduction
Definition [convex set]. A set \(C\subseteq\mathbb{R}^n\) is convex if for any \(x,y\in C\) and \(\lambda\in[0,1],\) we have \(\lambda x+(1-\lambda)y\in C\).

Subgradients
Definition [subgradient]. Consider any function \(f\colon D\rightarrow\mathbb{R}\) with domain \(D\subseteq\mathbb{R}^n\). We say that \(g\in\mathbb{R}^n\) is a subgradient of \(f\) at \(x\) if, for any \(z\in D\), we have\(f(z)\geq f(x)+g^{\top}(z-x)\). The collection of all subgradeints of \(f\) at \(x\) is called subdifferential and is denoted by \(\partial f(x)\). The function \(f\) is subdifferentiable at \(x\) if \(\partial f(x)\) is nonempty. The function \(f\) is called subdifferentiable if it is subdifferentiable at all points from its domain.
Proposition. For \(x\in D\), the subdifferential \(\partial f(x)\) contains zero if and only if \(x=\min f(x)\).
Proof.
    If \(0\in \partial f(x)\) then, for all \(y\in D\), \(f(y)\geq f(x)+0^{\top}(y-x)=f(x)\) so \(x\) is the global minimizer. Conversely, if \(x=\min f(x)\), then \(f(y)\geq f(x) + 0^{\top}(y-x)\) for all \(y\in D\).

Proposition. If a convex function \(f\) is differentiable at \(x\in D\), then \(\partial f(x)=\{\nabla f(x)\}\).
Proof.
    Coming soon.

Duality
Consider a possibly non-convex optimization problem: \[\min f_0(x)\quad \text{subject to } \begin{cases} f_i(x)\leq 0 & \text{ for } i\in[M]\\ h_j(x)=0 &\text{ for } j\in[P],\end{cases}\] with domain \(\mathcal{D}\) and \(\mathcal{X}\subseteq\mathcal{D}\) a nonempty feasible set. Furthermore, let the optimal value of this problem be \(p^{\star}\). Note that \(P\) equality constraints just correspond to \(Ax=b\) for some \(A\in\mathbb{R}^P\times\mathbb{R}^{\text{dim}(\mathcal{D})}\) and \(b\in\mathbb{R}^P\). Now, define a Lagrangian function \(L\colon\mathcal{D}\times\mathbb{R}^{M}\times\mathbb{R}^P\rightarrow\mathbb{R}\) and the Lagrange dual function \(g\colon\mathbb{R}^{M}\times\mathbb{R}^P\rightarrow\mathbb{R}\), resectively, as \[\begin{align}L(x,\lambda,\nu)&=f_0(x)+\sum_{i=1}^M\lambda_if_i(x)+\sum_{j=1}^P\nu_jh_j(x),\\g(\lambda,\nu)&=\inf_{x\in\mathcal{D}}L(x,\lambda,\nu).\end{align}\] First, note that \(g\) is concave as a pointwise infimum of affine functions \(L(\cdot,\lambda,\nu)\). Note that \(g(\lambda,\nu)\leq p^{\star}\) provided that \(\lambda\geq 0\). Indeed, since we assumed \(\mathcal{X}\) to be nonempty, let \(\tilde{x}\in\mathcal{X}\) be a feasible point. Then, \[f_0(x)\geq f_0(\tilde{x})\geq L(\tilde{x},\lambda,\nu)\geq\inf_{x\in\mathcal{D}}L(x,\lambda,\nu)=g(\lambda,\nu).\] Now taking the infimum over the feasible set \(\mathcal{X}\) yields the desired inequality. Hence, we define the dual problem to be \[\max g(\lambda,\nu)\quad\text{subject to }\lambda_i\geq0\text{ for all } i\in[P]\] with the optimal value \(d^{\star}\).
Example [Standard LP]. Recall the standard formulation of a linear program: \[\min c^{\top}x\quad\text{subject to }\begin{cases} Ax=b\\x\geq 0.\end{cases}\] Based on the definitions above, the corresponding Lagrangian function is \[\begin{align}L(x,\lambda,\nu)&=c^{\top}x-\lambda^{\top}x+\nu^{\top}(Ax-b)\\&=(c+A^{\top}\nu -\lambda)^{\top}x-b^{\top}\nu\end{align}\] The Lagrangian dual function is clearly unbounded below if \(c+A^{\top}\nu-\lambda\ne 0\). Otherwise, \(g(\lambda,\nu) = -b^{\top}\nu\). Overall, the dual problem becomes \[\max -b^{\top}\nu\quad\text{subject to }\begin{cases} c+A^{\top}\nu-\lambda=0\\\lambda\geq 0.\end{cases}\] After some cosmetic substitutions, it is equivalent to \[\max -b^{\top}\nu\quad\text{subject to }c+A^{\top}\nu\geq0,\] as expected.
How does scaling of the primal objective affect the dual?
    Formally speaking, we have much freedom in choosing a convenient primal objective function without changing the problem (i.e., its solution set). For example, we might prefer \(\lVert x\rVert^2\) to \(\lVert x\rVert\) for ease of manipulation or \(\max(0,x)^2\) to \(\max(0,x)\) because the former is differentiable. However, it is not immediately clear how this change affects the Lagrangian and the dual problem. For example, consider scaling the primal objective by a positive constant \(K\); the new Lagrangian \(L_K\) mixes the objective and the constraints in different proportions: \[\begin{align} L(x,\lambda) &= f_0(x)+\textstyle\sum_{i=1}^K\lambda_i f_i(x),\\L_K(x,\lambda) &= Kf_0(x)+\textstyle\sum_{i=1}^K\lambda_i f_i(x).\end{align}\] What does it mean for the dual problem? Well, conveniently, we have \(g_K(K\lambda) = Kg(\lambda)\), so the optimal dual variables, like the corresponding dual optimal value just get scaled as well: \(\lambda_K^{\star}=K\lambda^{\star}\) and \(d_K^{\star}=Kd^{\star}\).

By focusing on the dual problem we traded our original primal problem to a potentially easier dual problem (the objective function is concave and the constraints are linear) that lower bounds the original optimum: \(d^{\star}\leq p^{\star}\). This relationship is called weak duality, and the difference \(p^{\star}-d^{\star}\) is called duality gap. It turns out that strong duality \(p^{\star}=d^{\star}\) sometimes holds, and especially often for convex problems.
Theorem [Slater's conditions]. A strictly feasible convex primal problem exhibits strong duality. That is, if the functions \(f_0, f_i,\ldots f_M\) are convex, \(h_1, h_2,\ldots h_P\) are affine, and there exists \(\tilde{x}\in\mathcal{X}\) so that \(f_i(\tilde{x})<0\) for all \(i\in[M]\), the problem is strongly dual.
Proof.
    See Boyd & Vandenberghe (p.234).
Strong duality reveals an interesting property between the primal and the dual constraints called complementary slackness. Assume that strong duality holds and, furthermore, the optimal values are attainable: \(p^{\star} = f_0(x^{\star})=g(\lambda^{\star},\nu^{\star}) = d^{\star}\). Then, \[\begin{align}g(\lambda^{\star},\nu^{\star}) &= \inf_{x\in\mathcal{D}}\left[f_0(x)+\sum_{i=1}^M\lambda_i^{\star}f_i(x)+\sum_{j=1}^P\nu^{\star}h_j(x)\right]\\&\leq f_0(x^{\star})+\sum_{i=1}^M\lambda_i^{\star}f_i(x^{\star})+\sum_{j=1}^P\nu^{\star}h_j(x^{\star})\leq f_0(x^{\star})\end{align}\] where the last inequality follows from the feasibility \(\lambda_i^{\star}\geq 0\) and \(f_i(x^{\star})\leq 0\) for all \(i\in[M]\). By strong duality, all inequalities above must be equalities, which in particular implies \(\lambda_i^{\star}f_{i}(x^{\star})=0\) for all \(i\in[M]\). The term complementary slackness refers to this exact condition: at least one (primal or dual) inequality constraint in each pair of \(M\) inequality constraints must be active.
Theorem [KKT (Karush-Kuhn-Tucker) conditions]. A triplet \((x,\lambda,\nu)\in\mathbb{R}^{|\mathcal{D}|}\times\mathbb{R}^M\times\mathbb{R}^P\) is primal and dual optimal, \(p^{\star}=f_0(x)\) and \(d^{\star}=g(\lambda,\nu)\), if and only if the following conditions are satisfied:
  1. Primal feasibility: \(f_i(x)\leq 0\) for all \(i\in[M]\) and \(h_j(x)=0\) for all \(j\in[P]\),
  2. Dual feasibility: \(\lambda_i\geq 0\) for all \(i\in[M]\),
  3. Complementary slackness: \(\lambda_if_i(x)=0\) for all \(i\in[M]\),
  4. Stationary point: \(0\in\partial\left[f(x)+\sum_{i=1}^M\lambda_if_i(x)+\sum_{j=1}^P\nu_jh_j(x)\right]\).
Proof.
    Assuming that \(x_0\) and \((\lambda_0,\nu_0)\) are solutions of the primal and the dual problems, respectively, the first three conditons are established. The last property holds by the basic property of subgradients provided that \(x_0\) minimizes the Lagrangian \(L(x, \lambda_0,\nu_0)\), which we demonstrated above. Conversely, let \((x_0,\lambda_0,\nu_0)\) satisfy the KKT conditions. The last condition implies \(g(\lambda_0,\nu_0)=\inf_{x}L(x,\lambda_0,\nu_0)=L(x_0,\lambda_0,\nu_0)\), which equals \(f(x_0)\) by complementary slackness. Hence, our triplet ensures that primal and dual objectives are equal, which can only occur with strong duality: \(g(\lambda_0,\nu_0)\leq d^{\star}\leq p^{\star}\leq f(x_0)\) so all inequalities are in fact equalities.