Abstract: As in the title. Made to clarify my own knowledge, based mostly on the great answer by Alebu at StackExchange.
1. Introduction
We consider two cumulative distribution functions (CDFs), one for the negative class and one for the positive class:
$$ F_0(t) := P(S < t\,\vert\, \text{negative}), $$ $$ F_1(t) := P(S < t \,\vert\, \text{positive}). $$
To make this clearer, think of t as a red line which divides two teams (a threshold, if you will). The realizations of the random variable S which exceed the threshold t belong to the positive team, and realizations below that threshold belong to the negative team. This might seem unintuitive, since the probability of belonging to the negative class if you belong to the negative class is a hundred percent. The way to clarify this is to introduce an auxiliary variable v and understand that for a given value t, the CDF is actually measuring the probability that the score will be below v, up to when v catches up with t. So basically, the CDF for the negative class has in its domain the values of v up to when v catches up with t. Symmetrically, the CDF for the positive class has in its domain the values of some w up to when w catches up with t. However, as you might notice - this makes no sense, since we would get a CDF with constant value equal to zero for the positive class. What's crucial here is to understand that the positive and negative classes are defined not in terms of theoretical distributions, but rather in terms of empirical distributions. So: the CDF of the negative class is actually the proportion of the negative class correctly classified as negative by the decision rule, and the CDF of the positive class is the proportion of the positive class incorrectly classified as negative.
2. Defining FPR and TPR
$$ \text{FPR}(t) = x(t) = P(S > t \mid \text{negative}) = 1 - F_0(t), $$ $$ \text{TPR}(t) = y(t) = P(S > t \mid \text{positive}) = 1 - F_1(t). $$
Hence, the ROC curve is represented by the points \(\bigl(\text{FPR}(t),\ \text{TPR}(t)\bigr)\equiv \bigl(x(t),\ y(t)\bigr)\). We might write this as \((\,x(t),\ y(t)\,)\), with \(y(x(t)) = y(t)\).
3. ROC and AUC
(Here we see the integral definition of AUC in terms of \(y\) and \(x\). Essentially, the area under the ROC curve is the integral of the TPR/FPR tradeoff curve.)
$$ \text{AUC} = \int_{0}^{1} y(x)\,dx, $$
where \(x(t) = 1 - F_0(t)\) and \(y(t) = 1 - F_1(t)\). Taking derivatives w.r.t. \(t\):
$$ \frac{dx}{dt} = -f_0(t), \quad \frac{dy}{dt} = -f_1(t). $$ $$ dx = -f_0(t)\,dt, $$ $$ \int_{0}^{1} y\,dx = \int_{0=x(t)}^{1=x(t)} \bigl[\,1 - F_1(t)\bigr]\bigl[-f_0(t)\bigr]\ dt. $$
4. Behavior of \(x(t)\) as \(t\) Varies
(Key insight: \(x(t)\) decreases from 1 to 0 as \(t\) goes from \(-\infty\) to \(+\infty\).)
In other words:
where x(t) goes from 1 to 0 as t goes from -∞ to +∞. x(t) | F₀(t) | t ------+-------+--------- 0 | 1 | +∞ 1 | 0 | -∞ F₀(t): 1 → 0 (for -∞ → +∞).
5. Full AUC Derivation
(The integral for AUC in terms of the density \(f_0\) and cumulant \(F_1\).)
$$ \text{AUC} = \int_{-\infty}^{\infty} \bigl[1 - F_1(t)\bigr]\ f_0(t)\ dt. $$
Where we could get rid of the negative sign due to the change in integration limits. Given \(\mathbf{P}(\Omega)=1\) and \(\mathbf{P}(\emptyset)=0\), an equivalent probabilistic interpretation emerges:
$$ \text{AUC} = E_{S_0, S_1}\bigl[\mathbf{1}(S_1 > S_0)\bigr] = \mathbb{E}_{\,S_0 \sim f_0}\bigl[P(S_1 > S_0)\bigr]. $$
Interpreting integrals as the probability that a randomly selected positive score \(S_1\) exceeds a randomly selected negative score \(S_0\). Essentially, the AUC is the expected probability that a member of the positive class has a higher score than a member of the negative class.
6. Conclusion
The derivation starts from \(F_0(t)\) and \(F_1(t)\), defines FPR and TPR, and shows how the area under the ROC curve can also be seen as the probability that a random positive sample out-scores a random negative sample.
7. GINI
Assume \(AUC = \frac{1}{2}\). Then \(2 \cdot AUC - 1 = 0\).
Write \(GINI = 2 \cdot AUC - 1.\)
Thus, a GINI of 0
means your predictive power is no better than random;
1
would be perfect discrimination, and -1
would be labels flipped.
Intuitively, GINI centers the AUC around zero and scales the halves to ones. That means for two random members of the positive and negative classes, it reflects the average rank difference or how much better your model does compared to chance.
8. Alternatively, a Quicker Way
$$ \text{AUC} = \int_{-\infty}^{\infty} \text{TPR}(t)\, d\bigl(\text{FPR}(t)\bigr) = \int_{0}^{1} \text{TPR}\bigl(\text{FPR}^{-1}(p)\bigr)\, dp. $$
Where \(\text{FPR}^{-1}(p)\) is the threshold \(t\) s.t. \(\text{FPR}(t) = p\). Equivalently, using an indicator function approach:
$$ \text{AUC} = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \mathbf{1}(x_1 > x_2)\, f_{X_1}(x_1)\, f_{X_2}(x_2)\, dx_1\, dx_2. $$
This can be viewed as an application of the convolution theorem.