Backtesting VaR

Interactive exploration of VaR backtesting with Christoffersen tests (unconditional coverage, independence, conditional coverage) and binomial coverage analysis with Kupiec test

Risk models promise specific probabilistic guarantees: a 1% VaR should be exceeded only 1% of the time. Backtesting checks whether the model delivers on that promise by comparing forecasts against realized outcomes (Christoffersen 2012, chap. 13; Hull 2023, sec. 11.10).

VaR Backtesting

Backtesting compares ex ante VaR forecasts with ex post realized returns. Whenever the loss on a given day exceeds the VaR, we record a violation (or hit):

\[ I_{t+1} = \begin{cases} 1, & \text{if}\; R_{PF,t+1} < -VaR_{t+1}^p \\ 0, & \text{otherwise} \end{cases} \]

We construct the hit sequence \(\{I_{t+1}\}_{t=1}^T\) across \(T\) days. If the VaR model is correctly specified, this sequence should be unpredictable:

\[ H_0: I_{t+1} \sim \text{i.i.d. Bernoulli}(p) \]

This null hypothesis implies two properties: (1) the average violation rate equals \(p\) (unconditional coverage), and (2) violations are randomly scattered over time (independence).

The unconditional coverage test checks whether the observed violation rate \(\hat{\pi} = T_1/T\) differs from \(p\):

\[ LR_{uc} = -2\ln\left[\frac{L(p)}{L(\hat{\pi})}\right] \sim \chi_1^2 \]

The independence test models the hit sequence as a first-order Markov chain and tests whether the probability of a violation depends on yesterday’s outcome. Define \(\pi_{01} = \Pr(I_{t+1}=1 \mid I_t=0)\) and \(\pi_{11} = \Pr(I_{t+1}=1 \mid I_t=1)\). Under independence, \(\pi_{01} = \pi_{11}\):

\[ LR_{ind} = -2\ln\left[\frac{L(\hat{\Pi})}{L(\hat{\Pi}_1)}\right] \sim \chi_1^2 \]

The conditional coverage test combines both:

\[ LR_{cc} = LR_{uc} + LR_{ind} \sim \chi_2^2 \]

Note

Why clustering matters. Even with correct average coverage, clustered violations are dangerous. If all losses concentrate in a short period, the risk of bankruptcy is much higher than if violations are scattered randomly. Historical evidence shows that commercial bank VaRs, particularly those based on Historical Simulation, tend to produce exactly this pattern.

Note

Simulation setup. Returns are simulated from a GARCH(1,1) data-generating process: \(R_t = \sigma_t z_t\) with \(z_t \sim N(0,1)\) and \(\sigma^2_{t+1} = \omega + \alpha R_t^2 + \beta \sigma^2_t\). The three VaR methods differ in what they know about this process:

  • Normal (constant): estimates a single standard deviation from the full sample and assumes constant volatility. This is misspecified because the true volatility varies over time.
  • Historical Simulation: uses a rolling window of past raw returns to compute the VaR percentile. Also misspecified, as it adapts slowly to volatility changes.
  • GARCH(1,1): uses the true conditional volatility \(\sigma_t\) from the simulation. This is correctly specified and should produce well-behaved violations.
Tip

How to experiment

Try the Normal (constant) method first: it assumes constant volatility and will produce clustered violations when the true volatility spikes. Then switch to GARCH(1,1): because it tracks the true volatility dynamics, violations should be scattered randomly. Compare the test statistics across methods. Increase \(\alpha\) to create more volatile data and observe how the Normal and HS methods deteriorate.

Binomial Coverage Test

Under a correctly specified VaR model, the number of violations in \(n\) days follows a binomial distribution: \(M \sim \text{Binomial}(n, p)\). The Kupiec likelihood ratio test checks whether the observed number of violations \(m\) is consistent with the promised coverage rate \(p\):

\[ LR_{Kupiec} = -2\ln\left[(1-p)^{n-m}\,p^m\right] + 2\ln\left[\left(1-\frac{m}{n}\right)^{n-m}\left(\frac{m}{n}\right)^m\right] \sim \chi_1^2 \]

A critical challenge is the low power of backtests at high confidence levels with limited data. At a 99% VaR with 250 trading days, we expect only 2.5 violations, making it difficult to distinguish a correct model from an incorrect one.

Tip

How to experiment

Compare the power curve for \(n = 250\) versus \(n = 1000\) at the 99% confidence level. With fewer observations the power curve is much flatter, meaning the test struggles to distinguish between models with very different true violation rates. A model with a true violation rate of 3% (three times the promised 1%) may still not be rejected.

References

Christoffersen, Peter F. 2012. Elements of Financial Risk Management. 2nd ed. Academic Press.
Hull, John. 2023. Risk Management and Financial Institutions. 6th ed. John Wiley & Sons.