Last But Not Least:
Boundary Attention CalibratiON
for Multimodal KV Cache Compression

1KAIST   2National University of Singapore   3The Chinese University of Hong Kong   4Zhejiang Laboratory
Corresponding authors
Visual importance estimation under low budget on SnapKV: Observation window vs. BACON

Under aggressive KV-cache budgets, observation-window attention dilutes sparse visual evidence and discards answer-critical tokens. BACON recovers this boundary-emergent evidence by calibrating the window score with last-query attention under intra-layer coherence and inter-layer persistence — without changing the compression operator or budget allocation.

Abstract

Multimodal Large Language Models (MLLMs) achieve strong vision-language reasoning, but long visual contexts enlarge the KV cache and increase decoding latency. Existing compression methods rely on observation-window attention for stable token-importance estimation, yet this aggregation can dilute sparse visual evidence and discard answer-critical tokens under aggressive compression. We identify last-query attention as a complementary source for recovering such evidence, but its answer-irrelevant signals can mislead retention.

We propose BACON, a plug-and-play method that calibrates observation-window attention with last-query evidence and suppresses isolated noise via intra-layer coherence and inter-layer persistence. Across diverse benchmarks, models, budgets, and compression methods, BACON improves multimodal KV compression by 7.5% on average under the most aggressive budget, with gains up to 30.9%.

Motivation

Most attention-based KV cache compression methods rank cached tokens by an observation-window score — the average attention each cached token receives from the last few prompt queries. While this averaging stabilizes ranking, it can also obscure sparse but answer-critical visual evidence.

Observation window dilutes sparse visual evidence; last query recovers it

Observation-window aggregation can dilute sparse visual evidence; the last query recovers it. (a) Under aggressive compression, SparseMM discards answer-critical visual tokens. (b) The last prompt query is more sensitive than earlier queries to answer-relevant tokens. (c) Last-query attention better matches the answer's attention distribution over visual tokens than the window-averaged attention.


Why last-query attention needs calibration: intra-layer coherence and inter-layer persistence

Why last-query attention needs calibration. (a) Only ~9% of last-query high-attention tokens are true evidence; ~70% are answer-irrelevant noise. (b) True evidence appears as a coherent local region; noise often appears as an isolated spike. (c) True evidence remains salient across adjacent layers, while noise is less consistent across layers.

The last query is not least for token retention: it reveals visual evidence missed by observation-window aggregation, but this evidence should be calibrated against the window signal before guiding token retention.

Method

BACON framework overview

Overview of BACON. Boundary evidence is extracted from observation-window and last-query attention, calibrated with intra-layer coherence and inter-layer persistence, and combined into an evidence-aware score for head-wise KV cache compression.

Results

+7.5%

average gain under the most aggressive budget

+30.9%

peak improvement

+18.7

DocVQA, Qwen2-VL-7B, PyramidKV @ budget 64

+12.8

DocVQA, Qwen3-VL-30B-A3B, PyramidKV @ budget 64

Broad evaluation

  • Models: LLaVA-NeXT-Mistral-7B, InternVL3-8B, Qwen2-VL-7B-Instruct, Qwen3-VL-30B-A3B-Instruct, Mistral-7B-Instruct-v0.2, Llama-3.1-8B-Instruct.
  • Compression backbones: SnapKV, PyramidKV, AdaKV, SparseMM.
  • Tasks: image understanding (DocVQA, TextVQA, ChartQA, MMMU, TextCaps), video understanding (VATEX, NextQA), GUI grounding (ScreenSpot), and long-context text (LongBench).
  • Budgets: 256 / 128 / 64. BACON's gains are largest at the tightest budget, where sparse visual evidence is most vulnerable.

Efficiency

BACON only refines within-head token scores during prefill; the compressed cache size and the decoding procedure are unchanged. Accuracy gains come with negligible runtime and memory overhead over the original compression method.

Experiment Tables

Across image, video, GUI, and long-context text benchmarks — spanning 4 model families, 4 compression backbones, and 3 cache budgets — BACON consistently improves compressed inference.

Image Understanding — Qwen2-VL-7B

Δ over Base. Best result per column is in green bold. Full-KV upper bound: DocVQA 93.7, ChartQA 71.3, MMMU 49.9, TextCaps 1.473.

MethodVariant DocVQA TextVQA ChartQA MMMU TextCaps
25612864 25612864 25612864 25612864 25612864
SnapKVBase 88.682.170.1 80.677.070.3 70.069.666.2 49.949.849.6 1.3611.1410.787
+MixKV 91.883.970.9 82.681.073.5 70.070.267.2 49.949.949.7 1.4701.3320.919
+BACON 93.1 +4.591.5 +9.485.5 +15.4 83.1 +2.582.6 +5.678.2 +7.9 70.2 +0.270.2 +0.669.6 +3.4 49.949.9 +0.150.0 +0.4 1.488 +0.131.426 +0.291.178 +0.39
PyramidKVBase 83.475.660.5 77.774.966.8 71.168.965.2 49.949.849.6 1.1470.9930.600
+MixKV 85.077.561.3 80.977.269.4 71.071.166.4 49.949.849.6 1.3831.1450.662
+BACON 92.3 +8.989.3 +13.779.2 +18.7 82.5 +4.880.0 +5.175.1 +8.3 71.2 +0.170.6 +1.769.2 +4.0 49.949.9 +0.149.8 +0.2 1.461 +0.311.344 +0.351.073 +0.47
AdaKVBase 88.481.369.4 80.575.970.8 69.869.666.6 49.949.749.6 1.3001.0990.771
+MixKV 91.482.770.7 82.579.172.6 70.270.267.8 49.949.949.6 1.4541.2710.874
+BACON 93.0 +4.691.1 +9.886.2 +16.8 82.8 +2.381.2 +5.378.5 +7.7 70.2 +0.470.2 +0.669.6 +3.0 49.949.8 +0.149.8 +0.2 1.473 +0.171.371 +0.271.144 +0.37
SparseMMBase 93.191.487.3 82.682.176.9 70.270.069.6 49.849.849.6 1.4811.4271.044
+MixKV 93.992.988.6 82.582.580.9 69.669.870.8 49.849.849.7 1.4801.4561.303
+BACON 93.8 +0.793.2 +1.892.0 +4.7 82.682.4 +0.381.6 +4.7 70.6 +0.470.4 +0.470.2 +0.6 49.9 +0.149.849.8 +0.2 1.506 +0.031.511 +0.081.431 +0.39

Cross-Model — PyramidKV at budget 64

BACON generalizes from 7B MLLMs to InternVL3-8B and the 30B-scale MoE Qwen3-VL-30B-A3B.

ModelVariantDocVQATextVQAChartQAMMMUTextCaps
LLaVA-NeXT-Mistral-7BBase43.855.738.634.70.436
+MixKV45.657.839.134.70.505
+BACON 52.4 +8.6 59.8 +4.1 39.9 +1.3 34.8 +0.1 0.505 +0.07
Qwen2-VL-7BBase60.566.865.249.60.600
+MixKV61.369.466.449.60.662
+BACON 79.2 +18.7 75.1 +8.3 69.2 +4.0 49.8 +0.2 1.073 +0.47
InternVL3-8BBase69.268.871.155.10.688
+MixKV69.468.971.755.20.719
+BACON 80.7 +11.5 74.4 +5.6 74.3 +3.2 55.3 +0.2 0.782 +0.09
Qwen3-VL-30B-A3BBase74.473.268.551.780.266
+MixKV76.377.271.852.110.370
+BACON 87.2 +12.8 79.3 +6.1 73.2 +4.7 52.00 +0.22 0.387 +0.12

Video Understanding — Qwen2-VL-7B

VATEX captioning (CIDEr / BLEU-4) and NextQA (WUPS). Best per column in green bold.

MethodVariant VATEX CIDEr VATEX BLEU-4 NextQA WUPS
512256128 512256128 512256128
SnapKVBase 48.1446.5146.27 21.2220.6020.26 25.9725.6925.84
+MixKV 48.5547.6845.95 21.3420.8220.32 25.9925.9325.60
+BACON 48.6748.0246.02 21.3821.2620.38 26.1525.9426.02
AdaKVBase 48.2046.2045.37 21.5520.4320.28 25.7225.7025.65
+MixKV 48.6847.4645.39 21.2220.7820.34 25.9926.0125.86
+BACON 48.8247.8545.33 21.5521.2420.54 25.9625.9525.86
PyramidKVBase 46.0946.0344.36 20.5820.2919.26 25.8825.6925.63
+MixKV 46.1345.6744.50 20.4820.2919.55 25.8725.7225.52
+BACON 46.5746.2245.70 20.8120.2519.70 26.0425.8025.65
SparseMMBase 47.9147.1646.12 20.7920.2419.77 26.3725.9425.71
+MixKV 48.6547.4445.99 20.9920.5419.82 26.1325.9825.83
+BACON 48.9047.5746.38 21.2820.8120.03 26.3326.1726.07

GUI Grounding — ScreenSpot (Qwen2-VL-7B)

Average accuracy across mobile / desktop / web text and icon subsets.

MethodVariant Mobile Text Mobile Icon Desktop Text Desktop Icon Web Text Web Icon Average
12864 12864 12864 12864 12864 12864 12864
SnapKVBase 24.515.410.54.819.19.3 4.34.35.76.56.85.8 11.87.7
+MixKV 24.915.810.84.819.610.8 4.34.36.16.56.86.3 12.18.1
+BACON 24.915.810.75.721.113.7 4.35.77.06.48.26.3 12.79.0
AdaKVBase 26.115.411.84.219.111.3 5.75.04.86.56.35.8 12.38.0
+MixKV 26.614.611.34.820.111.3 5.75.05.75.76.36.3 12.68.0
+BACON 26.615.412.05.019.513.9 5.75.05.76.76.86.3 12.78.7
PyramidKVBase 24.513.98.26.620.19.8 3.65.06.16.16.25.3 11.57.8
+MixKV 24.513.28.77.019.110.3 3.65.06.56.16.65.8 11.57.9
+BACON 24.813.58.96.720.110.3 4.35.77.86.26.35.8 12.08.0
SparseMMBase 22.715.89.24.418.69.8 4.64.37.04.58.75.8 11.97.4
+MixKV 21.514.610.24.417.010.3 2.94.37.43.98.25.8 11.27.2
+BACON 22.716.910.24.519.610.3 4.95.07.65.28.38.2 12.28.4

Long-Context Text — LongBench (Mistral-7B-Instruct-v0.2)

Category averages and overall average; Full-KV reference 41.76.

BudgetMethodSingle-Doc QAMulti-Doc QASummarizationFew-shotSyntheticCodeAvg.
1024SnapKV (Base)34.3828.2225.2164.9445.2346.1140.06
+MixKV34.4128.2225.3466.1744.5346.3040.25
+BACON35.49 +1.1128.56 +0.3425.69 +0.4867.00 +2.0645.29 +0.0646.44 +0.3340.85 +0.79
AdaKV (Base)34.9428.9425.1065.7445.5246.4740.51
+MixKV35.1128.8425.4766.2144.9246.4040.60
+BACON35.67 +0.7329.33 +0.3925.71 +0.6166.94 +1.2045.63 +0.1146.79 +0.3241.11 +0.60
PyramidKV (Base)33.5228.1324.8765.2044.7045.9539.78
+MixKV33.9028.4125.5866.2143.5945.8040.07
+BACON34.85 +1.3328.92 +0.7926.02 +1.1566.71 +1.5144.75 +0.0546.43 +0.4840.74 +0.96
512SnapKV (Base)33.2526.8223.6564.4845.1545.4139.11
+MixKV33.0827.1224.4065.1945.1545.6139.43
+BACON34.06 +0.8127.50 +0.6824.54 +0.8966.09 +1.6145.32 +0.1745.91 +0.5039.94 +0.83
AdaKV (Base)33.3327.2323.8064.7045.3145.8639.35
+MixKV33.5227.4824.2765.3045.0646.1539.63
+BACON34.06 +0.7327.54 +0.3124.60 +0.8066.55 +1.8545.39 +0.0845.89 +0.0340.05 +0.70
PyramidKV (Base)31.9226.9523.3964.2644.5844.4538.60
+MixKV32.0926.6923.8964.8844.8044.8038.86
+BACON32.83 +0.9127.62 +0.6723.96 +0.5765.97 +1.7145.07 +0.4945.14 +0.6939.47 +0.87

Category averages were computed from the per-task numbers in the paper; the final column matches the table average.

BibTeX

@article{chen2026bacon,
  title  = {Last But Not Least: Boundary Attention Calibration for Multimodal KV Cache Compression},
  author = {Chen, Tianhao and Wu, Yuheng and Yao, Kelu and Xu, Xiaogang and Hu, Xiaobin and Lee, Dongman},
  year   = {2026},
  eprint = {2606.14782},
  archivePrefix = {arXiv}
}