“Optimally Combining Probe Monitors and Black Box Monitors” by Tim Hua, jamesbaskerville, BionicD0LPH1N, Mia Hopman, Aryan Bhatt, Tyler Tracy

Author: LessWrong ([email protected])
Published: Mon 28 Jul 2025
Episode Link: https://www.lesswrong.com/posts/FhixwyymPxF8TZX39/optimally-combining-probe-monitors-and-black-box-monitors

Link to our arXiv paper: https://arxiv.org/abs/2507.15886

TL, DR;: We study how to efficiently combine multiple monitors with different performance and cost profiles into a single protocol. When the number of monitors is small (2-4), one should simply brute force search over all monitor invocation strategies (i.e., which monitor to call first, when to call an additional monitor). For every strategy, apply the Neyman Pearson lemma and audit regions with the highest likelihood ratios <span>_ell(m) = frac{P(M = m |text{Output is misaligned})}{P(M = m|text{Output is aligned})}_</span> . Measure the recall and pick the best one.

Example of an optimal way of combining monitors given some costs and budget

In this example, you call your black box monitor only if the probe monitor score is between -0.01 and 2.18, and you audit everywhere with a log likelihood ratio above -0.32 (above the green line). This set of parameters achieves the [...]

---

Outline:

(01:16) Introduction

(02:49) Two example monitors

(04:26) Optimization problem with two monitors

(05:21) Picking where to audit given monitor calls

(08:59) Just brute force where to call monitors

(10:27) Summary:

(11:14) Some more examples

(12:10) Pareto Frontiers with two monitors v. only one monitor

---

First published:

July 27th, 2025