Quick Thoughts on Ensembling Detectors

Remark: None of the ideas in this note are original. They are mostly extremely basic statistics.

Here’s a simple model for an evilness detector : When the AI is evil, . When the AI is benign, .

We can get a classifier by fixing any threshold and classifying as evil if . Here’s the ROC curve:

Here’s a slightly more complex scenario:

You have metrics for evilness. Let’s start by assuming that these are each IID copies of .

In this case, the family of classifiers that give the ROC curve is (again) given by choosing a hyperplane, whose normal vector is the line connecting the center of the evil and benign distributions,

If you have of them, then your AUC is .

More generally, you might imagine that (conditional on evil and conditional on benign) are distributed according to a weirder Gaussian, in which case we still do the same procedure:

Sweep the likelihood ratio that you care about.
Analytically compute the likelihood ratio of points to put them on either side.
This gives you your ROC curve.

Even more generally, you might assume that are distributed according to some gross (non-Gaussian) distribution (conditional on evil and conditional on benign) where we either: have some guesses about what the distribution looks like, or we can take a bunch of sample from the distribution.

Distribution shift

The assumption that we know can get a bunch of samples from the detectors under the benign and evil states is pretty questionable.

Here’s an attempt at making a more realistic setup. It will still be extremely toy.

Suppose you have bits . Suppose you think the bits are independent with biases when the AI is benign, and have biases when the AI is malign.

But in actuality, there is some subset of maybe of the detectors which are broken, and can be manipulated arbitrarily by the AI.

We now have a crisp theoretical question:

The AI then chooses a strategy for how it’ll set the subset of the detectors that it has control over. Then, the human chooses an evil vs benign classifier . We’ll assume that the AI doesn’t know (we are assuming that the AI can’t change based on ). Then, the human is either handed sampled from OR, the human is handed where 80% of these values are sampled from , and the remaining 20% of values are set by (you can decide whether or not gets to see the rest of the bits, I think not seeing them is maybe more natural). The human then runs their classifier on and outputs “benign” or “evil”. For every value of , you’d get an ROC curve. You could take over the values of to compute your “guaranteed ROC curve.”

Unfortunately, this seems like probably the wrong question, and not very productive to think about.

A less pessimistic assumption that you could make is some kind of non-adversarial distribution shift. If you make that assumption then it seems like a reasonable idea to just flag things which are far OOD from the benign distribution, even if they’re closer to the benign distribution than the evil distribution.

You shouldn’t spend too much of your FPR budget on this, but plausibly it’s worth spending some of your FPR budget on this. Like 1%.

🐱 Skyspace3.0

Explorer

Quick Thoughts on Ensembling Detectors

Graph View

Backlinks