Why Identity-Aware Negative Sampling Matters

In contrastive learning, how good a representation a model learns depends heavily on the quality of the negative examples that show it "what to tell apart from what." In multimodal deepfake detection this becomes even more critical: distinguishing a person's genuine video from a manipulated video of the same person is far harder — and far more instructive — than separating two random people. In this post I describe how a seemingly innocent design choice — the way batches are assembled — silently determines the quality of the negatives a model sees, and what we gain once we notice and fix it.

The equation that frames the problem

The identity-aware negative term in our contrastive loss (Equation 5) needs a specific condition to be meaningful: real and fake examples of the same identity must appear together in the same batch. But when you fill a batch completely at random, the probability of that pairing is much lower than you'd guess. With $N_{\text{id}}$ identities and a batch size of $B$ , the chance that a given identity's counterpart lands in the batch at least once is roughly:

P(\text{match}) \approx 1 - \left(1 - \frac{1}{N_{\text{id}}}\right)^{B-1}

For a typical setup ( $N_{\text{id}}$ in the hundreds, $B = 24$ ) this comes out surprisingly small. The practical consequence was this: the relevant negative was absent from the batch in roughly 73% of cases. The loss's most instructive term was being skipped, silently and without error, in the great majority of steps. The model appeared to be training with an "identity-aware" loss; in reality that term sat idle most of the time.

The fix: intervene at the sampling layer

Because the root of the problem lay not in the loss itself but in the sampling layer that feeds it, that is where the fix belonged. In place of the standard random sampler, we designed one that groups batches by identity — filling each batch so that it contains both the real and the fake examples of the identities it draws:

class IdentityGroupedBatchSampler(Sampler):
    """Puts the real + fake examples of the same identity in the same batch."""
 
    def __init__(self, identities, batch_size, group_size=2):
        self.groups = defaultdict(list)
        for idx, ident in enumerate(identities):
            self.groups[ident].append(idx)
        self.batch_size = batch_size
        self.group_size = group_size
 
    def __iter__(self):
        # fill each batch by drawing group_size samples per identity
        ...

With this change the match rate rose to frac → 1.0, so the identity-aware negative term now actually engaged on nearly every step. The detail worth dwelling on: we changed nothing in the model architecture, the loss function, or the hyperparameters. The only thing we touched was how batches were built.

Reading the results

Sampler	Match rate	Δ Accuracy
Random (b=24)	~0.27	baseline
Identity-grouped	~1.00	+X.XX

The real lesson of the table goes beyond any single accuracy figure. Between "adding a loss term" and "that term actually working" there is often a gap that goes unnoticed, and that gap frequently hides in the data-sampling layer. Having a loss in your code does not mean it contributes effectively to the gradient at every step; sitting in between is a silent but decisive intermediary like batch composition. So before assuming a component is "on," measuring how often it actually activates is a habit that almost always pays off in contrastive setups.

What's next

In a following post I'll turn to another quietly overlooked inconsistency — the mismatch between the reference distributions used at training and evaluation time — where, once again, the root cause surfaces not in the model itself but in the data flow that feeds it.

Why Identity-Aware Negative Sampling Matters

The equation that frames the problem

The fix: intervene at the sampling layer

Reading the results

What's next

Related articles

Attention, Explained from Scratch

How Gradient Descent Actually Works

About This Blog