intermediate probabilistic-ir 40 min read

BM25 and the Binary Independence Model

How probability theory turns term counts into a ranking function — and why saturation and length normalization are the two ideas that matter

Prerequisites: The Retrieval Problem: Relevance, Similarity, and the Geometry of Scores(coming soon)The Vector Space Model and TF-IDF(coming soon)The Probability Ranking Principle(coming soon)

Overview & Motivation

The retrieval problem asks us to score documents against a query by relevance. The vector space model gave us a first answer — represent both as weighted term vectors and rank by cosine similarity — but raw term frequency has a pathology: a long, verbose document can dominate the ranking simply by repeating a query term many times, even when a shorter, more focused document is more relevant.

BM25 fixes this. It is the strongest lexical retrieval baseline in information retrieval, the default in Lucene, Elasticsearch, and Pyserini, and the lexical half of nearly every production hybrid retriever. Crucially, it is not an arbitrary formula. Two geometric ideas govern it, and both fall out of probability theory:

  1. Saturation — the marginal value of seeing a term one more time diminishes. The tenth occurrence of “interest” tells you much less than the first.
  2. Length normalization — a long document has more chances to contain a term by accident, so its term counts should be discounted relative to the average document length.

Before any algebra, drag the two governing parameters and watch both ideas at work:

A. tf saturation (dl = avgdl) — solid: BM25, dashed: raw tf, dotted: log(1+tf)
B. length normalization B(dl) = 1 − b + b·dl/avgdl
C. live ranking for query “interest rate exposure” — drag b to 0 to watch the padded transcript hijack the top spot
  1. 1.10-K · net interest margin sensitivity1.24
  2. 2.10-K · boilerplate legal1.13
  3. 3.10-K · foreign-exchange risk0.95
  4. 4.Earnings call · long Q&A (padded)0.57
  5. 5.News · Fed rate decision0.49
  6. 6.Earnings call · brief update0.49

Make the pathology concrete. A bank’s quarterly earnings call runs ninety minutes; transcribed, it is a few hundred words of prepared remarks and Q&A in which an executive says rate and interest a dozen times each, mostly as conversational filler — “in this rate environment,” “rates being what they are.” A single line buried in the same company’s 10-K states the fact a risk analyst actually wants: that its net interest margin is sensitive to rate moves — its real rate exposure. Ask raw term frequency which document is about interest-rate exposure and it answers with the transcript every time, not because the call is more relevant but because it is longer and so accumulates more chances to repeat the query words. The focused disclosure, which answers the question in one sentence, sinks beneath it.

This is the length-hijack failure, and removing it is exactly what BM25 is for. Two ideas do the work, and the laboratory above lets you watch each in isolation: saturation, which caps the reward for repeating a term so the call’s verbal tics stop paying off, and length normalization, which discounts a document’s term counts in proportion to how far its length runs past the collection average. The two governing parameters are k1k_1, which sets how quickly repetition saturates, and bb, which sets how aggressively length is penalized. Drag k1k_1 toward zero and the saturation curve in panel A collapses to a step; drag bb to zero and the padded transcript climbs back to the top of panel C — the very failure we just described, reconstructed on demand.

What we cover

  1. From the Probability Ranking Principle to the Binary Independence Model.
  2. The Robertson–Spärck-Jones weight, and how smoothed IDF emerges from a Jeffreys prior.
  3. The 2-Poisson eliteness model and why term frequency should saturate.
  4. The BM25 scoring function, assembled factor by factor.
  5. Limit theorems: what k₁ and b do at their extremes.
  6. The geometric and information-theoretic reading.
  7. A finance case study, and the honest caveats.

From the Probability Ranking Principle to the BIM

We begin where the PRP leaves off: rank documents by the probability of relevance P(R=1d,q)P(R=1 \mid d, q). Because ranking is invariant under any strictly monotonic transform, we may rank by the odds instead, and then by their logarithm:

score(d,q)  =  logP(R=1d,q)P(R=0d,q)  =Bayes  logP(dR=1,q)P(dR=0,q)+logP(R=1q)P(R=0q)constant in d.\text{score}(d, q) \;=\; \log \frac{P(R=1 \mid d, q)}{P(R=0 \mid d, q)} \;\overset{\text{Bayes}}{=}\; \log \frac{P(d \mid R=1, q)}{P(d \mid R=0, q)} + \underbrace{\log \frac{P(R=1\mid q)}{P(R=0\mid q)}}_{\text{constant in } d}.

The Binary Independence Model makes two modeling assumptions: a document is a binary vector of term presence/absence xt{0,1}x_t \in \{0,1\}, and terms are conditionally independent given relevance.

Definition 1 (Binary Independence Model).

Let pt=P(xt=1R=1)p_t = P(x_t = 1 \mid R=1) and ut=P(xt=1R=0)u_t = P(x_t = 1 \mid R=0) be the probabilities that term tt is present in a relevant and a non-relevant document, respectively. Under conditional independence,

logP(dR=1,q)P(dR=0,q)=t[xtlogptut+(1xt)log1pt1ut].\log \frac{P(d \mid R=1, q)}{P(d \mid R=0, q)} = \sum_{t} \left[ x_t \log\frac{p_t}{u_t} + (1-x_t)\log\frac{1-p_t}{1-u_t}\right].

Dropping terms constant in dd (those with xt=0x_t = 0 for query-absent terms collapse into the document-independent baseline), the document-dependent score reduces to a sum over the query terms present in the document.

To see why, regroup each per-term contribution from Definition 1 so the part that depends on the document separates cleanly from the part that does not. For a single term,

xtlogptut+(1xt)log1pt1ut  =  xtlogpt(1ut)ut(1pt)depends on xt  +  log1pt1utconstant in d.x_t \log\frac{p_t}{u_t} + (1-x_t)\log\frac{1-p_t}{1-u_t} \;=\; x_t \underbrace{\log\frac{p_t\,(1-u_t)}{u_t\,(1-p_t)}}_{\text{depends on } x_t} \;+\; \underbrace{\log\frac{1-p_t}{1-u_t}}_{\text{constant in } d}.

The rewrite is pure bookkeeping: expand the two logarithms, collect the coefficients of xtx_t, and what is left over is the xtx_t-free remainder log1pt1ut\log\frac{1-p_t}{1-u_t}. Summed over all terms, that remainder depends only on the collection statistics pt,utp_t, u_t — never on which document we are scoring — so under monotone ranking it is an additive constant we may discard. The first piece survives only when xt=1x_t = 1, that is, when the term is present. Finally, a term absent from the query carries no power to discriminate relevant from non-relevant, so we set pt=utp_t = u_t there and its weight vanishes. What remains is one weight per query term present in dd:

score(d,q)  =rank  tq:xt=1logpt(1ut)ut(1pt).\text{score}(d, q) \;\overset{\text{rank}}{=}\; \sum_{t \,\in\, q \,:\, x_t = 1} \log\frac{p_t\,(1-u_t)}{u_t\,(1-p_t)}.

This is the dividend the independence assumption pays: a document’s relevance score collapses to a sum of independent per-term verdicts, read off only the query terms it actually contains.


The Robertson–Spärck-Jones weight and the emergence of IDF

Theorem 1 (RSJ relevance weight).

Ranking under the BIM is equivalent to summing, over query terms present in the document, the weight

wtRSJ=logpt(1ut)ut(1pt).w_t^{\text{RSJ}} = \log \frac{p_t\,(1 - u_t)}{u_t\,(1 - p_t)}.

The payoff is what happens with no relevance data. Then ptp_t is unknown; a symmetric choice is pt=12p_t = \tfrac12, and utu_t — the chance a non-relevant document contains tt — is estimated by the collection frequency dft/Ndf_t / N. Estimating pt,utp_t, u_t by maximum likelihood with a Beta(12,12)\text{Beta}(\tfrac12, \tfrac12) Jeffreys prior (the +0.5+0.5 continuity correction added to both counts) gives

wtRSJ    logNdft+0.5dft+0.5  =  IDFt.w_t^{\text{RSJ}} \;\longrightarrow\; \log \frac{N - df_t + 0.5}{df_t + 0.5} \;=\; \text{IDF}_t.

So inverse document frequency is not a heuristic — it is the RSJ weight of the Binary Independence Model under an uninformative prior. This is the cross-link into the statistics of maximum-likelihood estimation and prior selection.

Let us earn that limit rather than assert it. Estimating ptp_t and utu_t means counting how often term tt appears among relevant and non-relevant documents — but with no relevance judgments, both samples are empty, and a raw maximum-likelihood estimate of 0/00/0 is undefined. The remedy is a prior. The Jeffreys prior for a Bernoulli probability is Beta(12,12)\text{Beta}(\tfrac12, \tfrac12); using its posterior mean turns aa successes in nn trials into θ^=(a+12)/(n+1)\hat\theta = (a + \tfrac12)/(n + 1) — the +12+\tfrac12 continuity correction that keeps every estimate strictly inside (0,1)(0,1).

Apply it twice. With no relevant documents observed (a=0a = 0, n=0n = 0), the relevance probability is pinned at the prior mean,

p^t=0+120+1=12.\hat p_t = \frac{0 + \tfrac12}{0 + 1} = \tfrac12 .

For the non-relevance probability, we lean on the fact that in a large collection almost every document is non-relevant to any one query, so the whole collection stands in for the non-relevant sample: term tt appears in dftdf_t of the NN documents, giving

u^t=dft+12N+1.\hat u_t = \frac{df_t + \tfrac12}{N + 1} .

Now substitute into the RSJ weight. Because p^t=12\hat p_t = \tfrac12 makes p^t/(1p^t)=1\hat p_t / (1 - \hat p_t) = 1, the relevance half cancels and the weight is governed by non-relevance alone:

wtRSJ=logp^t(1u^t)u^t(1p^t)=log1u^tu^t=logNdft+0.5dft+0.5=IDFt.w_t^{\text{RSJ}} = \log \frac{\hat p_t\,(1 - \hat u_t)}{\hat u_t\,(1 - \hat p_t)} = \log \frac{1 - \hat u_t}{\hat u_t} = \log \frac{N - df_t + 0.5}{df_t + 0.5} = \text{IDF}_t .

So smoothed inverse document frequency is not a weighting trick bolted onto term counts after the fact — it is the exact relevance weight the Binary Independence Model assigns a term when we are honest about knowing nothing. A common term (dftdf_t near NN) drives the ratio toward zero and the log toward -\infty; a rare one (dftdf_t small) sends it large and positive. One caveat for anyone matching this against a running system: Lucene and the companion implementation here add a +1+1 inside the logarithm, scoring log ⁣(Ndft+0.5dft+0.5+1)\log\!\left(\frac{N - df_t + 0.5}{df_t + 0.5} + 1\right), which floors the weight at zero so that a term appearing in more than half the collection can never contribute a negative score. That floor is an engineering choice for non-negativity, not a step in the derivation.


Eliteness and the 2-Poisson model: why term frequency saturates

The BIM uses only presence/absence and throws away term frequency. But intuitively, a document mentioning “interest” five times is more about interest than one mentioning it once — though not five times as much. The 2-Poisson eliteness model formalizes this: a term is either elite (the document is genuinely about its concept) or not, and within-document counts follow a mixture of two Poisson distributions with different rates.

Proposition 1 (Saturating shape of the eliteness log-odds).

Under the 2-Poisson model, the log-odds that a term is elite given its observed frequency tf\text{tf} is increasing and concave in tf\text{tf}, approaching a finite asymptote. BM25 approximates this curve with the rational saturation function

tftf+k1,\frac{\text{tf}}{\text{tf} + k_1},

which is 00 at tf=0\text{tf}=0, increasing, concave, and 1\to 1 as tf\text{tf}\to\infty.

This connects to the statistics of discrete distributions — the Poisson mixture is the generative story.

Why a mixture, and not a single Poisson? Picture the histogram of how many times one content word — say interest — appears across every document in a collection. It has two features at once: a tall spike at zero and very small counts, because the vast majority of documents are not about interest at all, and a long right tail of documents that use the word many times because they genuinely are. A single Poisson cannot draw both shapes. Its mean and variance are forced to be equal, E[X]=Var(X)=λ\mathbb{E}[X] = \mathrm{Var}(X) = \lambda, and its tail decays factorially fast, like λk/k!\lambda^k / k!; tuned low enough to reproduce the pile of zeros, it makes high counts astronomically unlikely, and tuned high enough to admit the tail, it loses the zeros. Real term counts are overdispersed — their variance exceeds their mean — which a one-parameter Poisson structurally cannot be. The 2-Poisson model resolves this by positing a hidden binary cause: a document is either elite for the term, about its concept and drawing counts from a high-rate Poisson, or not, a low rate that is mostly zeros. The resulting mixture is overdispersed and bimodal by construction, and the quantity BM25 actually needs — the log-odds that a document is elite given the count it shows — is precisely the increasing, concave, saturating curve of Proposition 1. Eliteness is the latent variable the saturation transform is quietly estimating.


The BM25 scoring function

Assembling the IDF weight with the saturating term-frequency transform and a document-length correction gives the standard form:

Definition 2 (BM25).

score(q,d)=tqIDFttft,d(k1+1)tft,d+k1(1b+bdldavgdl),\text{score}(q, d) = \sum_{t \in q} \text{IDF}_t \cdot \frac{\text{tf}_{t,d}\,(k_1 + 1)}{\text{tf}_{t,d} + k_1\left(1 - b + b\,\dfrac{\text{dl}_d}{\text{avgdl}}\right)},

where dld\text{dl}_d is the document length, avgdl\text{avgdl} the mean document length, k1>0k_1 > 0 governs saturation, and b[0,1]b \in [0,1] governs length normalization.

Each factor has a job: IDFt\text{IDF}_t down-weights common terms (Section 3), the (k1+1)(k_1+1) numerator scales the saturating transform so a single occurrence contributes one full unit (at average document length), and the (1b+bdl/avgdl)\left(1 - b + b\,\text{dl}/\text{avgdl}\right) denominator term stretches the saturation point for long documents.

Take the three in turn, each tied to exactly one of the ideas we have built:

  • IDFt\text{IDF}_t is the relevance weight of Section 3 — a rare query term is strong evidence of aboutness and a near-ubiquitous one is almost none — so it scales every term’s contribution by how surprising its presence is.
  • The numerator tft,d(k1+1)\text{tf}_{t,d}\,(k_1 + 1) carries saturation: the (k1+1)(k_1 + 1) factor rescales the curve tf/(tf+k1)\text{tf}/(\text{tf} + k_1) so it rises to the ceiling k1+1k_1 + 1 and, at average document length, a single occurrence contributes exactly one unit (f(1)f(0)=1f(1) - f(0) = 1) — the first occurrence of a term counts fully while the tenth barely moves the score.
  • The denominator term k1 ⁣(1b+bdld/avgdl)k_1\!\left(1 - b + b\,\text{dl}_d/\text{avgdl}\right) carries length normalization: the bracket stretches the saturation point in proportion to how far the document’s length dld\text{dl}_d runs past the average, so a long document must repeat a term more to earn the same credit, with bb dialing the effect from off (b=0b = 0) to full (b=1b = 1).

For documents with internal structure, BM25F refines the length term. A 10-K is not a flat bag of words but a sequence of fields — item titles, the MD&A, the risk-factors section, financial-statement footnotes — that differ in length and in how much a match there should count. BM25F length-normalizes and field-weights each field’s term frequencies first, sums them into a single combined count, and only then applies the saturation and IDF (Robertson, Zaragoza, and Taylor, 2004). Saturating after the sum rather than within each field is the decisive design choice: it lets a hit in a short, high-value field like the MD&A outweigh a long boilerplate section, instead of letting that boilerplate saturate on its own and dilute the signal.


Limit theorems

These are the claims the accompanying notebook verifies numerically, and they are what make BM25 interpretable.

Theorem 2 (Limit behavior of BM25).

Fix a document and term with frequency tf>0\text{tf} > 0.

  1. Saturation ceiling. As tf\text{tf} \to \infty, the term factor k1+1\to k_1 + 1 (bounded).
  2. Binary limit. As k10k_1 \to 0, the term factor 1\to 1 for any tf>0\text{tf} > 0 — BM25 collapses to the binary independence model (presence/absence only).
  3. Raw-tf limit. As k1k_1 \to \infty, the term factor tf/(1b+bdl/avgdl)\to \text{tf} / \left(1 - b + b\,\text{dl}/\text{avgdl}\right) — length-normalized raw term frequency. This is not cosine-normalized TF-IDF.
  4. Length normalization. At b=0b = 0 the document-length term vanishes (no normalization); at b=1b = 1 it is fully applied.

In the laboratory above, set k1k_1 near 00 and watch panel A flatten to a step; push k1k_1 high and watch it straighten toward the dashed raw-tf line; drag bb to 00 and watch the padded transcript hijack the top of panel C.

Proof.

Write the term factor as f(tf)=tf(k1+1)tf+k1Bf(\text{tf}) = \dfrac{\text{tf}\,(k_1 + 1)}{\text{tf} + k_1 B}, where B=1b+bdl/avgdl>0B = 1 - b + b\,\text{dl}/\text{avgdl} > 0 collects the length normalization into a single positive constant.

(1) Saturation ceiling. Divide numerator and denominator by tf\text{tf}:

f(tf)=k1+11+k1B/tf    k1+11=k1+1(tf).f(\text{tf}) = \frac{k_1 + 1}{1 + k_1 B / \text{tf}} \;\longrightarrow\; \frac{k_1 + 1}{1} = k_1 + 1 \qquad (\text{tf} \to \infty).

The factor is bounded above by k1+1k_1 + 1 no matter how often the term repeats — the analytic content of “diminishing returns.”

(2) Binary limit. As k10k_1 \to 0 the numerator tf\to \text{tf} and the denominator tf\to \text{tf}, so f1f \to 1 for every tf>0\text{tf} > 0 (and f=0f = 0 when tf=0\text{tf} = 0). Every present term contributes its IDFt\text{IDF}_t exactly once and repetition is ignored: BM25 collapses to the presence/absence Binary Independence Model.

(3) Raw-tf limit. Divide numerator and denominator by k1k_1 and let k1k_1 \to \infty:

f(tf)=tf(1+1/k1)tf/k1+B    tfB=tf1b+bdl/avgdl(k1).f(\text{tf}) = \frac{\text{tf}\,(1 + 1/k_1)}{\text{tf}/k_1 + B} \;\longrightarrow\; \frac{\text{tf}}{B} = \frac{\text{tf}}{1 - b + b\,\text{dl}/\text{avgdl}} \qquad (k_1 \to \infty).

This is length-normalized raw term frequency, and it is worth pausing on what it is not: it is not cosine-normalized TF-IDF. Cosine normalization divides a document’s term-weight vector by its Euclidean norm twt,d2\sqrt{\sum_t w_{t,d}^2}, a quantity that depends on every term in the document. Here each term frequency is divided by the single scalar BB, which depends only on the document’s length, and IDFt\text{IDF}_t enters separately as a multiplicative per-term weight. BM25 is not “TF-IDF with a saturation knob” — even in the limit where saturation is switched off, its normalization is structurally different.

(4) Length normalization. At b=0b = 0 we have B=1B = 1, independent of dl\text{dl}, so document length drops out of the score entirely. At b=1b = 1 we have B=dl/avgdlB = \text{dl}/\text{avgdl}, the full ratio, so a document twice the average length needs twice the term count to reach the same factor. Intermediate bb interpolates linearly between the two.

Each limit is checked numerically in the companion harness’s test_limits(): it asserts f1f \to 1 as k10k_1 \to 0, ftf/Bf \to \text{tf}/B as k1k_1 \to \infty, the ceiling fk1+1f \to k_1 + 1 (e.g. 2.52.5 at k1=1.5k_1 = 1.5), and the bb-toggle of length dependence — the theorem and the code are two views of the same four facts.


Finance case study

In the production retriever this case study is drawn from, BM25 does not score documents as flat text; it runs as BM25F over the fields that filings and transcripts naturally expose. A 10-K or 10-Q is split into the item heading, the Management’s Discussion and Analysis, the risk-factors section, and the financial-statement line items and footnotes; an earnings-call transcript is split into prepared remarks and Q&A, segmented by speaker turn. The heading and MD&A fields carry the highest field weights and the boilerplate legal language the lowest, so a one-line MD&A disclosure about rate sensitivity is not buried under a long risk-factors recital of the same words — the length-hijack fix from above, now applied at the granularity of fields rather than whole documents.

The lexical leg earns its keep precisely where the dense leg is weakest: exact tokens. Tickers and CUSIP/ISIN identifiers, GAAP line-item names, defined terms from a filing’s glossary, dollar and basis-point figures, and reporting dates are the strings a paraphrase-trained embedding tends to smear into near-synonyms — “AAPL,” “Apple Inc.,” and “the Company” collapse to nearly the same vector, and a query for one will cheerfully return the others. BM25 keeps them distinct because it matches the literal term. That complementarity is the reason the two legs are fused rather than chosen between: reciprocal rank fusion combines their rankings so the dense leg supplies semantic recall while the lexical leg supplies exact-match precision on the identifiers a financial query most often hinges on.


Honest caveats


Implementation

The companion notebook (notebookPath) builds an inverted index, implements the RSJ-derived IDF alongside textbook log(N/df)\log(N/df) so the Jeffreys smoothing is visible, scores from scratch, and — most importantly — includes a verification harness that asserts the four limit theorems above and cross-checks the ranking against rank_bm25. A grid sweep over (k1,b)(k_1, b) reports the NDCG@10 surface, reproducing “tuning matters.”

Running the harness prints the topic’s claims back as passing assertions. The four limit theorems check first — f1f \to 1 as k10k_1 \to 0, ftf/Bf \to \text{tf}/B as k1k_1 \to \infty, the ceiling fk1+1f \to k_1 + 1 (the printed 2.52.5 at k1=1.5k_1 = 1.5), and the bb-toggle — followed by monotonicity of the saturation curve, so nothing argued above is asserted that the code does not also confirm. Then comes the length-hijack flip on the toy corpus (N=6N = 6 documents, avgdl=52.7\text{avgdl} = 52.7 tokens) for the query “interest rate exposure”: at b=0b = 0 the padded transcript ranks first and NDCG@10 is 0.7160.716; at b=0.75b = 0.75 the concise on-point filing surfaces to first and NDCG@10 rises to 0.9630.963. That ranking is then cross-checked against rank_bm25’s BM25Okapi under a matched IDF variant, so any divergence would flag a genuine scoring bug rather than a known formula difference.

The closing grid sweep evaluates NDCG@10 across k1{0.5,1,1.5,2}k_1 \in \{0.5, 1, 1.5, 2\} and b{0,0.25,0.5,0.75,1}b \in \{0, 0.25, 0.5, 0.75, 1\} and reports its best cell at (k1,b)=(1.5,1.0)(k_1, b) = (1.5, 1.0) with NDCG@10 =0.977= 0.977 — only a hair above the 0.9630.963 at the canonical b=0.75b = 0.75. That near-tie is the honest lesson, not a contradiction: on a six-document toy corpus the optimum drifts to full length normalization, whereas the b0.75b \approx 0.75 that ships in Lucene and Elasticsearch is the value that holds up across collections of TREC scale. The parameters are tuned, the rigor flag is right to say so, and the surface is flat enough near the optimum that switching length normalization on matters far more than pinning down its exact setting.

Connections

  • BM25 shares the IDF factor and a term-frequency transform with TF-IDF, but contains no cosine normalization — it is not the k₁-limit of cosine-normalized TF-IDF vector-space-model-tfidf
  • the BIM derivation begins from the PRP: rank by decreasing probability of relevance, which the exchange argument shows is optimal probability-ranking-principle
  • an alternative probabilistic retrieval model that ranks by P(query | document language model); contrasted with BM25 via the KL-divergence view query-likelihood-language-models
  • BM25 scores are accumulated over the inverted index, and WAND/BlockMax-WAND prune documents that cannot enter the top-k inverted-index-dynamic-pruning
  • BM25 is the lexical leg of hybrid retrieval; reciprocal rank fusion combines its ranking with dense retrieval rank-fusion-rrf
  • SPLADE generalizes BM25's bag-of-weighted-terms to learned sparse expansions living in the same inverted-index world late-interaction-learned-sparse

References & Further Reading