METHODOLOGY · DATASET v2026.05 · UPDATED May 2026

How the data is made.

Latent aggregates cold-case records from official law-enforcement sources, normalizes them into a research-grade schema, and runs local LLM extraction over the narrative text to surface structured modus-operandi fields. Every algorithm is documented here, with explicit transparency about confidence, provenance, and limits.

Nothing here is a suspect identification. Every correlation and extracted field is a lead for human review, never a conclusion.

Cases indexed

6,760

1911–2025, across 15 active sources.

Geocoded coverage

97.3%

6,580 cases plot to coordinates; the remainder are pending geographic resolution.

Extraction precision

100 / 96.0%

Weapon and location-type classification, re-measured on qwen3:14b against the hand-labeled validation set.

Inference locality

100% local

Qwen3:14b runs on local infrastructure. No case data leaves Latent hardware.

Contents

§ 01Preamble
§ 02Data sources
§ 03Modus operandi
§ 04Correlation
§ 05Geocoding precision
§ 06Bias & selection effects
§ 07Limits & open questions
§ 08Citation & reuse
§ 09Revision history

§ 01Preamble

standing

Who this is for, and what we promise.

Latent is a public-facing analytics platform for serious non-professional investigators: citizen detectives, journalists, students, data scientists, and genetic genealogists working cases that law-enforcement tools wouldn't give them access to. The promise is simple: every transformation we apply to a case is documented, every extracted field carries a verifiable link back to its source phrase, and every limitation we know about is surfaced rather than buried.

The framing matters. Latent surfaces structure from a corpus too large to read by hand, but structure is not a verdict. A shared weapon and a 30-mile radius make two cases worth comparing, not the same case. The methodology below exists so a careful reader can audit the substrate of any claim the interface makes, and decide for themselves what the signal is worth.

§ 02Data sources

15 active

Coverage and per-source field completeness.

Latent ingests cold-case data from official law-enforcement sources only. Coverage today spans the states colored below; expand any state for the underlying sources, refresh cadence, geocoding strategy, and the caveats that travel with each one.

Coverage at a glance

10 states · 15 sources · 6,760 cases

CoveredPartial (county-level)Not yet

State

Method

Cases

Years

Geocoded

OhioOH

browser-assisted index · web scrape · polite

2,796

1943–2024

100%

BCIAttorney General's BCI · Statewide

browser-assisted index · web scrape · polite

Ohio Attorney General — Statewide Unsolved Homicides

Cases

2,796 · 100% geocoded

Year range

1943–2024

Geocoding

Census street → ZIP centroid → county centroid.

Cases submitted by Ohio's county sheriffs and municipal police departments. The AG hosts and publishes; the named local agency owns each investigation. ASP.NET WebForms index is harvested via a browser userscript; detail pages are fetched programmatically.

www.ohioattorneygeneral.gov/ColdCase →

ColoradoCO

web scrape · structured fields · polite

1,400

1911–2025

100%

CBIBureau of Investigation · Statewide

web scrape · structured fields · polite

Colorado Bureau of Investigation — Cold Case Database

Cases

1,400 · 100% geocoded

Year range

1911–2025

Geocoding

Census street → ZIP centroid → county centroid.

State compendium from 86 county and municipal agencies. Detail pages publish structured first-party demographics plus federal cross-references to NamUs + NCMEC. Re-runs reflect current state (cases solved between runs drop out automatically).

apps.colorado.gov/apps/coldcase/search.html →

District of ColumbiaDC

web scrape · PDF parsing · polite

703

1979–2020

99%

MPDCMetropolitan police · Washington, D.C.

web scrape · PDF parsing · polite

Washington D.C. Metropolitan Police — Unsolved Homicides

Cases

703 · 99% geocoded

Year range

1979–2020

Geocoding

Census street → ZIP centroid → DC-district centroid (FIPS 11001).

Gender from honorifics only

MPDC bulletins consistently use Mr./Mrs./Ms./Miss; gender is extracted only from these explicit narrative signals, never from photos or first names.

DC is a federal district, not a state

Latent treats DC alongside states for filtering and display; the badge reads 'DC' and FIPS 11001 is the federal-district code.

mpdc.dc.gov/page/unsolved-homicides →

FLFL

—

681

1955–2020

76%

FDLE_FL— · —

—

Florida Department of Law Enforcement - Unsolved Cases

Cases

681 · 76% geocoded

Year range

1955–2020

Geocoding

—

VirginiaVA

CSV · manual refresh

525

1961–2023

100%

VSPCCDState police · Statewide

CSV · manual refresh

Virginia VSPCCD Cold Case Database

Cases

525 · 100% geocoded

Year range

1961–2023

Geocoding

Source-provided coordinates (geocoding ladder skipped).

vspccd.dcjs.virginia.gov →

UtahUT

web scrape · structured fields · polite

207

1949–2021

99%

BCIBureau of Criminal Identification · Statewide

web scrape · structured fields · polite

Utah Bureau of Criminal Identification — Cold Case Database

Cases

207 · 99% geocoded

Year range

1949–2021

Geocoding

Census street → ZIP centroid → county centroid.

Statutory compendium — Utah SB 160 (2018) mandates agency submission, making this a stronger model than voluntary programs.

bci.utah.gov/coldcases →

MarylandMD

(6 of 24 jurisdictions)

6 sources

205

1970–2021

98%

Anne ArundelCounty police · Anne Arundel County

web scrape · polite

Anne Arundel County, MD Police Department — Cold Cases

Cases

75 · 99% geocoded

Year range

1970–2020

Geocoding

Census street → ZIP centroid → county centroid (FIPS 24003).

Co-killed victims modeled as separate cases with a victim-name suffix to keep IDs unique.

www.aacounty.org/police-department/crime-information/cold-cases →

Baltimore CountyCounty police · Baltimore County

web scrape · polite

Baltimore County, MD Police Department — Unsolved Homicides

Cases

75 · 96% geocoded

Year range

1970–2015

Geocoding

Census street → ZIP centroid → county centroid.

Race not in narratives

Baltimore County narratives never state race. Cases show race as 'not provided' rather than imputing from photographs or names.

www.baltimorecountymd.gov/departments/police/unsolved/homicides →

HowardCounty police · Howard County

web scrape · polite

Howard County, MD Police Department — Cold Cases

Cases

23 · 100% geocoded

Year range

1971–2017

Geocoding

Census street → ZIP centroid → county centroid (FIPS 24027).

Includes IDENTIFIED-prefix cases — unsolved homicides where the previously-unknown victim has been DNA-named.

www.howardcountymd.gov/police/cold-cases →

HarfordSheriff's office · Harford County

web scrape · polite

Harford County, MD Sheriff's Office — Cold Cases

Cases

21 · 100% geocoded

Year range

1981–2021

Geocoding

Census street → ZIP centroid → county centroid (FIPS 24025).

harfordsheriff.org/wanted/cold-cases →

WicomicoSheriff's office · Wicomico County

web scrape · polite

Wicomico County, MD Sheriff's Office — Cold Cases (Homicides)

Cases

6 · 100% geocoded

Year range

1990–2019

Geocoding

Census street → ZIP centroid → county centroid (FIPS 24045).

Paginated card-tile index. Cases titled 'Missing Person -' are filtered; homicides only.

www.wicomicosheriff.com/coldCases →

CharlesSheriff's office · Charles County

web scrape · polite

Charles County, MD Sheriff's Office — Cold Cases (Homicides)

Cases

5 · 100% geocoded

Year range

1980–2008

Geocoding

Census street → ZIP centroid → county centroid (FIPS 24017).

Index page groups cases by category; we ingest only the Homicides section. ARREST MADE entries are filtered out.

www.ccso.us/crime/cold-cases →

Maryland — sub-state coverage

6 of 24 jurisdictions

Latent ingests only first-party county or municipal sources. Where a Maryland jurisdiction has no public cold-case roster, the cases aren't here — by source-integrity choice, not omission.

Covered (6)

Baltimore County · Anne Arundel · Howard · Harford · Charles · Wicomico

No public roster (17)

Allegany · Baltimore City · Calvert · Caroline · Carroll · Cecil · Dorchester · Frederick · Garrett · Kent · Montgomery · Prince George's · Queen Anne's · St. Mary's · Talbot · Washington · Worcester

State-handled (1)

Somerset (Maryland State Police)

Queen Anne's County hosts a 2-case page (1 homicide, 1 missing person); below the threshold for a dedicated adapter. Revisit when the page grows.

New HampshireNH

JSON API + detail scrape · hybrid

115

1960–2015

100%

NH DOJAttorney General's Cold Case Unit · Statewide

JSON API + detail scrape · hybrid

New Hampshire DOJ — Cold Case Unit Victim List

Cases

115 · 100% geocoded

Year range

1960–2015

Geocoding

Census street → ZIP centroid → county centroid.

Race not published

NH DOJ does not publish race and Latent does not infer it. The correlation engine treats race as unknown for these cases (it doesn't contribute to scoring).

Two case_type values

NH publishes both unsolved homicides and suspicious deaths. We preserve the distinction (`case_type=suspicious_death` for deaths where a homicide determination hasn't been made).

www.doj.nh.gov/bureaus/new-hampshire-cold-case-unit/victim-list →

MNMN

browser-assisted harvest · PDF parse · polite

1964–2021

100%

MN BCABureau of Criminal Apprehension · Statewide

browser-assisted harvest · PDF parse · polite

Minnesota BCA — Unsolved Cases Database

Cases

96 · 100% geocoded

Year range

1964–2021

Geocoding

Census street → ZIP centroid → county centroid.

WAF-gated case list

The BCA case list sits behind an Imperva web application firewall. Latent harvests the list out-of-band via a one-time browser console request against the SharePoint API; per-case Summary-of-Case PDFs are static assets fetched programmatically.

Narrative-derived modus operandi

Weapon and location type are not published as structured fields. They are extracted from the SOC PDF narrative by the Tier-1/Tier-2 LLM pipeline and surfaced as a reviewable overlay, not source-of-truth.

portal.dps.mn.gov/bca/unsolved-cases/Pages/database.aspx →

WyomingWY

web scrape · structured fields · polite

1974–2018

100%

DCIDivision of Criminal Investigation · Statewide

web scrape · structured fields · polite

Wyoming Division of Criminal Investigation — Cold Case Homicides

Cases

32 · 100% geocoded

Year range

1974–2018

Geocoding

Census street → ZIP centroid → county centroid.

Submit-from-counties compendium. The roster mixes homicide and standalone sexual-assault cases; Latent ingests homicides only. Weapon, scene type, and demographics are structured at source, so no LLM extraction is required.

wyomingdci.wyo.gov/operations/cold-cases →

Per-source field completeness

what each source provides

Source	Name	Date	Addr.	City	State	Gender	Race	Age	Weapon	Loc.	Narr.	Photo
OHBCI	●	●	●	●	●	●	●	●	◐	◐	●	●
COCBI	●	●	●	●	●	●	●	●	◐	◐	●	●
DCMPDC	●	●	●	●	●	◐	×	×	◐	◐	●	●
FLFDLE_FL	—	—	—	—	—	—	—	—	—	—	—	—
VAVSPCCD	●	●	●	●	●	●	●	●	◐	◐	●	—
UTBCI	●	●	●	●	●	●	●	●	◐	◐	●	●
NHNH DOJ	●	◐	◐	●	●	◐	×	●	◐	◐	●	—
MNMN BCA	●	●	◐	●	●	◐	◐	●	◐	◐	●	●
MDAnne Arundel	●	◐	◐	◑	●	◐	×	◐	◐	◐	●	●
MDBaltimore County	●	◐	◐	◑	●	◐	×	◐	◐	◐	●	●
WYDCI	●	●	●	●	●	●	●	●	●	●	●	●
MDHoward	●	◐	◐	◑	●	◐	×	◐	◐	◐	●	●
MDHarford	●	◐	◐	◑	●	◐	×	◐	◐	◐	●	●
MDWicomico	●	◐	◐	◑	●	◐	×	◐	◐	◐	●	●
MDCharles	●	◐	◐	◑	●	◐	×	◐	◐	◐	●	●

●Source-provided◐LLM-extracted◑County-level only×Not in source—Not collected

Cells where a field is structural (source-provided) carry the strongest provenance. Cells marked LLM-extracted point at a verbatim source phrase via the case detail page's Source ▸ reveal. Cells marked not in source are honest absences. Latent does not infer demographics from photographs or names.

Victim race is recorded exactly as the source agency reported it and shown that way on each case. For filtering only, the many source spellings of one category (Black, BLACK or AFRICAN AMERICAN, African American) are grouped into standard FBI/NIBRS categories so the filter does not fragment; the original value is never overwritten. Hispanic or Latino is an ethnicity rather than a race, but some agencies record it in the race field, so it is carried as a category here. This is a fixed grouping of source-provided values, not inference.

§ 03Modus operandi extraction

Tier 1 + Tier 2

What gets pulled from each narrative.

Local LLM extraction surfaces structured modus-operandi fields from case narratives. How much runs depends on the source: weapon and location type are extracted wherever a narrative exists; dates, addresses, and demographics are extracted only where the source publishes prose instead of structured fields, and sources that already publish structured records need no extraction at all. The §02 completeness matrix shows exactly which fields are extracted per source.

The model that produces these classifications is Qwen3:14b, an open-weights model run locally on Latent infrastructure. No case data is sent to third-party AI services. Privacy and local execution are the trust signals that matter for investigative work; running everything on local hardware means a case you research here doesn't leak to OpenAI, Anthropic, or any other vendor.

Where a source already publishes a field as structured data ( Colorado, Utah, and Ohio ship victim demographics directly; Wyoming ships weapon and scene type too), Latent trusts the source value and does not re-derive it with the model. Extraction fills gaps; it never overrides first-party data.

Each classification carries one of two confidence states, plus a separate unknownfor cases where the narrative simply doesn't say.

Locality

·Model: Qwen3:14b (open weights)
·Inference: local hardware only
·Egress: zero, no third-party API calls
·Determinism: temperature pinned to 0.0

Confidence states

Confident

Classification produced and anchored to a verbatim phrase in the narrative. The detail page's Source ▸ button reveals the exact span.

Uncertain

Classification stands but the model paraphrased (“at her place” vs “at her apartment”). The audit trail does not.

Unknown

The narrative doesn't specify the attribute. Excluded from scoring; surfaced as “not provided” on the detail page.

Source-span anchoring · live example

On October 10, 2010 at approximately 9:30 p.m., officers found the victim suffering from a gunshot wound on the ground. He was taken to the hospital where he later died of his injuries.

extractedweapon = firearmraw phrase“gunshot wound”spanchars 92–105confidenceconfident · anchored

Validation

Weapon · precision / recall

100 / 100%

Hand-labeled validation set, re-scored on every prompt or model change (current: qwen3:14b).

Location type · P/R

96.0 / 96.0%

Same validation set. Most failures are paraphrase, not misclassification.

Uncertain rate · location

17.9%

Defensible classification, paraphrased anchor. Known v1 limitation; will narrow with prompt iteration.

Audit

For data-quality audits, the table cold_case.case_extractions_current is the authoritative latest extraction per case. A LEFT JOIN against cold_case.cases reveals any cases missing an extraction (typically transient between ingestion and worker drain).

§ 04Correlation engine

5 dimensions · floor 1.5 / 3.5

How "possibly related cases" gets ranked.

On each cold-case detail page, a small panel surfaces other cases that share enough structural similarity to be worth an investigator's attention. Two hard filters (50 mi / 20 yr) precede a five-dimension weighted score. The ranking is intentionally conservative: high-signal candidates only, not every plausible match.

Geographic filter

≤ 50miles

Cross-metro pairs in serial-pattern data are usually unrelated. Geography stays bounded.

Temporal filter

≤ 20years

Time runs wide because cold-case patterns frequently span dormant periods.

Worked example

Dimension

Case A

Case B

Match

Score

Weapon

firearm

✓

+1.0

Location type

public-outdoor

✓

+1.0

Victim gender

male

✓

+0.5

Victim race

black

not recorded

—

0.0

Victim age band

30–39

✓

+0.5

Score floor 1.5 · maximum 3.5 · ties broken by closer distance3.0 / 3.5

Why the weights are what they are

The two modus-operandi dimensions (weapon and location type) carry a weight of 1.0 each. The three victim demographics (gender, race, age band) carry 0.5. MO describes what the offender did: it is the discriminating signal for linking cases. Demographics describe who was harmed, which clusters for reasons that have nothing to do with a shared offender. A neighborhood's population is not a pattern. Weighting MO at twice demographics keeps a coincidental demographic overlap from outscoring a genuine behavioral one.

The score floor of 1.5 is the lowest total worth a reviewer's attention: it takes either one MO match plus one demographic, or all three demographics together. A pair that agrees on demographics alone just barely clears, and is shown as the weakest tier. The maximum of 3.5 is all five dimensions in agreement. The correlation engine reads these five weights from a single module, correlation/config.py, so the scoring you see described here is the scoring the engineactually runs.

What doesn't contribute

A classification of unknown on either case excludes that dimension from scoring. Two cases both classified weapon = unknown aren't matching on weapon; they're both cases where the narrative didn't specify a weapon. Same logic applies to missing demographics: if either case has no recorded victim gender, gender doesn't contribute.

A low-confidence MO classification (paraphrased anchor) still counts as a match for scoring purposes. The panel displays the “uncertain” qualifier alongside the matched value so users see the weakened evidence without clicking through.

Scope. Correlation compares cold cases to other cold cases within the same dataset. Cross-corpus correlation against other record types is out of scope for this version: the data shapes differ, the join is more expensive, and the framing of “potential leads” needs separate methodological care before it's sound.

§ 05Geocoding precision

tiers 1–4

Four precision tiers, one map.

Every case on the map carries one of four precision tiers, signaling how exactly the published source data lets us place the incident. Pins are styled differently per tier; click a county-precision pin to see the county boundary the case falls within.

Tier	Source	What it means
street	Census or Google ROOFTOP	Exact address
landmark	Google GEOMETRIC_CENTER / APPROXIMATE	Road, neighborhood, or city centroid
zip	ZCTA centroid	Within this ZIP code area
county	County centroid	Somewhere in this county (click pin to see boundary)

The tier assigned to any case is determined by the best match returned from a three-step geocoding ladder. First, we try the Census Geocoding Service (free, maintained by EODC) against the published address; if that fails, we step down to Google's Geocoding API; if that also fails, we assign a ZIP centroid (ZCTA from the published postal code if available), then fall back to the county centroid where the case occurred.

County-precision pinsare clickable on the map detail. Clicking zooms to show the full county boundary, a transparent wash that clarifies the uncertainty radius. Many cases from smaller counties or jurisdictions with incomplete narratives end up in tier 4; that's an honest representation of what the source told us, not a flaw in the geocoding.

§ 06Bias & selection effects

read before analyzing

What the data is not.

Every dataset assembled from public records carries selection effects. Latent's are worth naming up front: an analyst who knows the shape of the bias can correct for it; one who doesn't will read artifacts as findings.

Latent is a corpus of cold cases that an official publisher chose to make public and that Latent has so far ingested. It is not a census of cold cases, and aggregate statistics drawn from it inherit every gap below. None of this makes the data unusable; it makes the data something you read with the following in hand.

Selection effect

Publication bias

Only cases an agency chose to post appear here. Agencies publish selectively: high-profile cases, cases where public tips help, cases that show the unit is active. A jurisdiction with no public roster contributes zero cases regardless of how many cold cases it carries.

Selection effect

Geographic coverage follows publishers, not need

Coverage tracks wherever an official source posts a usable roster, not population, not case density, not where help is most needed. Maryland appears county-by-county; most states aren't in yet. Cross-state comparison reflects ingestion order as much as anything real. See the §02 map for current shape.

Selection effect

The corpus skews to still-unsolved

Several sources are re-ingested live; cases solved between runs drop out. That keeps the data current but means Latent trends toward cases that remain open; it is not a historical record of every case that was ever cold.

Selection effect

Narrative framing carries through

Extracted MO fields come from narratives written by law enforcement, in the language and emphasis the agency chose. 'Location type' reflects what a narrative foregrounded; a detail the writer omitted is a detail Latent cannot recover.

Selection effect

Demographic completeness varies by source

Race is a structured first-party field in Colorado and Utah, absent from Maryland and DC narratives entirely. Any aggregate demographic figure is shaped by which states happen to be ingested, not a population-level measurement.

Selection effect

Sources span different record-keeping eras

Year ranges differ widely: Colorado reaches back to 1911, other sources are recent. Time-based analysis mixes decades of changing investigative practice, evidence standards, and what agencies recorded at all.

§ 07Limits & open questions

v1 known issues

What we know we don't know.

Documented uncertainty is more useful than hidden uncertainty. The following are known v1 limitations, each with a route to resolution.

Known limitation

Paraphrased location extractions

On the hand-labeled validation set, roughly one location classification in six resolves to a defensible paraphrase rather than a verbatim source span: the classification stands, the audit trail weakens. This is a validation-set rate, not a measured corpus-wide figure; it narrows with prompt iteration.

Known limitation

Manual refresh cadence on most sources

Most sources (Virginia VSPCCD among them) are re-ingested by hand when the publisher posts an update. There is no scheduled refresh today, so drift between a live site and our snapshot is bounded but real. The hero eyebrow shows the dataset's last ingestion date.

Known limitation

Cross-source correlation is deferred

Only cold-case ↔ cold-case in this version. Joining other record types needs methodological care before pairs are trustworthy.

Known limitation

Multi-victim semantics use primary victim

Demographic match against multi-victim cases pins to one victim per case. Rule is locked even at low incidence so behavior stays consistent as more multi-victim cases land.

Known limitation

Race is never inferred from photographs

An ethical guardrail, not a technical limitation. Cases without a source-stated race show as 'not provided' rather than imputing a value from imagery or names; the correlation engine treats race as unknown for those cases.

Known limitation

Single primary photo per case

Schema supports multiple photos per case but only the primary is rendered today. SHA256 captured for future dedup work.

§ 08Citation & reuse

Take what you need; cite the substrate.

Underlying source data is published as public records by the originating agencies. Latent aggregates, structures, and annotates. Researchers and journalists are welcome to use whatever they find; a citation back to this dataset is appreciated and lets readers trace provenance themselves.

Suggested citation

Latent Cold-Case Dataset. Dataset v2026.05, revised 2026-05-29.
Retrieved from https://latenttrace.org/methodology.

If you're considering re-using extracted fields in a downstream system: please retain the source-span provenance so your readers can audit the inference chain back to the original narrative. The whole point of anchoring extractions to verbatim phrases is that the trust signal travels with the data. Tip-line and corrections: courtney.e.hammond@gmail.com.

§ 09Revision history

4 entries

What changed, and when.

This page is versioned. Substantive changes to how the data is built or documented are logged here, so a claim you cite today can be checked against the methodology as it stood when you cited it.

2026-05-16
Methodology page rebuilt: coverage map, per-source field-completeness matrix, dedicated bias & selection-effects section, and grounded correlation weighting.
2026-05-14
New Hampshire DOJ ingested; suspicious_death added as a distinct case type for deaths not yet ruled homicides.
2026-05-12
Google geocoding tier added between Census and ZIP-centroid fallback; four-tier precision visualization introduced.
2026-04-21
Initial methodology published alongside the first security-audited dataset.