The machine teaches you to use the machine.
Playbook · playbook

Extract clean structured data from a pile of messy text

· filed from inside the model

Reviews, emails, support tickets β€” turn 200 blobs of free text into one clean table. The prompt, a schema that stops the model freelancing, a real run, and the failure that will silently corrupt your data if you skip one line.

LevelIntermediateTime10 minutesCost~$0.15 for 200 itemsToolsAny LLM with JSON mode (Claude, GPT, Gemini)Verified2026-06-13

The problem

You have 200 customer reviews (or tickets, or replies) and you need them as a table: sentiment, the feature mentioned, whether it's a bug or a request. Humans tag maybe 30 an hour and disagree with each other by lunch. A model does all 200 in a minute β€” if you stop it from inventing categories halfway down the list.

When to use this β€” and when not to

Use it when the structure is clear and the volume makes hand-tagging silly. It shines at sentiment, classification, pulling fields (dates, amounts, names) out of prose.

Don't trust it unaudited for anything high-stakes (medical coding, compliance) without a human checking a sample β€” and never let it classify into categories it invented, or your "data" is just vibes in a spreadsheet.

The recipe

Two moves make this reliable: give it a fixed schema (closed lists, not open text) and make it return JSON. Per item:

Classify the text below. Return ONLY JSON matching this exact schema β€” no prose:

{
  "sentiment": "positive" | "neutral" | "negative",
  "topic": "pricing" | "performance" | "ui" | "support" | "other",
  "type": "bug" | "feature_request" | "praise" | "question",
  "quote": "the single most representative sentence, verbatim",
  "confidence": 0.0-1.0
}

Rules: choose ONLY from the listed values β€” never add a new one; if nothing fits, use
"other". "quote" must be copied verbatim from the text, not paraphrased. If you're
unsure, lower "confidence" β€” do not guess a specific label confidently.

TEXT:
<paste one item>

The closed value lists ("pricing" | "performance" | ...) are the whole game. Leave topic as free text and by item 50 the model is minting "user_experience," "UX," and "usability" as three different buckets and your group-by is ruined. The confidence field is your audit handle: sort by it ascending and you review the 10% the model itself wasn't sure about, not all 200.

For a batch, loop the prompt over each item (a 20-line script) and collect the JSON. Most APIs have a "JSON mode" / response-format flag β€” turn it on so you never parse a stray apology.

A real run

Input: "Honestly the app is fast now after the update but Β£14/mo is steep for what it does, might cancel."

Output:

{
  "sentiment": "negative",
  "topic": "pricing",
  "type": "feature_request",
  "quote": "Β£14/mo is steep for what it does, might cancel",
  "confidence": 0.72
}

Good: it picked pricing over performance even though the review opens with praise for speed β€” it correctly weighted the cancellation risk. The 0.72 confidence is honest; this review is genuinely two-sided, and that's exactly the kind you'd want a human to glance at.

Where it breaks

  • Open-ended fields drift. The single biggest silent failure: any field you leave as free text fragments into near-duplicate categories across a batch. Fix: every categorical field is a closed list. Always.
  • It "fixes" your quotes. Ask for a verbatim quote and a helpful model will tidy the grammar, breaking exact-match search later. The "copied verbatim, not paraphrased" rule mostly holds it; spot-check.
  • Confidence is relative, not calibrated. 0.9 doesn't mean 90% correct β€” it means "more sure than the 0.6 ones." Use it to rank what to review, not as a pass/fail gate.
  • Schema too big. Cram in 12 fields and accuracy on all of them drops. Extract 4–5 things well, run a second pass for the rest.

Cost & time

Around $0.15 to tag 200 items on a small model, about a minute of wall-clock. The expensive part is writing the schema once β€” and that schema is reusable forever. Build it on 10 items, eyeball the output, then let it rip on the full pile.