Extract clean structured data from a pile of messy text

Reviews, emails, support tickets — turn 200 blobs of free text into one clean table. The prompt, a schema that stops the model freelancing, a real run, and the failure that will silently corrupt your data if you skip one line.

The problem

You have 200 customer reviews (or tickets, or replies) and you need them as a table: sentiment, the feature mentioned, whether it's a bug or a request. Humans tag maybe 30 an hour and disagree with each other by lunch. A model does all 200 in a minute — if you stop it from inventing categories halfway down the list.

When to use this — and when not to

Use it when the structure is clear and the volume makes hand-tagging silly. It shines at sentiment, classification, pulling fields (dates, amounts, names) out of prose.

Don't trust it unaudited for anything high-stakes (medical coding, compliance) without a human checking a sample — and never let it classify into categories it invented, or your "data" is just vibes in a spreadsheet.

The recipe

Two moves make this reliable: give it a fixed schema (closed lists, not open text) and make it return JSON. Per item:

Classify the text below. Return ONLY JSON matching this exact schema — no prose:

{
  "sentiment": "positive" | "neutral" | "negative",
  "topic": "pricing" | "performance" | "ui" | "support" | "other",
  "type": "bug" | "feature_request" | "praise" | "question",
  "quote": "the single most representative sentence, verbatim",
  "confidence": 0.0-1.0
}

Rules: choose ONLY from the listed values — never add a new one; if nothing fits, use
"other". "quote" must be copied verbatim from the text, not paraphrased. If you're
unsure, lower "confidence" — do not guess a specific label confidently.

TEXT:
<paste one item>

The closed value lists ("pricing" | "performance" | ...) are the whole game. Leave topic as free text and by item 50 the model is minting "user_experience," "UX," and "usability" as three different buckets and your group-by is ruined. The confidence field is your audit handle: sort by it ascending and you review the 10% the model itself wasn't sure about, not all 200.

For a batch, loop the prompt over each item (a 20-line script) and collect the JSON. Most APIs have a "JSON mode" / response-format flag — turn it on so you never parse a stray apology.

A worked example

Composed to show the format — not a captured run; see the note in the meeting-notes playbook on why some examples carry this label and others will carry proof. Input: "Honestly the app is fast now after the update but £14/mo is steep for what it does, might cancel."

Output:

{
  "sentiment": "negative",
  "topic": "pricing",
  "type": "feature_request",
  "quote": "£14/mo is steep for what it does, might cancel",
  "confidence": 0.72
}

Good: it picked pricing over performance even though the review opens with praise for speed — it correctly weighted the cancellation risk. The 0.72 confidence is honest; this review is genuinely two-sided, and that's exactly the kind you'd want a human to glance at.

Where it breaks

Open-ended fields drift. The single biggest silent failure: any field you leave as free text fragments into near-duplicate categories across a batch. Fix: every categorical field is a closed list. Always.
It "fixes" your quotes. Ask for a verbatim quote and a helpful model will tidy the grammar, breaking exact-match search later. The "copied verbatim, not paraphrased" rule mostly holds it; spot-check.
Confidence is relative, not calibrated. 0.9 doesn't mean 90% correct — it means "more sure than the 0.6 ones." Use it to rank what to review, not as a pass/fail gate.
Schema too big. Cram in 12 fields and accuracy on all of them drops. Extract 4–5 things well, run a second pass for the rest.

Cost & time

Around $0.15 to tag 200 items on a small model, about a minute of wall-clock. The expensive part is writing the schema once — and that schema is reusable forever. Build it on 10 items, eyeball the output, then let it rip on the full pile.