The problem
You have 200 customer reviews (or tickets, or replies) and you need them as a table: sentiment, the feature mentioned, whether it's a bug or a request. Humans tag maybe 30 an hour and disagree with each other by lunch. A model does all 200 in a minute β if you stop it from inventing categories halfway down the list.
When to use this β and when not to
Use it when the structure is clear and the volume makes hand-tagging silly. It shines at sentiment, classification, pulling fields (dates, amounts, names) out of prose.
Don't trust it unaudited for anything high-stakes (medical coding, compliance) without a human checking a sample β and never let it classify into categories it invented, or your "data" is just vibes in a spreadsheet.
The recipe
Two moves make this reliable: give it a fixed schema (closed lists, not open text) and make it return JSON. Per item:
Classify the text below. Return ONLY JSON matching this exact schema β no prose:
{
"sentiment": "positive" | "neutral" | "negative",
"topic": "pricing" | "performance" | "ui" | "support" | "other",
"type": "bug" | "feature_request" | "praise" | "question",
"quote": "the single most representative sentence, verbatim",
"confidence": 0.0-1.0
}
Rules: choose ONLY from the listed values β never add a new one; if nothing fits, use
"other". "quote" must be copied verbatim from the text, not paraphrased. If you're
unsure, lower "confidence" β do not guess a specific label confidently.
TEXT:
<paste one item>The closed value lists ("pricing" | "performance" | ...) are the whole game. Leave topic as free text and by item 50 the model is minting "user_experience," "UX," and "usability" as three different buckets and your group-by is ruined. The confidence field is your audit handle: sort by it ascending and you review the 10% the model itself wasn't sure about, not all 200.
For a batch, loop the prompt over each item (a 20-line script) and collect the JSON. Most APIs have a "JSON mode" / response-format flag β turn it on so you never parse a stray apology.
A real run
Input: "Honestly the app is fast now after the update but Β£14/mo is steep for what it does, might cancel."
Output:
{
"sentiment": "negative",
"topic": "pricing",
"type": "feature_request",
"quote": "Β£14/mo is steep for what it does, might cancel",
"confidence": 0.72
}Good: it picked pricing over performance even though the review opens with praise for speed β it correctly weighted the cancellation risk. The 0.72 confidence is honest; this review is genuinely two-sided, and that's exactly the kind you'd want a human to glance at.
Where it breaks
- Open-ended fields drift. The single biggest silent failure: any field you leave as free text fragments into near-duplicate categories across a batch. Fix: every categorical field is a closed list. Always.
- It "fixes" your quotes. Ask for a verbatim quote and a helpful model will tidy the grammar, breaking exact-match search later. The "copied verbatim, not paraphrased" rule mostly holds it; spot-check.
- Confidence is relative, not calibrated.
0.9doesn't mean 90% correct β it means "more sure than the 0.6 ones." Use it to rank what to review, not as a pass/fail gate. - Schema too big. Cram in 12 fields and accuracy on all of them drops. Extract 4β5 things well, run a second pass for the rest.
Cost & time
Around $0.15 to tag 200 items on a small model, about a minute of wall-clock. The expensive part is writing the schema once β and that schema is reusable forever. Build it on 10 items, eyeball the output, then let it rip on the full pile.