Andrey Kozichev

10 AI Anti-Patterns in Data Pipelines

10 AI Anti-Patterns in Data Pipelines

Most AI failures in data pipelines aren't model failures. They're architecture failures. The model does roughly what you asked — the problem is where you put it and what you trusted it to do.

The one rule under all of this: use AI to build the solution, not to be the solution. AI belongs at the fuzzy edge — interpreting natural language, extracting intent, classifying messy unstructured input. The deterministic core of your pipeline is not its job.

Here are the anti-patterns I keep seeing.

1. Using AI to do the task instead of to build the solution

You put an LLM in the runtime path to transform every record — when what you actually needed was for AI to help you write the transform once. Let it generate the parser, the mapping, the regex, the code. Then run code. Cheap, fast, deterministic, debuggable.

2. Reaching for AI where cheaper, deterministic tools already win

Dates, joins, math, dedup, type coercion, validation — all deterministic. An LLM doing them is slower, pricier, and non-deterministic, and it will eventually get one wrong in a way you can't reproduce. If a unit test could cover it, it shouldn't be a prompt.

The same mistake shows up in matching: pick the tool by the kind of similarity you need, not by what's trendy. Fixed shape (emails, SKUs, error codes, dates)? Regex or rules — exact and free. The same words with noise (typos, OCR errors, "St" vs "Street")? Fuzzy matching, right inside your database (pg_trgm, fuzzystrmatch). Only when you need different words, same meaning ("reset my password" vs "I forgot my login") do you reach for embeddings — which carry real cost: a model, a vector store, a tuned threshold, and a full re-embed every time you change the model. Using vectors to match an exact SKU is slower, costlier, and harder to explain than the = it replaced; in production the honest answer is usually hybrid (keyword + vectors, fused).

3. Putting a non-deterministic call in a path that must be reproducible

Same input should yield the same output — that's what makes backfills match, incremental re-runs safe, and audits possible. An LLM breaks it: even at temperature 0 it isn't guaranteed bit-identical, and the model version can change underneath you, so re-running last month's job today gives different answers. If a model call truly must live in that path, defend it in three layers:

Pin it — lock the exact model version (the dated snapshot, not a floating latest alias), the prompt version, temperature=0, and a seed where supported. This kills the drift you control. It won't make output perfectly deterministic, which is why you also need the next two.
Cache it — keep a lookup keyed by (normalized input hash + model version + prompt version + params) → output, and on a re-run serve the stored answer instead of calling the model again. This is what actually guarantees idempotency, and it's the same hash as #7 doing double duty (cost there, reproducibility here). When the model or prompt changes the key misses and you recompute — which is what you want.
Snapshot the output — persist the result as a first-class, versioned dataset with its provenance (model, prompt, params, timestamp). Unlike a cache, you don't evict it: downstream rebuilds from the snapshot rather than a fresh call, so the pipeline stays reproducible even after the vendor deprecates that model — and you keep an audit trail of what the model said, under exactly what conditions.

In practice the cache and the snapshot are often the same materialized layer wearing two hats.

4. Trusting the model's input and output without checks

Don't trust the model on either end. On the way in, gate with cheap non-LLM checks before you pay for inference — right size, non-empty, the field you need is present, the format is plausible, the language is right. A run over 1,000 records where 30% are junk means full token price to extract nothing from a third of them, plus the downstream work of finding and discarding those results. On the way out, demand structured, typed output (JSON schema), validate it, and route anything malformed to a dead-letter / quarantine path. Filter before, validate after, and send only the survivors to the model.

5. Prompts that are untracked and untested

A prompt is production logic. If it lives unversioned inside a notebook, you've got an un-rollback-able deploy and no diff when it breaks. Worse: you tweak it and have no idea whether it got better or worse. Version prompts like code, and put a regression/eval harness around your most brittle component.

6. Defaulting to the biggest model for everything

Right-size it. Small, cheap models handle classification, extraction, and routing fine. Reserve the frontier model for genuinely hard reasoning. Reaching for the largest model by reflex is just burning money and latency on tasks that don't need it.

7. Wasting money at the API

A few cheap habits cost you constantly. Cache-busting: injecting timestamps, UUIDs, or row IDs into the prompt body kills prompt caching and re-bills you for identical context every call — keep the static part static and pass volatile data through separate fields. Unbatched, unmetered calls: loop an API request over millions of records with no batching, concurrency cap, or budget guard and watch the bill and wall-clock explode — measure tokens and cost per stage, or you can't catch a runaway or justify an optimization. Reprocessing what hasn't changed: store a hash of each input and skip the call when it matches, but hash the normalized text (trim whitespace, collapse formatting) so you only re-infer on a genuine content change, not a cosmetic one. And synchronous calls for bulk work: most vendors offer an async Batch API at roughly half the price of the real-time endpoint — fine to start synchronous for simplicity, but once you're hammering records, batch always wins, and designing for it from the start beats retrofitting later.

8. Hardcoding one vendor — and pinning to "latest"

Two distinct traps. Hardcoding OpenAI's or Anthropic's SDK straight into your DAG means a deprecation, rate-limit change, or repricing tomorrow is now a rewrite — put an abstraction layer (LiteLLM or similar) in between. And pinning to a floating latest alias means a vendor update silently changes your output. Pin the exact model version and upgrade deliberately, behind your eval.

9. Ignoring the prompt boundary

Data crossing the prompt is a risk in both directions. Outbound: raw, un-redacted records — PII, sensitive columns, no DPA — leaving your boundary for a third-party endpoint is a governance incident waiting to happen, so know what's in the payload before it leaves. Inbound: untrusted pipeline data flowing into a prompt can hijack the model (prompt injection), so keep data and instructions structurally separated and treat every input as hostile.

10. Outsourcing the thinking

Using an LLM to write the prompts another LLM runs stacks non-determinism you can't debug — write your instructions deliberately. And "the system will self-learn over time" is not an architecture; without explicit feedback capture, evals, and retraining loops, "self-learning" is just undocumented drift with a nicer name.

—

None of these rules are really about AI. They're the same data engineering you already practice — determinism, idempotency, observability, cost control, governance — applied to a component that happens to be probabilistic and metered by the token. The mistake is treating "AI" as a category that earns an exception from all of it.

So the bar is simple: keep the pipeline deterministic, observable, and version-pinned, and let the model do the one thing it's genuinely good at — turning messy language into structured intent or just use AI to build the solution. Then get it out of the hot path.

Use AI to build the solution, not to be the solution. Then get it out of the hot path.

Andrey Kozichev

Subscribe for the latest blogs and news updates!

governance

May 19, 2026

Where does the AI productivity surplus go?"

AI gave you a productivity surplus. You have four options for what to do with it — and two of them happen without you choosing. Here's the decision.

Sep 17, 2025

From PDF Chaos to AI-Powered Clarity

What if you could drop any PDF into a folder and get back a clean, structured summary — powered entirely by open-source tools and your own local AI? This workflow turns document chaos into automated clarity.

10 AI Anti-Patterns in Data Pipelines

10 AI Anti-Patterns in Data Pipelines

1. Using AI to do the task instead of to build the solution

2. Reaching for AI where cheaper, deterministic tools already win

3. Putting a non-deterministic call in a path that must be reproducible

4. Trusting the model's input and output without checks

5. Prompts that are untracked and untested

6. Defaulting to the biggest model for everything

7. Wasting money at the API

8. Hardcoding one vendor — and pinning to "latest"

9. Ignoring the prompt boundary

10. Outsourcing the thinking

Andrey Kozichev

Related Posts

Where does the AI productivity surplus go?"

From PDF Chaos to AI-Powered Clarity