Same Input, Different Output: How LLM Drift Silently Broke Our Production Pipeline

A user asked our assistant:

“List all customers who placed orders in the last 30 days.”

The backend used GPT to generate the SQL:

SELECT * FROM orders WHERE order_date >= '2024-04-01';

It worked.

The next day, the same prompt returned:

SELECT * FROM orders WHERE order_date >= '2024-04-01' AND status = 'shipped';

No warning. No error. Just a new condition the user never asked for.

The dashboard started showing fewer rows. Nobody noticed — until someone downstream questioned why monthly revenue looked off.

It was the same input. But GPT made a different decision.

This wasn’t a bug. It was the model behaving as expected: making plausible guesses based on training, context, and randomness.

And unless your system is explicitly engineered for consistency, this kind of drift — silent, confident, invisible — will leak into production.

1. Inconsistent Q\&A Output Even With the Same Prompt

Let’s say your assistant supports engineers with technical answers.

Here’s a prompt:

What are the assumptions behind the incompressible Navier–Stokes equations?

Seems deterministic, right?

On first run, GPT responds:

Newtonian fluid
Constant viscosity
Incompressibility
Continuum hypothesis
Isotropic stress tensor

But in a separate session, same model, same prompt:

Newtonian fluid
Thermodynamic equilibrium
Laminar flow
No body forces
Smooth fields

What changed?

Nothing visible. No prompt mutation. No context shift. Just an ambiguous request.

The word “assumptions” isn’t formally defined in GPT’s world. The model sees related words in its training corpus — consequences, constraints, flow regimes — and merges them.

Even at temperature=0, there’s enough freedom in token ranking for subtle semantic drift.

In our systems, the same drift appeared in other factual queries:

“List assumptions of linear elasticity” → sometimes gave small strain, sometimes missed isotropy
“What is divergence theorem?” → sometimes returned surface integrals, sometimes flipped the directionality

Only after restructuring prompts like this:

List exactly 5 **core assumptions** behind the **incompressible** Navier–Stokes equations.
Do not include derived consequences.
Each point must be under 15 words.

…did we achieve consistent response.

What stabilized it wasn’t magic — it was:

Explicit scope narrowing
Format constraints
Token space limitation

In LLMs, open-ended prompts yield generative behavior — not database behavior. The more formal your request, the more reproducible the outcome.

2. Code Editing Often Rewrites More Than You Asked

We had a function:

def process_data():
    response = requests.post(url, json=payload)
    if response.status_code == 200:
        return response.json()
    return None

User prompt:

“Add a logging line after the API call.”

In one session, GPT added:

logging.info("POST request sent.")

But in another:

It renamed response to result
Changed the error return to None, None
Removed an unrelated comment line

Even with temperature = 0, GPT doesn’t “insert” — it regenerates.

LLMs don’t think like patch editors. Unless you explicitly constrain their editing behavior, they reconstruct entire blocks, interpreting your instruction semantically instead of surgically.

This was dangerous for us in real pipelines. One generation silently removed a retry block. Another disabled an exception handler.

We made 3 architectural changes:

Extract only the function-level scope, not the full file
Mark exact anchor points using inline comments like # INSERT LOG HERE
Request output in diff format (only changed lines)

Once the prompt was transformed from:

Add logging to this function.

…to:

Only insert one line after `requests.post(...)`. Do not change anything else. Return only inserted lines.

…we achieved stable behavior across sessions.

And to protect against GPT still getting creative, we validated diffs using difflib, and flagged any token change outside marked scope.

3. Generated SQL is Unstable Without Output Normalization

Auto-generating SQL is a common GPT use case.

A user prompt:

Show me all orders from customer 12345 in the last 60 days.

We expect:

SELECT * FROM orders WHERE customer_id = 12345 AND order_date >= '2024-03-01';

But GPT sometimes gave:

BETWEEN '2024-03-01' AND CURRENT_DATE
Added AND status = 'shipped'
Used LIMIT 100
Reordered WHERE clauses

Each variation still ran. But:

Some returned fewer records
Others failed downstream joins
A few broke test case expectations in CI

We traced this to model behavior:

Inferring extra filters from similar queries in its training data
Using date logic variations it learned from informal datasets
Structuring clauses differently based on hidden positional biases

What worked consistently:

Feeding GPT a strict SQL contract template with hardcoded table, column, and clause positions
Predefining allowed logic insertions and constraints
Parsing generated output via sqlparse and normalizing clause order
Saving a hash of every unique query output and alerting on new ones

By reducing GPT’s job to filling in safe blanks — not generating full queries — and validating outputs post-generation, we made SQL predictable and auditable.

We also added semantic unit tests to check:

Are all filters present?
Is the customer ID unmodified?
Are extra clauses (LIMIT, JOIN, GROUP BY) added without being asked?

This made the difference between a “chatty assistant” and a reliable code generation engine.

4. Chat Memory Causes Cross-Turn Leakage

A session starts:

Explain Galerkin projection in finite elements.

GPT correctly describes:

Weighted residuals
Test functions
Integral formulation

A few turns later:

Can this be used in nonlinear systems?

GPT starts referencing:

Neural networks
Variational autoencoders
Auto-diff frameworks

None of which were mentioned.

This isn’t hallucination — it’s context blending.

In memory-enabled chat sessions, GPT integrates recent tokens to influence answer likelihood. Without scope boundaries, domain contamination creeps in.

In our knowledge assistant, this caused:

Context leakage from earlier unrelated questions
Irrelevant academic concepts being added
Mixing of FEM and ML vocabularies

We restructured factual flows as:

Stateless LLM calls with no prior turns
Embedding-matched semantic retrieval (RAG) using FAISS
Injected only 2-3 chunks of verified text into the prompt
Instruction: “Use only the provided context. Do not reference previous conversation unless explicitly instructed.”

This gave us traceable, scope-bound answers, where every token had a source.

5. Logic Insertion Creeps In Uninvited

A user asks:

Generate code to read a CSV and call an API for each row.

GPT responds with:

try/except blocks
Retry logic with exponential backoff
Hardcoded headers
Logging configuration
Progress bar via tqdm

The output is correct, but it includes business logic no one asked for.

If your system executes, validates, or audits GPT-generated code, these “enhancements” break alignment.

This was especially critical for us in compliance workflows. One generation added:

verify=False

…to a requests.post() call to “fix” a certificate error it hallucinated.

We mitigated this by:

Wrapping all GPT code generation in template scaffolds
Annotating safe insertion zones via comments
Asking for outputs without enhancements unless explicitly required
Running all code through black, flake8, and a set of custom linters for unexpected imports or logic blocks

GPT’s job shifted from “write code” to “complete this pre-defined code frame, only inside marked blocks.”

This eliminated drift without sacrificing the model’s utility.

Closing Thoughts

Large language models are powerful — but power isn’t the problem.

Behavioral consistency is.

What we learned is simple: GPT isn’t unreliable — it’s unconstrained.

If you let it choose the structure, scope, and tone of the output, it will. If you define the rails it must follow, it often complies.

Every inconsistency we experienced — from hallucinated filters to overwritten code — was a signal that we hadn’t designed the system boundary tightly enough.

In production, consistency is not an AI feature. It’s a system-level guarantee.

And the way to get there isn’t better prompting. It’s:

Structured input contracts
Canonicalized, testable output
Strict model roles and scopes
Downstream validators and drift monitors
Retrieval-injected grounding when memory isn’t enough

You don’t need to fine-tune GPT. You need to wrap it in guardrails.

Only then does it behave like a dependable part of your stack.

Note:
This post was developed with structured human domain expertise, supported by AI for clarity and organization.
Despite careful construction, subtle errors or gaps may exist. If you spot anything unclear or have suggestions, please reach out via our members-only chat — your feedback helps us make this resource even better for everyone!