LLMs don’t have a well understood place in deterministic settings.
Three months ago I put code in production that utilizes gpt-4o
and/or gpt-4o-mini
models to analyze feedback about businesses and categorize it. The prompts instruct the models to identify categories of feedback, and in a second phase, extract some examples of what people said.
This is a simplification, but it took very little effort to craft some prompts that enabled even the meager gpt-4o-mini
model to do exactly that. It didn’t feel like a stretch to imagine that this use case was well within ChatGPT’s limits based on the minimal effort put into a working solution. The results were genuinely useful, and the effort was low. It seemed like an obvious win.
The code ran every two weeks, and the models had done an admirable job every time. Before putting results in front of users, I let the code run silently, only visible to a select few within our company. Very subjectively, we were satisfied with ChatGPT’s results and figured it was time to put them in front of users. So I threw some basic integration tests around the feature, gave it the old go test -count 1000 ...
, and released it to the world.
Like most software engineers, I’m all too familiar with flaky tests, and was hesitant that tests around LLM output would be predictable enough to remain unflaky. But, running a test suite with go test -count 1000
without failure is pretty confidence inspiring. And it turns out that my hesitation was not terribly well-founded.
The tests simply never failed for three months. I was so skeptical that a few times over the last month, I cracked the tests open and made sure we weren’t simply getting false negatives. We weren’t, and tests were running multiple times a day on local development machines and during both staging
and production
deployments. The tests were running about 50
times per week and remained rock solid.
That is, until 11/13/2024
. The conversation in my head following that test failure went something like this
I KNEW IT. The test has gone flaky; LLMs obviously can’t be treated like deterministic machines, this was bound to go flaky. I’ll treat it like any other non-deterministic thing and just make the test easier to pass for now until I can re-focus on crafting better prompts.
I guess you could say that in some way I was right. LLMs don’t have a well understood place in deterministic settings. But I was definitely wrong about another: that making the test easier to pass would let me kick this thing under the rug and forget about it for a while.
11/15
rolled around, and this feature failed so spectacularly that I’m still blushing with embarrassment from what it put in front of users. Luckily, it failed so spectacularly that very few of them saw anything at all because nearly every prompt failed to produce results. In my test environment, the models were presented with a relatively small amount of mock data. In production, the models were presented with far more data, and as it turns out, they perform much worse when presented with more data.
What should have happened on 11/13
is I should have gone back to my original, more rigorous test (go test -count 1000 ...
) and checked whether ChatGPT itself had regressed. Had I done so on 11/13
, the embarrassment of 11/15
would have never occurred. Because I would have seen the tests fail all 1000 times. In other words it became a deterministic failure.
This isn’t your average regression. And I don’t intend to imply that ChatGPT is failing catastrophically for many users, because I’ve yet to find anyone else talking about similar regressions. But I’m certain that something fundamental changed in gpt-4o
on Wednesday, 11/13/2024
. Maybe it’s a single parameter or weight that affects only a subset of users. Maybe that parameter or weight, or whatever it is affects only me. But what is certain is that something changed, and it changed without any fanfare, like an announcement, blog post, or tweet.
Take this post as a reminder that these models can change on a whim, despite what is published at https://platform.openai.com/docs/models.
ChatGPT may not be slipping for everyone, but it’s certainly slipping for me.
The first thing I did was disable the feature. I’ll begin testing other models from other providers, but ultimately, I think this should be the beginning of a conversation about self-hosting and local LLMs. I can’t with good conscience continue building on a platform that can change on a whim without any public acknowledgement of what is changing.
There’s already a slow movement toward bringing LLM functionality in-house rather than relying on yet-another 3rd party vendor like OpenAI, Google, or Anthropic. A few things that have my wheels turning are:
Langchaingo (https://github.com/tmc/langchaingo): Abstract away the providers with a common interface. This is not to say the providers and models are fungible; they’re not. But langchaingo makes dropping different models into applications pretty seamless.
Go blog’s write up (https://go.dev/blog/llmpowered): This was a great overview of how one might begin the process of making more LLM functionality local
Ollama (https://ollama.com/): Local-first LLMs.
ML in Go with a Python sidecar (https://eli.thegreenplace.net/2024/ml-in-go-with-a-python-sidecar/): The title says it all. I’m not in love with “sidecar” architecture, but it’s sensible in some settings.
Have you seen any similar declines in ChatGPT quality? I’d love to hear from anyone else: [email protected]
or through the Mastodon comments below.
I decided to go back to the original test today and re-run it. It passed on the first try, and every subsequent try (it’s passed dozens of times now). So what was a 100% failure rate days ago, is once again a 100% success rate. It seems like OpenAI is comfortable taking API users for a ride. I assumed models would fluctuate regularly on the web frontend, and naively thought the API would have more stability. Learn from my naivety.
This article was originally published here.