Concerns on agentic development

I feel the weight on this one, given how enthusiastically I promoted it in previous posts.

Starting with the ground truth: agents and LLMs hallucinate, a lot. They often get things wrong on the first try, and need either systematic validation (e.g. tests) or a human to push back on what they’re doing.

What I’ve discovered is that it’s all too easy to push aside the latter if you have the former. Tests can create a false sense of security. It’s often the case that tests can’t cover everything because of a technical limitation. On iOS, you can’t run tests against the real screen time API in the simulator. Mocks were invented for just this purpose. But a mock also obscures how an external API actually works, and the agent can judge incorrectly in this case.

To give an example, a feature I recently added within OpenAppLock was push notifications. One case where a notification should be delivered is when you’ve used an app for a certain amount of time. On a 30-minute time limit, the notification is designed to fire in a background event around 25 minutes into that limit. Separately, there’s a usage counter in the app which only updates in the foreground. Skipping out on some details, Claude read a log of the time limit event firing in the background, thought that was a bug as it didn’t update the foreground usage counter continuously, and also concluded that the notification would get batched in with that event.¹

It feels like every engineer uncovers a similar experience: they code something with an LLM, attempt to fix a behavior they notice, and the LLM trips over itself trying to “reason” about a fix. Maybe they get mislead by an incorrect doc comment written by a previous agent. Even if they correctly explain how an API is supposed to work, they stumble on interacting with it themselves. More often than not, it’s when you start looking closely at the code when something just seems off about the way an LLM is doing something.

I say this after I’ve implemented the adversarial review loops, the test-driven development loops, and more into my workflow. I thought that software was different in that you can verify correctness. That’s why I happily embraced agents strapped with a software testing loop, while being skeptical of LLM usage in other fields. Perhaps I haven’t done enough to improve the correctness of these systems. Still, even the smallest of errors and confusions continuously compound at the pace of agentic development. Without the tacit knowledge and immersion within a codebase, it’s way too easy to miss the issues that crop up.

I’m not saying no to agents, because there are still many cases where they’re useful. I recognize that agents will continue to stick around in software. But I really, really don’t want them to make me dumber as an engineer anymore.

By the way, Apple’s screen time APIs are notoriously finnicky. It’s 100% possible that a notification gets batched like this, and the only way to verify it doesn’t is manual testing. The issue I take here is the “reasoning” that Claude used to reach this conclusion. ↩