The Illusion of Perfect LLM Code
I recently tested several different LLMs by tasking them with implementing a simple authentication feature for a web app. It is clear that almost all modern LLMs are now excellent at following a structured blueprint. However, the real differences appeared when looking under the hood at the security of the generated code.
In my testing, I compared Opus 4.8, Gemini 3.5 Flash, Sonnet 4.6, Kimi 2.6, and DeepSeek V4 Flash. My goal was to see how well these different LLMs handle real-world coding tasks, execution plans, and security audits. When I gave them a specific instruction file, like a PLAN.md, they all performed remarkably well. Whether it was a flagship, expensive model like Opus or an ultra-affordable option like DeepSeek, these LLMs could easily follow the step-by-step instructions and generate working code.

However, when it comes to security (including the models’ ability to self-assess their own work), things start to diverge. Premium, advanced models like Opus and Gemini showed great strength in conducting security audits and catching hidden flaws. On the other hand, other models were very hit-or-miss.
This creates a serious hidden danger for what people now call the vibe coder. A vibe coder is someone who trusts the LLM completely, writing code purely by judging the general vibe or flow of the project. If the application runs fine on the screen and the features work, the vibe coder assumes everything is perfect. They feel successful simply because the LLM followed the PLAN.md flawlessly.
But this is an illusion. Just because a piece of software works on the outside does not mean it is safe on the inside. When an LLM fails its own internal security audit, it can easily introduce dangerous vulnerabilities into your application. If you rely entirely on the vibe without reviewing the code yourself, you are unknowingly putting your entire system at risk.
We cannot always rely on public benchmarks to judge an LLM. Efficiency, speed, and low costs are great, but they should not come at the expense of safety. As developers, we must stay hands-on. The best approach to evaluating these models is to craft a truly representative test of your own, and always double-check the security of the code before it goes live.
Perhaps in the future, models will be advanced enough to carry out much better self-audits. Coding harnesses will likely improve over time, too. Until then, blindly rolling out LLM-generated code to production is simply irresponsible.