deep-dive
AI Coding Tools: An Honest Verdict
After months of daily use across Copilot, Cursor, Codeium, and Claude Code, here is what AI coding assistants actually do well — and where they still fail.
The pitch for AI coding assistants has stayed remarkably consistent since GitHub Copilot launched in 2021: write less boilerplate, ship faster, spend more time on problems that matter. After months of daily use across four tools — GitHub Copilot, Cursor, Codeium, and Claude Code — I can tell you the pitch is partly true, often oversold, and occasionally backwards. Here is the honest version.
The Tools Under Test
Each tool targets a slightly different workflow:
GitHub Copilot integrates directly into VS Code and JetBrains IDEs via extension. It is the most widely deployed AI tool in enterprise environments, partly due to Microsoft’s distribution muscle and partly because it was first.
Cursor is a full fork of VS Code with AI baked into the editor at a deeper level. It supports multi-file context, inline edits, and a chat interface that understands your entire codebase — not just the open file.
Codeium is a free alternative to Copilot with a similar inline completion model. Its paid tier adds more context and a chat interface. Worth considering for solo developers who find Copilot’s pricing hard to justify.
Claude Code is a terminal-first agent from Anthropic. It operates in your shell, reads files, runs commands, and executes multi-step tasks. The mental model is closer to pairing with a colleague than using an autocomplete engine.
How I Tested
I used each tool on the same categories of real work over several weeks: writing new features, debugging production issues, refactoring legacy code, generating tests, and reviewing PRs. I deliberately avoided synthetic benchmarks — those tell you almost nothing about daily-driver experience.
The “Impressive Demo” Problem
Every AI coding tool looks astonishing in a ten-minute demo. The presenter opens a blank file, types a comment, and watches the model generate a complete, working module. Jaws drop. The demo is real — all four tools can do this.
The gap between demo and daily driver is not a lie, exactly. It is a context problem. Demos use greenfield code, well-scoped problems, and familiar patterns. Real codebases are none of those things.
What Actually Happens at Scale
When you work in a 200,000-line codebase with ten years of accumulated decisions, AI suggestions start failing in subtle ways. The model generates code that looks correct but violates project conventions, uses deprecated internal APIs, or duplicates logic that already exists three files away. You spend time reviewing and fixing suggestions that a senior colleague would have gotten right on the first try.
This is not a dealbreaker — it is a calibration. The question is not “does the AI make mistakes?” (it does). The question is “does it still save time despite the mistakes?” The answer depends heavily on the task.
Where AI Genuinely Helps
Boilerplate and Scaffolding
This is the unambiguous win. Generating CRUD endpoints, writing Zod schemas from TypeScript interfaces, scaffolding test files, translating REST responses into typed models — all of these are mechanical tasks where AI is faster than a human and the correctness bar is easy to verify. I estimate a 60–70% time reduction on this category of work.
Test Generation
Feeding a function to an AI and asking for a comprehensive test suite is one of the highest-ROI uses of these tools. The model covers edge cases you might have missed, writes readable describe/it blocks, and handles mock setup correctly most of the time. Tests are also easy to review — if a generated test is wrong, it either fails or a quick read catches it.
Refactoring Known Patterns
Renaming a prop across 30 files, converting a callback-based API to async/await, migrating from one library to another with a known API mapping — these are tasks where AI tools, especially Cursor with multi-file context, perform very well.
Where It Still Fails
Complex Architecture Decisions
Ask an AI to design the state management architecture for a new feature that touches five existing systems, and you will get a plausible-looking answer that misses crucial constraints. The model does not know why your codebase is structured the way it is. It cannot weigh the political cost of a large refactor against the technical debt of a smaller hack. These decisions require human judgment, and AI suggestions in this space are more likely to mislead than help.
Subtle Bugs
Here is a real example. I asked Copilot to fix a race condition in an async job queue:
// AI-generated "fix" — looks reasonable, still broken
async function processQueue(queue: Job[]) {
const results = await Promise.all(queue.map(job => processJob(job)));
return results;
}
The suggestion replaced sequential processing with Promise.all, which made the race condition worse by running all jobs concurrently without any rate limiting or error isolation. The original bug was real; this “fix” introduced a new one. The model pattern-matched on “async + loop = Promise.all” without understanding why the original code was sequential.
Subtle, state-dependent bugs require understanding the system’s invariants. Current models are not reliable here.
The Ethics Question
This deserves more than a paragraph, but it gets less than it should in most tool reviews. These models were trained on public code — including code under copyleft licenses, code with attribution requirements, and code written by developers who were never asked for consent.
The legal landscape is unsettled. Several class actions are pending. Using AI-generated code in a commercial product is a risk your legal team should be aware of. For open-source projects, the question of how to attribute AI contributions is genuinely unresolved.
None of this means you should not use these tools. It means you should use them with eyes open and advocate for clearer industry standards around training data and attribution.
Verdict and Recommendation
After extended daily use, here is where I land:
Use AI tools for: boilerplate, test generation, refactoring with clear scope, documentation, and any task where the output is easy to verify.
Do not rely on AI for: architectural decisions, debugging complex system interactions, security-sensitive code, or anything where a plausible-looking wrong answer is worse than no answer.
Of the four tools: Cursor is the best daily driver for complex codebases because of its multi-file context. Claude Code is the best for agentic tasks — running commands, editing multiple files, working through a problem end-to-end. Copilot wins on ubiquity and IDE integration. Codeium is the right choice if cost is a constraint.
The honest summary: these tools are genuinely useful, meaningfully less capable than their marketing implies, and getting better every six months. Adopt them with calibrated expectations and you will come out ahead.
Read Next
View Archive →
Mechanical Keyboard Switches: A No-Nonsense Guide
Linear, tactile, or clicky — choosing the wrong switch ruins the experience. Here's how to pick the right one for typing, gaming, and everything in between.
Building a Home Lab on a Budget
Build a capable home lab for under $300 — private cloud storage, VPN, media server, and more. A practical guide to hardware, setup, and what to actually run.