Naive BDD: The Tests Ran, the Tests Passed, the App Didn't Work
13 days, 51 stories, 50+ BDD specs. Most passed. The integrations didn't work. A Potemkin village of green tests over broken functionality.
Insights on structured AI development, avoiding LLM-generated technical debt, and building production-ready Phoenix applications.
13 days, 51 stories, 50+ BDD specs. Most passed. The integrations didn't work. A Potemkin village of green tests over broken functionality.
AI coding agents contradict themselves on long tasks for a mechanical reason. Every new instruction deprioritizes every prior one. Here's the math.
Module spec files gave me and the model the same definition of done. The unit tests passed. The features were still broken. Why.
Every long AI coding session collapses for the same mechanical reason. Here's the five-step workflow that holds against it.
BDD's Three Amigos applied to AI coding agents: confirm scenario titles before any code is generated, plus a sealed-boundary spec module that can't cheat.
Get the rules out of the chat and into files the model re-reads at position zero of every session. The first workflow that held against attention drift.
MemPalace, mem0, Letta, vendor memory features. Why contamination, opacity, and lock-in outweigh the convenience for coding work. What actually breaks.
Zep, Graphiti, mem0 graph mode, and the self-evolving knowledge base trap that exploded after Karpathy's LLM-Wiki gist. When graph memory pays off.
The most-starred memory system in AI coding has 87k+ stars and isn't actually memory. Why the CLAUDE.md conflation costs you wrong tool choices.
Markdown files in your repo beat MemPalace, mem0, and every dedicated memory system. Cline Memory Bank, Doug's journal, Claude Code auto-memory.
RAG is over-applied to coding. Why retrieval breaks for most coding work, and the failure modes nobody mentions: wrong retrieval, context rot.
Process Claude Code transcripts into durable memory. session-kit, claude-mem, claude-memory-compiler, autoDream. The under-discussed third leg.
My harness was too module-spec heavy and too light on product management. Now prototyping a three amigos process for better BDD specs.
I got specs wrong twice before getting them right. The journey from module specs to BDD specs to executable boundary testing for AI-generated code.
Spec means 13 different things in software. If you're doing spec-driven development with AI, most definitions are wrong. BDD specs are the one that verifies.
Claude Opus 4.7 migration guide: three breaking API changes, a stealth 35% cost increase via tokenizer, and what's actually better.
Five levels from prompt engineering to platform engineering. Where most developers are, where the leverage is, and how to level up.
Pull docs from compiled BEAM files, embed locally with Ortex, search with sqlite_vec, serve through MCP. No API calls. No network.
One-click screenshot capture in a LiveView feedback widget using html-to-image, colocated hooks, and presigned S3 uploads.
The client isn't just a browser. Browser apps, PWAs, mobile apps, desktop apps - what they are, what they can do, and why it matters.
Does the toilet belong in the bathroom or the living room? A guide to putting the right parts of your app in the right place.
A plain-English guide to what code is, what JavaScript, React, Supabase, and Vercel actually are, and why your AI picked them.
A server is just a computer. Every service your AI signs you up for is someone else's computer. Here's why you're paying $50/month and how to pay $4.
Your app runs in two places. If you don't know which is which, your data will leak. A plain-English guide for vibe coders.
AI agents drift when they don't know what decisions you've already made. ADRs fix that. Here's how to write them for agent consumption.
Anthropic launched Managed Agents in public beta. They're literally selling the harness now. Here's what it is, what it costs, and why it matters.
Anthropic's next model leaked before they were ready. 93.9% SWE-bench claimed. Here's what's confirmed, what's speculation, and why the harness still matters.
Cursor 3 demoted the IDE for an agent switchboard with 8 parallel workers, mobile control, and Pro at $20/mo. What works and what's marketing.
Progressive disclosure is a 30-year-old UX principle that solves AI agent context bloat. Practical patterns for skills, CLAUDE.md, and MCP tools.
A 26-point quality gap between AI-only code and human-guided architecture. Here's which patterns produce the best agent output, ranked.
Devs are 19% slower with AI but perceive 20% faster. Vibe coding has 2.74x more security vulns. Here's what the implementation phase actually looks like.
Bug fixes, dependency updates, security patches, tech debt. Maintenance is 60-80% of software cost and it's where agents deliver the most proven value.
96% of developers don't trust AI code but commit it anyway. The verification gap is the central problem. Here's how to close it.
The bottleneck moved from writing code to knowing what to build. Here's how AI is changing requirements gathering and why bad specs kill agent output.
Most teams use AI agents for one phase of development. Here's what the full lifecycle looks like across all eight phases, with data.
Refactoring dropped 60% and duplication rose 48% after AI adoption. The fix is spec-driven development. Here's how it works.
When the same AI writes your code and your tests, you don't have tests. You have a mirror. Here's how to break the loop.
AI agents can write code but can't deploy it. I close the gap with 4 markdown files instead of giving my agent cloud credentials.
Marketing isn't a content problem. It's a system problem. The Claude Code loop I built with Reddit MCP, GA4, Search Console, and 30 minutes a day.
Most developers treat their AI coding tool as one thing. It's five layers. Here's the framework that changes how you evaluate and build with them.
The agent loop is a while loop that changed software. Here's how tool use, context management, and ReAct turn a token predictor into a coding tool.
CLI, IDE, or cloud? Sandboxed or wide open? The environment determines what your AI coding agent can do. Here's why it matters more than you think.
OpenAI shipped 1M lines with zero manually written source. The secret wasn't the model. It was the harness - constraints, verification, lifecycle.
The model didn't write your code. It predicted tokens. Everything else is the harness. Here's why that matters more than benchmarks.
One agent hitting its ceiling? Multi-agent coordination is the next frontier. Here's what works, what doesn't, and why the demo-to-production gap is wide.
GitHub Copilot deep dive: $10/mo Pro tier, Coding Agent, 60M+ code reviews, Copilot Memory, and what Reddit developers actually think.
Aider deep dive: 50+ model support, 4.2x token efficiency vs Claude Code, best-in-class git integration, and what Reddit developers actually think.
Gemini CLI deep dive: 1,000 free requests/day, improving quality with 3.1 Pro, Jules async agents, and what Reddit developers actually think.
Codex CLI deep dive: open source Rust CLI, 2-3x token efficiency, 9,000+ plugins, and what Reddit devs actually think. Pricing, strengths, weaknesses.
Cursor deep dive: $2B ARR, Background Agents, MCP Apps, credit-based billing, and what Reddit devs actually think. Features, pricing, and assessment.
Your CLAUDE.md is settings. Your skills are libraries. Your hooks are middleware. Two activities, one progression.
Claude Code deep dive: highest-rated for code quality, Agent Teams, MCP ecosystem, and what Reddit developers actually think. Pricing and weaknesses.
The most-loved tool (Claude Code) is fully closed. The most-starred (OpenCode, 117K) is fully open. Analysis of 21 tools shows when to choose which.
Supermaven was acquired. Aide is sunsetting. Void went silent. Why AI coding tools die, what patterns predict failure, and which tools are at risk today.
A web server returning navigable markdown replaces CLAUDE.md stuffing, MCP proliferation, and filesystem sync problems.
Amazon's Kiro uses EARS notation. CodeMySpec uses BDD. Both bet that specs before code = better AI output. We tested both approaches.
Model Context Protocol is USB for AI agents. 1,000+ servers, adopted by Anthropic, OpenAI, Google. What MCP is, who supports it, and what it enables.
9 free and open-source AI coding tools compared. Gemini CLI is truly free. Aider and Cline match paid tools. When is BYOK cheaper than subscriptions?
Cursor 3 Glass, Windsurf's price hike, Zed's 1M context BYOK, Kiro's AWS Transform. Updated April 2026 comparison of pricing, philosophy, and fit.
How to write Claude Code skills that actually work. Real examples, common mistakes, and how skills differ from prompts, MCP servers, and hooks.
Claude Code accounts for 4% of GitHub commits. Gemini CLI hit 90K stars. The terminal won the AI coding war nobody expected. Here's why.
6 CLI coding agents compared: independent testing, pricing, and community sentiment. Refreshed April 2026 with Opus 4.7, Codex $100 tier, Gemini restrictions.
From autocomplete to fully autonomous development. A framework for understanding where you are with AI coding tools and where the real leverage is.
Unit tests and BDD specs verify pieces. QA verifies the running application — story QA, journey QA, and automated issue filing by AI agents.
Unit tests verify your code works. BDD specs verify your app does what users actually want. One scenario per acceptance criterion, traced to user stories.
55 commits, 100K+ lines, 100+ QA issues caught in 5 active development days. How BDD specs and agentic QA verified a fuel card management platform.
How CodeMySpec verifies AI-generated code with a 7-stage validation pipeline, dirty tracking, BDD specs, and end-to-end QA journeys.
How we built a full-stack permission approval system for Claude Code that lets you approve tool calls from your phone with Web Push and Phoenix Channels.
Learn the architecture, planning, and process iteration that keeps LLMs on track.
Write one design doc per code file to prevent architectural drift and keep LLMs on track.
Learn to design Phoenix contexts and vertical slice architecture to keep AI-generated code consistent.
Phoenix contexts provide self-contained modules, consistent patterns, and built-in testability that make them ideal for AI-assisted development.
Practical approach to using user stories for AI code generation. Keep LLMs focused on requirements, maintain living documentation, and avoid technical debt.
The best way to get reliable code from an LLM is better control and enforcement through predefined workflows, validation, and test-driven development.
Design-driven development adds explicit, reviewable designs that define component architecture before implementation begins.
We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect