Just ask ChatGPT to write the tests.” Easy headline—messy reality. We benchmarked three GenAI engines (GPT-4o, Claude 3, Gemini 1.5) on an eight-service Node + Go platform. Verdict: 70 – 86 % line coverage in one work-day is real—but only after you automate prompts, deduplicate snapshots, and gate flake-rate. This post walks through the exact prompts, GitHub Action, and coverage delta, plus the seven cleanup steps that turned AI noise into shift-left value.
Unit tests guard refactors, but manual writing stalls when deadlines loom. GenAI promises to “write the boring 80 %.” If true, we:
But inflated marketing claims abound. We ran a controlled experiment to separate signal from sizzle.
| Parameter | Details |
| Codebase | 8 microservices (5 Node TS, 3 Go) – 42 K LoC |
| Existing tests | 28 % line coverage, 150 hand-written tests |
| GenAI engines | GPT-4o via OpenAI API, Claude 3 Opus via API, Gemini 1.5 Pro via API |
| Prompt driver | Custom CLI: gen-test <file> inserts test next to code |
| Timebox | One engineer, 7.5 h work-day |
| Acceptance | Coverage by nyc (Node) & go test, flake-rate < 3 % over 5 runs |
text
CopyEdit
You are TestWriterGPT. Write a COMPLETE unit test for the
following source file in <LANG>. Use <FRAMEWORK>.
Constraints:
1. Cover every branch & error path.
2. Mock external deps, NO network calls.
3. Fail test immediately if unhandled promise / panic.
Return ONLY the code in a markdown “` block.
Variables:
| File type | <LANG> | <FRAMEWORK> |
| .ts | TypeScript | Jest |
| .go | Go | Testify + httptest |
Automation tip: CLI passes file path, inserts resulting snippet into <file>.gen.test.ts|go.
| Engine | Coverage Δ | Tests Added | Flake Rate |
| GPT-4o | +58 pp → 86 % | 312 | 2.1 % |
| Claude 3 Opus | +54 pp → 82 % | 298 | 1.6 % |
| Gemini 1.5 | +42 pp → 70 % | 265 | 4.8 % |
pp = percentage-point rise over baseline.Takeaway: With GPT-4o our single engineer hit 86 % coverage in ~7 h—headline achieved.
Deterministic Stubs
AI sometimes mocks Date.now() with real time ⇒ flaky tests.
ts
CopyEdit
jest.spyOn(Date, ‘now’).mockReturnValue(1700000000000);
GenTest Marker Header
Each file starts with
ts
CopyEdit
// Generated by GenAI – Edit cautiously
Net time for cleanup: 2 h 10 m out of 7.5 h; still faster than manual.
yaml
CopyEdit
jobs:
ai-tests:
runs-on: ubuntu-latest
steps:
– uses: actions/checkout@v4
– name: Install deps
run: npm ci
– name: Run generated tests 5 times
run: |
for i in {1..5}; do npm test — run; done
– name: Fail on flake
run: |
if [ $? -ne 0 ]; then exit 1; fi
– name: Coverage Gate
run: npm run coverage
– name: Enforce budget
run: node scripts/check-budget.js 80
check-budget.js reads global coverage; fails PR if < 80 %.Median Action runtime with GPT-4o tests: 3 min 40 s.
| Provider | Tokens Used | API Cost |
| GPT-4o | 1.8 M | $9.00 |
| Claude 3 Opus | 1.6 M | $12.80 |
| Gemini 1.5 | 1.9 M | $5.70 |
Cost per LoC covered: $9 / 24 K LoC ≈ $0.00037—less than a cent per hundred lines.
Edge cases we still hand-write:
We tag such files with // @skip-genai for the CLI.
| Week | Milestone |
| 1 | Install CLI, generate tests for utils/ directory |
| 2 | Expand to services with < 500 LoC |
| 3 | Move coverage budget gate to 70 % |
| 4 | Apply to full repo, budget 80 %, flake gate 97 % |
By Week 4 most squads report 30–40 % drop in escaped defects.
Tag critical files @skip-genai.