The Painful Journey of Building an AI CodeGen Tool

Oct 18, 2024

A 2D illustration in a pen-drawn style on a casual sky-blue background. In the center, there is a paper box emitting a futuristic stream of TypeScript code in pastel tones. The TypeScript code is flowing out of the box in a stylized manner, giving a digital, futuristic look but maintaining a 2D, hand-drawn aesthetic. The code should include a snippet of real TypeScript that starts with 'export async function asyncMap<T, U>(array: T[], callback: (item: T, index: number, array: T[]) => Promise<U>, maxConcurrency = 10): Promise<U[]>' and continues flowing out of the box in neat, structured lines. The paper box should look simple yet slightly futuristic in its design, with soft pastel shades. The overall tone should feel light, creative, and suited for a technical blog cover image.

It wasn’t long after I switched my editor to Cursor. At the time, I was researching how to remove EmotionJS from a legacy project that had 10K+ lines of code. Manually editing all those seemed impossible, so I tried Cursor's Composer.

It didn’t go as I expected. The tool kept generating incorrect code that I had to fix manually. That’s when I wondered: what if there was an AI that could take more time but generate accurate code? And that thought sparked everything.

A Surprisingly Smooth, Smooth Start

As one of the founding engineers at Airbridge, I had experience building an analytics SDK. SDK integration is supposed to be simple, but for many clients, it delayed onboarding by weeks or even months. Some would even churn before they were fully set up. I often went to their offices on-site and helped them to speed up the process.

That earlier experience, combined with my idea of an AI CodeGen tool, felt like a perfect match: A codegen AI helping SDK integration/migration could reduce customer support costs for SDK companies!

In just a week, using some code from a previous AI project, I had a working prototype. I even secured a meeting with a potential enterprise client! The accuracy of the prototype was shockingly good for something built in a week.

I felt like I had nailed it. I was certain that revenue would follow quickly.

Harsh Feedback and… the Prompt Engineering Hell

But the feedback we got from our user was unexpectedly harsh:

“Isn’t this just copying and pasting from the docs?”
“The accuracy is terrible. This could cause errors.”
“That’s not how you should handle migrations...” (even though no migration guide existed!)

I was stunned. Something I thought was brilliant was dismissed as a "copy-paste" job. It was disappointing, but I realized we had to face reality and fix the problems.

The doubts about accuracy sent me into deep thought. I kept tweaking prompts, and refining our system to increase accuracy. But the hallucinations are inevitable. It felt like a losing battle. The more I tried to fix it, the more the system started overfitting to specific use cases. The evaluation time ballooned, making it hard to even know if we had improved accuracy.

Customers remained unimpressed. After wasting a month on prompt engineering, their feedback still wasn’t getting better.

The Trap of Building

Then came a day when I couldn’t focus on development. My motivation was low, and I was moving slower than usual. That’s usually a sign of two things: you’re sick, or your confidence is crumbling. This time, it was the latter.

While reflecting on why I lost confidence, I stumbled upon a quote on the YC website: “Make something people want.” Who exactly were we building this for? I realized we had been so obsessed with the product itself that we lost sight of who the users were.

People often talk about how Superhuman spent a year building its product. But what they forget is that before all that development, they spent a long time gathering customer insights and validating their direction. We skipped that step.

Realizing this, we scrambled to reach out to SaaS companies. The responses were lukewarm. Looking back, this was the moment I regret most. I had foolishly believed that if we built a great product, customers would naturally come. Resources are limited, time is unforgiving, and no news isn’t good news.

What We Did Wrong

One-directional user flow makes both you and your users unhappy

From a UX perspective, we had a problem. We built the tool so that users would connect their codebase, provide some basic info like an SDK key, and immediately get a finished pull request. The problem was, users had no way to convey their intent, so the output was often wrong. And they had no means to give feedback, so they assumed the tool was inaccurate.

We later switched to a conversational UX. Instead of providing a “final answer” right away, we focused on producing a good first draft and making it easy for users to provide feedback. As a result, complaints about accuracy started to drop.

We had believed a good system prompt could solve all user complaints. But sometimes, customer problems can be solved in simpler, more effective ways.

Don’t believe praise from people who didn’t pay

Praise from people who haven’t paid for your product is meaningless. During user interviews, I heard a lot of nice feedback, but that led me down the wrong path. People don’t always know what their problems are, so when they see a solution, they say “looks good,” but that feedback is worthless if it’s not backed by action. You can’t jump to conclusions in business. The comfort of easy answers leads to disaster.

Slowly Changing Direction

In the end, we stopped everything and started rethinking. What were we building, and who were we building it for? While the answer wasn’t perfectly clear, we eventually settled on “developers.” SDK installation and migration are long, tedious tasks that require humans to make countless code changes. What if we could generalize that?

There are two kinds of software development: building something from scratch and building on top of what’s already there. Most code generation tools focus on the former. But in the real world, both are necessary.

Seeing that gap, we made a small pivot. We shifted our focus to precise tools for large-scale migrations, like swapping Emotion for Tailwind or converting Next.js Pages Router to App Router. This was the moment that we became the customers of our own product.

Self-Validation with o1

Around that time, OpenAI launched its o1 model, which has the Chain-of-Thought reasoning baked in.

I say this was perfect timing because we had just switched to a conversational UX and received our first real feedback. The feedback was less about accuracy and more about the importance of producing a reasonable first result. Was I going to end up in another battle over accuracy?

But this time, it didn’t feel as hopeless. The o1 model’s ability to reason through tasks gave me hope. If AI could plan, identify errors, and improve over time, maybe it would work. So we built a simple planning/validation agent and applied it to our tool. Suddenly, the AI was avoiding the kind of mistakes that users would have previously flagged, producing much more accurate results.

We had solved the issue of code quality, and now all that was left was launching.

The Lingering Fear

Now, as we prepare for our beta launch, I’ve rewritten the landing page three times in a week. It’s a clear sign that I still lack confidence. That lack of confidence is the price I pay for not having spent the last two months gathering more user insights.

There’s still fear. What if no one cares about our product? What if all this time and effort was wasted? But I’m not giving up until we get a definitive answer.

A friend of mine, Chanhee, once said something that stuck with me:

If things aren’t going well and there’s no way to recover, it’s a failure. But if you can recover, it’s leverage.

Leverage always comes with risk, and I’m willing to take it. If we fail, at least we’ll walk away with lessons that will help us next time.

Catch

Our AI CodeGen tool is called Catch. Yes, like in try-catch.

Catch handles large-scale tasks while you sleep. Unlike Cursor or Copilot, it’s designed to generate precise, large-scale code modifications. If you’re dealing with refactoring or migrating a project, book a call! Let me help you how to pay off your technical debt.

trycatch.ai

Femi Orokunle

Jan 1

I spent about 2 weeks now trying to build my own tool for mass AI codegen. I tried integrating AST parsers, making advanced linters, and every 48 hours I'd run into another problem. I ended up on this page from Google after my latest idea with OpenAI batch output failed overnight. This resonates so hard, signing up for early access.

Expand full comment

Build in Private and Do Things that Scale

Discussion about this post

Build in Private and Do Things that Scale