4 min read

Breaking the “me too” drug cycle in biotech

A significant limitation of AI-driven biotech has been the training data. With so many models drawing on the same pool of public data, most AI outputs have failed to deliver on the promise of cutting edge, novel drugs. Biotechs with stronger emphasis on new data generation (and proper context capture) are well-poised to challenge this trend, and establish a large competitive advantage.


A few weeks ago, we spoke with an ML engineer who’s worked at multiple AI-focused biotech companies. They've built models, evaluated platforms, and talked to teams across the industry. When we asked them what’s most valuable to work on in AI x Life Sciences right now:

“It completely comes down to how much data and context you can capture. Without that, these models have little edge because everyone is basically training on the same publicly available data.”

It's a simple statement that captures a truth that’s become hard to ignore. 

The public data ceiling

Over the last decade, hundreds of AI drug discovery companies have launched. Based on 2024 data, only a few dozen have reached Phase 1 trials. None have an FDA-approved drug. The problem isn’t a lack of ambition or technical skill – it’s that most of these models are learning from the same raw material. Public datasets are, by definition, a record of what’s already been explored and published. Train a model on that, and it will naturally point you back to the same well-known targets, the ones with the most data behind them. That’s why so many AI-designed drugs fall into the “me too” category: new molecules aimed at the same pathways and mechanisms that have been drugged successfully for years (or pre-existing drugs on already known targets).

The economics of this are predictable. When you’re the seventh entrant on a target, you’re not competing on breakthrough efficacy. You’re competing on small gains in safety, manufacturability, or delivery, fighting over slivers of market share. The market prices that reality in quickly, which is why so many companies that raised massive rounds on the promise of platform-generated breakthroughs saw valuations fall when their lead assets turned out to be incremental improvements.

How do you break the ceiling?

Better models won’t fix this. The most advanced architecture in the world can’t produce genuinely novel insights from data that only reflects established, accessible knowledge. The only way to break that ceiling is to train on information no one else has – and to capture it with the context intact so it can actually inform the next generation of work (we’ve previously written on why data, not models, win big).

Some companies are already orienting around this. Teams like Peptone, Noetik, and New Limit are explicit about generating large volumes of proprietary, context-rich experimental data as their primary competitive advantage. They understand that in this space, the dataset is the product.

At the same time, large model players like OpenAI, Google, and others are pushing to connect models to more and more external data sources – not because their underlying models are suddenly “smarter,” but because the usefulness of any model is capped by the environment and context it can access.

The principle is the same whether you’re training your own model or using someone else’s: without rich, connected, proprietary data, you can’t get beyond what everyone else already knows.

This is where most companies hit a wall. Even the ones doing novel science often fail to capture their own data in a way that’s useful for training models. Lab notebooks, scattered spreadsheets, and siloed systems mean experimental results and the reasoning behind them are fragmented. By the time anyone tries to connect it all, the decision-making trail is gone.

We see the difference it makes when this gap is closed. One of our customers runs sophisticated physics-based simulations to generate thousands of potential compounds. Before Kaleidoscope, deciding which ones to synthesize was a slow, manual process that left much of the reasoning undocumented. Now, simulation results flow directly into the platform, the team can prioritize collaboratively with their rationale captured alongside, and incoming lab results are connected to the original hypotheses. What they’ve built is a living dataset of what was tried, why, and what happened, exactly the kind of resource a model can learn from to improve the next iteration.

Another partner is engineering a novel process for tissue preservation. In Kaleidoscope, experiment designs are sent programmatically to the machines used to interrogate the process, results flow back automatically, and every iteration is linked to the decision logic that produced it. They're aiming to create a fully closed loop where design, execution, and analysis all feed into the same connected record. That record becomes the foundation for future decisions, whether those are made by people or by models trained on the data.

In both cases, the advantage isn’t the model itself. 

It’s the dataset that emerges when novel experimental work is captured systematically. Proprietary data in itself isn’t rare – every biotech generates it to some degree. What’s rare is proprietary data that’s complete, well-structured, and rich with the context behind each decision. Without that context, even proprietary datasets become just another collection of results. Models trained on them won’t be able to tell the difference between a lucky hit and a carefully reasoned breakthrough.

It’s worth noting, that kind of dataset is also incredibly hard to replicate. Competitors can license the same public databases you started with, hire similar talent, and even buy comparable lab equipment. What they can’t do is recreate years of tightly coupled experimental results and reasoning that only your team has generated. The advantage compounds over time: every design cycle, every experiment, every iteration adds another layer to the moat. The more you build, the harder it is for anyone else to catch up.

In AI drug discovery, where the core limitation is the sameness of the data most companies rely on, that moat is everything. Because if your data looks like everyone else’s, your AI will too. And in this field, we really don’t need another incremental step-change improvement. We need something better. 


Kaleidoscope is a software platform for biotechs to robustly manage their R&D operations. With Kaleidoscope, teams can plan, monitor, and de-risk their programs with confidence, ensuring that they hit key milestones on time and on budget. By connecting projects, critical decisions, and underlying data in one spot, Kaleidoscope enables biotech start-ups to save months each year in their path to market.