Sep 29, 2025 3 min read

Federated data lakes in biopharma: why “just put everything in one place” doesn’t work.

In most biopharma organizations, the data story is the same.

You have information scattered across every imaginable system: LIMS, ELNs, file shares, Box, SharePoint, internal databases, Slack, Teams, email. Experimental results are scattered across a dozen places, the protocols that produced them in another five, and the decision-making context somewhere else entirely. Even within the same company, two teams can be running related experiments with no practical way to see the full picture of each other’s work.

The common advice you’ll hear: consolidate everything into one central repository. On paper, it sounds efficient. In reality, when you’re already operating at scale, it’s close to impossible. You’re not going to convince hundreds or thousands of scientists to abandon the systems they’ve been using for years. Even if you could, the transition would take years – during which R&D keeps moving forward, generating more data in even more places.

The result is predictable: when teams invest heavily in AI initiatives they realize their models can’t access, or can’t make sense of, most of their own data.

This is where the idea of a federated data lake comes in. The principle is straightforward: don’t try to move all the data. Instead, create a layer that can reach into each source, pull out the relevant data, and preserve the context that makes it meaningful – this part is key. A lab result without the hypothesis behind it is just a number. A simulation output without the prioritization discussion that followed is just a file.

This is the role that Kaleidoscope's data platform plays.

Instead of requiring teams to change the way they work, Kaleidoscope connects to the tools and systems they already use. Data from those systems flows into Kaleidoscope and is structured into programs, projects, and experiments, with the links between them preserved. That means a compound generated in a physics-based simulation can be tied directly to the decision that advanced it, the assays that tested it, and the results those assays produced.

For one customer, this means simulation outputs from their in-house models are automatically ingested, reviewed, and prioritized inside Kaleidoscope, with every decision and rationale attached. When lab results come back, they’re automatically linked to those original hypotheses. The result is a living dataset that shows not just what was tried, but why — the kind of dataset a model can actually learn from.

Another partner uses Kaleidoscope as the hub for their automated experiments. Experiment designs are sent programmatically from Kaleidoscope to custom lab hardware. When the machines finish, the results flow back into Kaleidoscope and are linked to the decision logic that generated them. Every iteration, from design and execution to analysis, lives in one connected record without a single manual data handoff.

Because all of this is federated, teams keep their existing systems and workflows. The benefit comes from the layer Kaleidoscope creates on top – the single place where all of that distributed data and context can be seen, navigated, and pointed at a model. Federation also works because it meets teams where they are. It doesn’t fight the complexity of biopharma R&D; it routes around it. Once that bridge exists, every experiment, every result, and every decision enriches the same connected environment, making it easier to ask better questions and get better answers.

Outside of biopharma, the big model players are thinking in similar terms. Google’s 'Agent Space' is designed to let models work across multiple sources, like PDFs, databases, software tools, rather than relying on a single store of truth. The only difference is, in biopharma the stakes are higher. If your AI can’t access and interpret the full history of your R&D work, it can’t assist you in making better scientific decisions.

Right now, most companies treat federation as background infrastructure – less exciting than the models and dashboards everyone talks about. The irony is that it’s the part that decides whether those models work at all. Once a team starts training on complete, context-rich datasets, every cycle improves the model and the dataset at the same time. That feedback loop is hard to match, because it can be impossible to recreate years of connected history after the fact. The companies that understand this first will set the pace for the next decade of biotech. We’re building Kaleidoscope to make sure they can.

Kaleidoscope is a software platform for biotechs to robustly manage their R&D operations. With Kaleidoscope, teams can plan, monitor, and de-risk their programs with confidence, ensuring that they hit key milestones on time and on budget. By connecting projects, critical decisions, and underlying data in one spot, Kaleidoscope enables biotech start-ups to save months each year in their path to market.