Jul 18, 2023 7 min read

Data hygiene and organization: why you should (proactively) care

Proper data hygiene and organization are critical in Life Science R&D, where work is dynamic, collaborative, and growing in volume/scale. They’re also necessary precursors to being able to do exciting things with that data over time. In this Kaleidoscope blog post, we cover why data hygiene and organization matter so much. As usual, if anything here resonates with you, please reach out!

Summary points

Costs of getting it wrong are high. Whether it's issues with experimental reproducibility or running into regulatory nightmares, data hygiene and organization are critically important. Being proactive can mean anything from saving months of work or millions of dollars, to avoiding company shut-down.
Laying the groundwork for scientific leaps. It's not just about saving time and money – it's also about leveraging data to its fullest potential. Scalability, automation, novel discovery, machine-driven insights: all of these are underpinned by getting the fundamentals right.
Team-wide mindset, as early as possible. Changing engrained behaviors is hard, and problems resulting from poor hygiene and organization can permeate across a company. Aligning early and encouraging everyone to do their part will pay massive dividends.

Computation is increasingly important in biotech. For anyone familiar with the space, this shouldn't come as a surprise. Most biotechs today, whether they’re enterprise companies or nimble startups, are dealing with data in some way. Often, they’re tasked with analyzing and coordinating large, complicated data sets across teams and roles — bench scientists, bioinformaticians, data scientists, computational biologists, CTOs.

What this looks like in practice is both companies hiring people with data or software backgrounds, and scientists picking up basic technical skills to grapple with the huge amounts of data being generated. The 2022 State of Tech Bio Survey reported that 84% of respondents are comfortable writing code, while 75% of respondents do data analysis on a weekly basis (this trend doesn’t just exist within the self-nominated tech-bio population — there are more references to software and data repositories than ever before in published research). More surprising, though, is how the majority (75%) of respondents learned to code: by just doing it.

The result? Code and data analysis are quickly becoming more integral to science, but most scientists simply haven’t been taught how to organize, maintain, and leverage the massive amounts of data they’re generating. This skill is separate from programming and knowing how to code. Unfortunately, simply hiring software engineers or data scientists and expecting them to figure it out isn’t enough. To get the most out of the data they’re generating, scientific teams need to ask themselves a more fundamental, infrastructural question: how can they create a systemized, scalable, and disciplined approach to handling data?

Reproducibility, scalability, automation, novel (machine-driven) insights — all of these are underpinned by and contingent on getting the fundamentals, data hygiene and data organization, sorted out.

What are the fundamentals?

Before going any further, it helps to clear up the difference between data hygiene and data organization.

Data hygiene means having routine checks and balances in place to clean and correct data to make sure it’s accurate, consistent, and up-to-date. Data organization means having some kind of systemized and structured approach to handling data in a way that aligns with how users need to access and interact with it. In short, hygiene is about having good quality of data. Organization is about arranging that data in a way that’s easy to use. We’ve previously touched on these concepts and how they relate to metadata as well.

For biotechs with either poor data hygiene or organization (or both!), the consequences can be tough — it becomes hard to analyze data in bulk or replicate experiments, the lack of reproducibility dents credibility, code is less likely to work and harder to scale, regulatory woes from bad data management abound.

On a good day, these problems can significantly delay your progress. On a bad day, they can lead to the failure of your company.

And out of all of the possible reasons a biotech can fail - not enough funding, scientific unknowns, too slow to go to market - failing as a result of poor data hygiene and organization is the biggest shame. Unlike other ways to fail, these issues are fully within your control. They stem from how you choose to manage your data.

This isn’t a thinly-veiled attempt at reprimanding scientists for not doing a better job. For the most part, scientists are not at fault here. Science is iterative and chaotic, especially in the early stages, and it’s easy to let infrastructural things like data hygiene slip in favor of solving more hair-on-fire problems. We also have to remember that, until fairly recently, data and wet-lab didn’t go together as intimately as they do now. The increasing rates of data generation, heterogeneity of backgrounds of people doing science, and increased rates of distributed work all necessitate new systems and ways to support the work being done (and are some of the major reasons why we started Kaleidoscope in the first place).

Then, there’s the lack of useful tools. Many scientists have lagged in adopting better working practices simply because the software supporting the science has traditionally been poorly designed when it comes to ease of use and simplicity. Instead of helping, most of these tools cause headaches and create friction — to the point where scientists largely avoid using them whenever possible. Greg Wilson has a similar theory: “from object-oriented languages to today’s craze for ‘agile’ programming, scientists have suffered through one fad after another without their lives becoming noticeably better.” What often ends up happening is that teams default to the rudimentary tools they’re familiar with (Excel, Powerpoint, etc.) that can actually exacerbate the underlying problems, or they try to build their own internal software, which is a really expensive decision to make.

A lot of these fads and softwares create a pressure-cooker environment that pumps biotech companies with the unrealistic expectation that they must find ways to handle their data flawlessly overnight. This binary narrative – that if you’re not doing it flawlessly, you may as well not do it at all – simply isn’t true. The reality is that companies don’t have to establish the perfect infrastructure from day one; just getting better at data hygiene and organization can make a big difference.

Why be proactive?

There are many reasons for people in biotech to set good data hygiene and organization as a goal today, instead of kicking the can down the road and shrugging it off as tomorrow’s problem.

Perhaps the most obvious: if you’re in biotech, regulations matter! You’ll almost inevitably come into contact with the FDA – and, when it comes to data management, the FDA cares about rigor and process. Last year, the FDA issued 161 warning letters to companies that had no established written procedures and 56 violations to companies that failed to back-up data. Just last month, the FDA also announced new guidelines on clinical trial modernization that focus on data integrity. Getting your team to align on best practices around data organization and hygiene early is easier than expecting behavior change many months or years down the line. And this upfront work will pay large dividends later, when you’re able to iterate quickly and reliably while the stakes are significantly higher.

The FDA has good reason to care about how you’re thinking about your data. If you’re in biotech, you probably saw the recent viral story The Boston Globe broke uncovering data integrity issues at Laronde. If you didn’t, here’s a recap: Laronde had what looked like promising results for a circular RNA technology. The company spoke of strong results in animal model experiments – but no one in the company could reproduce these experiments. This led to concerns about the lack of raw data and discrepancies in results, which spiraled into the darker realization that a single scientist was presenting preclinical data that was simply too good to be true. Eventually, Laronde shelved the project.

Biotech is a competitive environment with high expectations. It’s not entirely crazy that a scientist might, in a moment of weakness, succumb to the pressure to produce perfect results and manipulate data — but it is crazy when there isn’t a system in place to more easily and reliably catch this. Having checks and balances on data hygiene prevents situations like this from happening – if the right systems are in place, a single scientist should never be able to cause data integrity issues on that scale.

The consequences of poor data management don’t just crop up where there’s unethical behavior. Even the most well-intentioned scientists can (and have) fallen victim to clumsy data hygiene and organization. Geoffrey Chang, a young scientist, experienced the ultimate “scientist’s nightmare” when a high-profile paper he published had to be retracted. The homemade data-analysis program he used had flipped two columns of data, resulting in a domino-effect of retractions in papers that cited Chang’s original work. The analysis program was inherited from another lab with no automated tests or sample data – a fundamental flaw in data hygiene.

With that being said, it’s not just about avoiding Laronde or Chang-like problems. It’s also about achieving your full potential in an endeavor that really matters to society. With well-managed and organized data, people working in biotech can lay the critical groundwork to achieve what they entered this industry to do: help patients sooner, iterate faster, explore bigger questions, and understand and improve the value of their IP. It’s worth noting that caring about data is not and should not be a concern reserved for those working in the AI/ML bio space. Even if you’re someone who handles and generates far less data, being thoughtful in how you approach data hygiene and organization gives you the power to leverage that data better.

Ultimately, by being proactive about data hygiene and organization, you can get better at extracting insights that are novel, reliable and reproducible – and help the stakeholders that collaborate with you get the best use of that data as it flows through different teams in your company.

If you want to chat more about anything we wrote, or you’re interested in finding a way to work together, let us know!