FAIR Data Infrastructure in Life Sciences R&D
Navigating the complexities and challenges in preclinical drug discovery (Part 1 of 3)
“We are surrounded by data but starved for insights” - Jay Baer
Introduction
Investing in data has always been crucial for the life science industry. Recent improvements in machine learning (ML) have shown a path to using software across all parts of the drug discovery process. Scientific breakthroughs are also increasingly being powered by software tools. Many of the most impactful discoveries in the last several years generate huge amounts of data that require dedicated software to make sense of. 250 petabytes of biomedical data is generated every year, yet most of that data lies unused. It’s also really expensive — America’s biopharmaceutical R&D engine consumes over $100 billion every year, in part due to the challenges of using this data.
To solve this, the governing principles of the data tech stack should be FAIR (Findable, Accessible, Interoperable & Reusable) to ensure that relevant data can be used for downstream analysis. FAIR data helps prevent wasted experiments, enables collaboration across disciplines, and speeds up insight generation. The lack of adherence to these FAIR guiding principles slows down scientific discovery.
In this blog we will lay out the areas that pre-clinical teams work on and which aspects can be “FAIRified”. Our subsequent posts will cover what a FAIR tech stack for preclinical R&D looks like and the state of the industry today.
Stages in preclinical R&D
Every scientific organization has its own research goals and ways of operating, but we think these steps are broadly applicable. Let’s start by laying out these problems in greater depth.
Hypothesis generation
Experimental Design
Data Capture
Data Storage/Management
Data Analysis (Pipelines and Processing)
Insight Generation
Problems at each stage of preclinical R&D
Hypothesis Generation
The first stage of research begins with finding and accessing the necessary data to inform your hypothesis. At this stage, the struggle is not to generate more data, but to make use of existing data. The relevant data is scattered all over: public data repositories, electronic laboratory notebooks (ELNs), published literature, laboratory information management systems (LIMS), slide decks and more.
Experiment Design
Once a hypothesis has been generated, the next step is to design a set of experiments to test it. This is the core of the scientific method and is essential to any life sciences company or research group. The goal is to design the experiments which can provide more insight as to the validity of the proposed hypothesis. Crucially, this requires reproducibility – it’s only useful to run these experiments if you are confident that they can be reproduced. Unfortunately, inconsistencies at this stage make automated downstream analysis pipelines difficult.
Data Capture
Once the experiment has been designed, the next step is to execute it. In order to determine what happened in the experiment, we need to ensure the data is captured. Much of lab work is done on proprietary machines, so data is often in a custom format or schema, not clean and structured. This leads to a number of interoperability challenges resulting in teams writing a large number of custom software connectors.
4. Data Storage
The generated data needs to be stored somewhere that it can be used for further processing and analysis. Without a well structured data management solution, many research organizations have trouble finding and accessing their data. Data gets lost and knowledge is kept in silos. This leads to huge inefficiencies, especially in larger organizations, and slows down the research process.
5. Data Analysis (Pipelines and Processing)
Data analysis is important to transform the raw data into something that is interpretable by researchers. This encompasses primary analysis, secondary analysis and ML models. Much of this work involves processing raw data from different sources in a consistent manner — as an industry, we lack interoperable tools that facilitate analysis. Every company recreates their own parsing and normalization scripts for the same data. Additionally, lack of version control of pipelines and standardization of development environments poses a challenge to the reproducibility of analyses.
6. Insight Generation
Generating insights from the data collected as a result of the experiments requires collaboration with stakeholders across disciplines and teams, thereby mandating the presence/use of collaborative tools and software. This is done by a combination of plotting and visualization tools, slide decks, and static reports. But unless these outputs are linked back to the data that generated them, interoperability and reproducibility of these results will always be a barrier. Linking insights to the source for the reusability of the data and retaining institutional knowledge is a tedious process — but crucially important for future users of the data and decisions.
We hope this provides an overview of the various places software comes into play in the pre-clinical R&D space. We have highlighted common issues that arise from not adhering to FAIR principles.
In Part 2 of this series, we will walk through the types of software commonly used to solve these problems in industry.