This is reposted from a Q&A we did in October 2022. Check out our Slack for more.
See Elucidata’s writeup (here and here) of this blogpost as well.
We had the wonderful opportunity to interview Abhishek Jha, Co-Founder and CEO, of Elucidata. Below we discuss the importance of clean data for AI, fair data standards, and much more.
Nicholas: Welcome to @abhishek! i'm excited to talk about elucidata and congrats on your recent raise. i'll kick it off with some questions, but i encourage the community to jump in.
Abhishek: Hey @nicholas glad to be here!
Question #1:
Nicholas: What does elucidata do?
Abhishek: We clean and link biomedical data at scale. we are advocating for a data-centric approach to ai. in very simple terms it argues that clean data is more valuable than more data. of course in an ideal world you would want both. but we live in a real world and often the data we use (public and/or proprietary) is not structured and harmonized. our technology is solving the problem for biopharma companies.
Nicholas: This is a really interesting discussion topic. we've seen huge amounts of unstructured, messy data lead to amazing results in other domains (i'm thinking gpt-3 and dall-e). what about biology makes you think that's the wrong approach here?
Abhishek: This is very interesting. we argue and talk about it a lot. we also learn a lot from such models. we will talk about it at our event (datafair 2022). plug! :wink:
we and others have seen we can outperform large models (trained on billions of parameters) if the training is on the relevant dataset. this is at the core the promise of data centric ai. happy to get into more details if anyone is interested. more specifically, we outperform bert quite significantly.
very exciting time to be dealing with such challenges. so much is happening and so fast!
Nicholas: This is very interesting to me -- i'd love to see more details on how you outperform bert. what tasks is that on? how big is the model that outperforms it?
Abhishek: Happy to share some data on it later here. i will make a note of it.
Question #2:
Nicholas: How did it get started? how did you meet your cofounders?
Abhishek: We founded elucidata in 2015. this was right after my 5.5 years at agios pharmaceuticals. it has been 7 years and a lot of fun for me.
i met my co-founders swetabh and dick during the course of my work at agios. swetabh and i go a long back. he is my nephew. :slightly_smiling_face:
swetabh has prior experience in building scalable tech and organizations. dick is a md/phd professor at yale.
Nicholas: Can you talk more about how your experiences at agios motivated you to start elucidata? what were the problems you saw there
Abhishek: I can talk about it for hours. :slightly_smiling_face:
i would analyse omics data. most of my time was cleaning the files. putting it in the right structure. robjects. dataframes. excel pivots. you name it. lot of scrubbing the column and row names. but what i would present to my colleagues was analysis. something simple like pca. or more sophisticated like a classification model (what cell lines respond vs what do not). that is what was valuable for my team. that is what brought "glory" to me.
what was under recognized was most of my time was going to clean data. and i was doing it in a highly non scalable fashion.
this was a big problem that i experienced first hand and we are taking a shot at solving it at elucidata.
Question #3:
Nicholas: Your background seems to be more of a wet lab scientist than a computational one. how did you get into the software side of things?
Abhishek: Actually my training is of a computational biologist. i did my phd at uchicago and a postdoc at mit. all of it was focussed on building computational models to study proteins and later systems level models of immune system.
at agios, i continued computational work. here it was more focussed on integrating different types of data (omics + non omics) to understand the clinical phenotype.
in some ways i was always involved with software side of things in some shape or form. writing it. improving it. or using it.
more fundamentally i see myself as a consumer of the technology and services that elucidata is building. i would have benefited a lot from this at agios. :slightly_smiling_face:
Abhishek: More about my agios + earlier days at elucidata that led to elucidata of today:
i would be asked “oh just upload and clean some data useful for our experiments” but there was so much work to wrangle and clean just one set - and if there were many useful datasets, it would be unimaginable to add platoons of data scientists to grind through cleaning all of them, so i became determined to make a tool to fix this problem – applying all this new machine learning and natural language processing emerging in the 20-teens to somehow simplify the task of data wrangling. all that ml/nlp work turned into elucidata.
Question #4:
Nicholas: Where do you see the major applications of ml in drug discovery? how about major challenges?
Abhishek: Ml applications span across the broad spectrum of drug discovery from hypothesis testing to drug property predictions. major challenge is sparsely annotated data that makes it difficult from ml models to learn and ml teams spend major chunk of their time in data cleaning and wrangling. another challenge is that, the hits that are output of ml models very few times make exciting avenues for chemist/biologist.
unlike for internet companies like (grammarly) drug discovery do not have access to billions of data points. besides any event of interest is a rare event. therefore it becomes even more critical to have clean and linked data. this is one of the major challenge that can compromise the promise of ml and ai in drug discovery.
Question #5:
Nicholas: What about beyond drug discovery?
Abhishek: I think the opportunity is immense. lot of verticals (manufacturing etc) suffer from similar problem. one challenge for us (since we are small) is to remain focused and deliver value for our customers. but there is a whole world beyond drug discovery where our technology will be helpful!
Question #6:
Maximilian: How do you actually make the data clean/structured? (writing data conversion pipelines / adding guis for meta-data enrichment during experimentation?)
Abhishek: Responded to it in another thread. happy to give you a demo later.
in a nutshell we have a curation app that uses active learning. we have 30-50 curators (human experts) who use the app to build training models. it continues to improve as curators use it more. we track the performance of our models. often it is as accurate or more accurate than human curators.
Tess: @lilly
Lilly: Thanks! would love a demo!
Question #7:
Nicholas: ^as a follow up to this, what do you define as "clean" data? how do you assess that?
Abhishek: We have automated the cleaning process. human experts are involved to build the training data sets and continuously monitor the performance of our nlp models. here are some details. happy to talk more in detail if there is interest.
ingestion: data standardization + metadata harmonization
data engineering: we ingest processed data from customers that are present in variable tabular file formats (tsv, matrix file, vcf, rds) - depending on the data type. the data is ingested from the customer's storage unit and transformed into a consistent standard tabular schema - which usually is gct or h5ad for a single cell. this conversion of variable file formats to a consistent file format is the main data standardization piece.
metadata enrichment: these datasets are mapped with relevant metadata about the experiment (a drug used, tissue, cell lines, disease condition, etc.) at 3 levels - overall dataset, samples used in the experiments, or at the feature level. we also ensure consistent vocabulary/ontologies are used in the metadata annotation process, & each dataset is processed with uniform molecular identifiers.
consumption
the datasets along with their metadata files are packaged, analysis-ready, and stored in our omixatlas product. from here, users can either search/filter for relevant datasets on the omixatlas ui or query programmatically, using curated metadata fields. they can also start analyzing these datasets through 3rd party integrations, or use them to train models, ex: patient stratfication model can be trained quickly using this data
Abhishek: Some additional notes: we validate the labels/annotations to be normalized and compiling to ontologies.
we make sure the platforms and pipelines used to generate the data are curated for base to base comparisons.
Question #8:
Maximilian: I see on the website that you provide clinical data from sources like “ppmi” - do you see any efforts in using such data to inform pre-clinical research / drug discovery to improve prediction of “clinical success”?
Abhishek: Yes. one of our customers had access to ppmi and they used it quite effectively. we believe it is valuable.
we cleaned it and linked it to other datasets to make ppmi more valuable and usable.
providing ppmi out of the box on polly is tricky because of usual constraints. but we can talk more about it should you be interested.
Question #9:
Yohann: Hi @abhishek, do you see value in working with pre competitive player like pistoia alliance and other around data standardization and harmonization in the life science space.
Abhishek: Hey @yohann we are working with pistoia. we are a member too.
a funny thing about "standards" is that everyone has one. and that undermines the whole promise of it. we are big believers in using community to converge upon broadly agreed upon standards.
Question #10:
Yohann: Do you have to specialize your work on research, manufacturing, clinical, natural history studies, etc. or can your platform work in different settings?
Abhishek: Yes. we make a specific nlp model for 1 entity. for example disease would have 1 model. cell lines another. so on so forth. similar for clinical annotation models.
Abhishek: Drugs, dosage, strength, frequency (for example) would be 4 different models.
Question #11:
Nicholas: Can you talk more about polly? what is it and how would a user interact with it?
Abhishek: Polly is our cloud based platform that provides clean and linked biomedical data for consumption. we have a gui interface but we have been delibrate about being a code-first platform. i understand that is a loaded term on this slack community. :slightly_smiling_face: the code first approach allows us to integrate with tools like spotfire, sagemaker etc as well as run powerful queries on the data (based on sample, features and datasets).
Question #12:
Nicholas: There’s often a dichotomy drawn between traditional bioinformatics and ml. how do you think about using omics data for developing ml models for drug discovery?
Abhishek: I appreciate the question. it is aligned with my own experience at agios and also our experience at elucidata.
most (not all) of what bioinformatics is what i would call bi. analyzing 1-5 datasets at a time. looking at differential expressions. pathways and such. very helpful to drive programs.
but increasingly we are seeing applications (with our customers and also outside) which rely on large datasets (100s) to pick up classifications. we know of our customers who have used it to answer very specific questions around patient segmentation for the target that they had validated. some of the papers being published are even more audacious. so this list is growing as we speak.
it is important for predictive ml-models to have a narrow question. at the same time we see such models as beyond our scope. our customers develop such models. we feed them ml-ready datasets into such models. getting the data ready for ml is our core focus.
Question #13:
Nicholas: How should organizations decide which biological problems could be solved using machine learning? what are the factors you consider
Abhishek: 1. how narrowly can it be defined?
2. how critical is to business?
3. what kind of underlying data can be used to train and test the model?
4. quality of data.
Abhishek: This list is not prescriptive. but often we see that folks dive straight into the model. that has not served anyone well. it requires a lot of planning and thought to invest and think about data quality and usability before diving into the models.
Question #14:
Nicholas: Your upcoming event is called datafair -- can you talk more about fair data standards and why you see those as especially important in bio?
Abhishek: The fair principles put the onus on organizations that own and publish data to make it “machine-actionable”, i.e. a machine can read the metadata that describes the data, and this enables the machine to access and utilize the data for various applications.”
currently, for most organizations, data generation, storage, analysis, and insight derivation are owned by different stakeholders. a significant bottleneck is the disconnect between these stakeholders. fairly stored, managed and shared data facilitates data reuse, enables verification of the credibility and accuracy of the data and the insights derived from it. further, it enables interdisciplinary collaboration and innovation- accelerating the drug discovery.
Abhishek: We want the event to be a community based approach to discuss about the challenges of making data fair and the promise of it.
Question #15:
Maximilian: What is the long-term vision for elucidata, what are some of the areas you see the company expanding in?
Abhishek: Shameless plug: we just announced our series a led by f-prime and eight roads. that puts in a very good position to double down on our technology and expand. we continuously hear about similar problems in cleaning and linking pharmacology data, clinical trial data, manufacturing data etc etc.
we have just started to scratch the surface. r&d data (more specifically tabular and text data) at large has been underserved. we are hoping to serve that community in months and years to come.
Question #16:
Maximilian: Do you spend a lot of time consulting your clients also on the potential use-cases / business analysis for a ml approach (and potential model design) or do you try to stay at the pure data service level?
Abhishek: It is hard to be purist. We do consult and provide service too. But we are very clear about what we are 10x better at: Cleaning and linking biomedical data. We do at times create processing data pipelines etc. Create custom tools. Because it helps our customers. But a continuous challenge is to focus as narrowly as possible so that we can create something highly differentiated. We have been reasonably good at it so far but it is a journey! :)
Maximilian: hehe yeah it’s a constant struggle
Question #17:
Nicholas: Can Elucidata work with companies that are just beginning their data science/ML journey or is your product better served for advanced ML teams/deep learning teams?
Abhishek: We work with large pharma companies. But we love working with companies that are very young. Stealth mode. Day 0. Seed/SeriesA Companies. Companies that do not have a name yet. They have been a big part of our story and traction. We have a very strong services team which enables us to do it. Happy to talk more.
Tim: What about academia?
Abhishek: We have a number of academic customers too.
Tim: But do you love working with them? :wink: Do you have any referrals? Feel free to DM.
Abhishek: We work with academia for strategic reasons mostly! It has been worthwhile for both the parties. :smile: Will do.
Question #18:
Maximilian: Are certain settings more challenging for your NLP models? (e.g. language in clinical trials vs. ELN entries)
Abhishek: Yes. We try to put guardrails around what we can do. For example: digital text. english.And it is worth stressing that NLP is just a part of it. We also do data engineering at scale. That has led to some interesting discoveries. Folks would cut and paste excel sheets in their doc files. We can't do much there. :smile: It is a huge challenge. But that is what gets us excited!! :smile:
Shashank: ELN is far more tough than clinical trial text, primarily because ELN's are written in a hurry
Question #19:
Tess: Thanks! This chain has been really interesting. What you are building solves a massive problem in the industry. I am curious how you think about integrations as this field matures: how do you see Elucidata fitting into the wider ecosystem and workflows of your customers?
Abhishek: The ecosystem is complex and it will only get more complex in years to come. Integrations are a critical element of our roadmap. We are working with a few companies already to integrate with their solution. Spotfire, Tableau, LatchBio to name a few.
Tess: Also curious, given that you have been in the industry for 7 years and made a pivot that paid off really well: could you share what the major learnings have been on the market needs? How has the industry changed over time and what are some of the main trends you are equipping yourself for?
Abhishek: The industry has changed and continues to change quite dramatically. You see more companies invest (resources and expertise) to take care of their data needs (post data generation). Some needs and solutions are more established than others. LIMS for example. But the overwhelming take away for us has been we are still in very early days of adoption of contemporary data practices and technology. That is where we see a place for Elucidata and a number of other exciting companies!
Nicholas: Huge thanks to @abhishek for taking the time to answer all of these questions! Great to hear about what Elucidata is working on. If people have more questions, feel free to continue to add it here (but the hour we've allotted is just about up)
Abhishek: Thank you so much! @Nicholas greatly appreciate this opportunity. And glad to be part of this vibrant and growing community! Three cheers for Bits in Bio!!