Q&A with Watershed
AMA style interview with Jonathan Wang, Co-Founder and CEO, and Mark Kalinich, Co-Founder and CSO of Watershed
This is reposted from a Q&A we did in April 2022. Check out our Slack for more.
We had the chance to interview Jonathan Wang, Co-Founder and CEO, and Mark Kalinich, Co-Founder and CSO, from Watershed Informatics! Below we discuss how Watershed plans on improving the bioinformatics workflow for users and much more!
Nicholas: I’m super excited to introduce @mark and @jonathan of watershed! as usual, i’ll seed the discussion with some questions, but it’s highly encourage to jump in with questions of your own
Question #1:
Nicholas: Tell us more about what watershed does. who are your users? are you focused on academic users or industry? How did you two meet? what brought you to the intersection of sw and bio? How do you think about pre-set analyses vs customizable analysis? What does “multi-omics” mean to you? why the focus here?
Mark: @nicholas, first off, thanks so much for inviting jonathan to come chat with the bib community; we’re very eager to meet more of the folks here doing awesome work at the interface of tech and bio.
Ryan: @mark @jonathan is your platform geared toward homo sapiens or could people extend this into microbiology and plant science as well?
Mark: Hi ryan, thanks for coming to the ama! the platform is purpose-built to be extensible into any biological analysis; we can run microbiology and plant-based analyses today :slightly_smiling_face:. if you know anyone in agricultural biotech, we’d love to speak with them!
Jonathan: In terms of pre-set analyses vs. customizable analysis, we think the answer is customizable analysis with flexible building blocks. we often see discussions around how many pre-set analyses/pipelines a platform offers, and over the years, as that platform grows, the number often grows to thousands or even tens of thousands, with no end in sight. with customizable analysis, you effectively enable an infinite number of analyses to be developed. of course, then challenge now becomes how to offer a sweet spot of ease of use and customizability.
Nicholas: When you say “flexible building blocks”, what does that mean? what are the abstractions that you’re working with here? is it a pipeline? an aligner? a bar chart?
Jonathan: Closer to level of aligner and bar chart, while offering more built in functionality in each operation such as batch computation, data provenance, and input/output file management
I think on a high level, just writing python and r gets the job done, and those languages are obviously very flexible. however, there's often not good code reuse from project to project, but at the same time, simply developing a new standard library is just going to be yet a another competing standard that a group of people use.
So our philosophy in developing our api (i.e. building blocks) is to create build useful features into the abstractions themselves, such that there's real motivation for bioinformaticians to use it, instead of just reducing number of lines of code just for the sake of brevity.
Question #2:
Mark:
Re: Tell us more about what watershed does.
watershed is an end-to-end bio-it cloud data lab (which is admittedly a mouthful). what that means is that we solve 3 critical challenges for biomedical researchers trying to transform raw data into insight:
1. accessible and scalable computational infrastructure
2. solved bioinformatics tooling environments
3. accessible and scalable domain expertise
we do (1) and (2) via our platform; our users can access 1000's of cores and tbs of ram without having to worry about what an ec2 instance is or what the appropriate ram:core ratio might be for their application. we also have 1000's of tools pre-loaded on the platform that all play nice with each other - instead of fighting with incompatible packages requiring different versions of the same tool, scientists can focus on generating insight from their data. for (3) we have a world-class team of phd-level harvard, mit, ucsd etc trained bioinformaticians that can solve anything from a simple code change to building an entirely bespoke analysis.
Question #3:
re: who are your users? are you focused on academic users or industry?
Mark: our users are scientists - biologists and bioinformaticians who need to rapidly build, run, and customize bioinformatics workflows. we have users across both academia and industry, especially biotech startups.
Nicholas: There are several other companies in this space (some of which are active bits in bio contributors!) — what do you feel sets watershed apart?
Mark: Great question! it’s been exciting to see the bio-it space heat up over the past 12-18 months (expansion of companies like dnanexus, growth of newer players like watershed, ganymede, latch, basepair etc. speaking in very general terms, there are 3 classes of solutions in the space:
1. no-code first platforms
2. high-code first platforms
3. bio-it consulting shops
the no-code first platforms (dnanexus, biobox, latch, etc; note that these platforms have some high-code capabilities as well, i’m painting with a very broad brush here) are fantastic at providing speedy analyses - you can go from 0 to gene set enrichment over lunch for your rnaseq experiment. the problem arises when you want to tweak the pipeline - swap out an aligner, include a variable that doesn’t have a radio button in the no-code gui, etc. at this point, you may have to activate a full-stack engineering team in order to build out the back- and front-end resources required for that change. even though 80 — 90% of a a given analysis may be the same, that last 10 — 20% is absolutely critical for providing the actionable insights today’s scientists need.
the high-code first solutions (nextflow deployed on top of vanilla cloud (aws, gcp, azure); databricks, sequera labs, etc) have all the flexibility the no-code options lack - you can build literally any analysis you could dream of and scale it to infinity and beyond. the problem is finding the unicorn systems administrator - software engineer - bioinformatician who can build and configure your cloud instance (including appropriate security settings, don’t pull a pfizer - and there’s no chance your biologist is going to want to muck around with any of the above. these tools lack the speed of the no-code first solutions, and are inaccessible to non-experts.
another option is the consultant approach - there are several exemplary bioinformatics consultant shops around (we’ve personally been impressed with bioteam, bridge, the bioinformatics cro, and diamond age data science). the one indisputable advantage to using a consultant to perform your organization’s analysis is that the analysis will actually be completed, and in the context of a trusted relationship the quality of the results can be reliable. unfortunately, the operational cost of such an approach is prohibitive, each new question adds linearly to the bill; and additional iterative analyses are typically undertaken from scratch. timelines can be lengthy, with significant lead times for accessing limited bioinformatics bandwidth. additionally, the local fund of knowledge within your organization is not expanded - nothing is learned to make the next similar analysis any better or faster the next time.
Mark: Watershed’s technology enables us to have the speed of no-code first platforms and the flexibility of high-code first platforms, which enables our users to get insights to their custom analyses 10x faster than previously possible (per them - still need to run that randomized experiment on this one).
Question #4:
Nicholas: This brings up a question that i often think about w.r.t infrastructure vs. applications. does it make sense for vertical software companies (such as watershed) to have such a focus on the scalable infrastructure? why isn’t this something that is solved by a general software provider (e.g a cloud company)?
Jonathan: Does it make sense for vertical software companies (such as watershed) to have such a focus on the scalable infrastructure?
yes, and the reason is because of both price/performance and user experience.
Regarding what mark just mentioned:
the size and variety of multi-omics data has exploded in the past 10 years, and continues to grow exponentially
the amount of storage and compute needed in this domain actually exceeds what is needed in other industries (e.g. tech, web applications). a vertical approach with both software and hardware allows users to access the performance they need, without having to worry about optimizing the hardware configuration themselves.
A general cloud company is often very good at providing pure infrastructure-as-a-service. e.g. they give you a virtual machine, virtual networking capabilities, and you setup the os and connectivity yourself. the benefit is a high degree of hardware flexibility, but the downside is needing to spend more time on low-level complex infrastructure tasks. even starting with simpler offerings (e.g. aws lightsail) doesn't bypass that problem. as the team/company grows, someone on the team eventually needs to figure out data backups, network security, cost management, user roles, and general scalability issues (using aws doesn't mean you don't have to worry about scalability).
So we believe there's room for a more specialized and vertical solution to help accelerate the research efforts of both biologists and bioinformaticians on this front.
Nicholas: Cloud providers are just one part of the stack, though. there’s a large number of data pipeline and ml infra companies, both of which seem to have a large amount of overlap with bioinformatics workflows. your earlier answer to my question (https://bitsinbio.slack.com/archives/c02sav8du2j/p1657570472566619?thread_ts=1657570148.666779&cid=c02sav8du2j) mentioned batch computation, data provenance, and file tracking — all of which are offered by existing pipeline solutions.
Jonathan: We think the key is that there are advantages when you fully integrate the entire stack. we like to think about things in terms of number of action points needed to complete a workflow. if a user needs to go to aws to configure their s3 bucket, go to the ec2 console to configure an instance (or set of instances/autoscaling cluster), prepare a workflow in nextflow, and then monitor/manage it in slurm, that introduces a lot of touch points and decision fatigue. we believe that if you can reduce the number of overall tools/environments, you can increase the efficiency of the end users.
It's not that users can't figure out how to use each of those pieces individually, but whether it starts adding up to becoming a task that's orthogonal to the core work that they'd like to do.
Question #5:
Re: What does “multi-omics” mean to you? why the focus here?
Mark: shameless plug for the talk i gave at egnyte’s life sciences summit on the topic; recording available at this link. i think of multi-omics very broadly: any analysis that attempts to integrate 2+ single-omics datasets to better understand a biological system of interest counts as a multi-omics analysis to me.
the size and variety of multi-omics data has exploded in the past 10 years, and continues to grow exponentially — at this years bio-it world conference, george church presented an estimate that we’ll be at ~40+ exabytes of big biology data by ?2026 [i need to go back and fact-check the exact year].
this explosion in -omics data has not been met by a commensurate maturation of biological big data analysis infrastructure and tooling. today’s scientists can (relatively) easily generate data, but wait weeks, and sometimes months, to successfully analyze and interpret it. the root causes of the problem (in our opinion) are that life sciences discovery organizations lacking the required infrastructure expertise, and that today’s bioinformatics tooling is fragmented and fragile.
watershed is in a unique position to leverage jonathan’s expertise in high-throughput computational systems and computational research tools and my expertise in bioinformatics / biomedical research to deliver the purpose-built, fully verticalized hardware and software solution required to truly solve the above problem.
Nicholas: The root causes of the problem (in our opinion) are that life sciences discovery organizations lacking the required infrastructure expertise
the emphasis on infrastructure here implies that you think it’s a computational problem as opposed to an knowledge problem. is the problem that a company has a bioinformatician who knows how to run pipelines, but it’s too hard to run it on 100 gb of data or that the company has a bunch of scientists who don’t know how to run pipelines?
Mark: Thanks for the clarification question (and apologies for my slow typing, i’m a few questions behind).
we think it’s both an infrastructure and knowledge problem - there’s a massive shortage of bioinformaticians in the market (i have a blog post cooking that should be ready in 1-2 weeks on this), which is made much, much worse by the currently highly-over-subscribed bioinformaticians being forced to become a swiss army knife of systems admin, software engineer, and bioinformatics guru. building vertical-specific infra frees up bioinformaticians to operate at the top of their degree - focusing on bespoke bioinformatics workflows instead of making sure they remembered to turn the lights off on their monster ec2 instance before the long weekend.
Question #6:
Nicholas: I loved your answer about low-code solutions. how do you think about targeting tools towards bioinformaticians vs biologists? why not build a fully no-code solution?
Mark: thanks! this gets at the very heart of watershed’s mission - we exist to enable research teams to transform raw data into insight. it’s trite, but science is fundamentally a team sport - i can’t tell you how many times over the years what seemed like a terribly hard problem was solved by a 90-second conversation with the right person (e.g. a biologist instantly knowing what was went wrong with a computational experiment from the outputs, or a bioinformatician being able tot take a weeks-long excel slog into a 5 minute python script).
these interactions are entirely missed on any platform that doesn’t simultaneously offer the flexibility programmers need to quickly build a fully bespoke workflow (see my response on the current company landscape) and the usability a non-coding researcher requires to interact with the data. this need is why we spent the better part of 3 years perfecting the technology needed to enable that interaction.
Question #7:
Nicholas: :stuck_out_tongue: r or python? why?
Mark: My entire phd was in r; jonathan and alvin (our vp of engineering) dragged me, (almost literally) kicking and screaming into python. now that i’m here, i do love the broad utility of python (it’s pretty okay at almost anything and pretty great for data science), but r is built by statisticians for statisticians (and has many beautiful visualization tools that lack python correlates). watershed offers both so we don’t have to settle the debate ourselves :stuck_out_tongue: .
Benji: There is one and only one reason to use r. https://ggplot2.tidyverse.org/
Nicholas: Thoughts on python ports like: https://plotnine.readthedocs.io/en/stable/#:~:text=plotnine%20is%20an%20implementation%20of,that%20make%20up%20the%20plot.
Mark: Now they have it in python! https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129
Austin: Altair python ggplot clones: https://github.com/altair-viz/altair
I still love ggplot2. the python clones just end up in the uncanny valley
Question #8:
Nicholas: You’ve mentioned `watershed's technology` a few times. can you elaborate on what the tricky part of building this was? is it the infra, the apis, plugging in existing tools, etc.
Mark: Defer to @jonathan’s expertise on this one :slightly_smiling_face:
Jonathan: The infrastructure presents its own set of challenges for sure, but i would actually say the most difficult part is really getting the user experience right on the workflow api. there's always a constant push and pull to both try to make it simpler and also more flexible at the same time. we think we're at a pretty good sweet spot right now, but it is always a constant challenge to continue improving upon the ability of the api to deliver useful scientific results while still maintaining the same level of flexibility and simplicity.
Mark: Thanks so much for all of the questions! it’s been fantastic getting to chat with everyone, and looking forward to meeting more folks in person! i do have to run to my next meeting, but if anyone has any questions for me don’t hesitate to reach out at mark@watershed.ai or on this slack (although i can sometimes be a bit tardy re: responses here)
Nicholas: Thanks @mark and @jonathan — this was great!