Introducing Q&As ft. LatchBio

AMA-style Q&As with Bits in Bio community members

Nicholas Larus-Stone

Alice Yu

, and

Bits in Bio

Nov 18, 2022

(This Q&A is from January 2022)

At the beginning of this year, we started doing AMA-style Q&A sessions in the Bits in Bio Slack group, where we interview our members about what they’re working on and topics of interest at the intersection of software and science. These virtual gatherings have been great learning experiences and we wanted to share them with the broader biotech community. Additionally, we will be releasing our previous Q&As over the next several weeks.

Our first interview (detailed below) is with Alfredo Andere, Co-Founder & CEO, and Kenny Workman, Co-founder & CTO, of LatchBio. If you are interested in participating in future sessions, please join our Slack channel!

LatchBio Q&A

Nicholas: Welcome to our first q&a! thanks again to @alfredo and @kenny for agreeing to be our guinea pigs. everyone should feel free to ask questions — I’ll add a few to kick it off

Question 1:

Nicholas: What problem(s) is latch focused on solving?

Kenny: We are “building and disseminating the data infrastructure of the biocomputing revolution” though that’s quite vague and grandiose, aint it? concretely, latch is constructing a data infrastructure that lives in your browser that is tailor-made to store, visualize and process biological data.

Everyone here is probably well aware of how complex biotech data is to interpret and how large/hard to wrangle it is. the recent advent of iterative dbtl-style discovery + high-throughput, multiplexed assays has compounded the problem. Biotech startups, mature biopharma, and even some well-funded labs have invested enormously in expensive software engineering talent to rebuild the same “pipeline running” platforms again and again, as a means to categorize, visualize, and process these enormous data footprints

Latch is stepping in to build a single, generalizable platform to service all of these organizations, allowing us to build a better platform than any single company could in-house and allowing companies and labs to focus on research rather than software engineering

Nicholas: First of all, I love this! definitely think this idea of every biotech building their own software stack to solve problem x is a suboptimal solution — horizontal saas, let’s go. curious what you think other competitors in the space are? where do their existing solutions fall short?

Alfredo: I see two categories of competitors. traditional sales “top-down” companies like dnanexus and sevenbridges and freemium “bottoms-up” companies such as seqera labs. I’ll answer them separately.

top-down (dnanexus, sevenbridges, etc..)

as I explained in this answer there is no way to cover the whole range of bioinformatics workflows through one-sided marketplace, the space is way too big and growing faster every day. dnanexus and and sevenbridges are hoping to do exactly that, and switching away from that would go completely against their business model. imagine trying to convince a bioinformatics researcher that instead of publishing the code from their paper on github they should publish it on a platform that costs thousands of dollars to gain access to. good luck selling them on that.

bottoms-up (seqera labs, …?)

seqera labs, made by the creators of nextflow, are doing awesome work. we are big fans of their work and the community they have built. tons of respect for them.

main bets on them are that

a) this need is big enough to need many solutions rather than a winner takes all.

b) they need to build faster. they have been building nf-tower for many years now and after six months of building, we are projected to get to feature parity soon and then outpace their features in the next couple of months. there is so much to build in this space and we need everyone going full-steam ahead.

c) they are too married to nextflow. it’s a cool project but bioinformatics is much broader than a single workflow language and they need to build accordingly.

Question 2:

Nicholas: How did you two meet? and your third co-founder?

Alfredo: The three of us met early on at uc berkeley before dropping out to start latchbio. i was mainly working on software/machine learning stuff and kenny was working on software/biotech stuff. we started by hacking together on different projects. everything from eeg brain waves emotion predictions, visualizations of scrna data, micropayments, distribution of physical goods, etc.. we flirted with the idea of biotech software many times. between my data infra experience at google and facebook and kenny’s biotech software experience at jbei, serotiny, and asimov; it was clear that there was a lot of tooling missing and badly needed in the space. but biotech is a scary field that people warn you to not go into without a phd. after dropping out to keep working together and realizing we were going to work on this for the next 20 years, biotech became more and more the clear choice.

Question 3:

Nicholas: Many companies in the saas space force potential customers to call to set up a demo. you’ve chosen to allow people to “try for free now”. can you explain the motivation behind that decision?

Alfredo: Great question. during a top-down “book a demo” traditional sales motion, the buyer is not the end-user but rather the manager or executive decision-maker. so the incentives are to create the product that the manager likes the most, not necessarily the one that the end-user likes the most. these misaligned incentives often lead to products that look good on a demo to executives but are actually annoying to use for the actual scientists who are using it every day.

the rise of slack, twillio, benchling, and many other freemium/bottoms-up companies have shown us that it’s becoming more typical that the end-users, in this case, biologists and bioinformaticians, are the ones telling their manager what product they want to buy.

so by letting the end-user use it and try it first you can actually align the incentives and build the best product for them and the best product overall

Clint: Another great thread on this is here: https://kellblog.com/2021/12/15/demo-is-not-a-sales-process-stage-sorry/

Clint: It’s also a reason why everyone hates things like applicant tracking systems, time tracking portals, hr information system, or lims… what causes a purchaser to buy a system like that often has little to do with what the end user wants.

Question 4:

TJ: What’s the biggest challenge you’re facing?

Alfredo: Convincing scientists, bioinformaticians, and biologists, that they should buy instead of build a bioinformatics platform.

Biotechs are allergic to buying, they tend to believe that building everything in-house is a competitive advantage. this is true for certain things that are particular to their company, but bioinformatics is becoming so necessary to everyone that economies of scale start hitting in, and building in-house no longer holds.

The other thing is that the need for a robust bioinformatics platform creeps up on companies, rather than jumping all at once. first, they just need one pipeline that they run a few times a week. as the experiment count increases they might be running it many times a week. suddenly you need a way to expose it to biologists and non-computational people. slowly you realize that it’s really easy to mess up the inputs so building the right error padding takes time and patience. a month later you realize that it’s no longer one pipeline but three of them and they need different sizes of compute. at this point, you are a year and four software engineers into building this in-house and it’s very hard to make the switch to a new solution.

Question 5:

Thomas: How do you plan on allowing users to customize the pipelines? as in, if users want to add a new step in the middle of the pipeline that you provide?

Kenny: Internally, we have burned a lot of midnight oil in building a toolchain to compile any arbitrary program or library into a fully-fledged browser interface + companion logic to containerize/schedule said program as a serverless workflow on our standing cluster

We are going to release these devtools publicly and open source over the next few months to allow users to directly customize and upload workflows to our platform, hopefully, to create a marketplace dynamic where every piece of logic is guaranteed to have:

• a well-designed ui/ux

• strong versioning

• support for arbitrary resource consumption (up to 200 cores, 2 tb ram, 16 tb file input)

• parallel execution support via csvs

happy to talk about the sdk in more detail, as it’s my primary focus at the moment :wink:

Thomas: I think you are using flyte as the backend for the execution?

Kenny: Yes @thomas we are using a modified version of flyte with an expanded biological type system, a domain language to construct full web interfaces from a typed function header, and plugins to interact with our managed file system

Thomas: We at freenome are also building on flyte for a custom ml sdk so looking forward to see what you did

Question #6:

Apoorva: What are the steps in crispr bioinfomatics? how did you build, test and validate your tool?

Kenny: Crispr bioinformatics could probably be used to describe many different workflows, but i’ll go ahead and enumerate the most conserved steps among companies + labs we have worked with

Kenny: Workflow 1 - visualization of edit behavior

premise - some cas derivative (cas9, prime editor, base editor, etc.) is used to induce an edit at a locus that is well characterized

1. sequencing reads are usually provided from a pcr-amplicon reaction, primers are designed so that the flanking locus is ~200bp, and usually files are moderate and manageable ~35-50 mb

2. sequencing reads are run through a qc tool to check for erroneous overrepresentation (would be a negative artifact of sequencing) eg. multiqc

3. adapters, primers are trimmed from reads

4. cleaned reads are aligned against the unedited locus

5. a distribution of edit behavior is visualized by observing deviations between assembled reads and the amplicon, often one can pay special attention to the basepairs immediately flanking the cleavage site, giving a higher resolution “edit efficiency” score

Kenny: Workflow 2 - large-scale crispr sgrna screen analysis

premise - one wants to understand the relationship between genes and a phenotype by perturbing many genes and introducing a positive or negative selection

1. here we usually get a bunch of pooled reads for samples under different treatment conditions and a library of the spacers used in the screen

2. we first clean reads and construct a counts matrix by assigning reads to the spacers from which they were derived

3. we can then use vectors of counts across all spacers to see which ones deviated significantly between conditions and find out which spacer was likely implicated in a condition change

Kenny: There are many, many tools one could use for each of the steps in these two general scenarios which leads more broadly towards a philosophy we have at latch - we err on the side of wrapping rather than building. although we have built tools to solve unsolved needs (eg. quality control for sgrna libraries is top of mind), we would much rather build up and disseminate the great academic work being done all over the country rather than reinvent the wheel

Kenny: Rather than building and testing the logic of the pipeline, we are much more focused on building are reliable interface and infrastructure that will let it execute consistently

Apoorva: Thanks so much for taking the time to answer, this was very insightful. can't wait to see your company grow!

Question 7:

Michael: Do you know why biotech companies often invent their own pipeline running platforms? it seems there have been many companies around that offer this: dnanexus and sevenbridges are two that come to mind. neither of these two seem to have taken market share. what do you think these companies have been missing?

Kenny:

re: Do you know why biotech companies often invent their own pipeline-running platforms?

To be blunt, it is likely a mix of ego and a lack of good options. ego insofar as many modern platform biology companies have aspirations to build monolithic, biofoundry-like platforms (with lots of “ai”!) of which the desire to have a complimentary, homegrown software engine and the ability to tweak any knob one would like is very appealing.

the recent success of insitro, zymergen, recursion, etc. have probably done much to induce this sentiment, but is certainly something we have noticed among the larger (series c and beyond) while traversing the market

yes dnanexus and 7bridges have existed, but they would be what i would describe as “not a great option”. both sport pipeline execution engines that are archaic (single-standing server/computer for the entire execution), have poor user interfaces, lack platform visibility, and have proven to be very brittle when attempting to integrate new/bespoke scripts or pipelines into the platform

re: What do you think these companies have been missing?

i’ll outline a few concrete things that we believe have been slept on

1. the ability to quickly provide platform-wide support for new libraries + pipelines in any language in under 24 hours

we have invested heavily in a toolchain to do just this and will be releasing it quite soon. the ability, as a seasoned bioinformatician, to be able to spread and expose programmatic logic to biologists or others without coding experience that quickly has been an enormous software engineering effort in-house. it has allowed us to move incredibly quickly with existing customers + labs. we are very excited to release some of these tools and catalyze a marketplace for others to create executable interfaces of their beloved scripts (wdl + galaxy just have not cut it!)

2. our focus on building in-public and lack of barriers to platform usage

it is very hard to actually just try out the platforms that you mentioned. for almost a decade, both platforms existed without that “try it now” button. while building out latch over the past year, the limelight of public scrutiny has certainly forced us to make many rapid ux, design, and feature tweaks simply by watching users fumble with the platform that we initially thought was quite intuitive. building for the public user first and foremost will build a more beautiful and usable platform than has existed prior

3. modern workflow execution + orchestration engine

as mentioned previously, we are leveraging a kubernetes-native workflow execution system that allows each node/task within a workflow to receive its own versioned container on completely heterogeneous compute environments (think - read assembly can be done on a heavily multiprocessed hpc then the next task could be ml inference on a 4 gpu computer). by amortizing workflow execution over a growing body of users, we can truly only charge individuals for the compute for runtime alone. additionally, we can leverage spot/spare computing capacity as fodder for our clusters for similar amortization efficiency for additional cost savings

4. a focus on data

we have a fully in-browser, “biological file system” that feels much like macos’s finder but each file could be multiple gbs and exists in very high performance in-vpc network environments (100 gb/s copy speed!) this means you can play with huge genomic data like it was a text file on your laptop

Question 8:

Jesse: How are you planning to approach tradeoffs between standardization and flexibility as you expand the number of pipelines that you handle?

Alfredo: We are not. the broadness of the field means that there are way more workflows that we could ever hope to cover, and that is increasing every day. and the rate at which that is increasing is also increasing! so the only way to really cover the whole range is to act as a central hub (aws + github) where bioinformaticians can upload their workflows through an easy to use sdk/api, biologists can use those workflows in a no-code way, and other bioinformaticians can use those workflows through an easy to use sdk/api that takes care of cloud infra for them.

The reason that we currently uploading all the workflows ourselves is that, as with any marketplace dynamic, it is very hard to create both sides of the marketplace at the same time. so initially we are acting as one side of the marketplace ourselves but uploading all the workflows for the biologists to use. once we have enough users on our platform then there will be clear incentives for bioinformaticians to put their workflows on latch instead of github or a custom react frontend. at that point, which will hopefully be in a few months, we’ll release the sdk for them to do that and will focus more on building the best underlying infrastructure.

Jesse: That makes sense. thanks!

Question 9:

Yohann: Reproducibility being an issue, how do you make sure versions of analytical packages used by customers provide comparable results?

Kenny: Each workflow registered with latch is versioned and each constituent task receives its own container (which can be different between tasks in-theory, but in-practice most share the same compute environment). When a workflow (what we would call your analytical package) is invoked, each user is guaranteed to receive the exact same behavior given our rigorous containerization, and the user can easily navigate different versions of locked behavior by switching the workflow version

This level of rigor will be enforced when we open up the platform to the public with sdk, guaranteeing consistent and stable performance for the expert bioinformatician servicing a whole company of needy biologists and the lone academic trying to share his new software in the supp. material of nature alike

We are hoping to move towards bioinformatics workflows as a stable collection of executable primitives that can be relied upon as much as any programming construct

Yohann: Thank you. i like the container approach, it reminds me of code ocean and their capsules in the approach but a sdk would make latch very powerful :+1:

Question 10:

Nicholas: What advice would you give someone who’s interested in starting a company in this space, but is only in the idea stage? how did this go from “oh cool we see this problem” to latch?

Alfredo: We are very early on so take all this with many grains of salt, but my advice would be to just do it and then take on challenges vigorously.

There will never be a right time, you will never feel prepared, we never did. i would argue we will never be prepared, so we just had to trust that we can overcome any challenge thrown at us and then go for it.

Other advice, especially for scientists/engineers, which we did early on and then fixed: do not start with a solution and then look for a problem. make sure that the problem you want to solve is actually a big pain point. you can do this by starting with many hypotheses and then talking to hundreds of people as you validate/negate them and develop new ones. a good rule of thumb is to not build anything unless you and your friends would use it or you have ~5 people telling you they will pay you for what you are making. this does not go against my first advice of “just go for it” because finding this can be many months of a full-time job by itself. the mom test is a great resource around a lot of this.

Third, especially applies to young people with few responsibilities. if you take the fear of failure away, what is really the downside? you will learn so much by building a company that even if you completely fail you will come out on the other side with a lot of professional value to be applied elsewhere. you can definitely avoid many mistakes by reading vigorously about what other founders have messed up, there is so much documentation on this and i have felt that it has helped us a ton. some off the top of my head are sam altman’s how to make a startup stanford class including all the readings, every video in the yc library, paul graham essays, chris dixon, the lean startup, zero to one. within biotech some great biographies are genentech and the billion-dollar molecule.

lastly, something that has been amazing to us is how open and willing people are to help. in general around tech/sv but specially in biotech. the advice is to reach out to people, put yourself out there, and ask people for help.

with that said, please dm me or email at (alfredo@latch.bio) if you think i can ever help with anything. whether you are thinking about making a startup and want to brainstorm, or you are currently building a startup and need help with anything, or you just want to chat. i would love to help.

Question 11:

Isaac: What are right now the most important features that should resolve the majority of problems for the customers?

Alfredo: I don’t think there is one answer to this. but a really huge and somewhat unsexy feature is workflow parameter inputs. on one side there needs to be a lot of flexibility to be able to support any possible workflow. on the other hand, figuring out now to make an input form that users cannot mess up (and cause an error in the workflow) is a whole field of ui/ux research within itself. trust me, if there is any way for a user to mess up, they will mess up. it really appalls me that teams try solving this in-house for 3-10 users. the time investment makes no sense unless there are hundreds of users. it’s a very interesting problem for us though cause we get to nerd out about it extensively and the time investment pays off by saving hundreds of biologists many more hours. very similar problems around csv importing when filling out multiple forms at the same time.

Question 12:

Elizabeth: Is the primary / initial focus the development of the initial workflows, the development of the no/low code interface for biologists, or the aforementioned underlying infrastructure (and what falls into the 'underlying infrastructure' bucket here). also - you mentioned at the top 'data infrastructure that lives in your browser' can you expand on that?

Kenny: At the moment we are maintaining a healthy split between building for the no/low code users (those who would interact with the browser application) and building the underlying infrastructure and database systems that will scale to absorb its load ‘data infrastructure that lives in your browser’ refers to our browser application as a “window” into a much larger distributed system cranking away in datacenters across the globe. through your browser, you are able to reorganize your filesystem of tb-scale genomic objects as easily as you would locally, you are able to launch dozens of simultaneous assemblies each requiring the resources of 5x your 2022 macbook as easily as you would from a local terminal. you are able to do all of this without a line of code and in collaboration with your whole team spanning various degrees of programming literacy. happy to dig in anywhere else you want as well

Elizabeth: So basically, you just meant 'there's a gui in front of the cloud-based resources vs. having to access via the command line, and it's slightly easier to use for newbies than something like the aws gui?' - is that a fair paraphrase?

Kenny: Yes and no, this is mostly a fair characterization of the “file moving” example i gave (where a similar operation could be found in aws)

to show where even seasoned software veterans could benefit, how would you begin to design/build a system to launch 12x assembly pipelines, each requiring 120 cpus, 500 gb ram, ingesting 500gb files a-piece, and each with a different set of custom parameters. sure newbies would benefit from a nice interface wrapping this operation, but so would absolutely anyone. so more than a gui, there’s a lot of infrastructure engineering going on to make this possible for arbitrary logic

*in the above example, now take the same system you built and make it run 12x inference on dnabert in parallel without ripping out a single service in your architecture, nothing can be hardcoded!

Nicholas: I’ve been digesting your answers and your focus on infrastructure and pipelines has me thinking a lot about mlops platforms. you’ve compared yourself to dnanexus and 7bridges, but what about something like kubeflow or algorithmia (or any of the other 100 platforms out there)? one major difference i see is providing bioinformatics specific modules, but the underlying infrastructure (containerized pipelines that scale dynamically) seems quite similar.

Thank you Alfredo and Kenny for taking the time to chat with us! Stay tuned for more interviews with scientific leaders from BigHat Biosciences, scVerse, Patch Biosciences, and more coming soon.

Thank you for reading Bits in Bio. This post is public so feel free to share it.

A guest post by

Alice Yu

Data Scientist in Biotech

A guest post by

Bits in Bio

Bits in Bio Contributors