This is reposted from a Q&A we did in April 2023. Check out our Slack for more.
We were lucky to have the opportunity to interview co-founder & CEO, Josh Nicholson, from scite! scite helps researchers better discover and understand research articles through Smart Citations.
Read more to learn about:
Marrying ChatGPT with scientific literature through scite assistant
Redefining scientific citations through:
Contrast citations
Micropublications vs. full publications
Publications of null results
Toughest software challenges involved in making a research search engine
and more!
Nicholas: Welcome @josh! very excited to have you discussing modern ways to consume research. i’ll get us started with a few questions and hopefully the community will jump in as well!
Question 1:
Nicholas: how did you get interested in science and the application of software to the research domain?
Nicholas: @josh tell us about scite – what is the problem you solve, who is the target audience, and what the product is
Josh: Scite is a next generation citation index that shows the context of citations and allows users to see how any topic, article, research, drug, etc. has been cited and if it has been supported or contrasted in the literature.
initially scite was started in response to growing concerns around reproducibility in cancer research. towards the end of my phd, this became a big issue. from reading the literature religously, i knew certain papers that had all the hallmarks of quality (cell paper, 300+ citations, authors from mit) but were shown to be wrong by 4 papers citing it. we wanted to make it easy to see debates and see if claims had been challenged or supported.
we’re used by students down at the undergrad level, to grad students, professors, consultants, and really anyone interested in research.
Question 2:
Nicholas: You are focused on the domain of making research more accessible. what are the biggest barriers/problems you see there?
Josh: It is extremely hard to get your hands on the full-text of scientific papers without paying millions and millions of dollars. even openai and other groups like that don’t have access to all scientific papers.
we’ve done a good job of partnering with publishers to overcome this hurdle, giving them things that they care about in exchange for access to their full corpus. this has taken years to achieve.
in addition to access as, “i can read it.” it is also hard to digest all this stuff. i think accessibility in terms of understanding and synthesizing information is the other side of the access coin.
Nicholas: This is an eternal pain point for anyone working in this field. do you see publishers changing their attitudes towards open access any time soon?
Josh: I think the world is moving towards open access as funders start to mandate it around the world. the publishers have done well in the new open access world financially as well so i think it will become the norm over time and they will be happy about it as they are making money from it. indeed, lots of the m&a activity in academic publishing is buying open access publishers.
Question 3:
Nicholas: Llms have taken the world by storm recently and are really intriguing tools to be applied to the scientific domain. how do you see them changing the way we perform and consume scientific research?
Josh: I wrote a paper on this that gives a good overview of my thoughts: https://future.com/how-to-build-gpt-3-for-science/
at scite.ai, we just integrated with chatgpt last week. here we are, a small group of researchers focused on building “next generation citations” for give years. in comes llms, which are extremely easy to use, very powerful, but also potentially very misleading and wrong.
we’ve launched something we call assistant which allows us to validate chatgpt output against research articles: https://scite.ai/assistant. i can imagine similar use cases with molecules, proteins, figures from papers, and more
Nicholas: I think this is a really important topic since llms are so accessible and have proven to be very powerful in certain domains (e.g. coding) — what advice to you have for people looking to use them in their research? things to avoid? prompts that have been useful?
Josh: I think it’s still really early so we are all still learning about the pitfalls and use cases. i don’t have any suggestions on prompts really but i do think most people using it know that everything generated is not 100% true. so to me we have already learned a bit about their limits and are using prompts with that in mind.
i think it will become increasingly important to have llms applied to different data sets (scientific papers, images, etc.) we are marrying chatgpt with the scientific literature, giving real references to the output. i can see this use case being adopted on a lot of other datasets, wherein the data is the verifier or fact checker if you will.
Luis: Interesting, are you thinking that chatgpt and something like your platform can be married in a way to inform the creation of papers? it may be able to point out missed citations (eyerolls) or other things. a copilot for writing papers.
Josh: Yes, essentially. you can already tell assistant to write short paragraphs, summaries etc. i think the key thing is we need to validate each sentence against the literature to eliminate wildly wrong assertions. you can try a few prompts here around that if you like: https://scite.ai/assistant
Question 4:
Nicholas: You’ve taken different approaches towards consuming research – text vs audio. how did you choose your modality and why?
Josh: The scientific paper, whether we like it or not, is the currency of scientific communication and has been for a while. it’s generally how we consume scientific knowledge, how we measure it, etc. we focused on papers because that is where the focus has been for hundreds of years.
with that said, i think it is extremely interesting to move beyond that and start to capture “tacit knowledge” or knowledge that is out there but is not explicitly published. how we get researchers to contribute this information in a digestible way is an interesting thing i think about a lot and i think many groups are now starting to capture seminars, talks, conferences etc
Nicholas: Do you see the unit of research changing? for instance, certain types of research might lend themselves better to a blog post. we’ve seen a few attempts here (distill, arcadia, etc.) — what do you think needs to change if we want to move away from the scientific paper as the only way to communicate research
Josh: I think we’ve seen a shift to preprints and that has been a really big change in biomedical publishing. i think micropublications are a really interesting idea but i don’t think anyone has answered why someone should publish a single figure result.
i would like to experiment with this at scite. maybe our ai and citation classifications can prime the pump for humans to publish short “scites.” we would be answering the why because these short micro pubs would be directly influencing the citation records. to that end, its important we work with publishers to index their content and display it. smart citations from scite are now live on acs, pnas, wiley, apa, royal society, arxiv, and many other journals. a micro pub indexed by scite or published by scite could now influence how people see literature.
this is tricky though and i think requires a lot of moderation but would be interesting to explore–kind of like a glassdoor for science but based on findings not salaries or anything.
Question 5:
Nicholas: @josh one of the big focuses of scite (as far as i understand it) is the idea of a citation being at the forefront. literature is most helpful when it’s placed within context. simultaneously, we see a bit of a backlash towards using citation counts as a proxy for impact because they come with a lot of biases. how do you think about the utility of citations?
Josh: Yeah, i personally think how they have been used for decades is pretty dumb. we treat all citations as if they are same, and more is better. however, there are many reasons to cite an article, book chapter etc and many ways! you can cite a paper in your methods section because you used the method. you can cite another paper in the introduction section as background or you can cite a paper in the results section to compare your findings with others. there are dozens of reasons to cite a paper but we lose all that with traditional citations and reduce them down to a number.
i think reducing citations to superficial metrics is why they are not liked! hopefully as we show the context of citations, the community will realize that they are valuable beyond metrics and measuring, as they are a rich information source of opinions, criticisms, etc
None: I was suddenly reminded of the page rank algorithm behind google search while reading your response josh: who cites a paper should be a factor in a citation score for a paper just like, which website links to a page influences it's rank
Question 6:
Nicholas: I’m also interested in thoughts on the replication crisis – this is clearly a big issue in the scientific fields. how do you see software playing a role in helping solve this (if you do at all)?
Josh: I think people need to be rewarded for reproducible research, not just flashy findings. we show “supporting citations” as a way to identify if a paper has been supported in a broad sense. we hope that helps researchers shift their focus to doing solid work beyond cutting corners to try and get into selective journals.
i think beyond publishing though there are many things we need to change in science to reduce irreproducible studies.
Nicholas: Say more! what else should we be changing
Josh: Protocols are not shared openly, we don’t publish null results, cell line models can be really imperfect for a lot of things… it’s hard to do science and messy i would say and we all just want a nice clean perfect story.
to me, as soon as we drop the chase for that perfect story, i think we will get more real results that can be reproduced by others.
i personally think it would be interesting to have some papers published under pseudonyms. i think this would open up a lot more honest publishing, criticisms, etc.
Jacob: Publishing under pseudonyms ... i like it!
Jacob: Satoshi vibes
Question 7:
Nicholas: I’d be remiss if i didn’t bring up the rise in preprints! how have you seen the role of arxiv/biorxiv change in the past few years and where do you see it going as we move forward into a new llm-powered world?
Josh: The conversations amongst papers is getting a lot quicker! preprints are citing each other sometimes days or weeks apart. i also, think people are now publishing smaller papers and it is speeding up a lot of things as well.
Question 8:
Imran: Q) hi josh, thanks for doing this. just catching up with your replies, and the scite.ai assistant (w/ chatgpt) seems exciting. let me get right to it: given a publication (or even a scientific consensus based on multiple publications); is there a way to explore the "dissenting view"? the latter can be sourced either from other publications, or even twitter.
Josh: I think this is served by us creating “contrasting citations.” using our normal search you can search any topic and filter by citation types or filter to see papers that have received a contrasting citation.
for example, https://scite.ai/search?citationtypes[0]=contrasting&mode=citations&q=%22spatial%20transcriptomics%22
Josh: I think looking at social media would be interesting as well but most of those are opinions. we’re surfacing supporting or contrasting evidence or analyses to claims.
Imran: The reason i ask is, when encountering a new technology (for ex, let's say spatial transcriptonomics), i want to a cogent arguement summarised from all "nay sayers" (but substantiated by publications)
Question 9:
Nicholas: Let’s get technical — what are some of the toughest software challenges involved in making a research search engine? i’m assuming you don’t just throw everything into a really big elastic search instance :smile:
Josh: Step 1:
get access to the papers beyond just titles and abstracts. extremely hard unless you are willing to pay 50m for it.
step 2:
most content is still in pdf, especially pre-2000. this needs to be converted and the relevant information needs to be extracted. there are thousands of reference styles and many pdf styles. extracting that information from millions of different looking pdfs reliably is hard!
step 4: you need the right metadata. author information, paper information, paper types, etc. thankfully crossref gives you a lot of this pretty easily but not all of it.
also… we do use elastic ; )
Josh: We wrote a paper detailing how we built scite here: https://direct.mit.edu/qss/article/2/3/882/102990/scite-a-smart-citation-index-that-displays-the
Nicholas: Have you looked into using llms to extract data from pdfs? maybe not the most cost efficient solution, but as a fallback?
Josh: We haven’t but i think some have and have found some improvements. you’re right though, analyzing 60m+ papers with llms is not cheap (we have analyzed something like that to date)
Question 10:
Nicholas: You’ve mentioned a few times the importance of getting access to the full text of the papers. can you elaborate on what that unlocks for you?
Josh: There is a lot of valuable information within a paper. if you analyze only the abstract and title, you’re missing out on a ton of it.
google books based just on book jackets would be pretty bad, showing directly within the books the information you’re after is the difference (if that analogy helps)
Nicholas: So given that the llms aren’t trained on the fulltext, but still seem to be pretty good at making scientific inferences, seems like there could be a big performance gain there. have you all thought about training your own llm here?
Josh: Yeah, i think that would be interesting. i believe some groups have already done it though using content from sci-hub!
Nicholas: Thoughts on sci-hub? :stuck_out_tongue:
Josh: I think that has been one of the biggest contributions to all kinds of human learning ever.
Question 11:
Nicholas: One thing that’s always frustrating is reading a paper and seeing a lack of code/data (it may be available on request, but we all know how that works out). do you have any features to help filter/search by that? or plans to integrate with something like papers with code?
Luis: Autogenerate the e-mail request when "available upon request" pops up :joy:
Josh: I think we should run an analysis like this on our papers and let people filter by that https://arxiv.org/abs/2209.00693
Josh: So you could see which papers use which software
Josh: (we dont do a good job of it currently besides searching the names directly or github urls)
Question 12:
Nicholas: What’s been the hardest part of starting scite? unexpected challenges?
Josh: It’s really hard to raise money and make money in this space. everyone thinks everything should be free, vcs think its too small of a market, etc.
we just reached profitability yesterday after building/working on scite for almost 5 years though!
Nicholas: Congrats!!! that’s incredibly exciting
Josh: Also, people have been broaching the idea of scite since at least the 1960s. lawyers have something called shepardizing which allows you to see if law is good law or not. this is effectively shepardizing for science.
the technology wasn’t there though until recently to analzye and classify citations. we are riding the wave of deep learning and again the timing with validating chatgpt is not something we had anticipated.
Question 13:
Lee: Are either of the tools introduced during this chat capable of acting like a knowledge base? for example, a common use-case for translational scientists the world over is to focus in on a gene or pathway in a common cell type (or model), and try to summarize what is known about it:
• what are the major regulatory genes?
• what diseases are implicated if the gene/pathway is dis-regulated?
• what secondary effects might be pursuant to modulation of protein expression?
some of these questions are likely only to be determined experimentally, but it would be incredibly valuable for an agent to be able to answer some of these, or even to summarize what’s publicly known (with attribution). do any of these questions sound like what either scite or paperplayer can do?
Josh: Yeah, do you have a specific gene or pathway in mind? i can share an example of how it could work and you could tell me if it is what you’re after or not.
Josh: Here is one example
Josh: https://scite.ai/assistant (you can try it out yourself too)
Josh: Another based on your second bullet point.
Josh: @lee is this what you had in mind? any feedback on it would be great.
Nicholas: Huge thanks to @josh for the q&a — this was really cool to hear about everything that scite is doing. josh has been active in the bib community, so i’m sure you’ll have other opportunities to reach out to him as well
Josh: Thanks for all the questions! if others have them i will try to answer more later. appreciate this community and appreciate you!