Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Talk to any ArXiv paper just by changing the URL (github.com/evanhu1)
194 points by evanhu_ on Dec 21, 2023 | hide | past | favorite | 73 comments
Hello HN, Talk2Arxiv is a small open-source RAG application I've been building for a few weeks. To use it just prepend any arxiv.org link with 'talk2' to load the paper into a responsive RAG chat application (e.g. www.arxiv.org/abs/1706.03762 -> www.talk2arxiv.org/abs/1706.03762).

All implementation details are in the GitHub. Currently, because I've opted to extract text from the PDF of the paper rather than reading the LaTeX source code (since I wanted to build a more generic PDF RAG in the process), it struggles with symbolic text / mathematics, and sometimes fails to retrieve the correct context. I appreciate any feedback, and hope people find it useful!

Currently, the backend PDF processing server is only single-threaded so if embedding takes a while please be patient!




Idk where this changing the url thing started but I really like it.


The oldest instance of it that I know is putting something like download before or after the youtube domain. This must have been 2008±2. I very much doubt that's the first instance ever but I wasn't around online in the 90s (aside from on my grandma's machine who didn't know her computer had a web browser, but that wasn't very conscious, just a neutral net (young me) clicking buttons to see the effect)


ss before youtube.com brings you to a download page (i.e. ssyoutubecom)


Subject: Discontinuation of Service in the Great Britain

Blocked in the UK :(


or youtube5s


It works well with bookmarklets. This swizzles between arxiv.org <-> www.talk2arxiv.org. I've now added it to my Favorites bar, next to arxiv.org/abs <-> arxiv.org/pdf and twitter.org <-> nitter.net. Thanks for the service!

javascript:((u,a,b,c)=%3Ewindow.location.href=u.match(a)?u.replace(a,b):u.startsWith(b)?u.replace(b,c):u)(window.location.href,/https:\/\/arxiv\.org\/(abs|pdf)\//,'https://www.talk2arxiv.org/pdf/','https://arxiv.org/pdf/')


It bugs me cause it's kinda true but kinda misleading, I don't know if casual web users realize it's a whole different domain. Sometimes it's not important, sometimes it is.


Yeah doesn't help that some browsers hide the full url.


I use this on mitta.us/ to save pages. Got the idea from saved.io/, which does something similar.


CERN circa 1994.


Tell me more?



And where does this mention a constant mutation of urls as a mnemonic system as a substitute for a form submission to provide a service in 1-1 correspondence to another service?


URLs can be manipulated. It’s even encouraged.


Really? I thought it was recommended for URLs to stay static, as to not break old links.


URLs can be manipulated by the user. If they’re human-readable we can explore a site without links.

Consider something like phpbb. There are thousands of instances in the world, all with similar URLs. You might have some management script or bookmarks that rely on this URL structure and you can just change the domain name to use them on a new site.

Ditto for discovering historical blog posts on a wordpress site. Or interacting with a Stack Overflow.


nyud.net (coral cdn) was my first experience with this pattern... back from the slashdot days in the 00s.


You might be able to drop the PDF backend since they're close to getting HTML running well: https://news.ycombinator.com/item?id=38713215

Using that might be easier than a multi-modal approach. Bonus points for:

* Multiple papers at once

* Comparing PDF and HTML output with the LLM as input for it correcting similar converter code


Definitely I'll move to the LaTeX source code instead of a PDF backend since that allows better support for non textual data that gets poorly scraped by GROBID. That is a really cool development I didn't know about, also theres https://ar5iv.labs.arxiv.org/ which already has most arXiv papers as HTML documents. I chose GROBID because they not only parse the PDF but organize the text into logical sections for me (Intro, abstract, references) which I didn't want to manually do with heuristics that I'd have to devise.


Maybe you could use: https://github.com/facebookresearch/nougat/tree/main or https://github.com/VikParuchuri/marker

Both are tools to convert pdfs into Latex or Markup with latex formulas. Maybe that helps


Reading the motivation for the second:

"Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents. In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages."

When these are fed in the next levels as inputs, isn't it even less surprising to get even more hallucinations/repetitions?


Yes, using the LaTeX source code (or HTML, once that becomes reliable and widely used) should be much more robust than PDF parsing.


That is a very good resource to save


What's the easiest way to adapt this to local LLMs like Ollama or Lama.cpp



liteLLM looks like a tool that wraps various providers into an OpenAI API format, or is there more to it? What if I'm more interested in the PDF indexing(?) using ollama? Do you know of any tools that allow me to upload/include PDFs in my ollama chats?


There are a bunch of little projects that do this. It's on the roadmap for ollama-webui (but not implemented yet), and Ollama published a guest blog post with a simple implementation that can be cloned and run very easily:

https://ollama.ai/blog/building-llm-powered-web-apps

There's also Cheshire Cat, which is a framework for building chat assistants that use a set of documents as a knowledge base:

https://github.com/cheshire-cat-ai/core


pdftotext?


thanks!


While it certainly seems cool, as with most other AI tools I'm struggling to see how I'd use it. That is, I can't think of anything I'd want to ask.

I assume I'm just getting old and have a limited imagination when it comes to these new AI things.

Anyone got any good examples on how to effectively use this?


Presumably you're reading the paper for a reason. Take why you're reading the paper and form it into a question.


I struggled with the same problem. What's interesting with LLMs, is that they will augment the data in the paper with their general knowledge. So you can talk to them like you would with a colleague or mentor (but without the shame of asking dumb questions).


But are there currently any good local models that can provide good conversation/support of a recent paper? I feel the level of hallucinations will be huge. Even with gpt4 with pdf2text, it tends to hallucinate a lot. And Gpt4 is the best LLM model so far.


"What's the gist of this thing?"

"I'm building xyz app, how could this be applicable?"

"I want to know if x or y, what does this paper say, or does it even apply?"

"Implement the pseudocode algorithm in python"

"Can you help me understand section X, the wording is tricky"

"Doesn't this suffer from {flaw}?"

"How does the paper address x?"

No, an AI will not give 100% perfect answers all of the time. You know who else? Humans. You already have mechanisms to deal with unreliability, so please save yourself time and use AI to be more efficient.


The questions would come from reading a paper in particular. Have a question? Ask away. That's how I'd use it personally.


What I'm trying to say is I'm not used to think that way. I can't think of a question to ask.


You never have any doubts or uncertainties while reading a paper? Was that always a thing for you or did that grow with experience?


Of course, but then I just go back and reread the relevan section(s).

The things I might be able to phrase into query is not something I trust the AI to be able to explain, like what are the downsides of the proposed algorithm for example (unless explicitly mentioned, which is seldom).

I guess I'll have to try to keep this project in mind next time I read an arXiv article, and give it a spin.


Very nice; appears to work well. Just an FYI that I did get a couple errors where the max context length was exceeded, one using the demo summarization task as the first query. I was using my own API key when the error occured.


Thank you! Thanks for pointing that out, since the underlying RAG is rather naive (simple embedding cosine similarity lookup, as opposed to knowledge graph / advanced techniques), I opted to embed both "small" (512 character and below) chunks as well as entire section chunks (embedding the entire introduction) in order to support questions such as "Please summarize the introduction". Since I also use 5 chunks for each context, I suspect this can add up to a massive amount on papers with huge sections.


This is the paper that would reliably trigger context overflows. https://arxiv.org/abs/1811.03116 It otherwise did an admirable job on this brainbender.


This is the first time I have seen someone use GROBID. It seems like an incredibly cool solution


I spent forever looking at various PDF parsing solutions like Unstructured, and eventually stumbled across GROBID, which was an absolute perfect fit since it's entirely made for scientific papers and has header/section level segmentation capabilities (splitting the paper into Abstract, Introduction, References, etc.) It's lightweight and fast too!


I've spent the last couple weeks diving into various PDF parsing solutions for scientific documents. GROBID is pretty cool, but it made some mistakes when trying to parse (I think arxiv) papers which removed some of the text.

Even though it gave a lot of great structured options, missing even a single sentence was unforgivable to me. I went with Nougat instead, for arxiv papers.

(Also check out Marker (mentioned on hn in the last month) for pretty high fidelity paper conversion to markdown. Does reasonable job with equations too.)


Google's Document AI does a good job, but I'll need to test the equation handling again to be sure.


Did you try Apache Tika?


I wonder if they knew that they could get html versions of the paper by just changing the link from ...arxiv.. to ar5iv..


I did try that at first, it was hard to parse through the HTML code and organize into logical sections (authors, references, abstract) and then clean up the text to prepare it optimally for chunking and embedding. Once I found GROBID I just went with that route because it handled all that for me.


I used the example on the GitHub page, but found that I need to wait for Embedding. It would be beneficial to reduce latency and save API costs if there is a common cache.


There is a cache! You hit a new PDF but at least you will not have to wait for that one again ;)


This is great! We could also point it to https://biorxiv.org/ Awesome work!


Thank you so much, yes I will have that up soon as well


Looks great! It would be very interesting to understand a bit they why/how of some of the steps, such as the reranking and how you arrived at your chunking algo.


Thank you :). I updated the README to have some more explanation of the steps.

The chunking algorithm chunks by logical section (intro, abstract, authors, etc.) and also utilizes recursive subdivision chunking (chunk at 512 characters, then 256, then 128...). It is quite naive still but it works OK for now. An improvement would perhaps involve more advanced techniques like knowledge graph precomputation.

Reranking works by instead of embedding each text chunk as a vector and performing cosine similarity nearest neighbor search, you use a Cross-Encoder model that compares two texts and outputs a similarity score. Specifically, I chose Cohere's Reranker that specializes in comparing Query and Answer chunk pairs.


Awesome project. What metric and dimension size did you set on Pinecone?


Ah, I figured the dimension was 1024. Assuming cosine for the metric.


This could be generalized to any url right? With maybe special rules so that you know how to get to the pdf for arxiv.


If you want a generalized version of something similar, try this: https://github.com/MittaAI/mitta-community/tree/main/cookboo...

The query pipeline isn't that sophisticated, but it could be altered to do page reference and use keyterms first to filter, instead of doing the vector similarity on all data.

One thing with MittaAI is that it doesn't do UI interfaces. It expects you to handle those bits.


Is it able to answer questions about references?


Any plans for bioRxiv?


Yes! I'll set up talk2biorxiv.org very soon as it would be simple to port over. I also plan on making the underlying research PDF RAG framework available as an independent module


Will it become normal to paste your OpenAI keys into a website and will it become the new curl-sudo-bash or checking cloud credentials into github?


Interesting thought. I then wondered if perhaps we could introduce some kind of mechanism like a token that is authorized to spend a certain amount, and then wondered if we could take that concept to the extreme and have one key with some money loaded that users could put into websites. it was at this point I realised I'd just invented the credit card.


Such a service already exists - https://openrouter.ai. They're basically an aggregator of a lot of LLM models and APIs with unified billing (with both Stripe and crypto). And the main part related to your comment is that on OpenRouter you can create capped API keys with a specified limit in credits.


Azure has SAS tokens which give time limited subpermission access derived from access key

https://learn.microsoft.com/en-us/azure/ai-services/document...

Used it to scope/isolate databases storing backups to a shared storage account


Sounds more like a visa “gift” debit card


Pretty cool that the inventor of the credit card is here on HN!


Not to be that guy, but this would be an ideal use case for crypto.


All while calling the solution open source.


ideally we do something like Oauth eventually.


I thought this would be for contacting authors or chatting about the paper with other readers, but apparently RAG here is a new important TLA to take note of, meaning chat bot. You need to enter an API key from "Open"AI to use the service and it's about it answering your questions about the paper


Oops sorry for the miscommunication, actually you don't need to enter an API key for now. Feel free to just try it out!


I haven't looked at the code, but I wanted to ask in advance if it is possible to incorporate lean4's formal mathematical capabilities into the current architecture to obtain more precise answers when processing mathematical PDF documents. For example, to implement something similar to the functionality described in terrytao.wordpress.com/2023/02/18/would-it-be-possible-to-create-a-tool-to-automatically-diagram-papers/.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: