Show HN: Talk to any ArXiv paper just by changing the URL

katella · on Dec 21, 2023

Idk where this changing the url thing started but I really like it.

Aachen · on Dec 21, 2023

The oldest instance of it that I know is putting something like download before or after the youtube domain. This must have been 2008±2. I very much doubt that's the first instance ever but I wasn't around online in the 90s (aside from on my grandma's machine who didn't know her computer had a web browser, but that wasn't very conscious, just a neutral net (young me) clicking buttons to see the effect)

theolivenbaum · on Dec 21, 2023

ss before youtube.com brings you to a download page (i.e. ssyoutubecom)

aembleton · on Dec 21, 2023

Subject: Discontinuation of Service in the Great Britain

Blocked in the UK :(

KomoD · on Dec 21, 2023

or youtube5s

osteele · on Dec 21, 2023

It works well with bookmarklets. This swizzles between arxiv.org <-> www.talk2arxiv.org. I've now added it to my Favorites bar, next to arxiv.org/abs <-> arxiv.org/pdf and twitter.org <-> nitter.net. Thanks for the service!

javascript:((u,a,b,c)=%3Ewindow.location.href=u.match(a)?u.replace(a,b):u.startsWith(b)?u.replace(b,c):u)(window.location.href,/https:\/\/arxiv\.org\/(abs|pdf)\//,'https://www.talk2arxiv.org/pdf/','https://arxiv.org/pdf/')

ReactiveJelly · on Dec 21, 2023

It bugs me cause it's kinda true but kinda misleading, I don't know if casual web users realize it's a whole different domain. Sometimes it's not important, sometimes it is.

katella · on Dec 27, 2023

Yeah doesn't help that some browsers hide the full url.

kordlessagain · on Dec 21, 2023

I use this on mitta.us/ to save pages. Got the idea from saved.io/, which does something similar.

mulmen · on Dec 21, 2023

CERN circa 1994.

quickthrower2 · on Dec 21, 2023

Tell me more?

mulmen · on Dec 21, 2023

https://en.m.wikipedia.org/wiki/URL

quickthrower2 · on Dec 21, 2023

And where does this mention a constant mutation of urls as a mnemonic system as a substitute for a form submission to provide a service in 1-1 correspondence to another service?

mulmen · on Dec 21, 2023

URLs can be manipulated. It’s even encouraged.

tempaccount420 · on Dec 21, 2023

Really? I thought it was recommended for URLs to stay static, as to not break old links.

mulmen · on Dec 24, 2023

URLs can be manipulated by the user. If they’re human-readable we can explore a site without links.

Consider something like phpbb. There are thousands of instances in the world, all with similar URLs. You might have some management script or bookmarks that rely on this URL structure and you can just change the domain name to use them on a new site.

Ditto for discovering historical blog posts on a wordpress site. Or interacting with a Stack Overflow.

tinix · on Dec 21, 2023

nyud.net (coral cdn) was my first experience with this pattern... back from the slashdot days in the 00s.

pushfoo · on Dec 21, 2023

You might be able to drop the PDF backend since they're close to getting HTML running well: https://news.ycombinator.com/item?id=38713215

Using that might be easier than a multi-modal approach. Bonus points for:

* Multiple papers at once

* Comparing PDF and HTML output with the LLM as input for it correcting similar converter code

evanhu_ · on Dec 21, 2023

Definitely I'll move to the LaTeX source code instead of a PDF backend since that allows better support for non textual data that gets poorly scraped by GROBID. That is a really cool development I didn't know about, also theres https://ar5iv.labs.arxiv.org/ which already has most arXiv papers as HTML documents. I chose GROBID because they not only parse the PDF but organize the text into logical sections for me (Intro, abstract, references) which I didn't want to manually do with heuristics that I'd have to devise.

Koaisu · on Dec 21, 2023

Maybe you could use: https://github.com/facebookresearch/nougat/tree/main or https://github.com/VikParuchuri/marker

Both are tools to convert pdfs into Latex or Markup with latex formulas. Maybe that helps

acqq · on Dec 21, 2023

Reading the motivation for the second:

"Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents. In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages."

When these are fed in the next levels as inputs, isn't it even less surprising to get even more hallucinations/repetitions?

Mandelmus · on Dec 21, 2023

Yes, using the LaTeX source code (or HTML, once that becomes reliable and widely used) should be much more robust than PDF parsing.

skeptrune · on Dec 21, 2023

That is a very good resource to save

killingtime74 · on Dec 21, 2023

What's the easiest way to adapt this to local LLMs like Ollama or Lama.cpp

chpatrick · on Dec 21, 2023

Probably with https://github.com/BerriAI/litellm

addandsubtract · on Dec 21, 2023

liteLLM looks like a tool that wraps various providers into an OpenAI API format, or is there more to it? What if I'm more interested in the PDF indexing(?) using ollama? Do you know of any tools that allow me to upload/include PDFs in my ollama chats?

lolinder · on Dec 21, 2023

There are a bunch of little projects that do this. It's on the roadmap for ollama-webui (but not implemented yet), and Ollama published a guest blog post with a simple implementation that can be cloned and run very easily:

https://ollama.ai/blog/building-llm-powered-web-apps

There's also Cheshire Cat, which is a framework for building chat assistants that use a set of documents as a knowledge base:

https://github.com/cheshire-cat-ai/core

chpatrick · on Dec 21, 2023

pdftotext?

killingtime74 · on Dec 21, 2023

thanks!

magicalhippo · on Dec 21, 2023

While it certainly seems cool, as with most other AI tools I'm struggling to see how I'd use it. That is, I can't think of anything I'd want to ask.

I assume I'm just getting old and have a limited imagination when it comes to these new AI things.

Anyone got any good examples on how to effectively use this?

xboxnolifes · on Dec 21, 2023

Presumably you're reading the paper for a reason. Take why you're reading the paper and form it into a question.

addandsubtract · on Dec 21, 2023

I struggled with the same problem. What's interesting with LLMs, is that they will augment the data in the paper with their general knowledge. So you can talk to them like you would with a colleague or mentor (but without the shame of asking dumb questions).

3abiton · on Dec 21, 2023

But are there currently any good local models that can provide good conversation/support of a recent paper? I feel the level of hallucinations will be huge. Even with gpt4 with pdf2text, it tends to hallucinate a lot. And Gpt4 is the best LLM model so far.

bruce343434 · on Dec 21, 2023

"What's the gist of this thing?"

"I'm building xyz app, how could this be applicable?"

"I want to know if x or y, what does this paper say, or does it even apply?"

"Implement the pseudocode algorithm in python"

"Can you help me understand section X, the wording is tricky"

"Doesn't this suffer from {flaw}?"

"How does the paper address x?"

No, an AI will not give 100% perfect answers all of the time. You know who else? Humans. You already have mechanisms to deal with unreliability, so please save yourself time and use AI to be more efficient.

cristeigabriel · on Dec 21, 2023

The questions would come from reading a paper in particular. Have a question? Ask away. That's how I'd use it personally.

magicalhippo · on Dec 21, 2023

What I'm trying to say is I'm not used to think that way. I can't think of a question to ask.

bruce343434 · on Dec 21, 2023

You never have any doubts or uncertainties while reading a paper? Was that always a thing for you or did that grow with experience?

magicalhippo · on Dec 22, 2023

Of course, but then I just go back and reread the relevan section(s).

The things I might be able to phrase into query is not something I trust the AI to be able to explain, like what are the downsides of the proposed algorithm for example (unless explicitly mentioned, which is seldom).

I guess I'll have to try to keep this project in mind next time I read an arXiv article, and give it a spin.

gorkish · on Dec 21, 2023

Very nice; appears to work well. Just an FYI that I did get a couple errors where the max context length was exceeded, one using the demo summarization task as the first query. I was using my own API key when the error occured.

evanhu_ · on Dec 21, 2023

Thank you! Thanks for pointing that out, since the underlying RAG is rather naive (simple embedding cosine similarity lookup, as opposed to knowledge graph / advanced techniques), I opted to embed both "small" (512 character and below) chunks as well as entire section chunks (embedding the entire introduction) in order to support questions such as "Please summarize the introduction". Since I also use 5 chunks for each context, I suspect this can add up to a massive amount on papers with huge sections.

gorkish · on Dec 22, 2023

This is the paper that would reliably trigger context overflows. https://arxiv.org/abs/1811.03116 It otherwise did an admirable job on this brainbender.

skeptrune · on Dec 21, 2023

This is the first time I have seen someone use GROBID. It seems like an incredibly cool solution

evanhu_ · on Dec 21, 2023

I spent forever looking at various PDF parsing solutions like Unstructured, and eventually stumbled across GROBID, which was an absolute perfect fit since it's entirely made for scientific papers and has header/section level segmentation capabilities (splitting the paper into Abstract, Introduction, References, etc.) It's lightweight and fast too!

pugio · on Dec 21, 2023

I've spent the last couple weeks diving into various PDF parsing solutions for scientific documents. GROBID is pretty cool, but it made some mistakes when trying to parse (I think arxiv) papers which removed some of the text.

Even though it gave a lot of great structured options, missing even a single sentence was unforgivable to me. I went with Nougat instead, for arxiv papers.

(Also check out Marker (mentioned on hn in the last month) for pretty high fidelity paper conversion to markdown. Does reasonable job with equations too.)

kordlessagain · on Dec 21, 2023

Google's Document AI does a good job, but I'll need to test the equation handling again to be sure.

skeptrune · on Dec 21, 2023

Did you try Apache Tika?

arbitrandomuser · on Dec 21, 2023

I wonder if they knew that they could get html versions of the paper by just changing the link from ...arxiv.. to ar5iv..

evanhu_ · on Dec 21, 2023

I did try that at first, it was hard to parse through the HTML code and organize into logical sections (authors, references, abstract) and then clean up the text to prepare it optimally for chunking and embedding. Once I found GROBID I just went with that route because it handled all that for me.

schneehertz · on Dec 21, 2023

I used the example on the GitHub page, but found that I need to wait for Embedding. It would be beneficial to reduce latency and save API costs if there is a common cache.

evanhu_ · on Dec 21, 2023

There is a cache! You hit a new PDF but at least you will not have to wait for that one again ;)

gtsnexp · on Dec 21, 2023

This is great! We could also point it to https://biorxiv.org/ Awesome work!

evanhu_ · on Dec 21, 2023

Thank you so much, yes I will have that up soon as well

zzleeper · on Dec 21, 2023

Looks great! It would be very interesting to understand a bit they why/how of some of the steps, such as the reranking and how you arrived at your chunking algo.

evanhu_ · on Dec 21, 2023

Thank you :). I updated the README to have some more explanation of the steps.

The chunking algorithm chunks by logical section (intro, abstract, authors, etc.) and also utilizes recursive subdivision chunking (chunk at 512 characters, then 256, then 128...). It is quite naive still but it works OK for now. An improvement would perhaps involve more advanced techniques like knowledge graph precomputation.

Reranking works by instead of embedding each text chunk as a vector and performing cosine similarity nearest neighbor search, you use a Cross-Encoder model that compares two texts and outputs a similarity score. Specifically, I chose Cohere's Reranker that specializes in comparing Query and Answer chunk pairs.

soliax · on Dec 25, 2023

Awesome project. What metric and dimension size did you set on Pinecone?

soliax · on Dec 25, 2023

Ah, I figured the dimension was 1024. Assuming cosine for the metric.

quickthrower2 · on Dec 21, 2023

This could be generalized to any url right? With maybe special rules so that you know how to get to the pdf for arxiv.

kordlessagain · on Dec 21, 2023

If you want a generalized version of something similar, try this: https://github.com/MittaAI/mitta-community/tree/main/cookboo...

The query pipeline isn't that sophisticated, but it could be altered to do page reference and use keyterms first to filter, instead of doing the vector similarity on all data.

One thing with MittaAI is that it doesn't do UI interfaces. It expects you to handle those bits.

friend_and_foe · on Dec 21, 2023

Is it able to answer questions about references?

aendruk · on Dec 21, 2023

Any plans for bioRxiv?

evanhu_ · on Dec 21, 2023

Yes! I'll set up talk2biorxiv.org very soon as it would be simple to port over. I also plan on making the underlying research PDF RAG framework available as an independent module

willvarfar · on Dec 21, 2023

Will it become normal to paste your OpenAI keys into a website and will it become the new curl-sudo-bash or checking cloud credentials into github?

kn100 · on Dec 21, 2023

Interesting thought. I then wondered if perhaps we could introduce some kind of mechanism like a token that is authorized to spend a certain amount, and then wondered if we could take that concept to the extreme and have one key with some money loaded that users could put into websites. it was at this point I realised I'd just invented the credit card.

Tiberium · on Dec 21, 2023

Such a service already exists - https://openrouter.ai. They're basically an aggregator of a lot of LLM models and APIs with unified billing (with both Stripe and crypto). And the main part related to your comment is that on OpenRouter you can create capped API keys with a specified limit in credits.

__s · on Dec 21, 2023

Azure has SAS tokens which give time limited subpermission access derived from access key

https://learn.microsoft.com/en-us/azure/ai-services/document...

Used it to scope/isolate databases storing backups to a shared storage account

reidjs · on Dec 21, 2023

Sounds more like a visa “gift” debit card

SequoiaHope · on Dec 21, 2023

Pretty cool that the inventor of the credit card is here on HN!

addandsubtract · on Dec 21, 2023

Not to be that guy, but this would be an ideal use case for crypto.

imjonse · on Dec 21, 2023

All while calling the solution open source.

tehlike · on Dec 21, 2023

ideally we do something like Oauth eventually.

Aachen · on Dec 21, 2023

I thought this would be for contacting authors or chatting about the paper with other readers, but apparently RAG here is a new important TLA to take note of, meaning chat bot. You need to enter an API key from "Open"AI to use the service and it's about it answering your questions about the paper

evanhu_ · on Dec 21, 2023

Oops sorry for the miscommunication, actually you don't need to enter an API key for now. Feel free to just try it out!

hxypqr · on Dec 21, 2023

I haven't looked at the code, but I wanted to ask in advance if it is possible to incorporate lean4's formal mathematical capabilities into the current architecture to obtain more precise answers when processing mathematical PDF documents. For example, to implement something similar to the functionality described in terrytao.wordpress.com/2023/02/18/would-it-be-possible-to-create-a-tool-to-automatically-diagram-papers/.