Ask HN: I have many PDFs – what is the best local way to leverage AI for search?

bastien2 · 2024-05-30T23:39:29 1717112369

You don't. You use a full-text indexer and normal search tools. A chatbot is only going to decrease the integrity of query results.

andai · 2024-05-31T01:36:15 1717119375

I found that grep actually outperformed vector search for many queries. The only thing I was missing was when I didn't know how exactly to phrase something (the exact keyword to use).

Do keyword search systems have workarounds for this? My own idea was for each keyword to generate a list of neighbor keywords in semantic space. I figured with such a dataset, I'd get something approximating vector search for free.

I made some attempts at that (found neighbors by their proximity in text), but I ended up with a lot of noise (words that often go together without having the same meaning). So I'd probably have to use actual embeddings instead.

More generally, any suggestions for full-text indexing? Elasticsearch seems like overkill. I built my own keyword search in Python (simple tf-idf) which was surprisingly easy. (Long-term project is to have an offline copy of a useful/interesting subset of the internet. Acquiring the datasets is also an open question. Common Crawl is mostly random blogs and forum arguments...)

skydhash · 2024-05-31T03:11:02 1717125062

> The only thing I was missing was when I didn't know how exactly to phrase something (the exact keyword to use).

I think that's the only things GUI (or TUI) directories have over CLI. I remember having Wikipedia locally (english texts, back in 2010) and the portals were surprisingly useful. They act like the semantic space in case you can't find an article for your exact word. So Literature > Fiction > Fantasy > Epic Fantasy will probably land you somewhere close to "The Lord of The Rings".

ravetcofx · 2024-05-31T05:18:48 1717132728

Do you know of any way to build a fast index you can run grep against? Would love to have something as instantaneous as "Everything" on windows for full text on Linux so I can just dump everything in a directory

semi-extrinsic · 2024-05-31T06:00:07 1717135207

Have you tried the more modern solutions like gripgrep, ack, etc.?

Or for something more comprehensive (to also search PDF, docx, etc.) there is ripgrep-all:

https://github.com/phiresky/ripgrep-all

everforward · 2024-05-31T15:44:54 1717170294

As others have said, ripgrep et al are faster than regular grep. You would probably also get much faster results with an alias that excludes directories you don't expect results in (I.e. I don't normally grep in /var at all).

I have seen some recommendations for recoll, but I haven't used it so can't comment. Anecdotally, I normally just use ripgrep in my home directory (it's almost always in ~ if I don't remember where it is). It's fast enough as long as my homedir is local (I.e. not on NFS).

jononor · 2024-05-31T08:56:58 1717145818

Tracker is an open source project for that. It has been around for some 10+ years now. https://tracker.gnome.org/overview/

haiku2077 · 2024-05-31T13:21:32 1717161692

Try ripgrep.

j0hnyl · 2024-05-31T03:13:00 1717125180

The point of vector search is to support semantic search. It makes sense that grep will outperform if you're just looking for verbatim occurrences of a string.

3abiton · 2024-06-01T03:54:47 1717214087

A combination of both could help!

SkyPuncher · 2024-05-31T03:11:23 1717125083

Most developers are going to outperform vector search. We “get” how computers do lookups so we build our queries appropriately.

Vector search is amazing for using layman concepts.

yreg · 2024-05-31T02:09:31 1717121371

> decrease the integrity of query results

What does that even mean. When you know the exact keywords then you use full-text.

When you don't know them then other tools can be helpful.

eviks · 2024-05-31T05:19:00 1717132740

It means you'd use the same tool since it's more convenient and get worse results in one tool vs. the other

Capricorn2481 · 2024-05-31T06:01:02 1717135262

Because they're two different tools for two different tasks. If you expect to always know the exact phrase than, yes, grep will be better. But if you search a semantically similar phrase you will get nothing

vikramkr · 2024-05-30T23:50:58 1717113058

You wouldn't use a chatbot for the same query you'd use normal search tools for (and on a side note your answer would be much more useful with an example of what those tools would be, it's not really actionable). A vague natural language question over data whose structure you haven't fully understood using terms that might be inexact is not as likely to provide good results with normal search tools as with an llm based tool.

skydhash · 2024-05-31T00:34:13 1717115653

> your answer would be much more useful with an example of what those tools would be

Paperless, DevonThink, even Calibre (the ebook manager) can do it.

You only need a day or two to categorize the documents. No need for huge amounts of RAM, or privacy concerns, or hallucinated answers.

dotancohen · 2024-05-31T00:46:35 1717116395

  > You only need a day or two

For some of us, for some types of data, huge amounts of RAM, or even privacy concerns, or even the occasional hallucinated answer, is an easier pill to swallow.

A recent example, maybe not the best example but recent, was the query "What do the three headed dog from the Harry Potter books and the cat from Alien have in common"

brudgers · 2024-05-31T00:52:35 1717116755

  They are fictional.

xeromal · 2024-05-31T00:57:18 1717117038

I never want to categorize stuff. I want it done for me.

ajsnigrutin · 2024-05-31T01:26:10 1717118770

Another (ugly but works nice): https://www.recoll.org/pics/index.html

opensource, local, yada yada, almost zero configuration (just add folders, run indexer, wait).

rahimnathwani · 2024-05-31T05:42:35 1717134155

Paperless-ngx set up using docker compose is good for this use case.

barrenko · 2024-06-03T16:14:42 1717431282

Hi bastien,

Could you expand on the answer? Thanks!

pierre · 2024-05-30T22:34:08 1717108448

RAG cli from llamaindex, allow you to do it 100% locally when used with ollama or llamacpp instead of OpenAI.

https://docs.llamaindex.ai/en/stable/getting_started/starter...

homarp · 2024-05-30T22:51:30 1717109490

and at some point (https://github.com/ggerganov/llama.cpp/issues/7444) you will be able to use Phi-3-vision https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

but for now you will have to use python.

You can try it here https://ai.azure.com/explore/models/Phi-3-vision-128k-instru... to get an idea of its OCR + QA abilities

nl · 2024-05-31T00:49:15 1717116555

Does the llamaindex PDF indexer correctly deal with multi-column PDFs? Most I've seen don't, and you get very odd results because of this.

rspoerri · 2024-05-31T05:52:37 1717134757

i've made quite good conversions from pdf to markdown with https://github.com/VikParuchuri/marker . it's slow but worth a shot. Markdown should be easily parseable by a rag.

i'm trying to get a similar system setup on my computer.

nl · 2024-05-31T07:36:01 1717140961

This looks worth exploring, so thanks. The author has done a bunch of work beyond what PyMuPDF does on multicolumn layouts.

pierre · 2024-05-31T03:29:10 1717126150

Locally you can choose pypdf or mupdf wich are good but not perfect. If you can send your data online llamaparse is quite good.

j45 · 2024-05-31T05:27:28 1717133248

Pulling the text out of the PDFs correctly and independently is correct.

jd3 · 2024-05-30T23:31:57 1717111917

basically, still the same answer(s) from

https://news.ycombinator.com/item?id=38759877

https://news.ycombinator.com/item?id=36832572

tspann · 2024-05-31T13:39:44 1717162784

https://milvus.io/docs/integrate_with_llamaindex.md

Pretty easy to run local and lightweight with Milvus Lite with LlamaIndex

ekianjo · 2024-05-30T23:21:37 1717111297

llamaindex has an horrible API, very poor docs and is constantly changing. I do not recommend it.

papichulo2023 · 2024-05-31T01:45:25 1717119925

Any alternative?

hm-nah · 2024-05-31T14:14:18 1717164858

Vanilla python

dmd · 2024-05-31T19:41:25 1717184485

So your solution to “I don’t like flying [specific airline]” would be “how about a big pile of aluminum and some jet fuel”?

vladsanchez · 2024-05-31T19:00:45 1717182045

LOL `papichulo`? Que tigre!?

papichulo2023 · 2024-06-02T15:51:43 1717343503

Jaja la primera palabra que se me vino a la cabeza

m0shen · 2024-05-30T21:54:51 1717106091

Paperless supports OCR + full text indexing: https://docs.paperless-ngx.com/

As far as AI goes, not sure.

whynotmaybe · 2024-05-31T01:19:51 1717118391

You can use Gpt4all with localdocs to analyze the folder where you store the output of paperless-ngx

Ey7NFZ3P0nzAe · 2024-05-31T17:05:42 1717175142

I am a medical students with thousands and thousands of PDF and was unsatisfied with RAG tools so I made my own. It can consume basically any type of content (pdf, epub, youtube playlist, anki database, mp3, you name it) and does a multi step RAG by first using embedding then filtering using a smaller LLM then answering using by feeding each remaining document to the strong LLM then combine those answers.

It supports virtually all LLMs and embeddings, including local LLMs and local embedding It scales surprisingly well and I have tons of improvements to come, when I have some free time or procrastinate. Don't hesitate to ask for features!

Here's the link: https://github.com/thiswillbeyourgithub/DocToolsLLM/

samspenc · 2024-05-31T20:08:53 1717186133

Nvidia's 'Chat with RTX' can do this as well https://www.nvidia.com/en-us/ai-on-rtx/chatrtx/

You do need a beefy GPU to run the local LLM, but I think it's a similar requirement for running any LLM on your machine.

Ey7NFZ3P0nzAe · 2024-06-02T15:22:21 1717341741

I am deeply unsatisfied with how most RAG systems handle questions, chunking, embeddings, storage, and even those used for summaries are usually rubbish. That's why I created my own tool. Check it out I updated it a lot! It supports ollama too for private use.

constantinum · 2024-05-31T03:13:37 1717125217

The primary challenge is not just about harnessing AI for search; it's about preparing complex documents of various formats, structures, designs, scans, multi-layout tables, and even poorly captured images for LLM consumption. This is a crucial issue.

There is a 20 min read on why parsing PDFs is hell: https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

To parse PDFs for RAG applications, you'll need tools like LLMwhisperer[1] or unstructured.io[2].

Now back to your problem:

This solution might be an overkill for your requirement, but you can try the following:

To set things up quickly, try Unstract[3], an open-source document processing tool. You can set this up and bring your own LLM models; it also supports local models. It has a GUI to write prompts to get insights from your documents.[4]

[1] https://unstract.com/llmwhisperer/ [2] https://unstructured.io/ [3] https://github.com/Zipstack/unstract [4] https://github.com/Zipstack/unstract/blob/main/docs/assets/p...

jszymborski · 2024-05-31T03:34:50 1717126490

Apache Tika could help extract the relevant bits of PDFs, couldnt it?

https://tika.apache.org/

fooker · 2024-05-31T04:02:57 1717128177

Modern LLMs are good enough at treating pdfs as images and groking the context.

Well, Claude and GPT-4 seem to be.

elrostelperien · 2024-05-30T21:59:56 1717106396

For macOS, there's this: https://pdfsearch.app/

Without AI, but searching the PDF content, I use Recoll (https://www.recoll.org/) or ripgrep-all (https://github.com/phiresky/ripgrep-all)

gumboshoes · 2024-05-31T21:57:52 1717192672

The best indexer for macOS, bar none is Foxtrot Professional. https://foxtrot-search.com/foxtrot-professional.html Very sophisticated searching, including regex, its own query language, and proximity searches - x within z words of y - which for me is the biggest win. I have 2TB of files indexed with this.

hm-nah · 2024-05-31T14:23:10 1717165390

You have the find a good OCR tool that you can run locally on your hardware. RAG depends on your doc processing pipeline.

It’s not local, but the Azure Document Intelligence OCR service has a number of prebuilt models. The “prebuilt-read” model is $1.50/1k pages. Once you OCR your docs, you’ll have a JSON of all the text AND you get breakdowns by page/word/paragraph/tables/figures/alllll with bouding-boxes.

Forget the Lang/Llama/Chain-theory. You can do it all in vanilla Python.

Kikawala · 2024-05-30T22:19:41 1717107581

Quivr: https://github.com/QuivrHQ/quivr

SecureAI-Tools: https://github.com/SecureAI-Tools/SecureAI-Tools

pixelmonkey · 2024-05-31T02:45:00 1717123500

rga, aka ripgrep-all, is my go-to for this. I suppose grep is a form of AI -- or, at least, an advanced intelligence that's wiser than it looks. ;)

https://github.com/phiresky/ripgrep-all

gyrovagueGeist · 2024-05-31T02:48:31 1717123711

+1 for this. I use rga all the time. it's a "simple" solution but often enough for what I actually needed.

SoftTalker · 2024-05-31T02:23:50 1717122230

If you haven’t given some serious thought to getting rid of most of the documents then consider it. There is very little need to keep most routine documents for more than a few years. If you think you need your electric bill for March 2006 at your fingertips, why?

datpiff · 2024-05-31T09:13:58 1717146838

I was hoping someone would make this point. A lot of digital archiving is just delaying tossing things - a hard drive is easier to deal with than boxes of paper. The contents can still be useless.

When it comes to a search solution - what kind of searches have you done in the past? What kind of problems did you come across? If the answer to either is "none" you are planning on building a useless system.

temp3000 · 2024-06-01T00:06:08 1717200368

You never know when you will need a 10 year old doc. Audits and disputes for example. In addition I suspect keeping all the docs uses 1% of the spaces of photos people back up anyway.

I agree that search is overkill - just drudge manually or use grep when the time comes to dig.

Kikobeats · 2024-06-07T13:49:40 1717768180

You can use Microlink to turn PDF into HTML, and combine it with other service for processing the text data.

Here an example turning a arxiv paper into real text:

https://api.microlink.io/?data.html.selector=html&embed=html...

It looks like PDF, but it you open devtools you can see it's actually a very precise HTML representation.

theolivenbaum · 2024-05-31T11:50:07 1717156207

If you're looking for something local, we develop an app for macOS and Windows that let's you search and talk to local files and data from cloud apps: https://curiosity.ai For the AI features, you can use OpenAI or local models (the app uses llama.cpp in the background, it ships with llama3 and a few other models, and we're soon going to let you use any .gguf model)

brailsafe · 2024-05-31T01:17:43 1717118263

Like many others have suggested, local indexing is what I use for this, although some more natural interface may be better for structured search and querying.

What I haven't seen suggested though, is the built-in spotlight. Press CMD+Space, type some unique words that might appear in the document, and spotlight will search it. This also works surprisingly well for non-OCRd images of text, anything inside a zip file, an email, etc..

yousnail · 2024-05-30T22:47:50 1717109270

PrivateGPT is a great starting point for using a local model and RAG. Text-generation-ui, oogabooga, using superbooga V2 is very nice and more customizable.

I’ve used both for sensitive internal SOPs, and both work quite well. Private gpt excels at ingesting many separate documents, the other excels at customization. Both are totally offline, and can use mostly whatever models you want.

ssahoo · 2024-06-08T14:43:03 1717857783

This could be a humor or real hack.

Get a copilot PC with recall enabled and quickly scan through the documents by opening in Adobe acrobat reader. Voillla! You will have an sqlite dB that has your index. Few days later, Adobe could have your data in their llm.

gibsonf1 · 2024-05-30T21:36:59 1717105019

https://graphmetrix.com/trinpod-server

pawelduda · 2024-05-31T11:07:50 1717153670

Try https://github.com/phiresky/ripgrep-all before going down the rabbit hole of AI and advanced indexers. Quick to set up and undo if that's not what you want, but I'm pretty sure you'll be surprised how far can this get you

ilaksh · 2024-05-31T03:22:41 1717125761

If you want to run locally you can look into this https://github.com/PaddlePaddle/PaddleOCR

https://andrejusb.blogspot.com/2024/03/optimizing-receipt-pr...

But I suggest that you just skip that and use gpt-4o. They aren't actually going to steal your data.

Sort through it to find anything with a credit card number or anything ahead time.

Or you could look into InternVL..

Or a combination of PaddleOCR first and then use a strong LLM via API, like gpt-4o or llama3 70b via together.ai

If you truly must do it locally, then if you have two 3090s or 4090s it might work out. Otherwise it the LLMs may not be smart enough to give good results.

Leaving out the details of your hardware makes it impossible to give good advice about running locally. Other than, it's not really necessary.

gnicholas · 2024-05-31T03:45:42 1717127142

> But I suggest that you just skip that and use gpt-4o. They aren't actually going to steal your data.

Why do you have this confidence? Is it based on reading their TOS, and assuming they'll follow it?

bendsawyer · 2024-05-30T22:05:47 1717106747

I looked into this for sensitive material recently. In the end I got a purpose-built local system built and am having it remotely maintained. Cost: around 5k a year. I used http://www.skunkwerx.ai, who are US based.

The result is a huge step up from 'full text search' solutions, for my use case. I can have conversations with decades of documents, and it's incredibly helpful. The support scheme keeps my original documents unconnected from the machine, which I own, while updates are done over a remote link. It's great, and I feel safe.

Things change so fast in this space that there did not seem to be a cheap, stable, local alternative. I honestly doubt one is coming. This is not a on-size-fits-all problem.

skapa_flow · 2024-05-31T08:29:13 1717144153

Google Drive. It doesn't fullfill the "local" criteria, but it works for us (small engineering firm). We synchronize our local file server with GD nighly and use it only for searching. Google is just good when it comes to search.

phodo · 2024-06-03T06:00:00 1717394400

Thank you all for the comments. Got a lot of good input and ways to think thru the tried and true tools (enjoying ripgrep-all + fzf) plus the standard ai/rag-style tools. I do think there is room for a bridge or an integrated way to pipe in similarity / embedding into the ripgreps of the world. Maybe something close to fzf’s piping model. Will explore if I have some time.

westcort · 2024-05-31T01:33:46 1717119226

Use Recoll on Linux or File Locator Lite on Windows to do RegEx searches. Design the RegEx searches with GPT or llama running locally (or write them yourself).

hulitu · 2024-05-31T05:12:10 1717132330

> Ask HN: I have many PDFs – what is the best local way to leverage AI for search?

Adobe Reader can search all PDFs in a directory. They hide this function though.

kkfx · 2024-05-31T07:07:42 1717139262

Honestly?

ocrmypdf + ripgrep-all, recoll (GUI+XLI xapian wrapper) if you prefer an indexed version, for mere full-text search, currently nothing gives better results. The semantic search it's still not there, Paperless-ngx, tagspaces and so on demand way too much time for adding just a single document to be useful at a certain scale.

My own personal version is org-mode, I keep all my stuff org-attached, so instead of searching the pdfs I search my notes linking them, a kind of metadata-rich, taggable, quick, full-text search however even if org-ql is there I almost not use it, just org-roam-node-find and counsel-rg on notes. Once done this allow for quick enough manual and variously automated archiving, do it on a large home directory it's a very long and tedious manual work. For me it's worth done since I keep adding documents and using them, but it took more than an year to be "almost done enough" and it's still unfinished after 4 years.

treetalker · 2024-05-31T02:33:50 1717122830

On MacOS, use HoudahSpot. It’s awesome. Not AI, but as others have said, you likely want plain text search, not “AI” or a chatbot, for something like this.

If you’re having trouble thinking of search terms to plug into HoudahSpot (or grep etc.) then I suppose you could ask a chatbot to assist your brainstorming, and then plug those terms into HoudahSpot/grep/etc.

epirogov · 2024-05-31T10:58:02 1717153082

Cheap but full featured solution for batch AI processing of PDF documents on your local is an Aspose.PDF ChatGPT plugin

https://products.aspose.org/pdf/net/chat-gpt/

dudus · 2024-05-30T21:33:12 1717104792

I tried Google's NotebookLM for this use case and was very pleased with the experience.

If you trust Google that is.

hobo_mark · 2024-05-30T21:57:12 1717106232

NotebookLM is currently US only, limited to 20 documents (sorry, 'sources') per notebook, and only works with Google Drive.

bendsawyer · 2024-05-30T22:18:04 1717107484

Not offline. I do not trust anyone with some data, because I have contractually promised not to do so.

jesterson · 2024-05-31T07:55:56 1717142156

The best tool I found for myself for similar goal was Devonthink. Using it for many years since and quite happy with it.

There is no AI or any other modern fad, but fulltext search (including OCR for image files inside PDFs) works great

1123581321 · 2024-05-30T22:02:27 1717106547

Devonthink would do this with a tiny model to translate your natural length search prompts into its syntax and your folder/tag tree.

If you're okay with some false positives, Devonthink would work as is, actually.

bendsawyer · 2024-05-30T22:16:53 1717107413

I used to use this, but the LLM approach allows for much deeper interactions. Not "find all times I've typed X" but

"act as an expert in Y, looking across all times I've typed X, summarize my changing position over thee years, and suggest other terms that have a similar pattern of change, in a list."

The kind of thing I used to give to an intern over a month, with results that are not far off what that intern produced...

1123581321 · 2024-06-01T14:00:33 1717250433

I’d love to see that built in as well.

The devonthink crab-bucket community is hostile to any use of LLMs but I don’t think they understand how the app would structure and augment the input and output to keep it from returning fanciful output.

edgyquant · 2024-05-30T22:57:05 1717109825

Using python to dump the PDF to text then use llama3 (8B) to parse

nl · 2024-05-31T00:52:25 1717116745

The "Using python to dump the PDF to text" dramatically underestimates how hard this is.

Tables and especially multi-column PDFs often need one-off handling and - worse - you don't know when one is being misparsed until you start getting weird search results. At that point you need to debug your entire search pipeline, which isn't fun!

jeffreyq · 2024-05-31T14:28:09 1717165689

Tangentially related, but you can try https://macro.com/ for reading your PDFs.

hypefi · 2024-06-01T08:19:47 1717229987

check out my app "Chofane" this is something that does it, local batch OCR scan for PDFs and PNG files, I am just launching it, you can export results to json and csv, do some text based search on results https://chofane-landing.pages.dev/

Tylast · 2024-05-31T09:34:32 1717148072

You can try https://gpt4all.io/index.html

sciencesama · 2024-05-30T23:58:08 1717113488

You can tabulate the info 90% of your info will be from single source. There are online tools that sort costco and walmart bills !!

gandalfthepink · 2024-05-31T00:39:22 1717115962

I use Curiosity AI. Good interface.

vrighter · 2024-06-01T05:14:41 1717218881

you use a tool intended for accurately searihing. Which is not ai based.

finack · 2024-05-30T21:49:21 1717105761

OCR and pattern matching on text are computationally cheap and incredibly easy to do. For example, tax documents often bear the name of your government's tax authority, which presumably you are familiar with and can search for. They also tend to have years on them.

hiq · 2024-05-31T08:17:03 1717143423

This.

I wanted to convert some equations from some maths textbook back into latex, and I found that taking a screenshot and feeding the image into some LLM service supporting images was a good way to do that.

adyashakti · 2024-05-30T20:47:53 1717102073

getcody.ai

borg16 · 2024-05-30T21:35:07 1717104907

the op wanted a local method, and this does not seem to be local