I have found that embeddings + LLM is very successful. I'm going to make the wor...

kkielhofner · 2024-11-01T19:00:21 1730487621

> word Wood dominated the embedding values, but these were supposed to go into 2 different categories

When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.

[0] - https://huggingface.co/atomic-canyon/fermi-bert-1024

[1] - https://huggingface.co/atomic-canyon/fermi-1024

bravura · 2024-11-01T19:52:29 1730490749

Do you mind sharing why you chose SPLADE-esque sparse embeddings?

I have been working on embeddings for a while.

For different reasons I have recently become very interested in learned sparse embeddings. So I am curious what led you to choose them for your application, and why?

kkielhofner · 2024-11-01T21:25:05 1730496305

> Do you mind sharing why you chose SPLADE-esque sparse embeddings?

I can provide what I can provide publicly. The first thing we ever do is develop benchmarks given the uniqueness of the nuclear energy space and our application. In this case it's FermiBench[0].

When working with operating nuclear power plants there are some fairly unique challenges:

1. Document collections tend to be in the billions of pages. When you have regulatory requirements to extensively document EVERYTHING and plants that have been operating for several decades you end up with a lot of data...

2. There are very strict security requirements - generally speaking everything is on-prem and hard air-gapped. We don't have the luxury of cloud elasticity. Sparse embeddings are very efficient especially in terms of RAM and storage. Especially important when factoring in budgetary requirements. We're already dropping in eight H100s (minimum) so it starts to creep up fast...

3. Existing document/record management systems in the nuclear space are keyword search based if they have search at all. This has led to substantial user conditioning - they're not exactly used to what we'd call "semantic search". Sparse embeddings in combination with other techniques bridge that well.

4. Interpretability. It's nice to be able to peek at the embedding and be able to get something out of it at a glance.

So it's basically a combination of efficiency, performance, and meeting users where they are. Our Fermi model series is still v1 but we've found performance (in every sense of the word) to be very good based on benchmarking and initial user testing.

I should also add that some aspects of this (like pretrained BERT) are fairly compute-intense to train. Fortunately we work with the Department of Energy Oak Ridge National Laboratory and developed all of this on Frontier[1] (for free).

[0] - https://huggingface.co/datasets/atomic-canyon/FermiBench

[1] - https://en.wikipedia.org/wiki/Frontier_(supercomputer)

teleforce · 2024-11-04T15:18:59 1730733539

These is excellent comment, can someone put it inside the highlights.

bravura · 2024-11-01T18:44:23 1730486663

Some of the best performing embedding models (https://huggingface.co/spaces/mteb/leaderboard) are LLMs. Have you tried them?