I've found good results from summarizing my documents using a large context model then embedding those summaries using a standard embedding model (e.g. e5)
This way I can tune what aspects of the doc I want to focus retrieval on, it's easier to determine when there are any data quality issues that need to be fixed, and the summaries have turned out to be useful for other use cases in the company.
Agreed. Esp if you gonna call an API, you can call something cheaper than this embeddings model, like 4o-mini, summarize, then use a small embeddings model fine-tuned for your needs locally.
I was critical about these guys before (not about their quality of work but rather about building a business around embeddings). This work though seems interesting and I might even give it a try, esp if they provide a fine-tuning API (is that on the roadmap?)
I was trying to do this recently for Web page summarization. As said below the token sizes would end up over the context length, so I trimmed the html to fit just to see what would happen.
I found that the LLM was able to extract information, but it very commonly would start trying to continue the html blocks that had been left open in the trimmed input. Presumably this is due to instruction tuning on coding tasks
I'd love to figure out a way to do it though, it seems to me that there's a bunch of rich description of the website in the html
Can second Statistical Rethinking though if you have the basics of stats and want to learn it again from a very different, more causal/bayesian point of view.
The book Lynchpin talks about this and calls it "the resistance", the feeling of avoidance you get when nearing shipping something or even to sitting down and starting an ambitious project.
I've found it useful to have a name for it so I can recognize when I'm falling prey to "the resistance" and get myself to stop procrastinating or to "just ship it".
I suspect Linchpin / Godin has "borrowed" the idea of Resistance from Steven Pressfield's excellent book The War Of Art from 2002. That book is aimed at writers, but applies to anyone creative.
Right I was confused when the article mentions the other parameters other than the moon and sun ones track other astronomical variables, surely they are modeling geological and hydrological variables also (or potentially primarily)
There are two intertwined "sets" of effects to be modeled:
Primarily Earth + Moon, with a secondary twist of Sun, and a layered decline of precession | orbit woblle, lessor effects (the astro forces),
And then the ground effects; shaping around headlands, sloping of seafloors, funnelling through channels, etc. with a rinse and repeat cycle for sea areas that are "chained" backwards from the main flux via multiple bays and estuaries (internal bodies of water large enough to have their own tides via mood gravity while also connected with a delay to an outer ocean via a long channel, etc.
Fun stuff - I primarily worked with exploration geophysics but dabbled a little in tides and ocean levels across Australia.
Slack groups have filled in the meetup space in my life, mlops.community and locally optimistic are two of the best for what it sounds like you're looking for
One downside for milvus is that version 1 doesn't do filtering (necessary for most search applications) and version 2 is significantly slower.
Google's vector nearest neighbors offering, weaviate, and Vespa are much better options if you're expecting to extend to more realistic workloads
From an accessibility perspective it's a disaster too. My granny is visually impaired, but loves watching tv. She's been able to get by for decades by memorizing the commands needed to get where she wants to go (she had a list of nintendo cheat code-like instructions on how to get to various channels she wanted on the satellite e.g. down-down-left-left-enter gets her to her show)
We tried to get her into Netflix, but the menus change far too often for this strategy to be of any use. Of course you can use the screen reader/narrator, but you can imagine how frustrating it is trying to find the carousel you're looking for by waiting for the screen reader to tell you, it's frustrating enough doing it visually!!
Interestingly Leitrim and other counties in the north west he mentions have been high on the list of places people have escaped to from Dublin during the pandemic. Many people are now living there and working remotely for Dublin companies, and there are far too _few_ houses there now, and house prices have skyrocketed
This way I can tune what aspects of the doc I want to focus retrieval on, it's easier to determine when there are any data quality issues that need to be fixed, and the summaries have turned out to be useful for other use cases in the company.