I’ve got this bookmarked and use it to find hyper niche materials on numerical modelling. The stuff it finds on solvers, mesh generation, and optimization methods is so much better than anything I could ever find on Google. Stuff from the 80s and 90s. I’ve found sites written by professionals that I would never find on Google. As someone that doesn’t just take the commercial package of the shelf getting to that knowledge maybe finding Fortran code examples is extremely valuable.
Search "numerical solver" in any of the major search engines what do you see?
Now do that on Marginalia; you find https://www.scilab.org within the first 10 results, which is an open-source numerical solver software, which get's me to code, which gets me to examples to use.
To be nuanced, I could change my search to "open-source Range-Kutta numerical solver examples" or something better, but why? Give me the weird deeply technical stuff first.
Maybe more a HN example; when I wanted to learn about load balancers, just search "load balancers."
Google; lots of SEO crap (with soooo many ads), youtube videos?, and AWS commercials at the top. No idea, where I'm going.
Marginalia; a linux wiki, nginx (official documentation), a couple blogs by professionals on the topic. yeah, there are some things in here that ain't great.
But If I compare apples to apples the first ten results are just so much better I'd say 8/10 in Marginalia for this example are great to good, while the first 10 (things I could click on) are by companies that don't teach me anything or have articles full of ads.
I don‘t think your examples are very good, these are very generic search terms and if the results are good or bad very much depends on the person searching and what they are looking for.
Underspecified queries ("generic search terms") is actually one of the tricky problems in search. The way Marginalia Search deals with them caters to a particular type of audience and set of usecases. I don't think that's wrong. Seems silly to try to cater to everyone. In that scenario you're more likely to not cater to anyone.
I disagree with this feature, besides the predictable privacy argument, having a search engine transparently serve results according to your tastes makes it really difficult to find things that are new and outside of your existing preferences. It drains the web of serendipity, makes every website feel the same.
Yep. Curating search for an individual makes results worse for a certain class of queries. Opposite of what GP was advocating for, without realising Google already does this.
Yeah. Some of them are a bit rough still, but that's the general idea. Instead of trying to guess what sort of content the user wants, it seems like it makes sense to just give them the option to express that.
The recipe filter is approaching something I'd want to explore further, to be able to provide contextual information outside of the search query.
They're actually perfect examples for this thread.
We've already constrained "what we're looking for" to be "niche expert pages" further up thread. If we're seeing niche expert pages even for generic search results, that's probably a good indication that the search engine behaves the way RandomWorker is describing
yup this is definitely alot like how it used to be.
@unpopularop cant find "all quiet on the western front book movie differences". well you couldn't do that with AltaVista either in 1998.
however if you just type "all quiet on the western front" you get a ton of niche obscure sites talking about it. literally someone's personal blog page.
type in 'polytopes' you get a bunch of universities papers and code sites.
"rust generics" - again, its a bunch of mailing list discussions, blogs, rust discussion groups, personal websites, obscure professional discussions.
this IS how it was back in the day.
my only question is how could this possibly be sustainable financially in the long run.
> my only question is how could this possibly be sustainable financially in the long run.
For now I'm funded by grants and donations, got a few years runway that way.
The actual operational cost is like $100/month for colocation + personal expenses so what money comes in lasts a surprisingly long time. In the future, we'll see. There does seem to be a lot of people that want this type of thing to exist though, so the hope is if I polish it even more, further funding will become available from likeminded people, possibly selling API access to other search engines.
Search is notoriously hard to make money from (outside of ads), though not having a lot of expenses seems like a reasonable path to go.
It sounds like you only need one person (not as deep pocketed as Andrew Carnegie but who has read "gospel of wealth" and agrees with it) to have support for decades if not perpetuity.
Universities traditionally have done this sort of thing by playing golf and naming buildings, but I'm sure in the 21st century there are other models. (Fwiw $2k/yr is below a typical golf membership)
I think as long as you're not setting out to start a tech company with thousands of employees, or branch out into a sector with the word "cloud" in it, you'll be fine. Only unreasonably big ambitions cost billions.
A project is usually on the road to success when it starts with a disclaimer like "just a hobby, won't be big and professional like gnu".
I think a larger concern is how you'll address the Bus Factor going forward.
I recently put some effort into making it possible to run and host the system fairly easily[1]. That said, serving basic search data and operating a search engine is two different things. To do more than index a couple of blogs you inevitably need a fairly deep understanding of the system, probably decent hardware, and so on.
But the long term goal is that this is something that's relatively easy to operate and extend.
I just looked up "transformers intuition" and the results blew my mind. In comparison, Google's results led me to SEO'd websites (mostly Medium) and fancy-looking sites with inferior content. Awesome work Marginalia!
It’s proving a bit harder than anticipated, not because the software can’t handle it, but because the signal to noise ratio of the web isn’t very good; a huge reason why the search engine works relatively well is because of what it doesn’t index.
Congrats on the progress, I don't use marginalia as much as I should because I'm so used to rely on Google. It's a wonderful project though and I'll prob use it more since spammy SEO sites and AI generated answers seem to get more prevalent.
Probably some ways away from daily driver material. Optimistically sometime this summer when I'm done with the query and execution stuff it'll start approaching that territory.
Viktor- I'm curious as to whether Common Crawl [0] would be useful to you. It's currently around 100TB and 3.35 billion pages, so it's going to be a long download unless you process it in place on S3. I have no idea what its signal/noise ratio is.
> india test cricket lowest total > None of the results are good or giving an answer, straight up wrong sites.
The search engine has no ambitions to provide a knowledge graph at this point. It's for finding documents on the internet, rather than answering questions. Answering questions is a definitely something one might want, but it often comes at the expense of finding documents.
> raid calculator > The results are OK but you still have random noise like a Pokemon save/cheat editor page?
The pokemon result was discussing an application called "raidcalc". Seems like a good match, given the search engine does not profile you at all and has no clue about what your interests are.
> all quiet on the western front movie book differences
Hmm, I think there's an upper bound on the query length you hit. Could probably remove this, it's a pretty old, an artifact from when the query execution didn't deal with long queries well.
--edit--
Hmm, I increased the limit but they're still kinda not very good. Although this is definitely squarely within the realm of what I'm working on next, which is query understanding and execution.
Right now the search engine doesn't really know how group the terms. Like a human being can see that you'd want
|all quiet on the western front| in a sequence, preferrably in the title or appearing a few times, and 'movie', 'book', and 'differences' should be important to the document, but not necessarily appear in that exact order.
The search engine currently looks for either documents where they all appear in proximity, or all individual words have high tf-idf relevance markers. Not great for this query.
I hope this tone comes across correctly as just a suggestion: I get a lot of mileage out of the "Send Feedback" option in DDG, which they claim actual humans do read. It can help move bug reports out of these HN threads into a more context-aware flow, and also makes me feel like any bad outcome has the possibility of improving, unlike systems that don't provide a "I feel bad about this experience" button
Not yet, the support for long quoted sentences is a bit sketchy. Also within the wheelhouse of what's up next though. Having solid support for manual grouping is pretty much a prerequisite for automatic grouping anyway.
This is neat - I found a random website [0] where someone binary patched C&C Tiberian Sun to have IPv6 support, just because. It feels so much like the old web I miss.
For some reason all this makes me reminisce about Fravia's Searchlores [1] which always felt a bit like if Umberto Eco was interested in computers. And the site felt a bit like the library labyrinth from The Name of The Rose where you'd turn some random corner and find something incredible, only to lose it forever later on :D
The problem is specifically that it's sampling uniformly, rather than doing a 'spotify shuffle' that actively eliminates repetition.
That, and the set isn't very large, just a few thousand. If you randomly pick 25 items from a bag of ~3000 total a bunch of times, chances are relativley high that you're going to see repetition.
Thanks, I'm building a website focused on Metroidvanias. I liked the results so I was thinking I may use it to offer some interesting results on the various game pages.