Hacker News new | past | comments | ask | show | jobs | submit login
Marginalia: 3 Years (marginalia.nu)
274 points by latexr 11 months ago | hide | past | favorite | 44 comments



I’ve got this bookmarked and use it to find hyper niche materials on numerical modelling. The stuff it finds on solvers, mesh generation, and optimization methods is so much better than anything I could ever find on Google. Stuff from the 80s and 90s. I’ve found sites written by professionals that I would never find on Google. As someone that doesn’t just take the commercial package of the shelf getting to that knowledge maybe finding Fortran code examples is extremely valuable.


Do you have an example of a niche expert page you found useful, which is easier to find on Marginalia than Google?


Search "numerical solver" in any of the major search engines what do you see?

Now do that on Marginalia; you find https://www.scilab.org within the first 10 results, which is an open-source numerical solver software, which get's me to code, which gets me to examples to use.

To be nuanced, I could change my search to "open-source Range-Kutta numerical solver examples" or something better, but why? Give me the weird deeply technical stuff first.

Maybe more a HN example; when I wanted to learn about load balancers, just search "load balancers."

Google; lots of SEO crap (with soooo many ads), youtube videos?, and AWS commercials at the top. No idea, where I'm going.

Marginalia; a linux wiki, nginx (official documentation), a couple blogs by professionals on the topic. yeah, there are some things in here that ain't great.

But If I compare apples to apples the first ten results are just so much better I'd say 8/10 in Marginalia for this example are great to good, while the first 10 (things I could click on) are by companies that don't teach me anything or have articles full of ads.


kagi will give me scilab as the result #13 and this probably because I raise arxiv and stackoverflow results which will get high.


I don‘t think your examples are very good, these are very generic search terms and if the results are good or bad very much depends on the person searching and what they are looking for.


Underspecified queries ("generic search terms") is actually one of the tricky problems in search. The way Marginalia Search deals with them caters to a particular type of audience and set of usecases. I don't think that's wrong. Seems silly to try to cater to everyone. In that scenario you're more likely to not cater to anyone.


> particular type of audience and set of usecases

For signed in accounts (which is pretty much is ~3bn Android and/or Chrome users), Google can predict what the user might prefer and yet...


I disagree with this feature, besides the predictable privacy argument, having a search engine transparently serve results according to your tastes makes it really difficult to find things that are new and outside of your existing preferences. It drains the web of serendipity, makes every website feel the same.


Yep. Curating search for an individual makes results worse for a certain class of queries. Opposite of what GP was advocating for, without realising Google already does this.


Ah, this helped clarify my doubts. Thanks.


Your "domain" filters seems like a good solution to this?


Yeah. Some of them are a bit rough still, but that's the general idea. Instead of trying to guess what sort of content the user wants, it seems like it makes sense to just give them the option to express that.

The recipe filter is approaching something I'd want to explore further, to be able to provide contextual information outside of the search query.

https://search.marginalia.nu/search?query=cookie+recipe


They're actually perfect examples for this thread.

We've already constrained "what we're looking for" to be "niche expert pages" further up thread. If we're seeing niche expert pages even for generic search results, that's probably a good indication that the search engine behaves the way RandomWorker is describing


Exactly: marginalia results are good for geeks; I search for me, not for J Random Consumer.


It’s result 10 on Kagi.


The quoted phrase finds nothing, unquoted also no scilab, then realized I've made a typo and it's numerical, not numeric, then I get it

Google was ~ top 60, which for such a generic term seems fine, not much scrolling down


yup this is definitely alot like how it used to be.

@unpopularop cant find "all quiet on the western front book movie differences". well you couldn't do that with AltaVista either in 1998.

however if you just type "all quiet on the western front" you get a ton of niche obscure sites talking about it. literally someone's personal blog page.

type in 'polytopes' you get a bunch of universities papers and code sites.

"rust generics" - again, its a bunch of mailing list discussions, blogs, rust discussion groups, personal websites, obscure professional discussions.

this IS how it was back in the day.

my only question is how could this possibly be sustainable financially in the long run.


> my only question is how could this possibly be sustainable financially in the long run.

For now I'm funded by grants and donations, got a few years runway that way.

The actual operational cost is like $100/month for colocation + personal expenses so what money comes in lasts a surprisingly long time. In the future, we'll see. There does seem to be a lot of people that want this type of thing to exist though, so the hope is if I polish it even more, further funding will become available from likeminded people, possibly selling API access to other search engines.

Search is notoriously hard to make money from (outside of ads), though not having a lot of expenses seems like a reasonable path to go.


It sounds like you only need one person (not as deep pocketed as Andrew Carnegie but who has read "gospel of wealth" and agrees with it) to have support for decades if not perpetuity.

Universities traditionally have done this sort of thing by playing golf and naming buildings, but I'm sure in the 21st century there are other models. (Fwiw $2k/yr is below a typical golf membership)


I think as long as you're not setting out to start a tech company with thousands of employees, or branch out into a sector with the word "cloud" in it, you'll be fine. Only unreasonably big ambitions cost billions.

A project is usually on the road to success when it starts with a disclaimer like "just a hobby, won't be big and professional like gnu".

I think a larger concern is how you'll address the Bus Factor going forward.


> I think a larger concern is how you'll address the Bus Factor going forward

I can't speak to how much energy it is to go from code to serving requests, but FWIW the code is AGPLv3 and seems to be updated regularly https://github.com/MarginaliaSearch/MarginaliaSearch/blob/v2...


I recently put some effort into making it possible to run and host the system fairly easily[1]. That said, serving basic search data and operating a search engine is two different things. To do more than index a couple of blogs you inevitably need a fairly deep understanding of the system, probably decent hardware, and so on.

But the long term goal is that this is something that's relatively easy to operate and extend.

[1] https://www.youtube.com/watch?v=PNwMkenQQ24 (quick install and demo)


I just looked up "transformers intuition" and the results blew my mind. In comparison, Google's results led me to SEO'd websites (mostly Medium) and fancy-looking sites with inferior content. Awesome work Marginalia!


Throwback that gives some indication of how both well and at the same time questionably it worked a mere 6 months in: https://news.ycombinator.com/item?id=28550764

Though I think now there's a bit too much reddit and stackexchange and wikipedia stuff in the default filter.


Most important lines for me.

It’s proving a bit harder than anticipated, not because the software can’t handle it, but because the signal to noise ratio of the web isn’t very good; a huge reason why the search engine works relatively well is because of what it doesn’t index.


Congrats on the progress, I don't use marginalia as much as I should because I'm so used to rely on Google. It's a wonderful project though and I'll prob use it more since spammy SEO sites and AI generated answers seem to get more prevalent.


Probably some ways away from daily driver material. Optimistically sometime this summer when I'm done with the query and execution stuff it'll start approaching that territory.


Viktor- I'm curious as to whether Common Crawl [0] would be useful to you. It's currently around 100TB and 3.35 billion pages, so it's going to be a long download unless you process it in place on S3. I have no idea what its signal/noise ratio is.

[0] https://commoncrawl.org/overview


Cool engine. Going to check out source soon but "ROME2D16-2T" returned relevant results from esoteric sources. Useful.


Tried my last 3 Google searches

india test cricket lowest total > None of the results are good or giving an answer

raid calculator > The results are OK but you still have random noise like a Pokemon save/cheat editor page because it contains the word raid

all quiet on the western front movie book differences > 0 results. Like straight up no hits, an empty page


> india test cricket lowest total > None of the results are good or giving an answer, straight up wrong sites.

The search engine has no ambitions to provide a knowledge graph at this point. It's for finding documents on the internet, rather than answering questions. Answering questions is a definitely something one might want, but it often comes at the expense of finding documents.

> raid calculator > The results are OK but you still have random noise like a Pokemon save/cheat editor page?

The pokemon result was discussing an application called "raidcalc". Seems like a good match, given the search engine does not profile you at all and has no clue about what your interests are.

> all quiet on the western front movie book differences

Hmm, I think there's an upper bound on the query length you hit. Could probably remove this, it's a pretty old, an artifact from when the query execution didn't deal with long queries well.

--edit--

Hmm, I increased the limit but they're still kinda not very good. Although this is definitely squarely within the realm of what I'm working on next, which is query understanding and execution.

Right now the search engine doesn't really know how group the terms. Like a human being can see that you'd want

|all quiet on the western front| in a sequence, preferrably in the title or appearing a few times, and 'movie', 'book', and 'differences' should be important to the document, but not necessarily appear in that exact order.

The search engine currently looks for either documents where they all appear in proximity, or all individual words have high tf-idf relevance markers. Not great for this query.


I hope this tone comes across correctly as just a suggestion: I get a lot of mileage out of the "Send Feedback" option in DDG, which they claim actual humans do read. It can help move bug reports out of these HN threads into a more context-aware flow, and also makes me feel like any bad outcome has the possibility of improving, unlike systems that don't provide a "I feel bad about this experience" button

If you were thus inclined, https://gitlab.com/glitchtip/glitchtip#glitchtip is the actual open source Sentry implementation which (as far as I know) would enable gluing https://docs.sentry.io/platforms/javascript/user-feedback/#u... to the search results page (that client-side library is still MIT: https://github.com/getsentry/sentry-javascript/blob/7.102.1/... )


Is it possible to just quote the title of the book, old-school style, so it becomes a single phrase?

It is arguably a better UI than handing a barrage of words and hoping the engine does the sense-making.


Not yet, the support for long quoted sentences is a bit sketchy. Also within the wheelhouse of what's up next though. Having solid support for manual grouping is pretty much a prerequisite for automatic grouping anyway.


This is neat - I found a random website [0] where someone binary patched C&C Tiberian Sun to have IPv6 support, just because. It feels so much like the old web I miss.

For some reason all this makes me reminisce about Fravia's Searchlores [1] which always felt a bit like if Umberto Eco was interested in computers. And the site felt a bit like the library labyrinth from The Name of The Rose where you'd turn some random corner and find something incredible, only to lose it forever later on :D

[0] http://ts.sesse.net/ [1] https://www.biostatisticien.eu/www.searchlores.org/indexo.ht...


I've been impressed by the results I see there. And you've chosen a sick name for it.


What's up with the "random site" function? I would expect it to sample uniformly, but it seems to return certain sites over and over.


The problem is specifically that it's sampling uniformly, rather than doing a 'spotify shuffle' that actively eliminates repetition.

That, and the set isn't very large, just a few thousand. If you randomly pick 25 items from a bag of ~3000 total a bunch of times, chances are relativley high that you're going to see repetition.


I love Marginalia!


did a search and first results were from "stack exchange sci fi", i was expecting something more nostalgic



Do you offer API?


https://api.marginalia.nu/ :-)

Demo key is always under siege though.


Thanks, I'm building a website focused on Metroidvanias. I liked the results so I was thinking I may use it to offer some interesting results on the various game pages.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: