Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: My recommendation engine for Hacker News (julienc.me)
314 points by julien040 on June 19, 2023 | hide | past | favorite | 58 comments
Hi! I’m Julien and I built a recommendation engine for Hacker News.

I feel like this website is a gold mine. Every day, I find some very interesting stories about a topic. And sometimes, I want to find other stories covering that same topic but I can’t.

Hacker News has years of history of awesome discussion and ressources. Unfortunately, I think HN Algolia isn’t helpful in searching these old threads. As a student, I want to learn a lot from this website.

This is why I created HN Recommend. Input a sentence or the URL of an article, and get the most popular and similar posts from Hacker News.

About the technical details, I've computed the embeddings of over 100,000 articles from HN and indexed it using Faiss. I made a blog post for a deeper explanation.

Source code: https://github.com/julien040/hn-recommendation-api

Article: https://julienc.me/articles/Extract_embeddings_Hacker_News_a...

Project: https://hn-recommend.julienc.me




Aww, thank you for using my Memory Allocation post as the placeholder text. <3


I often wish I could sort Hacker News into two categories. Actual software/tech/STEM and everything else. I think both are interesting, but often, the niche tech stuff gets drowned out fast. So this is great for that :-)


I just released a new update, thanks to everyone's feedback. Now, you can sort results by relevancy, age, or score using the select.


This is a joy to use and also ot fits very nicely with the other highly ranked post by Nielsen group. Kudos!


Which other post? There is so much churn on HN that it’s hard to know which post you are referring to



Yes that was it, sorry, should have included the link


This is great. I often come across some HN post on a topic I am interested in and then want to go look at other posts in the same topic cluster to expand my exposure. This looks awesome for that.

I don't know if it would be useful or even work, but is it possible to let the user adjust the vector distance threshold and then apply the other sorting parameters to the results? Eg. if I want to go broader, but then sort by high score or something so I see popular posts within an expanded (but still relevant) cluster?


Checkout https://askhn.ai

The content is ranked by how people discuss the topics and who discusses them

If you just do embeddings on posts you might miss relevant content. When people who have knowledge of AMD discuss intel and believe that content is relevant to AMD, the content will be ranked


I thought about an algorithm with weight adjustable by the user. Now, the API returns a field with the distance between the post and the query (the square of the Euclidean distance). It's used by the interface to rank results by relevance.

Perhaps I can compute a score for each story, where each field has a weight and rank the results using this score. For example, the score could be 0.2 x score + 0.1 x comments + 1/distance - timestamp/ 10^9. The stories with the highest rank would be shown first, and the weight (0.2, 0.1, 10^9) could be adjusted by the user, as some might prefer recency while others prefer popularity.


It might be useful to pose this problem in terms of a precision vs. recall curve.


Hmm I tried searching "elixir" and found nothing related to the language. HN Algolia gives me exactly what I want. On what basis do you say it's "not helpful"?


Yes the search doesn't work very well for one word. Try to input an url about elixir like this: https://hn-recommend.julienc.me/?q=https%3A%2F%2Fnews.ycombi...

I may have used the incorrect term. HN Algolia is effective for searching for a particular story. However, I am unable to utilize it to find related posts on the same topic that do not contain the same words.


Out of curiosity related to the word vectorization algorithm...why does one word not perform as well? Whats the cause/rationale?


It's pure speculation, but articles embeddings are computed using 512 tokens, which is roughly equivalent to 400 words. I think that using only one word does not allow the model to fully understand the context.


hey Julien. I love the product but the search doesn't seem to be doing the best for me. For example, I looked up Tailwind and got plenty of results but none of them actually involved Tailwind.

Maybe a tagging solution is the way? if you determine a set amount of popular keywords for a topic and filter around those, you can offer more relevant results. With some sort of public tagging system you can also have SEO friendly pages around tags and get people browsing stuff they wouldn't normally search for.


At first, the website concept focused on getting posts similar to a URL. Querying with text didn't yield relevant results.

Your solution appears better suited for this use case. Thank you.


What I really need for HN (and any other news feed for that matter) is something like "google discover" i.e. a content-based recommendation system with some sort of feedback mechanism.

So I would get relevant information to me (I can skip, visit, like, dislike) whether or not it's popular. That last point is important because HN home page doesn't give you that, and most of posts could get lost in oblivion just because the first few folks did not find it interesting.


HN needs a simple feature: a weekly digest view that shows the top 30 most commented posts (it should completely ignore flags and votes).


You mean like the one that's emailed to me every week?

https://hackernewsletter.com


Thanks, I was considering something like this as I used ITTT to send me weekly top threads from certain subreddits, but now with Reddit going south…


Pls sort by recency. Otherwise you see 13 year old articles most of them obsolete/irrelevant to the current situation.


By sorting by recency, I was worried I would get less revelant results. Perhaps I should add a thresold to not have too old posts


You can now sort by recency. I hope this helps.


Very fast turnaround. Kudos! Works very nicely now.

Try this: https://hn-recommend.julienc.me/?q=sf%20crime With the Newest filter vs the Oldest filter. ( btw the default Relevance filter gives only tangentially relevant results for this query. Whereas the Newest & Oldest are on point. )


Love it.

This response is very reactive heavy, where as it’s elixir I’m more interested in.

But well done on the execution. It does exactly what it states.

I’ve bookmarked.

I often search HN for additional articles and discussions based on something I’ve just read. Next time I’ll use this tool.


Great project. I learned about the faiss library. Out of curiousity, did you also try it with doc2vec?


I didn't try Doc2Vec. I wanted a hosted solution because I wouldn't have been able to compute all this locally (more than 100,000 posts).

If you tried it, did you have great results with? I may use it in future projects.


Yes, I am using it on a not so small dataset (roughly 1 million docs) and the output is a fairly efficient model. I am using gensim with pre-trained word vectors. New docs can be inferred via .infer_vector().

Overall my approach is less automated than what I have seen in your codebase so it’s likely a bigger investment. I am happy to share more.


It's very interesting. I may try it in the future.


The blog post link on GitHub was a nice walk through of your method and I was interested in what you think the hit rate was for getting successful text for embeddings from TFA links. 100K is a good sized corpus but wondering how many got skipped due to paywalls or 404 links or any other problems ?


Thank you for reading it.

The hit rate is low. I've only tried to get embeddings for stories with a score greater than 100. SQL Query "SELECT count(*) FROM story WHERE score > 100;" gives me 155,228 stories and the corpus size is 108,477 stories.

108,477/ 155,228 = 0,6988236658

The main problems were 404 links and posts that weren't articles (such as tweets).


A comment about search results: "design system" is related to design, "system design" relates to computing

It seems search takes the two inputs as the same.

Also, search doesn't seem to work when using just 1 word.


Yes it's an issue. Sadly, I can't fix it. I'm using the closed source "text-embedding-ada-002" model from OpenAI.

As I can see, the longer the input, the more accurate the results. Perhaps you can try something longer, like "What is a design system for UI?"


Yes, adding context helps.

Thanks!


This is amazing, thank you for this. Makes finding stuff a lot easier


i like the idea of this but wont remember it because my muscle memory is tuned to news.ycombinator.com. perhaps i can recommend a chrome extension instead of a website?


Thank you for suggesting this.

The API is already made and can be found at https://github.com/julien040/hn-recommendation-api. I don't think it would be too difficult to build a Chrome extension that fetches it.


An iOS share widget would be cool too. Since you support putting the input text in the URL, then maybe someone can make a Workflow for it and share it here.


Are iOS share widgets using Apple Shortcuts? I wish to learn more about this technology, so it would be a pleasure to try building it.


Yeah, my mistake, the app is called "Shortcuts" now. I get confused because it was an app called Workflow that was acquired by Apple [0].

You can use the app itself to make some surprisingly powerful shortcuts, and then share them in some kind of text based serialized form (don't remember the details). I'm sure there are also ways to make them programmatically, but I doubt it would be necessary for this use case.

Seems like you basically want to extract the URL of the HN page, store it in a variable and then append that variable to the URL of your recommendation engine. There are probably more fancy variants of "extract text" that you could use, too - I'm not sure of the details.

[0] https://en.wikipedia.org/wiki/Shortcuts_(app)


Here is the v1 of the shortcut: https://www.icloud.com/shortcuts/a7e9d236b35342c5aed1d022801...

For now, it only pushes the shared URL to the recommendation engine. If I have more time, I'll try to find a way to extract the URL from the HN page.


Wow, very fast turnaround time!! I just added it and it works :) Nice job!



Oops, on the API side, there is a check to ensure the text is long enough (5 characters), but I forgot to add this check client-side. Thank you for pointing out the issue.

Try this https://hn-recommend.julienc.me/?q=Golang if you want stories related to Go.

Edit: add link


i didn't expect the embeddings have such simple yet useful application, thanks!


One feature I would like for an Recommender Systems to have is : explicit ability to jump in and out of filter bubbles or research rabbit holes. Another example would be, put yourself in the shoes of another, e.g. what content is liked by game developers generally. apart from general gamedev content, what do they like, where do they take inspiration from, etc.

I remember there was a project built on instagram which allowed a person to view instagram as it looked like to a particular celebrity.


I'm a bit divided on this feature. On one hand, I would like to have this feature; it would be awesome to see the recommendation of people from different jobs. On the other hand, I'm a bit concerned about privacy. The system must ensure that each group is big enough to avoid the leak of someone's recommendations. I don't want anyone to know exactly what I'm liking and what I'm watching.

If I recall correctly, myCANAL (the French Netflix) used to have a similar feature. You could access the recommendations of personalities of the channel, but it was curated manually.


I search for a url I know was posted and it doesn't show it. It shows unrelated articles.


The data is a few weeks old. Do you know when the URL was published?


It's 10 years old.

This search query https://hn-recommend.julienc.me/?q=paul%20graham returns articles that are missing both words of the query


The website features only stories with a score greater than 100 but I don't think that is the problem.

Unlike HN Algolia, it doesn't match words; it uses embeddings so stories are matched by their similar meaning rather than similar words. To find it, you might try to be more specific, such as "Paul Graham Y Combinator <facts of the article>". I'm sorry HN Recommend doesn't match your use case


Nit:

> Resources to learn about distributed systems

I thought Murat Buffalo's blog would come up at the top. That's a gold, and I'm confident that it was shared on HN as well (maybe a year or two back).

Otherwise neat and useful!


The layout is currently buggy on Firefox.


Hi, are you talking about a problem like this one? https://cln.sh/MFG3DPZn+


Yeah, when there’s no thumbnail.


It's fixed now. Thank you for reporting it.


A time filter is needed




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: