Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: AskHN (patterns.app)
612 points by kvh on Feb 22, 2023 | hide | past | favorite | 127 comments



As is often true of GPT responses, there's some nonsense interspersed here, e.g. the claim that R has "a more mature package universe" than Python.

I think this is false, but if you're reading quickly, it sounds cogent enough. As Sarah Constantin observed about GPT2 [0]:

> if you skim text, you miss obvious absurdities. The point is OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot...The mental motion of “I didn’t really parse that paragraph, but sure, whatever, I’ll take the author’s word for it” is, in my introspective experience, absolutely identical to “I didn’t really parse that paragraph because it was bot-generated and didn’t make any sense so I couldn’t possibly have parsed it”, except that in the first case, I assume that the error lies with me rather than the text. This is not a safe assumption in a post-GPT2 world. Instead of “default to humility” (assume that when you don’t understand a passage, the passage is true and you’re just missing something) the ideal mental action in a world full of bots is “default to null” (if you don’t understand a passage, assume you’re in the same epistemic state as if you’d never read it at all.)

[0] https://www.skynettoday.com/editorials/humans-not-concentrat...


> there's some nonsense interspersed here, e.g. the claim that R has "a more mature package universe" than Python.

As a programmer, I find R hard to use and not very well designed, so I can see why you'd call that nonsense.

But when I was a math student, I found that in some ways R does have "a more mature package universe". There were many math algorithms that I could find packages for in R and not in Python, even as a mere grad student.


Absolutely, for statistics and visualization I think R and its packages are (sometimes) superior. But GPT responses don't generally offer those kinds of nuances; the claim is that the packages are "more mature," period. And it's for good reason that the _most_ mature Python packages, e.g. numpy and pandas, are used by data scientists in production pretty much everywhere.


amazingly, your comment will eventually be added to the CHatGPT corpus and at some point down the line may be used to add the nuance that's currently lacking :)


Assuming it's not a GPT response


Regarding numpy/pandas: What's the reason outside of them being _in Python_?


I wonder if the "default to humility" heuristic probably does more harm than good on net, because the people who heed it probably shouldn't, and the ones who should won't.


Default to humility. Do not assume you're so smart that you can skim the text and understand it correctly. Read every word, don't assume that the author is so predictable that you can guess correctly.

Why, does not sound too arrogant to me.


I think it's important to remember that Humans who are not-too-smart can also sound coherent, yet also babble complete nonsense.

My experience with ChatGPT thus far is that it is as intelligent as a very broadly read person who just doesn't reeeally get the complex or nuanced aspects of the content it reads - much like many real Humans.


Robin Hanson makes this point in better babblers http://www.overcomingbias.com/2017/03/better-babblers.html

"After eighteen years of being a professor, I’ve graded many student essays. And while I usually try to teach a deep structure of concepts, what the median student actually learns seems to mostly be a set of low order correlations. They know what words to use, which words tend to go together, which combinations tend to have positive associations, and so on. But if you ask an exam question where the deep structure answer differs from answer you’d guess looking at low order correlations, most students usually give the wrong answer."


Reminds me how when people get criticized on Twitter now, they just assume it’s a bot


[flagged]


Giving credit to someone for the quotation is called a citation, and it's encouraged.


A simple Google search yields that she's likely an expert and her opinion can be used as a "citation" in one's comment:

> Brief bio: I started out studying math (Princeton AB ‘10, Yale PhD ‘15, focusing on applied harmonic analysis) and then spent some time in the world of data science and machine learning (Palantir, Recursion Pharmaceuticals, Starsky Robotics.)

from: https://srconstantin.github.io/about/


Her essay is from 2019, and I quoted it mainly to say that I am not making an original point, this has been a known problem with LLMs for a while (and I presume it will continue to be).


Or, maybe they want to give credit where it’s due and not plagiarize?


It seems to write in the generic "style" of GPT, instead of in the style I would recognise as a HN poster. Is that because of something baked into how the training process works? It lacks a sort of casualness or air of superiority ;)


There was no training process, this is just running GPT with relevant HN comments as part of the prompt.

If he wanted it to replicate that classic HN feel he would either have to extend the prompt with additional examples or, better yet, use finetuning.

I guess he could also just randomly sprinkle in some terms like 'stochastic parrot' and find a way to shoehorn Tesla FSD into every conversation about AI.


> “AskHN” is a GPT-3 bot I trained on a corpus of over 6.5 million Hacker News comments to represent the collective wisdom of the HN community in a single bot.

First sentence of the first paragraph on OP's page

EDIT: it's a bit misleading, further down they describe what looks like a semantic-search approach


Scroll a bit further down and you will see

> 7. Put top matching content into a prompt and ask GPT-3 to summarize

> 8. Return summary along with direct links to comments back to Discord user


Ah got it. Perhaps they should edit the intro then, it's misleading.


I agree, that language could be very improved. This is not a GPT-like LLM whose training corpus is HN comments, which I found to be an extremely interesting idea. Instead, it looks like it's finds relevant HN threads and tells GPT-3 (the existing model) to summarize them.

To be clear, I think this is still very cool, just misleading.


Soon we will see language style transfer vectors, akin to the image style transfer at the peak of the ML craze 5-10 years ago -- so you will be able to take a HN snark vector and apply it to regular text, you heard it here first ;)


Joking aside, that does seem like it would be very useful. Kind of reminds me of the analogies that were common in initial semantic vector research. The whole “king - man + woman = queen” thing. Presumably that sort of vector arithmetic is still valid on these new LLM embeddings? Although it still would only be finding the closest vector embedding in your dataset, it wouldn’t be generating text guided by the target embedding vector. I wonder if that would be possible somehow?


Hmm. If you're willing to be stuck in time at 2016, there's https://zenodo.org/record/45901

Build a model off of that?


Last year (pre the chatGPT bonanza) I was using GPT-3 to generate some content about attribution bias and the responses got much spicier once the prompt started including the typical HN poster lingo, like "10x developer":

https://sonnet.io/posts/emotive-conjugation/#:~:text=I%27m%2...

My conclusion was that you can use LLMs to automate and scale attribution bias.

We did it guys!


To truly capture the HN experience, the user should provide a parameter for the number of "well actually"'s they want to receive. So initial response should demonstrate clear expertise and make a great concise point in response to question, and then start the cascade of silly nitpicking.


I think you'll find "I think you'll find" trumps "well actually".

;)


I wish the results were reversed, so I could "well actually" your comment, but 'site:news.ycombinator.com "well actually"' gives ca. 4k results in Google and 'site:news.ycombinator.com "I think you'll find"' gives close to 17k results, so you appear to be right.


Well, "it turns out that" beats both, with about 26k results ;)


site:news.ycombinator.com "in my experience" 120K results


IANAL: unfortunately only 10.6k results, thought I had a winner for a second.


I am mildly disappointed that none of the phrase pitches in this thread were phrased with the given pitch.


> ii. Compute embeddings and similarity and choose top K comments closest to question

> iii. Put top matching comments into a prompt and ask GPT-3 to answer the question using the context

It depends on the Prompt used to ask GPT the question. A prompt that instructs GPT to write like a HN poster should fix that.


There also needs to be at least one question mark at the end of a statement?


Now that you say it, it will train itself for it while it learns from your comments ;-)


Am I correct in understanding that this doesn't actually generate answers based on HN, but instead finds semantically-near comments and sends them verbatim to GPT to summarize? Seems like a good enough hack, though I'd love a detailed writeup of how to actually specialize an existing LLM with additional training data (like HN).


Technically it does give a specific answer to the question, but it is based on the semantically similar comments (and the question).

The thing people don't realize is that right now there is a very large gap between the capabilities of a few models including OpenAI's most recent ones, and most of the other LLMs. So there are several options for actually training or fine-tuning with open models, but actually none of them have the language understanding and generation capabilities at the level of those new OpenAI models.

As far as I know.


The summary itself is still generated, but has all the context to do summarization in the prompt.

It's very difficult to otherwise finetune existing LLMs. GPT itself is closed-sourced, and doesn't allow for finetuning (except via an opaque API and with limited amounts of data). Other open models are either very difficult to load in memory and/or simply not as expressive as GPT


Agreed, I think the better approach is to do some custom tuning but that becomes cost prohibitive very quickly. Not really much different than Algolia with a minor GPT-3 integration but neat project regardless.


You can literally finetune these OpenAI models using their API. In this case it probably wasn't done because the author found it too much work and/or too expensive.


>the author found it too much work and/or too expensive.

Military budgets and manpower would make sock puppets interesting. https://www.theguardian.com/technology/2011/mar/17/us-spy-op...

Does anyone find it curious Edward Snowden is mute on the military's use of this technology?

How many people would be surprised to learn they are communicating with a program on websites like this?


I'm quite pleasantly surprised by this response:

Question: what are the biggest fallacies of people that post on hackernews?

Answer: One of the biggest fallacies of people that post on Hacker News is the nirvana fallacy, which is when someone jumps from "thing x is imperfect" to "thing x is worthless" without considering any of the benefits. People also tend to be familiar with certain technologies and will discount the flaws of existing technology vs. the flaws of new technology. Another fallacy is the assumption that there is one best tool for doing things - the one that supports the largest amount of use cases. Additionally, people may use appeals to authority, invent their own solutions rather than use existing ones, and assume that the approaches they first learned are most correct. Finally, people may assume open source software is a gift and that dependencies don't need to be tested.


> People also tend to be familiar with certain technologies and will discount the flaws of existing technology vs. the flaws of new technology.

Is that really a problem? It's only natural to get accustomed to old problems caused by old tech (where workarounds probably already exist or limitations understood and accepted) while it makes sense to hold new tech to a higher standard since it has the benefit of learning from past mistakes and should avoid introducing obvious problems for people in the future to deal with. Our future selves will thank us for being more critical of flawed new tech than we are of old tech.


It kinda enumerated all possible characteristics of people and their approaches to technical issues, not something unique to HN.


I'm a little surprised that Hacker News comments weren't already in the GPT-3 training set. I just assumed that OpenAI had vacuumed up most of the web already.


I am guessing they already were? But this is 100% pure, concentrated HN not contaminated with nonsense from the rest of the web :)


If it's really trained exclusively off of HN comments, I expect most of the bot's responses will evade the actual question but spend several paragraphs debating the factual specifics of every possible related tangential point, followed by an thinly-veiled insult questioning the user's true motivations.


In no way does a typical HN comment debate every possible related tangential point. Do we expect a modicum of intellectual rigor? Yes. But to say every tangent is followed and scrutinized is simply factually untrue.

And several paragraphs? I challenge you to show even a large minority of argumentative responses that veer into "several" paragraphs. You characterize this as "most of the ... responses" but I think that's unfair.

One wonders why you'd resort to such hyperbole unless you were deliberately attempting to undermine the value of the site.


This is my favorite type of humour.


If you're not arguing over the semantics, rather than OP's clear-enough intent, are you really on HN?


That had me laughing! Case in point, from a few days ago: https://news.ycombinator.com/item?id=34855372


It's not trained at all. The bot finds relevant comments and then uses OpenAI's API to summarize them.


Is it exclusively HN comments and nothing else? How does a model like that know how to speak English (noun/verb and all that) if you are starting from scratch and feeding it nothing but HN comments?


I'm sorry to be THAT GUY, but it is addressed in the article :)

>GPT embeddings

To index these stories, I loaded up to 2000 tokens worth of comment text (ordered by score, max 2000 characters per comment) and the title of the article for each story and sent them to OpenAI's embedding endpoint, using the standard text-embedding-ada-002 model, this endpoint accepts bulk uploads and is fast but all 160k+ documents still took over two hours to create embeddings. Total cost for this part was around $70.


In a nut shell, this is using openai’s api to generate embeddings for top comments on hn, then also generating an embedding for the search term. It then can find the closest related comments for the given question by comparing the embeddings and then send the actual text to GPT3 to summarize. It’s a pretty clever way to do it.


> How does a model like that know how to speak English

Mimicry.


I have to assume that targeted/curated LLM training sets will have a tendency to be less accurate than very general, just by the very nature of how they work.

(edited for clarity)


I know it's not quite analogous, but I fine-tuned GPT-3 on a small (200 examples) data set and it performed extremely poorly compared to the untrained version.

This surprised me, I thought it wouldn't do much better, but I wasn't expecting that specializing it on my target data would reduce performance! I had fewer examples than the minimum OpenAI recommends, so maybe it was a case of overfitting or something like that.


Nice! We built something very similar recently, it is more like "Ask your documentation" but very similar implementations otherwise.

See a demo on the huggingface transformers documentation: https://huggingface.co/spaces/jerpint/buster

code: https://github.com/jerpint/buster


Starred! We've been looking to build something similar so I appreciate you sharing this here.

The only other project that I've seen that's doing something close to this is this one: https://github.com/getbuff/Buff

It's a bit more similar to the OPs bot (it's a Discord bit).

Cool to see momentum in this space!



For those who are wondering,

HN data is indexed with embeddings for semantic search. When queried, it finds closest article, top comments and summarizes with GPT-3.

GPT-3 serves as a rendering tool for compressed comments.


My own experiments made me think that the impact of finetuning is comparable to that of a molecule in a drop in a bucket.

> “AskHN” is a GPT-3 bot I trained on a corpus of over 6.5 million Hacker News comments to represent the collective wisdom of the HN community in a single bot.

I'm assuming you used the openai fine-tuning pathway to make a custom model?

Have you tested the responses on vanilla GPT3 vs your custom model?

I'd be curious to see the comparison.


From the article, they did not use fine-tuning. This is semantic search + GPT-3 to provide human-like answers.


Thanks! I missed that part.

The semantic search approach seems to focus the answers better than fine-tuning; at the cost of preloading the prompt with a lot of tokens, but with the benefit of a more constrained response.


Yeah, to me it looks like the learning rate was way too low to make a difference.

I don't see any of the sublime and succinct snark.


Yeah. Also full of GPT-3isms like "ultimately the choice ... comes down to the specific project and its ... requirements" and not nearly contrarian enough

A bot focused on the output of HNers would insist on providing arguments against going through Google's interview process in the first place and suggestions that the correct answer to "Python or R" should be Haskell or Julia and would never suggest prioritising emotional vulnerability or being a happy person!


Thank you for the laffs =)


This might be a dumb question, but is this based on the collective wisdom of HN. Because I would say that the collective wisdom is just as much in the interaction of the comments and the ranking of those comments as it is in the comments themselves. If you just injest all the comments wholesale, aren't you rather getting the average wisdom of HN?


I believe it's always going to be an average. The more interesting question is how is the average weighted?


Let's admit that HN's culture is that many of us are confidentially wrong, which we cover up with impressive technical jargon. As such, any wrong answer in this AI is in fact correct.


> confidentially wrong, which we cover up with impressive technical jargon

I get the feeling this comment is self-referential/self-parodying.


Well spotted ;)


I love this! I used to append "reddit" to my Google search queries to get best results, but the quality of dialog over there has really dropped in recent years. These days I've switched to appending "hackernews", but this is even better.


Same. I have “site:news.ycombinator.com” as a keyboard shortcut on my phone. Use it all the time.


Nice work! Been playing with Langchain and was not aware of patterns.app.

This whole space is moving so fast its hard to keep up for someone whos immediate day job doesn't revolve around this space. Congrats.


Is there a way to opt out of one's comments being used for this?


Nah, it's no big a deal, its not like cambridge analytica will happen again. They're just using your data to train AI. Who knows may be based on the way you comment, you may get suggestions on which medication you need, or if it's time for the Redbull/starbucks coffee. Nah, all is good. Nothing bad will happen in allowing companies to scrape comments and build models. They're very ethical. In fact, people here are suddenly not so concerned that the model is not open. There is no oversight on how data is being used They are just proud to get answers from a text generator.


The BIG DEAL is not THAT specific instance but the fact that the ML crowd think it's OK to take everything without even asking permission


> The BIG DEAL is...the fact that the ML crowd think it's OK to take everything without even asking permission

Everything they take was freely given. Thrown into the void. Screamed into the wind. It's weird that people are perfectly fine if someone happens to read their words (at all) and fine if some of those who do read them manage to find something in them that is in any way helpful or useful, but the moment they think someone else might make money as a result of something gained from exposure to those same words it's somehow offensive and everyone starts demanding a cut of (usually non-existent) profit.

The "ML" crowd has just as much a right to read and learn from the words I enter on social media platforms as anyone else. I'm not charging any kind of fee for the words of debatable wisdom, fact checking, or shitposting I "contribute". I didn't ask permission before replying to your comment. Why should anyone feel like they should ask for permission from me to read it? What exactly is "taken" from me beyond the time I voluntarily spent participating in online discourse?


I think I should've put an /s at the end. Its kind of strange that I see constant discussions here and people harrassing small apps/libraries about how their error collection is not OPT-IN. The whole audacity debacle. But data collection for training ML models is perfectly fine because we sure do know the companies who fund the research, how they will get an ROI.


Just post a healthy amount of random nonsense along with any of your actual posts to dilute the effects

Banana Sebastian housewares fly swimmingly under terrestrial Zruodroru'th Memphis Steve Jobs archipelagos


> Banana Sebastian housewares fly swimmingly under terrestrial Zruodroru'th Memphis Steve Jobs archipelagos

It's actually more likely to require a bathtub to increase the volume of the reticulated lorries, so I really don't think a farmer's market is the ideal place.


Yes, don’t post on online forums.


Thats how I decided to opt-out of reddit after 16 years.


Why would you want to? Genuinely wondering.

I for one am oh so proud that my valuable ramblings contributed to this majestic machinery.


I agree: when I signed in, I never agreed to let anybody use what I write to do anything they want ! I only agreed to let everybody read, understand, interact with what I wrote

Actually, it makes me feel as bad as knowing that CAPTCHA were used to train image recognition models...

I think it could be a good time to reconsider the question of the consent. I may agree that my words are used to train some IA... but 1) I must be asked (kindly) first and 2) it won't be free!!! (it may be paid to me or the service provider like HN... but it's NOT unpaid work ;-) )


If you're willing to pay for the retraining? ;)


Hi, thanks for the interesting article. I have a question about Pinecone. What is the cost of storing all these vectors?


Anyone here know how to generate subtitles automatically using AI when a video is playing on the web?

Was planning to see how I can build something like these sites but without the need to regularly update the subtitles catalog: https://subscene.be https://subtitlecat.com https://subtitletrans.com https://my-subs.co


Is there any LLM model that can be self hosted and fed a corpus of data to ingest for question answering? The part I find difficult is how to feed (not train) the open LLM models with entire dataset not available to public?


The hack to solve this is to embed each paragraph in your large corpus. Find paragraphs most similar to the user query using embeddings. Put the paragraphs and the raw user query into a prompt template. Send the final generated prompt to gpt3.

This actually works surprisingly well.

Check out the OpenAI cookbook for examples.


ChatGPT and friends always talk like those Microsoft and Apple forum responders with 100k reputation.

I see that you are asking about "How to get a job at Google". I will help you with "How to get a job at Google". In order to solve the problem of "How to get a job at Google" please follow the following steps first:

- rewrite your resume in Google Docs

- reinstall Chrome

- apply to the job

Let me know if I can help further with "How to get a job at Google". I like using it, but I have to tune my prompts to make sure that they don't bullshit me before getting to the point.


I like the project. Had been wanting to do this myself for a long time, because HN has become the first place I go to nowadays for answers, and I value the intelligence and experience distilled in the comments here.

I do not like that it seems to be effectively an ad.

> Embedding every single one of the 6.5 eligible comments was prohibitively time-consuming and expensive (12 hours and ~$2,000).

Does anybody understand what he’s talking about here? Assuming 6.5 million comments and an average token length of 70 we'd be looking at $180 ($0.0004 / 1K tokens).


Nice. I just sort of assumed early on my comments were training some future AI, and I hope that in some small way I have been able to moderate some of its stupider urges.

A version where you can turn knobs of flavored contributors would be pretty funny. I know my comment style is easily identifiable and reproducable, and it encodes a certain type of logical conjugation, albeit biased with some principles and trigger topics, and I think there is enough material on HN that there may be such a thing as a distinct, motohagiographic lens. :)


Some day I will sue people like OP (if they're monetizing it) and OpenAI for monetizing my public posts. You can use, reuse and alter public speech but when you earn ad dollars...yeah part of that is mine if your model used my public content. I probably won't actually sue but someone will.

I am not a lawyer but there has to be a jurisdiction where I can establish standing at least.


Love that it includes sources — this makes it much more valuable because you can tell if it's giving useful information or just blowing smoke.


> 4. Index the embeddings in a database

If Op is reading. I'm curious about the database you are using to store the embeddings. Pinecone, Weaviate ...?


From the article:

> The embeddings were then indexed with Pinecone.


Related question: I've written probably a million words over my lifetime.

Is there an easy way to load up GPT with my thoughts to have it be a fake me?


This I think would be a great little SaaS idea to make some money. I keep seeing more and more people askingn how they can transform their data into an interactive archive that responds as chat, or with voice.


Sort of. Look into gpt-index/Langchain


> The methodology I used here is a generic, scalable solution for distilling a knowledge corpus into an embodied intelligence

The methodology used here is a generic solution for distilling a non-generic corpus of utterances of , into a generic platitude machine.


I have an experiment that uses the embeddings to visualize clusterings of HN comments (using tsne). Not super useful but interesting to view the comments in 3D and seeing how similar ones cluster together into mostly relevant themes.


Hmm. I thought perhaps he was going to take the questions from askHN and the top upvoted comments and fine tuning a model with that as the prompt / reply pair.

Curious how that would differ; but would be an expensive endevour.


Seeing a ton of projects utilizing chatgpt nowadays. Are the project owners basically paying the API costs out of pocket ? Think it would add up pretty quick especially if front page on HN.


Instead of being summarize tool, this bot is really useful if you want to search related hn post based on abstract thinking imo.


Just to be sure: This is NOT a finetuned GTP model, but rather standard GPT-3 API, used to summarize search results of a HN Comments DB, based on user input. Right?


Did you also ingest dead comments to the corpus?

I would very much like to see the ghost of Terry pop up from time to time, to offer his wisdom and unique style of response.


> I trained on a corpus of over 6.5 million Hacker News comments

How long did it take to scrape them and train the "corpus" on this content?



That was mentioned in the article In the « Ingesting and filtering HN corpus » … 30min …


I Didn't know the api supported downloading all of its database. Are you the reason HN has sporadic downtime lately? ;)


Ask HN: here is my idea, can I build this in a weekend

AI: of course .. here is your bash script (220 lines long)


"He only went and did it... " !


This is nice! The official algolia search is useless.

Otoh, did I miss something or is it only on discord?


I really like Algolia. I usually use it to see if a particular link has been submitted. Other times I use it to find relevant comments or posts.


How to get a job at Google? Oh, that's easy, just get a PhD.

Thanks bottie, very use, much helpful.


Now that we have this bot to answer questions for us, I think we can all go home!


First thing I saw is my answers to someones question.

Can you cut me a distro of myself?


Amazing, an AI that is incapable of picking up on jokes or sarcasm!


I thought chatgpt may already used hacker news (Reddit) to train?


the collective wisdom of the HN community

Made me smile


could you do this for medical journal articles?


You'd probably need to prepend a prompt that told the bot how to analyze experiment design. Maybe have it read a book or 10 on experiment design. Also a few books on social networks, financial motivations and other human factors in science. Then let it take a look at journal articles and their metadata. In short, you need a way to vet for quality.


[flagged]


It looks interesting, but posting it on random threads of HN will make users flag your post and mods ban your account.

The post definitively needs more info! Who are you? How do you pick the kids? Are you the "teacher", an "organizer" or just someone enthusiastic that is related to the project? Programing language? Age of the kids? Have you done something similar before? Length of the course? Why do you need money?

Try to write a long post answering all that questions and perhaps a few more, but not too long. Make a new post, and then make a comment explaining you are the [teacher or whatever], and be ready reply to the comments in the thread.

Some official suggestions in https://news.ycombinator.com/newswelcome.html


Can anyone help me to guide some tutorials using GPT-3 model on a certain dataset. I am a Python programmer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: