As is often true of GPT responses, there's some nonsense interspersed here, e.g. the claim that R has "a more mature package universe" than Python.
I think this is false, but if you're reading quickly, it sounds cogent enough. As Sarah Constantin observed about GPT2 [0]:
> if you skim text, you miss obvious absurdities. The point is OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot...The mental motion of “I didn’t really parse that paragraph, but sure, whatever, I’ll take the author’s word for it” is, in my introspective experience, absolutely identical to “I didn’t really parse that paragraph because it was bot-generated and didn’t make any sense so I couldn’t possibly have parsed it”, except that in the first case, I assume that the error lies with me rather than the text. This is not a safe assumption in a post-GPT2 world. Instead of “default to humility” (assume that when you don’t understand a passage, the passage is true and you’re just missing something) the ideal mental action in a world full of bots is “default to null” (if you don’t understand a passage, assume you’re in the same epistemic state as if you’d never read it at all.)
> there's some nonsense interspersed here, e.g. the claim that R has "a more mature package universe" than Python.
As a programmer, I find R hard to use and not very well designed, so I can see why you'd call that nonsense.
But when I was a math student, I found that in some ways R does have "a more mature package universe". There were many math algorithms that I could find packages for in R and not in Python, even as a mere grad student.
Absolutely, for statistics and visualization I think R and its packages are (sometimes) superior. But GPT responses don't generally offer those kinds of nuances; the claim is that the packages are "more mature," period. And it's for good reason that the _most_ mature Python packages, e.g. numpy and pandas, are used by data scientists in production pretty much everywhere.
amazingly, your comment will eventually be added to the CHatGPT corpus and at some point down the line may be used to add the nuance that's currently lacking :)
I wonder if the "default to humility" heuristic probably does more harm than good on net, because the people who heed it probably shouldn't, and the ones who should won't.
Default to humility. Do not assume you're so smart that you can skim the text and understand it correctly. Read every word, don't assume that the author is so predictable that you can guess correctly.
I think it's important to remember that Humans who are not-too-smart can also sound coherent, yet also babble complete nonsense.
My experience with ChatGPT thus far is that it is as intelligent as a very broadly read person who just doesn't reeeally get the complex or nuanced aspects of the content it reads - much like many real Humans.
"After eighteen years of being a professor, I’ve graded many student essays. And while I usually try to teach a deep structure of concepts, what the median student actually learns seems to mostly be a set of low order correlations. They know what words to use, which words tend to go together, which combinations tend to have positive associations, and so on. But if you ask an exam question where the deep structure answer differs from answer you’d guess looking at low order correlations, most students usually give the wrong answer."
A simple Google search yields that she's likely an expert and her opinion can be used as a "citation" in one's comment:
> Brief bio: I started out studying math (Princeton AB ‘10, Yale PhD ‘15, focusing on applied harmonic analysis) and then spent some time in the world of data science and machine learning (Palantir, Recursion Pharmaceuticals, Starsky Robotics.)
Her essay is from 2019, and I quoted it mainly to say that I am not making an original point, this has been a known problem with LLMs for a while (and I presume it will continue to be).
It seems to write in the generic "style" of GPT, instead of in the style I would recognise as a HN poster. Is that because of something baked into how the training process works? It lacks a sort of casualness or air of superiority ;)
There was no training process, this is just running GPT with relevant HN comments as part of the prompt.
If he wanted it to replicate that classic HN feel he would either have to extend the prompt with additional examples or, better yet, use finetuning.
I guess he could also just randomly sprinkle in some terms like 'stochastic parrot' and find a way to shoehorn Tesla FSD into every conversation about AI.
> “AskHN” is a GPT-3 bot I trained on a corpus of over 6.5 million Hacker News comments to represent the collective wisdom of the HN community in a single bot.
First sentence of the first paragraph on OP's page
EDIT: it's a bit misleading, further down they describe what looks like a semantic-search approach
I agree, that language could be very improved. This is not a GPT-like LLM whose training corpus is HN comments, which I found to be an extremely interesting idea. Instead, it looks like it's finds relevant HN threads and tells GPT-3 (the existing model) to summarize them.
To be clear, I think this is still very cool, just misleading.
Soon we will see language style transfer vectors, akin to the image style transfer at the peak of the ML craze 5-10 years ago -- so you will be able to take a HN snark vector and apply it to regular text, you heard it here first ;)
Joking aside, that does seem like it would be very useful. Kind of reminds me of the analogies that were common in initial semantic vector research. The whole “king - man + woman = queen” thing. Presumably that sort of vector arithmetic is still valid on these new LLM embeddings? Although it still would only be finding the closest vector embedding in your dataset, it wouldn’t be generating text guided by the target embedding vector. I wonder if that would be possible somehow?
Last year (pre the chatGPT bonanza) I was using GPT-3 to generate some content about attribution bias and the responses got much spicier once the prompt started including the typical HN poster lingo, like "10x developer":
To truly capture the HN experience, the user should provide a parameter for the number of "well actually"'s they want to receive.
So initial response should demonstrate clear expertise and make a great concise point in response to question, and then start the cascade of silly nitpicking.
I wish the results were reversed, so I could "well actually" your comment, but 'site:news.ycombinator.com "well actually"' gives ca. 4k results in Google and 'site:news.ycombinator.com "I think you'll find"' gives close to 17k results, so you appear to be right.
Am I correct in understanding that this doesn't actually generate answers based on HN, but instead finds semantically-near comments and sends them verbatim to GPT to summarize? Seems like a good enough hack, though I'd love a detailed writeup of how to actually specialize an existing LLM with additional training data (like HN).
Technically it does give a specific answer to the question, but it is based on the semantically similar comments (and the question).
The thing people don't realize is that right now there is a very large gap between the capabilities of a few models including OpenAI's most recent ones, and most of the other LLMs. So there are several options for actually training or fine-tuning with open models, but actually none of them have the language understanding and generation capabilities at the level of those new OpenAI models.
The summary itself is still generated, but has all the context to do summarization in the prompt.
It's very difficult to otherwise finetune existing LLMs. GPT itself is closed-sourced, and doesn't allow for finetuning (except via an opaque API and with limited amounts of data). Other open models are either very difficult to load in memory and/or simply not as expressive as GPT
Agreed, I think the better approach is to do some custom tuning but that becomes cost prohibitive very quickly. Not really much different than Algolia with a minor GPT-3 integration but neat project regardless.
You can literally finetune these OpenAI models using their API. In this case it probably wasn't done because the author found it too much work and/or too expensive.
Question: what are the biggest fallacies of people that post on hackernews?
Answer: One of the biggest fallacies of people that post on Hacker News is the nirvana fallacy, which is when someone jumps from "thing x is imperfect" to "thing x is worthless" without considering any of the benefits. People also tend to be familiar with certain technologies and will discount the flaws of existing technology vs. the flaws of new technology. Another fallacy is the assumption that there is one best tool for doing things - the one that supports the largest amount of use cases. Additionally, people may use appeals to authority, invent their own solutions rather than use existing ones, and assume that the approaches they first learned are most correct. Finally, people may assume open source software is a gift and that dependencies don't need to be tested.
> People also tend to be familiar with certain technologies and will discount the flaws of existing technology vs. the flaws of new technology.
Is that really a problem? It's only natural to get accustomed to old problems caused by old tech (where workarounds probably already exist or limitations understood and accepted) while it makes sense to hold new tech to a higher standard since it has the benefit of learning from past mistakes and should avoid introducing obvious problems for people in the future to deal with. Our future selves will thank us for being more critical of flawed new tech than we are of old tech.
I'm a little surprised that Hacker News comments weren't already in the GPT-3 training set. I just assumed that OpenAI had vacuumed up most of the web already.
If it's really trained exclusively off of HN comments, I expect most of the bot's responses will evade the actual question but spend several paragraphs debating the factual specifics of every possible related tangential point, followed by an thinly-veiled insult questioning the user's true motivations.
In no way does a typical HN comment debate every possible related tangential point. Do we expect a modicum of intellectual rigor? Yes. But to say every tangent is followed and scrutinized is simply factually untrue.
And several paragraphs? I challenge you to show even a large minority of argumentative responses that veer into "several" paragraphs. You characterize this as "most of the ... responses" but I think that's unfair.
One wonders why you'd resort to such hyperbole unless you were deliberately attempting to undermine the value of the site.
Is it exclusively HN comments and nothing else? How does a model like that know how to speak English (noun/verb and all that) if you are starting from scratch and feeding it nothing but HN comments?
I'm sorry to be THAT GUY, but it is addressed in the article :)
>GPT embeddings
To index these stories, I loaded up to 2000 tokens worth of comment text (ordered by score, max 2000 characters per comment) and the title of the article for each story and sent them to OpenAI's embedding endpoint, using the standard text-embedding-ada-002 model, this endpoint accepts bulk uploads and is fast but all 160k+ documents still took over two hours to create embeddings. Total cost for this part was around $70.
In a nut shell, this is using openai’s api to generate embeddings for top comments on hn, then also generating an embedding for the search term. It then can find the closest related comments for the given question by comparing the embeddings and then send the actual text to GPT3 to summarize. It’s a pretty clever way to do it.
I have to assume that targeted/curated LLM training sets will have a tendency to be less accurate than very general, just by the very nature of how they work.
I know it's not quite analogous, but I fine-tuned GPT-3 on a small (200 examples) data set and it performed extremely poorly compared to the untrained version.
This surprised me, I thought it wouldn't do much better, but I wasn't expecting that specializing it on my target data would reduce performance! I had fewer examples than the minimum OpenAI recommends, so maybe it was a case of overfitting or something like that.
My own experiments made me think that the impact of finetuning is comparable to that of a molecule in a drop in a bucket.
> “AskHN” is a GPT-3 bot I trained on a corpus of over 6.5 million Hacker News comments to represent the collective wisdom of the HN community in a single bot.
I'm assuming you used the openai fine-tuning pathway to make a custom model?
Have you tested the responses on vanilla GPT3 vs your custom model?
The semantic search approach seems to focus the answers better than fine-tuning; at the cost of preloading the prompt with a lot of tokens, but with the benefit of a more constrained response.
Yeah. Also full of GPT-3isms like "ultimately the choice ... comes down to the specific project and its ... requirements" and not nearly contrarian enough
A bot focused on the output of HNers would insist on providing arguments against going through Google's interview process in the first place and suggestions that the correct answer to "Python or R" should be Haskell or Julia and would never suggest prioritising emotional vulnerability or being a happy person!
This might be a dumb question, but is this based on the collective wisdom of HN. Because I would say that the collective wisdom is just as much in the interaction of the comments and the ranking of those comments as it is in the comments themselves. If you just injest all the comments wholesale, aren't you rather getting the average wisdom of HN?
Let's admit that HN's culture is that many of us are confidentially wrong, which we cover up with impressive technical jargon. As such, any wrong answer in this AI is in fact correct.
I love this! I used to append "reddit" to my Google search queries to get best results, but the quality of dialog over there has really dropped in recent years. These days I've switched to appending "hackernews", but this is even better.
Nah, it's no big a deal, its not like cambridge analytica will happen again. They're just using your data to train AI. Who knows may be based on the way you comment, you may get suggestions on which medication you need, or if it's time for the Redbull/starbucks coffee. Nah, all is good. Nothing bad will happen in allowing companies to scrape comments and build models. They're very ethical.
In fact, people here are suddenly not so concerned that the model is not open. There is no oversight on how data is being used
They are just proud to get answers from a text generator.
> The BIG DEAL is...the fact that the ML crowd think it's OK to take everything without even asking permission
Everything they take was freely given. Thrown into the void. Screamed into the wind. It's weird that people are perfectly fine if someone happens to read their words (at all) and fine if some of those who do read them manage to find something in them that is in any way helpful or useful, but the moment they think someone else might make money as a result of something gained from exposure to those same words it's somehow offensive and everyone starts demanding a cut of (usually non-existent) profit.
The "ML" crowd has just as much a right to read and learn from the words I enter on social media platforms as anyone else. I'm not charging any kind of fee for the words of debatable wisdom, fact checking, or shitposting I "contribute". I didn't ask permission before replying to your comment. Why should anyone feel like they should ask for permission from me to read it? What exactly is "taken" from me beyond the time I voluntarily spent participating in online discourse?
I think I should've put an /s at the end.
Its kind of strange that I see constant discussions here and people harrassing small apps/libraries about how their error collection is not OPT-IN. The whole audacity debacle. But data collection for training ML models is perfectly fine because we sure do know the companies who fund the research, how they will get an ROI.
> Banana Sebastian housewares fly swimmingly under terrestrial Zruodroru'th Memphis Steve Jobs archipelagos
It's actually more likely to require a bathtub to increase the volume of the reticulated lorries, so I really don't think a farmer's market is the ideal place.
I agree: when I signed in, I never agreed to let anybody use what I write to do anything they want ! I only agreed to let everybody read, understand, interact with what I wrote
Actually, it makes me feel as bad as knowing that CAPTCHA were used to train image recognition models...
I think it could be a good time to reconsider the question of the consent. I may agree that my words are used to train some IA... but 1) I must be asked (kindly) first and 2) it won't be free!!! (it may be paid to me or the service provider like HN... but it's NOT unpaid work ;-) )
Is there any LLM model that can be self hosted and fed a corpus of data to ingest for question answering? The part I find difficult is how to feed (not train) the open LLM models with entire dataset not available to public?
The hack to solve this is to embed each paragraph in your large corpus. Find paragraphs most similar to the user query using embeddings. Put the paragraphs and the raw user query into a prompt template. Send the final generated prompt to gpt3.
ChatGPT and friends always talk like those Microsoft and Apple forum responders with 100k reputation.
I see that you are asking about "How to get a job at Google". I will help you with "How to get a job at Google". In order to solve the problem of "How to get a job at Google" please follow the following steps first:
- rewrite your resume in Google Docs
- reinstall Chrome
- apply to the job
Let me know if I can help further with "How to get a job at Google". I like using it, but I have to tune my prompts to make sure that they don't bullshit me before getting to the point.
I like the project. Had been wanting to do this myself for a long time, because HN has become the first place I go to nowadays for answers, and I value the intelligence and experience distilled in the comments here.
I do not like that it seems to be effectively an ad.
> Embedding every single one of the 6.5 eligible comments was prohibitively time-consuming and expensive (12 hours and ~$2,000).
Does anybody understand what he’s talking about here? Assuming 6.5 million comments and an average token length of 70 we'd be looking at $180 ($0.0004 / 1K tokens).
Nice. I just sort of assumed early on my comments were training some future AI, and I hope that in some small way I have been able to moderate some of its stupider urges.
A version where you can turn knobs of flavored contributors would be pretty funny. I know my comment style is easily identifiable and reproducable, and it encodes a certain type of logical conjugation, albeit biased with some principles and trigger topics, and I think there is enough material on HN that there may be such a thing as a distinct, motohagiographic lens. :)
Some day I will sue people like OP (if they're monetizing it) and OpenAI for monetizing my public posts. You can use, reuse and alter public speech but when you earn ad dollars...yeah part of that is mine if your model used my public content. I probably won't actually sue but someone will.
I am not a lawyer but there has to be a jurisdiction where I can establish standing at least.
This I think would be a great little SaaS idea to make some money. I keep seeing more and more people askingn how they can transform their data into an interactive archive that responds as chat, or with voice.
I have an experiment that uses the embeddings to visualize clusterings of HN comments (using tsne). Not super useful but interesting to view the comments in 3D and seeing how similar ones cluster together into mostly relevant themes.
Hmm. I thought perhaps he was going to take the questions from askHN and the top upvoted comments and fine tuning a model with that as the prompt / reply pair.
Curious how that would differ; but would be an expensive endevour.
Seeing a ton of projects utilizing chatgpt nowadays. Are the project owners basically paying the API costs out of pocket ? Think it would add up pretty quick especially if front page on HN.
Just to be sure: This is NOT a finetuned GTP model, but rather standard GPT-3 API, used to summarize search results of a HN Comments DB, based on user input. Right?
You'd probably need to prepend a prompt that told the bot how to analyze experiment design. Maybe have it read a book or 10 on experiment design. Also a few books on social networks, financial motivations and other human factors in science. Then let it take a look at journal articles and their metadata. In short, you need a way to vet for quality.
It looks interesting, but posting it on random threads of HN will make users flag your post and mods ban your account.
The post definitively needs more info! Who are you? How do you pick the kids? Are you the "teacher", an "organizer" or just someone enthusiastic that is related to the project? Programing language? Age of the kids? Have you done something similar before? Length of the course? Why do you need money?
Try to write a long post answering all that questions and perhaps a few more, but not too long. Make a new post, and then make a comment explaining you are the [teacher or whatever], and be ready reply to the comments in the thread.
I think this is false, but if you're reading quickly, it sounds cogent enough. As Sarah Constantin observed about GPT2 [0]:
> if you skim text, you miss obvious absurdities. The point is OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot...The mental motion of “I didn’t really parse that paragraph, but sure, whatever, I’ll take the author’s word for it” is, in my introspective experience, absolutely identical to “I didn’t really parse that paragraph because it was bot-generated and didn’t make any sense so I couldn’t possibly have parsed it”, except that in the first case, I assume that the error lies with me rather than the text. This is not a safe assumption in a post-GPT2 world. Instead of “default to humility” (assume that when you don’t understand a passage, the passage is true and you’re just missing something) the ideal mental action in a world full of bots is “default to null” (if you don’t understand a passage, assume you’re in the same epistemic state as if you’d never read it at all.)
[0] https://www.skynettoday.com/editorials/humans-not-concentrat...