Hacker News new | past | comments | ask | show | jobs | submit login
Build ChatGPT like chatbots on your website (towardsai.net)
130 points by nigamanth on Jan 29, 2023 | hide | past | favorite | 45 comments



The challenge with this kind of system is always the bit that figures out the most relevant text from the corpus to bake together into the prompt.

It's interesting to see that this example takes the simplest approach possible, and it seems to provide pretty decent results:

> Third, each word of the list cleaned up above is searched inside the information paragraph. When a word is found, the whole sentence that includes it is extracted. All the sentences found for each and all of the relevant words are put together into a paragraph that is then fed to GPT-3 for few-shot learning.

This is stripping punctuation and stopwords and then doing a straight string match to find the relevant sentences to include in the prompt!

A lot of people - myself included - have been trying semantic search using embeddings to solve this. I wrote about my approach here: https://simonwillison.net/2023/Jan/13/semantic-search-answer...

My version dumps in entire truncated blog entries, but from this piece I'm thinking that breaking down to much smaller snippets (maybe even at the sentence level) is worth investigating further.


One thing that seems like it may be a slight issue for me is that some questions don't actually need any knowledgebase information, and when I include the closest matches it can confuse text-davinci-003. So I added something like "ignore if none of these snippets are relevant" but still had an issue so I ended making a separate command to turn on the kb search per question.

I'm wondering if there is some cosine similarity cut off I could use to just drop kb matches, but it seems like probably not because a lot of real matches are pretty close in similarity to non-matches.


I see your blog referencing what you call the “semantic search answers” pattern. I’ve seen this elsewhere as Retrieval Augmented Generation or Data Augmented Generation. A few libraries, like LangChain, have support for this.


Thanks, I just updated the blog entry to add a link to the Retrieval Augmented Generation paper.


I’ve been playing around with this and have gotten decent results simply using openAI embeddings. I’ll probably be making some kind of post showing results soon


Yeah I'm happy with the results I got from embeddings so far. The areas I want to explore there are:

1. What's the ideal size of text to embed? I'm doing whole blog entries right now but I'm confident I can get better results if I divide them up into smaller chunks first - I'm just not sure how best to do that.

2. There's a trick called Hypothetical Document Embeddings (HyDE) where you ask GPT-3 to invent an answer to the user's question, embed THAT fictional answer, then use that embedding to find relevant documents in your corpus. https://arxiv.org/abs/2212.10496


I’ve done some experiments and the ideal seems to be semantic chunks of max 175 tokens. But only if those chunks are already very dense (use summarisation to get there)


For 1. I think you may find work with semantic text chunking interesting, and the possibility of overlaying embeddings with one another (e.g. powering fragen.co.uk youtube search, the transcriptions whenever you click the "play" icon are semantically chunked...). Audacity has a good Autochapter API for this too. I found after good OCR postprocessing, semantic splitting is most important.

For 2 - have you used this or found it to work? I found asymmetric embeddings like MS MARCO trained models much superior when you have a reasonably bespoke corpus (like the demos of yours I've seen) - because, TLDR, it had a negative impact on recall, even as it improved Precision - if your GPT hypothetical answer is a misconception or joke answer, you'll be semantic searching joke answers, etc - same with numbers, GPT-3 might suggest short or plurality-default answer, like US phone numbers for "what is head office of marketings number?" when you have UK numbers in your corpus you would prefer to be surfaced.


Wow thanks for sharing that’s such a clever hack


Maybe naively, I was hoping this would involve a way of hosting a specially trained model oneself. If this pre-feeding of a corpus needs to be done every time the bot is "launched", that seems like a lot of extra tokens to pay for. I question the practicality of getting people to input their own API keys, which seems to be the only purpose of the PHP wrapper. On the other hand, passing the costs on to people (the "intermediate solution"[1]) would only make sense if the value added by the several-shot training was really significant, e.g. a very large body of domain-specific knowledge. Which again becomes impractical to feed in at the start of every session.

[1] https://towardsdatascience.com/custom-informed-gpt-3-models-...


This is exactly my question. I want to be able to give GPT a large body of domain specific information and have it use this information in the same way it uses the information it already has. I've tried creating a fine tuned divinci model, but it didn't work great, and honestly I'm not sure I really trained it right -- I've yet to someone give a good example of it.

For example, I'd love to see someone do an example where they add in all the information from this past NFL football season and then have the bot be able to discuss this past season as well as any other searson.


Yeah that's it exactly, and a major thing people run into.

Finetuning won't add "knowledge", it's changing the bias likelihood going forward for each tokens generation(with context to tokens around it in your training dataset).

Fundamentally it's building on top of the "associations" the internal mechanism of the model has learned at different parts of itself - attention/self-attention - during the original training. Finetuning changes things superficially on the outside of the box. It changes things in a way that do not alter the fundamentals, unless you literally abduct weights deliberately (some good posts on lesswrong about this where they remove concepts like fire/water by finding with SVD where they are). If you think of it like a building, training makes the entire building, and finetuning has no access to the ground floor/basement/foundations of the building to change those lower, important parts, except in terms of how the floors are presented as you go upwards. Some things link the lift will always be in the same place[fundamental orderings of words in general], but you can use a different floorplan[change the vocab distribution+etc by finetuning].

Can you try uploading that data in PDF format (or youtube videos) to Fragen.co.uk and let me know what you think? It should reasonably by able to discuss the current season well once you provide enough data (but it will have reduced ability to discuss previous seasons). That's a tool which uses a similar approach to OP but with some mechanics to share and order knowledge smartly according to the question (e.g. replacing "it" with the right nouns, bringing in predicate facts relevant to the question). The answers have checkmarks next to statements it is confident about, and you could reasonably expect performance like base GPT-3 on NFL football season of 2019 if you did this with sufficient data (aka. 1000+ pages/game reports). If you have a youtube video of one game, you should be able to test it quickly with that by asking things mentioned in the games video, and the answers should not be wrong. It will reject questions it can't answer well.


I hadn't seen Fragen.co.uk before, but it's pretty cool that it can ingest several different kinds of media.

Notwithstanding the limitations you described w/r/t adding "knowledge" to the model itself, is there a way to sort of, like, crystalize the state of the model after it's taken a set of tokens? In non-AI terms, if you were hosting the model yourself, could you just upload videos of the entire NFL season and then dump the model out of memory in its current state (with what I assume would be just a bunch of activated weights now, not actually a compressed version of all those videos), and restore it to that point later so you don't have to feed all that data in again? And is that what e.g. custom Stable Diffusion checkpoints are doing, or are they actually further training the model?


Just FYI, my last comment has now been slightly outdated by this new paper today: https://arxiv.org/abs/2301.12652


Yep, you could pull a saved version of the model weights for your specific data and restore it later, which is what a lot of the finetuned models on huggingface do under the hood(and probably what OpenAI does when you finetune on the API).

As it turns out yes, if you uploaded the NFL seasons transcripts, the weights which changed the most (negative or positive) would be those associated with the new information in the data. But they will all still be connected with the connections and tokens they were before(importantly, including the unchanged earlier layers), just with different "strengths" - and for the most part, very similar strengths.

However these are not commutive operations between different sets of finetuned data.

- If you combined the weight differences from two different sets of data, you will not get the same result as if you finetune on both sets of data and then save. So you can't just tack on further data that easily (you could add more of the same but if the data distribution is different over time, you'll want to sample the data randomly when training mostly)

Part of this is because models are trained with batches of samples, never the whole set (generally), say, 16 to 256 at a time. Because of this, each "step/batch" then changes the network and the next step is dependent upon the previous. There are alternate approaches that load all samples in memory, where you purely learn a custom matrix projection of the output last hidden layer (e.g. 4096 floats), which can allow you to do downstream tasks but is not useful for altering the pattern of generating text (and therefore not useful for knowledge extraction). This also makes it hard to understand adding knowledge to LLMs, although SVD is helping with this recently for all kinds of models that use attention and decoding (so GPT, BERT, T5 etc) in the NLP domain: https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singul... - https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreti... The above also cover why early layers matter: In almost all cases, with an unfinetuned GPT, the first token in the earliest 1-3 layers is already a reasonable or ideal token. Since finetuning often does not impact these layers, it is questionable whether the model can have the information available that a non-finetuned model does to the same respect regarding new patterns(concepts etc) in your new data, even if it can "skirt over it" by appropriately adjusting at the last layer. Recently there was also a 2 layer model which performed similarly well to these transformer/GPT models at smaller scale (which span from 8-24 layers by and large) which might open new ways of finetuning.

I can't answer about StableDiffusion checkpoints as I don't know diffusion models well, but most finetuning is a set of matrix operations than project weight changes to existing weights in the network (typically - just the end parts before the output of the network).


>> However these are not commutive operations between different sets of finetuned data.

you mean they're not additive, right? Each weight set is destructive of another. But each one would be deterministic up to that point, yes? When you restore that state, the responses to an identical series of questions would always be the same?


An alternative to using old school NLP, is to use GPT itself for the first pipeline as well, with a prompt like: I've the following resources with data. power_troubleshooting.txt contains information for customers that have issues powering on the device, (and so forth in the next lines... with other resources). This is the user question: ..., please reply with what is the resource I should access.

Then you get the file and create a second prompt. Based on the following information: ..., answer this question: ... question ...

A slower and more powerful way involves showing GPT different parts of potentially relevant text (for instance 3 each time) and ask it to score from 0 to 10 the level of usefulness of the resource in order to reply to the question. And make it select what resource to use. But this requires a lot of back and forth.


Just be aware that your pipeline prompt should not contain any secrets and you should expect that users will be able to subvert your pipeline prompt! I think the most popular name for these attacks is currently 'prompt injection'.


It may also make binding commitments to your customers as your agent.


(Probably) related discussion.

Natural language is the lazy user interface (2 days ago) https://news.ycombinator.com/item?id=34549378

Chatbot is often a useless, unintuitive UX for solving problems. We already know how most websites work, so it's easier to navigate to the intended resources with a few clicks rather than typing uncertain questions.


Yeah, the thing about my interactions with ChatGPT is that as it has apparently been tuned since release, it's output has come to more and more resemble a decent FAQ on whatever topic I'm asking about.

It's useful - but it's only better if a given website lacks such a good faq or equivalent.

Another thing to consider is even for the sites that have real humans standing by to chat, the chat can be useless when the company basically doesn't want to their low-level any more leeway than the site's normal forms/application allows.

I mean, I could imagine kinds of AI chatbot that could be very useful - a bot that could talk someone through the process of medium-level auto repair. But this seems to be also something well beyond the ability of current systems.


You could say the same about search engines but for some reason they seem to be quite popular!


...yet we seem to be coalescing on agreement that it is at least "a" (if not "the") lazy user interface. Lazy user interfaces are difficult to create and have value.

That doesn't mean they supplant motivated user interfaces; they're for different purposes.


This seems terrifying for Google Search and similar products. If I can cram the majority of my static, rarely changing information and proverbial (not literally, of course) consciousness into a better search format (a model) than Google et al can, why should I bother building out the rest of my website or sharing that model with Google et al? This is especially true if it's conversational enough for most people to chat with it casually.

It seems the obvious answer is that people still need to be able to find me and you can't easily backlink the contents of a model. Google can create an interface or standard for this à la bots talking to bots, but the compute cost is just fundamentally higher for everyone involved. Maybe it's worth it for the end-user's sake? Anyway, a search query can be shorter than the question(s) it's going to take to get that information out of a model too. And as for Google, OpenAI or similar scraping the entire internet and creating a model like ChatGPT, sure, that works now, but how are people going to feel about that now that the cat's out of the bag? It seems the knee-jerk reaction to this is to more highly scrutinize what you publicly make available for scraping, especially since I have no idea what level of accuracy a model like this is going to possess in terms of representing my information.

As a closing example, I have a friend who runs one of the most popular NPM packages available. He doesn't billboard his name all over the project, but it's public information that can be discovered trivially by a human with a search engine for various reasons (on govt. websites no less). Essentially, he's a de facto, albeit shy, public figure. I asked ChatGPT various questions about the library and it nailed the answers. Next I asked ChatGPT various formulations of who wrote or maintains the project. It gave us a random, wildly incorrect first name and said no other public information is available about him. To be honest, I'm really ambivalent about this because of all sorts of different reasons centered around the above topics.

It seems there's some tension here. For those of us willing to embrace this, we may want to maintain technical stewardship. However, those changes may fundamentally change the fabric of discoverability on the web. Please let me know if I'm misunderstanding the technology or you believe I'm jumping to any conclusions here. Thanks!


People act like Google doesn’t already have all of the data and their own LLM to make a natural language interface with.


No people act like Google is inept at releasing any new products for the last decade.

It’s not their engineering that’s questionable. It’s their product/program management.


People act like Google has already created a natural language interface and have been waiting for ChatGPT to release before showing the world they've already done this, too.

I don't think Google has any of this, and I don't see why everyone assumes if ChatGPT did it, Google already did it too. They would've told their shareholders about this already to belay fears of supplanting.


An alternative view is that OpenAI specifically needs to build hype to get more investment and is spending massive amounts of money to provide ChatGPT for free, in order to do so [0].

The unanswered question for Google is how to future evolutions of ChatGPT affect the search business, between unclear monetization/advertising, and the issues with language models making up facts.

[0]: https://www.cnbc.com/2023/01/23/microsoft-announces-multibil...



The main reason I think would be discovery. How would users find your site otherwise?


We need a transferable GPT, like how CNN models have been trained on basic shapes and patterns, and can then be fine-tuned to an application. A transferable GPT wouldn't know the entire internet's worth of knowledge, but it would know to predict generalized structures. Maybe those structures could have placeholders that could be filled with specific knowledge.


How to approach building ChatGPT like chatbot for business application, to help users who do not like reading documentation? How much more precise the documentation should be, so chatbot would be actually useful? Is it possible to teach chatbot by having sessions with users who are experts of the application, so bot could gather required information from them?


I'm skeptical of this, but one could also feed it support tickets and responses, slack conversations involving product support, etc.


Checkout this example: https://blog.langchain.dev/langchain-chat/

It's a chatbot based on the LangChain docs, a Python library for interacting with LLMs.


On that note, have there been any chatgpt like open-source projects, that are on the same or similar level?


Working in the defense industry, it's extremely difficult to see how we're going to capitalize on these systems. It's a damn shame, because 95% of our document requirements are very-nearly-boilerplate, a great application for these early AI systems. I know the image processing AI things are coming along pretty well, but in some ways that's an easier problem. The problem for us is multilevel stovepipes.

The biggest and grandest is ITAR, which restricts the physical path that data can take. Recently there was a tweak in draft that allowed for the data to take a path outside the physical USA, with the guarantee that the endpoints are encrypted. Not generally implemented though.

The second is what I would call datarest or Data Restrictions, which includes the whole data classification system of DoD, DoE, and others. If each model is only able to pull from its bucket, it's going to be a bad model.

The third is the proprietary problem. Since there are very few organizations competing - often they arrange for one to be the sole "winner", the ultimate smoke filled backroom - they make frameworks that generally work for just a single org. XML in Lockheed is not XML for Boeing, but replace "XML" with "anything that's normally standards-based". That's another layer of stovepipes.

DoD will have to provide the framework and big ass models for this stuff to work, but that's going to be a hell of a job, and will need serious political horsepower to keep it from being kidnapped by LockBoNorthRay.


Open-Assistant - https://github.com/LAION-AI/Open-Assistant - is an interesting open source project I found this morning.

As others have pointed out, running a truly large language model like GPT-3 isn't (yet) feasible on your own hardware - you need a LOT of powerful GPUs racked up in order to run inference.

https://github.com/bigscience-workshop/petals is a really interesting project here: it works a bit like bittorrent, allowing you to join a larger network of people who share time on their GPUs, enabling execution of models that can't fit on a single member's hardware.


The strongest one right now is a project called koboldai. However, instructing models like chat GPT are not open source yet, so it only runs stuff that writes books for you.

The problem with running a chat GPT sized system at home is that you need like 15 graphics cards to do it, and very few people have the equipment to manage something like that.

On a 24 gig graphics card you can run maybe 13 billion parameters, and stuff like chat GPT gets up above 200 billion.

I'm trying to set up a server at home with a bunch of old Tesla 40 series cards which have 24 gigs of vram and cost $200 each. A server that is a 2U super micro GPU server can hold six cards.

At the lovely power consumption of 1800 watts, about what your wall can deliver, you can hit about 96 gigs of VRAM for one to $2,000.

If you really wanted to go crazy you can get the v100 card, which are about $1,000 each, with 32 gigs . Trouble is, Nvidia starting to switch all these cards to their fuck you regular old business practice of making the connectors owned by Nvidia and changing the connector you need every single generation, so it's getting harder and harder to get these cards on the used market.


> However, instructing models like chat GPT are not open source yet

Closed source AI built by OpenAI, another oxymoron just like the Patriot Act and countless others meant to subvert public goodwill


I known of bloom ai, https://huggingface.co/bigscience/bloom it has 1 billion more parameters. But it is a completion ai, not a query and response ai. I wonder if it can be tweaked


If you want a simple command line chat bot, I made this simple example: https://github.com/atomic14/command_line_chatbot


I'm doing a similar thing but with a web-based platform that lets you build chatbots (AI Agents) in your browser: https://agent-hq.io


Chat with Cassandra: https://www.cbmdigital.co.uk/contact our friendly assistant


Glad people are having fun with this, Cassandra has told me about some interesting project ideas discussed with her. Thanks HN


i am just happy to be back on the timeline where we type in what we want the computer to do rather than going on a widget hunt.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: