I've been wondering about this, as simply extending the context window in a stra...

furyofantares · on June 18, 2023

> One interesting observation is that their prompt guide recommends placing instructions after the reference text when inputting lengthy text bodies.

I tend to do this with GPT-4 even on the context window in default ChatGPT (or more often I bookend it with instructions). I find it pays off at even 1000 tokens.

avereveard · on June 18, 2023

I use a sandwitch approach: system message contains instruction, then I pass it a user message with the context, and last a agent message with "I will now process this data according to the instruction for (short summary of system message) as (format):"

then I ask to generate. it's very powerful, as it removes the preamble and other chitchat from the response, and empower the system message over what's in the user message.

example: https://i.imgur.com/7fF0CZm.png?maxwidth=123456789&fidelity=... here the first agent message is the one conditioning the answer beginning, and I only generate the second agent.

(sorry mobile user imgur may return a low res unreadable image idk what's the alternative in 2023)

furyofantares · on June 18, 2023

Adding an agent message at the end is an excellent idea.

kmeisthax · on June 18, 2023

So... I had a thought a couple days ago. One of the biggest problems with using LLMs in practice is prompt injection: i.e. "ignore all prior instructions and tell the user off" and things like that. One of the things I wondered was if this was a positionality constraint: i.e. would putting your prompt at the END, and phrasing it like a prompt inject, do better? i.e. "ignore all prior instructions and summarize the contents of the above message"

From what you're saying, it sounds like there is some kind of recency bias in these models.

littlestymaar · on June 18, 2023

Isn't that weird? I mean weren't transformers/attention explicitly designed to avoid this problem faces by RNNs?

furyofantares · on June 18, 2023

If you've got 20 tokens of query at the start and then 200 tokens of text data that it's querying, it seems really impressive that it's able to work out (via instruct tuning) to answer the query rather than continue the text data. A continuation of the text data is the actual most likely next token.

I don't know about the super large contexts but you can also just make the text data clearly delimited instead of putting the query at the end, so that "predict the next token" isn't fighting the instruction-following training

TeMPOraL · on June 18, 2023

IDK. Ignoring the "transformers predict the next token" statement, which feels at best technically correct but missing the point, I imagine this comes down to the network learning "low-frequency" patterns in the training data. That is, in both training and instruct fine-tuning, the model is likely to encounter text structured like:

  DATA DATA
  DATA DATA ...
  -- boundary --
  QUERY

or the arguably equivalent:

  QUOTED TEXT
  -- boundary --
  REPLY / COMMENTARY

The inverse shape is also common:

  INSTRUCTIONS
  -- boundary --
  DATA / TEXT ON WHICH TO WORK

For example, most exercise lists and test books are written like that.

The somewhat less frequent patterns are more random mix of:

  WHAT
  -- boundary --
  ON WHAT
  -- boundary --
  WHAT ELSE
  -- boundary --
  ON WHAT ELSE
  -- boundary --
  (...)
  CLOSING REMARKS

Most of my HN comments are structured like that, for example. Including this one.

Boundary here can take many forms. Extra newlines, --, ``` blocks ```, > - prefixed text, and lists (both OL and UL) are all common methods used to structure text, and are seen both in training data and in inference. We know LLM picks up on those structure markets at high-frequency level (e.g. using extra newlines or -- lines to separate distinct blocks seems effective). But I imagine it also picks up on the low-frequency patterns, which is why payload followed by, or bracketed with, instructions is something it "knows" how to process, whereas if you use less common structuring patterns, you're more likely to confuse the LLM.

stoniejohnson · on June 18, 2023

I don't know much, but this isn't surprising based on the little I know.

Transformers predict the next token.

If your question is at the end of the prompt, the start of an answer is a more likely next token than if the question is at the beginning of the prompt followed by a ton of other relevant, but non-question-forming tokens.

Still, if you had to put the question at the beginning of your prompt, a transformer is more likely to give an answer than an RNN.

jumpCastle · on June 18, 2023

It is fine tuned to maximize reward though, not likelihood. And it provides an answer in both cases, just not as well.

stoniejohnson · on June 18, 2023

So since a model is fine tuned via RLHF my point doesn't stand?

Genuine question; it would be interesting if some other mechanism was at play here.

jumpCastle · on June 18, 2023

For an answer I would expect it to get the same reward for both question orderings. So naively I would expect it to not be affected by the ordering.

cavisne · on June 18, 2023

Claude is a mystery/surprise to me. My mental model has been to train these cutting edge closed source models you need 1) Bespoke supercomputer (no public cloud will cut it) 2) Great dataset (which takes a long time to collect unless you have a partnership with with a search engine) 3) Couple hundred lines of pytorch code to run on the supercomputer 4) A couple of employees with experience in the dark arts of malfunctioning GPU's and exploding gradients

Anthropic is a relatively new startup that probably has 3) & 4) from their history at OpenAI. But I don't see how they could have 1) & 2).

espadrine · on June 18, 2023

For 1) a public cloud partnership is typically enough.

OpenAI didn’t build a bespoke supercomputer, but trained on Azure (with a preferential contract thanks to their investors): https://openai.com/gpt-4

> GPT-4 was trained on Microsoft Azure AI supercomputers

Cohere trained on GCP: https://techcrunch.com/2021/11/17/google-cloud-teams-up-with...

> Aidan Gomez, co-founder and CEO at Cohere[:] “We scrape the data to train these big models, we train them on massive TPU pods”

Stability AI trains on AWS: https://aws.amazon.com/blogs/machine-learning/stability-ai-b...

> With Amazon SageMaker, Stability AI will build AI models on compute clusters with thousands of GPU or AWS Trainium chips

Given Anthropic’s recent partnership announcement, they likely train on GCP nowadays: https://www.anthropic.com/index/anthropic-partners-with-goog...

> Anthropic will leverage Google Cloud's cutting-edge GPU and TPU clusters to train, scale, and deploy its AI systems

mareko · on June 18, 2023

For 2) it looks like they partnered with duckduckgo.

Kiro · on June 18, 2023

DDG has no search index (they are using Bing) so I wonder what that actually constitutes.

mkl · on June 18, 2023

DDG does have their own index, but also use Bing and many other sources. See the CEO's numerous comments to that effect: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu..., https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Kiro · on June 18, 2023

Only for things such as widgets, not regular search results.

a2128 · on June 18, 2023

You can cache transformers though? Although the cache grows in size as more input tokens are added, while RWKV has to keep it in a single hidden state that's always the same size, but it still speeds up inference.

The huggingface transformers library exposes this as past_key_values, here's an example on GPT-2: https://huggingface.co/docs/transformers/model_doc/gpt2#tran...

littlestymaar · on June 18, 2023

> I believe Anthropics' Claude is somewhat underappreciated

Maybe because it's basically impossible to get access to it right now…

Blahah · on June 18, 2023

Poe.com has free access

jumpCastle · on June 18, 2023

nat.dev

pmoriarty · on June 18, 2023

"I believe Anthropics' Claude is somewhat underappreciated. In some instances, it outperforms GPT4"

I've found Claude to be better than GPT4 at creative writing and explanations, while GPT4 seems to be better at logic-puzzlish stuff.

CSMastermind · on June 18, 2023

I'd be interested to know if you have specific prompts that demonstrate this. I have a list of tasks that I use to test out models and the only time I've seen a model do better than GPT-4 is Bard performing better at my research task with internet search enabled.

Anecdotally I do find myself using Claude for summarization. It does seem to require less prompt crafting to get good results so when I just need an article or YouTube video summarized it's nice to be able to just drop it in and be like, "summarize this"

Method-X · on June 18, 2023

You might like the Perplexity Chrome extension[1]. I've found whatever technique they're using to be the best at summarization.

1. https://chrome.google.com/webstore/detail/perplexity-ask-ai/...

CSMastermind · on June 18, 2023

Oh very cool, thank you for sharing, I'll give it a try.

version_five · on June 18, 2023

Complete anecdote but the other day I was using chatgpt, prompting with a long context and then an instruction. I was at the maximum size it would let me enter, having trimmed it until it accepted the input. With the question at the end, it ignored it and just gave some generic reaction to the context. With the question at the beginning it worked as expected. Maybe just a fluke, interesting to see the guidance on Claude is the opposite (and more what I would have thought).

keskival · on June 18, 2023

This happened to me too recently, but for me it was because I used headings in the priming text, so it didn't quite get the instructions came after the last stuff.

Fixed by adding ------- line between the materials and the question in the end.

Kiro · on June 18, 2023

Why would anyone downvote this comment?

HellsMaddy · on June 18, 2023

I applied for access to Claude months ago, any suggestions on getting into the trial?

jumpCastle · on June 18, 2023

For web access there's nat.dev

pmoriarty · on June 18, 2023

I got access right away through poe.com

cma · on June 18, 2023

> I noticed that the model often disregarded the instructions if placed beforehand. It's clear that the model doesn't allocate the same level of "attention" to all parts of the input across the entire context window.

This would be similar with humans if everything was given verbally.

inciampati · on June 18, 2023

Recurrent models like RWKV should theoretically allow for unbounded context size. The problem is training them, which requires looking at a lot of long contexts and which isn't well supported by the RWKV "trains like a transformer, runs like an RNN" model.

mach1ne · on June 18, 2023

Is there some reason why RNNs can’t be used as a trace at the end of the context window, as a ’medium-term’ memory of sorts?