Hacker News new | past | comments | ask | show | jobs | submit login
March 20 ChatGPT outage: Here’s what happened (openai.com)
359 points by zerojames on March 24, 2023 | hide | past | favorite | 259 comments



The disclosure is provides valuable information, but the introduction suggests someone else or «open-source» is to blame:

>We took ChatGPT offline earlier this week due to a bug in an open-source library which allowed some users to see titles from another active user’s chat history.

Blaming an open-source library for a fault in closed-source product is simply unfair. The MIT licensed dependency explicitly comes without any warranties. After all, the bug went unnoticed until ChatGPT put it under pressure, and it was ChatGPT that failed to rule out the bug in their release QA.


They are not suing anybody, just because it’s open source doesn’t mean talking about bugs is taboo. It reads to me like raw information, very good. They said they contacted authors helping with upstream bug fix. It’s great example how to deal with this kind of problems if anything to me.


I agree it's a good example – if not great. But considering OpenAI itself is as closed source as can be, mentioning open-source software as the cause of their outage in the very first line of their statement seems somewhat out of place. I don't think it's a coincidence that the open-source comes in the opening line while the acknowledgement to the Redis team is at the very end. Many outside of software engineering might read this as «open source caused OpenAI to crash».


Ironically, this sort of insane takes everywhere on the thread is causing more harm towards the image of open-source developers than OpenAI or anyone could ever do with their press release. It's important to mention whatever open-source library caused a bug because that could be happening in many other applications that are using it. That's basically one of the main points of open-source.


What would your opening sentence have been?

It’s a great opening sentence. Short, factual explanatory. You would have to be hypersensitive to object.

As far as I can see, the whole thing is a reaction to OpenAI not being sufficiently open - which is a fine argument to have with them, but shouldn’t cloud judgement on the quality of this write-up


i'm not the person you replied to, but i'm guessing maybe they'd have preferred:

"We took ChatGPT offline earlier this week due to a bug in an external library which allowed some users to see titles from another active user’s chat history."


Thanks for that reply - yeh, that's a reasonable alternative.


But the bug resided in an open source library, not their own code. What else could they say?

Bugs will always exist whether you have a QA dept or not.


Shifting the blame to open source isn't a good look. I like what Bryan Cantrill had to say about it: https://twitter.com/bcantrill/status/1638707484620902401


The blame wasn’t “shifted”. The source of the bug was reported



Spot on


They're not blaming anyone. Objectively there was a bug in that library that caused the problem.


They're not blaming, legally speaking. But they're communicating that open source software caused their outage. OpenAI chose to use software that explicitly came without warranties, and are legally solely responsible for problems caused by open-source libraries they choose to include in their product.


I understand where you're coming from. But there is certainly an audience for this kind of post that would like to read about the objective source of the bug without connecting it to some expectations around responsibility.

I read this post, and I don't see it assigning blame to anyone other than themselves. See this bit copied from the post:

> Everyone at OpenAI is committed to protecting our users’ privacy and keeping their data safe. It’s a responsibility we take incredibly seriously. Unfortunately, this week we fell short of that commitment, and of our users’ expectations. We apologize again to our users and to the entire ChatGPT community and will work diligently to rebuild trust.


The questionable part is saying "open source". They refrained from naming the library, which is good; but then why bring up that one fact about it right after?


The facts could be more clearly stated by defraining from stating open-source software as the cause.


Why mention it is open source then? What does that add?


"there was a bug in an outside library that we used" does not mention open source but has the same meaning, and would probably provoke the same complaints ("they're trying to blame somebody else for their problem").

In that case, though, they could say "look, we used a popular open source library because we had more faith that it would be better tested and correct" which would be a compliment to open source. That's essentially the information that we have.

In today's world, who builds anything from anything close to stratch? embedded developers, probably come closest. It's no worse or better to say "there was a bug that our release uncovered." If they continue to announce as many details as possible, we as the audience can develop a sense whether they're creating bugs or just uncovering bugs we're glad to know about.


I think it's fair to mention the bug originated in redis-py, but I don't find it relevant at all to mention «open-source» in the opening line of the public statement about the outage. Or «outside library» for that matter. It was ChatGPT release QA that failed, and then they failed to admit it.


"given enough eyeballs, all bugs are shallow" is a compliment to open source, and telling your users that you use open source is a positive reassurance on other dimensions as well; it's relevant, and the truth is always relevant; there's no harm in mentioning it.

"Release early, release often" and "move fast and break things" are respected ideas that diminish the importance of QA. The more effective and efficient any QA you do is, yes, very valuable, and QA that finds broken things, all the better. But don't move slow is an OK compromise.


It might be redis-py originally and someone at public relations advised to replace it with something more general so that it would sound being nice.


On the other hand, why overthink it?


why not mention it? What does mentioning it subtract?


«Open source» was in the opening line of the statement as if it was something to blame.


If someone asked ChatGPT to generate some code, then copy and pasted it mindlessly into their project, I wonder what OpenAI would think about the claim:

"The bug was a result of faulty code produced by ChatGPT."

Using an open source library is like copy and pasting code into your project. You assume responsibility for any pitfalls.


Come to think of it, I think that's realistically going to happen a lot.

It's going to be recursive blaming all the way down.


I mean they’re pretty clear with their warnings about accuracy.


Just like any author of MIT licensed open source software is clear about their license which clearly states they are not responsible in any way whatsoever for the shortcomings of their licensees.


According to this is not fixed yet... https://github.com/redis/redis-py/issues/2624


I don't recall anyone advocating omitting the name log4j when that security bug dropped in 2021. How is that situation materially different from this? redis-py fixed the behavior so it's definitely not the case of working-as-intended.

To be clear, I would be up in arms if OpenAI was trying to hold redis potentially legally responsible.


Damn open source! Those guys aren't living to their paychecks, I foresee staff cuts...


I understand and agree that framing open source as the issue is ridiculous. They may be redeemed though through the likely scenario that they didn't write this - GPT probably did. Perhaps their GPT doesn't like open source. :D


I am curious how you would have phrased it? Is there a way to accurately convey the root cause of the problem (the redis-py library) without sounding like trying to shift blame?


there was a bug on our codebase...


But then you don't get the goodwill of helping the open-source library fixing the bug, contributing back to the community.


Did you read the technical details? This was a bug in the way Redis-py handled requests that were cancelled before a response was returned to the user.


So? You own the stack. All that had to be said was there was a bug in the way requests were handled in a library. The open source part is not even needed.


I think the mention of exactly where is useful to this audience. Other people using the same open-source library might realize --- oh, we've run into that bug but never realized it. Low probability, sure, but nonetheless.


OpenAI could have simply named the library up front instead of stressing that it is open source. This would have alerted people who use the library while at the same time avoiding the suggestion that it being open source was part of the problem.


They tweeted few days back there was a bug in an open source library. That's the main issue.


Yes, and this is still the sole responsibility of OpenAI. They chose to include redis-py without warranties and are responsible for their own QA on any product built on it.


I reject the premise this is a bug. The author picked this implementation with whatever side effects. The author might want some requests to cancel in some situations. It's upto OpenAI to decide if the implementation works for their setup.


Maybe they should let ChatGPT craft the postmortem


Maybe they already did


I prefer accuracy over placating your public relations ego

The only counterpoint I would take is one related to the accuracy of what they wrote


This shows that LLM AI, as dangerous as OpenAI says it is, cannot be entrusted to OpenAI for safeguarding its use.


The phrase you quoted doesn't blame anyone, it just shows causation. Causation and moral responsibility are not the same thing.


Why did it take them 9 hours to notice? The problem was immediately obvious to anyone who used the web interface, as evidenced by the many threads on Reddit and HN.

> between 1 a.m. and 10 a.m. Pacific time.

Oh... so it was because they're based in San Francisco. Do they really not have a 24/7 SRE on-call rotation? Given the size of their funding, and the number of users they have, there is really no excuse not to at least have some basic monitoring system in place for this (although it's true that, ironically, this particular class of bug is difficult to detect in a monitoring system that doesn't explicitly check for it, despite being immediately obvious to a human observer).

Perhaps they should consider opening an office in Europe, or hiring remotely, at least for security roles. Or maybe they could have GPT-4 keep an eye on the site!


Staffing an actual 24x7 rotation of SREs costs about a million dollars a year in base salary as a floor and there are few SREs for hire. A metrics-based monitor probably would have triggered on the increased error rate but it wouldn’t have been immediately obvious that there was also a leaking cache. The most plausible way to detect the problem from the user perspective would be a synthetic test running some affected workflow, built to check that the data coming back matches specific, expected strings (not just well-formed). All possible but none of this sounds easy to me. Absolutely none of this is plausible when your startup business is at the top of the news cycle every single day for the past several months.


"there are few SREs for hire"

How do you figure? If you mean there are few SRE with several years of experience you might be right. SRE is a fairly new title so that's not too surprising.

However, my experience with a recent job search is that most companies aren't hiring SRE right now because they consider reliability a luxury. In fact, I was search of a new SRE position because I was laid off for that very reason.


You don't even need an SRE to have an on-call rotation; you could ping a software engineer who could at least recognize the problem and either push a temporary fix, or try to wake someone else to put a mitigation in place (e.g. disabling the history API, which is what they eventually did).

However, I think the GP's point about this class of bug being difficult to detect in a monitoring system is the more salient issue.


Well hang on! Your question was why was the time to detect so high and you specifically mentioned 24x7 SRE so I thought that’s what we were talking about ;)

And I do think the answer is that monitoring is easy but good monitoring takes a whole lot of work. Devops teams tend to get to sufficient observability where a SRE team should be dedicating its time to engineering great observability because the SRE team is not being pushed by product to deliver features. A functional org will protect SRE teams from that pressure, a great one will allow the SRE team to apply counter-pressure from the reliability and non-functional perspective to the product perspective. This equilibrium is ideal because it allows speed but keeps a tight leash on tech debt by developing rigor around what is too fast or too many errors or whatever your relevant metrics are.


I’ve anecdotally observed the opposite. I have noticed SRE jobs remain posted, even by companies laying off or announcing some kind of hiring slowdown over the last quarter or so. More generally, businesses that have decided that they need SRE are often building out from some kind of devops baseline that has become unsustainable for the dev team. When you hit that limit and need to split out a dedicated team, there aren’t a ton of alternatives to getting a SRE or two in and shuffling some prod-oriented devs to the new SRE team (or building a full team from scratch which is what the $$ was estimating above). Among other things, the SRE baliwick includes capacity planning and resource efficiency; SRE will save you money in the long term.

On a personal note, I am sorry to hear that your job search has not yet been fruitful. Presumably I am interested in different criteria from you —- I have found several postings that are quite appealing to the point where I am updating my CV and applying, despite being weakly motivated at the moment.


My search was fruitful. I'm doing regular SWE work now. Market sucks though.


Every system fail prompts people to exclaim "why aren't there safeguards?". Every time. Well guess what, if we try to do new stuff, we will run into new problems.


There is nothing new about using redis for cache, or returning a list for a user.


Are you trying to say cache invalidation in a distributed system is a trivial problem?


I'm not disagreeing with you, and I'm not the commenter you're replying to, but it's worth noting that cache leakage and cache invalidation are two different problems.


You're right. Thanks for pointing that out. My original point still stands, distributed systems are hard and people demanding zero failures are setting an impossible standard.


It's non-trivial but it's also not that hard, there are well known strategies for achieving it; especially if you relax guarantees and only promise eventual consistency then it becomes fairly trivial - we do this for example and have little problems with it.


This wasn’t a cache invalidation problem. It was a cache corruption error.


Im saying there is nothing new about it.


Probably the cheapest solution would be letting GPT monitors user feedbacks from various social media channels and alert human engineers to check for the summative problem. GPT can even engage with users to request more details or reproduceable cases ;)


that's abusable, as you can manipulate gpt however you like.


Since it now handles visual inputs, I wonder how hard it'd be to get GPT to monitor itself. Have it constantly observe a set of screenshares of automated processes starting and repeating ChatGPT sessions on prod, alert the on-call when it notices something "weird."


RUM monitoring does 99% of what you want already. Anomaly detection is the hard part. IMO too early to say whether gpt will be good at that specific task but I agree that a LLM will be in the loop on critical production alerts in 2023 in some fashion.


They raised a billion dollars.


How much have they spent?


You don't necessarily need a full team of SREs- you can also have a lightly staffed ops center with escalation paths.


I don’t think that model has the properties you think it does. Someone still has to take call to back the operators. Someone has to build the signals that the ops folks watch. Someone has to write criteria for what should and should not be escalated, and in a larger org they will also need to know which escalation path is correct. And on and on — the work has to get done somewhere!


The way those criteria usually get written in a startup with mission-critical customer-facing stuff (like this privacy issue) is that first the person watching Twitter and email and whatever else pages the engineers, and then there's a retro on whether or not that particular one was necessary, lather, rinse, repeat.

All you need on day 1 is someone to watch the (metaphorical) phones + a way to page an engineer. Don't start by spending a million bucks a year, start by having a first aid kit at the ready.

Perhaps they could also help this person out by looking into some sort of fancy software to automatically summarize messages that were being sent to them, or their mentions on Reddit, or something, even?


Yup, twitter monitoring is a thing that I have seen implemented. We did not allow it to page us, however. As you say, some of the barriers around that are low or gone as of late. I wonder if someone has already secured seed funding for social media monitoring as a service. The feature set you can build on a LLM is orders of magnitude better than what was practical before.

Looking at my post up-thread, I wish I had emphasized the time aspect more - of course all of these problems are solvable but it takes both time and money. They have the money now but two months ago the parts of this incident were in place but the scale was so small that it never actually leaked data. Or maybe a handful of early adopters saw some weird shit but we’re all well-trained to just hit refresh these days. Hiring even one operator and getting them spun up takes calendar time that simply has not existed yet. I assume someone over there is panicking about this and trying to get someone hired to make sure they look better prepared next time, because there will be a next time, and if they’re even half as successful as the early hype leads me to believe, I expect they are going to have a lot more incidents as they scale. One in a million is eight and a half times per day at 100 rps.


> early adopters saw some weird shit

Since I wrote this, I have seen several anecdotes that support this guess. This is a classic scaling problem. One or two users saw it, and one even says they reported it, but at small scale with immature tools and processes getting to the actual software bug is a major effort that has to be balanced around other priorities like making excessive amounts of money.


> […] it was because they're based in San Francisco. Do they really not have a 24/7 SRE on-call rotation?

OpenAI is hiring Site Reliability Engineers (SRE) in case you, or anyone you know, is interested in working for them: https://openai.com/careers/it-engineer-sre . Unfortunately, the job is an onsite role that requires 5 days a week in their San Francisco office, so they do not appear to be planning to have a 24/7 on-call rotation any time soon.

Too bad because I could support them in APAC (from Japan).

Over 10 years of industry experience, if anyone is interested.


I had forgotten that I looked at this and came to the same conclusion as you. I’d happily discuss a remote SRE position but on-site is a non-starter for me, and most of SRE, if I am reading the room correctly.

Edit to add: they’re also paying in-line or below industry and the role description reads like a technical project manager not a SRE. I imagine people are banging down the door because of the brand but personally that’s a lot of red flags before I even submit an application.


that is quite low for FAANG level SRE/SWE .


Also, I heard their interviews (for any technical position) are very tough.


nobody qualified wants the 24/7 SRE job unless it pays an enormous amount of money. i wouldn't do it for less than 500 grand cash. getting woken up at 3am constantly or working 3rd shift is the kind of thing you do with a specific monetary goal in mind (i.e., early retirement) or else it's absolute hell.

combine that with ludicrous requirements (the same as a senior software engineer) and you get gaps in coverage. ask yourself what senior software engineer on earth would tolerate getting called CONSTANTLY at 3am, or working 3rd shift.

the vast majority of computer systems just simply aren't as important as hospitals or nuclear power plants.


Timezones are a thing - your 3am is someone's 9am and may be a significant part of your customer base.

Being paged constantly is a sign of bad alerts or bad systems IMO - either adjust the alert to accept the current reality or improve the system


spinning up a subsidiary in another country (especially one with very strict labor laws, like in european countries) is not as easy as "find some guy on the internet and pay him to watch your dashboard. and then give him root so he can actually fix stuff without calling your domestic team, which would defeat the whole purpose.

also, even getting paged ONCE a month at 3am will fuck up an entire week at a time if you have a family. if it happens twice a month, that person is going to quit unless they're young and need the experience.


It's really not that difficult, and there are providers like Deel who can manage it all for you, to the point you just ACH them every month.

Source: co-founder of a remote startup with employees in five countries


like you said, timezones are a thing. now you're managing a global team.


That sounds harder than it is, especially if you already allow remote work. It mostly just forces you to have better docs.


Sorry to be clear I was replying to this part of your comment

> the vast majority of computer systems just simply aren't as important as hospitals or nuclear power plants.

I agree that the stakes are lower in terms of harm, but was trying to express that whilst it might not be life and death, it might be hindering someone being able to do their job / use your product - eg: it still impacts customer experience and your (business) reputation.

False pages for transient errors are bad - ideally you only get paged if human intervention is required, and this should form a feedback cycle to determine how to avoid it in future. If all the pages are genuine problems requiring human action then this should feed into tickets to improve things


Not only that, but you probably need follow the sun if you want <30 minute response time.

Given a system that collects minute-based metrics, it generally takes around 5-10 minutes to generate an alert. Another 5-10 minutes for the person to get to their computer unless it's already in their hand (what if you get unlucky and on-call was taking a shower or using the toilet?). After that, another 5-10 minutes to see what's going on with the system.

After all that, it usually takes some more minutes to actually fix the problem.

Dropbox has a nice article on all the changes they made to streamline incidence response https://dropbox.tech/infrastructure/lessons-learned-in-incid...


I've worked two SRE (or SRE-adjacent) jobs with oncall duty (Some unicorn and a FAANG). Neither have been remotely as bad to what you're saying. (Only one was actually 24/7 for a week shift.)

The whole point is that before you join, the team has done sufficient work to not make it hell, and your work during business hours makes sure it stays that way. Are there a couple bad weeks throughout the year? Sure, but it's far, far from the norm.


Constantly? It's one wakeup in 4 months.


I did that for a few years, and wasn't on 500k a year, but I'm also the company co-founder, so you could argue that a "specific monetary goal" was applicable.


You don't need 24/7 SREs, you could do it with 24/7 first-line customer support staff monitoring Twitter, Reddit, and official lines of comms that have the ability to page the regular engineering team.

That's a lot easier to hire, and lower cost. More training required of what is worth waking people up over; way less in terms of how to fix database/cache bugs.


Support engineers and an official bug reporting channel would help. I noticed and reported the issue on their official forums at on 16 March, but got no response.

https://community.openai.com/t/bug-incorrect-chatgpt-chat-se...

I only reported it on the forums because there didn't seem to be an official bug reporting channel, just a heavyweight security reporting process.

As well as the actions they took to fix this specific bug, another useful action would be to have a documented and monitored bug reporting channel.


Probably because they launched ChatGPT as an experiment and didn't think it would blow up, needing full time SRE etc. I don't think it was designed for scale and reliability when they launched.


Do events like this cause them to lose enough revenue that it would make sense to hire a bunch of SRE's?


Probably the real reason. I assume they intend to make money off enterprise contracts which would include SLAs. Then they'd set their support based off that


Given the Microsoft partnership, they might not even need to manage any real infrastructure. Just hand it off to Azure and let them handle the details.


Just add metrics for the number of times "ChatGPT" and "OpenAI" appeared in tweets, reddit posts, and HN comments in the last (rolling) five minutes, put them on a dashboard alongside all your other monitoring, and have a threshold where they page the oncall to review what's being said. It doesn't even have to be an SRE in this case; it could be just about anyone.


I managed to manually produce this bug 2 months ago. As they don't have any bug bounty, I didn't submitted it. By starting a conversation and refreshing before ChatGPT has time to answer, I managed to reproduce this bug 2-3 times in January.


did you reach out via https://openai.com/security.txt?


No, as I said, their disclosure page says there is no and I'm not a professional security researcher so I was not very interested in helping them.

I find even it funny now, they could write that they will provide free API creds, but now they had a very bad moment due to their greed.


*there is no bug bounty

Missed a word.


You won't fire off a quick email nor warn others because there's no bug bounty?


Not everyone has the privilege of working for free


As I understand, the "work" was already done, the only thing missing was sending a heads-up email with "hey, this seems iffy, maybe you ought to look into it".

I dunno, I generally report issues I find in software, paid or not, as I've always done. Takes usually ~10 minutes and 1% of the time, they ask for more details and I spend maybe 20 minutes more to fill out some more details.

Never been paid for it ever, most I gotten was a free yearly subscription. But in general I do it because I want what I use to be less buggy.


Sending an email is work too.


I reported this race condition via ChatGPT's internal feedback system after I saw other user's chat titles loading on my sidebar a couple of times (around 7-8 weeks ago). Didn't get a response, so I assumed it was fixed...

Hopefully they'll start a bug bounty program soon, and prioritise bug reports over features.


The explanation at the time was that unavailable chat data (due to, e.g. high load) resulted in a null input sometimes being presented to the chat summary system, which in turn caused the system to hallucinate believable chat titles. It's possible that they misdiagnosed the issue or that both bugs were present and they caught the benign one before the serious one.


Yeah I was surprised that the bug appeared simply using the app normally. My first thought is that it was data from other user loading so I reported immediately that it looked like a race condition. But maybe it was this other bug you mention.


same to my. actually only the summary of the history was from a different user. the content itself was mine.


The claim made at the time was that the titles were not from other people and were in fact caused by the model hallucinating after the input query timed out (or something like that). Obviously that sounds a little suspect now, but it might be true.


That's a lie if so, if you look at the Reddit threads there's no way those were not specific other users histories as they had the logical history of reading browser history. Eg, one I saw had stuff like "what is X", then the next would be "How to X" or something. Some were all in Japanese, others all in Chinese. If it was random you wouldn't see clear logical consistency across the list.


> In the hours before we took ChatGPT offline on Monday, it was possible for some users to see another active user’s first and last name, email address, payment address, the last four digits (only) of a credit card number, and credit card expiration date

This is a lot of sensitive data. It says 1.2% of ChatGPT Plus subscribers active during a 9 hour window, which considering their user base must be a lot.


It’s a bit unclear if this means that 1.2% of all chatGPT Plus subscribers were active during that 9-hour window


The original issue report is here: https://github.com/redis/redis-py/issues/2624

This bit is particularly interesting:

> I am asking for this ticket to be re-oped, since I can still reproduce the problem in the latest 4.5.3. version

Sounds like the bug has not actually been fixed, per drago-balto.



.... "I am asking for this ticket to be re-opened, since I can still reproduce the problem in the latest 4.5.3. version"


The PR: https://github.com/redis/redis-py/pull/2641

According to the latest comments there, the bug is only partially fixed.


> "If a request is canceled after the request is pushed onto the incoming queue, but before the response popped from the outgoing queue, we see our bug: the connection thus becomes corrupted and the next response that’s dequeued for an unrelated request can receive data left behind in the connection."

The OpenAI API was incredibly slow and lots of requests probably got cancelled (I certainly was doing that) for some days. I imagine someone could write a whole blog post about how that worked, it would be interesting reading.


Was this written by ChatGPT? Maybe it found the bug as well, who knows.


There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.


… in this case this variant seems more appropriate:

  There are 3 hard problems in Computer Science:
  1. naming things
  2. cache invalidation
  3. 4. off-by-one errors
  concurrency


It's surprising that openai seems to be the only one being affected. If the issue is with redis-py reusing connections then wouldn't more companies/products be affected by this?


their description of the problem seemed kind of obtuse, in practice, these connection-pool related issues have to do with 1. request is interrupted 2. exception is thrown 3. catch exception, return connection to pool, move on. The thing that has to be implemented is 2a. clean up the state of the connection when the interrupted exception is caught, then return to the pool.

that is, this seems like a very basic programming mistake and not some deep issue in Redis. the strange way it was described makes it seem like they're trying to conceal that a bit.


It's an open source library, I assume that logic is abstracted within it and that the "basic mistake" was one of the maintainer's.


I think most app using redis-py rarely cancel async redis command. Python async web frameworks is gaining popularity, but the majority of people using python for their web application is not using an async framework. And of those people that do use them, not many of them canceling async redis requests often enough to trigger the bug.


There is a 1 year old autoclosed issue which is very similar to OpenAI's issue: https://github.com/redis/redis-py/issues/2028


Serious question: Why do people feel it's necessary to use a redis cluster?

I understand in early 2000s we were using spinning disks and it was the only way. Well, we don't use spinning disks any more, do we?

A modern server can easily have terabytes of RAM and petabytes of NVMe, so what's stopping people from just using postgres?

A cluster of radishes is an anti-pattern.


1. Redis can handle a lot more connections, more quickly, than a database can. 2. It's still faster than a database, especially a database that's busy.

#2 is an interesting point. When you benchmark, the normal process is to just set up a database then run a shitload of queries against it. I don't think a lot of people put actual production load on the database then run the same set of queries against it...usually because you don't have a production load in the prototyping phase.

However, load does make a difference. It made more of a difference in the HDD era, but it still makes a difference today.

I mean, redis is a cache, and you do need to ensure that stuff works if your purge redis (ie: be sure the rebuild process works), etc, etc.

But just because it's old doesn't mean it's bad. OS/390 and AS/400 boxes are still out there doing their jobs.


A pretty small Redis server can handle 10k clients and saturate a 1Gbps NIC. You'd need a pretty heavy duty Postgres database and definitely need a connection pooler to come anywhere close.


I agree that redis can handle some query volumes and client counts that postgres can't.

But FWIW I can easily saturate a 10GBit ethernet link with primary key-lookup read-only queries, without the results being ridiculously wide or anything.

Because it didn't need any setup, I just used:

  SELECT * FROM pg_class WHERE oid = 'pg_class'::regclass;

I don't immediately have access to a faster network, connecting via tcp to localhost, and using some moderate pipelining (common in the redis world afaik), I get up to 19GB/s on my workstation.


> SELECT * FROM pg_class WHERE oid = 'pg_class'::regclass;

This selects every column (*) from every table (ObjectID is of type regclass)?


Sorry, I should have used something more standard - but it was what I had ready...

It just selects every column from a single table, pg_class. Which is where postgres stores information about relations that exist in the current database.


and those have reliable backup/restore infrastructure. Using redis as a cache is fine, just don't use it as your primary DB.


I'm confused on why the need to complicate something as seemingly-straightforward as a KV store into a series of queues that can get all mixed up. I asked ChatGPT to explain it though, and it sounds like the justification for its existence is that it doesn't "block the event loop" while a request is "waiting for a response from Redis."

Last time I checked, Redis doesn't take that long to provide a response. And if your Redis servers actually are that overloaded that you're seeing latency in your requests, it seems like simple key-based sharding would allow horizontally scaling your Redis cluster.

Disclaimer: I am probably less smart than most people who work at OpenAI so I'm sure I'm missing some details. Also this is apparently a Python thing and I don't know it beyond surface familiarity.


Redis latency is around 1ms including network round trip for most operations. In a single threaded context, waiting on that would limit you to around 1000 operations per second. Redis clients improve throughput by doing pipelining, so a bunch of calls are batched up to minimize network roundtrips. This becomes more complicated in the context of redis-cluster, because calls targeting different keys are dispatched to different cache nodes and will complete in an unpredictable order, and additional client side logic is needed to accumulate the responses and dispatch them back to the appropiate caller.


I'm not familiar with the Python client specifically, but Redis clients generally multiplex concurrent requests onto a single connection per Redis server. That necessitates some queueing.


Yes! I have been spending the last couple months pulling out completely unnecessary redis caching from some of our internal web servers.

The only loss here is network latency which negligible when you're colocated in AWS.

Postgres's caches end up pulling a lot more weight too when you're not only hitting the db on a cache miss from the web server.


For caching the query results you get from your database. Also it's easier to spin up Redis and replicate it closer to your user than doing that with your main database. From my experience anyway.


I think the idea is that if your db can hold the working set in RAM and you're using a good db + prepared queries, you can just let it absorb the full workload because the act of fetching the data from the db is nearly as cheap as fetching it from redis.


> For caching the query results you get from your database.

This only makes sense if queries are computationally intensive. If you're fetching a single row by index you aren't winning much (or anything).


Of course? I'm not really sure what the original question actually is if you know that users benefit from caching the results of computationally intensive queries.


OpenAI uses redis to store pieces of text. Fetching pieces of text is not computationally intensive.


Most likely they have them in an rdbms, so it's more like joining a forum thread together. Not expensive, but why not prebuild and store it instead?


> This only makes sense if queries are computationally intensive.

Or if the link to your DB is higher latency than you're comfortable with.


Better concurrency (10k vs ~200 max connections compared to postgres). ~20x faster than Postgres at Key-value read/write operations. (mostly) single threaded, so atomicity is achieved without the synchronicity overhead found in RDBMS.

Thus, it's much cheaper to run at massive scale like OpenAI's for certain workloads, including KV caching

also:

- robust, flexible data structures and atomic APIs to manipulate them are available out-of-the box

- large and supportive community + tooling


My redis clusters are 10x more cost effective than my postgresdb in handling load.


For caching somewhat larger objects based on ETag?


People know it, that's all.


Do you want to ask why we use caching instead of main db in RAM? Or why we use redis instead of postgres for caching?


Nice writeup, it's fair in the content presented to us.

Yet I'm wondering why there is no checking if the response does actually belong to the issued query.

The client issuing a query can pass a token and verify upon answer that this answer contains the token.

TBH as a user of the client I would kind of expect the library to have this feature built-in, and if I'm starting to use the library to solve a problem, handling this edge-case would be of a somewhat low priority to me if the library wouldn't implement it, probably because I'm lazy.

I hope that the fix they offered to Redis Labs does contain a solution to this problem and that everyone of us using this library will be able to profit from the effort put into resolving the issue.

It doesn't [0], so the burden is still on the developer using the library.

[0] https://github.com/redis/redis-py/commit/66a4d6b2a493dd3a20c...

---

Edit: Now I'm confused, this issue [1] was raised on March 17 and fixed on March 22, was this a regression? Or did OpenAI start using this library on March 19-20?

Interesing comment:

> drago-balto commented 3 hours ago

> Yep, that's the one, and the #2641 has not fixed it fully, as I already commented here: #2641 (comment)

> I am asking for this ticket to be re-oped, since I can still reproduce the problem in the latest 4.5.3. version

[1] https://github.com/redis/redis-py/issues/2624#issue-16293351...


That sounds more like a hindsight thing. In most systems authorization doesn't happen at the storage layer. Most queries fetch data by an identifier which is only assumed to be valid based on authorization that typically happens at the edge and then everything below relies on that result.

It's not the safest design but I wouldn't say the client should be expected to implement it. That security concern is at the application layer and the actual needs of the implementation can be wildly different depending on the application. You can imagine use cases for redis where this isn't even relevant, like if it's being used to store price data for stocks that update every 30 seconds. There's no private data involved there. It's out of scope for a storage client to implement.


I've long thought that it is often better to return a bit of extra data in internal API responses to validate that the response matches the request sent. That can be fairly simple like parroting a request ID, or including some extra metadata (e.g. part of the request) to validate the response is valid. It's not the most efficient, but it can safe your bacon sometimes. Mixing up deployment stacks (e.g. thinking you are talking to staging but actually it's prod) and mixing user data are pretty scary, so any defense in depth seems useful.


This more a data leak than an outage…


It was down for quite a while, so I would call it an outage.


If you're subscribed to their status page, you'll know it's actually unusual for a day to go by without an outage alert from OpenAI. They don't usually write them up like this but I guess this counts as PII leak disclosure for them? For having raised billions of dollars the are comically immature from a reliability and support perspective.


To be fair, they accidently made a game-changing breakthrough that gained millions of users overnight, and I don't think they were ready for it.

Before chatgpt, most normal people had never heard of OpenAI. Their flagship product was basically an API that only programmers could make useful.

Team leaders at OpenAI have stated that they were not expecting the success, let alone the highest adoption rate for any product in history. In their minds, it was just a cleaned-up version of a 2-year old product. It was billed as a research preview.

So, all of a sudden you go from hiring mostly researchers because you only have to maintain an API and some mid-traffic web infra, to suddenly having the fastest growing web product in history and having to scale up as fast as you can. Keep in mind that they didn't get backing from Microsoft until January 23, 2023-- that was only 2 months ago.

I'd say we should cut them some slack.


These problems predate ChatGPT. Their API has been on the market for nearly 3 years. And they raised their first $1B in 2019. That's plenty of money and time to hire capable leadership.


Yeah but again, this is the fastest growing app in history and it uses way more compute than your standard webapp, and basically delivers all functionality from a single service that handles that load. I can see why there would be some growing pains.


Not sure why no one is talking about serious data breach of personal and credit card information in this case. On the contrary, everyone is very concerned about compromise of github ssh key in another thread.


Does anyone else find it a bit off-putting how much emphasis they keep putting on "open source library"? I don't think I've read about this without the word open source appearing more than once in their own messaging about it. Why is it so important to emphasize that the library with the bug is open source?

The cynic in me wants to believe that it's a way of deflecting blame somehow, to make it seem like they did their due diligence but were thwarted by something outside of their control. I don't think it holds. If you use an open source library with no warranty, you are responsible (legally and otherwise) to ensure that it is sufficient. For example, if you break HIPAA compliance due to an open source library, it is still you who is responsible for that.

But of course, they're not claiming it's anyone else's fault anywhere explicitly, so it's uncharitable to just assume that's what they meant. Still, it rubs me the wrong way. I can't fight the feeling that it's a wink wink nudge nudge to give them more slack than they'd otherwise get. It feels like it's inviting you to just criticize redis-py and give them a break.

The open postmortem and whatnot is appreciated and everything, but sometimes it's important to be mindful of what you emphasize in your postmortems. People read things even if you don't write them, sometimes.


I noticed it too, but it doesn't necessarily bother me. Possibly they're just trying to say, "This incident may have made us look like we're complete amateurs who don't have any clue about security, but it wasn't like that."

Using someone else's library doesn't absolve you of responsibility, but failing to be vigilant at thoroughly vetting and testing external dependencies is a different kind of mistake than creating a terrible security bug yourself because your engineers don't know how to code or everyone is in too much of a rush to care about anything.


Yes, I agree with that sentiment, and I thought precisely the same. I know as an engineer that I would feel compelled to mention that it was an obscure bug in an open source library, if that was the case. Not to excuse myself of responsibility, but because I would feel so ashamed if I myself introduced such an obvious security flaw. I would still of course consider myself responsible for what happened.

A lot of the time when people make mistakes, they explain themselves so as they are afraid to be perceived as completely stupid or incompetent for making that mistake, not excusing themselves of taking responsibility even though people frequently think that excuses or explanation means that you are trying to absolve yourself of what you did.

There's a huge difference to me between having an obscure bug like this and introducing that type of security issue because you couldn't logically consider it. First one can be resolved in the future by introducing processes and make sure all open source libraries are from trusted sources, but second one implies that you are fundamentally unable to think and therefore also probably improve on that.


Why?

The result for the end consumer is identical whether they have their PII leaked from "an external library" vs a vendor's own home-baked solution.

It's not really a different kind of mistake, it's exactly the same kind of mistake, because it is exactly the same mistake! This is talking the talk, and not walking the walk, when it comes to security.

Publishing a writeup that passes the buck to some (unnamed) overworked and underpaid open source maintainer is worse, not better!


Right.

The dev had such a big ego that they didn't want to say "I was dumb and left open a bug", so the dev says "I was so dumb that I left open a bug in software I was also too dumb or lazy to write or even read". It's not better.


I agree, it is a different kind of mistake; it is immensely worse than creating a terrible security bug yourself.

Outsourcing your development work without a acceptance criteria and without validation for fitness of purpose is complete, abject engineering incompetence. Do you think bridge builders look at the rivets in the design and then just waltz over to Home Depot and just pick out one that looks kind of like the right size? No, they have exact specifications and it is their job to source rivets that meet those specifications. They then either validate the rivets themselves or contract with a reputable organization that legally guarantees they meet the specifications and it might be prudent to validate it again anyways just to be sure.

The fact that, in software, not validating your dependencies, i.e. the things your system depends on, is viewed as not so bad is a major reason why software security is such a utter joke and why everybody keeps making such utterly egregious security errors. If one of the worst engineering practices is viewed as normal and not so bad, it is no wonder the entire thing is utterly rotten.


I do not believe it's necessarily nefarious in nature, but maybe more specifically it feels kind of like they're implying that this is actually a valid escape hatch: "Sorry, we can't possibly audit this code because who audits all of their open source deps, amirite?"

But the truth is that actually, maybe that hints at a deeper problem. It was a direct dependency to their application code in a critical path. I mean, don't get me wrong, I don't think everyone can be expected to audit or fund auditing for every single line of code that they wind up running in production, and frankly even doing that might not be good enough to prevent most bugs anyways. Like clearly, every startup fully auditing the Linux kernel before using it to run some HTTP server is just not sustainable. But let's take it back a step: if the point of a postmortem is to analyze what went wrong to prevent it in the future, then this analysis has failed. It almost reads as "Bug in an open source project screwed us over, sorry. It will happen again." I realize that's not the most charitable reading, but the one takeaway I had is this: They don't actually know how to prevent this from happening again.

Open source software helps all of us by providing us a wealth of powerful libraries that we can use to build solutions, be we hobbyists, employees, entrepreneurs, etc. There are many wrinkles to the way this all works, including obviously discussions regarding sustainability, but I think there is more room for improvement to be had. Wouldn't it be nice if we periodically had actual security audits on even just the most popular libraries people use in their service code? Nobody in particular has an impetus to fund such a thing, but in a sense, everyone has an impetus to fund such work, and everyone stands to gain from it, too. Today it's not the norm, but perhaps it could become the norm some day in the future?

Still, in any case... I don't really mean to imply that they're being nefarious with it, but I do feel it comes off as at best a bit tacky.


I mean, if there were ever a company in a position to figure out a scalable way to audit OSS before usage, it'd be OpenAI, right?


It's only a 100x capped profit multi billion dollar company. How could they afford to read the code they ship?


They really skirt around the fact that they apparently introduced a bug which quite consistently initiated redis requests and terminated the connection before receiving the result.


Doesn't bother me either. All the car companies issue recalls regularly, sometimes an issue only shows up when the system hits capacity or you run into an edge case.


The gaping hole in this write-up goes something like:

"In order to prevent a bug like this from happening in the future, we have stepped up our review process for external dependencies. In addition, we are conducting audits around code that involves sensitive information."

Of course, we all know what actually happened here:

- we did no auditing;

- because our audit process consists of "blame someone else when our consumers are harmed";

- because we would rather not waste dev time on making sure our consumers are not harmed

If you want to know why no software "engineering" is happening here, this is your answer. Can you imagine if a bridge collapsed, and the builder of the bridge said, "iunno, it's the truck's fault for driving over the bridge."


Are you confident that an audit would have uncovered this bug? I’d be surprised if audits are effective at finding subtle bugs and race conditions, but I could be wrong.


Depends on the type of audit. Subtle bugs are often uncovered by fuzzing, for example, but a race condition might not be found without substantial load.


When a bridge collapses, everyone does try to shift blame onto someone else :)


I think you’re reading too much into it. Being an open source library is relevant because it means it’s third party and doesn’t come with a support agreement, so fixing a bug is a somewhat different process than if it were in your own code or from a proprietary vendor.

Yes, it’s technically up to you to vet all your dependencies, but in practice, often it doesn’t happen, people make assumptions that the code works, and that’s relevant too.


Also, vetting a dependency != auditing and testing every line of code to find all possible bugs.

If this bug was an open issue in the project's repo, that might be concerning and indicate that proper vetting wasn't done. Ditto if the project is old and unmaintained, doesn't have tests, etc. But if they were the first to trigger the bug and it only occurs under heavy load in production conditions, well, running into some of those occasionally is inevitable. The alternative is not using any dependencies, in which case you'd just be introducing these bugs yourself instead. Even with very thorough testing and QA, you're never going to perfectly mimic high load production conditions.


Open source can be fixed as if it was your own code. (And that is a strong tenant of free/open source software)

Not only do most open/free source libraries come without support agreements: they come with the broadest possible limitation of warranties. (As they should)

So the company, knowing that what they are using comes without any warranty either of quality or fitness to the use-case, have a very strong burden of due diligence / vetting.


> in practice, often it doesn’t happen, people make assumptions that the code works

True, but that's an inexcusable practice and always has been. We as an industry need to stop accepting it.


What do you mean by "stop accepting it?"

All of us rely on millions of lines of code that we have not personally audited every single day. Have you audited every framework you use? Your kernel? Drivers? Your compiler? Your CPU microcode? Your bootrom? The firmware in every gizmo you own?

If "Reflections on Trusting Trust" has taught us anything, it's turtles all the way down. At some point, you have to either trust something, or abandon all hope and trust nothing.


> Have you audited every framework you use? Your compiler? Your CPU microcode? Your bootrom?

Of course not. I exclude the CPU microcode, bootrom, and the like from the discussion because that's not part of the product being shipped.

But it's also true that I don't do a deep dive analyzing every library I use, etc. I'm not saying that we should have to.

What I'm saying is that when a bug pops up, that's on us as developers even when the bug is in a library, the compiler, etc. A lot of developers seem to think that just because the bug was in code they didn't personally write, that means that their hands are clean.

That's just not a viable stance to take. The bug should have been caught in testing, after all.

If your car breaks down because of a design failure in a component the auto manufacturer bought from another supplier, you'll still (rightfully) hold the auto manufacturer responsible.


> when a bug pops up

That’s reacting to a bug you know about. Do you mean to talk about how developers aren’t good enough at reacting to bugs found in third party libraries, or how they should do more prevention?

In this case, it seems like OpenAI reacted fairly appropriately, though perhaps they could have caught it sooner since people reported it privately.

“Holding someone responsible” is somewhat ambiguous about what you expect. It seems reasonable that a car manufacturer should be prepared to do a recall and to pay damages without saying that they should be perfect and recalls should never happen.


> Do you mean to talk about how developers aren’t good enough at reacting to bugs found in third party libraries, or how they should do more prevention?

My point was neither of these. My point is very simple: the developers of a product are responsible for how that product behaves.

I'm not saying developers have to be perfect, I'm just saying that there appears to be a tendency, when something goes wrong because of external code, to deflect blame and responsibility away from them and onto the external code.

I think this is an unseemly thing. If I ship a product and it malfunctions, that's on me. The customer will rightly blame me, and it's up to me to fix the problem.

Whether the bug was in code I wrote or in a library I used isn't relevant to that point.


I've also noticed it, and I can't help but interpret it as their way of shifting blame. Which is irresponsible. It's their product, and they need to take accountability for the bug occurring.

It's a serious bug, but in the grand scheme of things, not earth shattering, and not something that I think would discourage usage of their product. But their treatment of the bug causes more concerns than the bug itself. They are shifting the blame away from the work they did using a library with a bug, rather than their process by which that library made it into their product. And I don't understand how they can't see how that reflects poorly on them as an AI company.

I find it so confusing that at the end of the day, OpenAI's biggest product is having created a good process by which to create value out of a massive amount of data, and build a good API on top of it. And the open source library is effectively something they processed into their product and built an API based off of it. So it creates (to me) some amount of doubt about how they will react when faced with similar challenges to their core product. How will they behave when the data they consume impacts their product negatively? From limited experience, they'll shift the blame to the data, not their process, and keep it pushing.

It seems likely that this is only the beginning of OpenAI having a large customer base, with a high impact on many products. This is a disappointing result on their first test on how they'll manage issues and bugs with their products.


I don't find it over-emphasized. Many in the Twitter-sphere are acting as if they aren't being appreciative of open source software and I don't see it that way.

The technical root cause was in the open source library. There's a patch available and more likely than not OpenAI will continue to use the library.

Being overly sensitive to blame would be distracting to the technical issue at hand. It's great they are posting this post-mortem to raise awareness that the libraries you use can have bugs and to consider that risk when building systems.


Root cause analysis would likely also include the lack of threat modeling / security evaluation of their dependencies

Would likely also question the lack of resources allocated to these open source projects by companies with a lot of profits from, in part, using those open source projects.


I half agree, but I also half-sympathize with them, because it really wasn't their fault -- it was a quite-bad bug in a very fundamental library.

Bugs happen, though. Especially in Python.


Instead of spending engineering time, they used a free and open-source library to do less work.

The license they agreed to in order to use this library has this in capital letters. [THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND].

After agreeing to this license and using the library for free, they charged people money and sold them a service. And when that library they got for free, which they read and agreed that had no warranty of any kind, had a software bug, they wrote a blog post and blamed the outage of their paid service on this free library.

This is not another open-source project, or a small business. This is a company that got billions of dollars in investment, and a lot of income by selling services to businesses and individuals. They don't get to use free, no-warranty code written by others to save their own money, and then blame it and complain about it loudly for bugs.


> it really wasn't their fault -- it was a quite-bad bug in a very fundamental library.

It's still their fault. When you ship code, you are responsible for how that code behaves regardless of where the code came from.


Only for some incredibly broad definition of fault that almost no one uses.

How many people make sure all of the open source libraries they're using are bug free?

Anyone besides maybe NASA?


I've never cared per se that a library was bug free but I've put a lot of effort/$ into making sure the features that used the libraries in my product were bug free (with the amount of effort depending on the sensitivity of the feature, data, etc).

Usually "fix the original library" wasn't as easy or immediate a fix as "hack around it" which is sad just re: the overall OSS ecosystem but still the person releasing a product's responsibility.

Unfortunately these sorts of bugs are wildly difficult to predict. Yet it's also a wildly common architecture. That's what's sad for all of us as engineers as a whole. But "caching credit card details and home addresses", for instance, is... particularly dicey. That's very sensitive, and you're tossing it into more DBs, without good access control restrictions?


> Only for some incredibly broad definition of fault that almost no one uses.

It's a definition most laypeople use. It's developers who tend to use a very narrow definition.

I don't think it should be controversial to say that when you ship a product, you are responsible for how that product behaves.


Anywhere where you have payments related or any other PII data, then transitive dependencies, framework and language choices, memory sharing and other risks are taken into account as something that you as someone developing and operating a service is solely responsible for.


Anyone that has to pay from their own pocket when things go wrong, like consulting warranties, liabitiliy in security exploits,...


There have been several reports of this issue in Feb/early March on r/ChatGPT subreddit - OpenAI could have known if they listened to community.

Alternatively, they knew about it, and didn't fix the bug until it bit them


> Especially in Python.

as opposed to...?


Go, for one.

In my experience errors are more common (for both cultural and technological reasons) in Python than in Go.

I would guess something similar applies to Rust, though I don't have personal experience.

There's wide variation in C, but with careful discrimination, you can find very high-quality libraries or software (redis itself being an excellent example).

I don't have rigourous data to baack this stuff up, but I'm pretty convinced it's true, based on my own experience.


As opposed to not in Python.


… like JavaScript? Bash? C? PHP?

Certainly none of those are widely used and have a reputation for making it easy to keep the gun aimed squarely at the foot.


Those would be roughly similar. The main difference would be between dynamically typed interpreted languages and statically typed compiled ones I guess. At least I think I make less mistakes when the compiler literally tells me what's wrong before I even run the thing. It's awful and slow to develop that way, but it is more reliable for when that's a requirement.

So compared to ones like Kotlin or Rust.


it really was their fault. they chose to ship the bug. it doesn't matter in the last that someone else previously published the code under a license with no warranty whatsoever.


I was upvoting you, but then reading

> Especially in Python.

made me unvote.


This doesn't come across this way to me at all. They just described what happened. Do you expect them to jump in front of a bus for the library they're using, and beg for forgiveness for not ensuring the widely used libraries they're leveraging are bug free?

There are very few companies that couldn't get caught by this type of bug.


Basically agree -- feels off-putting, but not technically a wrong detail to add. An additional reason it rubs me the wrong way, however, is that I believe open-source software code is especially critical to ChatGPT family's capabilities. Not just for code-related queries, but for everything! (e.g. see this "lineage-tracing" blog post: https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tr...)

Thus, I honestly think firms operating generative AI should be walking on eggshells to avoid placing blame on "open-source". Rather, they really should going out of their way to channel as much positive energy towards it as possible.

Still, agree the charitable interpretation is that this just purely descriptive.


the library is provided by the redis team themselves and the bug is awful [1]. I know it's not redis' fault but this bug could hit anyone. Connections may be left in a dirty state where they return data from the previous request in the following one.

[1] https://github.com/redis/redis-py/issues/2624


I think you are reading a bit between the lines, and didn't feel them blaming the library as much as stating that the bug happened because of an issue in the library. Maybe they could have sugarcoated it between 10 layers of corporate jargon but I'd rather take this over that


The emphasis could also have been done to educate folks using this combination to check their setup.

Though version reference in the postmartem should also be posted, as a general guidance to their own readers, but at least a quick google search leads you to it.

https://github.com/redis/redis-py/releases/tag/v4.5.3

For anyone reading this and using a combination of asyncio and py-redis, please bump your versions.

Similar issues I've encountered with asyncio python and postgres too in the past when trying to pool connections. It's really not easy to debug them either.


Not surprising from a company that calls itself openai. The “open source” keyword stuffing is so people associate the open from openai with open source. Psyops I mean marketing 101.


I think you’re overreacting. What bothered me is that they didn’t link to the actual bug or provide a reference ID.


I don't know. To me it's simply an explanation of what has happened. I think its exactly what I would have written if I was in their position. And show me the one company that has audited all source code of all used open source projects, at least in a way that is able to rule out complex bugs like this. I have once found a memory corruption bug in Berkeley DB wrecking our huge production database, which I would have never found in any pre-emptive source code audit, however detailed.

Edit: On second thought, maybe they could have just written "external library" instead of "open source library".


> The cynic in me wants to believe that it's a way of deflecting blame somehow

That's how it reads to me as well.

Of course, it doesn't deflect blame at all. Any time you include code in your project, no matter where the code came from, you are responsible for the behavior of that code.


Personally, I think it was partially a virtue signal to show that they use open source software and collaborate with the maintainers.


Was postmortem generated by chatGPT?


If the FTC had teeth and good judgement, they'd force OpenAI to rename themselves.


This is a common bug with a lot of software. For example some HTTP clients that do pooling won’t invalidate the connection after timing out waiting for the response.


I suspect the fatal error OpenAI saw was an "SSL error: decryption failed or bad mac"? Or if SSL were disabled, they likely would see the sort of parsing error vaguely described as an "unrecoverable server error" due to streams of data from one request being swapped with another, with incorrect data structures, bad alignment, etc. I can see how if the stars aligned and SSL were disabled this data race would manifest in viewing another user's request, so long as the sockets were receiving similar responses when they were swapped. I suspect the issue is deeper than the bug recently fixed in just redis-asyncio.

Libraries written prior to asyncio/green threads often have that functionality enabled by means of monkey patches or shims that juggle file handles or other shared state, and there are data races. When SSL is used hopefully those data races reading from sockets result in MAC errors and the connection is terminated.

It's easy to imagine the same mistake happening one level up, in message passing code that manages queues or other shared data structures.

Search for "python django OR celery OR redis OR postgres OR psycopg2 decryption failed or bad mac" and you'll see the scale of the issue, it's fairly widespread.

I don't have a high degree of confidence in this ecosystem. I've written about this before on Hacker News[1], and I'm not confident in the handling of shared data structures in Python libraries. I don't think I can blame any maintainers here, there's a huge number of people asking for these concurrency features and the way that it's often implemented - monkey patching especially - makes it extremely difficult to do correctly.

[1] https://news.ycombinator.com/item?id=31065472


> Actions we’ve taken > > - [testing, assertions, log alerting, debugging]

I reported this bug in the forums (no response), because I couldn't find an official way to report bugs - at least it wasn't documented anywhere that I could find. The actions they've taken fix that bug, but not the next one. OpenAI should take an action to improve their bug reporting channels, so the next bug gets found more quickly.

https://community.openai.com/t/bug-incorrect-chatgpt-chat-se...

Other messages indicate issues with bug reporting too:

* https://news.ycombinator.com/item?id=35291943 * https://news.ycombinator.com/item?id=35295747


The key part:

If a request is canceled after the request is pushed onto the incoming queue, but before the response popped from the outgoing queue, we see our bug: the connection thus becomes corrupted and the next response that’s dequeued for an unrelated request can receive data left behind in the connection.



using chatGPT to summarize:

This section summarizes the issue with the ChatGPT service which caused some users to be able to see another user's chat history and payment information. OpenAI took the ChatGPT service offline and published technical details of the problem. The bug was found in the Redis client open-source library and has been patched. OpenAI has reached out to affected users and improved their systems to prevent this from happening again. They have also expressed their commitment to protecting user privacy and their dedication to the open-source community.


Interesting. I pasted in the outage report, and then asked ChatGPT to give me a single sentence summary and it gave me an answer way longer than a sentence. I pointed out that it wasn't a single sentence and it gave me:

> ChatGPT experienced an outage due to a bug in an open-source library that allowed some users to see titles from another active user's chat history and payment-related information of 1.2% of ChatGPT Plus subscribers was exposed; the bug has since been patched and actions have been taken to prevent future incidents.

I then asked it why there was the outage and it spit out:

> The exact cause of the Facebook outage is unknown, but it is believed to be related to a configuration change in the company's routers.

It's likely because I ran out of tokens because the OpenAI outage report is long. Pasting in the text of the outage report, and then re-asking about why, it was able to give a much better answer:

> There was an outage due to a bug in an open-source library that allowed some users to see titles from another active user's chat history and also unintentionally exposed payment-related information of 1.2% of ChatGPT Plus subscribers who were active during a specific nine-hour window.

Querying it further, again having to repeat the whole OpenAI outage report, and asking it a few different ways I eventually managed to get this succinct answer:

> The bug was caused by the redis-py library's shared pool of connections becoming corrupted and returning cached data belonging to another user when a request was cancelled before the corresponding response was received, due to a spike in Redis request cancellations caused by a server change on March 20.

It did take me more than a few minutes to get to there, so just actually reading the report would have been faster, and I ended up having to read the report to verify that answer was correct and not a hallucination anyway, so our jobs are safe for now.


Try with GPT 4. The token window is quadruple.


I decided to see how Bing Chat would do on this. I opened the page in Edge and I was given a summary automatically when I clicked the Discover button:

---

Welcome back! Here are some takeaways from this page.

> ChatGPT was offline due to a bug in redis-py that caused some users to see other users’ chat history and payment information.

> The bug was patched and the service was restored, except for a few hours of chat history.

> The bug affected 1.2% of ChatGPT Plus subscribers who were active during a nine-hour window on March 20.

> Full credit card numbers were not exposed at any time.

> OpenAI apologized to the users and the ChatGPT community and took steps to prevent such incidents in the future.

---


Funnily enough I've had a very similar bug occur in an entirely separate redis library. It was a pretty troubling failure mode to suddenly start getting back unrelated data


It boggles my mind how they're not absolutely checking the user & conversation id for EVERY message in the queue given the possible sensitivity of the requests. How is this even remotely acceptable?

In the one reddit post first surfacing this the user saw conversations related to politics in china and other rather sensitive topics related to CCP.

This can absolutely get people hurt and they absolutely must take this serious.


It doesn’t boggle my mind at all. Session data appears, and is used to render the page. Do you verify every time the actual cookie and go back to the DB to see what user it pointed to?

No, everyone assumes their session object is instantiated with the right values at that level of the code.


That is a pretty good disclosure that creates trust.


is this only me who don't see any chat history since yesterday and generally chat does not work (you can type the message, but clicking button or hitting enter / ctrl+enter) does not give any effect?

in chat history there is a button to "retry" but clicking it and inspecting the result, you see "internal server error"


Ive had the exact same issue since last week. it never fixed itself if that’s what you’re waiting for. I had to resort to creating a new account with a different email to get access again. Contacted support but yet to hear back. Not sure if it was the cause the issue happened when I try to buy Plus and that failed.


I found it was issue with Firefox and its default privacy settings. Clicking shield icon in Firefox address bar and setiting this privacy guard off made Chat working again


Sounds like they need to use a lua function to ensure that pop and push operation remains atomic within the redis instance?

For example: spopsadd

local src_list = ARGV[1] local dst_list = ARGV[2] local value = redis.call('spop', src_list) if value then -- avoid pushing nils redis.call('sadd', dst_list, value) end return value


First GitHub, then OpenAI. Two of Microsoft finest(!) (majority owned and acquired) companies on the top of HN announcing a serious security incident.

It's quite unsettling to see this leak of highly sensitive information and a private key exposure as well. Doesn't look good and seems like they don't take security seriously.


In the case of OpenAI, the product is more of a research demo that had to be drastically scaled up, though. From an operations point of view it’s more like a startup.


Nobody cares and yet another case study of HN being out-of-touch with reality


> "Nobody cares"

Yet another case study of absolutism, which can be simply dismissed.

People paying for ChatGPT care once it goes down and getting their details and chats leaked that is certainly outside of HN. Same with GitHub. Both having ~100M users between them.

That's the reality.


I'm paying for ChatGPT and I don't care about this any more than the many, many other services I use that have at some point had an embarassing security issue.


I'm paying and I don't care. If I write perfect bug-free code, lead a perfect life, live in a perfect world, I'd be upset.

But, I know that shit happens and the reliability meter should be flexible for different things (bridges, heart surgery and chat agent).

If I train my brain to bitch, whine, moan about every thing, I'd not have resources to care about really important things.


> I'm paying and I don't care.

"Yeah". Two people out of the majority of commenters not caring when it was unavailable and having their chat history leaked and as well as GitHub going down again. [0] Surely that alone is "Nobody" caring. /s

This is the whole problem with you being absolutist. Seems to have aged like milk.

[0] https://news.ycombinator.com/item?id=35295216


Well - they have had more bugs and will have more bugs to worry from.

https://twitter.com/naglinagli/status/1639343866313601024


I wonder how much time passed between the first case of corruptions leading to exceptions (and they ignored it as “eh, not great not terrible we’ll look at it later) and users reporting seeing other’s users data?


I'm the only one terrible bored by the assault of the trivial AI news last months?

Every fart some AI related person makes becomes a huge news. And it's followed by tens of random blog postings all posted to HN.


At least it isn't about the Rust language this time grumbles


For some reason I liked reading about Rust (or any other technology) a lot more that the AI.

Part of it is that, the average engineer could understand and grok what those articles were talking about, and I could appreciate, relate, and if applicable criticize it.

The AI news just seems to swing between hype and doomsday prophecies, and little discussion about the technical aspects of it.

Obviously OpenAI choosing to keep it closed source makes any in-depth discussion close to impossible, but also some of this is so beyond the capabilities of an average engineer with a laptop. It can be frustrating.


Because Rust hasn't conquered AI the way it conquered crypto.

But we will see AI stuff rewritten in Rust quite soon.


I bet againt it


This reminds me of a comment I made 1.5 months ago [0]:

I was logging in during heavy load, and after typing the question I started getting responses to questions which I didn't ask.

gdb answered on that comment "these are not actually messages from other users, but instead the model generating something ~random due to hitting a bug on our backend where, rather than submitting your question, we submitted an empty query to the model."

I wonder if it was the same redis-py issue back then, but just at another point in the backend. His answer didn't really convince me back then.

[0] https://news.ycombinator.com/item?id=34614796&p=2#34615875


I'd put money on it.



Unbelievable, have many times have you seen AWS explain an outage (or PII leak like this kind) as open source library bug? Have they asked 5 whys? Why the bug was deployed into production? Why there was not enough testing before deployment? Why integration testing was only done with low concurrency? Why standard release testing procedure is missing? Why there's no synthetic traffic testing and gates before rolling to 100% in production?


I don’t fault them for having an outage. It’s hard to think of any other recent site with a comparable popularity spike.


They were/are storing payment data in redis? LOL!


The postmortem doesn’t say that. It just says they were caching “user information”. Maybe that includes a Stripe customer or subscription ID that they look up before sending an email, for example.


Yeah probably the session id and when the wrong session id is returned other operations like GET User details would pull its data from relational storage.


Yes they were. It says they stored billing address and some pieces of credit card data.


Chance this was not 90% written by ChatGPT is 10%


Did they use chat-gpt to fix the bug?


That sounds like the kind of bug that could be prevented by modeling with TLA+.


maybe they've just scrolled over issue lists of popular tech stacks and cherry-picked the most compelling one to bury the dirt.


It's interesting (read: wrong) for an AI company to bother writing the user interface for their web application.

This was a failure of integration testing and defensive design, whether the component was open-source or not. There's no reason to believe that an AI company would have the diligence and experience to do the grunt work of hardening a site.

But management obviously understood the level and character of interest. Actual users include probably 10,000 curiosity seekers for every actual AI researcher, with 1,000 of those being commercial prospects -- people who might buy their service.

This is a clear sign that the managers who've made technical breakthrough's in AI are not capable even of deploying the service at scale -- no less managing the societal consequences of AI.

The difficulty with the board getting adults in the room is that leaders today give the appearance of humility and cooperation, with transparent disclosures and incorporation of influencers into advisory committees. The leaders may believe their own abilities because their underlings don't challenge them. So there's no obvious domineering friction, but the risk is still there, because of inability to manage.

Delegation is the key to scaling, code and organizations. "Know thyself" is about knowing your limits, and having the humility to get help instead of basking in the puffery of being in control.

This isn't a PR problem. It's the Achilles' heel of capitalism, and the capitalists in OpenAI's board should nip this incipient Musk in the bud or risk losing 2-3 orders of magnitude return on their investment.


It sounds like their redis key was not unique enough and yada yada yada it returned sensitive info the wrong people.


Did you read the article? That’s not at all what happened.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: