March 20 ChatGPT outage: Here’s what happened

abujazar · on March 24, 2023

The disclosure is provides valuable information, but the introduction suggests someone else or «open-source» is to blame:

>We took ChatGPT offline earlier this week due to a bug in an open-source library which allowed some users to see titles from another active user’s chat history.

Blaming an open-source library for a fault in closed-source product is simply unfair. The MIT licensed dependency explicitly comes without any warranties. After all, the bug went unnoticed until ChatGPT put it under pressure, and it was ChatGPT that failed to rule out the bug in their release QA.

mirekrusin · on March 24, 2023

They are not suing anybody, just because it’s open source doesn’t mean talking about bugs is taboo. It reads to me like raw information, very good. They said they contacted authors helping with upstream bug fix. It’s great example how to deal with this kind of problems if anything to me.

abujazar · on March 24, 2023

I agree it's a good example – if not great. But considering OpenAI itself is as closed source as can be, mentioning open-source software as the cause of their outage in the very first line of their statement seems somewhat out of place. I don't think it's a coincidence that the open-source comes in the opening line while the acknowledgement to the Redis team is at the very end. Many outside of software engineering might read this as «open source caused OpenAI to crash».

random_cynic · on March 25, 2023

Ironically, this sort of insane takes everywhere on the thread is causing more harm towards the image of open-source developers than OpenAI or anyone could ever do with their press release. It's important to mention whatever open-source library caused a bug because that could be happening in many other applications that are using it. That's basically one of the main points of open-source.

Angostura · on March 25, 2023

What would your opening sentence have been?

It’s a great opening sentence. Short, factual explanatory. You would have to be hypersensitive to object.

As far as I can see, the whole thing is a reaction to OpenAI not being sufficiently open - which is a fine argument to have with them, but shouldn’t cloud judgement on the quality of this write-up

__del__ · on March 25, 2023

i'm not the person you replied to, but i'm guessing maybe they'd have preferred:

"We took ChatGPT offline earlier this week due to a bug in an external library which allowed some users to see titles from another active user’s chat history."

Angostura · on March 27, 2023

Thanks for that reply - yeh, that's a reasonable alternative.

mvkel · on March 26, 2023

But the bug resided in an open source library, not their own code. What else could they say?

Bugs will always exist whether you have a QA dept or not.

SerCe · on March 25, 2023

Shifting the blame to open source isn't a good look. I like what Bryan Cantrill had to say about it: https://twitter.com/bcantrill/status/1638707484620902401

Angostura · on March 25, 2023

The blame wasn’t “shifted”. The source of the bug was reported

danielbln · on March 25, 2023

https://twitter.com/danielbln/status/1639535353634643968?s=2...

abujazar · on March 25, 2023

Spot on

voidfunc · on March 24, 2023

They're not blaming anyone. Objectively there was a bug in that library that caused the problem.

abujazar · on March 24, 2023

They're not blaming, legally speaking. But they're communicating that open source software caused their outage. OpenAI chose to use software that explicitly came without warranties, and are legally solely responsible for problems caused by open-source libraries they choose to include in their product.

LASR · on March 24, 2023

I understand where you're coming from. But there is certainly an audience for this kind of post that would like to read about the objective source of the bug without connecting it to some expectations around responsibility.

I read this post, and I don't see it assigning blame to anyone other than themselves. See this bit copied from the post:

> Everyone at OpenAI is committed to protecting our users’ privacy and keeping their data safe. It’s a responsibility we take incredibly seriously. Unfortunately, this week we fell short of that commitment, and of our users’ expectations. We apologize again to our users and to the entire ChatGPT community and will work diligently to rebuild trust.

int_19h · on March 25, 2023

The questionable part is saying "open source". They refrained from naming the library, which is good; but then why bring up that one fact about it right after?

abujazar · on March 25, 2023

The facts could be more clearly stated by defraining from stating open-source software as the cause.

Matl · on March 24, 2023

Why mention it is open source then? What does that add?

fsckboy · on March 24, 2023

"there was a bug in an outside library that we used" does not mention open source but has the same meaning, and would probably provoke the same complaints ("they're trying to blame somebody else for their problem").

In that case, though, they could say "look, we used a popular open source library because we had more faith that it would be better tested and correct" which would be a compliment to open source. That's essentially the information that we have.

In today's world, who builds anything from anything close to stratch? embedded developers, probably come closest. It's no worse or better to say "there was a bug that our release uncovered." If they continue to announce as many details as possible, we as the audience can develop a sense whether they're creating bugs or just uncovering bugs we're glad to know about.

abujazar · on March 24, 2023

I think it's fair to mention the bug originated in redis-py, but I don't find it relevant at all to mention «open-source» in the opening line of the public statement about the outage. Or «outside library» for that matter. It was ChatGPT release QA that failed, and then they failed to admit it.

fsckboy · on March 25, 2023

"given enough eyeballs, all bugs are shallow" is a compliment to open source, and telling your users that you use open source is a positive reassurance on other dimensions as well; it's relevant, and the truth is always relevant; there's no harm in mentioning it.

"Release early, release often" and "move fast and break things" are respected ideas that diminish the importance of QA. The more effective and efficient any QA you do is, yes, very valuable, and QA that finds broken things, all the better. But don't move slow is an OK compromise.

tnzk · on March 25, 2023

It might be redis-py originally and someone at public relations advised to replace it with something more general so that it would sound being nice.

clnq · on March 24, 2023

On the other hand, why overthink it?

xdavidliu · on March 24, 2023

why not mention it? What does mentioning it subtract?

abujazar · on March 24, 2023

«Open source» was in the opening line of the statement as if it was something to blame.

YPPH · on March 25, 2023

If someone asked ChatGPT to generate some code, then copy and pasted it mindlessly into their project, I wonder what OpenAI would think about the claim:

"The bug was a result of faulty code produced by ChatGPT."

Using an open source library is like copy and pasting code into your project. You assume responsibility for any pitfalls.

carlmr · on March 25, 2023

Come to think of it, I think that's realistically going to happen a lot.

It's going to be recursive blaming all the way down.

colechristensen · on March 25, 2023

I mean they’re pretty clear with their warnings about accuracy.

abujazar · on March 25, 2023

Just like any author of MIT licensed open source software is clear about their license which clearly states they are not responsible in any way whatsoever for the shortcomings of their licensees.

belter · on March 24, 2023

According to this is not fixed yet... https://github.com/redis/redis-py/issues/2624

Caligatio · on March 25, 2023

I don't recall anyone advocating omitting the name log4j when that security bug dropped in 2021. How is that situation materially different from this? redis-py fixed the behavior so it's definitely not the case of working-as-intended.

To be clear, I would be up in arms if OpenAI was trying to hold redis potentially legally responsible.

juunpp · on March 25, 2023

Damn open source! Those guys aren't living to their paychecks, I foresee staff cuts...

chaxor · on March 25, 2023

I understand and agree that framing open source as the issue is ridiculous. They may be redeemed though through the likely scenario that they didn't write this - GPT probably did. Perhaps their GPT doesn't like open source. :D

nialv7 · on March 25, 2023

I am curious how you would have phrased it? Is there a way to accurately convey the root cause of the problem (the redis-py library) without sounding like trying to shift blame?

pinkcan · on March 25, 2023

there was a bug on our codebase...

Moru · on March 25, 2023

But then you don't get the goodwill of helping the open-source library fixing the bug, contributing back to the community.

s3p · on March 24, 2023

Did you read the technical details? This was a bug in the way Redis-py handled requests that were cancelled before a response was returned to the user.

manojlds · on March 24, 2023

So? You own the stack. All that had to be said was there was a bug in the way requests were handled in a library. The open source part is not even needed.

lcuff · on March 24, 2023

I think the mention of exactly where is useful to this audience. Other people using the same open-source library might realize --- oh, we've run into that bug but never realized it. Low probability, sure, but nonetheless.

acka · on March 25, 2023

OpenAI could have simply named the library up front instead of stressing that it is open source. This would have alerted people who use the library while at the same time avoiding the suggestion that it being open source was part of the problem.

manojlds · on March 25, 2023

They tweeted few days back there was a bug in an open source library. That's the main issue.

abujazar · on March 24, 2023

Yes, and this is still the sole responsibility of OpenAI. They chose to include redis-py without warranties and are responsible for their own QA on any product built on it.

ipaddr · on March 25, 2023

I reject the premise this is a bug. The author picked this implementation with whatever side effects. The author might want some requests to cancel in some situations. It's upto OpenAI to decide if the implementation works for their setup.

manojlds · on March 24, 2023

Maybe they should let ChatGPT craft the postmortem

andrethegiant · on March 25, 2023

Maybe they already did

yieldcrv · on March 24, 2023

I prefer accuracy over placating your public relations ego

The only counterpoint I would take is one related to the accuracy of what they wrote

hgsgm · on March 25, 2023

This shows that LLM AI, as dangerous as OpenAI says it is, cannot be entrusted to OpenAI for safeguarding its use.

golergka · on March 25, 2023

The phrase you quoted doesn't blame anyone, it just shows causation. Causation and moral responsibility are not the same thing.

chatmasta · on March 24, 2023

Why did it take them 9 hours to notice? The problem was immediately obvious to anyone who used the web interface, as evidenced by the many threads on Reddit and HN.

> between 1 a.m. and 10 a.m. Pacific time.

Oh... so it was because they're based in San Francisco. Do they really not have a 24/7 SRE on-call rotation? Given the size of their funding, and the number of users they have, there is really no excuse not to at least have some basic monitoring system in place for this (although it's true that, ironically, this particular class of bug is difficult to detect in a monitoring system that doesn't explicitly check for it, despite being immediately obvious to a human observer).

Perhaps they should consider opening an office in Europe, or hiring remotely, at least for security roles. Or maybe they could have GPT-4 keep an eye on the site!

eep_social · on March 24, 2023

Staffing an actual 24x7 rotation of SREs costs about a million dollars a year in base salary as a floor and there are few SREs for hire. A metrics-based monitor probably would have triggered on the increased error rate but it wouldn’t have been immediately obvious that there was also a leaking cache. The most plausible way to detect the problem from the user perspective would be a synthetic test running some affected workflow, built to check that the data coming back matches specific, expected strings (not just well-formed). All possible but none of this sounds easy to me. Absolutely none of this is plausible when your startup business is at the top of the news cycle every single day for the past several months.

sosodev · on March 24, 2023

"there are few SREs for hire"

How do you figure? If you mean there are few SRE with several years of experience you might be right. SRE is a fairly new title so that's not too surprising.

However, my experience with a recent job search is that most companies aren't hiring SRE right now because they consider reliability a luxury. In fact, I was search of a new SRE position because I was laid off for that very reason.

chatmasta · on March 24, 2023

You don't even need an SRE to have an on-call rotation; you could ping a software engineer who could at least recognize the problem and either push a temporary fix, or try to wake someone else to put a mitigation in place (e.g. disabling the history API, which is what they eventually did).

However, I think the GP's point about this class of bug being difficult to detect in a monitoring system is the more salient issue.

eep_social · on March 24, 2023

Well hang on! Your question was why was the time to detect so high and you specifically mentioned 24x7 SRE so I thought that’s what we were talking about ;)

And I do think the answer is that monitoring is easy but good monitoring takes a whole lot of work. Devops teams tend to get to sufficient observability where a SRE team should be dedicating its time to engineering great observability because the SRE team is not being pushed by product to deliver features. A functional org will protect SRE teams from that pressure, a great one will allow the SRE team to apply counter-pressure from the reliability and non-functional perspective to the product perspective. This equilibrium is ideal because it allows speed but keeps a tight leash on tech debt by developing rigor around what is too fast or too many errors or whatever your relevant metrics are.

eep_social · on March 24, 2023

I’ve anecdotally observed the opposite. I have noticed SRE jobs remain posted, even by companies laying off or announcing some kind of hiring slowdown over the last quarter or so. More generally, businesses that have decided that they need SRE are often building out from some kind of devops baseline that has become unsustainable for the dev team. When you hit that limit and need to split out a dedicated team, there aren’t a ton of alternatives to getting a SRE or two in and shuffling some prod-oriented devs to the new SRE team (or building a full team from scratch which is what the $$ was estimating above). Among other things, the SRE baliwick includes capacity planning and resource efficiency; SRE will save you money in the long term.

On a personal note, I am sorry to hear that your job search has not yet been fruitful. Presumably I am interested in different criteria from you —- I have found several postings that are quite appealing to the point where I am updating my CV and applying, despite being weakly motivated at the moment.

sosodev · on March 25, 2023

My search was fruitful. I'm doing regular SWE work now. Market sucks though.

namaria · on March 24, 2023

Every system fail prompts people to exclaim "why aren't there safeguards?". Every time. Well guess what, if we try to do new stuff, we will run into new problems.

wouldbecouldbe · on March 24, 2023

There is nothing new about using redis for cache, or returning a list for a user.

namaria · on March 24, 2023

Are you trying to say cache invalidation in a distributed system is a trivial problem?

chatmasta · on March 24, 2023

I'm not disagreeing with you, and I'm not the commenter you're replying to, but it's worth noting that cache leakage and cache invalidation are two different problems.

namaria · on March 24, 2023

You're right. Thanks for pointing that out. My original point still stands, distributed systems are hard and people demanding zero failures are setting an impossible standard.

oulu2006 · on March 24, 2023

It's non-trivial but it's also not that hard, there are well known strategies for achieving it; especially if you relax guarantees and only promise eventual consistency then it becomes fairly trivial - we do this for example and have little problems with it.

skywhopper · on March 24, 2023

This wasn’t a cache invalidation problem. It was a cache corruption error.

wouldbecouldbe · on March 25, 2023

Im saying there is nothing new about it.

sinuhe69 · on March 24, 2023

Probably the cheapest solution would be letting GPT monitors user feedbacks from various social media channels and alert human engineers to check for the summative problem. GPT can even engage with users to request more details or reproduceable cases ;)

Yiin · on March 25, 2023

that's abusable, as you can manipulate gpt however you like.

scarmig · on March 24, 2023

Since it now handles visual inputs, I wonder how hard it'd be to get GPT to monitor itself. Have it constantly observe a set of screenshares of automated processes starting and repeating ChatGPT sessions on prod, alert the on-call when it notices something "weird."

eep_social · on March 24, 2023

RUM monitoring does 99% of what you want already. Anomaly detection is the hard part. IMO too early to say whether gpt will be good at that specific task but I agree that a LLM will be in the loop on critical production alerts in 2023 in some fashion.

pharmakom · on March 24, 2023

They raised a billion dollars.

eep_social · on March 24, 2023

How much have they spent?

dharmab · on March 24, 2023

You don't necessarily need a full team of SREs- you can also have a lightly staffed ops center with escalation paths.

eep_social · on March 24, 2023

I don’t think that model has the properties you think it does. Someone still has to take call to back the operators. Someone has to build the signals that the ops folks watch. Someone has to write criteria for what should and should not be escalated, and in a larger org they will also need to know which escalation path is correct. And on and on — the work has to get done somewhere!

majormajor · on March 24, 2023

The way those criteria usually get written in a startup with mission-critical customer-facing stuff (like this privacy issue) is that first the person watching Twitter and email and whatever else pages the engineers, and then there's a retro on whether or not that particular one was necessary, lather, rinse, repeat.

All you need on day 1 is someone to watch the (metaphorical) phones + a way to page an engineer. Don't start by spending a million bucks a year, start by having a first aid kit at the ready.

Perhaps they could also help this person out by looking into some sort of fancy software to automatically summarize messages that were being sent to them, or their mentions on Reddit, or something, even?

eep_social · on March 24, 2023

Yup, twitter monitoring is a thing that I have seen implemented. We did not allow it to page us, however. As you say, some of the barriers around that are low or gone as of late. I wonder if someone has already secured seed funding for social media monitoring as a service. The feature set you can build on a LLM is orders of magnitude better than what was practical before.

Looking at my post up-thread, I wish I had emphasized the time aspect more - of course all of these problems are solvable but it takes both time and money. They have the money now but two months ago the parts of this incident were in place but the scale was so small that it never actually leaked data. Or maybe a handful of early adopters saw some weird shit but we’re all well-trained to just hit refresh these days. Hiring even one operator and getting them spun up takes calendar time that simply has not existed yet. I assume someone over there is panicking about this and trying to get someone hired to make sure they look better prepared next time, because there will be a next time, and if they’re even half as successful as the early hype leads me to believe, I expect they are going to have a lot more incidents as they scale. One in a million is eight and a half times per day at 100 rps.

eep_social · on March 25, 2023

> early adopters saw some weird shit

Since I wrote this, I have seen several anecdotes that support this guess. This is a classic scaling problem. One or two users saw it, and one even says they reported it, but at small scale with immature tools and processes getting to the actual software bug is a major effort that has to be balanced around other priorities like making excessive amounts of money.

guessmyname · on March 24, 2023

> […] it was because they're based in San Francisco. Do they really not have a 24/7 SRE on-call rotation?

OpenAI is hiring Site Reliability Engineers (SRE) in case you, or anyone you know, is interested in working for them: https://openai.com/careers/it-engineer-sre . Unfortunately, the job is an onsite role that requires 5 days a week in their San Francisco office, so they do not appear to be planning to have a 24/7 on-call rotation any time soon.

Too bad because I could support them in APAC (from Japan).

Over 10 years of industry experience, if anyone is interested.

eep_social · on March 24, 2023

I had forgotten that I looked at this and came to the same conclusion as you. I’d happily discuss a remote SRE position but on-site is a non-starter for me, and most of SRE, if I am reading the room correctly.

Edit to add: they’re also paying in-line or below industry and the role description reads like a technical project manager not a SRE. I imagine people are banging down the door because of the brand but personally that’s a lot of red flags before I even submit an application.

VirusNewbie · on March 24, 2023

that is quite low for FAANG level SRE/SWE .

p1esk · on March 24, 2023

Also, I heard their interviews (for any technical position) are very tough.

inconceivable · on March 24, 2023

nobody qualified wants the 24/7 SRE job unless it pays an enormous amount of money. i wouldn't do it for less than 500 grand cash. getting woken up at 3am constantly or working 3rd shift is the kind of thing you do with a specific monetary goal in mind (i.e., early retirement) or else it's absolute hell.

combine that with ludicrous requirements (the same as a senior software engineer) and you get gaps in coverage. ask yourself what senior software engineer on earth would tolerate getting called CONSTANTLY at 3am, or working 3rd shift.

the vast majority of computer systems just simply aren't as important as hospitals or nuclear power plants.

mnahkies · on March 24, 2023

Timezones are a thing - your 3am is someone's 9am and may be a significant part of your customer base.

Being paged constantly is a sign of bad alerts or bad systems IMO - either adjust the alert to accept the current reality or improve the system

inconceivable · on March 24, 2023

spinning up a subsidiary in another country (especially one with very strict labor laws, like in european countries) is not as easy as "find some guy on the internet and pay him to watch your dashboard. and then give him root so he can actually fix stuff without calling your domestic team, which would defeat the whole purpose.

also, even getting paged ONCE a month at 3am will fuck up an entire week at a time if you have a family. if it happens twice a month, that person is going to quit unless they're young and need the experience.

chatmasta · on March 24, 2023

It's really not that difficult, and there are providers like Deel who can manage it all for you, to the point you just ACH them every month.

Source: co-founder of a remote startup with employees in five countries

inconceivable · on March 24, 2023

like you said, timezones are a thing. now you're managing a global team.

Godel_unicode · on March 24, 2023

That sounds harder than it is, especially if you already allow remote work. It mostly just forces you to have better docs.

mnahkies · on March 24, 2023

Sorry to be clear I was replying to this part of your comment

> the vast majority of computer systems just simply aren't as important as hospitals or nuclear power plants.

I agree that the stakes are lower in terms of harm, but was trying to express that whilst it might not be life and death, it might be hindering someone being able to do their job / use your product - eg: it still impacts customer experience and your (business) reputation.

False pages for transient errors are bad - ideally you only get paged if human intervention is required, and this should form a feedback cycle to determine how to avoid it in future. If all the pages are genuine problems requiring human action then this should feed into tickets to improve things

nijave · on March 24, 2023

Not only that, but you probably need follow the sun if you want <30 minute response time.

Given a system that collects minute-based metrics, it generally takes around 5-10 minutes to generate an alert. Another 5-10 minutes for the person to get to their computer unless it's already in their hand (what if you get unlucky and on-call was taking a shower or using the toilet?). After that, another 5-10 minutes to see what's going on with the system.

After all that, it usually takes some more minutes to actually fix the problem.

Dropbox has a nice article on all the changes they made to streamline incidence response https://dropbox.tech/infrastructure/lessons-learned-in-incid...

okdood64 · on March 25, 2023

I've worked two SRE (or SRE-adjacent) jobs with oncall duty (Some unicorn and a FAANG). Neither have been remotely as bad to what you're saying. (Only one was actually 24/7 for a week shift.)

The whole point is that before you join, the team has done sufficient work to not make it hell, and your work during business hours makes sure it stays that way. Are there a couple bad weeks throughout the year? Sure, but it's far, far from the norm.

hgsgm · on March 25, 2023

Constantly? It's one wakeup in 4 months.

oulu2006 · on March 24, 2023

I did that for a few years, and wasn't on 500k a year, but I'm also the company co-founder, so you could argue that a "specific monetary goal" was applicable.

majormajor · on March 24, 2023

You don't need 24/7 SREs, you could do it with 24/7 first-line customer support staff monitoring Twitter, Reddit, and official lines of comms that have the ability to page the regular engineering team.

That's a lot easier to hire, and lower cost. More training required of what is worth waking people up over; way less in terms of how to fix database/cache bugs.

richdougherty · on March 25, 2023

Support engineers and an official bug reporting channel would help. I noticed and reported the issue on their official forums at on 16 March, but got no response.

https://community.openai.com/t/bug-incorrect-chatgpt-chat-se...

I only reported it on the forums because there didn't seem to be an official bug reporting channel, just a heavyweight security reporting process.

As well as the actions they took to fix this specific bug, another useful action would be to have a documented and monitored bug reporting channel.

cloudking · on March 24, 2023

Probably because they launched ChatGPT as an experiment and didn't think it would blow up, needing full time SRE etc. I don't think it was designed for scale and reliability when they launched.

CubsFan1060 · on March 24, 2023

Do events like this cause them to lose enough revenue that it would make sense to hire a bunch of SRE's?

nijave · on March 24, 2023

Probably the real reason. I assume they intend to make money off enterprise contracts which would include SLAs. Then they'd set their support based off that

chatmasta · on March 24, 2023

Given the Microsoft partnership, they might not even need to manage any real infrastructure. Just hand it off to Azure and let them handle the details.

raldi · on March 25, 2023

Just add metrics for the number of times "ChatGPT" and "OpenAI" appeared in tweets, reddit posts, and HN comments in the last (rolling) five minutes, put them on a dashboard alongside all your other monitoring, and have a threshold where they page the oncall to review what's being said. It doesn't even have to be an SRE in this case; it could be just about anyone.

Kuinox · on March 24, 2023

I managed to manually produce this bug 2 months ago. As they don't have any bug bounty, I didn't submitted it. By starting a conversation and refreshing before ChatGPT has time to answer, I managed to reproduce this bug 2-3 times in January.

breckenedge · on March 24, 2023

did you reach out via https://openai.com/security.txt?

Kuinox · on March 25, 2023

No, as I said, their disclosure page says there is no and I'm not a professional security researcher so I was not very interested in helping them.

I find even it funny now, they could write that they will provide free API creds, but now they had a very bad moment due to their greed.

Kuinox · on March 25, 2023

*there is no bug bounty

Missed a word.

MacroChip · on March 25, 2023

You won't fire off a quick email nor warn others because there's no bug bounty?

catmanjan · on March 25, 2023

Not everyone has the privilege of working for free

capableweb · on March 25, 2023

As I understand, the "work" was already done, the only thing missing was sending a heads-up email with "hey, this seems iffy, maybe you ought to look into it".

I dunno, I generally report issues I find in software, paid or not, as I've always done. Takes usually ~10 minutes and 1% of the time, they ask for more details and I spend maybe 20 minutes more to fill out some more details.

Never been paid for it ever, most I gotten was a free yearly subscription. But in general I do it because I want what I use to be less buggy.

Kuinox · on March 26, 2023

Sending an email is work too.

fintechie · on March 24, 2023

I reported this race condition via ChatGPT's internal feedback system after I saw other user's chat titles loading on my sidebar a couple of times (around 7-8 weeks ago). Didn't get a response, so I assumed it was fixed...

Hopefully they'll start a bug bounty program soon, and prioritise bug reports over features.

jetrink · on March 24, 2023

The explanation at the time was that unavailable chat data (due to, e.g. high load) resulted in a null input sometimes being presented to the chat summary system, which in turn caused the system to hallucinate believable chat titles. It's possible that they misdiagnosed the issue or that both bugs were present and they caught the benign one before the serious one.

fintechie · on March 25, 2023

Yeah I was surprised that the bug appeared simply using the app normally. My first thought is that it was data from other user loading so I reported immediately that it looked like a race condition. But maybe it was this other bug you mention.

totallyunknown · on March 24, 2023

same to my. actually only the summary of the history was from a different user. the content itself was mine.

sebzim4500 · on March 24, 2023

The claim made at the time was that the titles were not from other people and were in fact caused by the model hallucinating after the input query timed out (or something like that). Obviously that sounds a little suspect now, but it might be true.

nwienert · on March 24, 2023

That's a lie if so, if you look at the Reddit threads there's no way those were not specific other users histories as they had the logical history of reading browser history. Eg, one I saw had stuff like "what is X", then the next would be "How to X" or something. Some were all in Japanese, others all in Chinese. If it was random you wouldn't see clear logical consistency across the list.

ajhai · on March 24, 2023

> In the hours before we took ChatGPT offline on Monday, it was possible for some users to see another active user’s first and last name, email address, payment address, the last four digits (only) of a credit card number, and credit card expiration date

This is a lot of sensitive data. It says 1.2% of ChatGPT Plus subscribers active during a 9 hour window, which considering their user base must be a lot.

mach1ne · on March 24, 2023

It’s a bit unclear if this means that 1.2% of all chatGPT Plus subscribers were active during that 9-hour window

lopkeny12ko · on March 24, 2023

The original issue report is here: https://github.com/redis/redis-py/issues/2624

This bit is particularly interesting:

> I am asking for this ticket to be re-oped, since I can still reproduce the problem in the latest 4.5.3. version

Sounds like the bug has not actually been fixed, per drago-balto.

construct0 · on March 24, 2023

The bug: https://github.com/redis/redis-py/issues/2624

construct0 · on March 24, 2023

.... "I am asking for this ticket to be re-opened, since I can still reproduce the problem in the latest 4.5.3. version"

chatmasta · on March 24, 2023

The PR: https://github.com/redis/redis-py/pull/2641

According to the latest comments there, the bug is only partially fixed.

photochemsyn · on March 24, 2023

> "If a request is canceled after the request is pushed onto the incoming queue, but before the response popped from the outgoing queue, we see our bug: the connection thus becomes corrupted and the next response that’s dequeued for an unrelated request can receive data left behind in the connection."

The OpenAI API was incredibly slow and lots of requests probably got cancelled (I certainly was doing that) for some days. I imagine someone could write a whole blog post about how that worked, it would be interesting reading.

braindead_in · on March 24, 2023

Was this written by ChatGPT? Maybe it found the bug as well, who knows.

pixl97 · on March 24, 2023

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

deathanatos · on March 24, 2023

… in this case this variant seems more appropriate:

  There are 3 hard problems in Computer Science:
  1. naming things
  2. cache invalidation
  3. 4. off-by-one errors
  concurrency

ketchupdebugger · on March 24, 2023

It's surprising that openai seems to be the only one being affected. If the issue is with redis-py reusing connections then wouldn't more companies/products be affected by this?

zzzeek · on March 24, 2023

their description of the problem seemed kind of obtuse, in practice, these connection-pool related issues have to do with 1. request is interrupted 2. exception is thrown 3. catch exception, return connection to pool, move on. The thing that has to be implemented is 2a. clean up the state of the connection when the interrupted exception is caught, then return to the pool.

that is, this seems like a very basic programming mistake and not some deep issue in Redis. the strange way it was described makes it seem like they're trying to conceal that a bit.

roberttod · on March 24, 2023

It's an open source library, I assume that logic is abstracted within it and that the "basic mistake" was one of the maintainer's.

neurostimulant · on March 25, 2023

I think most app using redis-py rarely cancel async redis command. Python async web frameworks is gaining popularity, but the majority of people using python for their web application is not using an async framework. And of those people that do use them, not many of them canceling async redis requests often enough to trigger the bug.

YetAnotherNick · on March 25, 2023

There is a 1 year old autoclosed issue which is very similar to OpenAI's issue: https://github.com/redis/redis-py/issues/2028

killerstorm · on March 24, 2023

Serious question: Why do people feel it's necessary to use a redis cluster?

I understand in early 2000s we were using spinning disks and it was the only way. Well, we don't use spinning disks any more, do we?

A modern server can easily have terabytes of RAM and petabytes of NVMe, so what's stopping people from just using postgres?

A cluster of radishes is an anti-pattern.

manv1 · on March 24, 2023

1. Redis can handle a lot more connections, more quickly, than a database can. 2. It's still faster than a database, especially a database that's busy.

#2 is an interesting point. When you benchmark, the normal process is to just set up a database then run a shitload of queries against it. I don't think a lot of people put actual production load on the database then run the same set of queries against it...usually because you don't have a production load in the prototyping phase.

However, load does make a difference. It made more of a difference in the HDD era, but it still makes a difference today.

I mean, redis is a cache, and you do need to ensure that stuff works if your purge redis (ie: be sure the rebuild process works), etc, etc.

But just because it's old doesn't mean it's bad. OS/390 and AS/400 boxes are still out there doing their jobs.

nijave · on March 24, 2023

A pretty small Redis server can handle 10k clients and saturate a 1Gbps NIC. You'd need a pretty heavy duty Postgres database and definitely need a connection pooler to come anywhere close.

anarazel · on March 24, 2023

I agree that redis can handle some query volumes and client counts that postgres can't.

But FWIW I can easily saturate a 10GBit ethernet link with primary key-lookup read-only queries, without the results being ridiculously wide or anything.

Because it didn't need any setup, I just used:

  SELECT * FROM pg_class WHERE oid = 'pg_class'::regclass;

I don't immediately have access to a faster network, connecting via tcp to localhost, and using some moderate pipelining (common in the redis world afaik), I get up to 19GB/s on my workstation.

e12e · on March 25, 2023

> SELECT * FROM pg_class WHERE oid = 'pg_class'::regclass;

This selects every column (*) from every table (ObjectID is of type regclass)?

anarazel · on March 25, 2023

Sorry, I should have used something more standard - but it was what I had ready...

It just selects every column from a single table, pg_class. Which is where postgres stores information about relations that exist in the current database.

hobobaggins · on March 24, 2023

and those have reliable backup/restore infrastructure. Using redis as a cache is fine, just don't use it as your primary DB.

xp84 · on March 24, 2023

I'm confused on why the need to complicate something as seemingly-straightforward as a KV store into a series of queues that can get all mixed up. I asked ChatGPT to explain it though, and it sounds like the justification for its existence is that it doesn't "block the event loop" while a request is "waiting for a response from Redis."

Last time I checked, Redis doesn't take that long to provide a response. And if your Redis servers actually are that overloaded that you're seeing latency in your requests, it seems like simple key-based sharding would allow horizontally scaling your Redis cluster.

Disclaimer: I am probably less smart than most people who work at OpenAI so I'm sure I'm missing some details. Also this is apparently a Python thing and I don't know it beyond surface familiarity.

oxymoron · on March 24, 2023

Redis latency is around 1ms including network round trip for most operations. In a single threaded context, waiting on that would limit you to around 1000 operations per second. Redis clients improve throughput by doing pipelining, so a bunch of calls are batched up to minimize network roundtrips. This becomes more complicated in the context of redis-cluster, because calls targeting different keys are dispatched to different cache nodes and will complete in an unpredictable order, and additional client side logic is needed to accumulate the responses and dispatch them back to the appropiate caller.

zmj · on March 24, 2023

I'm not familiar with the Python client specifically, but Redis clients generally multiplex concurrent requests onto a single connection per Redis server. That necessitates some queueing.

eldenring · on March 24, 2023

Yes! I have been spending the last couple months pulling out completely unnecessary redis caching from some of our internal web servers.

The only loss here is network latency which negligible when you're colocated in AWS.

Postgres's caches end up pulling a lot more weight too when you're not only hitting the db on a cache miss from the web server.

cplli · on March 24, 2023

For caching the query results you get from your database. Also it's easier to spin up Redis and replicate it closer to your user than doing that with your main database. From my experience anyway.

mike_hearn · on March 24, 2023

I think the idea is that if your db can hold the working set in RAM and you're using a good db + prepared queries, you can just let it absorb the full workload because the act of fetching the data from the db is nearly as cheap as fetching it from redis.

killerstorm · on March 24, 2023

> For caching the query results you get from your database.

This only makes sense if queries are computationally intensive. If you're fetching a single row by index you aren't winning much (or anything).

dpkirchner · on March 24, 2023

Of course? I'm not really sure what the original question actually is if you know that users benefit from caching the results of computationally intensive queries.

killerstorm · on March 24, 2023

OpenAI uses redis to store pieces of text. Fetching pieces of text is not computationally intensive.

mannyv · on March 24, 2023

Most likely they have them in an rdbms, so it's more like joining a forum thread together. Not expensive, but why not prebuild and store it instead?

acuozzo · on March 24, 2023

> This only makes sense if queries are computationally intensive.

Or if the link to your DB is higher latency than you're comfortable with.

aadvark69 · on March 24, 2023

Better concurrency (10k vs ~200 max connections compared to postgres). ~20x faster than Postgres at Key-value read/write operations. (mostly) single threaded, so atomicity is achieved without the synchronicity overhead found in RDBMS.

Thus, it's much cheaper to run at massive scale like OpenAI's for certain workloads, including KV caching

also:

- robust, flexible data structures and atomic APIs to manipulate them are available out-of-the box

- large and supportive community + tooling

adrr · on March 24, 2023

My redis clusters are 10x more cost effective than my postgresdb in handling load.

amtamt · on March 24, 2023

For caching somewhat larger objects based on ETag?

lofaszvanitt · on March 24, 2023

People know it, that's all.

YetAnotherNick · on March 25, 2023

Do you want to ask why we use caching instead of main db in RAM? Or why we use redis instead of postgres for caching?

qwertox · on March 24, 2023

Nice writeup, it's fair in the content presented to us.

Yet I'm wondering why there is no checking if the response does actually belong to the issued query.

The client issuing a query can pass a token and verify upon answer that this answer contains the token.

TBH as a user of the client I would kind of expect the library to have this feature built-in, and if I'm starting to use the library to solve a problem, handling this edge-case would be of a somewhat low priority to me if the library wouldn't implement it, probably because I'm lazy.

I hope that the fix they offered to Redis Labs does contain a solution to this problem and that everyone of us using this library will be able to profit from the effort put into resolving the issue.

It doesn't [0], so the burden is still on the developer using the library.

[0] https://github.com/redis/redis-py/commit/66a4d6b2a493dd3a20c...

---

Edit: Now I'm confused, this issue [1] was raised on March 17 and fixed on March 22, was this a regression? Or did OpenAI start using this library on March 19-20?

Interesing comment:

> drago-balto commented 3 hours ago

> Yep, that's the one, and the #2641 has not fixed it fully, as I already commented here: #2641 (comment)

> I am asking for this ticket to be re-oped, since I can still reproduce the problem in the latest 4.5.3. version

[1] https://github.com/redis/redis-py/issues/2624#issue-16293351...

menzoic · on March 24, 2023

That sounds more like a hindsight thing. In most systems authorization doesn't happen at the storage layer. Most queries fetch data by an identifier which is only assumed to be valid based on authorization that typically happens at the edge and then everything below relies on that result.

It's not the safest design but I wouldn't say the client should be expected to implement it. That security concern is at the application layer and the actual needs of the implementation can be wildly different depending on the application. You can imagine use cases for redis where this isn't even relevant, like if it's being used to store price data for stocks that update every 30 seconds. There's no private data involved there. It's out of scope for a storage client to implement.

grogers · on March 25, 2023

I've long thought that it is often better to return a bit of extra data in internal API responses to validate that the response matches the request sent. That can be fairly simple like parroting a request ID, or including some extra metadata (e.g. part of the request) to validate the response is valid. It's not the most efficient, but it can safe your bacon sometimes. Mixing up deployment stacks (e.g. thinking you are talking to staging but actually it's prod) and mixing user data are pretty scary, so any defense in depth seems useful.

kristianpaul · on March 24, 2023

This more a data leak than an outage…

sebzim4500 · on March 24, 2023

It was down for quite a while, so I would call it an outage.

picodguyo · on March 24, 2023

If you're subscribed to their status page, you'll know it's actually unusual for a day to go by without an outage alert from OpenAI. They don't usually write them up like this but I guess this counts as PII leak disclosure for them? For having raised billions of dollars the are comically immature from a reliability and support perspective.

thequadehunter · on March 24, 2023

To be fair, they accidently made a game-changing breakthrough that gained millions of users overnight, and I don't think they were ready for it.

Before chatgpt, most normal people had never heard of OpenAI. Their flagship product was basically an API that only programmers could make useful.

Team leaders at OpenAI have stated that they were not expecting the success, let alone the highest adoption rate for any product in history. In their minds, it was just a cleaned-up version of a 2-year old product. It was billed as a research preview.

So, all of a sudden you go from hiring mostly researchers because you only have to maintain an API and some mid-traffic web infra, to suddenly having the fastest growing web product in history and having to scale up as fast as you can. Keep in mind that they didn't get backing from Microsoft until January 23, 2023-- that was only 2 months ago.

I'd say we should cut them some slack.

picodguyo · on March 24, 2023

These problems predate ChatGPT. Their API has been on the market for nearly 3 years. And they raised their first $1B in 2019. That's plenty of money and time to hire capable leadership.

thequadehunter · on March 24, 2023

Yeah but again, this is the fastest growing app in history and it uses way more compute than your standard webapp, and basically delivers all functionality from a single service that handles that load. I can see why there would be some growing pains.

kaustyap · on March 25, 2023

Not sure why no one is talking about serious data breach of personal and credit card information in this case. On the contrary, everyone is very concerned about compromise of github ssh key in another thread.

jchw · on March 24, 2023

Does anyone else find it a bit off-putting how much emphasis they keep putting on "open source library"? I don't think I've read about this without the word open source appearing more than once in their own messaging about it. Why is it so important to emphasize that the library with the bug is open source?

The cynic in me wants to believe that it's a way of deflecting blame somehow, to make it seem like they did their due diligence but were thwarted by something outside of their control. I don't think it holds. If you use an open source library with no warranty, you are responsible (legally and otherwise) to ensure that it is sufficient. For example, if you break HIPAA compliance due to an open source library, it is still you who is responsible for that.

But of course, they're not claiming it's anyone else's fault anywhere explicitly, so it's uncharitable to just assume that's what they meant. Still, it rubs me the wrong way. I can't fight the feeling that it's a wink wink nudge nudge to give them more slack than they'd otherwise get. It feels like it's inviting you to just criticize redis-py and give them a break.

The open postmortem and whatnot is appreciated and everything, but sometimes it's important to be mindful of what you emphasize in your postmortems. People read things even if you don't write them, sometimes.

adrianmonk · on March 24, 2023

I noticed it too, but it doesn't necessarily bother me. Possibly they're just trying to say, "This incident may have made us look like we're complete amateurs who don't have any clue about security, but it wasn't like that."

Using someone else's library doesn't absolve you of responsibility, but failing to be vigilant at thoroughly vetting and testing external dependencies is a different kind of mistake than creating a terrible security bug yourself because your engineers don't know how to code or everyone is in too much of a rush to care about anything.

mewpmewp2 · on March 24, 2023

Yes, I agree with that sentiment, and I thought precisely the same. I know as an engineer that I would feel compelled to mention that it was an obscure bug in an open source library, if that was the case. Not to excuse myself of responsibility, but because I would feel so ashamed if I myself introduced such an obvious security flaw. I would still of course consider myself responsible for what happened.

A lot of the time when people make mistakes, they explain themselves so as they are afraid to be perceived as completely stupid or incompetent for making that mistake, not excusing themselves of taking responsibility even though people frequently think that excuses or explanation means that you are trying to absolve yourself of what you did.

There's a huge difference to me between having an obscure bug like this and introducing that type of security issue because you couldn't logically consider it. First one can be resolved in the future by introducing processes and make sure all open source libraries are from trusted sources, but second one implies that you are fundamentally unable to think and therefore also probably improve on that.

mlsu · on March 24, 2023

Why?

The result for the end consumer is identical whether they have their PII leaked from "an external library" vs a vendor's own home-baked solution.

It's not really a different kind of mistake, it's exactly the same kind of mistake, because it is exactly the same mistake! This is talking the talk, and not walking the walk, when it comes to security.

Publishing a writeup that passes the buck to some (unnamed) overworked and underpaid open source maintainer is worse, not better!

hgsgm · on March 25, 2023

Right.

The dev had such a big ego that they didn't want to say "I was dumb and left open a bug", so the dev says "I was so dumb that I left open a bug in software I was also too dumb or lazy to write or even read". It's not better.

Veserv · on March 24, 2023

I agree, it is a different kind of mistake; it is immensely worse than creating a terrible security bug yourself.

Outsourcing your development work without a acceptance criteria and without validation for fitness of purpose is complete, abject engineering incompetence. Do you think bridge builders look at the rivets in the design and then just waltz over to Home Depot and just pick out one that looks kind of like the right size? No, they have exact specifications and it is their job to source rivets that meet those specifications. They then either validate the rivets themselves or contract with a reputable organization that legally guarantees they meet the specifications and it might be prudent to validate it again anyways just to be sure.

The fact that, in software, not validating your dependencies, i.e. the things your system depends on, is viewed as not so bad is a major reason why software security is such a utter joke and why everybody keeps making such utterly egregious security errors. If one of the worst engineering practices is viewed as normal and not so bad, it is no wonder the entire thing is utterly rotten.

jchw · on March 24, 2023

I do not believe it's necessarily nefarious in nature, but maybe more specifically it feels kind of like they're implying that this is actually a valid escape hatch: "Sorry, we can't possibly audit this code because who audits all of their open source deps, amirite?"

But the truth is that actually, maybe that hints at a deeper problem. It was a direct dependency to their application code in a critical path. I mean, don't get me wrong, I don't think everyone can be expected to audit or fund auditing for every single line of code that they wind up running in production, and frankly even doing that might not be good enough to prevent most bugs anyways. Like clearly, every startup fully auditing the Linux kernel before using it to run some HTTP server is just not sustainable. But let's take it back a step: if the point of a postmortem is to analyze what went wrong to prevent it in the future, then this analysis has failed. It almost reads as "Bug in an open source project screwed us over, sorry. It will happen again." I realize that's not the most charitable reading, but the one takeaway I had is this: They don't actually know how to prevent this from happening again.

Open source software helps all of us by providing us a wealth of powerful libraries that we can use to build solutions, be we hobbyists, employees, entrepreneurs, etc. There are many wrinkles to the way this all works, including obviously discussions regarding sustainability, but I think there is more room for improvement to be had. Wouldn't it be nice if we periodically had actual security audits on even just the most popular libraries people use in their service code? Nobody in particular has an impetus to fund such a thing, but in a sense, everyone has an impetus to fund such work, and everyone stands to gain from it, too. Today it's not the norm, but perhaps it could become the norm some day in the future?

Still, in any case... I don't really mean to imply that they're being nefarious with it, but I do feel it comes off as at best a bit tacky.

xxpor · on March 24, 2023

I mean, if there were ever a company in a position to figure out a scalable way to audit OSS before usage, it'd be OpenAI, right?

hgsgm · on March 25, 2023

It's only a 100x capped profit multi billion dollar company. How could they afford to read the code they ship?

thefreeman · on March 24, 2023

They really skirt around the fact that they apparently introduced a bug which quite consistently initiated redis requests and terminated the connection before receiving the result.

jvm___ · on March 24, 2023

Doesn't bother me either. All the car companies issue recalls regularly, sometimes an issue only shows up when the system hits capacity or you run into an edge case.

mlsu · on March 24, 2023

The gaping hole in this write-up goes something like:

"In order to prevent a bug like this from happening in the future, we have stepped up our review process for external dependencies. In addition, we are conducting audits around code that involves sensitive information."

Of course, we all know what actually happened here:

- we did no auditing;

- because our audit process consists of "blame someone else when our consumers are harmed";

- because we would rather not waste dev time on making sure our consumers are not harmed

If you want to know why no software "engineering" is happening here, this is your answer. Can you imagine if a bridge collapsed, and the builder of the bridge said, "iunno, it's the truck's fault for driving over the bridge."

marshmellman · on March 24, 2023

Are you confident that an audit would have uncovered this bug? I’d be surprised if audits are effective at finding subtle bugs and race conditions, but I could be wrong.

hobobaggins · on March 24, 2023

Depends on the type of audit. Subtle bugs are often uncovered by fuzzing, for example, but a race condition might not be found without substantial load.

JW_00000 · on March 24, 2023

When a bridge collapses, everyone does try to shift blame onto someone else :)

skybrian · on March 24, 2023

I think you’re reading too much into it. Being an open source library is relevant because it means it’s third party and doesn’t come with a support agreement, so fixing a bug is a somewhat different process than if it were in your own code or from a proprietary vendor.

Yes, it’s technically up to you to vet all your dependencies, but in practice, often it doesn’t happen, people make assumptions that the code works, and that’s relevant too.

danenania · on March 24, 2023

Also, vetting a dependency != auditing and testing every line of code to find all possible bugs.

If this bug was an open issue in the project's repo, that might be concerning and indicate that proper vetting wasn't done. Ditto if the project is old and unmaintained, doesn't have tests, etc. But if they were the first to trigger the bug and it only occurs under heavy load in production conditions, well, running into some of those occasionally is inevitable. The alternative is not using any dependencies, in which case you'd just be introducing these bugs yourself instead. Even with very thorough testing and QA, you're never going to perfectly mimic high load production conditions.

fabianhjr · on March 24, 2023

Open source can be fixed as if it was your own code. (And that is a strong tenant of free/open source software)

Not only do most open/free source libraries come without support agreements: they come with the broadest possible limitation of warranties. (As they should)

So the company, knowing that what they are using comes without any warranty either of quality or fitness to the use-case, have a very strong burden of due diligence / vetting.

JohnFen · on March 24, 2023

> in practice, often it doesn’t happen, people make assumptions that the code works

True, but that's an inexcusable practice and always has been. We as an industry need to stop accepting it.

isopede · on March 24, 2023

What do you mean by "stop accepting it?"

All of us rely on millions of lines of code that we have not personally audited every single day. Have you audited every framework you use? Your kernel? Drivers? Your compiler? Your CPU microcode? Your bootrom? The firmware in every gizmo you own?

If "Reflections on Trusting Trust" has taught us anything, it's turtles all the way down. At some point, you have to either trust something, or abandon all hope and trust nothing.

JohnFen · on March 24, 2023

> Have you audited every framework you use? Your compiler? Your CPU microcode? Your bootrom?

Of course not. I exclude the CPU microcode, bootrom, and the like from the discussion because that's not part of the product being shipped.

But it's also true that I don't do a deep dive analyzing every library I use, etc. I'm not saying that we should have to.

What I'm saying is that when a bug pops up, that's on us as developers even when the bug is in a library, the compiler, etc. A lot of developers seem to think that just because the bug was in code they didn't personally write, that means that their hands are clean.

That's just not a viable stance to take. The bug should have been caught in testing, after all.

If your car breaks down because of a design failure in a component the auto manufacturer bought from another supplier, you'll still (rightfully) hold the auto manufacturer responsible.

skybrian · on March 24, 2023

> when a bug pops up

That’s reacting to a bug you know about. Do you mean to talk about how developers aren’t good enough at reacting to bugs found in third party libraries, or how they should do more prevention?

In this case, it seems like OpenAI reacted fairly appropriately, though perhaps they could have caught it sooner since people reported it privately.

“Holding someone responsible” is somewhat ambiguous about what you expect. It seems reasonable that a car manufacturer should be prepared to do a recall and to pay damages without saying that they should be perfect and recalls should never happen.

JohnFen · on March 24, 2023

> Do you mean to talk about how developers aren’t good enough at reacting to bugs found in third party libraries, or how they should do more prevention?

My point was neither of these. My point is very simple: the developers of a product are responsible for how that product behaves.

I'm not saying developers have to be perfect, I'm just saying that there appears to be a tendency, when something goes wrong because of external code, to deflect blame and responsibility away from them and onto the external code.

I think this is an unseemly thing. If I ship a product and it malfunctions, that's on me. The customer will rightly blame me, and it's up to me to fix the problem.

Whether the bug was in code I wrote or in a library I used isn't relevant to that point.

chamakits · on March 24, 2023

I've also noticed it, and I can't help but interpret it as their way of shifting blame. Which is irresponsible. It's their product, and they need to take accountability for the bug occurring.

It's a serious bug, but in the grand scheme of things, not earth shattering, and not something that I think would discourage usage of their product. But their treatment of the bug causes more concerns than the bug itself. They are shifting the blame away from the work they did using a library with a bug, rather than their process by which that library made it into their product. And I don't understand how they can't see how that reflects poorly on them as an AI company.

I find it so confusing that at the end of the day, OpenAI's biggest product is having created a good process by which to create value out of a massive amount of data, and build a good API on top of it. And the open source library is effectively something they processed into their product and built an API based off of it. So it creates (to me) some amount of doubt about how they will react when faced with similar challenges to their core product. How will they behave when the data they consume impacts their product negatively? From limited experience, they'll shift the blame to the data, not their process, and keep it pushing.

It seems likely that this is only the beginning of OpenAI having a large customer base, with a high impact on many products. This is a disappointing result on their first test on how they'll manage issues and bugs with their products.

babl-yc · on March 24, 2023

I don't find it over-emphasized. Many in the Twitter-sphere are acting as if they aren't being appreciative of open source software and I don't see it that way.

The technical root cause was in the open source library. There's a patch available and more likely than not OpenAI will continue to use the library.

Being overly sensitive to blame would be distracting to the technical issue at hand. It's great they are posting this post-mortem to raise awareness that the libraries you use can have bugs and to consider that risk when building systems.

fabianhjr · on March 24, 2023

Root cause analysis would likely also include the lack of threat modeling / security evaluation of their dependencies

Would likely also question the lack of resources allocated to these open source projects by companies with a lot of profits from, in part, using those open source projects.

dilap · on March 24, 2023

I half agree, but I also half-sympathize with them, because it really wasn't their fault -- it was a quite-bad bug in a very fundamental library.

Bugs happen, though. Especially in Python.

gkbrk · on March 24, 2023

Instead of spending engineering time, they used a free and open-source library to do less work.

The license they agreed to in order to use this library has this in capital letters. [THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND].

After agreeing to this license and using the library for free, they charged people money and sold them a service. And when that library they got for free, which they read and agreed that had no warranty of any kind, had a software bug, they wrote a blog post and blamed the outage of their paid service on this free library.

This is not another open-source project, or a small business. This is a company that got billions of dollars in investment, and a lot of income by selling services to businesses and individuals. They don't get to use free, no-warranty code written by others to save their own money, and then blame it and complain about it loudly for bugs.

JohnFen · on March 24, 2023

> it really wasn't their fault -- it was a quite-bad bug in a very fundamental library.

It's still their fault. When you ship code, you are responsible for how that code behaves regardless of where the code came from.

JamesBarney · on March 24, 2023

Only for some incredibly broad definition of fault that almost no one uses.

How many people make sure all of the open source libraries they're using are bug free?

Anyone besides maybe NASA?

majormajor · on March 24, 2023

I've never cared per se that a library was bug free but I've put a lot of effort/$ into making sure the features that used the libraries in my product were bug free (with the amount of effort depending on the sensitivity of the feature, data, etc).

Usually "fix the original library" wasn't as easy or immediate a fix as "hack around it" which is sad just re: the overall OSS ecosystem but still the person releasing a product's responsibility.

Unfortunately these sorts of bugs are wildly difficult to predict. Yet it's also a wildly common architecture. That's what's sad for all of us as engineers as a whole. But "caching credit card details and home addresses", for instance, is... particularly dicey. That's very sensitive, and you're tossing it into more DBs, without good access control restrictions?

JohnFen · on March 24, 2023

> Only for some incredibly broad definition of fault that almost no one uses.

It's a definition most laypeople use. It's developers who tend to use a very narrow definition.

I don't think it should be controversial to say that when you ship a product, you are responsible for how that product behaves.

rschoultz · on March 24, 2023

Anywhere where you have payments related or any other PII data, then transitive dependencies, framework and language choices, memory sharing and other risks are taken into account as something that you as someone developing and operating a service is solely responsible for.

pjmlp · on March 24, 2023

Anyone that has to pay from their own pocket when things go wrong, like consulting warranties, liabitiliy in security exploits,...

practice9 · on March 24, 2023

There have been several reports of this issue in Feb/early March on r/ChatGPT subreddit - OpenAI could have known if they listened to community.

Alternatively, they knew about it, and didn't fix the bug until it bit them

airstrike · on March 24, 2023

> Especially in Python.

as opposed to...?

dilap · on March 24, 2023

Go, for one.

In my experience errors are more common (for both cultural and technological reasons) in Python than in Go.

I would guess something similar applies to Rust, though I don't have personal experience.

There's wide variation in C, but with careful discrimination, you can find very high-quality libraries or software (redis itself being an excellent example).

I don't have rigourous data to baack this stuff up, but I'm pretty convinced it's true, based on my own experience.

moffkalast · on March 24, 2023

As opposed to not in Python.

deathanatos · on March 24, 2023

… like JavaScript? Bash? C? PHP?

Certainly none of those are widely used and have a reputation for making it easy to keep the gun aimed squarely at the foot.

moffkalast · on March 24, 2023

Those would be roughly similar. The main difference would be between dynamically typed interpreted languages and statically typed compiled ones I guess. At least I think I make less mistakes when the compiler literally tells me what's wrong before I even run the thing. It's awful and slow to develop that way, but it is more reliable for when that's a requirement.

So compared to ones like Kotlin or Rust.

kljhghfgdfjkgh · on March 24, 2023

it really was their fault. they chose to ship the bug. it doesn't matter in the last that someone else previously published the code under a license with no warranty whatsoever.

qwertox · on March 24, 2023

I was upvoting you, but then reading

> Especially in Python.

made me unvote.