The disclosure is provides valuable information, but the introduction suggests someone else or «open-source» is to blame:
>We took ChatGPT offline earlier this week due to a bug in an open-source library which allowed some users to see titles from another active user’s chat history.
Blaming an open-source library for a fault in closed-source product is simply unfair. The MIT licensed dependency explicitly comes without any warranties. After all, the bug went unnoticed until ChatGPT put it under pressure, and it was ChatGPT that failed to rule out the bug in their release QA.
They are not suing anybody, just because it’s open source doesn’t mean talking about bugs is taboo. It reads to me like raw information, very good. They said they contacted authors helping with upstream bug fix. It’s great example how to deal with this kind of problems if anything to me.
I agree it's a good example – if not great. But considering OpenAI itself is as closed source as can be, mentioning open-source software as the cause of their outage in the very first line of their statement seems somewhat out of place. I don't think it's a coincidence that the open-source comes in the opening line while the acknowledgement to the Redis team is at the very end. Many outside of software engineering might read this as «open source caused OpenAI to crash».
Ironically, this sort of insane takes everywhere on the thread is causing more harm towards the image of open-source developers than OpenAI or anyone could ever do with their press release. It's important to mention whatever open-source library caused a bug because that could be happening in many other applications that are using it. That's basically one of the main points of open-source.
It’s a great opening sentence. Short, factual explanatory. You would have to be hypersensitive to object.
As far as I can see, the whole thing is a reaction to OpenAI not being sufficiently open - which is a fine argument to have with them, but shouldn’t cloud judgement on the quality of this write-up
i'm not the person you replied to, but i'm guessing maybe they'd have preferred:
"We took ChatGPT offline earlier this week due to a bug in an external library which allowed some users to see titles from another active user’s chat history."
They're not blaming, legally speaking. But they're communicating that open source software caused their outage. OpenAI chose to use software that explicitly came without warranties, and are legally solely responsible for problems caused by open-source libraries they choose to include in their product.
I understand where you're coming from. But there is certainly an audience for this kind of post that would like to read about the objective source of the bug without connecting it to some expectations around responsibility.
I read this post, and I don't see it assigning blame to anyone other than themselves. See this bit copied from the post:
> Everyone at OpenAI is committed to protecting our users’ privacy and keeping their data safe. It’s a responsibility we take incredibly seriously. Unfortunately, this week we fell short of that commitment, and of our users’ expectations. We apologize again to our users and to the entire ChatGPT community and will work diligently to rebuild trust.
The questionable part is saying "open source". They refrained from naming the library, which is good; but then why bring up that one fact about it right after?
"there was a bug in an outside library that we used" does not mention open source but has the same meaning, and would probably provoke the same complaints ("they're trying to blame somebody else for their problem").
In that case, though, they could say "look, we used a popular open source library because we had more faith that it would be better tested and correct" which would be a compliment to open source. That's essentially the information that we have.
In today's world, who builds anything from anything close to stratch? embedded developers, probably come closest. It's no worse or better to say "there was a bug that our release uncovered." If they continue to announce as many details as possible, we as the audience can develop a sense whether they're creating bugs or just uncovering bugs we're glad to know about.
I think it's fair to mention the bug originated in redis-py, but I don't find it relevant at all to mention «open-source» in the opening line of the public statement about the outage. Or «outside library» for that matter. It was ChatGPT release QA that failed, and then they failed to admit it.
"given enough eyeballs, all bugs are shallow" is a compliment to open source, and telling your users that you use open source is a positive reassurance on other dimensions as well; it's relevant, and the truth is always relevant; there's no harm in mentioning it.
"Release early, release often" and "move fast and break things" are respected ideas that diminish the importance of QA. The more effective and efficient any QA you do is, yes, very valuable, and QA that finds broken things, all the better. But don't move slow is an OK compromise.
If someone asked ChatGPT to generate some code, then copy and pasted it mindlessly into their project, I wonder what OpenAI would think about the claim:
"The bug was a result of faulty code produced by ChatGPT."
Using an open source library is like copy and pasting code into your project. You assume responsibility for any pitfalls.
Just like any author of MIT licensed open source software is clear about their license which clearly states they are not responsible in any way whatsoever for the shortcomings of their licensees.
I don't recall anyone advocating omitting the name log4j when that security bug dropped in 2021. How is that situation materially different from this? redis-py fixed the behavior so it's definitely not the case of working-as-intended.
To be clear, I would be up in arms if OpenAI was trying to hold redis potentially legally responsible.
I understand and agree that framing open source as the issue is ridiculous.
They may be redeemed though through the likely scenario that they didn't write this - GPT probably did. Perhaps their GPT doesn't like open source. :D
I am curious how you would have phrased it? Is there a way to accurately convey the root cause of the problem (the redis-py library) without sounding like trying to shift blame?
Did you read the technical details? This was a bug in the way Redis-py handled requests that were cancelled before a response was returned to the user.
So? You own the stack. All that had to be said was there was a bug in the way requests were handled in a library. The open source part is not even needed.
I think the mention of exactly where is useful to this audience. Other people using the same open-source library might realize --- oh, we've run into that bug but never realized it. Low probability, sure, but nonetheless.
OpenAI could have simply named the library up front instead of stressing that it is open source. This would have alerted people who use the library while at the same time avoiding the suggestion that it being open source was part of the problem.
Yes, and this is still the sole responsibility of OpenAI. They chose to include redis-py without warranties and are responsible for their own QA on any product built on it.
I reject the premise this is a bug. The author picked this implementation with whatever side effects. The author might want some requests to cancel in some situations. It's upto OpenAI to decide if the implementation works for their setup.
Why did it take them 9 hours to notice? The problem was immediately obvious to anyone who used the web interface, as evidenced by the many threads on Reddit and HN.
> between 1 a.m. and 10 a.m. Pacific time.
Oh... so it was because they're based in San Francisco. Do they really not have a 24/7 SRE on-call rotation? Given the size of their funding, and the number of users they have, there is really no excuse not to at least have some basic monitoring system in place for this (although it's true that, ironically, this particular class of bug is difficult to detect in a monitoring system that doesn't explicitly check for it, despite being immediately obvious to a human observer).
Perhaps they should consider opening an office in Europe, or hiring remotely, at least for security roles. Or maybe they could have GPT-4 keep an eye on the site!
Staffing an actual 24x7 rotation of SREs costs about a million dollars a year in base salary as a floor and there are few SREs for hire. A metrics-based monitor probably would have triggered on the increased error rate but it wouldn’t have been immediately obvious that there was also a leaking cache. The most plausible way to detect the problem from the user perspective would be a synthetic test running some affected workflow, built to check that the data coming back matches specific, expected strings (not just well-formed). All possible but none of this sounds easy to me. Absolutely none of this is plausible when your startup business is at the top of the news cycle every single day for the past several months.
How do you figure? If you mean there are few SRE with several years of experience you might be right. SRE is a fairly new title so that's not too surprising.
However, my experience with a recent job search is that most companies aren't hiring SRE right now because they consider reliability a luxury. In fact, I was search of a new SRE position because I was laid off for that very reason.
You don't even need an SRE to have an on-call rotation; you could ping a software engineer who could at least recognize the problem and either push a temporary fix, or try to wake someone else to put a mitigation in place (e.g. disabling the history API, which is what they eventually did).
However, I think the GP's point about this class of bug being difficult to detect in a monitoring system is the more salient issue.
Well hang on! Your question was why was the time to detect so high and you specifically mentioned 24x7 SRE so I thought that’s what we were talking about ;)
And I do think the answer is that monitoring is easy but good monitoring takes a whole lot of work. Devops teams tend to get to sufficient observability where a SRE team should be dedicating its time to engineering great observability because the SRE team is not being pushed by product to deliver features. A functional org will protect SRE teams from that pressure, a great one will allow the SRE team to apply counter-pressure from the reliability and non-functional perspective to the product perspective. This equilibrium is ideal because it allows speed but keeps a tight leash on tech debt by developing rigor around what is too fast or too many errors or whatever your relevant metrics are.
I’ve anecdotally observed the opposite. I have noticed SRE jobs remain posted, even by companies laying off or announcing some kind of hiring slowdown over the last quarter or so. More generally, businesses that have decided that they need SRE are often building out from some kind of devops baseline that has become unsustainable for the dev team. When you hit that limit and need to split out a dedicated team, there aren’t a ton of alternatives to getting a SRE or two in and shuffling some prod-oriented devs to the new SRE team (or building a full team from scratch which is what the $$ was estimating above). Among other things, the SRE baliwick includes capacity planning and resource efficiency; SRE will save you money in the long term.
On a personal note, I am sorry to hear that your job search has not yet been fruitful. Presumably I am interested in different criteria from you —- I have found several postings that are quite appealing to the point where I am updating my CV and applying, despite being weakly motivated at the moment.
Every system fail prompts people to exclaim "why aren't there safeguards?". Every time. Well guess what, if we try to do new stuff, we will run into new problems.
I'm not disagreeing with you, and I'm not the commenter you're replying to, but it's worth noting that cache leakage and cache invalidation are two different problems.
You're right. Thanks for pointing that out. My original point still stands, distributed systems are hard and people demanding zero failures are setting an impossible standard.
It's non-trivial but it's also not that hard, there are well known strategies for achieving it; especially if you relax guarantees and only promise eventual consistency then it becomes fairly trivial - we do this for example and have little problems with it.
Probably the cheapest solution would be letting GPT monitors user feedbacks from various social media channels and alert human engineers to check for the summative problem. GPT can even engage with users to request more details or reproduceable cases ;)
Since it now handles visual inputs, I wonder how hard it'd be to get GPT to monitor itself. Have it constantly observe a set of screenshares of automated processes starting and repeating ChatGPT sessions on prod, alert the on-call when it notices something "weird."
RUM monitoring does 99% of what you want already. Anomaly detection is the hard part. IMO too early to say whether gpt will be good at that specific task but I agree that a LLM will be in the loop on critical production alerts in 2023 in some fashion.
I don’t think that model has the properties you think it does. Someone still has to take call to back the operators. Someone has to build the signals that the ops folks watch. Someone has to write criteria for what should and should not be escalated, and in a larger org they will also need to know which escalation path is correct. And on and on — the work has to get done somewhere!
The way those criteria usually get written in a startup with mission-critical customer-facing stuff (like this privacy issue) is that first the person watching Twitter and email and whatever else pages the engineers, and then there's a retro on whether or not that particular one was necessary, lather, rinse, repeat.
All you need on day 1 is someone to watch the (metaphorical) phones + a way to page an engineer. Don't start by spending a million bucks a year, start by having a first aid kit at the ready.
Perhaps they could also help this person out by looking into some sort of fancy software to automatically summarize messages that were being sent to them, or their mentions on Reddit, or something, even?
Yup, twitter monitoring is a thing that I have seen implemented. We did not allow it to page us, however. As you say, some of the barriers around that are low or gone as of late. I wonder if someone has already secured seed funding for social media monitoring as a service. The feature set you can build on a LLM is orders of magnitude better than what was practical before.
Looking at my post up-thread, I wish I had emphasized the time aspect more - of course all of these problems are solvable but it takes both time and money. They have the money now but two months ago the parts of this incident were in place but the scale was so small that it never actually leaked data. Or maybe a handful of early adopters saw some weird shit but we’re all well-trained to just hit refresh these days. Hiring even one operator and getting them spun up takes calendar time that simply has not existed yet. I assume someone over there is panicking about this and trying to get someone hired to make sure they look better prepared next time, because there will be a next time, and if they’re even half as successful as the early hype leads me to believe, I expect they are going to have a lot more incidents as they scale. One in a million is eight and a half times per day at 100 rps.
Since I wrote this, I have seen several anecdotes that support this guess. This is a classic scaling problem. One or two users saw it, and one even says they reported it, but at small scale with immature tools and processes getting to the actual software bug is a major effort that has to be balanced around other priorities like making excessive amounts of money.
> […] it was because they're based in San Francisco. Do they really not have a 24/7 SRE on-call rotation?
OpenAI is hiring Site Reliability Engineers (SRE) in case you, or anyone you know, is interested in working for them: https://openai.com/careers/it-engineer-sre . Unfortunately, the job is an onsite role that requires 5 days a week in their San Francisco office, so they do not appear to be planning to have a 24/7 on-call rotation any time soon.
Too bad because I could support them in APAC (from Japan).
Over 10 years of industry experience, if anyone is interested.
I had forgotten that I looked at this and came to the same conclusion as you. I’d happily discuss a remote SRE position but on-site is a non-starter for me, and most of SRE, if I am reading the room correctly.
Edit to add: they’re also paying in-line or below industry and the role description reads like a technical project manager not a SRE. I imagine people are banging down the door because of the brand but personally that’s a lot of red flags before I even submit an application.
nobody qualified wants the 24/7 SRE job unless it pays an enormous amount of money. i wouldn't do it for less than 500 grand cash. getting woken up at 3am constantly or working 3rd shift is the kind of thing you do with a specific monetary goal in mind (i.e., early retirement) or else it's absolute hell.
combine that with ludicrous requirements (the same as a senior software engineer) and you get gaps in coverage. ask yourself what senior software engineer on earth would tolerate getting called CONSTANTLY at 3am, or working 3rd shift.
the vast majority of computer systems just simply aren't as important as hospitals or nuclear power plants.
spinning up a subsidiary in another country (especially one with very strict labor laws, like in european countries) is not as easy as "find some guy on the internet and pay him to watch your dashboard. and then give him root so he can actually fix stuff without calling your domestic team, which would defeat the whole purpose.
also, even getting paged ONCE a month at 3am will fuck up an entire week at a time if you have a family. if it happens twice a month, that person is going to quit unless they're young and need the experience.
Sorry to be clear I was replying to this part of your comment
> the vast majority of computer systems just simply aren't as important as hospitals or nuclear power plants.
I agree that the stakes are lower in terms of harm, but was trying to express that whilst it might not be life and death, it might be hindering someone being able to do their job / use your product - eg: it still impacts customer experience and your (business) reputation.
False pages for transient errors are bad - ideally you only get paged if human intervention is required, and this should form a feedback cycle to determine how to avoid it in future. If all the pages are genuine problems requiring human action then this should feed into tickets to improve things
Not only that, but you probably need follow the sun if you want <30 minute response time.
Given a system that collects minute-based metrics, it generally takes around 5-10 minutes to generate an alert. Another 5-10 minutes for the person to get to their computer unless it's already in their hand (what if you get unlucky and on-call was taking a shower or using the toilet?). After that, another 5-10 minutes to see what's going on with the system.
After all that, it usually takes some more minutes to actually fix the problem.
I've worked two SRE (or SRE-adjacent) jobs with oncall duty (Some unicorn and a FAANG). Neither have been remotely as bad to what you're saying. (Only one was actually 24/7 for a week shift.)
The whole point is that before you join, the team has done sufficient work to not make it hell, and your work during business hours makes sure it stays that way. Are there a couple bad weeks throughout the year? Sure, but it's far, far from the norm.
I did that for a few years, and wasn't on 500k a year, but I'm also the company co-founder, so you could argue that a "specific monetary goal" was applicable.
You don't need 24/7 SREs, you could do it with 24/7 first-line customer support staff monitoring Twitter, Reddit, and official lines of comms that have the ability to page the regular engineering team.
That's a lot easier to hire, and lower cost. More training required of what is worth waking people up over; way less in terms of how to fix database/cache bugs.
Support engineers and an official bug reporting channel would help. I noticed and reported the issue on their official forums at on 16 March, but got no response.
Probably because they launched ChatGPT as an experiment and didn't think it would blow up, needing full time SRE etc. I don't think it was designed for scale and reliability when they launched.
Probably the real reason. I assume they intend to make money off enterprise contracts which would include SLAs. Then they'd set their support based off that
Given the Microsoft partnership, they might not even need to manage any real infrastructure. Just hand it off to Azure and let them handle the details.
Just add metrics for the number of times "ChatGPT" and "OpenAI" appeared in tweets, reddit posts, and HN comments in the last (rolling) five minutes, put them on a dashboard alongside all your other monitoring, and have a threshold where they page the oncall to review what's being said. It doesn't even have to be an SRE in this case; it could be just about anyone.
I managed to manually produce this bug 2 months ago.
As they don't have any bug bounty, I didn't submitted it.
By starting a conversation and refreshing before ChatGPT has time to answer, I managed to reproduce this bug 2-3 times in January.
As I understand, the "work" was already done, the only thing missing was sending a heads-up email with "hey, this seems iffy, maybe you ought to look into it".
I dunno, I generally report issues I find in software, paid or not, as I've always done. Takes usually ~10 minutes and 1% of the time, they ask for more details and I spend maybe 20 minutes more to fill out some more details.
Never been paid for it ever, most I gotten was a free yearly subscription. But in general I do it because I want what I use to be less buggy.
I reported this race condition via ChatGPT's internal feedback system after I saw other user's chat titles loading on my sidebar a couple of times (around 7-8 weeks ago). Didn't get a response, so I assumed it was fixed...
Hopefully they'll start a bug bounty program soon, and prioritise bug reports over features.
The explanation at the time was that unavailable chat data (due to, e.g. high load) resulted in a null input sometimes being presented to the chat summary system, which in turn caused the system to hallucinate believable chat titles. It's possible that they misdiagnosed the issue or that both bugs were present and they caught the benign one before the serious one.
Yeah I was surprised that the bug appeared simply using the app normally. My first thought is that it was data from other user loading so I reported immediately that it looked like a race condition. But maybe it was this other bug you mention.
The claim made at the time was that the titles were not from other people and were in fact caused by the model hallucinating after the input query timed out (or something like that). Obviously that sounds a little suspect now, but it might be true.
That's a lie if so, if you look at the Reddit threads there's no way those were not specific other users histories as they had the logical history of reading browser history. Eg, one I saw had stuff like "what is X", then the next would be "How to X" or something. Some were all in Japanese, others all in Chinese. If it was random you wouldn't see clear logical consistency across the list.
> In the hours before we took ChatGPT offline on Monday, it was possible for some users to see another active user’s first and last name, email address, payment address, the last four digits (only) of a credit card number, and credit card expiration date
This is a lot of sensitive data. It says 1.2% of ChatGPT Plus subscribers active during a 9 hour window, which considering their user base must be a lot.
> "If a request is canceled after the request is pushed onto the incoming queue, but before the response popped from the outgoing queue, we see our bug: the connection thus becomes corrupted and the next response that’s dequeued for an unrelated request can receive data left behind in the connection."
The OpenAI API was incredibly slow and lots of requests probably got cancelled (I certainly was doing that) for some days. I imagine someone could write a whole blog post about how that worked, it would be interesting reading.
It's surprising that openai seems to be the only one being affected. If the issue is with redis-py reusing connections then wouldn't more companies/products be affected by this?
their description of the problem seemed kind of obtuse, in practice, these connection-pool related issues have to do with 1. request is interrupted 2. exception is thrown 3. catch exception, return connection to pool, move on. The thing that has to be implemented is 2a. clean up the state of the connection when the interrupted exception is caught, then return to the pool.
that is, this seems like a very basic programming mistake and not some deep issue in Redis. the strange way it was described makes it seem like they're trying to conceal that a bit.
I think most app using redis-py rarely cancel async redis command. Python async web frameworks is gaining popularity, but the majority of people using python for their web application is not using an async framework. And of those people that do use them, not many of them canceling async redis requests often enough to trigger the bug.
1. Redis can handle a lot more connections, more quickly, than a database can.
2. It's still faster than a database, especially a database that's busy.
#2 is an interesting point. When you benchmark, the normal process is to just set up a database then run a shitload of queries against it. I don't think a lot of people put actual production load on the database then run the same set of queries against it...usually because you don't have a production load in the prototyping phase.
However, load does make a difference. It made more of a difference in the HDD era, but it still makes a difference today.
I mean, redis is a cache, and you do need to ensure that stuff works if your purge redis (ie: be sure the rebuild process works), etc, etc.
But just because it's old doesn't mean it's bad. OS/390 and AS/400 boxes are still out there doing their jobs.
A pretty small Redis server can handle 10k clients and saturate a 1Gbps NIC. You'd need a pretty heavy duty Postgres database and definitely need a connection pooler to come anywhere close.
I agree that redis can handle some query volumes and client counts that postgres can't.
But FWIW I can easily saturate a 10GBit ethernet link with primary key-lookup read-only queries, without the results being ridiculously wide or anything.
Because it didn't need any setup, I just used:
SELECT * FROM pg_class WHERE oid = 'pg_class'::regclass;
I don't immediately have access to a faster network, connecting via tcp to localhost, and using some moderate pipelining (common in the redis world afaik), I get up to 19GB/s on my workstation.
Sorry, I should have used something more standard - but it was what I had ready...
It just selects every column from a single table, pg_class. Which is where postgres stores information about relations that exist in the current database.
I'm confused on why the need to complicate something as seemingly-straightforward as a KV store into a series of queues that can get all mixed up. I asked ChatGPT to explain it though, and it sounds like the justification for its existence is that it doesn't "block the event loop" while a request is "waiting for a response from Redis."
Last time I checked, Redis doesn't take that long to provide a response. And if your Redis servers actually are that overloaded that you're seeing latency in your requests, it seems like simple key-based sharding would allow horizontally scaling your Redis cluster.
Disclaimer: I am probably less smart than most people who work at OpenAI so I'm sure I'm missing some details. Also this is apparently a Python thing and I don't know it beyond surface familiarity.
Redis latency is around 1ms including network round trip for most operations. In a single threaded context, waiting on that would limit you to around 1000 operations per second. Redis clients improve throughput by doing pipelining, so a bunch of calls are batched up to minimize network roundtrips. This becomes more complicated in the context of redis-cluster, because calls targeting different keys are dispatched to different cache nodes and will complete in an unpredictable order, and additional client side logic is needed to accumulate the responses and dispatch them back to the appropiate caller.
I'm not familiar with the Python client specifically, but Redis clients generally multiplex concurrent requests onto a single connection per Redis server. That necessitates some queueing.
For caching the query results you get from your database. Also it's easier to spin up Redis and replicate it closer to your user than doing that with your main database. From my experience anyway.
I think the idea is that if your db can hold the working set in RAM and you're using a good db + prepared queries, you can just let it absorb the full workload because the act of fetching the data from the db is nearly as cheap as fetching it from redis.
Of course? I'm not really sure what the original question actually is if you know that users benefit from caching the results of computationally intensive queries.
Better concurrency (10k vs ~200 max connections compared to postgres). ~20x faster than Postgres at Key-value read/write operations. (mostly) single threaded, so atomicity is achieved without the synchronicity overhead found in RDBMS.
Thus, it's much cheaper to run at massive scale like OpenAI's for certain workloads, including KV caching
also:
- robust, flexible data structures and atomic APIs to manipulate them are available out-of-the box
Nice writeup, it's fair in the content presented to us.
Yet I'm wondering why there is no checking if the response does actually belong to the issued query.
The client issuing a query can pass a token and verify upon answer that this answer contains the token.
TBH as a user of the client I would kind of expect the library to have this feature built-in, and if I'm starting to use the library to solve a problem, handling this edge-case would be of a somewhat low priority to me if the library wouldn't implement it, probably because I'm lazy.
I hope that the fix they offered to Redis Labs does contain a solution to this problem and that everyone of us using this library will be able to profit from the effort put into resolving the issue.
It doesn't [0], so the burden is still on the developer using the library.
Edit: Now I'm confused, this issue [1] was raised on March 17 and fixed on March 22, was this a regression? Or did OpenAI start using this library on March 19-20?
Interesing comment:
> drago-balto commented 3 hours ago
> Yep, that's the one, and the #2641 has not fixed it fully, as I already commented here: #2641 (comment)
> I am asking for this ticket to be re-oped, since I can still reproduce the problem in the latest 4.5.3. version
That sounds more like a hindsight thing. In most systems authorization doesn't happen at the storage layer. Most queries fetch data by an identifier which is only assumed to be valid based on authorization that typically happens at the edge and then everything below relies on that result.
It's not the safest design but I wouldn't say the client should be expected to implement it. That security concern is at the application layer and the actual needs of the implementation can be wildly different depending on the application. You can imagine use cases for redis where this isn't even relevant, like if it's being used to store price data for stocks that update every 30 seconds. There's no private data involved there. It's out of scope for a storage client to implement.
I've long thought that it is often better to return a bit of extra data in internal API responses to validate that the response matches the request sent. That can be fairly simple like parroting a request ID, or including some extra metadata (e.g. part of the request) to validate the response is valid. It's not the most efficient, but it can safe your bacon sometimes. Mixing up deployment stacks (e.g. thinking you are talking to staging but actually it's prod) and mixing user data are pretty scary, so any defense in depth seems useful.
If you're subscribed to their status page, you'll know it's actually unusual for a day to go by without an outage alert from OpenAI. They don't usually write them up like this but I guess this counts as PII leak disclosure for them?
For having raised billions of dollars the are comically immature from a reliability and support perspective.
To be fair, they accidently made a game-changing breakthrough that gained millions of users overnight, and I don't think they were ready for it.
Before chatgpt, most normal people had never heard of OpenAI. Their flagship product was basically an API that only programmers could make useful.
Team leaders at OpenAI have stated that they were not expecting the success, let alone the highest adoption rate for any product in history. In their minds, it was just a cleaned-up version of a 2-year old product. It was billed as a research preview.
So, all of a sudden you go from hiring mostly researchers because you only have to maintain an API and some mid-traffic web infra, to suddenly having the fastest growing web product in history and having to scale up as fast as you can. Keep in mind that they didn't get backing from Microsoft until January 23, 2023-- that was only 2 months ago.
These problems predate ChatGPT. Their API has been on the market for nearly 3 years. And they raised their first $1B in 2019. That's plenty of money and time to hire capable leadership.
Yeah but again, this is the fastest growing app in history and it uses way more compute than your standard webapp, and basically delivers all functionality from a single service that handles that load. I can see why there would be some growing pains.
Not sure why no one is talking about serious data breach of personal and credit card information in this case. On the contrary, everyone is very concerned about compromise of github ssh key in another thread.
Does anyone else find it a bit off-putting how much emphasis they keep putting on "open source library"? I don't think I've read about this without the word open source appearing more than once in their own messaging about it. Why is it so important to emphasize that the library with the bug is open source?
The cynic in me wants to believe that it's a way of deflecting blame somehow, to make it seem like they did their due diligence but were thwarted by something outside of their control. I don't think it holds. If you use an open source library with no warranty, you are responsible (legally and otherwise) to ensure that it is sufficient. For example, if you break HIPAA compliance due to an open source library, it is still you who is responsible for that.
But of course, they're not claiming it's anyone else's fault anywhere explicitly, so it's uncharitable to just assume that's what they meant. Still, it rubs me the wrong way. I can't fight the feeling that it's a wink wink nudge nudge to give them more slack than they'd otherwise get. It feels like it's inviting you to just criticize redis-py and give them a break.
The open postmortem and whatnot is appreciated and everything, but sometimes it's important to be mindful of what you emphasize in your postmortems. People read things even if you don't write them, sometimes.
I noticed it too, but it doesn't necessarily bother me. Possibly they're just trying to say, "This incident may have made us look like we're complete amateurs who don't have any clue about security, but it wasn't like that."
Using someone else's library doesn't absolve you of responsibility, but failing to be vigilant at thoroughly vetting and testing external dependencies is a different kind of mistake than creating a terrible security bug yourself because your engineers don't know how to code or everyone is in too much of a rush to care about anything.
Yes, I agree with that sentiment, and I thought precisely the same. I know as an engineer that I would feel compelled to mention that it was an obscure bug in an open source library, if that was the case. Not to excuse myself of responsibility, but because I would feel so ashamed if I myself introduced such an obvious security flaw. I would still of course consider myself responsible for what happened.
A lot of the time when people make mistakes, they explain themselves so as they are afraid to be perceived as completely stupid or incompetent for making that mistake, not excusing themselves of taking responsibility even though people frequently think that excuses or explanation means that you are trying to absolve yourself of what you did.
There's a huge difference to me between having an obscure bug like this and introducing that type of security issue because you couldn't logically consider it. First one can be resolved in the future by introducing processes and make sure all open source libraries are from trusted sources, but second one implies that you are fundamentally unable to think and therefore also probably improve on that.
The result for the end consumer is identical whether they have their PII leaked from "an external library" vs a vendor's own home-baked solution.
It's not really a different kind of mistake, it's exactly the same kind of mistake, because it is exactly the same mistake! This is talking the talk, and not walking the walk, when it comes to security.
Publishing a writeup that passes the buck to some (unnamed) overworked and underpaid open source maintainer is worse, not better!
The dev had such a big ego that they didn't want to say "I was dumb and left open a bug", so the dev says "I was so dumb that I left open a bug in software I was also too dumb or lazy to write or even read".
It's not better.
I agree, it is a different kind of mistake; it is immensely worse than creating a terrible security bug yourself.
Outsourcing your development work without a acceptance criteria and without validation for fitness of purpose is complete, abject engineering incompetence. Do you think bridge builders look at the rivets in the design and then just waltz over to Home Depot and just pick out one that looks kind of like the right size? No, they have exact specifications and it is their job to source rivets that meet those specifications. They then either validate the rivets themselves or contract with a reputable organization that legally guarantees they meet the specifications and it might be prudent to validate it again anyways just to be sure.
The fact that, in software, not validating your dependencies, i.e. the things your system depends on, is viewed as not so bad is a major reason why software security is such a utter joke and why everybody keeps making such utterly egregious security errors. If one of the worst engineering practices is viewed as normal and not so bad, it is no wonder the entire thing is utterly rotten.
I do not believe it's necessarily nefarious in nature, but maybe more specifically it feels kind of like they're implying that this is actually a valid escape hatch: "Sorry, we can't possibly audit this code because who audits all of their open source deps, amirite?"
But the truth is that actually, maybe that hints at a deeper problem. It was a direct dependency to their application code in a critical path. I mean, don't get me wrong, I don't think everyone can be expected to audit or fund auditing for every single line of code that they wind up running in production, and frankly even doing that might not be good enough to prevent most bugs anyways. Like clearly, every startup fully auditing the Linux kernel before using it to run some HTTP server is just not sustainable. But let's take it back a step: if the point of a postmortem is to analyze what went wrong to prevent it in the future, then this analysis has failed. It almost reads as "Bug in an open source project screwed us over, sorry. It will happen again." I realize that's not the most charitable reading, but the one takeaway I had is this: They don't actually know how to prevent this from happening again.
Open source software helps all of us by providing us a wealth of powerful libraries that we can use to build solutions, be we hobbyists, employees, entrepreneurs, etc. There are many wrinkles to the way this all works, including obviously discussions regarding sustainability, but I think there is more room for improvement to be had. Wouldn't it be nice if we periodically had actual security audits on even just the most popular libraries people use in their service code? Nobody in particular has an impetus to fund such a thing, but in a sense, everyone has an impetus to fund such work, and everyone stands to gain from it, too. Today it's not the norm, but perhaps it could become the norm some day in the future?
Still, in any case... I don't really mean to imply that they're being nefarious with it, but I do feel it comes off as at best a bit tacky.
They really skirt around the fact that they apparently introduced a bug which quite consistently initiated redis requests and terminated the connection before receiving the result.
Doesn't bother me either. All the car companies issue recalls regularly, sometimes an issue only shows up when the system hits capacity or you run into an edge case.
The gaping hole in this write-up goes something like:
"In order to prevent a bug like this from happening in the future, we have stepped up our review process for external dependencies. In addition, we are conducting audits around code that involves sensitive information."
Of course, we all know what actually happened here:
- we did no auditing;
- because our audit process consists of "blame someone else when our consumers are harmed";
- because we would rather not waste dev time on making sure our consumers are not harmed
If you want to know why no software "engineering" is happening here, this is your answer. Can you imagine if a bridge collapsed, and the builder of the bridge said, "iunno, it's the truck's fault for driving over the bridge."
Are you confident that an audit would have uncovered this bug? I’d be surprised if audits are effective at finding subtle bugs and race conditions, but I could be wrong.
Depends on the type of audit. Subtle bugs are often uncovered by fuzzing, for example, but a race condition might not be found without substantial load.
I think you’re reading too much into it. Being an open source library is relevant because it means it’s third party and doesn’t come with a support agreement, so fixing a bug is a somewhat different process than if it were in your own code or from a proprietary vendor.
Yes, it’s technically up to you to vet all your dependencies, but in practice, often it doesn’t happen, people make assumptions that the code works, and that’s relevant too.
Also, vetting a dependency != auditing and testing every line of code to find all possible bugs.
If this bug was an open issue in the project's repo, that might be concerning and indicate that proper vetting wasn't done. Ditto if the project is old and unmaintained, doesn't have tests, etc. But if they were the first to trigger the bug and it only occurs under heavy load in production conditions, well, running into some of those occasionally is inevitable. The alternative is not using any dependencies, in which case you'd just be introducing these bugs yourself instead. Even with very thorough testing and QA, you're never going to perfectly mimic high load production conditions.
Open source can be fixed as if it was your own code. (And that is a strong tenant of free/open source software)
Not only do most open/free source libraries come without support agreements: they come with the broadest possible limitation of warranties. (As they should)
So the company, knowing that what they are using comes without any warranty either of quality or fitness to the use-case, have a very strong burden of due diligence / vetting.
All of us rely on millions of lines of code that we have not personally audited every single day. Have you audited every framework you use? Your kernel? Drivers? Your compiler? Your CPU microcode? Your bootrom? The firmware in every gizmo you own?
If "Reflections on Trusting Trust" has taught us anything, it's turtles all the way down. At some point, you have to either trust something, or abandon all hope and trust nothing.
> Have you audited every framework you use? Your compiler? Your CPU microcode? Your bootrom?
Of course not. I exclude the CPU microcode, bootrom, and the like from the discussion because that's not part of the product being shipped.
But it's also true that I don't do a deep dive analyzing every library I use, etc. I'm not saying that we should have to.
What I'm saying is that when a bug pops up, that's on us as developers even when the bug is in a library, the compiler, etc. A lot of developers seem to think that just because the bug was in code they didn't personally write, that means that their hands are clean.
That's just not a viable stance to take. The bug should have been caught in testing, after all.
If your car breaks down because of a design failure in a component the auto manufacturer bought from another supplier, you'll still (rightfully) hold the auto manufacturer responsible.
That’s reacting to a bug you know about. Do you mean to talk about how developers aren’t good enough at reacting to bugs found in third party libraries, or how they should do more prevention?
In this case, it seems like OpenAI reacted fairly appropriately, though perhaps they could have caught it sooner since people reported it privately.
“Holding someone responsible” is somewhat ambiguous about what you expect. It seems reasonable that a car manufacturer should be prepared to do a recall and to pay damages without saying that they should be perfect and recalls should never happen.
> Do you mean to talk about how developers aren’t good enough at reacting to bugs found in third party libraries, or how they should do more prevention?
My point was neither of these. My point is very simple: the developers of a product are responsible for how that product behaves.
I'm not saying developers have to be perfect, I'm just saying that there appears to be a tendency, when something goes wrong because of external code, to deflect blame and responsibility away from them and onto the external code.
I think this is an unseemly thing. If I ship a product and it malfunctions, that's on me. The customer will rightly blame me, and it's up to me to fix the problem.
Whether the bug was in code I wrote or in a library I used isn't relevant to that point.
I've also noticed it, and I can't help but interpret it as their way of shifting blame. Which is irresponsible. It's their product, and they need to take accountability for the bug occurring.
It's a serious bug, but in the grand scheme of things, not earth shattering, and not something that I think would discourage usage of their product. But their treatment of the bug causes more concerns than the bug itself. They are shifting the blame away from the work they did using a library with a bug, rather than their process by which that library made it into their product. And I don't understand how they can't see how that reflects poorly on them as an AI company.
I find it so confusing that at the end of the day, OpenAI's biggest product is having created a good process by which to create value out of a massive amount of data, and build a good API on top of it. And the open source library is effectively something they processed into their product and built an API based off of it. So it creates (to me) some amount of doubt about how they will react when faced with similar challenges to their core product. How will they behave when the data they consume impacts their product negatively? From limited experience, they'll shift the blame to the data, not their process, and keep it pushing.
It seems likely that this is only the beginning of OpenAI having a large customer base, with a high impact on many products. This is a disappointing result on their first test on how they'll manage issues and bugs with their products.
I don't find it over-emphasized. Many in the Twitter-sphere are acting as if they aren't being appreciative of open source software and I don't see it that way.
The technical root cause was in the open source library. There's a patch available and more likely than not OpenAI will continue to use the library.
Being overly sensitive to blame would be distracting to the technical issue at hand. It's great they are posting this post-mortem to raise awareness that the libraries you use can have bugs and to consider that risk when building systems.
Root cause analysis would likely also include the lack of threat modeling / security evaluation of their dependencies
Would likely also question the lack of resources allocated to these open source projects by companies with a lot of profits from, in part, using those open source projects.
Instead of spending engineering time, they used a free and open-source library to do less work.
The license they agreed to in order to use this library has this in capital letters. [THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND].
After agreeing to this license and using the library for free, they charged people money and sold them a service. And when that library they got for free, which they read and agreed that had no warranty of any kind, had a software bug, they wrote a blog post and blamed the outage of their paid service on this free library.
This is not another open-source project, or a small business. This is a company that got billions of dollars in investment, and a lot of income by selling services to businesses and individuals. They don't get to use free, no-warranty code written by others to save their own money, and then blame it and complain about it loudly for bugs.
I've never cared per se that a library was bug free but I've put a lot of effort/$ into making sure the features that used the libraries in my product were bug free (with the amount of effort depending on the sensitivity of the feature, data, etc).
Usually "fix the original library" wasn't as easy or immediate a fix as "hack around it" which is sad just re: the overall OSS ecosystem but still the person releasing a product's responsibility.
Unfortunately these sorts of bugs are wildly difficult to predict. Yet it's also a wildly common architecture. That's what's sad for all of us as engineers as a whole. But "caching credit card details and home addresses", for instance, is... particularly dicey. That's very sensitive, and you're tossing it into more DBs, without good access control restrictions?
Anywhere where you have payments related or any other PII data, then transitive dependencies, framework and language choices, memory sharing and other risks are taken into account as something that you as someone developing and operating a service is solely responsible for.
In my experience errors are more common (for both cultural and technological reasons) in Python than in Go.
I would guess something similar applies to Rust, though I don't have personal experience.
There's wide variation in C, but with careful discrimination, you can find very high-quality libraries or software (redis itself being an excellent example).
I don't have rigourous data to baack this stuff up, but I'm pretty convinced it's true, based on my own experience.
Those would be roughly similar. The main difference would be between dynamically typed interpreted languages and statically typed compiled ones I guess. At least I think I make less mistakes when the compiler literally tells me what's wrong before I even run the thing. It's awful and slow to develop that way, but it is more reliable for when that's a requirement.
it really was their fault. they chose to ship the bug. it doesn't matter in the last that someone else previously published the code under a license with no warranty whatsoever.
This doesn't come across this way to me at all. They just described what happened. Do you expect them to jump in front of a bus for the library they're using, and beg for forgiveness for not ensuring the widely used libraries they're leveraging are bug free?
There are very few companies that couldn't get caught by this type of bug.
Basically agree -- feels off-putting, but not technically a wrong detail to add. An additional reason it rubs me the wrong way, however, is that I believe open-source software code is especially critical to ChatGPT family's capabilities. Not just for code-related queries, but for everything! (e.g. see this "lineage-tracing" blog post: https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tr...)
Thus, I honestly think firms operating generative AI should be walking on eggshells to avoid placing blame on "open-source". Rather, they really should going out of their way to channel as much positive energy towards it as possible.
Still, agree the charitable interpretation is that this just purely descriptive.
the library is provided by the redis team themselves and the bug is awful [1]. I know it's not redis' fault but this bug could hit anyone. Connections may be left in a dirty state where they return data from the previous request in the following one.
I think you are reading a bit between the lines, and didn't feel them blaming the library as much as stating that the bug happened because of an issue in the library. Maybe they could have sugarcoated it between 10 layers of corporate jargon but I'd rather take this over that
The emphasis could also have been done to educate folks using this combination to check their setup.
Though version reference in the postmartem should also be posted, as a general guidance to their own readers, but at least a quick google search leads you to it.
For anyone reading this and using a combination of asyncio and py-redis, please bump your versions.
Similar issues I've encountered with asyncio python and postgres too in the past when trying to pool connections. It's really not easy to debug them either.
Not surprising from a company that calls itself openai. The “open source” keyword stuffing is so people associate the open from openai with open source. Psyops I mean marketing 101.
I don't know. To me it's simply an explanation of what has happened. I think its exactly what I would have written if I was in their position. And show me the one company that has audited all source code of all used open source projects, at least in a way that is able to rule out complex bugs like this. I have once found a memory corruption bug in Berkeley DB wrecking our huge production database, which I would have never found in any pre-emptive source code audit, however detailed.
Edit: On second thought, maybe they could have just written "external library" instead of "open source library".
> The cynic in me wants to believe that it's a way of deflecting blame somehow
That's how it reads to me as well.
Of course, it doesn't deflect blame at all. Any time you include code in your project, no matter where the code came from, you are responsible for the behavior of that code.
This is a common bug with a lot of software. For example some HTTP clients that do pooling won’t invalidate the connection after timing out waiting for the response.
I suspect the fatal error OpenAI saw was an "SSL error: decryption failed or bad mac"? Or if SSL were disabled, they likely would see the sort of parsing error vaguely described as an "unrecoverable server error" due to streams of data from one request being swapped with another, with incorrect data structures, bad alignment, etc. I can see how if the stars aligned and SSL were disabled this data race would manifest in viewing another user's request, so long as the sockets were receiving similar responses when they were swapped. I suspect the issue is deeper than the bug recently fixed in just redis-asyncio.
Libraries written prior to asyncio/green threads often have that functionality enabled by means of monkey patches or shims that juggle file handles or other shared state, and there are data races. When SSL is used hopefully those data races reading from sockets result in MAC errors and the connection is terminated.
It's easy to imagine the same mistake happening one level up, in message passing code that manages queues or other shared data structures.
Search for "python django OR celery OR redis OR postgres OR psycopg2 decryption failed or bad mac" and you'll see the scale of the issue, it's fairly widespread.
I don't have a high degree of confidence in this ecosystem. I've written about this before on Hacker News[1], and I'm not confident in the handling of shared data structures in Python libraries. I don't think I can blame any maintainers here, there's a huge number of people asking for these concurrency features and the way that it's often implemented - monkey patching especially - makes it extremely difficult to do correctly.
I reported this bug in the forums (no response), because I couldn't find an official way to report bugs - at least it wasn't documented anywhere that I could find. The actions they've taken fix that bug, but not the next one. OpenAI should take an action to improve their bug reporting channels, so the next bug gets found more quickly.
If a request is canceled after the request is pushed onto the incoming queue, but before the response popped from the outgoing queue, we see our bug: the connection thus becomes corrupted and the next response that’s dequeued for an unrelated request can receive data left behind in the connection.
This section summarizes the issue with the ChatGPT service which caused some users to be able to see another user's chat history and payment information. OpenAI took the ChatGPT service offline and published technical details of the problem. The bug was found in the Redis client open-source library and has been patched. OpenAI has reached out to affected users and improved their systems to prevent this from happening again. They have also expressed their commitment to protecting user privacy and their dedication to the open-source community.
Interesting. I pasted in the outage report, and then asked ChatGPT to give me a single sentence summary and it gave me an answer way longer than a sentence. I pointed out that it wasn't a single sentence and it gave me:
> ChatGPT experienced an outage due to a bug in an open-source library that allowed some users to see titles from another active user's chat history and payment-related information of 1.2% of ChatGPT Plus subscribers was exposed; the bug has since been patched and actions have been taken to prevent future incidents.
I then asked it why there was the outage and it spit out:
> The exact cause of the Facebook outage is unknown, but it is believed to be related to a configuration change in the company's routers.
It's likely because I ran out of tokens because the OpenAI outage report is long. Pasting in the text of the outage report, and then re-asking about why, it was able to give a much better answer:
> There was an outage due to a bug in an open-source library that allowed some users to see titles from another active user's chat history and also unintentionally exposed payment-related information of 1.2% of ChatGPT Plus subscribers who were active during a specific nine-hour window.
Querying it further, again having to repeat the whole OpenAI outage report, and asking it a few different ways I eventually managed to get this succinct answer:
> The bug was caused by the redis-py library's shared pool of connections becoming corrupted and returning cached data belonging to another user when a request was cancelled before the corresponding response was received, due to a spike in Redis request cancellations caused by a server change on March 20.
It did take me more than a few minutes to get to there, so just actually reading the report would have been faster, and I ended up having to read the report to verify that answer was correct and not a hallucination anyway, so our jobs are safe for now.
Funnily enough I've had a very similar bug occur in an entirely separate redis library. It was a pretty troubling failure mode to suddenly start getting back unrelated data
It boggles my mind how they're not absolutely checking the user & conversation id for EVERY message in the queue given the possible sensitivity of the requests. How is this even remotely acceptable?
In the one reddit post first surfacing this the user saw conversations related to politics in china and other rather sensitive topics related to CCP.
This can absolutely get people hurt and they absolutely must take this serious.
It doesn’t boggle my mind at all. Session data appears, and is used to render the page. Do you verify every time the actual cookie and go back to the DB to see what user it pointed to?
No, everyone assumes their session object is instantiated with the right values at that level of the code.
is this only me who don't see any chat history since yesterday and generally chat does not work (you can type the message, but clicking button or hitting enter / ctrl+enter) does not give any effect?
in chat history there is a button to "retry" but clicking it and inspecting the result, you see "internal server error"
Ive had the exact same issue since last week. it never fixed itself if that’s what you’re waiting for. I had to resort to creating a new account with a different email to get access again. Contacted support but yet to hear back. Not sure if it was the cause the issue happened when I try to buy Plus and that failed.
I found it was issue with Firefox and its default privacy settings. Clicking shield icon in Firefox address bar and setiting this privacy guard off made Chat working again
Sounds like they need to use a lua function to ensure that pop and push operation remains atomic within the redis instance?
For example: spopsadd
local src_list = ARGV[1]
local dst_list = ARGV[2]
local value = redis.call('spop', src_list)
if value then -- avoid pushing nils
redis.call('sadd', dst_list, value)
end
return value
First GitHub, then OpenAI. Two of Microsoft finest(!) (majority owned and acquired) companies on the top of HN announcing a serious security incident.
It's quite unsettling to see this leak of highly sensitive information and a private key exposure as well. Doesn't look good and seems like they don't take security seriously.
In the case of OpenAI, the product is more of a research demo that had to be drastically scaled up, though. From an operations point of view it’s more like a startup.
Yet another case study of absolutism, which can be simply dismissed.
People paying for ChatGPT care once it goes down and getting their details and chats leaked that is certainly outside of HN. Same with GitHub. Both having ~100M users between them.
I'm paying for ChatGPT and I don't care about this any more than the many, many other services I use that have at some point had an embarassing security issue.
"Yeah". Two people out of the majority of commenters not caring when it was unavailable and having their chat history leaked and as well as GitHub going down again. [0] Surely that alone is "Nobody" caring. /s
This is the whole problem with you being absolutist. Seems to have aged like milk.
I wonder how much time passed between the first case of corruptions leading to exceptions (and they ignored it as “eh, not great not terrible we’ll look at it later) and users reporting seeing other’s users data?
For some reason I liked reading about Rust (or any other technology) a lot more that the AI.
Part of it is that, the average engineer could understand and grok what those articles were talking about, and I could appreciate, relate, and if applicable criticize it.
The AI news just seems to swing between hype and doomsday prophecies, and little discussion about the technical aspects of it.
Obviously OpenAI choosing to keep it closed source makes any in-depth discussion close to impossible, but also some of this is so beyond the capabilities of an average engineer with a laptop. It can be frustrating.
This reminds me of a comment I made 1.5 months ago [0]:
I was logging in during heavy load, and after typing the question I started getting responses to questions which I didn't ask.
gdb answered on that comment "these are not actually messages from other users, but instead the model generating something ~random due to hitting a bug on our backend where, rather than submitting your question, we submitted an empty query to the model."
I wonder if it was the same redis-py issue back then, but just at another point in the backend. His answer didn't really convince me back then.
Unbelievable, have many times have you seen AWS explain an outage (or PII leak like this kind) as open source library bug? Have they asked 5 whys?
Why the bug was deployed into production?
Why there was not enough testing before deployment?
Why integration testing was only done with low concurrency?
Why standard release testing procedure is missing?
Why there's no synthetic traffic testing and gates before rolling to 100% in production?
The postmortem doesn’t say that. It just says they were caching “user information”. Maybe that includes a Stripe customer or subscription ID that they look up before sending an email, for example.
Yeah probably the session id and when the wrong session id is returned other operations like GET User details would pull its data from relational storage.
It's interesting (read: wrong) for an AI company to bother writing the user interface for their web application.
This was a failure of integration testing and defensive design, whether the component was open-source or not. There's no reason to believe that an AI company would have the diligence and experience to do the grunt work of hardening a site.
But management obviously understood the level and character of interest. Actual users include probably 10,000 curiosity seekers for every actual AI researcher, with 1,000 of those being commercial prospects -- people who might buy their service.
This is a clear sign that the managers who've made technical breakthrough's in AI are not capable even of deploying the service at scale -- no less managing the societal consequences of AI.
The difficulty with the board getting adults in the room is that leaders today give the appearance of humility and cooperation, with transparent disclosures and incorporation of influencers into advisory committees. The leaders may believe their own abilities because their underlings don't challenge them. So there's no obvious domineering friction, but the risk is still there, because of inability to manage.
Delegation is the key to scaling, code and organizations. "Know thyself" is about knowing your limits, and having the humility to get help instead of basking in the puffery of being in control.
This isn't a PR problem. It's the Achilles' heel of capitalism, and the capitalists in OpenAI's board should nip this incipient Musk in the bud or risk losing 2-3 orders of magnitude return on their investment.
>We took ChatGPT offline earlier this week due to a bug in an open-source library which allowed some users to see titles from another active user’s chat history.
Blaming an open-source library for a fault in closed-source product is simply unfair. The MIT licensed dependency explicitly comes without any warranties. After all, the bug went unnoticed until ChatGPT put it under pressure, and it was ChatGPT that failed to rule out the bug in their release QA.