This is also a security vulnerability because MySQL/MariaDB silently truncate strings at the first invalid character. This can result in data manipulation attacks where a higher level layer validates the complete input as UTF-8 for insertion into the DB, but the database only stores half the string.
Silent truncation only occurs if strict sql_mode is disabled. Strict sql_mode has been the default since MySQL 5.7, released over 6 years ago. All prior versions have hit EOL support, so there is literally no supported version of MySQL which has this behavior out-of-the-box.
That's good to know, however I was encountering this MySQL truncation just last year in COTS software, so clearly there's a long tail. As a related example, log4j 1.x has been deprecated for years, but 100% of atlassian uses it with their own maintained version.
In fact, when the log4j thing came up, we audited all our existing COTS/internal and virtually all of it was still on the "EOL" log4j 1.x.
Since you're familiar with this. What happens with an existing upgrade - is strict mode enabled by a typical distro package upgrade or does it only impact new installs?
Wouldn't it be a nice world where you can just declare something EOL and it disappears and somehow magically it's not a problem anymore for anyone end-to-end? Thing's just don't work that way. Like, suddenly, a product that was working perfectly fine for who knows how long suddenly needs to modified because some ass-hats thinks that EOL isn't suitable for production.
There are a number of reasons to upgrade a dependency or a product; an EOL label shouldn't be a part of the calculus. EOL may imply certain conditions that may influence the decision to upgrade but for the sake of EOL you can eat a bag of dicks. Think harder why you're proposing an upgrade is needed and how it'll add value to the business. EOL software is perfectly fine especially because it's been battle tested.
We also use log4j 1.x all over the place and I feel zero need to change that.
It is perfectly reasonable to continue to use EOL software, but you have to assume maintenance costs in that case. It would be irresponsible to continue to use EOL software as is. EOL is a perfectly natural concept. Nothing in this world lasts forever, including our machines, electronics, cars, batteries, lives, and yes, software.
For the end user, not the maintainers. Who's gonna sign up to keep that ENIAC, PDP-11, ALGOL 58, python2.3, win3.1, etc, code running, stable, and not a source of exploits?
Or do you mean in the abstract, fairy-tale ending sense?
Ah makes sense. I should've looked it up and double checked before commenting, my bad.
Honestly, in that case I tend to agree with GP. As long as you recognize EOL means you are in charge of supporting it and possibly backporting security fixes and you're okay with that - why not use it?
I agree EOL is too often interpreted to mean 'Insecure' - in the case of closed solutions this is absolutely the case (presumably, it is often difficult to audit closed and obfuscated closed source solutions).
In many cases it is not, it is purely a 'ticking time bomb' - if new CVEs or requirements arise which cause issues in EOL versions, you may not even become aware of the vuln.
In counterbalance, new versions of code have new, unknown CVEs, new bugs, etc.
The issue is that of support, the fallacy and process issue is that 'current' support is of greater quality than 'historical' support, or that old software has a disproportionately higher support costs than 'current' software.
The true issue is overdependence on FOSS software and unwillingess to contribute - all software has costs, major vendors are 'coasting' on work provided by free volunteers and panic when those volunteers decide to no longer work for free.
If vendors were doing the legwork necessary to properly inhouse their dependencies, I would wager that much less software would end up discarded in 'fear', for a lack of a better word than this - modern development is supposedly built on features and patches, any codebase should be maintainable for a much longer period than its supposed EOL.
For sure, there's a huge long tail. However, if a company is using 6+ year old EOL software that has flaws, that's outside of the vendor's control.
> What happens with an existing upgrade
sql_mode is a setting, so it depends on whether the default is being overridden: e.g. does the my.cnf config file specify a value explicitly? Does your application's connection logic set a session-level override for each connection? DBaaS vendors add an additional wrinkle since they tend to have custom UIs and APIs for global settings.
> typical distro package upgrade
Many Linux distros made it so that attempting to install "mysql" actually installs MariaDB instead, so when going this route, it's unfortunately always very hard to predict what you'll get or what will happen!
> Many Linux distros made it so that attempting to install "mysql" actually installs MariaDB instead, so when going this route, it's unfortunately always very hard to predict what you'll get or what will happen!
MariaDB is indeed pretty pleasant to use and for most use cases can be a drop in replacement for MySQL, should you not want to use MySQL, but also wouldn't want/need/be able to use PostgreSQL or something else: https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/
I might be in the minority here, but i actually rather enjoy MySQL/MariaDB, since MySQL Workbench is still the best DBA software that i've used, especially because of the ER modelling and schema sync functionality, and their approach to schemas/users within a DB feels more reasonable than that of Oracle/PostgreSQL.
That said, both PL/SQL and PL/pgSQL feel more powerful than whatever MySQL/MariaDB has. Also, in regards to MariaDB in particular, their versions have diverged from MySQL somewhat, so you might still want to look into the differences, if migrating over: https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/
As for predictability - nowadays i just run specific versions in containers (e.g. 10.7.1 or maybe 10.7 which corresponds to the patest PATCH release), to sidestep the distro differences altogether. It also lets me easily launch new instances, give resource limits to them, have multiple versions in parallel and mount the data directories wherever i need them, for easier backups etc.
Same. If I'm building and managing any special instances/VMs these days I avoid installing any packages but those required to run containers. There's certainly tradeoffs but it makes software management so much easier. Upgrading is a matter of stopping and starting a container using an upgraded image and if it's supported rolling back is just a matter of stopping the newer container and restarting the old. A container hosting service/platform and it's abstractions are better but if you have to run some one offs outside of that it's so much nicer to use containers. I call these pet cattle.
I still bind mount directories though. Docker volumes seem really convenient but a few years back I dealt with helping out a coworker with some data loss (pre-volumes) in a container's filesystem so I have some trust issues with how they implement storage.
> I still bind mount directories though. Docker volumes seem really convenient but a few years back I dealt with helping out a coworker with some data loss (pre-volumes) in a container's filesystem so I have some trust issues with how they implement storage.
This mirrors my experience and despite how Docker seems to suggest that non-abstracted storage is a liability and we should use volumes instead in many cases, i disagree: https://docs.docker.com/storage/bind-mounts/ In my eyes, the current implementation for volumes is for data that i don't really care much about, but would still consider extracting/moving later and have it be persistent, like when trying out new software.
Personally, bind mounts also make it extremely easy to operate with whole installs of software - for example, should i want to upgrade something like Nextcloud (and have enough space), i can just stop the containers, make a copy at /docker/nextcloud_stack/data/nextcloud_mysql_old and run a new container version against /docker/nextcloud_stack/data/nextcloud_mysql and see if it all works.
If the update has corrupted everything, i can just delete the broken folder, rename my backup and proceed with restoring the older version. If i want to mess around with the broken version, i can do it with 0 worries about what else could break. It's amazing, way better than relying on the POSIX approach to a file system, where your software, its configuration and its persistent data is all over the place.
I once had to manage a Tomcat install which had the webapps, config, logs directories in their idiomatic locations rather than one folder, which was pretty horrible from a management perspective. I had to constantly jump around and wasn't entirely sure how it all fits together (probably lots of symlinks).
It's curious that some/many Kubernetes distros lock down the local storage driver, though - there's a preset data directory, where all of your mounted data ends up, which feels like a step backwards from a customization perspective. I had to change the data directory for the entire Kubernetes install to be able to store my data in a painfully obvious /app or /data mount point. Then again, their ability to limit the sizes of these persistent volumes is pretty cool.
If you don't have an explicit value set in your my.cnf config and the client sends no explicit value upon connecting you get the current built in default of the version you are using. If you set something somewhere it will be respected.
Sure, but that's not MySQL's fault. MySQL/Oracle has no control over what ill-conceived setting overrides are used by DBaaS providers or shared hosting providers.
There are similarly dumb misconfigurations that can theoretically be made in most hosted open source software.
Implementing silent string truncation in the first place (by default(!)) is an excellent indication of how fast and loose and out of touch the original MySQL developers were with the database world. It is the worst database "feature" I have ever seen.
It made more sense at the time. Earlier days of the web, dynamic scripting languages used everywhere, etc. No emoji, barely any use of 4-byte characters at all. Large corps weren't using open source DBs, or sometimes open source anything. Simpler times!
And for sake of comparison -- if I understand correctly, prior to just 2 months ago, sqlite didn't even have the option for strict type enforcement [1]. Do you judge sqlite as harshly, or think its creator is out of touch with the database world?
Relational databases have been used in production since the late 1970s. You would be hard pressed to find any that silently corrupted user data like that, ever, until MySQL came along.
Sqlite is more of an embedded database, but if it silently truncated character data under some conditions that would make it equally unusable. Better to have no type enforcement than defective type enforcement.
I think you misunderstood my point. Yes, relational databases have been around a long time. But in the early days of MySQL, most of the companies using MySQL differed quite a bit from the companies typically using DB2, Sybase, Oracle, etc. I've been using MySQL professionally since 2003 and am speaking from first-hand experience here: the web development world was largely separate from the traditional enterprise world at that time. It was considered a different use case, it's not like today where the biggest companies in the world are tech companies built on open source software.
Regarding "silent", MySQL at least emits a warning when this happens. Still not great / obviously wrong in a modern context, but just for clarity's sake it wasn't completely silent.
As for "defective type enforcement", that's a subjective opinion. In C, atoi("123hello456") returns 123 -- once you hit an invalid character, everything from that point is truncated. Is C atoi defective? Perhaps, but it's an opinion.
With strict mode disabled, four-byte characters worked the same way in MySQL utf8mb3: four-byte characters are invalid in this data type, and everything gets truncated at the first invalid character. Would simply stripping the invalid characters, but keeping valid chars after them, necessarily be more "correct"? No, it's an implementation decision, especially considering that emoji did not exist at the time and barely any non-BMP (4-byte) utf8 characters were used anywhere.
It is certainly true that most users of relational databases before MySQL used them for applications where throwing away data is a potential catastrophe, and most uses of MySQL were for toy applications where if that happened no big deal. It was about a decade before MySQL even supported transactions.
atoi is deficient by modern standards, but at least it clearly documents what the behavior is and always has. It doesn't truncate anything. For a database, truncating data is unforgivable. That is what errors are for, so your database doesn't act as a black hole where your data may silently disappear forever.
And yes, if you can't return an error for some reason, you should preserve as much data as possible. Throwing away hundreds or thousands of characters after an initial anomaly is insane. But a database should never do that anyway, it should either store what you give it, or return an error. Same with a filesystem or any other trustworthy means of data storage. Sorry we were too lazy to tell you we threw your data away isn't acceptable, nor is relying on anyone to check for warnings that should be hard errors.
I would argue yes, that atoi is defective. Stripping characters is equally defective.
A QA walks into a bar and orders "123hello456" beers. The correct answer is not for the barkeep to give them 123 beers, its to ask "I beg your pardon?" So the truncation in general is the problem. If the operation fails - too many bytes, wrong encoding, string too long - the operation should fail. It's not the db's job, that is the business logic's job.
All competent software tokenizes a input string before calling atoi on an extracted piece of it. Otherwise you can't even tell the difference between "0" and an empty string.
On error resume next was the idea about error resilience back then. MSSQL has a similar feature: when you run a script and an error happens it just keeps running statements. If you want to interrupt it, you should wrap the whole thing in the try/catch statement and also roolback a transaction if any or it will commit a partial result.
Well, the things it would have been nice for MySQL to have done would have been not implementing it this way in the first place (yes, that's probably not Oracle's fault), and moving utf8 to utf8mb4 a bit more quickly. (good to know that's finally happening).
Also, I gotta say, now I have to look into this strict mode and what else it breaks, when in this particular case a "don't silently truncate strings" flag would be preferable and probably lower impact, since "strict" vs "security flaw mitigation" is not quite the same thing (does such a more granular flag exist?).
... unless that's all the strict mode does.
Hm, and since you brought this up elsewhere. Is this "strict by default" in MariaDB - which is what most (linux) people are (probably) thinking of.
> now I have to look into this strict mode and what else it breaks
It's a mode to ensure nothing happens silently: illegal data type conversions, attempting to insert invalid values for the data type (too large, too long, invalid characters, etc), attempting to do an INSERT which doesn't specify some columns that lack defaults, etc. With strict mode enabled, all of these things throw hard errors, like in most other relational databases. Generally, it covers the entire class of things that are typically the most common complaints about MySQL on HN.
Interesting wording re: "breaks" -- for practical purposes this is a real concern for existing applications, which may be doing bad things which previously didn't throw errors with strict mode off. Do you want your DB to suddenly fail loudly for these things (strict mode), or do you want your existing application to continue working in these cases? Can't have it both ways.
> Is this "strict by default" in MariaDB - which is what most (linux) people are (probably) thinking of.
MariaDB made it default in 10.2, ~5 years ago.
I don't agree that most Linux people conflate MySQL and MariaDB though. At large US tech companies that use MySQL, they're almost all literally using Oracle MySQL, or a patch-set like Percona Server which directly tracks MySQL, or an internal patch-set that does the same.
In recent years, MySQL and MariaDB have diverged a bit on feature set and focus; each has some interesting features that the other lacks. They are not drop-in replacements for each other.
So... while I'm broadly in agreement with you, and a huge fan of strict typing, I will note people use languages like Javascript and usually are ok with things like string to integer coercion.
It seems to be that there's a world of difference between "silently truncate string" and "type coercion" even if I readily agree both are sources of security vulnerabilities.
Thanks for the replies though, even though I clearly annoyed at least 2 people due to the downvotes.
I enabled strict sql mode in MariaDB for an website just last year, after noticing the database was silently corrupting data. What made the transition much less painful was noticing that when strict sql mode is disabled, MariaDB produces a warning for the queries that corrupted the data, but you must run SHOW WARNINGS; to see the warnings after running such a query. Knowing that, I made the website run SHOW WARNINGS; after every database query and logging if a warning occurred. This allowed me to deal with the most of the warnings before they were promoted to errors by enabling strict sql mode.
Pretty sure the name "utf8" isn't metaphysically wired to a defective physical representation. All they have to do is wait until the next minor release, add a compatibility name, make the generic name refer to a non-defective representation, and update the documentation and release notes accordingly.
If they were really conservative, they could wait until the next major release to repoint the name like that. Haven't there been four major releases since 2003?
This would be problematic for existing applications. Even if you leave existing binary data as-is (in existing tables utf8 becomes utf8mb3), changing the meaning of "utf8" in only new tables is still very problematic. Think about logical dumps, e.g. CREATE TABLE statements which use the "utf8" alias. For example, any self-host application which runs a .sql file in its install process. Or spinning up a dev environment for a new hire. Basically, anything that contains a CREATE TABLE with the utf8 alias now has a landmine:
In older versions of MySQL, InnoDB indexes were limited to 767 bytes. Meanwhile, VARCHAR(N) can store N characters. So with utf8mb3, that's a max of 3N bytes worst-case; but with utf8mb4 ("proper" utf8), it's now 4N. This meant that VARCHAR(255) CHARACTER SET utf8mb4 could not be indexed (in its entirety) in these older versions; the CREATE TABLE would fail if you try.
When you have a massive installed base of users, and those users have lots of dumps / install scripts containing "VARCHAR(255) CHARACTER SET utf8", you simply cannot change the meaning of "utf8" in a minor release.
I fully agree that MySQL could have made this transition faster/earlier than they did, but I disagree that it could be done trivially or terribly quickly.
Advertising the ability to store UTF-8 characters when you can do no such thing is a serious problem. Obviously you have to do whatever it takes to fix it, including fixing a deficient indexing implementation if necessary.
I believe the limitation you mention was fixed in MySQL 5.7, which would have been a reasonable opportunity to make "utf8" live up to its name.
Keep in mind that prior to 2010, MySQL and InnoDB were owned by different companies. Unilaterally deciding to increase the InnoDB index size limitation to permit utf8mb4 varchar(255) was not physically possible for the MySQL team prior to that point.
Anyway, I fully agree that the utf8 alias swap should have been done in MySQL 5.7, or perhaps even 5.6. I'm just disagreeing with your assertions that this would have been a trivial change.
Nor do I think it should have been a "stop the world" high-priority change prior to emoji existing in unicode (~late 2010). Non-BMP characters were very rare before that point. Hard to justify a business need for a complex migration just to store archaic hieroglyphs. For another point of comparison, MS SQL Server only added support for non-BMP characters in 2012, and only added support for utf8 encoding in 2019!
It's less a question of fault and more a question of whether developers need to be aware and look out for those gotchas. A security problem doesn't stop being a problem because your hosting provider did something vs. MySQL doing it by default.
I do find this behaviour rather incomprehensible actually. I understand that if the "utf8" is BMP it can't have invalid characters in it. What puzzles me is that the approach they went for seems to be almost the worst possible one from a security and data integrity perspective.
Assuming a SQL error can't be thrown for some legacy reason (and you'd think a flag to opt into that would be an option), why not replace the invalid parse with U+FFFD like most things do? Or at a bare minimum, only strip the invalid character which at least makes attacks a bit harder and results in a little less data loss.
Just wanted to reply to myself that I just learned from evanelias' comment a "strict mode" flag is indeed an option. I suppose it's time to audit all my mysqls to see which ones have this enabled, and what other mysql-isms this flag might break.
I know it's definitely not enabled on one updated and maintained BSD - this could be because they didn't want to break existing systems during upgrade.
I suspect this problem is still extremely common. But good to know there's some form of defense.
Another fun one that came up at my first job: If you run up against the size limit of a text column in MySQL, it silently truncates.
This came up when a developer used YAML for some configuration file. There was all sorts of confounding behaviour because YAML is still valid when truncated.
Every database has limits, but at least alert the user! I’d rather a query fail than alter data.
Edit:
This is something that might be useful in CS classes, to reflect what being a real-world engineer can often entail. Set an assignment with some silly nonsense like this and students have to debug it.
There's a setting for this, but like many settings MySQL has (along with InnoDB, the storage engine) historically chosen exceptionally bad defaults. It's possible to tune the thing to be screaming fast and pretty robust, but it takes a lot of expertise and TBH these days most people would be much better off going either NoSQL or Postgres instead.
iirc it was to allow customization of an e-commerce item. Basically a column as a document (in the sense of a document database) with custom attributes that could be…large.
It wasn’t my implementation. This was another team
It's really weird that they've chosen YAML for it, since its sole (questionable) benefit over the alternatives is that it's geared more towards being edited directly by humans... which is generally not something that you need to do with a database field. JSON is more usual for scenarios like this, not the least because most databases actually provide tools to parse and query it.
It's pretty weird to put any configuration-languages in a relational database in general. Large variable-length objects is not what relational databases are good at. You're mostly getting the drawbacks, and very few of the benefits in that use case. Even a filesystem is probably more suitable.
Drupal seems keep a large amount of PHP serialisation of configuration in its database (a JSONish thing), probably because it's the standard data store for the site. Noticed this in fixing a dump one time. Got to use vim's function syntax for regex to ensure the character length count of the string matched the value of the new string.
I think virtually everything generically configurable in the UI that isn't part of a small core with dedicated fields (or in the sitewide PHP config include) is handled that way.
Most databases have some kind of special support for XML and/or JSON these days. And sometimes you just need to associate structured hierarchical data with each record without knowing the precise schema in advance. If it's non-hierarchical, you can always use a separate key/value table; but tree queries are not exactly convenient (or fast) in SQL.
Maybe also do a checksum or something, seriously it could just be CRC32. I get that it's not something you instinctively think to do. For configurations that are most likely parsed directly by something else though, may check that nothing broke in transit.
What percentage of CS graduates go onto become computer scientists? Seems like vast majority of CS degree holders aren't practicing "pure" computer science, as you would seemingly hope.
Probably a very low amount. When I was studying I participated as a student representative at the CS programme I was attending and this was something that was actively being addressed.
The problem was multi-pronged. The courses had become increasingly geared toward industry, and students that had an interest weren't really fit for continuing their studies.
In the end, the longer 5 year program (MSE) was allowed to remain more vocational while those seeking a bachelor's were brought along a much more theoretical path so they could then do a master's followed by a PhD.
The day I realized that is the day I stopped using MySQL. It was a very long time ago, I was burned by this exact issue. I don't know what's more fundamental to a database than storing the data you ask it to. If it's storing something else, then it failed at its job spectacularly. At the time I experienced this bug I think it didn't even give a warning, but I think that's actually worse with one : it knows it's storing incorrect data but does so anyways.
And it still has not been fixed after all those years. I don't know what use is a character encoding that partially supports Unicode (you just have to try to find out in what way) and is used by nobody else. I don't really care that at some point in the future utf8 is going to become an alias to utf8mb4, a DB capable of this behaviour cannot be trusted, and it allowed me to discover postgres which seems superior in just about every way, so I don't need to go back.
Any data in your database that can come from external input should be treated as untrusted and validated before it's used. Otherwise validation bugs or bypasses will result in bad data and exploits that persist beyond the fix.
Edit: I’m not arguing against the need to use utf8mb4
While that’s true, I trust PostgreSQL to store exactly what I’ve asked it to store. At some point, you have to trust something to do its job, or else everything built on top of it is a castle of sand.
Imagine a bug like this in ext4. No one would reasonably contend that the layers on top of it should be validating that the files you write out are the ones you’ll read back in. We write unit tests for all kinds of stuff, but we’re not that thorough.
Agreed. The confusing part here as I see it is where validation layer A (correctly) asserts the data is valid UTF-8 and safe then assumes the database persists what it passes to it, since no error is reported.
Then, subsystem B trusts reading the database field (since it passed validation layer A).
Obviously more validation layers can be added, but at this point validation layer C called by subsystem B needs to know what the initial input from layer A is in order to differentiate it from the db value which was manipulated - a rather tricky thing to do sometimes. (I guess you could add a hash to the db to check the db is storing your strings, but really.. come on)
Upgrading to utf8mb4 is probably safer than hoping enough validation layers thrown at it solves the problem.
Yes. I’ve been burned by this exact issue in the past, not realizing that utf8 was not really utf8 in mysql-land. That is a major, major WTF IMO.
But I’ve also seen people relying on validation at the time of insertion so many times that I wanted to warn against that, too. Not in argument against the need for utf8mb4.
Sure, but the issue here is that the default mysql "utf8" encoding is not actually utf8. You can write as many validation layers as you want, but if they are assuming that utf8 actually means utf8, they won't help, and mysql will potentially screw it up when it gets stored.
https://bugzilla.mozilla.org/show_bug.cgi?id=1253201