Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ZA̡͊͠͝LGΌ causes "Invalid MD5 checksum on messages" (github.com/aws)
178 points by paulddraper on May 8, 2021 | hide | past | favorite | 68 comments


For anyone who doesn't get the reference: https://stackoverflow.com/questions/1732348/regex-match-open...


Worth mentioning that Zalgo predates the StackOverflow post (a̷nd̡ t̕h͟i̡s u̕niver͠s͠e̵).

https://en.wikipedia.org/wiki/Zalgo_text


Which is a fun reference but this isn't exactly parsing and I think using one regex to process escapes would actually fix the problem.

Edit: The function in the fix does in fact appear to be a straightforward implementation based around a regex. https://github.com/mathiasbynens/he/blob/master/src/he.js


I think the Zalgo answer is an wonderful example of people thinking they are smarter than they really are.

Spot the error:

1) HTML is not a regular language

2) Regular expressions can only parse regular languages

3) Regular expressions can not be used to to parse arbitrary HTML documents

4) If you try to use regular expressions to parse HTML documents, you are doing it wrong

The error is between step 3 and 4: It's true that you can't write a regex that parses arbitrary HTML.

However, very often we do not have to deal with arbitrary HTML. Very often we have to deal with documents that only use a subset of HTML and they can be parsed by regular expressions just fine.


> Very often we have to deal with documents that only use a subset of HTML and they can be parsed by regular expressions just fine.

I agree, assuming the emitter of the document to be processed is a) correct (as in bug free, and as in knows the expected HTML subset and the corner cases of its ad hoc parsing method, and will never make a mistake nor assume the other end has an actual HTML parser) and b) has no ill intent.

In practice, it is neither.

> people thinking they are smarter than they really are

I genuinely think it is the quippy expression of someone who has been burned way too much by the practical side of this that they prefer to frantically laugh at themselves as much as the issue out of despair of people trying to be smart with regexes. They chose to actually not be smart at all, and just use an HTML parser to parse HTML documents.


One of the situations I've seen is roughly: System is up: returns XML or JSON or pretty-format-of-the-week System is down: It returns a HTML IIS error page.

In the second case, all you might want is to extract the content of the first <h1> tag out of that error page. That's predictable enough of a task that a Regex might be able to handle it, especially if at that point you've already iven up on a full success and you're just salvaging a prettier error message than "system error".


Exactly. In cases like this using an HTML/XML parser instead of a regular expression won't make a difference at all.


I've always taken it as a response to people trying to force regex solutions for all text parsing problems beyond string.split(), perhaps making the unfair assumption that the asker wasn't already aware of XML parsing libraries and just wanted to try it with regex...


> However, very often we do not have to deal with arbitrary HTML.

That really depends on who "we" are and what you mean by"very often".

I used to develop web crawlers, HTML parsers, document analysis infrastructure and various other things that come into contact with "content" for web crawlers at various search engine companies. If you assume people can produce valid, or even half way sane HTML, you'll be disappointed. As for how you parse insane HTML: with difficulty.


There might be a few narrow cases where using regex to parse XML/HTML might make sense, but even in majority of the cases where it might conceivably work it doesn’t make sense because XML/HTML parsing libraries exist and the time and effort you spend building, debugging, and maintaining your brittle, limited, ad hoc parser is still wasted effort that increases both the time to deliver and maintenance cost of the project it is part of.


The question is not about parsing at all. The question is about tokenization of XHTML. You can tokenize XHTML with a regular expression.


all the "string theory" means is that you cannot parse html completely or check its validity with a SINGLE regex based on a standard nfa or dfa.

the reason why is that standard nfas and dfas cannot implement arbitrary counters for the purposes of keeping track of recursion depth or "how many open parens are currently open right now in the parsing?"

there are a few subtle points here:

1) you CAN parse and validate non-regular languages by using regular expressions for subsets of them that ARE regular and then validating the whole. this is exactly what most parser generators generate.

2) you CAN use regular expressions to manipulate text in regular languages if you are careful. however, you should be extremely cautious, as non-regular parts of languages are often those that are used to support escaping or quoting or recursive sections of languages and subtle bugs in these sorts of things can sometimes have unintended consequences in terms of security.

3) you CAN use regular expressions for simple search and replace but see #2. if the language allows escaping or quoting, your regexes will not respect it on their own. this may have the unintended consequence of things you expect to have escaped not be escaped or vice versa. depending on other assertions in your project, this may or may not matter. in the worst case, it could result in a hard to track down or security critical bug.

4) regular expressions are implemented by simulating nfas and dfas. sometimes when implementing these simulations, programmers take advantage of the fact that they're programming and just slap counters in there giving their implementations the ability to support some non-regular languages.

as a rule of thumb, if i'm dealing with untrusted input, i'll use a proper parser.

so yes, you can use regular expressions, but you should understand their limitations if you do so. (which also means understanding your input language)


> 2) Regular expressions can only parse regular languages

This is false, regular expressions as popularly implemented can parse non-regular languages.

For example, /^(.*)\1$/ is a non regular language.


Strictly (pedantically) speaking, using a backreference turns it from a "regular expression" into a "regex", regexes being a superset of regular expressions that are no longer strictly limited to being regular.

Isn't terminology great?


It's true, not false.

You refer to a misleading overloading of the term when referring to the stack machines needed to realize 'regular expressions with backreference memories', which expands the language class to which they correspond.

Citation : https://www.amazon.com/Introduction-Automata-Languages-Compu...


Context matters, in the context of the original post that was not how they used the term.


So, so glad to see someone cite the regular language issue in that naivete.


There exist regexp that matches invalid html parsing regexps /.*/


> a straightforward implementation based around a regex

A "straightforward regex" is so large/complicated the solution actually uses a JS code generator.

Yes, decoding is only part of XML parsing, and you technically can use a regex for the coding.

But 9/10 times sometimes rolls a regex to help parse XML they fail miserably. Including AWS engineers.


It uses a code generator but that seems to be concatenating some constants together to allow for prettier organization and reuse between encoding and decoding. The core of what it has to match against to handle escapes isn't very complicated.



I somewhat disagree that the famous answer is an instance of cargo-culting. Even if not for entertainment it's probably more valuable than literally answering the question posed. Let's say that you now have a method that extracts tags in the manner prescribed... now what? I'd bet the OP was only trying to write that method as a subprocedure for some larger task much more efficiently dealt with another way. In fact if you look at that poster's question history around that time, it's apparent that they were trying to render potentially-malformed XHTML.


regardless of cargo-culting or not, the rebuttal explains why the famous answer is utterly wrong: the problem can be solved by regular expressions and it cannot be solved using an xml parser. While the answer is funny and probably good general advice, it is incorrect.


You can't really argue with a mathematical theorem.


But you can argue against someone misunderstanding what a theorem says and applying it in the wrong context.

The question is about tokenization, not parsing.


They aren’t asking to parse all valid xml. They wanted to parse a small known subset. Regex can be an excellent tool in that situation.


> They aren’t asking to parse all valid xml. They wanted to parse a small known subset. Regex can be an excellent tool in that situation.

Rarely better than using an existing general XML parser and then validating the additional constraints, especially in terms of implementation and maintenance cost.


The question is not about parsing, it is about tokenization.


> regex will consume all living tissue (except for HTML which it cannot, as previously prophesied)

This part gets me every time!


> The <center> cannot hold it is too late.

Never fails to make me chuckle


Nice reference to Yeats’ poem, “The Second Coming”. I guess the infamous Stack Overflow answer is a modern form of poetry.


Where I work, we have this post as part of our software developer on-boarding experience. It's required reading.


Have any of the developers noticed the fundamental fallacy in the answer? The question is not about parsing, it is about recognizing a token in a string, which is exactly what regular expressions are designed for.


Oh you beat me to it.


The offending code for those interested:

   const decodeEscapedXML = (str: string) =>
      str
        .replace(/&amp;/g, "&")
        .replace(/&apos;/g, "'")
        .replace(/&quot;/g, '"')
        .replace(/&gt;/g, ">")
        .replace(/&lt;/g, "<");

Seems like it's in multiple places in the code base too, I think all those clients are automatically generated.

https://github.com/aws/aws-sdk-js-v3/search?p=1&q=const+deco...

So that's not even a good fix, think he should have fixed

aws-sdk-js-v3/codegen/smithy-aws-typescript-codegen/src/main/resources/software/amazon/smithy/aws/typescript/codegen/decodeEscapedXML.ts


Very true. AWS doesn't pay me enough to spend time on a comprehensive fix, with tests.

Hopefully they will pay someone to do that.

At least I can read SQS messages now.


I've become a lot more sympathetic to this attitude lately.

I kind of enjoy working with Rust, and have written a few minor utilities to interact with AWS in it for work. I found and fixed a few bugs in the library for it, Rusoto. I observed along the way that the project is sorely in need of people to spend time on management and maintenance. I could do that, but...

I can't ignore that AWS makes some absurdly huge amount of money for Amazon. I don't begrudge them that, but I get paid pretty well at my day job already. Why do more of that work for the benefit of AWS for no money? They ought to pay me for it.

Hell, they have more than enough money to hire a team of professional experts in every language under the sun to maintain their AWS libs. Especially considering that the net effect of higher-quality AWS libs in more languages will result in more money being spent on AWS services and more lock-in. There's no excuse for having such terrible code in a mainstream language like Javascript.


By the way, in case you missed it yesterday: https://news.ycombinator.com/item?id=27080859


Yeah, but you've submitted a PR. It's not a fix, it's broken code. As soon as they run their code generator, it'll disappear.


Well it fixed their problem enough, and filing a PR was the level of effort they're willing to put in. It might help someone fix it properly, and it might help someone fix it enough to get moving, so it's potentially good for the origin and good for the users and if not, the origin can close the PR.


I'm getting very confused by this attitude.

It's broken code. It's bad code.

Never, ever submit bad code. You're just making even more work for everyone.

Its far worse than doing nothing at all. And if you want to highlight the problem, just raise an issue, point to the problem.


That's dreadful code.


Yeah and unpacking &amp; first is asking for trouble with say "&amp;amp;lt;" in your input


My god. I wish I could find a linter to ban startsWith and regex from my codebases. I swear 99% of their uses are buggy hacky shit


You may find semgrep handy for that, although I don't offhand know how to express "any use of regex" in its language

https://github.com/returntocorp/semgrep


Can‘t tell you how often I have to use StartsWith/Contains in some interface when connecting to applications together because of the missing ability to extract proper unique/primary keys for objects from any of the systems.

So please don‘t, I need these buggy hacks ;)


For those unaware of that folklore, the github link contains reference to one of the most famous StackOverflow answer of all times:

https://stackoverflow.com/questions/1732348/regex-match-open...


Thanks for sharing, I hadn't seen that. Sadly, I feel like it would get removed as an answer or edited to the SO standards if that was posted as an answer today.


Unquestionably.

SO's sense of humor died around 2014.


The node SDKs for AWS really shock me.

The `aws-sdk` v2 API is huge. It and its transitive deps are 64MB. In one TS project that I worked on, adding it added 1.5 seconds of transpilation time.

I thought v3 was supposed to be a modular improvement, but even just the `@aws-sdk/client-s3` library is 24MB.

Then I see bugs like `v3 has a dependency on react-native` and I really wonder how the process works internally to release these things.


The Java SDKs are bigger, and what's IMO worse: in Lambda the Python and JavaScript SDKs are automatically installed but the Java SDK is not automagically available on the classpath. Since Lambda is an environment where both artifact size and startup latency matters a lot, this kind of hobbles Java Lambdas.

(I work around this mostly by using native-image.)


As I found out in this bug, v3, despite being released last year, is unusable.

You can't even stream an object to S3. https://github.com/aws/aws-sdk-js-v3/issues/2348


I was pretty shocked by the bad quality of the Cognito JS library (which is now superseded by the Amplify JS library, afaik). The API design itself was really bad and error handling was based on parsing the textual messages of the HTTP responses (or at least something along those lines). In general I really do like AWS but it made me reluctant to suggest Congito to any of our clients (as opposed to say Firebase Auth which has a nice API and just works).


> I was pretty shocked by the bad quality of the Cognito JS library (which is now superseded by the Amplify JS library, afaik).

Subsumed within rather than superceded, AFAICT—in fact, I’ve had to use notionally outdated documentation (actually, I think it was actually Stack Overflow answers, but...) for the former to fill gaps in that for the latter, so I’m relatively certain of the relationship.

AWS SDK APIs tend to mix poor design with poor documentation, often being extremely leaky abstractions on top of the HTTP APIs, which themselves aren’t masterpieces of either design or documentation.


The PHP SDK is so bad, https://github.com/async-aws/ exists and one of the stated goals is "One should be able to read the code and follow the logic". https://bref.sh/ which allows using PHP to write AWS Lambda functions easily uses async-aws, for example.


This problem is what lead me to building out my own s3-compatible implementation, sure it doesn't have all the bells and whistles that the official library has but it does what I need and is under 10kb in size.


"Thanks to (1), all my HL7 messages in my SQS FIFO queue won't process because carriage returns broke the client."

Did my time with HL7. One of those cases where a standard didn't serve any real purpose. It gave the appearance that it might work, but then bred a whole industry of HL7 <-> HL7 translators because nobody's HL7 is the same. If you need a translator/hub, everyone might as well have their own protocol.


Yeah.

I also work with X12 which is far more consistent (because if it isn't, you don't get paid).

But the format itself is -- believe it or not -- even worse.


(possibly) interestingly, I pased the link to this issue into a slack, which very helpfullly got one of Github's new preview links.

The image it produces can't render Zalgo either https://imgur.com/a/pOUu7m1


That pull request needs tests.


I love how it just blithely takes on another dependency too. This is all too common in JS dev unfortunately… never implement something yourself if you can take on another useless dependency instead.


Conversely, you don’t have to reinvent the wheel of parsing XML and all the nuances that come with that if you use a tried and tested library.

I see your point and agree with the general sentiment, but this is no leftpad.


Strictly speaking "he@1.2.0" was already present, drug in transitively by something else, that's why the yarn.lock folded the older version into the existing checksum

I came very close to commenting on the PR, but since it was pointed out elsewhere that the fix was to the wrong file, I suspected it was going to be closed wontfix anyway


> never implement something yourself if you can take on another useless dependency instead.

So...your argument is that XML parsing/decoding is useless/too easy and every project should reimplement an XML parser/decoder?

If so, I have 0% trust in your judgement.


They're not doing arbitrary parsing, they just need the ability to un-escape XML characters for a particular case. I'd be shocked if (after all the abstraction) the `he` library didn't just have a single function which does that. I'd prefer just seeing how `he` does it and implementing it myself.

Moreover, they're already using a `fast-xml-parser` for doing XML parsing. Presumably it doesn't have an unescape function, so they're taking on a dependency on another XML parser (and keeping the old one!) just to get the one function.


he is for HTML(/XML) encoding.

It's kinda dumb that fast-xml-parser doesn't full parse the XML and leave content+attribute values in raw forms.

The docs for fast-xml-parser show how to combine it and he.

Had they done that, they wouldn't have this bug that completely breaks the SQS client for me.


lol no even a single test in the PR which supposedly fixes this.


Oh hey, another person making a pipeline for EHR records. I wonder how many of these have been remade by now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: