Hard to imagine the tradeoff of using a third party binary library developed this year vs just using urllib.parse being worth it. Is this solving a real problem?
According to itself, it's solving the issue of parsing differentials vulnerabilities: urllib.parse is ad-hoc and pretty crummy, and the headliner function "urlparse" is literally the one you should not use under any circumstance: it follows RFC 1808 (maybe, anyway) which was deprecated by RFC 2396 25 years ago.
The odds that any other parser uses the same broken semantics are basically nil.
I agree that the stdlib parser is a mess, but as an observation: replacing one use of it with a (better!) implementation introduces a potential parser differential where one didn’t exist before. I’ve seen this issue crop up multiple times in real Python codebases, where a well-intentioned developer adds a differential by incrementally replacing the old, bad implementation.
That’s the perverse nature of “wrong but ubiquitous” parsers: unless you’re confident that your replacement is complete, you can make the situation worse, not better.
> unless you’re confident that your replacement is complete
And that any 3rd party libs you use also don't ever call the stdlib parser internally because you do not want to debug why a URL works through some code paths but not others.
Turns out that url parsing is a cross-cutting concern like logging where libs should defer to the calling code's implementation but the Python devs couldn't have known that when this module was written.
It seems unlikely that this C++ library written by a solo dev is somehow more secure than the Python standard library would be for such a security-sensitive task.
Hi, can_ada (but not ada!) dev here. Ada is over 20k lines of well-tested and fuzzed source by 25+ developers, along with an accompanying research paper. It is the parser used in node.js and parses billions of URLs a day.
can_ada is simply a 60-line glue and packaging making it available with low overhead to Python.
Ah, that makes more sense -- it might be a good idea to integrate with the upstream library as a submodule rather than lifting the actual .cpp/.h files into the bindings repo. That way people know the upstream C++ code is from a much more active project.
Despite my snarky comments, thank you for contributing to the python ecosystem, this does seem like a cool project for high performance URL parsing!
I guess you are right that there are 2 commits from a different dev, so it is technically not a solo project. I still wouldn't ever use this in production code.
Ada was developed in eoy 2022, and included in Node.js since March 2023. Since then, Ada powers Node.js, Cloudflare workers, Redpanda, Clickhouse and many more libraries.