Hacker News new | past | comments | ask | show | jobs | submit login
Pydantic V2 leverages Rust's Superpowers [video] (fosdem.org)
122 points by BerislavLopac on April 23, 2023 | hide | past | favorite | 49 comments



As someone who built a pure python validation library[0] that's much faster than pydantic (~1.5x - 12x depending on the benchmark), I have to say that this whole focus on Rust seems premature. There's clearly a lot of room for pydantic to optimize its Python implementation.

Beyond that, rust seems like a great fit for tooling (i.e. ruff), but as a library used at runtime, it seems a little odd to make a validation library (which can expect to receive any kind of legal python data) to also be constrained by a separate set of data types which are legal in rust.

[0]: https://github.com/keithasaurus/koda-validate


I agree that pydantic could have been faster while still being written in Python.

The argument for Rust: 1. If I'm going to rewrite - why not go the whole hog and do it in rust - and thereby get a 20x improvement, not 2x. 2. By using Rust we can have add more customisation with virtually no performance impact, with Python that's not the case.

Of course we could make Pydantic faster by removing features, but that would be very disappointing for existing user.

As mentioned by other commenters, your comment about "constrained" does not apply.


> If I'm going to rewrite - why not go the whole hog and do it in rust

We use black at work. One of the challenges with it is that it doesn't play very nicely with pandas, specifically its abundant use of []. So we forked it and maintain a trivial patch that treats '[' the same as '.' and everybody's happy.

What was maybe 15 minutes of work for me to get everybody's buy-in to use a formatter would not have been so quick or easy if it had been written in rust and now either we maintain our own repo of binary wheels, or all our devs now need to include rust in their build tooling.

I'm not invested in the argument one way or the other, just wanted to note that having the stack be accessible to easy modification by any user is itself a feature and one some people (including me in general, not so much in this particular case) derive a lot of benefit from.

P.S. cheers and congratulations!


This is so on point but is already the case with any package written in C. I feel like there’s such a strong push towards having rust backends for python packages that you might have to learn it to become a decent python developer…and I think I might be ok with that. For the price of having 1 dev on your team understand rust, you can keep using python as a top performing language. We’ve got Ruff, Pydantic and Rye (experimental?) just off the top of my head being written in rust. It seems like that’s where we’re heading as a community.


I'm talking about the rest of my team being able to use the fork. A compiler and toolchain is much more likely to be installed for c than for rust.


With Rust drivers in the Linux kernel I don't think this will be for long.


Good point, and gcc is getting a roast front end as well!


Because now you are on your own looking at the community from afar. You have taken the Drupal path and the ones who can help you are also not being helped in their rust paved paths so they are busy.

Strange how it turned out that way, at the last convention everyone agreed that the best tool for python is rust... The silent majority was not there.

At least do it in Nim, a python dev can quickly catch up. Optimization kills resilience.


I really appreciate your transparency around, "I am the one who am writing this open source library, and I think it will be more fun to do it this way."

Have fun! I truly hope it pays the returns you hope it will as well.

Naysayers: you're welcome to fork the old python version. If the rust version is a nightmare for the ecosystem, I'm sure someone will do that.

I'm very excited to see the results!


While I agree that there are ways to write a faster validation library in python, there are also benefits to moving the logic to native code.

msgspec[1] is another parsing/validation library, written in C. It's on average 50-80x faster than pydantic for parsing and validating JSON [2]. This speedup is only possible because we make use of native code, letting us parse JSON directly and efficiently into the proper python types, removing any unnecessary allocations.

It's my understanding that pydantic V2 currently doesn't do this (they still have some unnecessary intermediate allocations during parsing), but having the validation logic already in compiled code makes integrating this with the parser theoretically possible later on. With the logic in python this efficiency gain wouldn't be possible.

[1]: https://github.com/jcrist/msgspec

[2]: https://jcristharif.com/msgspec/benchmarks.html#benchmark-sc...


Definitely true. I've just soured on the POV that native code is the first thing one should reach for. I was surprised that it only took a few days of optimizations to convert my validation library to being significantly faster than pydantic, when pydantic as already largely compiled via cython.

If you're interested in both efficiency and maintainability, I think you need to start by optimizing the language of origin. It seems to me that with pydantic, the choice has consistently been to jump to compilation (cython, now rust) without much attempt at optimizing within Python.

I'm not super-familiar with how things are being done on an issue-to-issue / line-to-line basis, but I see this rust effort taking something like a year+, when my intuition is some simpler speedups in python could have been in a matter of days or weeks (which is not to say they would be of the same magnitude of performance gains).


Two things may preclude optimization in pure Python when producing a library for general public. Having a nice / ergonomic interface is one. Keeping things backwards-compatible is another.


I see that msgspec also uses native code to achieve the speed.

But the fact that it's faster than orjson (another native-code implementation) is cool.


I also wrote a pure python validation library [0] that is much faster than pydantic. It also handles unions correctly (unlike pydantic).

Pydantic2 is indeed much faster than any pure python implementation I've seen, but it also introduces some bugs. And on pypy it is as slow as it ever was, because it falls back to python code.

I wrote mine because nothing else existed at the time, but whenever I've had to use pydantic I've found it to be quircky and to have strange opinions about types, that are not shared by type validators. Using it with mypy (despite the extension) is not so easy nor useful.

[0]: https://ltworf.github.io/typedload/performance.html


It's slow with pypy because of some problems with pyo3's interaction with pypy, not a pure python fallback.

I agree unions were wrong in V1, but smart unions helped, and unions are fixed properly in V2.


Eh, smart unions… you're welcome for that idea that comes from my project :)

Of course there was an incompatible api change there, where the smart union parameter got removed and it's impossible to obtain the old (and completely wrong) behaviour. I'm sure someone relies on that.


> to also be constrained by a separate set of data types which are legal in rust.

This isn't really how writing rust/python iterop works. You tend to have opaque handles you call python methods on. Here's a decent example I found skimming the code.

https://github.com/pydantic/pydantic-core/blob/main/src/inpu...


> it seems a little odd to make a validation library (which can expect to receive any kind of legal python data) to also be constrained by a separate set of data types which are legal in rust.

That... makes no sense? Rust can interact with Python objects, there is no "constrained".


In the sense of using escape hatches back to python, that's true. Main point is that from a complexity standpoint, why do python -> rust -> python, when there's still a lot of room to run in just python?


Because it's not python -> rust -> python, it's python -> rust -> python c api.


Personally, I think it's great to have many projects solving the same problem and pushing each other further. Although the differences between the faster validations are small, the older ones were quite slow. This will save unnecessary CPU cycles, making it eco-friendly. And now the bar will be even higher with a Rust version, which is really great.

[0]Maat is 2.5 times faster than Pydantic on their own benchmark, as stated in their readme.

[0]https://github.com/Attumm/Maat


“To also be constrained by a separate set of data types”

This is nonsensical.


Unless your library is a drop-in replacement for Pydantic, I don't think it's fair to compare the performance of Pydantic and yours.


My personal take is that faster language do not necessary make you think more in algorithms and efficiency.

But people here just dislike algorithms and love to flag me. They gonna embrace their slow architecture in rust rather than do a better job in python.


Rust is the future of tooling [0]. While [0] is about JS tooling specifically, we're seeing the same effects in other languages as well. Turns out, you probably don't want to write infrastructure tooling in slow, dynamically typed languages when faster, more safe languages exist. Python knows this already, with much of the scientific computing libraries being just wrappers over the core C++ codebases. JS is beginning to catch up as well, with swc (speedy web compiler), stc (speedy type checker), Turbopack (Webpack successor) and so on, with Vercel leading the charge mainly.

[0] https://leerob.io/blog/rust#the-future-of-javascript-tooling


I'd say that you don't want to write the second crop of the tooling in a language like JS or Python.

This is because the first crop, like mypy or babel or jslint existed, and has shown the general direction. But for the first crop, a slow-running but fast-turnaround language was essential, to my mind. The first iteration had to move fast, and change the direction fast, because it wasn't yet clear what direction was going to be right.


even just knocking up a cli too in python I'm seeing so many cases and states I could be in that python just isn't warning me about.


Someone recommended here msgspec as a Pydantic alternative for serialization/validation and wow. It is fantastic. I really recommend it https://github.com/jcrist/msgspec


Thanks, glad you like it!


Samuel also presented a slightly altered talk at PyCon US yesterday[0], which was awesome!

GitHub link to pydantic[1], and pydantic-core[2].

[0]: https://twitter.com/samuel_colvin/status/1649928041462915072 (slides in comments)

[1]: https://github.com/pydantic/pydantic

[2]: https://github.com/pydantic/pydantic-core


Pydantic is so cool and I am really excited to migrate to v2. My biggest pain point in Pydantic is that it has no abstract base class for a serializer/deserializer. There is just BaseModel. This means that there is no "Pydantic interface" to implement. If I need to serialize/deserialize something other than an object with attributes, I basically can't do it in Pydantic, or at least I couldn't figure out how to do it.

In contrast, when I designed Serialite [1], I designed it to have a Serializer base class with two abstract methods: to_data and from_data. Then I added a @serializable decorator that can be applied to any dataclass, which injects concrete implementations of to_data and from_data. The decorator is what I usually use, but I sometimes implement the base class directly when it gets too complicated.

Making Serialite work with FastAPI was almost impossible because there is no interface to implement, duck type or otherwise. In the end, Serialite monkey patches some key FastAPI functions to make them understand Serialite Serializers and that seems to work.

[1] https://github.com/drhagen/serialite


Have not watched the video, but I did find the slides from a different version[0].

[0] https://slides.com/samuelcolvin/how-pydantic-v2-leverages-ru...

Edit: Realized that the linked slides were from a different iteration of the ~same content.


This is the one from the video https://slides.com/samuelcolvin/deck-0e6306


PyO3 is the key actor, in this rust-binding trend.



Code style in Pydantic user code reminds me of Rust's Serde serialization lib. Not directly relevant to the point of this video, but it makes me curious if it was inspired by Serde.


I'm rather amazed at the sheer number of human hours invested (wasted?) into making a terrible language(s) better just so things are slightly easier for beginners.


Assuming you're talking about Python, it's a product of its time. At the time it was released, it was very much following the newest "best practices" from "experts". You had lots of well-known people in the industry going to conferences and talking about how static typing was Actually Bad and unit testing was the one true way. (I won't bother explaining the retrospective idiocy of that.)

It's not a great language in a vacuum... but it is the most popular backend language in the world, by most estimations; there's a library for literally everything, and the entire data science ecosystem lives in the Python world. If you're on the management side, hiring a Python developer is orders of magnitudes easier than finding a Rust developer.

So you've really only got a few options... 1) throw everything out to rewrite it in Rust (et al); 2) accept that things suck and do nothing about it; 3) accept that things suck, but make some tooling to make it suck less. The rewrite strategy is a nonstarter for a lot of massive legacy codebases, so 3 is the only option that makes sense.

So it's really not that amazing how much time people invest in improving Python, when you think about it. Similar situation with PHP and Javascript.


> (I won't bother explaining the retrospective idiocy of that.)

Good, because people have done actual surveys to show there's no real decrease in bugs, and development time suffers: https://arxiv.org/abs/2203.11115

It's a trade off like anything else, Python has a lot of scientists who probably love not having to specify types everywhere. Python's best advantage is ease of development - Rust is in _no way_ an alternative in that department. Go would be a better choice.

It's biggest issue imo is it's slow, but since people aren't rushing to use mypy and friends, perhaps that's not important for most users. And it's had types for ages if you want them.


Even the article you linked flat-out says static typing is better:

> The analysis indicates that TS apps exhibit significantly better code quality and understandability than JS apps [...] Furthermore, reducing the usage of the `any` type in TS apps was significantly correlated with all metrics except bug proneness.

> It's biggest issue imo is it's slow, but since people aren't rushing to use mypy and friends, perhaps that's not important for most users.

People are rushing to mypy. Every Python library now has type stubs available. That wasn't the case even 3-4 years ago. Similar story with the mass move to TypeScript, which currently is the fastest-growing language[0].

Dynamic typing[1] is a legacy of ignorance and immaturity in the industry, from the same era where it was common to put your php $_GET parameters directly into a SQL query or access your C array without bounds checking. For the most part, the industry is smarter than this now, thankfully.

[0] https://visualstudiomagazine.com/articles/2023/02/02/jetbrai...

[1] note I'm really getting at "dynamic by default" -- there's certainly a use case for objects whose structures are only known at runtime, but that should be the exception, not the rule.


The article says code quality is better but development time and bug fixing time is worse. You chose to simplify that to 'better', I chose to see it as a tradeoff.

People are rushing to mypy for type checking, not speed improvements. I use cypthon and VS Code still does type checking for me; I don't care what program is doing it.

My argument is that truly static typing is too extreme in the other direction. Many times you don't care about type checking - quick scripts, data science, handling large JSON blobs - and informing the compiler is just boilerplate work. So I think we agree there that a static language with an 'any' type, like TypeScript, is the happy medium here.


Python is not a terrible language; neither is JS, or Ruby, or Lisp, despite being highly dynamic.

Python is indispensable for interactive experimentation (see Jupyter), and is a good glue language (see everything from Pytorch to Blender). It's also indispensable for rapid prototyping.

Python is not a good high-performance language. If performance is something that limits you, you should write performance-critical parts in a language optimized for that. Rust is a fine choice, but many other options exist, from Java to Haskell, and Python-integrated solutions like Cython also help. Note that usually 80-90% of your code is not performance-critical.

Same applies to systems where you want to formally prove certain correctness properties; both Python and C would be terrible choices.

By the same token, Rust is not a terrible language, despite its complicated syntax, the constant struggle with lifetimes and the borrow checker, and long compilation times. It shines when you need performance and correctness. If you need easy experimentation in a REPL, use a different language.

Use the right tool for the job.


I’m someone who spent a lot of time focusing on finding the most efficient tools, writing the most elegant code, using the most “pure” frameworks, etc. and I realized that this did very little to help me obtain my goals.

Unless your objectives are not commercial or will never scale beyond the ability of a single person, the overhead of human communication and collaboration will be enormous compared the inefficiencies of shuffling electrons and flipping bits with imperfect instructions. Sometimes this means using technically-inferior tools better than idealistic tools improperly to get the job done.


You're being downvoted, but I sort of agree...

How in the world can I stop worrying about errors like using a non existant class member and having a runtime exception? How can I refactor?

Ok, the core devs add annotations, then type hints to the language.

Some other people create MyPy, the JetBrains people use some other linting thing that does well, but does not match MyPy. The Pydantic appear and some other people says its awesome, but yet again, different to the others...

And all that to enforce things we already had 20 years ago in Java 1.0

How about using dynamic typing for small script and prototyping and using better tools for bigger projects?


Java 1.0 didn't have generics, so collection elements were only type-checked at runtime, and Java still doesn't have null-safety in the type system. The Python type annotations handle both of these things.


Yeah, why build better tooling or libraries for a language that lots of people use, right? That would be silly!


There is no shortage of programming languages that can choose from. Python seems to make the right tradeoffs between accessibility and power. The market has spoken.


It's not just about the market speaking. There's path dependence between it being easy to use for scripting and scientific things as well as bumpy which made it natural to supplant things like Matlab in teaching. And then basic knowledge translates to industry jobs.

It's not popular purely based on its merits as a language


Python is the keyboard of programming laguages.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: