It boggles my mind that the author wrote an entire long article based on this. T...

renewiltord · on June 21, 2021

What is so hard to understand here? There is some library code you can't immediately change because it belongs to upstream Spark. To illustrate the problem, ne simplifies the code to represent what the problem is.

Then, ne writes some code that works around the library bug by modifying the input losslessly into something that's more easily processed by the library.

Finally, ne patches the library bug and shares the patch.

All of this is also kinda fucking obvious to not just me, but a lot of people, so I'm having a really hard time grasping if you've mixed up the illustrative simplification with the actual code, or if you think that the best engineering approach is to always patch your environment bugs instead of modifying your input, or if you just don't have a Github account or for some other reason can't read the patch.

Between that patch and https://github.com/apache/spark/pull/24910 you can see why the code is what it is.