It boggles my mind that the author wrote an entire long article based on this.
The rhetorical question saying that surely that weird refactor of two different functions into one, followed by calling that new, non-trivial function twice for no reason surely shouldn't affect performance.. He already lost me during the premise of the article.
What is so hard to understand here? There is some library code you can't immediately change because it belongs to upstream Spark. To illustrate the problem, ne simplifies the code to represent what the problem is.
Then, ne writes some code that works around the library bug by modifying the input losslessly into something that's more easily processed by the library.
Finally, ne patches the library bug and shares the patch.
All of this is also kinda fucking obvious to not just me, but a lot of people, so I'm having a really hard time grasping if you've mixed up the illustrative simplification with the actual code, or if you think that the best engineering approach is to always patch your environment bugs instead of modifying your input, or if you just don't have a Github account or for some other reason can't read the patch.
The rhetorical question saying that surely that weird refactor of two different functions into one, followed by calling that new, non-trivial function twice for no reason surely shouldn't affect performance.. He already lost me during the premise of the article.