Data Science at The New York Times

everybodyknows · on July 11, 2019

>Python has gotten sufficiently weapons grade that we don’t descend into R anymore.

>Hadoop is definitely happening but it’s Google’s problem because now after building our own Hadoop on iron solution, after dealing with Redshift for a while, we now just gave it all to BigQuery.

A tidy simplification of the technology stack.

nerdponx · on July 11, 2019

Python has gotten sufficiently weapons grade that we don’t descend into R anymore.

I've experienced this in my own work as well. The extra verbosity of Pandas data frames compared to R data frames doesn't bother me anymore. Sometimes I miss the Lispy homoiconic magic, but not enough to make me want to use R at work.

I still use it once in a while for heavily "statistical" stuff that doesn't ever need to be "productionized", but for run-of-the-mill machine learning I see no reason to use it over Pandas.

mushufasa · on July 11, 2019

Would anyone recommend / warn against any of python tidyverse ports, like dplython (dplyr) or plotnine (ggplot2)?

I'd like to have my cake and eat it too, but I'm worried that's too good to be true.

mkay313 · on July 11, 2019

I gave plotnine a go in one of my personal Python projects (I'm a big fan of ggplot2 and tidyverse in general over pandas and seaborn) and after struggling for a while with a more complicated graph I went back to using seaborn.

Not to mention writing R-like code in Python will prevent you from being immediately understood by both R and Python developers. It's just not worth it.

kyllo · on July 11, 2019

Those python ports of dplyr and ggplot2 are cool, but the problem is that they're abandoned.

Angostura · on July 11, 2019

> Lispy homoiconic magic

I just want to acknowledge this fabulous phrase.

o10449366 · on July 11, 2019

I'd like to see more transparency from NYT on how they're actually collecting, retaining, and distributing user data given both their data science and privacy efforts.

PeterStuer · on July 11, 2019

Interesting how at 11:45 he skirts the whole privacy topic by just stating that linking all their data to an identified reader (the 'who' in the 'who what where' of reader behavior tracking) 'involves third party data'.

lordleft · on July 11, 2019

I’ve collaborated with Chris Wiggins at Columbia. He’s insanely hardworking and it’s impressive to see how he balances an academic life with the life of a working Data Scientist at the New York Times. Really inspiring guy to be around.

sonabinu · on July 11, 2019

Been a very long time since I heard anyone speak of Jeff Hammerbacher. His passion for data and engineering is amazing!

TallGuyShort · on July 11, 2019

And it seems like an understatement, when presenting his data science credentials in an article that mentions Hadoop this much, to not mention that the dude founded the first major Hadoop company between Facebook and "retirement". The guy really was an inspiration when I worked there in the early days.