Hacker News new | past | comments | ask | show | jobs | submit login
Data Science at The New York Times (dominodatalab.com)
185 points by gk1 on July 10, 2019 | hide | past | favorite | 11 comments



>Python has gotten sufficiently weapons grade that we don’t descend into R anymore.

>Hadoop is definitely happening but it’s Google’s problem because now after building our own Hadoop on iron solution, after dealing with Redshift for a while, we now just gave it all to BigQuery.

A tidy simplification of the technology stack.


Python has gotten sufficiently weapons grade that we don’t descend into R anymore.

I've experienced this in my own work as well. The extra verbosity of Pandas data frames compared to R data frames doesn't bother me anymore. Sometimes I miss the Lispy homoiconic magic, but not enough to make me want to use R at work.

I still use it once in a while for heavily "statistical" stuff that doesn't ever need to be "productionized", but for run-of-the-mill machine learning I see no reason to use it over Pandas.


Would anyone recommend / warn against any of python tidyverse ports, like dplython (dplyr) or plotnine (ggplot2)?

I'd like to have my cake and eat it too, but I'm worried that's too good to be true.


I gave plotnine a go in one of my personal Python projects (I'm a big fan of ggplot2 and tidyverse in general over pandas and seaborn) and after struggling for a while with a more complicated graph I went back to using seaborn.

Not to mention writing R-like code in Python will prevent you from being immediately understood by both R and Python developers. It's just not worth it.


Those python ports of dplyr and ggplot2 are cool, but the problem is that they're abandoned.


> Lispy homoiconic magic

I just want to acknowledge this fabulous phrase.


I'd like to see more transparency from NYT on how they're actually collecting, retaining, and distributing user data given both their data science and privacy efforts.


Interesting how at 11:45 he skirts the whole privacy topic by just stating that linking all their data to an identified reader (the 'who' in the 'who what where' of reader behavior tracking) 'involves third party data'.


I’ve collaborated with Chris Wiggins at Columbia. He’s insanely hardworking and it’s impressive to see how he balances an academic life with the life of a working Data Scientist at the New York Times. Really inspiring guy to be around.


Been a very long time since I heard anyone speak of Jeff Hammerbacher. His passion for data and engineering is amazing!


And it seems like an understatement, when presenting his data science credentials in an article that mentions Hadoop this much, to not mention that the dude founded the first major Hadoop company between Facebook and "retirement". The guy really was an inspiration when I worked there in the early days.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: