Recommendations for becoming a data scientist? Sounds like a wonderful job.
What books would be on the required reading lists? What is the best set of tools to learn and know?
An alternative approach: how much of this could be automated and delivered as a web application/service? Provide easy ways for people to parse and upload their data, algorithms to run with good descriptions and recommendations for lay people, and well designed visualization tools for presenting the output of the data. This would be a very valuable service, but would take a broad range of knowledge and expertise to implement well (experts in the CS, Statistics, and Visualization fields as described in the article).
It does take a wide-variety of knowledge. But I don't think you need to be an expert in all these fields. You just have to have a passion for presenting information to others. As far as what tools are needed? Here's my take.
Computer Science (Data Mining, Machine Learning):
- Hard science book like PRML from Chris Bishop or Neural Networks from Haykin
- Fun implementations from Programming Collective Intelligence by Toby Segaran
Don't forget the graphic design and HCI parts, and definitely do not underestimate them. I often find that people with a scientific/technical background find it hard to visualise and present data well because of their background - 'I love lots of information and data; therefore, everyone else must love lots of information and data'. Wrong. You can number crunch all day long, but if you can't translate the output into something that people can understand (i.e. turn information into knowledge) then there is no point.
What curriculum are you pursuing to become a data scientist? Is it an actual degree program, are you cobbling together your own set of courses, or is it a degree in one of CS/Stats/Visual Design with electives in the others?
I have decided to do a masters degree in computational biology (bioinformatics).
There are a few masters courses that could lead you to a career in data science, but it depends on your background. Most of the courses I found were specialisations, e.g. a masters in computer vision or HCI for those with a computer science background, or a masters in digital media or design and technology for those with an arts background. I had neither, so I thought I might have to do a conversion course into computer science or art and then specialise. However, when I read the article above, I realised I was more on the design side anyway (happier in Photoshop and Illustrator than a text editor) and that I needed to learn the computing and statistics side. This brought me to bioinformatics since I already have a background in biochemistry. The course will teach me programming, data mining and statistics. I also get to do three projects, which I am going to do in visualisation of life science data - which is why I got interested in this topic in the first place.
I think undergraduates have a greater opportunity to 'cobble' a course together tailored towards data science.
I found it useful to do a bit of research on anyone who does a visualisation that you like and see what their background is.
I'm coming at it as someone who was doing NLP a decade ago, and recently came back to it by taking courses part time here at Carnegie Mellon (where I am employed). So I have courses under my belt now covering the intersection of machine learning and NLP (which is pretty much all of NLP these days) and planning to take the Masters Machine Learning course this Fall.
The part I would need to add next, I guess, is the data visualization part.
For the large scale data part, there is this program:
Looks like an interesting course. And you couldn't be in a better place for it. My second choice of course was a masters in information science, which looks very similar to the vlis one.
I think data science is a very broad subject area that covers the science of organising information (information science and vlis), right through to information visualisation.
Information science would have given me a good grounding in the skills I need for my start-up (which is involved in organising research information), but I thought it would be best to build on my expertise and interests (design, HCI, UX, psychology) and team up with someone who knows a lot more about the technical side of building information systems - more than I could ever learn from a one year masters.
It sounds like we are starting at opposite ends of the data/information science spectrum; it will be interesting to see if we meet in the middle after our masters.
People definitely love Tufte's books about information graphic design.
Then of course for the actual data processing and visualization you'll need software such as Matlab, R, Processing, or Python (with Scipy + Matplotlib).
As for your latter suggestion - I think its unrealistic to build a general web-based solution. As you are describing it, you would essentially be hosting R, some visualization tool, and a scripting tool. That's absurd enough and doesn't even address the bandwidth issue - these datasets are usually huge.
You'd be better off writing enterprise software in this field. Especially considering anyone who would pay for data munging and visualization tools/services probably doesn't want their data out on the internet.
I think its unrealistic to build a general web-based solution
Maybe not a perfectly "general" solution, but there are lots of domains where data is analyzed that would benefit from a powerful web-based analytic environment.
these datasets are usually huge
Not all interesting datasets are huge. Some large data sets can easily be shared among users -- Amazon already hosts a lot of common data sets in EC2 already, for example. Given the steadily-decreasing cost of bandwidth and the fact that you don't need interactive response (at least for the initial data load), I don't think it's that impractical for a pretty wide set of scenarios.
"As for your latter suggestion - I think its unrealistic to build a general web-based solution."
Might be a dual product strategy, the public facing webapp with limits on data size to test drive, and customer site installed enterprise version. Or maybe some solution that uses Amazon storage (in other words, outsource data storage and security issues to Amazon).
Of course, Google had the two tier search idea (they wanted to monetize by selling custom search servers to businesses) but found the real money was to be made on the consumer facing side. So you never know.
One of the things I am hoping to do with my start-up is to build simple tools to help wet-lab scientists access informatics solutions to analyse and visualise their data, and simple tools for informatics scientists to visualise their data so that a wet-lab scientist can understand it (more easily). There is a huge communication gap between (bio)wet-lab and (bio)informaticians, and I think 'data science' tools can help to bridge that gap.
Yes, this looks a lot like what I had in mind. Reading his blog entry, I like that he is getting something out there, even though it is not all of what he originally had in mind. If he keeps iterating, he should do well. I wonder if he has any thoughts on business models.
What books would be on the required reading lists? What is the best set of tools to learn and know?
An alternative approach: how much of this could be automated and delivered as a web application/service? Provide easy ways for people to parse and upload their data, algorithms to run with good descriptions and recommendations for lay people, and well designed visualization tools for presenting the output of the data. This would be a very valuable service, but would take a broad range of knowledge and expertise to implement well (experts in the CS, Statistics, and Visualization fields as described in the article).