Hacker News new | past | comments | ask | show | jobs | submit login
Data Analysis and Visualization Using R (2014) (varianceexplained.org)
95 points by michaelsbradley on Aug 10, 2016 | hide | past | favorite | 24 comments



These tutorials are from 2014. While they provide a good overview of R syntax, a lot has been added to the R-verse such as dplyr, which the author primarily used for his Trump Tweets blog post yesterday.

If you are interested in learning R, you may want to read the R for Data Science book (http://r4ds.had.co.nz/) book by dplyr (and ggplot2) author Hadley Wickham.

Relatedly, I have my own (slightly more complicated) notebooks using R/dplyr/ggplot2, open-sourced on GitHub, if you want further examples of real-world analysis with publically-available data along the lines of the Trump Tweet analysis:

Processing Stack Overflow Developer data: https://github.com/minimaxir/stack-overflow-survey/blob/mast...

Identifying related Reddit Subreddits: https://github.com/minimaxir/subreddit-related/blob/master/f...

Determining correlation between genders of lead actors of movies on box office revenue: https://github.com/minimaxir/movie-gender/blob/master/movie_...


While the tidy-verse and data.table are definitely game changers for R, it's still worth learning the basics. Often the packages make irritating tasks easy, though they rarely touch the tasks that are easy in base R. I've seen some pretty convoluted dplyr from newcomers that could have been achieved in a single line without loading any packages.


Course author here; I agree about most of the lessons being outdated in the last two years, and that R for Data Science is a great modern source.

I'm working with DataCamp to develop an R course that covers dplyr, tidyr, and other newer additions to the R language.


Good to hear! :)


Took me a while to find the Trump Tweets blog you referred to. It is here for anyone trying to find it:

http://varianceexplained.org/r/trump-tweets

By the author of the tutorial, not the poster of the link.


I love R, but I have two problems with it that I would like suggestions to deal with.

1. Debugging seems way more primitive than in other languages; I get cryptic messages and really struggle to pinpoint what is happening. Debugging in (free) shiny is even harder, the page says connection closed and I have to guess what has happened.

2) Code structure. R is simply fantastic in REPL and/or RStudio mode for digging around in data, but longer programs remind me of COBOL (yes, I have programmed in COBOL) longer programs written by other people remind me of the need to drink alcohol. Creating good code with R is vastly harder than Julia, in Julia the challenge is not to create working clean code - that's natural, the challenge is to create the best code that it's ever possible to have. In R the challenge (for me) is to make it work and not make a plate of spaghetti.


Ad 1. See ?browser ?traceback ?debugger. You also have breakpoints.

Ad 2. There are reference classes which also provide limited type checking for fields. You can also encapsule your code in environments which is more R-style but doesn't work well with roxygen.


Sounds like you need some Visual Studio in your life.

And you probably need to be more assertive.

https://cran.r-project.org/web/packages/assertive/index.html

https://www.youtube.com/watch?v=JWjiMvlfCwk

(and you do use testhat for unit testing, right?)

Additionally you want to write more modular code. There is lots of infrastructure around that in R, but people just don't use it often enough because a lot of them aren't programmers.

mlr provides very convenient infrastructure for building data mining pipelines where you can fuse steps with each other.

http://mlr-org.github.io/mlr-tutorial/release/html/

For non-model building activities, i.e. inference or exploratory analysis, mason is a great way to do it.

https://cran.r-project.org/web/packages/mason/vignettes/spec...


I use r-studio but not sure what visual studio will bring - will investigate.


Project management and develops tools.


For (2), you may want to look at a couple of module systems for R:

https://github.com/klmr/modules

https://github.com/wahani/modules

Neither is perfect, but I've found them helpful in my projects. They involve much less overhead than writing packages, especially when the modularization I'm trying to achieve is purely internal to my project and I don't intend to publish the code. At the same time, they provide much better encapsulation compared to `base::source`.


FYI, debugging in Shiny has gotten much better in 0.13.1 and later (you now get stack traces at the console, or in your log file if running on your own Shiny Server, or in your admin console if running on ShinyApps.io).


How? I have looked, but shiny says "debugging is private, talk you your admin".... (I'm running ubuntu on aws)



Considering how R has exploded in recent years, I'm sure a more recent article could have been found. That being said, R is amazing, easily the best language/software for any sort of data analysis. And bonus points for easy Fortran/C++ interop, as well as easy multicore/cluster computing. Oh, and a shout out to RStudio, which is also amazing.


I want to thank everyone for those links. I am learning R at the moment and I am finding them immensely helpful


Since this link is from 2014, it doesn't mention rBokeh, which is a very powerful interactive viz library for R: http://hafen.github.io/rbokeh/


How good is the support for R when data is large and does not fit in memory?


Very good, just look at sparklyr and the likes. People seem to think there is some hang up here. R has some of the best developed packages for working out of memory.


plenty of options, depending on what you need.


I guess one has options in every language including Python. Does something make R stand out over Python for Big Data?


Everything that makes it stand over Python for small data.


I use R multiple hours every day, I love it, I love the ecosystem, but I can't help thinking that it's showing its age. It does a great job on static analysis and visualization of smaller (< 1 gigabyte) data sets, but is seriously challenged by anything significantly larger, and is unfit for purpose if the data is changing rapidly (eg streaming). I unfortunately am slowly coming to the conclusion that Spark and Flink style tools are where data science will be at in a few years time, and while I know you can use R as a layer on top of these, I think other aspects of R also hold you back, paradoxically, things like the excellent base and ggplot graphics, which are rightly lauded as excellent, but are very low-dimensional in a world where tensors increasingly rule. I think R will remain hugely relevant for a long time, and is, what I tell people, like Excel^2, but it's getting to point where the world is moving on and it will struggle if it's not rewritten from the ground up with a much faster, multicore, threaded, distributed implementation.


I genuinely think these type of thoughts come from having extensive experience as a programmer, that can consider building out systems that might reach the performance ceiling of R. For 99% of people multicore/distributed architecture will never even be a consideration. But I'm with you, in that having these things from a system engineers perspective would be incredible. There are other implementations of R out there (not just R GNU): Hadley Wickham's discusses them in Advanced R somewhere.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: