It's worth noting that for journalists, analyzing data is only half the battle.
Sites like FiveThirtyEight and The Economist usually have separate graphics departments who use nonstatistical tools like Illustrator to annotate and apply custom theming. Good visualization is an huge part of a persuasive argument, and so being able to do both is important (and languages like R have good native plotting as well)
Additionally, looking at the agate code Jupyter notebook, it appears that the processing syntax is very, very similar to pandas (despite the warning against it) aside from the print_bars method, so I'm confused about the specific utility of the module.
From the post comments, after someone else noted the similarities too:
> You're right, most of my problems with pandas are not in its interfaces. My problems there are with the overhead of the numpy dependency, its confusing handling of text, nulls, etc. (inherited from numpy) and its documentation aimed at advanced users rather than beginners.
Hello! Author of the library here. Just want to point out that I am a journalist and very active in the data journalism community. (6+ years) Both the sites you name-check have journalists who do production online graphics that don't go through the traditional Illustrator workflow and news organizations are increasingly discarding that antiquated pattern. (I've made a hundred graphics for NPR and I don't even have a copy of Illustrator installed.)
To your second point, that's fine. The most common feedback I've gotten is "I don't see what purpose this serves that X doesn't already fulfill!" Well okay then, you don't gotta use it. But given the fact I've done this job for years, working with the very folks who it's targeted at, I think it's probably safe to assume I've got some reasons. (Which you will find enumerated in the blog post and documentation.)
I like pandas but I find it confusing to use. I wouldn't mind sacrificing some speed for ease of use. In particular it would be great to find a library with something similar to pandas' MultiIndex, but a more intuitive behavior. However, from skimming over Agate's docs, it doesn't seem to have something similar. Anyway, it's great that alternatives are attempted.
I don't think the syntax is that much nicer than dplyr in R (thank you Based Hadley). But the approach (focusing on less-technical users and reducing headaches) is certainly good.
I do really like the graphs being printed in console. Is this common elsewhere?
Where I work, we do a lot of projects where we are replacing some aging and wacky system (i.e., FileMaker Pro, Access, old and ignored SQL Server 7, etc.) Our project managers might find this tool helpful, doing the data analysis in the wacky system is pretty specialized. Dumping that data to CSV and looking at it through a tool like this seems like it'd be a big time saver.
Journalists have some problems that tend to be somewhat peculiar to their jobs. Some examples:
* A mix of heterogeneous and often internally inconsistent data.
* A lot of data that is categorical, free text or otherwise non-numerical.
* A need to be robust that is not always accompanied by the time necessary to become an expert programmer.
I'm sure some other folks have these problems too, though I can't think of any other industry where folks would touch as diverse a range of data as we do.
If works for other niches, great! But I'm a journalist and I had journalism problems in mind when I built it. I can't speak to the needs of folks in science, finance or what have you.
+1 for digital humanities folks. Your emphasis on well written documentation is a strong argument for agate over more powerful, but more confusing, data processing libraries. I'm already thinking about using agate in my digital humanities workshops!
Sites like FiveThirtyEight and The Economist usually have separate graphics departments who use nonstatistical tools like Illustrator to annotate and apply custom theming. Good visualization is an huge part of a persuasive argument, and so being able to do both is important (and languages like R have good native plotting as well)
Additionally, looking at the agate code Jupyter notebook, it appears that the processing syntax is very, very similar to pandas (despite the warning against it) aside from the print_bars method, so I'm confused about the specific utility of the module.
From the post comments, after someone else noted the similarities too:
> You're right, most of my problems with pandas are not in its interfaces. My problems there are with the overhead of the numpy dependency, its confusing handling of text, nulls, etc. (inherited from numpy) and its documentation aimed at advanced users rather than beginners.