Agreed - the sample size is also really small to be interpreting slower arrival times as well, if only 1 or 2 packages took a long time it's likely they just found an outlier.
According to http://www.atheistberlin.com/study they got a significant Wilcoxon Signed Rank Test at p < 0.01. The Wilcoxon Signed Rank Test is nonparametric (i.e. sacrifices some statistical power in order to not make assumptions about the underlying distribution of the data) so is meaningful even if there is a long tail of packages that take longer due to circumstances outside of the study variable.
Sample size is something that must be interpreted in the presence of power. You can make a solid conclusion with a very small sample size if the true difference in arrival times is very large, given that the assumptions of the hypothesis test hold (t-test can be a little ridiculous with some of its assumptions sometimes).
In the original article, one of the footnotes mentioned that they tested the data using Wilcoxon's Signed-Rank test, which mitigates a lot of the impact of single outliers.
I'd love to see the raw data though, to see an even less-sensitive method to outliers (sign test). If the difference between the groups is as large as the article would lead us to believe, the loss of power should not present any problem.
Yeah, that 3 days could either be damning, or absolutely nothing. Until we can examine the data this is a non-story despite how juicy we may want it to be.