Hacker News new | past | comments | ask | show | jobs | submit login

sorry -- please substitute "average" (original was also in quotes) to category or factor, and you still have the bin I am talking about. You can put any label on it you like such as "commonality", as long as you remove details, ie. other bins.

But as you say: your "aggregate analysis" NEEDS "many different samples from different stars". Commonality is the result of your analysis based on different samples. But since they are common, you can go and sample and have the result without doing mass surveillance on every star.

ps: I am fully aware of photo stacking, but also note, that stars are not humans, see context of privacy. Please look at argus or sdcMicroGUI from CRAN to get a feeling for data utility vs. reidentification risk.




> But since they are common, you can go and sample and have the result without doing mass surveillance on every star.

"Mass surveilance" reduces noise and lets you get more data in a shorter period of time (telescopes have large fields of view, but they can't make time pass faster). Stacking (which is what the technique is called in Astrophysics) is very useful in this case. Not to mention that you can also do individual analysis as well.

Actually, most interesting of all is that you can do this type of analysis on objects like neutron stars that we can't observe directly because they're too faint. Because noise in telescopes can be modelled as a Poisson process, stacking actually increases S/N in a way you can't do without making much bigger telescopes.

PS. I'm not a statistician, so I can only speak to what I know. But my whole point is that researchers do know how to deal with noisy data, regardless of whether or not that noise is man-made or not. Interestingly enough, I found out recently that the NASA pipeline actually breaks certain data sets they have released (which have papers written about them) so man-made noise is a problem regardless of whether or not it's intentional.


"Not to mention that you can also do individual analysis as well."

This is the key point to argue against in the context of people, privacy and mass surveillance.

It is the touchstone of privacy, anonymity and crowd protection.

Regarding noise suppression: yes, the more queries (available data whether raw or extracted) the more you can filter (ask a Kalman student) to reduce your error bars and margins. This is a reason why DP is overhyped. Also, if there are no differences between queries, then data is redundant. See deduplication (database) or scaling (measurement).

About the analysis pipeline: this is why the mantra "know your detector". Coincidentally, this is why releasing only recorded datasets is next to useless for people outside the given research group. You would need to capture detailed knowledge of your data taking operations and instruments, which happens rarely, if ever. Please cite a thing such as "the NASA pipeline", perhaps you mean a given mission/experiment? In any case, detector recalibration is a usual, almost daily activity...


> Please cite a thing such as "the NASA pipeline", perhaps you mean a given mission/experiment?

The specific pipeline I was referring to is the Kepler pipeline that NASA uses to take their raw pixel data and produce photon counts that everyone uses for their research (this wasn't a detector issue, it was a software bug at the final stage of the data publishing process). The point was not the pipeline issue, it was that noise is everywhere.

But as to your point, yeah okay. Maybe I shouldn't talk about statistics when that's not my field. :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: