Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Let's note that this thread's been shifting back and forth between information which is publicised over media and data, with the discussion focusing on use in research.

These aren't entirely dissimilar, but they have both similarities and differences.

Data in research is used to confirm or deny models, that is, understandings of the world.

Data in operations is used to determine and shape actions (including possibly inaction), interacting with an environment.

Information in media ... shares some of this, but is more complex in that it both creates (or disproves) models, and has a very extensive behavioural component involving both individual and group psychology and sociology.

Media platform moderation plays several roles. In part, it's performed in the context that the platforms are performing their own selection and amplification, and that there's now experimental evidence that even in the absence of any induced bias, disinformation tends to spread especially in large and active social networks.

(See "Information Overload Helps Fake News Spread, and Social Media Knows It". (https://www.scientificamerican.com/article/information-overl...), discussed here https://news.ycombinator.com/item?id=28495912 and https://news.ycombinator.com/item?id=25153716)

The situation is made worse when there's both intrinsic tooling of the system to boost sensationalism (a/k/a "high engagement" content), and deliberate introduction of false or provocative information.

TL;DR: moderation has to compensate and overcome inherent biases for misinformation, and take into consideration both causal and resultant behaviours and effects. At the same time, moderation itself is subject to many of the same biases that the information network as a whole is (false and inflammatory reports tend to draw more reports and quicker actions), as well as spurious error rates (as I've described at length above).

All of which is to say that I don't find your own allegation of an intentional bias, offered without evidence or argument, credible.



An excellent distinction. In the world of data with research & operations, I only very rarely deal with data that is intentionally biased. Counted on the fingers of my hand. Cherry picked is more common, but intentionally wrong to present things in a different light, that's rare.

Well, it's rare that I know of. The nature of things is that I might never know. But most people that don't work with data as a profession also don't know how to create convincingly fake data, or even cherry pick without leaving the holes obvious. Saying "Yeah, so I actually need all of the data" isn't too uncommon. Most of the time it's not even deliberate, people just don't understand that their definition of "relevant data" isn't applicable. Especially when I'm using it to diagnose a problem with their organization/department/etc.

Propaganda... Well, as you said there's some overlap in the principles. Though I still stand by more preference of #2 > #1 > #3. And #3 > 2&3 together.


Does your research data include moderator actions? I imagine such data may be difficult to gather. On reddit it's easy since most groups are public and someone's already collected components for extracting such data [1].

I show some aggregated moderation history on reveddit.com e.g. r/worldnews [2]. Since moderators can remove things without users knowing [3], there is little oversight and bias naturally grows. I think there is less bias when users can more easily review the moderation. And, there is research that suggests if moderators provide removal explanations, it reduces the likelihood of that user having a post removed in the future [4]. Such research may have encouraged reddit to display post removal details [5] with some exceptions [6]. As far as I know, such research has not yet been published on comment removals.

[1] https://www.reddit.com/r/pushshift/

[2] https://www.reveddit.com/v/worldnews/history/

[3] https://www.reveddit.com/about/faq/#need

[4] https://www.reddit.com/r/science/comments/duwdco/should_mode...

[5] https://www.reddit.com/r/changelog/comments/e66fql/post_remo...

[6] https://www.reveddit.com/about/faq/#reddit-does-not-say-post...


Data reliability is highly dependent on the type of data you're working with, and the procedures, processes, and checks on that.

I've worked with scientific, engineering, survey, business, medical, financial, government, internet ("web traffic" and equivalents), and behavioural data (e.g., measured experiences / behavour, not self-reported). Each has ... its interesting quirks.

Self-reported survey data is notoriously bad, and there's a huge set of tricks and assumptions that are used to scrub that. Those insisting on "uncensored" data would likely scream.

(TL;DR: multiple views on the same underlying phenomenon help a lot --- not necessarily from the same source. Some will lie, but they'll tend to lie differently and in somewhat predictable ways.)

Engineering and science data tend to suffer from pre-measurement assumptions (e.g., what you instrumented for vs. what you got. "Not great. Not terrible" from the series Chernobyl is a brilliant example of this (the instruments simply couldn't read the actual amount of radiation).

In online data, distinguishing "authentic" from all other traffic (users vs. bots) is the challenge. And that involves numerous dark arts.

Financial data tends to have strong incentives to provide something, but also a strong incentive to game the system.

I've seen field data where the interests of the field reporters outweighed the subsequent interest of analysts, resulting in wonderfully-specified databases with very little useful data.

Experiential data are great, but you're limited, again, to what you can quantify and measure (as well has having major privacy and surveillance concerns, often other ethical considerations).

Government data are often quite excellent, at least within competent organisations. For some flavour of just how widely standards can vary, though, look at reports of Covid cases, hospitalisations, recoveries, and deaths from different jurisdictions. Some measures (especially excess deaths) are far more robust, though they also lag considerably from direct experience. (Cost, lag, number of datapoints, sampling concerns, etc., all become considerations.)

It's complicated.


I've worked with a decent variety as well, though nothing close to engineering.

>Self-reported survey data is notoriously bad

This is my least favorite type of data to work with. It can be incorrect either deliberately or through poor survey design. When I have to work with surveys I insist that they tell me what they want to know, and I design it. Sometimes people come to me when they already have survey results, and sometimes I have to tell them there's nothing reliable that I can do with to. When I'm involved from the beginning, I have final veto. Even then I don't like it. Even a well designed survey with proper phrasing, unbiased likert scales, etc can have issues. Many things don't collapse nicely to a one-dimensional scale. Then there is the selection bias inherent when by definition you only receive responses from people willing to fill out the survey. There are ways to deal with that, but they're far from perfect.


Q: What's the most glaring sign of a failed survey analysis project?

A: "I've conducted a survey and need a statistician to analyse it for me."

(I've seen this many, many, many times. I've never seen it not be the sign of a completely flawed aproach.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: