Let's note that this thread's been shifting back and forth between *information*...

ineedasername · on Sept 13, 2021

An excellent distinction. In the world of data with research & operations, I only very rarely deal with data that is intentionally biased. Counted on the fingers of my hand. Cherry picked is more common, but intentionally wrong to present things in a different light, that's rare.

Well, it's rare that I know of. The nature of things is that I might never know. But most people that don't work with data as a profession also don't know how to create convincingly fake data, or even cherry pick without leaving the holes obvious. Saying "Yeah, so I actually need all of the data" isn't too uncommon. Most of the time it's not even deliberate, people just don't understand that their definition of "relevant data" isn't applicable. Especially when I'm using it to diagnose a problem with their organization/department/etc.

Propaganda... Well, as you said there's some overlap in the principles. Though I still stand by more preference of #2 > #1 > #3. And #3 > 2&3 together.

rhaksw · on Sept 13, 2021

Does your research data include moderator actions? I imagine such data may be difficult to gather. On reddit it's easy since most groups are public and someone's already collected components for extracting such data [1].

I show some aggregated moderation history on reveddit.com e.g. r/worldnews [2]. Since moderators can remove things without users knowing [3], there is little oversight and bias naturally grows. I think there is less bias when users can more easily review the moderation. And, there is research that suggests if moderators provide removal explanations, it reduces the likelihood of that user having a post removed in the future [4]. Such research may have encouraged reddit to display post removal details [5] with some exceptions [6]. As far as I know, such research has not yet been published on comment removals.

[1] https://www.reddit.com/r/pushshift/

[2] https://www.reveddit.com/v/worldnews/history/

[3] https://www.reveddit.com/about/faq/#need

[4] https://www.reddit.com/r/science/comments/duwdco/should_mode...

[5] https://www.reddit.com/r/changelog/comments/e66fql/post_remo...

[6] https://www.reveddit.com/about/faq/#reddit-does-not-say-post...

dredmorbius · on Sept 13, 2021

Data reliability is highly dependent on the type of data you're working with, and the procedures, processes, and checks on that.

I've worked with scientific, engineering, survey, business, medical, financial, government, internet ("web traffic" and equivalents), and behavioural data (e.g., measured experiences / behavour, not self-reported). Each has ... its interesting quirks.

Self-reported survey data is notoriously bad, and there's a huge set of tricks and assumptions that are used to scrub that. Those insisting on "uncensored" data would likely scream.

(TL;DR: multiple views on the same underlying phenomenon help a lot --- not necessarily from the same source. Some will lie, but they'll tend to lie differently and in somewhat predictable ways.)

Engineering and science data tend to suffer from pre-measurement assumptions (e.g., what you instrumented for vs. what you got. "Not great. Not terrible" from the series Chernobyl is a brilliant example of this (the instruments simply couldn't read the actual amount of radiation).

In online data, distinguishing "authentic" from all other traffic (users vs. bots) is the challenge. And that involves numerous dark arts.

Financial data tends to have strong incentives to provide something, but also a strong incentive to game the system.

I've seen field data where the interests of the field reporters outweighed the subsequent interest of analysts, resulting in wonderfully-specified databases with very little useful data.

Experiential data are great, but you're limited, again, to what you can quantify and measure (as well has having major privacy and surveillance concerns, often other ethical considerations).

Government data are often quite excellent, at least within competent organisations. For some flavour of just how widely standards can vary, though, look at reports of Covid cases, hospitalisations, recoveries, and deaths from different jurisdictions. Some measures (especially excess deaths) are far more robust, though they also lag considerably from direct experience. (Cost, lag, number of datapoints, sampling concerns, etc., all become considerations.)

It's complicated.

ineedasername · on Sept 13, 2021

I've worked with a decent variety as well, though nothing close to engineering.

>Self-reported survey data is notoriously bad

This is my least favorite type of data to work with. It can be incorrect either deliberately or through poor survey design. When I have to work with surveys I insist that they tell me what they want to know, and I design it. Sometimes people come to me when they already have survey results, and sometimes I have to tell them there's nothing reliable that I can do with to. When I'm involved from the beginning, I have final veto. Even then I don't like it. Even a well designed survey with proper phrasing, unbiased likert scales, etc can have issues. Many things don't collapse nicely to a one-dimensional scale. Then there is the selection bias inherent when by definition you only receive responses from people willing to fill out the survey. There are ways to deal with that, but they're far from perfect.

dredmorbius · on Sept 14, 2021

Q: What's the most glaring sign of a failed survey analysis project?

A: "I've conducted a survey and need a statistician to analyse it for me."

(I've seen this many, many, many times. I've never seen it not be the sign of a completely flawed aproach.)