Hacker News new | past | comments | ask | show | jobs | submit | compute_me's comments login

Making public data open by default can arguable be an imporant step towards fostering societal equity. However, it needs to be not only "open", which typically means stashed away in some corner as a spreadsheet or database file, but accessible and useful to people. The UK has been pushing open data for years now and more and more institutions are now realizing this. Shameless plug for a research project that is aiming to make open data more accessible and to democratize data-science: https://data-in.place/ ...


In case you haven't seen it, I think https://www.data.gov/ is an attempt to answer your point about making it "accessible and useful to people." There's room for improvement, but it's a start.


I thought there would be a flood of projects analysing the data when it came out, but it seems like the idea everyone applauded but not much came out of it. Steve Ballmer's http://usafacts.org seems like the first real attempt though.


Is there a HN thread discussing these reports? I'd love to hear folks' opinions about them... Interesting facts to me in the latest report [1]:

- Page 44: "Our economy has grown at a steady rate despite changes in economic policy". I expected to see a lot more fluctuation on this data.

- Page 30: "There have been more suicide gun deaths than homicide gun deaths every year since 1981". This is crazy to me given how much we hear about gun homicide being a problem in this country.

- Page 29: Crime rate has declined but "The number of incarcerated persons has increased by 330% since 1980".

[1] https://static.usafacts.org/public/resources/USAFactsReport2...


>Page 44: "Our economy has grown at a steady rate despite changes in economic policy". I expected to see a lot more fluctuation on this data.

Barring some calamity, this is pretty much as expected. Tariffs impact a very small percentage of the economy with large size, and the rest of it with a small overhead, much like a fed interest rate hike. Even with this hawkish fed, there hasn't been anything overly harmful to the economy from a policy perspective.

> This is crazy to me given how much we hear about gun homicide being a problem in this country.

Anything that's politicized gets this special treatment. Kid kills brother with car, blurb in local newspaper. Kid kills brother with gun, national news and Tweets from Presidential candidates.

Texting and driving is as bad as drinking and driving when it comes to number of deaths (dwarfing gun deaths as well), yet people do it like it's no big deal. Most cities still only levy small fines (in comparisons to DUIs) for doing it. Not all accidental death is created equal.

>Page 29: Crime rate has declined but "The number of incarcerated persons has increased by 330% since 1980"

Welcome to the US where the war on drugs gives us authoritarian level incarceration rates.


Every election cycle I look for the candidate with the balls to say the War on Drugs is over, let’s wind this crap down, change our laws to reflect this fact, change some sentences ex post facto to reflect this and get on with our lives.

Every election cycle, I continue to be disppointed. Even if they were crazy in every other regard, I would probably still vote for them. It’s like, Step 1 towards doing anything meaningful in regards to poverty, education, criminal justice reform, et cetera.


+1 for this. This site is beautifully designed and loads very fast in mobile.


That's pretty interesting, thanks for mentioning it.


https://public.enigma.com is another good way to browse public government data.


data.gov is great but it seems like the most recent activities/updates were prior to 2016...



USDS/18F have done such incredible work and I'm glad the Trump admin hasn't killed them off completely. Given Trump's total-war attitude toward all Obama initiatives I'm somewhat surprised they still exist and haven't been gutted like CFPB.


honestly, with this administration's turnover and record number of still-unfilled positions, I'd bet there's a very real chance that it still exists because it hasn't been noticed yet. because it sounds exactly like the kind of thing the Trump admin would hate, seeing as how they've already pulled a lot of data out of the public eye like climate change reports, white house visitor logs, etc.


nih.gov is also quite good for health data.


As long as those spreadsheets/database files are accessible to someone with technical skill, people can pull in the data and use tools to make it more accessible and useful. Ideally, yes, the data is useful to begin with, but as long as it's available, there's nothing stopping individuals with the skills from making it useful.

Of course, there are exceptions: the PDFs that are often provided by the prosecution as part of the discovery process are prohibitively difficult to deal with, and should be considered a violation of Brady vs. Maryland, IMO.


I've spent a great deal of time parsing data out of government PDFs that isn't attainable by any other means as a part of my job. In the process I've learned how difficult this information can be to access even for people who don't require it to be in a machine readable format. It certainly has been an interesting exercise in how far simple web scraping tools can be pushed, though.


Amazon Textract was recently announced, sounds like it might be good for that. Haven't tried it myself.

https://aws.amazon.com/about-aws/whats-new/2018/11/introduci...


I applied to the beta, but they never got back to me :\


Have you tried Apache's tika? It's pretty decent.


Nope, I'll have to give it a spin. Thanks for the recommendation!


Do you have any tool suggestions or general advice for someone trying to do this? A while back I was trying to extract text from some government PDFs in order to make the information more accessible for others, but I became a bit overwhelmed when I started reading up on PDFs.


Sure! In terms of raw text extraction (for documents that don't require OCR), the most useful tools I've worked with have been pdftotext [0] and PyMuPDF [1]. For extracting useful details, really, my best advice is to make sure that your regex skills are sharp. I've been meaning to explore the possibility of using NLP tools for named entity recognition, but unfortunately I don't have much of a background there.

The rest kind of it kind of just comes down to using good software engineering practices to help keep yourself sane. Find useful abstractions for common tasks you need to perform and build a library around them, make sure that your data processing pipeline is designed with enough flexibility to handle inputs in different formats so that adding or modifying parsing logic becomes trivial, etc.

[0] https://www.xpdfreader.com/pdftotext-man.html [1] https://pymupdf.readthedocs.io/en/latest/


pdfminer is another good library (Python).


Exactly. Accessible and machine readable are necessary but not sufficient. Thankfully, civil society can reasonably pick up the slack.

In regards to modern day transparency requirements, it seems like laws should include a reasonableness clause.

Making records available to the public but requiring them to be hand photocopied vs. making them available in electronic form in a custom format.

Both open. But two very different magnitudes of effort.


>Making public data open by default can arguable be an imporant step towards fostering societal equity

I think this was one of my biggest shocks doing work for the government, collecting public data, payed by tax funded grants. Public data isn't for the public.

We went into this project with all these starry eyed dreams of making a public online database and freely posting everything we collected, with maps and interactive tools, status reports. It was part of our grant proposal.

Then reality came and we found out public data meant a government password protected database with access fees where our data would be available to people willing to pay for it or we'd lose our funding. The data were for companies or individuals willing to pay the government not for the public.

This still doesn't sit well with me nearly 6 years later. That was never what we wanted out of that project and it wasn't what was planned or accepted when we wrote our proposal.


How much did access cost?

Ideally, I'd prefer the data be free, but if the fee was (mostly) nominal, I'd consider that almost as good...


Simply put, this government isn't ours. It belongs to the corporate heads and the monied elite and the lobbyists who write the laws that are uusually summarily passed by congress.


The FAIR principles make a lot of sense to me: Findable, Accessible, Interoperable and Reproducible.

https://en.wikipedia.org/wiki/FAIR_data


I was kind of concerned right off the bat with that numbered list:

1. public information should be open by default to the public in a machine-readable format, where such publication doesn’t harm privacy or security

I'm sure literally everything that they wish to keep opaque will declared to be covered under one or both of these incredibly vague categories and nothing will significantly change. Is there any elaboration in the bill that defines what they can call a matter of privacy or security? Even if there is, it wouldn't matter much because how are people going to tell if they keep it locked down in the first place? And they would not risk any sort of real blowback for abusing this and getting caught. Tell me there have not been far, far worse scandals that resulted in no consequences for the perps and cowed silence from the public. I don't think they're hiding the X-Files in there or anything, but this won't magically cause a more transparent, just, or equitable government unless it has serious teeth and tight language.

And 2. federal agencies should use evidence when they make public policy

Somehow I wonder if the data from the Kansas experiment will be taken into consideration and turned into public policy by this current administration, or if they will cherry-pick evidence selectively to justify only wildly unpopular legislation because someone (possibly an industry with a conflict of interest?) contrives some p-hacked research to back it up. Just because something is scientific doesn't necessarily mean it's good government. It is often so, but I'm always very wary when they trot out a bill with lots of bold language touting justice and democracy, truth, stuff like that. If the US legislature passes a bill called "protect innocent puppies from being kicked in the name of god and freedom" you can be 95% sure that this bill will enable a great wave of puppy-kicking despite its holy name.


This is a cool project. I was trying to find the source for the project on the page. There’s an oss page [0] but that’s about the software used.

Is this an open source project? Or what’s the way for licensing to use with US data?

[0] https://data-in.place/open-source


>However, it needs to be not only "open", which typically means stashed away in some corner as a spreadsheet or database file, but accessible and useful to people

True, I guess HGTG applies well here:

“But the plans were on display…” “On display? I eventually had to go down to the cellar to find them.” “That’s the display department.” “With a flashlight.” “Ah, well, the lights had probably gone.” “So had the stairs.” “But look, you found the notice, didn’t you?” “Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard.”


It would be extremely helpful if the government data used the same Solid protocol (Linked Data Platform) which will likely be getting widespread use starting next year. (See Tim Berner Lee's Inrupt.com Solid startup)


What are the signs that make you think Solid will see widespread use, and by whom? I'm genuinely curious.


I agree that the government will need to work on a transparent front-end to make this data universally accessible. Nonprofits without the budgets for advanced tech workers and without volunteers will need clearly organized links to download. They may not know how to do shell scripting.

The more data everyone can use, rather than data that can be owned and commoditized or utilized only by specialists is a good thing.


I would also mention Data USA by MIT Media Labs Collective Learning group, with an effort to combine multiple sources into a single geographic profiles rich with visualization and data sources https://datausa.io


Making public data open by default can arguable be an imporant step towards fostering societal equity.

Define equity here?


Not the parent commenter, but open data policies could contribute to levelingn the playing field for society. The vast majority of records requests are currently made by corporations and other business interests.


Having accessible data isn't enough. The average person also needs the tools to process it in a meaningful way. This is where corporations are ahead - with relatively unlimited resources to turn that government data into insights that influence business decisions.


I actually rather have the raw data accessible instead of doctored version.


Why not both?

Make the raw data available for those of use who want to write machine-parsing algorithms, and also make it available in human-readable and easily digestible form for the broader public.


> Making public data open by default can arguable be an imporant step towards fostering societal equity.

Out of curiosity, what’s the argument here for how public data being open by default is an important step toward fostering societal equity?


Here's a paper by the Sunlight Foundation that goes into this a bit:

https://www.scribd.com/doc/263776138/The-Social-Impact-of-Op...


I disagree - that's not a link to a paper by the Sunlight Foundation.

It's a link to a site that wants you to agree to try a monthly subscription before you can download anything at all.

Especially in the context of a discussion about public data, that's an important distinction.

If you can, please provide an URL to the actual content?


The article linked above, hosted on Scribd, is actually one of the official distributions provided by the Sunlight Foundation. The Scribd user who uploaded it was one of the paper's authors.

Having said that, here's a link to the version hosted on the Sunlight Foundation's domain: http://assets.sunlightfoundation.com.s3.amazonaws.com/policy...

You can also verify what I said about the Scribd version by checking out the original press release/announcement here: https://sunlightfoundation.com/2015/05/05/a-new-approach-to-...


Thanks!


What do you mean by "democratize data-science"?


I prefer when it is available as a spreadsheet or database file. The crucial element is my having access to the data. After that, any differential amount of information I have over everyone else is an improvement.

For instance, consider the statistics on accidents between cars and bikes in California. You can get the numbers yourself from the government. When someone says that more accidents are adjudged to be because the bicyclist is at fault, you can reference the truth and ruin the credibility of the person making the assertion, thereby allowing political advancement of your own cause. No one can use the technique against you because you are capable of acquiring the knowledge and won't make wrong assertions. Only other people will make them.

Having verifiable true information over someone is power. It's better non-democratized so long as I fall within the circle of power.


I think it needs to be both. Accessible in it's raw format as well as approachable for non-technical users to interact with public information.


I second this. Great for self-hosting on weaker machines or virtual servers and very simple setup in comparison. Such a sweet project!


Totally what I thought too after seeing this: https://twitter.com/smeddinck/status/1032970885148364800 And, yes of course, the models should easily run in the cloud. Could be a whole application series of "make your friends do X", where X is a hilariously remapped activity ... bonus here: it probably does not hurt if the results are somewhat crappy at times.


Not affiliated with the team, but I feel compelled to plug https://gitea.io/en-US/ here ... such a lean alternative to GitLab, I was really happy to find it. Runs just fine for a couple of dozen users on a rather whimpy virtual server.


Wow that looks like a copyright infringement spectacular.


This looks amazing! Added to my to-watch-list. Also, if you are interested in this kind of neuro- / cogsci- stuff with a tech-twist you might want to consider attending this wonderful spring school for an incredibly immersive learning experience: https://interdisciplinary-college.de/


This looks nice, but one can only attend in person, am I right? Or are there any online materials available?



Let's hope that this was a genuine iPhone 1 moment! :)


I've been saying from day one of the "debacle" (release day) that it would likely lead to AWESOME mods if the engine was opened to the community for tweaking and the integration with more rich game mechanics with a good chance of something big like counter strike (originally a half-life mod) emerging... A universe builder builder so to say. I'd love that!



Here is a little bit of background why hydro can be seen as less carbon neutral than say wind or solar: http://www.ecowatch.com/the-hydropower-methane-bomb-no-one-w...


It might be, but is still more practical. Both solar and wind also rely on mining, something environmentalists also oppose.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: