Hacker News new | past | comments | ask | show | jobs | submit login
The Unitedstates Project (theunitedstates.io)
166 points by _pius on May 30, 2014 | hide | past | favorite | 33 comments



Speaking as someone who contributed a few scrapers to the inspectors general project (https://github.com/unitedstates/inspectors-general), I think this is a great and worthwhile effort. It's actually not that hard to contribute a scraper if you know a little Python (and maybe a way to learn a little Python if you don't).

One thing that my friend who works in Open Data has told me is that it's important for websites like this to exist, to be able to point non-technical people at them and say "SEE. THIS is why you can't just publish everything as a PDF".


Wow. If this can aggregate some of the secondary sources that are the main source of Lexus/Westlaw's power, it would be fantastic.


The main source of their power is every single court case, and more importantly, tracking which ones interact with each other and how, such as being overturned.

This not only needs access to pacer, but a good algorithm and a huge staff to catch up to westlaw/lexis.


That's right. I once wrote the code for the now-defunct Courtbot.com, which crawled, stored, and indexed the majority of available U.S. court decisions as they were published. This was years before Google Scholar started indexing opinions.

But even a completely functional Courtbot-style site is really only a competitor to something like Findlaw.com, not Westlaw or LexisNexis. That's because those companies have features that would be difficult to replicate:

- Very complex and comprehensive search options and parameters. This is possible to do but time-consuming and tricky. - LexisNexis has 15,000 employees, and I suspect a significant number are involved in reviewing cases, summarizing, noting authorities and conflicts, etc. It's not yet possible to replace a trained lawyer reviewing an opinion with a regular expression. :) - A subscription to LN/WL typically gets you, depending on your package and how much you're paying, far more than just court opinions. You get news articles, journal articles, congressional transcripts, and a slew of databases that can be used to look up info on people, locate assets, etc. A lot of this means licensing deals, and LN/WL effectively gives you a one-stop shop for a wealth of data. Some of this is coming online and is becoming searchable, but not enough to make a real dent.

The one thing that challengers have in their favor is that Lexis and Westlaw are expensive. I've had free accounts because of faculty affiliations or a newsroom subscription, which is grand, but it's cost-prohibitive for many people and businesses. The ABA has published a list of alternatives; note the majority are actually still owned by Lexis and Westlaw: http://www.americanbar.org/groups/departments_offices/legal_...


We're actually working on solving some of these problems at my non-profit, http://freelawproject.org:

- We've built https://www.courtlistener.com to provide a powerful search system with millions of opinions and scrapers for lots of jurisdictions.

- We have the RECAP project that ingests content from PACER: https://www.recapthelaw.org

- We'll start collecting an archive of oral arguments soon (just got funding for that, if all goes smoothly).

Everything we do is open source and open access, so hopefully if we fail, people will take our code and content and keep it alive.

You're right that it's a big challenge, but I think we're making some headway.


Although many folks from the Sunlight Foundation support the project, it has relatively decentralized control:

> This is an unusual, and occasionally chaotic, model for an open data project. the /unitedstates project is a neutral space; GitHub's permissions system allows many of us to share the keys, so no one person or institution controls it. What this means is that while we all benefit from each other's work, no one is dependent or "downstream" from anyone else. It's a shared commons in the public domain.

From http://sunlightfoundation.com/blog/2013/08/20/a-modern-appro...


Just remember to mirror it locally.


For any Malaysian readers out there, here's our version:

http://www.sinarproject.org/


The UK doesn't have a single source, but the two main sources are:

http://data.gov.uk/

http://alphagov.github.io/



Awesome list of resources! I'm currently working on a text-based Twilio app that simplifies updates on how their Senator/Representative votes on major legislation. Further down the line I'd like to tie in direct communication with Senators/Reps where they give a statement on why they voted the way they did, updates on when they're in their local offices, etc.


May I suggest you include committee votes where you can?

I've done a bunch of technology voters guides for Wired and CNET by crawling House/Senate records (what a pain) and that's one thing I always thought would be useful. Not enough attention is paid to them, and many bills don't get to the floor. There were plenty of SOPA committee votes on amendments, but the legislation never made it to the floor.


This is awesome - though I'm amused that the site that clearly represents US data, is hosted on an overseas domain... (.io = British Indian Ocean Territory).

Edit: All snark aside though, this really is awesome. I can imagine all kinds of useful things that come out of this sort of structured data, including just interesting information (like demographic patterns of various politicians, etc).


Most of the people on the BIOT islands are American, though. The natives were expelled to build a US military base.


This is a fascinating and shockingly current story. Thank you for the pointer.

"The depopulation of Chagossians from the Chagos Archipelago, that is, the compelled expulsion of the indigenous inhabitants of the island of Diego Garcia and the other islands of the British Indian Ocean Territory (BIOT) by the United Kingdom, at the request of the United States of America, began in 1968 and concluded on 27 April 1973 with the evacuation of Peros Banhos atoll.

...

On April 1, 2010, the British Cabinet announced the creation of the world’s largest Marine Protected Area (MPA) which consists of most of the Chagos Archipelago, homeland of the Chagossians. The MPA will prohibit extractive industry of all kinds, including commercial fishing and oil and gas exploration. Some Chagossians have claimed that this MPA was created to prevent the islanders from returning to the islands.

On December 1, 2010, a leaked US Embassy London diplomatic cable exposed British and US communications in creating the marine nature reserve. The cable relays exchanges between US Political Counselor Richard Mills and British Director of the Foreign and Commonwealth Office Colin Roberts, in which Roberts 'asserted that establishing a marine park would, in effect, put paid to resettlement claims of the archipelago’s former residents'. The cable (reference ID '09LONDON1156')[citation needed] was classified as confidential and 'no foreigners', and leaked as part of the Cablegate cache."

http://en.wikipedia.org/wiki/Depopulation_of_Chagossians_fro...


Didn't the .io domain get repurposed for general use, which is why so many projects and companies have started using it lately?


Not exactly; Google just recategorised it in their indexing system as a generic ccTLD (a technically-country-specific TLD treated as though it weren't).


It's surprising in this day of age no easily obtainable digital data of US demographics is available.

Yeah, it would be really awesome to just click a few times to see make up of a politician's district.

Hope the project does well.


You mean something that would allow "access [to] selected statistics about your Congressional district"? With a informative name like "My Congressional District"? Easily found by navigating to the tools and data section of the government entity that collects statistics about the population?

My Congressional District: https://www.census.gov/mycd/

Other tools from census.gov: https://www.census.gov/data/data-tools.html


That tool helps with the parent's particular complaint, but I think the broader point is accurate. It is definitely too hard to find useful raw data, and it is even harder to find raw data that is already is a useful format. Specifically talking about the census data, their format is custom and complex [1]. They do have an API [2] which makes it easier, but I still have to write code to download a version of the census data that is in a useful format. Why can't I just have a download link to a SQL script, JSON file, or a tarball with a bunch of CSV files?

I have the same question for the United States project. Why YAML for congress-legislators? It is certainly better than creating their own custom format, but I still have to do work if I want to import the data into a database or Excel.

1. http://www2.census.gov/census_2000/datasets/

2. http://www.census.gov/data/developers/data-sets/decennial-ce...


People are lazy. Anything that works to circumvent that fact should be applauded.


I do not understand your comment. OP lamented the lack of easy access to demographic data, commenting that "it would be really awesome to just click a few times to see make up of a politician's district" and I gave a link to exactly what OP was longing for. The only remaining laziness circumvention is an application that reads your mind. You think bothering to look before complaining is too much to ask of individuals who comment on HN posts?


My point is this: how many people are there as motivated as op? Motivated enough to comment on an obscure website in hopes of being pointed in the right direction?

Perhaps maybe the disconnect is in that I -- and maybe op as well -- are thinking in the context of people as a whole, and you are thinking in the context of people as members HN.


Lets talk about "people as a whole" who are interested in "a few clicks access to congressional district demographics." You think it is too much to ask to have them type "congressional district demographics" into a search box? If you put this search into google "My Congressional District" is the first result. I don't know how you make that any easier to find short of creating a mind reading application.

https://www.google.com/search?q=congressional+district+demog...


>You think it is too much to ask to have them type "congressional district demographics"..

For the majority of the (US at least) population? Absolutely yes. Hence the term "circumvention". No way this fight is won, at least at this point, over a battle of logic.

In this day and age, instigating change needs to be as simple as possible.

Maybe what I'm saying doesn't make sense. If so, apologies.


We use the GovTrak API's and some from Sunlight Foundation for http://PlaceAVote.com. They are pretty awesome and well written.


The project I am most excited about is the citation extractor: https://github.com/unitedstates/citation


I'm so glad you noticed it :) do you work in the field, what got your notice?


I apologize I did no see your reply earlier. I have a project that I am working on that will really benefit from the citation extraction. I am tired of waiting for GPO/CRS to release the Annotated Constitution in xml format. I have been slowly working on getting it in markdown format so that it can be made into epub/html/etc. I have been planning to get in touch with you but I do not have enough completed yet.


Doesn't www.enigma.io do this?


Both have to do with open data, but otherwise, there are significant differences.

The Github @unitedstates Project, is an open, relatively decentralized directory to find tools and data related to the United States. Based on the organizations involved in its birth, I'd say its ethos is, broadly, about civic-minded issues. The tools mentioned vary and have different user experiences.

Enigma is a login-required, commercial offering (with a free option, at least for the time being) providing a web application interface to public data, worldwide. It is, at its core, a search engine that lets you drill down into data rows from a common user interface. Its ethos seems to be "find the data you are looking for, whatever your purpose: academic research, business analysis, civics, etc.


Excellent leadership from sinak!


Thanks Craig, but I was barely involved: all the credit should go to Sunlight Foundation and their partners (Govtrack, NY Times) who started the project and did the painstaking work to build the datasets over the course of 2 years.

I helped with a tiny tiny piece (the contact-congress repo), and even that was worked on for months before by the folks at Sunlight (in particular Dan Drinkard and Eric Mill).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: