This is a very small detail in that post but it captures quite well what officialdom is to me, what separates GSA and 18F from other digital efforts: the inclusion of the “tribal” scale in the list of levels of authority. 18F makes things so that many people can use the Internet including, explicitly, the administration of First Nations.
I’ve complained a lot about how US-based company do not thing about non-US users enough (that common rant is obviously not applicable to GSA, although American abroad, immigrants and foreign visitors probably quality) but in that rant, I have forgotten the original Americans. Shame on me.
I have never heard of any start-up asking “What about First Nations? Do we support Cherokee alphabet? Is there a Sioux exception for the law that we are enforcing in that form?”
pshtt (the HTTPS scanning tool) also powers the results for Freedom of the Press Foundation's recently launched Secure The News project: https://securethe.news. (Full disclosure: I work for FPF, and worked on Secure the News).
One thing I noticed going through the list linked in the page is, many of these .gov pages host _both_ www and no-www versions, making them essentially two different websites with the same content. Example: http://abilityone.gov/ and http://www.abilityone.gov/ It looks like the clear guidelines around this is something missing. I know of certain countries whose .gov domains are almost 99% www and they don’t serve no-www at all.
Thanks for the work that you're doing on this and answering questions. I had never seen many of the neat things mentioned in the blog post.
While the article did a good job explaining how pshtt works and how it generates data for the reporting, it didn't dive too much into the scanning itself. Since this is posted on Hacker News, I'd love to hear more about the nitty gritty of the data collection itself.
Can you talk about what sort of setup you run, and what sort of technical and interdepartmental challenges you run into scanning, storing, and obtaining data for 1,143 government websites?
Hi there. First, you've got to begin with the understanding that no one is maintaining a list of federal .gov websites holistically (or at one I can get hold of). So, before scanning, we source several public datasets to gather potential .gov hostnames. This was recently described in depth by 18F [https://18f.gsa.gov/2017/01/04/tracking-the-us-governments-p...]. In addition to Censys, GSA's DAP, and the End of Term Web Archive data, our team performs authorized scans of federal agency networks [https://www.whitehouse.gov/sites/default/files/omb/memoranda...] and so we mine that data too. This currently nets ~90k hostnames, only which about a third are responsive.
For both hostname gathering and HTTPS scanning, we use 18F's domain-scan [https://github.com/18F/domain-scan], which orchestrates the scan and provides parallelization. We use the pshtt scanner to ping each hostname at the root and www for both http and https-- this typically takes 36-48 hours to burn through. Once the scanning is finished, we throw the data from the CSV into mongodb, then generate the report via LaTeX. The trickiest part is probably report delivery, which is a mostly manual process for Very Government reasons.
Most of the bureaucratic challenge is overcome because we've already been doing scans against these executive branch agencies for the past several years, so we're a known quantity, though we do modify our user-agent to clearly point back to us. On the whole, agencies have been very supportive-- the data on Pulse bears that out. Agencies really do want to do the right thing for citizens.
I appreciate you taking the time for an insightful and detailed response. The link you provided, "Tracking the U.S. government's progress on moving to HTTPS[1]" gave a lot of the details I was looking for.
You might consider mentioning it in this blog post as it does offer interesting background information and technical details.
As a specific example, the actual Python scripts used to generate the data[2] and the data itself[3], give a great deal of insight into the question I had.
As it happened, we were migrating production infrastructure to a new service tonight, and had a few minutes of time where the cert was invalid. Sorry about that.
No, the code for report generation hasn't been opened up yet, mostly because it won't work without dependancies that aren't yet public. I think that will change in the next few months; open-sourcing is definitely an intention. It will live at https://github.com/dhs-ncats when released.
this combines some really important checks. I might be able to remove my .bashrc hack ...
function certchain() {
# Usage: certchain
# Display PKI chain-of-trust for a given domain
# GistID: https://gist.github.com/joshenders/cda916797665de69ebcd
if [[ "$#" -ne 1 ]]; then
echo "Usage: ${FUNCNAME} <ip|domain[:port]>"
return 1
fi
local host_port="$1"
if [[ "$1" != *:* ]]; then
local host_port="${1}:443"
fi
openssl s_client -connect "${host_port}" </dev/null 2>/dev/null | grep -E '\ (s|i):'
}
We (18F/GSA) have been using DHS's tool in production for a few months now, and have fixed various bugs as they've come up.
Before that, pshtt's methodology was replicated in a Ruby tool (site-inspector) that we grafted HTTPS/HSTS detection logic onto, and had that running in production for a year or so.
So in terms of business logic, I think it's pretty mature. If you mean things like having it formally audited or having a dedicated development team, it hasn't gotten there yet. But the more people that use it, the more mature it will get.
I’ve complained a lot about how US-based company do not thing about non-US users enough (that common rant is obviously not applicable to GSA, although American abroad, immigrants and foreign visitors probably quality) but in that rant, I have forgotten the original Americans. Shame on me. I have never heard of any start-up asking “What about First Nations? Do we support Cherokee alphabet? Is there a Sioux exception for the law that we are enforcing in that form?”