Hi there. First, you've got to begin with the understanding that no one is maintaining a list of federal .gov websites holistically (or at one I can get hold of). So, before scanning, we source several public datasets to gather potential .gov hostnames. This was recently described in depth by 18F [https://18f.gsa.gov/2017/01/04/tracking-the-us-governments-p...]. In addition to Censys, GSA's DAP, and the End of Term Web Archive data, our team performs authorized scans of federal agency networks [https://www.whitehouse.gov/sites/default/files/omb/memoranda...] and so we mine that data too. This currently nets ~90k hostnames, only which about a third are responsive.
For both hostname gathering and HTTPS scanning, we use 18F's domain-scan [https://github.com/18F/domain-scan], which orchestrates the scan and provides parallelization. We use the pshtt scanner to ping each hostname at the root and www for both http and https-- this typically takes 36-48 hours to burn through. Once the scanning is finished, we throw the data from the CSV into mongodb, then generate the report via LaTeX. The trickiest part is probably report delivery, which is a mostly manual process for Very Government reasons.
Most of the bureaucratic challenge is overcome because we've already been doing scans against these executive branch agencies for the past several years, so we're a known quantity, though we do modify our user-agent to clearly point back to us. On the whole, agencies have been very supportive-- the data on Pulse bears that out. Agencies really do want to do the right thing for citizens.
I appreciate you taking the time for an insightful and detailed response. The link you provided, "Tracking the U.S. government's progress on moving to HTTPS[1]" gave a lot of the details I was looking for.
You might consider mentioning it in this blog post as it does offer interesting background information and technical details.
As a specific example, the actual Python scripts used to generate the data[2] and the data itself[3], give a great deal of insight into the question I had.
No, the code for report generation hasn't been opened up yet, mostly because it won't work without dependancies that aren't yet public. I think that will change in the next few months; open-sourcing is definitely an intention. It will live at https://github.com/dhs-ncats when released.