Hacker News new | past | comments | ask | show | jobs | submit | robustcollector's comments login

Elsewhere in this thread I posted a detailed commentary on what the torrent contains.


Perhaps HN readers would appreciate a detailed account of what the NPD torrents contain.

The torrent deliver two files like so:

  NPD202401.7z  33,456,912,010 bytes (32GB)
  NPD202402.7z  20,548,499,322 bytes (20GB)
Uncompressing NPD202401.7z results in:

  ssn.txt 176,806,109,779 bytes (165GB)
  wc -l ssn.txt ==>> 1,698,302,005 lines
Uncompressing NPD202402.7z results in:

  ssn2.txt 120,722,361,611 bytes (113GB)
  wc -l ssn2.txt ==>> 997,379,508 lines
This is a total of 1698302005+997379508 = 2,695,681,513 lines.

Each line is a comma separated record with these fields:

ID,firstname,lastname,middlename,name_suff,dob,address,city,county_name,st,zip,phone1,aka1fullname,aka2fullname,aka3fullname,StartDat,alt1DOB,alt2DOB,alt3DOB,ssn

Generally records have ID, firstname, lastname, middlename, address, city, county_name, st, zip, and ssn. Most records do not have the fields for name_suff (name suffix), phone1, aka1fullname, aka2fullname, aka3fullname, StartDat, alt1DOB, alt2DOB, and alt3DOB.

There are no emails at all. There is no "@" in the files anywhere. Phone numbers are very rare.

I don't know what the ID number at the head of each line represents. I presume it is an internal index used by the organization that compiled the data. The SSN is at the end of each line.

The files have U.S. addresses only as far as I can tell. Nothing from Mexico, Canada, or other foreign countries.

Many of the lines (records) concern the same person at various addresses. Of 7 random people who I personally know that I checked on, all had entries. There were between 3 and 20 lines (records) for these 7 persons, averaging about 10. They usually differed only in the address field. Going by an estimate of 10 records per person, the 2.6 billion lines represents about 2695681513/10 = 269,568,151 distinct persons in the U.S.

The U.S. population is about 337M where 78% is over 18 years of age. In other words, 337000000*0.78 = 262,860,000 Americans are adults. This is pretty close to my estimate of 269,568,151 distinct individuals in the NPD data files.

Of the 7 persons I checked on, the names were spelled correctly, although the middle name was sometimes just an initial. I searched each person by multiple methods (address, last name, birth date) so I believe I would have detected names that were spelled slightly wrong.

The addresses appeared correct but there was no way to tell which was the current address and the order in which they lived at each address. There is a StartDat field but it was almost never filled in. The latest entry was not always the most current address. In a couple cases, the current address, where the person has been living for several years, was absent.

The birth dates were correct in a couple cases, were abbreviated in three cases (that is, instead of showing 19800704, meaning July 4 1980, it showed 19800700, meaning July 1980 without an exact day), and was wrong for one person by a wide margin.

All 7 persons I checked had SSN numbers. It was correct for 1 person but I don't know for the other 6. The SSN numbers were consistent for each of the 7 persons I checked on. By this I mean that a person did not have more than 1 SSN number, at least among the 7 persons I checked on.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: