Hacker News new | past | comments | ask | show | jobs | submit login

OP attempted this using Python.

What would be the fastest way using *nix commands? A naive solution would be something like:

  echo -n password | sha1sum | cut -d ' ' -f 1 | xargs -I hash grep hash pwned.txt



Use look:

  look $(echo -n password | sha1sum | cut -d ' ' -f 1 | tr a-z A-Z) pwned.txt

from man page:

NAME

look - display lines beginning with a given string

DESCRIPTION

The look utility displays any lines in file which contain string. As look performs a binary search, the lines in file must be sorted (where sort(1) was given the same options -d and/or -f that look is invoked with).

example:

  justin@box:~/data$ time look $(echo -n secret123 | sha1sum | cut -d ' ' -f 1 | tr a-z A-Z) pwned-passwords-sha1-ordered-by-hash-v6.txt 
  F2B14F68EB995FACB3A1C35287B778D5BD785511:17384

  real 0m0.212s
  user 0m0.005s
  sys 0m0.001s

  justin@box:~/data$ time look $(echo -n secret123 | sha1sum | cut -d ' ' -f 1 | tr a-z A-Z) pwned-passwords-sha1-ordered-by-hash-v6.txt 
  F2B14F68EB995FACB3A1C35287B778D5BD785511:17384

  real 0m0.002s
  user 0m0.003s
  sys 0m0.001s


Hey, I didn't know about this command, neat!

On my laptop, look `time`s at ~10 ms (for comparison, the Python "binary search" script `time`s at ~50 ms).


You can make python binary search super fast if you use mmap. here's a version of that I had lying around, it's probably correct.

  import os
  import mmap
  
  def do_mmap(f):
      fd = os.open(f, os.O_RDONLY)
      size = os.lseek(fd, 0, 2)
      os.lseek(fd, 0, 0)
      m = mmap.mmap(fd, size, prot=mmap.PROT_READ)
      return m, size, fd
  
  SEEK_SET = 0
  SEEK_CUR = 1
  
  class Searcher:
      def __init__(self, file):
          self.file = file
          self.map, self.size, self.fd = do_mmap(file)
  
      def close(self):
          self.map.close()
          os.close(self.fd)
  
      def find_newline(self):
          self.map.readline()
          return self.map.tell()
  
      def binary_search(self, q):
          pos = 0
          start = 0
          end = self.size
          found = False
          #this can get stuck with start = xxx and end = xxx+1, probably from the \r\n
          while start < end - 2:
              mid = start + (end-start)//2
              self.map.seek(mid)
              pos = self.find_newline()
              if pos > end:
                  break
              line = self.map.readline()
              if q < line:
                  end = mid
              elif q > line:
                  start = mid
  
          while True:
              line = self.map.readline()
              if not line.startswith(q): break
              yield line
  
  if __name__ == "__main__":
      import sys
      q = sys.argv[1]
      s = Searcher("pwned-passwords-sha1-ordered-by-hash-v6.txt")
      import time
      ss = time.perf_counter()
      res = s.binary_search(q.upper().encode())
      for x in res:
          print(x)
      ee = time.perf_counter()
      print(ee-ss)


I did try mmap, both with the plaintext binary search, and with the binary file (you can find a note about it in the HTML source :)

I ended up not mentioning it because for some reason, it was ~twice as slow on my mac... I'm now curious to try it on a decent Linux machine.


Make sure you put a space at the beginning of your command, so you don't leave your password sitting plaintext in your bash history.


If you're using bash, you'll need a to use HISTIGNORE or HISTCONTROL environment variables to do this.


If you're using bash, you can just leave a space before the command, like the other commentor said.


Thats true if HISTCONTROL is set to `ignorespace` or `ignoreboth`

https://www.gnu.org/software/bash/manual/html_node/Bash-Vari...


read -s -r MY_PASSWORD

Then, after typing your password you can safely use the $MY_PASSWORD variabile


Oh I just learned something, thank you.


Perhaps there's a way to insert GNU Parallel in there to do parallel search of different chunks?

Or just use ripgrep, which integrates multi-core.


That is already doable with xargs itself

xargs -P maxprocs

Parallel mode: run at most maxprocs invocations of utility at once. If maxprocs is set to 0, xargs will run as many processes as possible.


GNU parallel gives you some extra features like a runtime log and resumable operation.


if have 10 computers put every 10th line in its own file, if each file is 1000 lines put line 500 at the start, then line 250, then line 750, then line 125, 375 etc


Assuming you have enough RAM, I wonder how much putting this file into a ramdisk will help Be speed things up.


Perhaps pushing the definition of a *nix command slightly, but I’d be interested in the performance of https://sgrep.sourceforge.net/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: