Hacker News new | past | comments | ask | show | jobs | submit login

These days you can simply load it into a database (ClickHouse) - it will be compressed (about the same as the 7z file), and the search will take 3 ms:

https://github.com/ClickHouse/ClickHouse/issues/42363




But the point was to make it take less than 1ms. How would 3ms help?


DuckDB is also great for stuff like this. You can replace a MapReduce cluster with a single SQL.


I checked DuckDB and your statement appears to be untrue.

    >>> con.execute("CREATE TABLE passwords (hash TEXT, count INT)")
    <duckdb.DuckDBPyConnection object at 0x7fc7bceb55f0>
    >>> con.execute("CREATE INDEX ix_hash ON passwords (hash)")
    <duckdb.DuckDBPyConnection object at 0x7fc7bceb55f0>
    >>> con.execute("COPY passwords FROM 'pwned-passwords-sha1-ordered-by-hash-v8.txt' (SEPARATOR ':')")
    100%  
    100% 
It froze in an attempt to load the data. Nothing happens after it displays 100%.


CREATE INDEX currently has the restriction that the index must fit in memory [1]. As the data is already sorted, creating an index is not necessary anyway. The min/max indexes created automatically by the system are sufficient to complete the query in a few milliseconds.

  D CREATE TABLE passwords (hash TEXT, count INT);
  D COPY passwords FROM '~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt' (SEPARATOR ':');
  D .timer on
  D SELECT \* FROM passwords WHERE hash=upper('5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8');
  ┌──────────────────────────────────────────┬─────────┐
  │                   hash                   │  count  │
  │                 varchar                  │  int32  │
  ├──────────────────────────────────────────┼─────────┤
  │ 5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8 │ 9545824 │
  └──────────────────────────────────────────┴─────────┘
  Run Time (s): real 0.005 user 0.007455 sys 0.000584
[1] https://duckdb.org/docs/sql/indexes


based on the headline, it must be under 1 ms. love the table


I cannot even ssh into the server after trying to use DuckDB. It is completely dead (with all the ducks, what a misery).

The reason is probably that it's using a full index, in contrast with the sparse index in ClickHouse, and maybe it's trying to build it in memory, going to swap (the server has 32 GB memory).


Interesting - the database file looks ok, but the data is lost (the table is empty):

  ubuntu@ip-172-31-3-138:~$ ls -l
  total 69561648
  -rw-rw-r-- 1 ubuntu ubuntu 17631031296 Dec 16 23:57 my-db.duckdb
  -rw-rw-r-- 1 ubuntu ubuntu         326 Dec 16 23:53 my-db.duckdb.wal
  -rw-rw-r-- 1 ubuntu ubuntu 16257755606 Jan 21  2022 pwned-passwords-sha1-ordered-by-hash-v8.7z
  -rw-rw-r-- 1 ubuntu ubuntu 37342268646 Dec  2  2021 pwned-passwords-sha1-ordered-by-hash-v8.txt
  ubuntu@ip-172-31-3-138:~$ python3
  Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import duckdb
  >>> con = duckdb.connect(database='my-db.duckdb')
  >>> con.execute("SELECT count(*) FROM passwords").fetchall()
  [(0,)]


Because DuckDB uses ACID [1] data is loaded in an all-or-nothing manner. As the load was interrupted due to the system running out of memory, the table is expected to be empty.

[1] https://en.wikipedia.org/wiki/ACID


Curious: Are you affiliated with ClickHouse or any other Columnar DB project in any way? If so, you may want to add that as a disclosure.


Yes, I'm working on ClickHouse, here is my GitHub profile: https://github.com/alexey-milovidov

I'm also trying to follow every existing technology in the data engineering space :)


If you load the data properly (creating the index after insertion, which is definitely preferable in this case), it will load extremely quickly (milliseconds).

You should also disclose your relationship with a competing project. For the record, I use DuckDB in personal projects and love it. You seem to be misusing it. :)


have tested duckdb v0.6.0 2213f9c946

  4e17b76fc101c9db7222e0cd8d6f5eee  pwned-passwords-sha1-ordered-by-hash-v8.txt

  select count(*) from read_csv('pwned-passwords-sha1-ordered-by-hash-v8.txt', delim=':', header=False, columns={'Hash': 'VARCHAR', 'Count': 'INT'});
60.32s, 847223402 rows

  create table hashes as select * from ...
OOM :( set PRAGMA temp_directory

  create table ...
144.92s (83.19s on BATCH CREATE, 61.53s on READ CSV)

  select \* from hashes where Hash = 'F2B14F68EB995FACB3A1C35287B778D5BD785511'; -- secret123

  0.0269s -- 1st
  0.0043s -- 2nd
  0.0026s -- 3rd
  0.0062s -- 4th
  0.0047s -- 5th
edits: attempt to fix formatting


“These days”? A 32 bit intel running mysql back in 2000 would probably have laughed at this query.

Granted, I’d expect one or two disk seeks, at ~ 10 ms each. I imagine on modern hardware, it would be in the 100’s of usec range. (Assuming you limited it to 2-4 GB of ram).


Lots of great solutions in the in-memory columnar data space, what a time to be alive.


Clickhouse is not in memory.


ClickHouse has a table engine for keeping data in memory. Can be accessed like this

create table tableName (.....) Engine = Memory();


But it's not necessary to keep data in Memory, because MergeTree works perfectly.

Here is a slide deck showing why MergeTree is faster than in-memory tables: https://presentations.clickhouse.com/meetup53/optimizations/


The comparison is done with the in-memory table doing full scan while MergeTree using an index. Kind of meaningless comparison.


Soooo did they just implement in memory store worse ?


It's a database




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: