The idea is to shard the data by logical domain and by time segment, so that queries only apply to relatively small and efficiently-read data, and to exploit the embarrassingly-parallel nature of the problem.
I'm implementing the segmenting based on time + a sharding key to group together records of the same domain on the same shard. By sharing the shard with other domains it prevents having an overly large number of segments per time interval allowing them to be bigger and get better index density.
Which is in important factor in my indexer design which amortizes the cost of the indices over large number of rows.
Storing logs in sqlite is definitely a neat way to go for smaller scale stuff, hope you make something cool out of it.
For me though I have faced this logs problem at very large scale numerous times in my career and have tried all manner of commercial and OSS solutions and have yet to be satisfied so my project is definitely geared to solving the sort of problems you have when everything else just either stops working or costs more to run than your actual app.
Not saying it won't scale down very well. I think my software on a single machine should easily handle atleast 20k+ logs/second (prototype is much faster atm but lots of features need to be added) and be able to serve queries concurrently with that on just a few cores and ~8-16gb/ram for say at 200-400GB dataset.
I think the 3 deployment sizes I will optimize for are single node for demo and benchmark purposes, 3 node for realistic small deployment and 6-10 nodes for high volume logging environments like my $DAY_JOB.
The idea is to shard the data by logical domain and by time segment, so that queries only apply to relatively small and efficiently-read data, and to exploit the embarrassingly-parallel nature of the problem.