Unless you compile SQLite yourself, you'll find the maximum mmap size is 2GB. ie even with your pragma above, only the first 2GB of the database are memory mapped. It is defined by the SQLITE_MAX_MMAP_SIZE compile time constant. You can use pragma compile_options to see what the value is.
SQLite has 32 bit limits. For example the largest string or blob it can store is 2GB. That could only be addressed by an incompatible file format change. Many APIs also use int in places again making limits be 32 bits, although there are also a smattering of 64 bit APIs.
Changing this default requires knowing it is a 64 bit platform when the C preprocessor runs, and would surprise anyone who was ok with the 2GB value.
There are two downsides of mmap - I/O errors can't be caught and handled by SQLite code, and buggy stray writes by other code in the process could corrupt the database.
It is best practise to directly include the SQLite amalgamation into your own projects which allows you to control version updating, and configuration.
The general test suite is not proprietary, and is a standard part of the code. You can run make test. It uses TCL to run the testing, and covers virtually everything.
There is a separate TH3 test suite which is proprietary. It generates C code of the tests so you can run the testing in embedded and similar environments, as well as coverage of more obscure test cases.
This isn't an issue as SQLite doesn't accept contributions because they don't want to risk someone submitting proprietary code and lying about its origin.
I've never understood why other large open-source projects are just willing to accept contributions from anyone. What's the plan when someone copy-pastes code from some proprietary codebase and the rights holders finds it?
If the rights holder is particularly litigious then I could see them suing even if you agreed to take out their code under the argument that you've distributed it and profited from it. I don't know if there's been any cases of this historically but I'd be surprised if there hasn't been.
I'd love to see an analysis of byte ordering impact on CPU implementation. Does little vs big endian make any difference to the complexity of the algorithms and circuits?
This is how Apple addressed audio hardware and do something similar for speakers. Instead of trying to make speakers that have the desired frequency response or microphones that produce the desired signal, they let the analog hardware do whatever it does.
Then in software they use digital signal processing. For speakers they modify what gets sent to the hardware so that the actual output then does match the frequency response, and for the microphones they do this work to extract the desired signal.
If Linux addressed the speakers as is, you would get unpleasant sound, and if it read the microphones as is, it would get a lot of noise. That is why Asahi had to add digital signal processing to the audio input and output, to get the "correct" audio.
It does mean the processing is specific to the analogue audio hardware in each of the different Mac models.
The processing could be done in additional hardware, but why bother when you have a very good CPU that can do the work.
> For speakers they modify what gets sent to the hardware so that the actual output then does match the frequency response
As I understand, this is not a magic pill: it probably won't help to pull out frequencies which are suppressed by 30-40 dB and I assume that if the frequency response graph is too wavy (lot of narrow peaks and dips), it won't help either.
Also, you need to have calibration files to use this method, right?
Yes you need calibration files for supported models. You can see the details and explanation at the asahi audio repository. They also criticize the MacOS curves, and point out how some Windows vendors are doing the same DSP approach.
By the way I now realized that simply adding an equalizer before the amp might be not enough; speakers typically produce different sound in different directions, so for a perfect sound you need to somehow track location of the head and adjust filter curves.
SQLite has a session extension that can record changes on a local database into a changeset and you can replay those changes on another SQLite instance. Note that it replays what the changes were, not the queries that resulted in the changes. When applying changes you provide a conflict handler. (You can also invert changesets making a handy undo/redo feature.)
You can save conflicts to another changeset. There is also a rebaser to help deal with multiple way syncing.
there's also a CRDT version of this, which allows two databases to be sync'ed to each other in real time (aka, updates to one will eventually make it to the other, and both database would eventually contain the same data).
It's https://vlcn.io/docs/cr-sqlite/intro , and i find it amazing that this is doable in sqlite. It is perfect for small scale collaboration imho, but it also works to sync across local client and remote server (for a single db per user scenario).
Interesting link, it'd be great if their solution meets expectations.
Right now, the proof-of-concept they've provided seems simplistic. Their progress seems to have shifted from cr-sqlite into "Zero" instead. I'm guessing it has something to do with CRDTs being quite app-specific and hard to generalize.
I would want to see this library used in production first before hyping it
in a sense it is quite specific. In a difference sense, this is as generic a CRDT as you can get - it's CRDT on table(s). There's no merging of rows iirc (unless you write a custom merge, which is supported but probably need some tweaking and could lead to poor results?).
My sweet spot is working as a developer in a small team that has to do all the work to ship the product, using Python for practical and productive development, and as glue, C for performance and lower levels, and domain specific tools and languages when necessary.
Where I worked we got an early system as we made graphical desktops for workstations. IBM required that we have a separate locked room for the system!
Despite having over 20 different brands/Unix systems, AIX was the only one we had to pay a 3rd party consultant to come in and do installations and upgrades. That was back in the days of tapes. It used some convoluted volume management system where the tape contents got spooled into the disk and then the installation absorbed that.
smitty always felt so slow. There was a release where it was sped up. They didn't actually make it do the work any quicker - they just doubled the animation speed of the running man that was shown.
You can store your text and PDFs in SQLite (or their filenames) and use the FTS5 infrastructure to do tokenization, query execution, and ranking. You can write your own tokenizer in Python, as well as ranking functions. A pure Python tokenizer for HTML is included, as well as a pure Python implementation of BM25.
You can chain tokenizers so it is just a few lines of code to call pypdf's extract_text method, and then have the bundled UnicodeWords tokenizer properly extract tokens/words, and Simplify to do case folding and accent stripping if desired.
Thank you, your project meets my requirements. I want to build a long memory RAG system for my personal data. I like the commercial offerings like Google Gemini integrated with Workplace data, but I think I would be happier with my own system.
Thank you for publishing your work. Do you know of any similar projects with examples of custom tokenizers, e.g. for synonyms, snowball, but written in C?
The text is in UTF8 bytes so any C code would have to deal with that and mapping to Unicode codepoints, plus lots of other text processing so some kind of library would also be needed. I don't know of any.
My technique is to give an estimate with error bars. Something like 6 weeks plus or minus 2. That then leads into discussion as to what is unknown/undefined leading to the uncertainty. Sometimes the error bars are larger than the estimate because I know there will be endless feature creep and UI revisions for something like "analytics". And sometimes the number could be negative like 6 weeks plus or minus 8. That is when the functionality already exists (eg the data is already there - you can already load it into Excel and a pivot table) and could be sufficient.
At my very first job out of college,¹ the VP who led the division that I was part of invited me into his office (I think there was someone else from the team, but this is 34 years ago so who knows) and talked about a little game he had encountered about estimates. He gave us a list of 20 things to estimate with 95% confidence (the only one I remember was the weight of a 747), all given as a range. If we did it right, we would get 19 of the 20 correct. The point was to set the range large enough that you could be confident about the answer.
I kept that as a practice when I did freelancing work, always giving my estimates as a range rather than a single number. It would be nice if agile(ish) practices incorporated this in the estimation process.
⸻
1. Technically, my second job since I had a short temp job doing some cataloging work at a local used bookstore between leaving school and starting at that company.
https://sqlite.org/compile.html#max_mmap_size
Ubuntu system pragma compile_options: