Hacker Newsnew | past | comments | ask | show | jobs | submit | rogerbinns's commentslogin

Unless you compile SQLite yourself, you'll find the maximum mmap size is 2GB. ie even with your pragma above, only the first 2GB of the database are memory mapped. It is defined by the SQLITE_MAX_MMAP_SIZE compile time constant. You can use pragma compile_options to see what the value is.

https://sqlite.org/compile.html#max_mmap_size

Ubuntu system pragma compile_options:

    MAX_MMAP_SIZE=0x7fff0000


That seems like a holdover from 32-bit days. I wonder why this is still the default.


SQLite has 32 bit limits. For example the largest string or blob it can store is 2GB. That could only be addressed by an incompatible file format change. Many APIs also use int in places again making limits be 32 bits, although there are also a smattering of 64 bit APIs.

Changing this default requires knowing it is a 64 bit platform when the C preprocessor runs, and would surprise anyone who was ok with the 2GB value.

There are two downsides of mmap - I/O errors can't be caught and handled by SQLite code, and buggy stray writes by other code in the process could corrupt the database.

It is best practise to directly include the SQLite amalgamation into your own projects which allows you to control version updating, and configuration.


>There are two downsides of mmap - I/O errors can't be caught and handled by SQLite code,

True. https://www.sqlite.org/mmap.html lists 3 other issues as well.

> and buggy stray writes by other code in the process could corrupt the database.

Not true: "SQLite uses a read-only memory map to prevent stray pointers in the application from overwriting and corrupting the database file."


All great points. Thank you!


The general test suite is not proprietary, and is a standard part of the code. You can run make test. It uses TCL to run the testing, and covers virtually everything.

There is a separate TH3 test suite which is proprietary. It generates C code of the tests so you can run the testing in embedded and similar environments, as well as coverage of more obscure test cases.

https://sqlite.org/th3.html


Why is that? Surely that leads to conversations with open source contributors like 'this fails the test suite, but I can't show you, please fix it'?


This isn't an issue as SQLite doesn't accept contributions because they don't want to risk someone submitting proprietary code and lying about its origin.

I've never understood why other large open-source projects are just willing to accept contributions from anyone. What's the plan when someone copy-pastes code from some proprietary codebase and the rights holders finds it?


Partly why they have CLAs I suppose?

If someone sells me something they stole, I'm not on the hook for the theft.


The "plan" is to take out the contaminated code and rewrite it.


If the rights holder is particularly litigious then I could see them suing even if you agreed to take out their code under the argument that you've distributed it and profited from it. I don't know if there's been any cases of this historically but I'd be surprised if there hasn't been.


Every open source project has the possibility of litigation. Can't always live in fear of the bogeyman


The same issue is present with the use of LLMs. Are you absolutely sure it didn't just repeat some copyrighted code at you?


SQLite doesn’t accept contributions


I'd love to see an analysis of byte ordering impact on CPU implementation. Does little vs big endian make any difference to the complexity of the algorithms and circuits?


This is how Apple addressed audio hardware and do something similar for speakers. Instead of trying to make speakers that have the desired frequency response or microphones that produce the desired signal, they let the analog hardware do whatever it does.

Then in software they use digital signal processing. For speakers they modify what gets sent to the hardware so that the actual output then does match the frequency response, and for the microphones they do this work to extract the desired signal.

If Linux addressed the speakers as is, you would get unpleasant sound, and if it read the microphones as is, it would get a lot of noise. That is why Asahi had to add digital signal processing to the audio input and output, to get the "correct" audio.

It does mean the processing is specific to the analogue audio hardware in each of the different Mac models.

The processing could be done in additional hardware, but why bother when you have a very good CPU that can do the work.


> For speakers they modify what gets sent to the hardware so that the actual output then does match the frequency response

As I understand, this is not a magic pill: it probably won't help to pull out frequencies which are suppressed by 30-40 dB and I assume that if the frequency response graph is too wavy (lot of narrow peaks and dips), it won't help either.

Also, you need to have calibration files to use this method, right?


Yes you need calibration files for supported models. You can see the details and explanation at the asahi audio repository. They also criticize the MacOS curves, and point out how some Windows vendors are doing the same DSP approach.

https://github.com/AsahiLinux/asahi-audio


By the way I now realized that simply adding an equalizer before the amp might be not enough; speakers typically produce different sound in different directions, so for a perfect sound you need to somehow track location of the head and adjust filter curves.


Interesting, does that means Mac speakers may be great for certain sounds, but not others.


I mean, Apple uses high quality speakers to begin with, as far as laptops go. I'm sure they're not making 40 dB corrections, that would be ginormous.

Yes, I would be very surprised if they weren't using specific calibrations for each model. That's pretty basic.


SQLite has a session extension that can record changes on a local database into a changeset and you can replay those changes on another SQLite instance. Note that it replays what the changes were, not the queries that resulted in the changes. When applying changes you provide a conflict handler. (You can also invert changesets making a handy undo/redo feature.)

You can save conflicts to another changeset. There is also a rebaser to help deal with multiple way syncing.

https://www.sqlite.org/sessionintro.html - overview

https://www.sqlite.org/session/sqlite3changeset_apply.html - conflict information

https://www.sqlite.org/session/rebaser.html - rebaser


there's also a CRDT version of this, which allows two databases to be sync'ed to each other in real time (aka, updates to one will eventually make it to the other, and both database would eventually contain the same data).

It's https://vlcn.io/docs/cr-sqlite/intro , and i find it amazing that this is doable in sqlite. It is perfect for small scale collaboration imho, but it also works to sync across local client and remote server (for a single db per user scenario).


Interesting link, it'd be great if their solution meets expectations.

Right now, the proof-of-concept they've provided seems simplistic. Their progress seems to have shifted from cr-sqlite into "Zero" instead. I'm guessing it has something to do with CRDTs being quite app-specific and hard to generalize.

I would want to see this library used in production first before hyping it


in a sense it is quite specific. In a difference sense, this is as generic a CRDT as you can get - it's CRDT on table(s). There's no merging of rows iirc (unless you write a custom merge, which is supported but probably need some tweaking and could lead to poor results?).


> in real time (aka, updates to one will eventually make it to the other

The term you're looking for is "eventual consistency".


This is just clobbering one of the divergent copies with per-field granularity.


Rump kernel/anykernel is the concept. The drivers can run in user space with minimal underlying support.

https://en.wikipedia.org/wiki/Rump_kernel


Location: Santa Cruz, California

Remote: Yes

Willing to relocate: No

Technologies: Python (including its C API), C, SQLite (C API), Linux, deep techie generalist

Résumé/CV: https://www.rogerbinns.com/rogerbresume.html

Email: rogerb@rogerbinns.com

My sweet spot is working as a developer in a small team that has to do all the work to ship the product, using Python for practical and productive development, and as glue, C for performance and lower levels, and domain specific tools and languages when necessary.


Where I worked we got an early system as we made graphical desktops for workstations. IBM required that we have a separate locked room for the system!

Despite having over 20 different brands/Unix systems, AIX was the only one we had to pay a 3rd party consultant to come in and do installations and upgrades. That was back in the days of tapes. It used some convoluted volume management system where the tape contents got spooled into the disk and then the installation absorbed that.

smitty always felt so slow. There was a release where it was sped up. They didn't actually make it do the work any quicker - they just doubled the animation speed of the running man that was shown.


My project APSW may have exactly what you need. It wraps SQLite proving a Python API, and that includes the FTS5 full text search functionality. https://rogerbinns.github.io/apsw/textsearch.html

You can store your text and PDFs in SQLite (or their filenames) and use the FTS5 infrastructure to do tokenization, query execution, and ranking. You can write your own tokenizer in Python, as well as ranking functions. A pure Python tokenizer for HTML is included, as well as a pure Python implementation of BM25.

You can chain tokenizers so it is just a few lines of code to call pypdf's extract_text method, and then have the bundled UnicodeWords tokenizer properly extract tokens/words, and Simplify to do case folding and accent stripping if desired.

There is a lot more useful functionality, all done from Python. You can see code in action in the example/tour at https://rogerbinns.github.io/apsw/example-fts.html


Thank you, your project meets my requirements. I want to build a long memory RAG system for my personal data. I like the commercial offerings like Google Gemini integrated with Workplace data, but I think I would be happier with my own system.


Thank you for publishing your work. Do you know of any similar projects with examples of custom tokenizers, e.g. for synonyms, snowball, but written in C?


SQLite itself is in C so you can use the API directly https://www.sqlite.org/fts5.html#custom_tokenizers

The text is in UTF8 bytes so any C code would have to deal with that and mapping to Unicode codepoints, plus lots of other text processing so some kind of library would also be needed. I don't know of any.


My technique is to give an estimate with error bars. Something like 6 weeks plus or minus 2. That then leads into discussion as to what is unknown/undefined leading to the uncertainty. Sometimes the error bars are larger than the estimate because I know there will be endless feature creep and UI revisions for something like "analytics". And sometimes the number could be negative like 6 weeks plus or minus 8. That is when the functionality already exists (eg the data is already there - you can already load it into Excel and a pivot table) and could be sufficient.


At my very first job out of college,¹ the VP who led the division that I was part of invited me into his office (I think there was someone else from the team, but this is 34 years ago so who knows) and talked about a little game he had encountered about estimates. He gave us a list of 20 things to estimate with 95% confidence (the only one I remember was the weight of a 747), all given as a range. If we did it right, we would get 19 of the 20 correct. The point was to set the range large enough that you could be confident about the answer.

I kept that as a practice when I did freelancing work, always giving my estimates as a range rather than a single number. It would be nice if agile(ish) practices incorporated this in the estimation process.

1. Technically, my second job since I had a short temp job doing some cataloging work at a local used bookstore between leaving school and starting at that company.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: