Hacker News new | past | comments | ask | show | jobs | submit login
Datasette is my data hammer (jeremiak.com)
299 points by Tomte on Jan 18, 2023 | hide | past | favorite | 69 comments



Two things I love about this post:

1. It's by a data journalist. Datasette was originally designed for data journalism (though it's useful for all sorts of other things) so it's really great to see it being enthusiastically used in this way

2. Jeremia talks at length about Datastte's URLs! This is a key design principle of Datasette: anything you see (a filtered table, SQL query results, a configure chart visualization) should be reflected in the URL of the page, such that you can bookmark it and share it with others. And any time you can see data you should be able to add .json to that URL (or .csv or, via plugins, other extensions like .atom or .geojson) to get that data back in a useful format.


As for 2:

I think this is quite generally a quality feature for web apps and dynamic web interfaces, that should always be considered.

If they get too large/unwieldy there is always the option to use base64 encoding or other formats and tricks. Human readable URLs are always nice but I rather have something ugly than no URL state at all in many cases.

If the output changes frequently over time, so it might get stale, then one can still add a timestamp and an additional warning in the UI with a link to recent data.

What I'm saying is that there is rarely a good excuse to not do this.

In fact I think when I haven't been following this principle it was typically due to self-inflicted, bad decisions paired with time constraints.


On 2., Tim Bray's write-up on uri/urls as the integration point powering AWS is a good write up on the virtues of urls at scale, as a universal integration point.

https://www.tbray.org/ongoing/When/202x/2021/10/26/URI-based...


Datasette is such a neat project & tool. Simon, its developer, has built an entire ecosystem of tools around sqlite [1], Datasette [2], and automation through GH Actions, like git-scraping [3]. Also, he has been actively blogging for 20+ years [4]. I particularly enjoy his short TIL posts [5].

[1] https://github.com/simonw?tab=repositories&q=sqlite&sort=sta...

[2] https://github.com/simonw?tab=repositories&q=datasette&sort=...

[3] https://simonwillison.net/2020/Oct/9/git-scraping/

[4] https://simonwillison.net/

[5] https://til.simonwillison.net/


Simon also did some brilliant stuff with Datasette based on a 2-hour long mp3 of train announcements that ScotRail released as part of a Freedom of Information reponse. A journalist called Jon Brady found it initially on the ScotRail website and tweeted about it, I chopped it up into individual announcements and crowdsourced the transcription, and Simon built a dashboard to explore them all in Datasette.

Simon's blog: https://simonwillison.net/2022/Aug/21/scotrail/

The Datasette dashboard: https://scotrail.datasette.io/


He's also the co-creator of django!


I tried datasette for the first time not long ago. To me it wasn't obvious what it gave me compared to my local Postgres setup.

I have a single docker-compose file with Postgres + pgAdmin and a host mount which I use for data exploration and ad-hoc analysis. I get a nice Web UI and can ingest quickly using the awesome Postgres FDWs.

Compared to datasette, it allows me to also transform the data by creating derived views and tables. The UI of pgAdmin also is much more fully-featured, but depending on the user that could well be a disadvantage. I guess the other thing datasette might be better at is the sharing use case by somehow exposing your instance and sharing links, I can do that with pgAdmin as well though.

Overall I'd probably need to try it out some more but currently lack the motivation because I can't think of any big advantage over my old setup.


One big advantage is in publishing data online: that as one of Datasette's key original features, and it's something I use constantly:

    datasette publish vercel til.db --app simon-tils
It can deploy databases directly to Cloud Run, Heroku, Fly and Vercel. It can cost pennies per month depending on database size and traffic.

My https://til.simonwillison.net site (and many other sites I run) is deployed like that, using GitHub Actions: https://github.com/simonw/til/blob/81fdedb8e9b047cb059ef22c6...

See also the baked data pattern where I talk about this kind of deployment in a lot of detail: https://simonwillison.net/2021/Jul/28/baked-data/

For ad-hoc data analysis - assuming you already know your way around PostgreSQL - your pgAdmin in Docker solution is hard to beat.

How often do you create a new database?

My favourite thing about working with SQLite is that the overhead of new database creation is equivalent to the overhead of creating a file. In a given day I might casually create dozens of new databases, then throw away most of them. I never found myself doing that with MySQL or PostgreSQL.


That publish command looks sweet! For confidential data I'd have to setup some integration with our infra to get the required access control though. Maybe rather go with the package option using docker and a pipeline that deploys to our k8s where I could handle auth using an existing proxy...

My current workflow doesn't really involve creating databases, I mostly just create foreign tables, normal tables and views in a default DB. From time to time I then just dump-backup everything away and start fresh.

Thanks for the response and your tools in general though, I liked how everything just worked when I tried it :-)


For confidential data you could use one of the plugins that add authentication - datasette-auth0 or datasette-auth-github or datasette-auth-passwords are options there - but yeah, you're probably better off deploying to internet hosting if you already have an auth proxy setup somewhere.


Oh wow thx for the pointer, I overlooked those. For business I would need a more generic oauth- or saml-based one to work with our Keycloak, but for quick private/hobby sharing the password one looks perfect.


Meant to say internal hosting, not internet hosting.


Your preaching to the choir Simon.

...But the tenor ...whose pitch is a little off... up the back, Is listening even though he's heard it before :-)

I think the ability to so quickly create and consume an SQLite DB is one of the big advantages here. I need to use bigger DBs like PostGres and MSSQL (in previous roles) for bigger tasks, but SQLite is down there with fopen in terms of time to create a DB.

I find myself imagining schemas for problems that I'd like to put into SQLite and possibly DataSette.

Thanks for making DataSette Simon.


I do not think Datasette has much to offer the savvy analyst. The base experience is more-or-less what you can get from DBeaver. However, the plug-in system does dramatically change this: check out Simon’s example with the train announcement audio clips or any of the mapping examples.

Instead, I have used Datasette as a way to distribute data to less technical groups with a minimum of fuss. Without dedicated infrastructure, sharing data bigger than an Excel file can require quite a bit of work, requiring building some kind of custom query/search/visualization mechanism. With Datasette, after I have completed an analysis, I can dump as much data as I want into some (fairly wide) tables, give a three minute tutorial (This is like a web version of Excel, and this is how you filter rows), and leave them to explore. This has allowed me to offer alternatives reports by just creating a new view into the data. As the underlying technology is SQLite, it can handle large datasets without problem.


Thanks, I read this as a vote for the sharing use case. Actually the UI immediately reminded me of phpmyadmin, but trimmed down to the bare necessities. I'll be watching out for a chance to try it again when I have a share-with-less-technical-people use case :-)


Question: Did you add plugins?

Datasette is basically entirely plugins, such that you can create radically different "sites" and tools out of different combinations of plugins. Datasette without any plugins is useful, but where it's functionality shines is entirely in plugins.


I did not, just had a quick glance at what was available.

What is your use case and which plugins would you recommend? Providing some kind of bare bones report/dashboard with the visualization plugin?


In my case, I log the GPS data from my car, but wanted a private way to explore that (obviously sensitive) data.

https://datasette.io/plugins/datasette-cluster-map is a really neat example of a "just works" plugin: If you have latitude and longitude columns, it will plot them!

But there are numerous plugins for certain types of visualizations, there's a bunch for handling editing, inserts, uploads, etc.

Some of the examples are pretty neat examples of what you can do with a bit of tuning: https://global-power-plants.datasettes.com/global-power-plan... is pretty neat.


Will datasette automatically provide useful graph views of time series? Or is there a good plug-in of that?

One thing I’m trying to get better at is resource tracking (mostly energy) but it’s pretty clunky using a spreadsheet (likely because I’m not super knowledgeable and really don’t care).


https://datasette.io/plugins/datasette-vega is the original visualization plugin, built on Vega. Example here: https://datasette.io/content/stats?_facet=package&package=da...

I've been hoping to build a new one around Observable Plot for a while now.

This plugin still requires you to pick what columns you would like to see plotted. Ideally I'd like it to take an informed guess based on the data and maybe offer you some thumbnails of suggested charts which you could then click to view larger.


Does pgadmin have drag and drop to import files? So often I have a few parquet or CSVs from various sources (e.g. S3) and need to quickly summarize or join them. Quick imports and summarization is one thing I've always found painful in most tools.


If it has I haven't used it yet - I just have a collection of snippets to create foreign tables (e.g. one based on a CSV I put into the host mount before).


Do you have a solution (with your go-to-tools) to put data with coordinates directly into an interactive map, for example leaflet? That is something I really like (and use extensivly) about datasette respectivly the plugins.


I'm showing my age, but the first thing I thought of when reading the headline was the Commodore Datasette.

https://en.m.wikipedia.org/wiki/Commodore_Datasette


Datasette was named after that disk drive, because the first code I ever wrote was saved to a floppy disk using a C64.

I had also assumed it would be a good unique name that I could subscribe to search results for... only to find that the Commodore 64 vintage computing community is still way more active than I had ever imagined!


I immediately thought of the musician Datassette - proponent of the micro-genre Business Funk.

https://www.youtube.com/watch?v=zgWa7NoJGWc


I did, too :)

For those that don't know: Datasette is the author/curator of https://musicforprogramming.net


...and he contributed music and SFX to Space Rubbish, an awesome bullet-hell Asteroids clone. https://vimeo.com/18472447


Clicked because I thought at first this was a retro computing post about a Commodore cassette tape drive. Thanks for the intro to this useful tool.


Interesting project. I would've liked to have heard a brief comparison with Excel - what makes Datasette something that one should reach for over that? Is it just the ability to use SQL in-place?


When I've talked to startups in the past I've always warned them to be careful about accidentally building a product which competes with a potential customer's existing good Excel spreadsheet... so it's ironic that I've ended up myself focusing on something that falls into that bucket!

Right now I'd say the big things are:

1. Size of data. Datasette is a great tool for when your data is too big for Excel but not so big that you need a data warehouse - think 100MB to 10GB range

2. Interoperability and automation. Building tools in other languages that interoperate with an Excel spreadsheet - a web crawler that updates it constantly over time for example - is possible but in my experience pretty hard. Writing to a SQLite file is usually much easier in comparison

3. Publishing data. This is a core concept in Datasette: the ability to publish your data online so other people can explore and interact with it.

4. Plugins. Writing plugins for Datasette is pretty easy, and there are over 100 now: https://datasette.io/plugins - so if it doesn't do something you need, you can add that feature independently of the core project

5. SQLite generally. Some projects are a better fit for a database than a spreadsheet, and SQLite is a /really/ good (often underestimated) database.


Hi Simon, thanks so much for taking the time. Been following your excellent work with Django for many years.

We took a look at DS several months ago (I believe it popped up here on HN) but I confess that until I read the linked article I hadn’t fully understood that it was a “hammer”, and a good one at that.

At that time we set it aside in favour of AirTable which for our purposes - to relate this to one of your other comments here - is sort of Access with a web interface plus authentication and (frankly not very granular) authorisation.

The main reason was because it seemed to us that Datasette didn’t have first class support for auth - that it was sort of positioned as a community or internal tool, rather than something we could build out a client interface on. I just had a look at the plugins and noted that there is something for auth, although it too looks rather to be for internal tooling. Is that a fair characterisation, or do we have it wrong? Because we love everything else about DS - open source, built on SQL and SQLite, etc. - and would vastly prefer to use it over AirTable.


Airtable is a fantastic piece of software. I'd love Datasette to work as an alternative for it some day but there is a whole lot of stuff to build to get there.

The most relevant plugin for what you're describing at the moment is probably the Auth0 one: https://datasette.io/plugins/datasette-auth0

Quite a lot of assembly is required at the moment if you want to do interesting custom things with authentication and permissions - the documentation here is the best starting point: https://docs.datasette.io/en/stable/authentication.html


Fantastic - thanks Simon, much appreciated.

Yes we generally like the AirTable UX, too, and it’s a good fit for client-facing stuff in addition to internal tooling. It is slow, though, and there are no native desktop or mobile clients. Mainly, though: building a business on it comes with all the usual pitfalls of VC-backed SaaS.

It’s very helpful to hear your thoughts on the project’s trajectory. We’ll take another look at DS and the auth0 module, thanks. It looks very promising and we don’t mind a bit of assembly.


Replying to myself to say that - with hilarious timing - one of our business-critical AirTable bases is suddenly throwing an error and has become completely inaccessible right when we need it. AirTable support does its best to try to funnel you into the "helpful community".

Cancelling our subscription today and migrating to Datasette, hell or high water. Glad this happened now and not 2 years down the track.


Some things which stand out to me at a glance:

  - it’s open source
  - it allows publishing to the web
  - it provides handy web-based tooling, like turning your data into a JSON api
  - it provides a user-friendly UI but also lets you dig deeper with plain old SQL (pro to some, con to others)
  - apparently has an extensible core with a rich plugin ecosystem which allows it to meet a wide range of needs
I’m sure there’s more, but these are all really interesting to me.


Just taking the first point, being open source means that I don't need a licence to use Datasette and that I can run it on all three OSs I have to work with (Linux, macOS and Windows).


It’s got plugins for some stuff Excel don’t. I’ve used one for map/gis data to look at lat/long data on a map. Never seen that in Excel. The dataset was too big for Excel, too.


I think the best comparison would be with Access, seems to be the all-in-one data tool that tried to be.


Yeah, I'm constantly surprised at Microsoft's lack of investment in Access. The world clearly needs a really great, easy to use personal database.

Datasette isn't there yet but it's a useful ambition for the future.


FYI: I’m currently working on a full-fledged SQLite hosting platform with similar features as Datasette. Check it out at https://airsequel.com. So far we provide an instant GraphQL API, an SQLite workbench, a spreadsheet UI, and dashboards.


Only tangentially related: what is the best general-use single hand tool? My feeling is that it would be some kind of small sturdy knife.

"When all you have is a knife, a lot of stuff looks passably doable"


Context is everything. Outdoors? Probably a very sharp machete. Working in a genius bar? Probably that weird-ass hextool screwdriver thing.


Maybe a billhook over a machete if you're going to be splitting logs for kindling. Bit more versatile.


For general house projects, I’d say a Visegrip. I’m not saying it can do everything well, but it could be used as a hammer, pliers, adjustable wrench, and I’ve even used it as a screwdriver in a couple of situations. It’s not the best at any of those things, but it can do all of them if necessary.


Have you used it as a screwdriver by holding a screwdriver bit or using the grip directly on the screw? Or something else I'm not imagining?


I knew an outdoor person that did almost everything with an Opinel knife when being on tour.


Indeed a blade is one of the most useful things that you can’t replicate with naturally occurring items (a log or a rock can be used to hammer things, for example, or vines can be twisted into rope).

Opinel knives in particular are quite handy, thanks to the good quality of the blade. They can take a beating and keep their edge, plus they fold and fit in the pocket (up to No. 9 with ease), so they can fulfill the functions of fancier “survival knives”.

Add a ferrocerium rod for easy fire starting, and a good book for downtime by the fire, and you’re golden.


You should definitely check out knapping if you're interested in how people used to produce blades solely from natural materials. Knapped stones can even be superior to modern knives in certain respects (sharpness, shape for purpose, weight, etc).


It's hard to argue that a utility knife is one of the top tools. When I'm doing DIY and woodworking, the tools I'm most likely to have in my pocket with me are:

- Tape measure. - Utility knife. - Mechanical pencil. - Distant 4th: Screwdriver. - Even more Distant 5th: Linesman's Pliers (but I'm doing a lot of electrical work right now)

In particular, a few years ago I switched to the Milwaukee FASTBACK utility knife. They have a belt/pocket clip, storage for extra blades, and different models have additional features like: scoring, gut hook, wire stripper, bottle opener, and they have a new one with a bit driver built in, and they're cheapish at $10-$20. Early ones had a far-too-easy to press blade removal button, but more recent versions have added a metal tab to protect it.

On the screwdriver front, I went on a bit of a deep dive when Linus Tech Tips released their own $70 driver, which does look awesome, but in the end I decided I'm just not that picky and the one I usually reach for is the cheap "Amazon Basics 12-in-1". Fairly affordable at $13, and works fairly well.

I had started off the attempt to replace my ~20 year old multi-bit ratcheting driver (because the plastic was starting to break down) with a ChannelLock that Wirecutter fawned over, but I ended up hating that one, largely because it used ball bearings for retaining the bits and 80% of the time I don't get the bit fully seated and then the bit falls out. I also tried the Williams WRS-1 at the recommendation of Project Farm, and it indeed does have a much better ratcheting mechanism, more akin to the LTT, but in the end I decided things like a little bit of wobble and slop don't really matter to me when I'm using it. Having 4-6 Amazons that are just laying around, vs. 2 Williams or 1 LTT for the same price is probably more valuable to me.


I see there is a Mac desktop app, is there plans for similar for Linux or Windows? I'm assuming I can just install an run similar to jupyter notebooks in any case.


Yes, if you know how to use pip you can run "pip install datasette" and use it that way.

I'd love to get the desktop app working on Linux and Windows.

I did manage to get a prototype working on Windows, despite having VERY little experience working on that platform: https://github.com/simonw/datasette-app/issues/71

The bit I'm stuck on is how to turn that prototype into an application with an installer that's signed so people can download and run it.


Looking through some of the examples[1], it seems the WebUI is pretty underwhelming and demands much optimization and modernization. Are there alternative UIs available?

[1] https://datasette.io/examples


I'm definitely keen on suggestions for improvements I can make to the default UI.

Datasette provides both a JSON API (easily enabled for CORS access) and supports custom templates, so it's possible to customize the UI any way you like.

So far I've not seen many examples of extensive customization. I use the custom templates a lot myself - these four sites are all just regular Datasette with some custom templates:

- https://datasette.io/

- https://til.simonwillison.net/

- https://www.niche-museums.com/

- https://www.rockybeaches.com/us/pillar-point

Source code is on GitHub for all four.


> I'm definitely keen on suggestions for improvements I can make to the default UI.

Have you looked at other modern data-apps? AirTable? Notion? Jupyter Notebooks at least?

> Datasette provides both a JSON API (easily enabled for CORS access) and supports custom templates, so it's possible to customize the UI any way you like.

Seems promising. But how much does the JSON API support interactive work with data? The UI I've seen just reloaded the whole dataset it seems? There seems to be no way to specify queries and views from the UI. Would the API have means to manage such side-data?


What specifically would you want to be different?

It seems to me to serve the purpose well:

https://global-power-plants.datasettes.com/global-power-plan...


> What specifically would you want to be different?

It's sold for data exploration, so i except some useful data-app with an up to date interface, dedicated for interactive navigation and presentation of data.

> It seems to me to serve the purpose well:

Not really. This looks like something that could even run on Internet Explorer 2. It has significant flaws in terms of UI and UX, and lacking nearly everything I except here.

It's incredibly wasteful on space. I must scroll down two screens to even see some data. For some reason the map is high jacking the scroll down, and the map is so broken that I end with multiple earths displayed, with only one having actual data. Changes to the query are not in-place, the whole side is reloads every time. Adding a new filter only seems to work after a reload. And all information are really sparse and scattered randomly around. And finally, there seems to be no way to shape the actual output, which would be the main job of a data explorer.


Can Datasette be used with a Postgres RDS database with hundred GBs? Feels like it's mostly designed for smaller things? Also if it can, how would I go about it and if it can't, any similar tools I can use? This looks really cool!!


I have a sister project to Datasette called Django SQL Dashboard which works against PostgreSQL databases: https://django-sql-dashboard.datasette.io

It used Django for the authentication layer but can otherwise work against any PostgreSQL database.

I partly built it to help explore what Datasette could look like if it expanded to work with more databases than SQLite. That's still something I'm considering doing in the future, via a plugin hook, but it's not on my short-term roadmap.


You can easily put a hundred GB or two into a SQLite database


This is a self-plug but shapedata.io does similar things on top of Postgres -- it is much younger than Datasette, so has far fewer features, but should give you a similar exploration experience!


I'm keen to check this out, I tried to sign up but got a 403 during the google auth flow.

Along with the message: The developer hasn’t given you access to this app. It’s currently being tested and it hasn’t been verified by Google. If you think you should have access, contact the developer


Really sorry about that -- I should've turned off test mode now for Google OAuth. Login work ok now?


Yep, in now. Will play with it tomorrow. Thanks, looks like a great project.


Will give it a try! Want to use it on top of replicas of our work production DB like in an exploration day setting and see what comes out of it, looks awesome!!


It's just for sqlite databases.


Is there something equivalent to Datasette but with duckdb as the sql engine?


I've been contemplating adding a plugin hook to Datasette so that it can work against engines other than SQLite, and my first two goals if I get that working are DuckDB and PostgreSQL.

It's not on my immediate roadmap though.


I'm disappointed. I came here hoping for this: https://www.c64-wiki.com/wiki/Datassette




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: