We're happy to introduce Dropbase 2.0! It's a tool that helps you bring offline files, such as CSV, Excel, and JSON files, into Postgres database. You can also process your data before uploading it using a spreadsheet-like interface or by writing a custom Python script. Once your data is in the database, you can query it using any third party tool (credentials will be provided). You can also access your data via REST API (powered by PostgREST) or create custom endpoints to serve a more specific use case.
A bit about the tech:
Currently, we support .csv, .json, .xls, .xlsx files. For data processing, we use Pandas, so if you are comfortable using Python, you can write your own custom functions to process the data. We also give you a free shared Postgres database to test the tool with (your data is isolated and hidden from others). Each one of these databases come with an instance of PostgREST preinstalled, so you can query your database using REST API ([http://postgrest.org/en/v7.0.0/](http://postgrest.org/en/v7....). You can also generate an access token with an expiry date to share your data with others.
There are many more features that we baked into the product. Come check it out, it's open for HN community.
"However, by posting Content using Service you grant us the right and license to use, modify, publicly perform, publicly display, reproduce, and distribute such Content on and through Service. You agree that this license includes the right for us to make your Content available to other users of Service, who may also use your Content subject to these Terms"
Does this mean I should have no expectation of privacy or control over anything I upload?
Your data is private and you own all of your data. We do not and will not share your data with anybody else unless you share it yourself through the sharing of projects, pipelines, endpoints, or exports. We do however store and process your data. We also let you generate endpoints so we need some wording to cover these cases. We'll double check our terms to make this point clearer, but we added this because you can generate live endpoints and you can share those.
Unfortunately IANAL and the formulation of the ToS/PP in your, and that of most other online service providers, always give me that naggy feeling that the legalese leaves so many loopholes, texts open to different interpretation, that effectively - even though it may seem so - I have no privacy guarantees whatsoever. That might be entirely unwarranted of me, but the feeling is there. Unease.
> even though it may seem so - I have no privacy guarantees whatsoever.
I mean, fundamentally, really consistent security is hard; and the best you can reasonably expect from someone you're not paying is "best effort". For them to make real promises about security opens them up to being sued if they fail; it's not really reasonable to ask someone to do that unless you're paying them a reasonable chunk of cash to offset that risk.
Sorry, but this almost feels like a GPT-3 response to me.
I don't see what security, paid vs. free or best-effort has got to do with my argument, which is that the loopholes in legalese are so hard to spot for anyone but a lawyer, that effectively my data might still be used in any way and possibly against my wishes or expectations (but which becomes legal when I consent to the PP and ToC).
- How do you handle incremental loads from files (or even google sheets)? Am I able to only load the diff, load full snapshots and get bi-temporality, etc?
- Are you supporting PostgREST as sponsors in any way? It's one of the most solid tools I've ever used, and I love to see that companies build great products on top of it!
- we use pandas to process data and load it to Postgres using .to_sql (https://pandas.pydata.org/pandas-docs/stable/reference/api/p...). for incremental loads, we set "if_exists" to "append". we're working to add more flexibility to load function, so you can specify how to load your data, handle conflicts, and so on. we're open for suggestions
- PostgREST is great. We are not sponsors at the moment, but looking to do this when we can!
I've used Pandas recently, not sure if this will ever help you, but dictionaries are much faster if you continually add rows.
pandas.concat and similar functions for appending to a Pandas dataframe can be quite slow. Just mentioning this in case you ever encounter this issue; maybe you don't need to. In my case I changed to dictionaries for important parts and 2 mins changed to 5 seconds execution time.
However, in my case, I have to change logical row structure and not just read in rows as is.
Good point. We can probably work on clarifying our value prop. You could do something like what you described with a local database. We could add that by building some desktop code. At the moment we are only focused on the cloud part. That way we can get data, let you process it, and also easily share as APIs or endpoints.
On a somewhat unrelated note, the design of this landing page is fantastic. It is exactly how I like to have new tools presented to me. All the fundamental competencies of the tool displayed on one page, with sufficient, but not verbose, technical detail. Kudos.
Yeah, although in my case it would have been nice to have a 'light' version, as I have some trouble reading the dark grey text on a black background in broad daylight.
I too very much like this landing page. Is it a custom one? And if so, are there any services that would provide a template like this one? I would like to have one like this, but unfortunately, Im far from a CSS expert.
TailwindUI is fine so long as you don't want to customise the blocks they give you. As soon as you want to do that you're in a world of pain.
Example: I wanted to use their "Hero Sections - with angled image on right" without the navigation where shown (was going to add a standard top navbar). As soon as you delete it, the angled bar doesn't reach to the top. I ended up keeping an empty navbar in that place to keep the block looking right.
Is there a light mode option for the site? Or is iOS just not selecting it for some reason?
My astigmatism makes reading dark UIs migraine inducing, so as cool as this sounds I unfortunately can’t read more about it without triggering a migraine. x_x
(Maybe still default to the dark UI, but if the user has light mode enabled it uses a light UI?)
Oh, so that's what makes dark themes so hard to read for me. Unfortunately there's no easy way out for me, since my eyes are photosensitive due to a separate complication. Between a rock and a hard place :p
I'm with you in the sense that I just learned through these posts that astigmatism and website dark modes don't go well together.
I wonder if lighter colors and soft grey palettes would work for your case. have you experimented with colors that are easier on your eyes, given your complications?
I haven't actually. I always just used dark mode and assumed the additional difficulty was a drawback everyone experienced and learned to live with it. Now that I know that's not the case I'll see if I can find a color scheme that works well, like you suggested :)
Anyway, +1 for astigmatism and extreme sensitivity to glare. The rise of dark themes is something of a curse for my ability to interact with interfaces these days.
I don't like reading pages with dark background either. I just do ctrl+A on such pages so that the text background becomes blue, making it a little better to read.
I also have astigmatism which makes all dark mode websites and apps difficult to read. As "cool" as this trend and previously "hacker" color scheme has gotten, don't remove light modes!
I’m used to getting “most of our users don’t have an issue, so we don’t care” responses (Robinhood did this for a while before they finally added a toggle, and Spotify straight up doesn’t care), so having a company actually note they’ll look into doing it if possible is really appreciated by me.
This is very cool. I think there's a lot of room to grow this space: local "folders" that do some "magic" in the cloud.
Obviously, sync (Dropbox) is just the beginning, and Dropbase takes it a step further. There's been times where I had a (big-ish) CSV and wanted to run a few tests/queries on it. Auto-importing it into some database and being able to run SQL/Python on the dataset (without bootstrapping that locally) would've been a godsend.
Thank you! Please give it a try and let us know if you have any feedback.
One of the features we added is to do make the "magic" replicable (or deployable in production). So we keep track of processing steps and let you export Python code that that applies those same processing steps. This could be used to run the exact same steps in a larger version of that dataset later.
If you're open to PHP, I found the perfect tool for myself. The Laravel framework has an ORM called Eloquent. Usually it maps models to database tables (hence ORM) but you can also bring it to get its rows from a CSV which it then maps into an in-memory sqlite database.
You can then work with it through the ORMs methods, with regular SQL or with external tools like Tinkerwell that just displays them in a tabular fashion.
This reminds me of @BrandonM's famous reply to Drew Houston[1] :) Of course there are ways of doing it. But sticking something in a folder and stuff just automagically "working" is a much more pleasant workflow -- and more importantly, how you create value. Jimmy, I'd say you're in good company!
Please don't forget about BrandonM's follow up comment though!
I've seen the thread linked as a 'tech people don't appreciate simplicity' but he actually acknowledges Dropbox could be very useful and wishes success.
The other criticisms were also very valid at the time, and were acted upon.
Gitlab has a similar project that will load google sheets as a dataframe.[1]
At my current company, we load google sheets into s3, then mount those files as external tables. There has not been a commit in years, meaning it has worked out well for us.
What seems to be missing in these solutions, and what Dropbase provides, is a UI to guide users through the process.
Thanks. That's a useful project! And yes, we aim to make data processing easy (through UI, low code) and easy to reuse/export (by converting UI steps 1:1 to code)
I tend to think of software in terms of composable units, so Unix-like utilities are very attractive in my workflow, and Datasette just fits right into that model. Datasette is easy to deploy and does one thing well. I can use it on my little single-board computer I use for hobby projects and allow other machines on my network to have an API to view a database a daemon is populating there. But it works just as well to share larger, static data sets on the internet. It's just a tool that fits right into its niche in the stack and does its job really well (much like sqlite).
As engineers, we also tend to think in terms of modularity and control - call this "tool flexibility."
With Dropbase, we're balancing flexibility with the goal of creating an experience that allows users who can't directly work with these composable units.
How we balance experience vs flexibility is that we give users full control of the database and the processing steps (we even allow you to export Python code you can run anywhere else).
We found this is the right balance for the uses cases we're targeting, although, we're still doing a lot of research to figure out the right balance and that balance might also evolve over time.
If you do this, please offer nonprofit and student/academic plans! Many people in social sciences and the nonprofit world don't have the engineering resources to build data pipelines, nor the budgets for a $250/mo plan. But they're spending every day slicing Excel files of surveys and risk assessments and potential donors and the now-departed intern's messy list of average flight speeds of unladen swallows.
In all seriousness, this product could bring leverage to those in society who could have the most impact. Design is brilliant, the pipeline idea is brilliant, I can see this really gaining traction.
I really like this. I could see myself using this in the future for some personal projects or for prototyping.
What I would really love though is something a little more similar to Dropbox, with tight integration to the user's filesystem, and keeping the spreadsheet as the source of the data.
Spreadsheet view for all your data is on our roadmap. Integrating into user's file system is something we'll definately explore, it sounds quite interesting.
These are great suggestions, thank you!
This is just just excellent. I do a lot of ETL work and need to build custom workflows for it, this is exactly what I have been looking for. Good job, team!
Since you brought up ETL, you may also be interested in Meltano (https://meltano.com), an open source ETL tool we've been working on at GitLab for a few years now!
Thanks! Give it a go and let us know if there's anything we can do to make this better. We let you export Python code that maps 1-to-1 to any processing step you take on the UI.
Some of the ideas are good but it’s more interesting if it processes files like RDS Spectrum does vs loading first to Postgres. I know you are targeting smallish datasets but eventually data size will go up and loading everything in PG could become a scaling problem.
This reminds me of https://www.visidata.org/ which is a terminal based application with similar purpose - loading tabular data from various sources, and exploring and processing it in a visual way.
Thanks. Yes, in some cases where you're working with regulated data you'd need a self-hosted version. We're working on an enterprise version that allows this.
Would you be able to describe your use case and the kind of data you're using?
If you're looking for open source, self-hosted ELT, I suggest you check out Meltano (https://meltano.com), which we've been working on at GitLab since 2018!
Meltano uses open source Singer taps and targets (https://singer.io) as its extractors and loaders, so to put together something similar to DropBase (which looks amazing, by the way), you could use:
For transformation, Meltano currently supports only dbt (https://www.getdbt.com/), which means that unlike DropBase, it's built for ELT rather than ETL, since transformation takes place inside the loading database, rather than in between the E and L steps.
I'm very interested in exploring the ETL direction more, though, because as DropBase clearly shows, there are still a lot of companies and people who may not be experts on SQL, but would benefit tremendously from sturdy ETL with an accessible interface and flexible integration points.
As I just wrote on our Slack workspace (there's a link on https://meltano.com):
> I’d love to see Meltano UI develop into that direction for simple transformations over Singer tap stream/entity schema and record JSON, so that we can do ETL as well as dbt-backed ELT.
> We’d probably start with a way of specifying transformations in `meltano.yml`, similar in spirit to the `select`, `schema`, and `metadata` extra’s (https://meltano.com/docs/command-line-interface.html#extract...), and/or by pointing at a Python file that can process each Singer message. Building a DropBase-style UI over that would be on the horizon too, once we’ve brought the Entity Selection interface back (https://gitlab.com/meltano/meltano/-/issues/2002) and add interfaces for metadata rules and schema overrides.
I'll create some more issues around this potential direction tomorrow :-)
---
If you or anyone else interested in open source, self-hosted ETL/ELT end up giving it a try, I'd love to hear what you think, so that we can figure out how to build it into this direction together!
This is very cool! I was building something similar but with a different crowd in mind, basically inbetween business and ops people as a "data sanity gatekeeper".
I was debating going with a similar stack (postgres(t)) but am currently playing around with sqlitebiter. Cool to see a similar product like this!
How does it work for deep JSON? Does it import it as JSON and keep the depth inside each row? Or is there an option to flatten the data, or spread across related tables?
Hey, I'm one of the engineers behind Dropbase. Currently all imported data needs to be structured, which means that the data needs to be formatted in either values, records, index,
or column orientations (see https://pandas.pydata.org/pandas-docs/stable/reference/api/p...). Right now we can only auto-detect between those formats, but in the future we're looking to accept unstructured data as well.
Reading the headline I clicked this thinking it might give you an API for your file system and I had thoughts of managing / viewing my files via an API.
We could work on the wording. Let me know if you have any suggestions.
We don't let you manage your file system through API but we offer access to your database tables through REST APIs. It's not the same but you could hack it to work that way.
Perhaps changing offline files to offline data (files) as just "offline files" is pretty ambiguous of what is supported (and how I arrived at file system)
That's a great suggestion. We were considering something like this for business or enterprise versions. It would also allow you to connect Dropbase pipelines to multiple files or entire folders in local or cloud storage.
Yes, we have more database types in our roadmap. At the moment we are starting out with Postgres. Are you trying to get data to a new sqlite or an existing one?
We're happy to introduce Dropbase 2.0! It's a tool that helps you bring offline files, such as CSV, Excel, and JSON files, into Postgres database. You can also process your data before uploading it using a spreadsheet-like interface or by writing a custom Python script. Once your data is in the database, you can query it using any third party tool (credentials will be provided). You can also access your data via REST API (powered by PostgREST) or create custom endpoints to serve a more specific use case.
A bit about the tech:
Currently, we support .csv, .json, .xls, .xlsx files. For data processing, we use Pandas, so if you are comfortable using Python, you can write your own custom functions to process the data. We also give you a free shared Postgres database to test the tool with (your data is isolated and hidden from others). Each one of these databases come with an instance of PostgREST preinstalled, so you can query your database using REST API ([http://postgrest.org/en/v7.0.0/](http://postgrest.org/en/v7....). You can also generate an access token with an expiry date to share your data with others.
There are many more features that we baked into the product. Come check it out, it's open for HN community.