Meta comment: The confusion we see in this thread is what happens when a startup tries to create a new category -- a strategy known as "category creation." Founders imagine everyone will jump onboard with this category and run to them as the de facto leader of that category.
In reality, it just creates another point of confusion. Whereas before you had to explain just one thing, now you have to explain two things: What your product does, and what the category means.
There are many existing subcategories within the "database" category of products. Pick the subcategory where you want to compete, and market your innovation to stand out and win over new or existing customers in that category. Timeseries, in-memory, data lake, RDBS, ... Lots to choose from.
This goes for TileDB and any startup founder reading this who's about to launch something like a "Intra-Terrestrial Data Pipeline Miracle" or "Middle-Out Machine Learning Capacitor" or "The First Cloud Fog Edge Dew Platform" or whatever.
It’s difficult to tell from the landing page, but what exactly is this? It says a lot of things that it’s not, but that leaves it sounding like it’s an object store with a tightly coupled map reduce framework allowing the “pluggable computer layer”.
The foundational invention is the TileDB universal storage engine based on dense and sparse multi-dimensional arrays. Genomic variants are 2D sparse arrays. Images and video are dense arrays. KV are interestingly sparse arrays as well[1]. TileDB is the first solution that can efficiently model all data with a single powerful data structure. TileDB universal storage engine natively supports cloud object stores, as well as local filesystems and we natively handle the eventual consistency through our MVCC design.
On top of the TileDB storage engine we have built and worked on integrations with a range of existing computation tools and frameworks. Our goal is to make the computation pluggable and provide fast and efficient (zero-copy or apache arrow where possible) integrations to allow you to continue to use your favorite computation tools in your favorite language. A few examples are our integrations with numpy/Pandas, Spark, MariaDB, gdal and more.
In addition, we built TileDB Cloud on our storage engine, which has two more innovations: (1) easy data sharing and logging at global scale (beyond organizations), and (2) a complete serverless infrastructure for scaling out compute, very similar to the DAG approach of dask.delayed.
> The foundational invention is the TileDB universal storage engine based on dense and sparse multi-dimensional arrays.
Want some advice? Find a way to explain TileDB not to sound super intelligent, but to help readers understand. You will be much more successful if you do just that.
Added bonus: start with what problem TileDB solves, that current solutions don't solve.
Edit: 1) scalability for complex data; and 2) deployment; seem to be the problems you solve. Is that correct?
If you don't understand the value of TileDB, you probably haven't dealt with the data that it is meant to model.
Using something like MongoDB for genomic or sparse geospatial data is incredibly cumbersome.
The access patterns for analytical applications making use of the above data types are spatially collocated ... the subsequent implication of this is that data needs to be stored in ways that are geometrically optimized on the file-system.
Consider a dataset which assembles information regarding terrain ... you collect a measurement using a laser. Some areas you use a laser scanner that is very dense (high number of samples per square km) ... in another you sparsely sample it. This results in a huge multi-dimensional array... this is well understood by people collecting the data. TileDB is built for this type of use case... "complex" data is ambiguous.
No doubt, but people need constraints. So they'll cobble together ad hoc constraints in their applications to prevent their stuff from crashing or being wrong.
A DBMS's raison d'etre is to take those ad hoc code patterns and formalize them.
Yes, but but you're talking about a different kind of database. This kind of big-data store is for specialty use cases with different kinds of constraints. Foreign-keys are pretty low on the list of important things here, and when they exist likely point to an external relational database anyway.
He's saying they're uniquely capable of storing both dense and sparse data efficiently, so which means they can store literally anything, hence the universal storage engine.
Not making a value judgement btw, not very at home in the field, I wouldn't know if any other database is capable of this, nor if storing dense and sparse information in the same database is even something anyone wants or needs.
Most every (analytic) RDMS database system can model sparse arrays. A sparse array is modeled by defining a clustered index on the table "array" dimensions and defining a uniqueness constraint on that clustered index. This works well with columnar storage because the data needs to have (and assumed to naturally have) a total sort order on the dimensions. Ex. Vertica, Clickhouse, Bigquery... all allow you to do this. TileDB allows for efficient range queries through an R-Tree like index on the specified dimensions.
Most real world data though is messy and defining a uniqueness constraint upfront (upon ingestion) is often limiting, so for practical use cases this gets relaxed to a multi-set rather than sparse array model for storage, and uniqueness imposed in some way after the fact (if required).
I agree that many use cases of sparse data, uniqueness of the dimensions can't be guaranteed or you might not want to enforce the uniqueness. With the recent TileDB 2.0 release we introduced support for duplicates in sparse arrays which adds the support for multi-sets[1].
It really isn't clear to me from your description why I might find TileDB useful or how it differs from using Redis or s3 directly. And I say this as a DL practitioner who would be very interested in a tensor database.
Between the TileDB splash page and your comment I don't understand the use cases where TileDB is compelling nor do I see a lightweight way to try it out and determine the use case for myself.
Seth from TileDB here. There are several differences from Redis or S3.
Redis is primarily a in-memory key-value store. There is an option for persistence but it's not primarily designed for persisting data. TileDB is designed first and foremost for persisting data to disk or to a cloud object store. Redis does not directly/natively support persisting to cloud object stores, as it's not really in the design goals. Another difference is TileDB is a column store designed around dimensions (~primary index) and attributes (non-indexes columns), where redis is a key-value with several data types but effectively its single value (that is one key equals one value, it might be a list type, but you can't have 1 key equal to a structure or multiple datatypes). You must serialize your data structures to fit into the key-value or you need to keep track of multiple keys and indexes if you want to approximate a columnar storage using lists. Depending on your application and use case Redis and TileDB are more likely to complement each other than you to select one over the other.
Using S3 directly, either with parquet or flat files is an option and one that is widely used. The problem we see with this is the fact that neither parquet nor flat files are designed for cloud objects stores and they leave a lot up to the application to implement since there is no single storage engine designed around any of these formats. This is why we've seen in the community over the last few years additional frameworks come about such as Delta Lake, Iceberg, Hudi and others. These systems are built to help facilitate the eventual consistency of cloud object stores, the multi-writer/multi-read problem and handling things like updates and deletes. By contrast TileDB has many of the features that are needed built directly into the format design and into the storage engine. TileDB's MVCC design instead of a single file allows it to natively handle updates and insertions with cloud object stores. We have designed it with the eventual consistency in mind, and are safe from corrupt reads or writes without the need for a central locking or transaction system.
In addition to the advantages we believe the TileDB storage engine has with S3, I also want to mention that one of the main issues we wanted to solve with TileDB Cloud was sharing and access control of S3 data sets. S3 access policies do exist and can be used to facilitate sharing of objects with other users/accounts/public. However anyone who has dealt with the S3 policies has seen that they can grow quickly and become unwieldy as you try to limit different prefixes. It seems like many times companies end up making a bucket public to share the data instead of managing the access policies, and we all see the various data leaks that happen as a result. With TileDB Cloud we offer very easy and simple sharing capabilities. From our web console we aim for it to be trivial to select an array, and share it with another user, organization or even to make it public. [1]
For some quick examples, we do have some example jupyter notebooks for running python examples [2]. These are also available on TileDB Cloud, if you sign up we are giving $10 of free credit and you are able to launch a jupyterlab instance and see these examples preloaded. We also have some quickstart examples in different languages if you prefer something other than Python [3]. The quickstarts are not full use cases but they do give you an overview of basic API usage. I'd love to get some feedback from you on what type of lightweight examples you are looking for. We are always aiming to improve our documentation and make it easier to discover about TileDB.
Thank you for that, it was very helpful. Those notebooks and quickstart examples are exactly what I was thinking of, I just hadn't found them.
To be honest, the splash page doesn't help me understand which of my problems the technology would solve and so I didn't dig deep enough to find those examples. I would say the splash page is a little heavy on what TileDB is and a little light on why I should care. But thank you for taking the time to explain it to me!
Wouldn't that take up a lot of storage space if there is no image or video compression, and you're storing it as raw pixel & frame arrays? Is the only option to apply file compression like GZIP on that data, or can I compress the data with image/video compression like JPEG/MPEG?
TileDB is designed around supporting multiple compressors or filters. We have architected the code so that we can add different filters as we find use cases or customer requests. Currently we support compressors such as gzip, bzip2, zstd, lz4 and also filters like double-delta, byteshuffle or and RLE. For a complete list see our docs[1]. Adding support for specific image compressors such as JPEG or video codecs is on our roadmap.
In genomics land its less an object store and more DB built to house giant but very sparse columnar data (think the position of genetic variants in hundreds of thousands of people) with support for interval operations and some other genomics things that traditional DBs don't support.
That said, we evaluated TileDB for genomics recently and found it lacking for our use case.
Take a look at Redis. With the modules API it's possible to load pre-existing C libraries and consume them through simple Redis commands and there's Redis clients for each and every language out there.
I did a GSoC project years ago and I hated having to use MongoDB for manpulating VCF files (it was the time where "nosql" was all the rage and all project ideas had to have mongo somewhere). Redis Modules would have been the one thing that would have made sense (that API was introduced much later).
Stavros from TileDB here. Great description of the genomics use case for TileDB. We'd be interested in learning what limitations you've found. Happy to discuss over email as well (stavros@tiledb.com).
Hey Stavros. We were looking for a data-store to integrate into a clinical genomics LIMS that supports in-system analysis. We deal with de novo sequenced clinical samples (and not genotyped samples, which seems to be what TileDB-vcf had in mind?). There are some edge cases that TileDB-vcf explicitly disallows (updates/reinserts to the same sample, overlapping variants) that are not edge cases for us but rather common occurrences.
This is an API issue with TileDB-VCF. The core TileDB library supports inserts/appends/overwrites without issues and we just need to expose those operations in the TileDB-VCF APIs. Added to our backlog, thanks!
TileDB and Hail are rather complementary. We have customers that use TileDB to store and manage their variants, and Hail to perform GWAS (by exporting from TileDB to Hail format). We are currently designing a tighter integration with Hail. This expands on our vision for a universal data engine that integrates with pretty much everything out there and does not lock you in a single framework (e.g., Spark).
That was our feeling about the two products as well, the limitations w/ TileDB-vcf though sort of forced our hands. I was (and still am) of the opinion TileDB would be a good variant store since it does do so many of the things we want and does them well
Sure but that’s a very specific use case and not a “universal database”. Object store after all is just cloud speak for KV store which is the foundation of data retrieval. So I’m still unclear what this actually does. Has it invented something new or is it tying together mature tech in a really powerful way?
I'm in the same camp. I'm quite interested since it mentions asset data as an example, but I have no idea what this does from looking at the landing page. Does someone have an end-to-end example? Since this stores arrays, is this kind of like Apache Arrow but with a persistence layer? Is this suited for large amounts (~1TB) of time series data?
Please see my comment above in the thread for a full description of TileDB[1].
Compared to Apache Arrow we have some similarities but also some significant differences. Arrow as a project has many components, the most directly comparable are the in-memory data structure and parquet for on-disk storage. For the in-memory data structure TileDB has similar goals of doing zero-copy for moving data between libraries and applications. In-fact we even use Arrow in our TileDB-VCF[2] genomics project for zero copy into spark and python (pyarrow). We are looking to expand support for arrow into other integrations where appropriate.
For parquet, a brief comparison is that parquet is a one dimensional columnar storage format, where TileDB is multi-dimensional. TileDB subsumes parquet in that we include all of its functionality and more. TileDB natively handles the eventual consistency of cloud object stores, and natively handles updates through its MVCC design. TileDB is a complete storage engine, not only a file format. That said parquet does have some advantages, a primary one it has a several year headstart on TileDB in being integrated into many tools, so its more well known.
> Is this suited for large amounts (~1TB) of time series data?
Yes, it's well suited for time series data. We natively support timestamp/datetime fields, in the core library and in many of our integrations (TileDB-Py, TileDB-R, Spark, MariaDB to name a few). We allow for fast sub-slicing on the dimension. You also have configurable tiling[3] so you can shape the array to fit your timestamp granularity and volume. The support for updates also can help if your timeseries data ever gets updated. Many timeseries databases don't recommend updates to records, or they recommend no primary keys and to have duplicates. TileDB supports fast and efficient updates (and duplicates) so you have full control of your design and implementation.
Without experience with the product, it's core seems to be a wrapper around stored data (sql, hdf, etc.) over several protocols (mysql, static files (s3) etc).
I doubt that it is like an object store. I think that it is built around the notion of multidimensional array (matrix) by focusing on data science tools. Yet, they use the term "table" which is somewhat confusing because there are significant differences between tables and arrays. Maybe they use (sparse) arrays at physical level and then support other data models at the level of operations. For me it is not clear.
Yet Another Database.... As Shelnutt2 states below, its success will depend on how quickly TileDB can be integrated into other tools.
However, I don't see any benefit in TileDB supporting fast and efficient updates (and duplicates) of time series data. Time series should be immutable, and only in rare occasions require updating.
Time series data can vary and I'd agree that most time series is immutable. There are however use cases in which updates can happen. For instance, at my previous job before TileDB Inc, we had a case where 99% of our data was immutable and never updated but a very small amount of data could be updated if there were late arriving parts. In order to get near-realtime data we accepted that sometimes some columns might not be available within the window the datapoint represents. In that case that record might reappear at a later time with the complete and correct values.
Of course there are trade offs, we could have forced a longer waiting period until we were confident the data was finalized. We also could have ignored the updates. In the end we used a system of staging tables for loading the last 24 hours of data before merging out into a more finalized table. This kept the load of updating records in the database down, and still allowed us to achieve our goals.
At the time, several years ago, I was not aware of TileDB else we would have considered it instead of a more traditional database vendor.
I'm looking at some of the comments and still having a little bit of trouble understanding what makes the data engine "universal."
I see reference to "universal storage" in some areas, but keep landing on a multi-dimensional array structure for data storage - and this seems kind of at odds. Maybe I'm missing something, but isn't specifying the structure of the data inherently not universal?
I'm relatively shielded in my databases knowledge, though - having only worked with "traditional" tools. If I'm missing something definitely let me know!
Seeing the number of customer testimonials on their announcement makes me once again wonder how brand-new products get traction with customers (who really serve as guinea pigs). I'd love to hear (and learn) how startups have successfully gotten their foot in the door with technologies like this.
For context, I spent a bit shy of a year a while back developing a middleware idea, and severely struggled to get anyone to try it. Friends suggested open-sourcing it, but even that would have been a struggle, I'm sure.
I believe TileDB had some customers when it was framed as the product. According to its website: "TileDB, Inc. is a data management company spun out of Intel Labs and MIT"
I couldn't find a nice example of using TileDB together with data frames. From the documentation, I understand this is one of the uses of TileDB. Is someone aware of a performance oriented blog post about TileDB and data frames?
Can you elaborate? I can't figure out what you're referring to with just that, but I wonder whether it's a silly nickname for a common limitation of databases?
HDFS is the Hadoop distributed file system, but maybe they meant HDF5 files, which is a format the Pandas library saves in. That makes use of Numpy's n-dimensional arrays, so it's related to what these guys are working on.
Yeah my typo - I was trying to understand with limited time budget TileDB and their marketing pitch - they mentioned something like "the problem of knowing where to keep data was solved by traditional databases, now its solved by us"
On a dumb level I was just wondering if they are tracking the data sets people create - it does seem more like a data policy than a product. But I may not understsnd it well
TileDB Embedded is a storage engine like HDF5, with the following differentiators: (1) it is cloud-native, (2) it supports also sparse arrays, (3) it offers rapid updates, (4) it supports data versioning and time traveling built into its format. TileDB Cloud (our cloud SaaS solution) further allows you to see which arrays you own in the cloud and which ones you share with others, along with full access logs. You can also attach arbitrary descriptions and metadata that can search on, even find and access public datasets posted by you or others.
Although this comment is voted into oblivion, and I know the tone wasn't great, the poster is actually making an interesting point in one sense - the fact that technology often comes back around another time in a new form.
For example, along the lines of IBMs IMS, I also recall the MUMPS system and language [1] and I'd be interested to know if the inventors of TileDB were familiar with the history of sparse array interfaces to databases or not, if it influenced their design in any way or it was rediscovered.
Regarding MUMPS: "The MUMPS language provides a hierarchical database made up of persistent sparse arrays, which is implicitly "opened" for every MUMPS application. All variable names prefixed with the caret character ("^") use permanent (instead of RAM) storage, will maintain their values after the application exits, and will be visible to (and modifiable by) other running applications. Variables using this shared and permanent storage are called Globals in MUMPS, because the scoping of these variables is "globally available" to all jobs on the system. The more recent and more common use of the name "global variables" in other languages is a more limited scoping of names, coming from the fact that unscoped variables are "globally" available to any programs running in the same process, but not shared among multiple processes. The MUMPS Storage mode (i.e. Globals stored as persistent sparse arrays), gives the MUMPS database the characteristics of a document-oriented database."
> the fact that technology often comes back around another time in a new form.
In this case, it's failed technology. That's what the people who developed these "NoSQL" kluges and those who adopted them don't understand. Non-relational databases like IMS were once popular, but fell out of favor after the advent of RDBMSs, and for very good reasons, which the adopters of Mongo etc. had to rediscover.
> For example, along the lines of IBMs IMS, I also recall the MUMPS system and language [1] and I'd be interested to know if the inventors of TileDB were familiar with the history of sparse array interfaces to databases or not, if it influenced their design in any way or it was rediscovered.
The "inventors" of these rehashes are generally just as ignorant of history as the users who adopt their "solutions."
I'm going to guess that this technology is not exactly the same as something IBM did in 1966 :-)
Now, I understand a certain crankiness about MongoDB (which arguably was overhyped), but surely you don't think all database development or attempts at innovation should stop, right?
Good ideas from the past now have a place in a world with data centers, personal computers, and mobile devices that even a brilliant computer scientist would have struggled to imagine in 1966.
Meta comment: The confusion we see in this thread is what happens when a startup tries to create a new category -- a strategy known as "category creation." Founders imagine everyone will jump onboard with this category and run to them as the de facto leader of that category.
In reality, it just creates another point of confusion. Whereas before you had to explain just one thing, now you have to explain two things: What your product does, and what the category means.
There are many existing subcategories within the "database" category of products. Pick the subcategory where you want to compete, and market your innovation to stand out and win over new or existing customers in that category. Timeseries, in-memory, data lake, RDBS, ... Lots to choose from.
This goes for TileDB and any startup founder reading this who's about to launch something like a "Intra-Terrestrial Data Pipeline Miracle" or "Middle-Out Machine Learning Capacitor" or "The First Cloud Fog Edge Dew Platform" or whatever.