Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What is the difference between a "Data Warehouse", a "Data Lake" and a plain old managed SQL Server instance I run on Azure?


Data Warehouse is usually a relational database designed for large OLAP analysis with features like column-oriented storage, vectorized processing, and distributed scale-out architecture. Since it's a database, the focus is on strong schemas and structured data, although all major systems also support JSON datatypes now.

Data Lake is usually object storage or other large storage pool with raw files. These can be different formats like JSON, AVRO, Parquet containing with strong schemas or unstructured data. Processing can be done by engines like Spark, Presto, Drill, etc that support less advanced SQL but more robust access across data files and storage locations. The point is to serve as a general dumping ground or "lake" of all the data and then manage it afterwards (including cleaning and moving important records to a data warehouse).

SQL Server is a single-node OLTP relational database but most database engines are fast enough now that you can do everything you need up to hundreds of millions of rows. Best SQL and feature support with full update capabilities. Some DBs like SQL Server have also added OLAP features like columnstore tables to further delay or eliminate the need for a data warehouse.


Great answer.

On Data Lakes: I often use an S3 data lake construct as a staging area for my Snowflake data warehouse.


Mostly how much data there is, and how structured it is. Not really sure what the difference between data lake and warehouse is, but either of them will typically have less structured data and more of it than an SQL server. We're talking petabyte-scale. Sure, you can get 16TB drives, but it's still a stretch to put it all on a single machine. Data should ideally be stored as parquet or similar, but there's probably a lot of JSON out there. Couple it with something like Athena, and you can query in SQL. Spark for more complicated stuff.


data lake is where all your messy historical timestamped immutable data goes so its not lost. data warehouse is where you make sense of it. and your old sql server is just the current snapshot.


SQL instance is fast and for transactional systems - like stock exchange, purchasing something,... called schema on write.

DW is for analytics and reporting.

Data Lake is like many DWs together and other, often "garbage" data, which "might" be useful in future analysis, ML and stuff. It's the unstructured graveyard of data (joking). Schema is defined on read.


My impression? Data warehouses are more fully featured than a data lake, whereas a data lake implies primarily the storage, with other systems querying it. Sql server is orthogonal in that you need neither if all your data fits in a single sql database (or alternatively, sql server is a small scale data warehouse).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: