More

dalailambda · on July 15, 2020

A quote from the article I would object to is "for large datasets and complex transformations this architecture is far from ideal. This is far from the world of open-source code on Git & CI/CD that data engineering offers - again locking you into proprietary formats, and archaic development processes."

No one is forcing you to use those tools on top of something like Snowflake (which is just a SQL interface). These days we have great open source tools (such as https://www.getdbt.com/) which let you write plain SQL that you can then deploy to multiple environments, perform automated testing and deployment, and do fun scripting. At the same time, dealing with large datasets in a spark world is full of lower level details, whereas in a SQL database it's the exact same query you would run on a smaller dataset.

The reality is that the ETL model is fading in favour of ELT (load data then transform it in the warehouse) because maintaining complex data pipelines and spark clusters make little sense when you can spin up a cloud data warehouse. In this world we don't just need less developer time, those developers don't have to be engineers that can write and maintain spark workloads/clusters, they can be analysts who are able to do transformations and have something valuable out to the business faster than the equivalent spark data pipeline can be built.

ibains · on July 15, 2020

Very valid points: 1) Agree that Snowflake is far easier to use than Spark. 2) Agree that DBT is a great tool.

ETL workflows normally processing 10s of TBs and workflows with large and complex business logic is the context. With Spark code, you can break down your code into smaller pieces, see data flow across them, write unit tests, and have the entire thing still execute as a single SQL query.

Don't large SQL scripts become really gnarly for complex stuff - nothing short of magical incantations. I can't see data flow from a subquery for debugging without changing code.

Prophecy as a company is focused on making Spark significantly easier to use!

dalailambda · on Oct 2, 2019

SQL has definitely become the defacto tool for a lot of data processing. This model of working is generally referred to as ELT as opposed to ETL.

For small/medium scale environments Fivetran/Stitch with Snowflake/BigQuery using getdbt.com for modelling is an insanely productive way to build an analytics stack. I consider this the default way of building a new data stack unless there's a very good reason not to.

For larger scales Facebook has Presto, Google has Dremel/Procella/others, and a lot of data processing is done using SQL as opposed to writing code.

The only downside really is that it tends to be fairly focussed on batch pipelines (which are fine for 95% of workloads). But even that is becoming less of an issue with Beam/Spark so you can use SQL for both batch and streaming.

Source: Solution Architect at an analytics consultancy.

dalailambda · on Sept 8, 2019

I can highly recommend this talk by Respawn, Multiplay, and Google on how Titanfall 2 does multiplayer server management. It's geared more at the infrastructure side as opposed to actual dev but worth a watch: https://www.youtube.com/watch?v=p72GaGq-B_0

dalailambda · on Feb 1, 2019

While the move seems to be towards NewSQL databases (Spanner/CockroachDB), the answer to relations in NoSQL is to model the data differently. This can generally resolve a lot of the problems inherent in the need for relations.

This might not be useful advice for everyone, but I can highly recommend this youtube video on DynamoDB data modelling: https://www.youtube.com/watch?v=HaEPXoXVf2k

dalailambda · on Dec 2, 2018

Lambda will create a function instance to handle concurrent requests, i.e. if 100 requests happen at the same time there will be 100 function instances. It will then however keep them "alive" for a few minutes allowing it to reuse already running instances. Additionally, a function instance won't handle multiple requests at a time.

Given this, using beam is a bit of a waste in terms of individual instance scalability. That being said however, you might be able to use shared actor pools (e.g. caches, etc.) across all your functions. I want to emphasise the might as rapidly adding and removing nodes from the beam cluster might not work well or at all.

At the end of the day, the lambda model kind of supplants the actor model as your unit of messaging and concurrency, and so trying to mix the two isn't the best idea. If you want to use the beam on AWS I'd recommend sticking to ECS/Fargate/EKS. That being said, Elixir might be a nice match due to developer ergonomics, just don't expect to be able to drag and drop actor reliant features.

dalailambda · on April 11, 2018

In my experience the reason for a JSON object per line is because a tool can then split the entire file by newlines and have a list of objects to start parsing/processing in parallel, which avoids having to parse the entire file up front to get usable data, and lets the tool start processing things while some data is still being parsed.

dalailambda · on March 17, 2018

One of the major reasons that would influence a language like Jai is that once you do that, even if your generated C/++ is fairly optimal, you still have to go through another slower compiler. Jai's compiler is fairly optimised for speed and I think that would be an unacceptable trade off.

On the target platform side, a video game tends to have a fairly well known set of platforms it will run on (new PC's, PS4, Xbox), so I think this is an acceptable trade off.

And just a final note, the generated C/++ would be very unreadable anyway, unless it was a direct transpiler, at which point you lose a lot of the benefits of having a new language.

dalailambda · on May 31, 2017

The app can still use environment variables and be completely independent (12 factor), but the Dockerfile/Kubernetes config still needs to provide those things, so there is a clear distinction between ops and dev, they're just more mingled than previously.

rhizome · on June 1, 2017

I think maybe you didn't read the top-level comment in this branch of the thread?

dalailambda · on May 31, 2017

There are generally two levels of autoscaling involved with kubernetes.

Firstly, kubernetes is able to create multiple instances of your app as load increases and scale them out over all the nodes.

Secondly, the nodes your kubernetes cluster is running on can also autoscale. With terraform you can set up an AWS autoscaling group for example to automatically increase the size of your cluster as load increases.

dalailambda · on May 29, 2017

There doesn't seem to be any part of that package that requires ES6 to use. You can use it as an ES6 module, but the commonjs version should work fine as well.