More

pitah1 · 2024-10-02T11:48:51.000000Z

This was also my philosophy behind creating insta-infra (https://github.com/data-catering/insta-infra). Single command to run any service. No additional thinking required.

Too many times I've become very frustrated when an installation doesn't work the first time or it has some dependencies that you haven't installed (or worse, you have a different version). Then you end up in some deep rabbit hole that you can't dig out from. Now for each tool I make, it must have a quick start with a single command.

pitah1 · 2024-10-02T09:15:43.000000Z

The world of mock data generation is now flooded with ML/AI solutions generating data but this is a solution that understands it is better to generate metadata to help guide the data generation. I found this was the case given the former solutions rely on production data, retraining, slow speed, huge resources, no guarantee about leaking sensitive data and its inability to retain referential integrity.

As mentioned in the article, I think there is a lot of potential in this area for improvement. I've been working on a tool called Data Caterer (https://github.com/data-catering/data-caterer) which is a metadata-driven data generator that also can validate based on the generated data. Then you have full end-to-end testing using a single tool. There are also other metadata sources that can help drive these kinds of tools outside of using LLMs (i.e. data catalogs, data quality).

pitah1 · 2024-08-16T13:22:08.000000Z

I recently went down the rabbit hole of using PyScript for running a Python CLI app in the browser.

It felt hacky the whole time, especially when dependencies were involved. I had to create wrapper classes to work around Pydantic 2.x not being available to use. I tried to put all logic into the Python files but found some things missing that I had to put in JavaScript.

I think it could be good in use cases where you want some simple UI with custom UI logic on top of your Python code but maybe Streamlit or Gradio could be more suitable.

GitHub repo: https://github.com/data-catering/data-contract-playground

Website: https://data-catering.github.io/data-contract-playground/

skeledrew · 2024-08-17T00:48:41.000000Z

The Rustification of a lot of Python projects is making it more difficult that necessary to use Python everywhere.

pitah1 · 2024-07-23T04:59:31.000000Z

I've created a Docker image for it and onboarded it into my tool called insta-infra[1]. You should be able to run it via:

    ./run.sh maestro

[1] https://github.com/data-catering/insta-infra

pitah1 · 2024-06-10T07:17:09.000000Z

Would love to hear what people think or other approaches people have taken to help quickly spin up tools on your laptop.

pitah1 · 2024-06-01T00:25:39.000000Z

I've been keeping an eye on these kinds of Spark accelerator libraries for a while now.

How does it compare to Blaze[1] and Gluten[2]?

I'm interested in running some benchmarks soon against all three for my project to see how they all go.

[1] https://github.com/kwai/blaze

[2] https://github.com/apache/incubator-gluten

ed_elliott_asc · 2024-06-01T07:39:55.000000Z

Apparently blaze is also datafusion

pitah1 · 2024-05-23T00:19:26.000000Z

Thanks for sharing. Happy to see another solution that doesn't just slap on AI/ML to try to solve it.

I am also among the many people who have created a solution similar[0] to this :). The approach I took though is being metadata-driven (given most anonymisation solutions cannot guarantee sensitive data not leaking and also open up network access from prod to test envs, security teams did not accept it whilst I was working at a bank), offering the option to validate based on the generated data (i.e. check if your service or job has consumed the data correctly) and ability to clean up the generated or consumed data.

Being metadata-driven opened up the possibility of linking to existing metadata services like data catalogs (OpenMetadata, Amundsen), data quality (Great Expectations, Soda), specification files (OpenAPI/Swagger), etc., which are often underutilized.

The other part that I found whilst building and getting feedback from customers, was having referential integrity across data sources. For example, account create events coming through Kafka, consumed and stored in Postgres whilst, at the end of the day, a CSV file of the same accounts would also be consumed by a job.

I'm wondering if you have come across similar thoughts or feedback from your users?

[0]: https://github.com/data-catering/data-caterer

pitah1 · 2024-03-26T22:50:46.000000Z

Working on a data generation and validation tool called Data Caterer. The focus of it is being data source agnostic, fast and simple. Just last week, I released a UI for it.

https://github.com/data-catering/data-caterer

pitah1 · 2024-02-28T02:54:05.000000Z

I think they make the biggest difference when testing data pipelines (which have historically been difficult to test). You can now easily test out compatibility between different versions of databases, verify data types, embed as part of your build, etc.

I believe the next step, once using test containers, would be automating data generation and validation. Then you will have an automated pipeline of integration tests that are independent, fast and reliable.

afro88 · 2024-02-28T03:32:59.000000Z

You can automate data validation with snapshot tests. I do it this way with a data pipeline and have a function that queries the destination DBs and puts them unto json to be written validated with a snapshot

pitah1 · 2024-02-28T02:19:27.000000Z

This is what I am trying to solve via building Data Catering (https://data.catering/). It gives you the ability to generate data into any database (along with maintaining any relationships between data) via metadata that can be retrieved via a source database or other types of metadata sources (i.e. Open metadata).