Building data infrastructure that will last

Terr_ · 2024-08-11T20:30:45 1723408245

I've come to believe the opposite, promoting it as "Design for Deletion."

I used to think I could make a wonderful work of art which everyone will appreciate for the ages, crafted so that every contingency is planned for, every need met... But nobody predicts future needs that well. Someday whatever I make is going to be That Stupid Thing to somebody, and they're going to be justified demolishing the whole mess, no matter how proud I may feel about it now.

So instead, put effort into making it easy to remove. This often ends up reducing coupling, but--crucially--it's not the same as some enthusiastic young developer trying to decouple all the things through a meta-configurable framework. Sometimes a tight coupling is better when it's easier to reason about.

The question isn't whether You Ain't Gonna Need It, the question is whether when you do need it will so much have changed that other design-aspects won't be valid anymore. It also means a level of trust towards (or helpless acceptance of) a future steward of your code.

twic · 2024-08-11T23:08:21 1723417701

A colleague of mine preaches this gospel. But we never actually delete anything. So now our life is lived on a shanty-town of under-engineered throwaway codebases.

I'm not saying that our life would be better if we gold-plated anything; it would probably just suck differently. But so far, IME, design for deletion hasn't really delivered.

Terr_ · 2024-08-12T07:15:01 1723446901

I prefer to think of it less as "do the short-term minimum" and more like a kind of humility when it comes to wiring things up.

cwalv · 2024-08-11T23:40:47 1723419647

> The question isn't whether You Ain't Gonna Need It, the question is whether when you do need it will so much have changed that other design-aspects won't be valid anymore.

This resonates with me. But IMO YAGNI isn't necessarily different from asking "whether when you do need it will so much have changed that other design-aspects won't be valid anymore." If "it" is reduced coupling (to make something easier to change or remove), or any other not-immediately-necessary abstraction, it's really the same question

beardedetim · 2024-08-11T20:37:26 1723408646

Cannot agree more to this sentiment. I call it "throw away code" and it's always seemed like the easiest to change in the future, and we all know everything is gonna change in the future.

mulmen · 2024-08-11T23:48:43 1723420123

Especially if you never solve anything properly.

Terr_ · 2024-08-12T04:39:30 1723437570

In a lot of business-software contexts, that's a given because stakeholders don't really know what they want/need either.

oftenwrong · 2024-08-12T18:03:57 1723485837

There is a Greg Young talk about this called 'The Art of Destroying Software'

weego · 2024-08-11T11:19:17 1723375157

How do you ensure the data infrastructure you’re building doesn’t get replaced as soon as you leave in the future?

If this is a core conceit of the thinking then my answer is who cares?

Why do you want to try and influence a situation you're not even involed in?

Taking it back to the best lesson I was ever given in software engineering "don't code for every future".

Do what you're asked to and not get caught up in projecting your own biases into trying to make a "solid base" for the future when you can't know the concerns of said future.

layer8 · 2024-08-11T12:07:38 1723378058

You should be interested in building something that others don't want to replace as soon as you leave. That doesn't require predicting the future.

It should be obvious that people not caring what comes after them is not a good thing.

ipaddr · 2024-08-11T12:47:10 1723380430

No matter what you build someone will come along later and try to rewrite. If it is built too well with too many future cases in mind it will be too complex. If you write something simple and basic someone will try to add their complexity. If you write in one language someone will try to use something different. Same goes for framework.

Write for your current requirements not some future state because people will say your work was subpar or overkill regardless because you are not around to defend your decisions and putting you down raises them up.

Things you learn after 25 years.

jvans · 2024-08-11T20:27:43 1723408063

The desire to rewrite something is the single biggest red flag for me that someone has questionable technical decision making skills. Yes there can be good reasons for it, but my priors shift dramatically once I hear someone suggest it

kdazzle · 2024-08-12T03:23:05 1723432985

I used to agree, but there’s so much software that is just so bad. Wrong DB, bad framework, crazy abstractions, black box magic, no security, behavior that is just wrong, etc.

The dev team going away for 3 months to rewrite is probably a bad idea, though. There are definitely good ways to rewrite and then not good ways.

gregors · 2024-08-12T20:28:35 1723494515

I'll agree with you if you mean - entirely en masse. I specialize in fixing legacy software systems that have ground to a halt development-wise. Blanket rewrites are almost never a good thing, but partial rewrites are a wonderful tool. Like most things in this industry there unfortunately isn't a hard and fast rule. Everything is extremely context-dependent.

d_sem · 2024-08-11T14:11:03 1723385463

Some data retention requirements are mandated by law and it is necessary to develop robust systems that can stand the test of time. I've seen 15 and 25 year retention periods for data in safety related applications.

Things my interns learned in the first month as part of new hire training.

My quip above is to illustrate that in a dynamic and complex field its important we don't over index on experience.

dataflow · 2024-08-11T18:20:00 1723400400

> No matter what you build someone will come along later and try to rewrite.

> If you write something simple and basic someone will try to add their complexity.

Note that extending != rewriting.

jvans · 2024-08-11T20:15:25 1723407325

They mean rewriting. People love rewriting stuff they didn't write

Closi · 2024-08-11T12:51:28 1723380688

Then let someone come along and try to rewrite and improve - but if your solution is so flimsy it forces a rewrite, it's just poorly made to start with.

boznz · 2024-08-12T21:14:10 1723497250

Or old. Not many technical decisions stand 15+ years in any organisation

Culonavirus · 2024-08-11T12:55:37 1723380937

> If this is a core conceit of the thinking then my answer is who cares?

Yep. At the end of the day, it's very simple:

People working for a company are not ants or bees. A company is not a hive and people are not going to put down their own interests to serve the hive. We are a bunch of cooperating, but ultimately independent agents, who act in their own benefit.

It is up to the business owner to keep their employee activity in check. Does that mean giving them work to do? Checking on the progress of their tasks? Checking on their methodology and software stack sustainability? Making sure there are no single points of failure for the business? Making sure the "IT know-how" of the business is preserved when a person leaves? ALL OF THE ABOVE!

When a business owner can't do these periodic checks themselves, they're free to hire someone that will do this for them.

But the idea that individual developers should care about what happens to the business after they leave is just preposterous.

Also, the entire "resume driven development" thing is absurd. This has always happened in software development. People care a lot about what their resume will look like in 5 years. It's perfectly normal and the business benefits too ("we use modern tools, come work for us"). It doesn't mean the business should allow needless "shiny new thing" syndrome to thrive, but you should watch out to not stomp out innovation or you might find yourself unable to hire talented devs because no one wants to work on your shitty "php with jquery" web app.

vladms · 2024-08-11T14:38:35 1723387115

> But the idea that individual developers should care about what happens to the business after they leave is just preposterous.

It's not about caring after you leave. It's while you stay caring enough to do useful things for the company. Sure, you can be like a consultant (require very specific requirements and not trying to understand or put things in perspective), but as an employer these are the first people that I will let go because they bring less value than someone that "cares" (again, while being there, not after they left)

neilv · 2024-08-11T18:47:15 1723402035

Yes. Put another way, this school of thought concerns professionalism while you're there, when you already know that what you do will still have effects after you're gone.

A different school of thought is that a job is about showing up and doing some interpretation of what your your manager tells you to do. This might not be very aligned, and much of the org chart might not be very aligned, so the priority tends to be appearances. Manager told you to make a Web site that does X, so you try to make a Web site that arguably does X. You don't tell the manager all the factors that in a better organization they should care about, and you maybe don't do a particularly good job of the site you do make, and you definitely don't base all your implementation decisions based on company needs rather than your own resume and political capital. But you're satisfied that you arguably did what you were told to do, and that's the transaction.

The latter school of thought is very common, and I think it's not really due to individual ICs. Rather, usually the organization is actually pushing people towards that thinking, because the org chart and practices are also full of that kind of thinking. A more conscientious professional would blow a gasket, due to the "preposterous" situation of a company of individual irresponsible mercenary behavior and collective dysfunction like that.

I naturally subscribe to the true alignment school of thought, and that's one of the appeals of being a startup founder: I can apply my experience (and, admittedly, just as much theories/guesses) towards building a company and team where things are aligned better. It's also one of the reasons I dread some aspects of founding, because I know that, no matter how good I am about hiring and onboarding into the aligned culture, we'll sometimes have to deal with very mis-aligned (even bad-faith) people from partners/customers/investors. Not only is that unpleasant, but there's the risk of infection.

seer · 2024-08-12T04:28:41 1723436921

Being an individual developer that does think about those things, even if they are not actually doing them but at least helping with all those checks, is a strong differentiator for promotion and higher pay.

As an employer you _want_ these kinds of people around, helping you with process and making sure if they leave things are still functioning, thus you have more insensitive to keep them around / pay them more.

So it is in the best interest of the individual developer to work with those goals in mind too. Yes it is not your responsibility, but it can be, and that can give you more leverage in salary negotiations.

So its always good to think about “cui bono” and be sure you’re on the right side of each advice :)

liveoneggs · 2024-08-11T12:37:18 1723379838

Resume-building job-hoppers are annoying and self absorbed, yes. The idiots who enable them are even worse.

mvkel · 2024-08-11T12:58:51 1723381131

SaaS was born from comments like this. Paying to keep the lights on, effectively ensuring that an employee quitting won't undermine the entire operation.

OutOfHere · 2024-08-11T19:47:34 1723405654

SaaS companies simply cannot be trusted to not leak customer data. They always will leak it to hackers. This is different from major clouds and self-hosted services which have different sets of security considerations. Snowflake validated this assertion this year with a major data leak.

Also, with SaaS, you pay 5-20x for everything. For example, you can self-host Airflow in a $20 USD/month VM, but any managed Airflow service is going to cost astronomically more.

itsoktocry · 2024-08-12T01:06:14 1723424774

These are age-old "build vs buy" questions. There are no right answers, you have to do the math.

But I can tell you there's more to hosting Airflow than the cost of the VM....

fuzzfactor · 2024-08-11T14:24:13 1723386253

A lot of the article relates to a key person dependency issue.

>Sure, they have built data infrastructure that works and solved the businesses current problems. They maintain it, and no one asks questions.

Probably all the "budget" has allowance for is current needs. Some of the key engineers may not even be paid very fairly considering the true magnitude of business problems being overcome currently. You can't really expect them to prepare for a longer future than they have already been fully staffed for, especially succession.

>Perhaps no one realizes that one of the team members has to wake up at 6 AM every morning to check and ensure all the reports and tables have been created.

>That all works until the day they leave.

If a talented engineer is regularly working overtime to get things going like infrastructure, or worse to keep things going, even worse to keep things from failing, then that engineer is definitely short two staff members. And has probably been short the entire time. Nothing less than an assistant engineer and a technical secretary if they want real documentation as they go along. Plus even more true investment if there's any need to make up for lost time.

If infrastructure is important, something like this is absolutely pure executive failure from someone who's just not in the proper league.

You can not paint a pretty picture, and it's reported to be very difficult to fix stupid.

Some people just should not be accepted as executives in technical endeavors.

It can be tough for lesser executives to accept a non-cutthroat non-business-ladder-climber as more of a "key person" than themselves, but it is far too often the case. Whether the bonehead executives realize it and shrewdly calculate how much more payroll would expand if there was to be better coverage, or are completely oblivious, as the article says about the overworked engineers:

>They maintain it, and no one asks questions.

Example article is from a more expert data "repairman" who knows better than to rely on a single company as an employer if it's got dingbat executives.

There's so many of the under-qualified executives to go around, he's got a lifetime of work ahead of him as a consultant fixing the lackadaisical way they let technical debt underlie a business to where it could topple unexpectedly.

halfcat · 2024-08-11T19:30:08 1723404608

You should care because, the vast majority of the time the person working with it after you, will be you.

But to your point, people think they want “flexibility” or some similar concept, and they end up adding immediate complexity that never pays off, or worse, they pick the wrong abstraction and have a mess that’s hard to undo later.

What they should be aiming for is simplicity. Instead of trying to predict future, keep it as simple as possible to give that future person a chance of tackling the future needs when they arise.

bborud · 2024-08-11T12:11:16 1723378276

You can view the question as a proxy for "how do you provide value for money?".

If you build something that then gets replaced a few years later, maybe you did something wrong. Ideally you make something that evolves, or even better, that acts as a foundation others can build on. If you get a lot of assumptions right and the implementation doesn't get in the way of what people do - or better yet, meaningfully enables them to get work done, you've succeeded.

Here are some things I've observed in the wild.

Data infrastructure projects often fail, not because the technology doesn't work, but because the solution does not enable _organizations_ to work with them. I've seen many companies invest millions in solutions that eventually turned out to be useless because they failed to help make data and results accessible to complex organizations with lots of internal boundaries.

Too much too soon and too complex. You try to address every possible need from the start and in order to make the feature list as long and impressive as possible, you introduce lots and lots of systems that are expensive and complex. Then to use the system, you unload a huge burden onto the users. They have to learn all of these systems and spend lots of time and money training people and adapting their systems so they can interoperate with the rest.

I've helped a few companies design their data infrastructure. I usually follow an extremely minimalist approach. Here's how I start.

1) your long term data store is flat files, 2) you make real-time data available over streaming protocols, 3) by default everyone (inside the company) has access - access limitations have to be justified, 4) you document formats and share code that is used to interpret, transform and process data so the consumer can interpret the data. 5) you give people access to resources where they can spin up databases and run stuff. Data producers and consumers decide how they want to create and process data. You focus on the interface where they exchange data.

(I left security as an exercise to the reader because a) it depends and b) how to secure these kinds of systems is an even longer post)

Points 1 and 2 are sufficient to bootstrap databases and analytic systems at any time. Including systems that receive live data. It makes it possible to both support systems that are supposed to be up permanently and systems that perhaps only load the data, do some progressing and then get nuked. 5 provides the resources to do so.

3 usually meets with resistance in some types of organizations, but is critical. I've seen companies invest millions in "data lakes" and whatnot ... and then piss away the value because only 2-3 people have access to the data and they ain't sharing. You need executive management to empower someone to put their foot down. (One way to make people share data is to use budgets. If you don't share data, your department pays for its storage. If it is shared, it is paid for by a central budget.)

Point 4 requires you to also educate people a bit on data exchange. For instance in many areas there exists exchange standards, but these are not necessarily very good. If you find yourself in a situation where you spend a lot of effort expressing the data in format X and then spend a lot of effort interpreting the data at the other end, you are wasting your time. Come up with something simpler. Not all standards are worth using. And not everything is worth standardizing - don't lose sight of actual goals.

Point 5 is where you grow new core services. Producers and consumers get to pick their own technologies and do whatever they want. When they can show that they've built something that makes life easier for other parts of the organization, you can consider moving it to the "core" but this only happens when something has shown that it works and improves productivity across internal boundaries.

mritchie712 · 2024-08-11T12:51:50 1723380710

(disclaimer: I'm a founder in this space)

> the project is either so incomplete or so lacking in a central design that the best thing to do is replace the old system

I put a lot of blame here on "the modern data stack". Hundreds[0] of point-solution data tools and very few of them achieve "business outcomes" on their own. You need to stitch together 5 of them to get dashboards, 7 of them to get real time analytics, etc.

We're going to see more products that achieve an outcome end-to-end. A lot of companies just want a few dashboards that give a 360 degree view of their data. They want all their data in one spot, an easy way to access it and don't want to spend fortune on it. That's what we're focused on at Definite[1].

We're built on the best open source data projects (e.g. DuckDB, Iceberge, Cube, etc.). If you decide to self host, you can use the same components, but it's generally cheaper to use us than manage all this stuff yourself.

0 - https://mattturck.com/landscape/mad2024.pdf

1 - https://www.definite.app/

2 - https://youtu.be/7FAJLc3k2Fo

steveBK123 · 2024-08-11T13:23:43 1723382623

The "you need 5-7 different tools glued together to solve anything" is the CORE problem of the "modern data stack". It also ties very closely with Resume Driven Development.

It leads to a lot of anti-patterns.

For example, the 5-7 different tools are constantly changing, so after hiring some proclaimed expert.. they end up re-inventing the wheel by choosing a new combination of tools than they've used in the past, hitting various unexpected issues as they go.

VERY rarely in this space do you see someone come in and go "I used these 5 tools in previous roles, they work great, and I'm going to build the best solution because I have done it before."

These guys always think they need to reinvent the wheel, and then end up wrecking the car with some combination of v0.1 untested FOSS, up&coming SaaS, and their own in-house DSL.

kwillets · 2024-08-11T21:00:37 1723410037

OMG I'm that guy -- 3 straight Vertica roles with $B annual revenue.

I did learn a lot from watching MDS people try to beat it (in the end I'm also looking for what should come next), but mostly it was confirming the article and RDD. What they didn't know about data warehousing they also didn't know about performance or price-performance or selecting tools or managing projects or vendors, so costs exploded.

These folks were hilarious because they kept insisting that Vertica is not "modern", while it beat the pants off them with basic columnstore stuff.

cmiles74 · 2024-08-11T14:09:12 1723385352

In my opinion there are many tools and products in this space and they all seem somewhat confused in their target audience (is it marketed to management, analysts producing reports or developers supporting the analysts?) The boundaries between these projects is often fuzzy and they are often complicated (does it include a scripting language?) When you are starting from close to scratch with an application and it's backing database and being asked to produce timely reports, I think these tools aren't the best place to start.

My process has been to talk to stakeholders and sketch out reports they find useful, preferably getting a set of data together that many people find useful. Running these reports against a read-only replica of production data is typically not a big lift. If dashboards are required, write this data out to another database to back the dashboard, probably on some set schedule. It hasn't been long but now we have the bones of an ETL process that is already returning value.

At that point I think these tools start to look more compelling. Now we have a handle on the source data, what it looks like and any places where we need to do something tricky to connect the dots to get our data out. We know what the reports and dashboards look like.

In short, we know what we need these tools to do and where they can help us.

bigger_cheese · 2024-08-12T01:07:40 1723424860

One thing I often see in data people (particularly consultants) is that they are very focused on building reports and dashboards in my experience (manufacturing plant) Analytics should be the focus - if you get the analytical part right - make the data simple/easy to interrogate then the reporting can evolve naturally as a consequence.

Unfortunately I think the tools for a lot of this (exploratory data analysis) are somewhat lacking. I think we are starting to see new tools emerge - especially around Timeseries data that are promising but I don't think things are at the level where a non expert user can quickly glean insights from data in a frictionless way.

itsoktocry · 2024-08-12T01:20:35 1723425635

I disagree. Reports and dashboards provide decision makers with information that they require to make decisions. This is always step one.

Randomly exploring data for "insights" is why so many companies are turning against "Data Science"; it rarely bares fruit. This work should be focused on dialing in business processes.

mritchie712 · 2024-08-12T11:11:33 1723461093

Agreed. A lot of my background is in banking / fintech. So much scrambling on "adhoc analytics" can be avoided if you have really well constructed, standard reports and dashboards.

The department I started my career at didn't have this and every exec request was a bespoke, adhoc request, tailored to answer that one question. I spent many late nights writing SQL and building excel files / powerpoints to answer a single question.

When I started at a fintech as the first data person, we built a few primary dashboards where you could drill down for detail. Those answered over 80% of the questions people asked.

bigger_cheese · 2024-08-12T05:29:39 1723440579

I don't think we disagree (Report and dashboards are useful decision making tools), but in practice hiring external people to come in and build them causes more problems then it solves. In my experience processes on the ground can and will change and all of these pre-written reports and dashboards quickly become ossified and outdated.

If you take steps to make sure the data is properly stored and queryable stakeholders can maintain and develop their own reporting. Make the data understandable and accessible and reporting will follow. Human friendly schemas and such help a lot.

itsoktocry · 2024-08-12T01:13:13 1723425193

Having lots of options isn't a bad thing. The tools you happen to have chosen for your product are not objectively "the best". The entire business model of SaaS is "it's cheaper for us to manage this than you".

I am also in this space. Personally, I like piecing together my tools, rather than using some black box. YMMV.

victor106 · 2024-08-11T13:27:12 1723382832

Definite - seems interesting.

But what is the Definite warehouse? Is it built on open standards?

mritchie712 · 2024-08-11T14:19:07 1723385947

Yes, our warehouse is built on DuckDB and Iceberg (https://iceberg.apache.org/). DuckDB is used as the query engine and storage or smaller or static data and Iceberg is used to store larger / more frequently updated data (e.g. CDC from Postgres).

tomrod · 2024-08-11T13:12:13 1723381933

I'm not a founder in this space like yourself, but I do a solid mix of consulting and building on the modern data stack for AI/ML and always appreciate a well constructed data stack.

cletus · 2024-08-11T14:41:50 1723387310

All technical problems are organizational problems. Put another way: any technical problem is a symptom of an organizational problem.

Even at Google, which has some truly amazing homegrown technical infrastructure, you see what I called Promotion Driven Development ("PDD"). I didn't see this when it came to core technical infrastructure (eg storage, networking) but at a higher level I saw many examples of something being replaced solely (ultimately) because someobody wanted to get promoted and you don't get promoted for maintaining the thing. You get promoted for replacing the thing.

The most egregious example was someone getting promoted to Principal Engineer (T8) for being the TL of something that was meant to replace existing core infrastructure before it had even shipped. In the end it didn't ship. The original thing is still there. But wait, "we learned a lot".

So this happens because the organization rewards the new thing.

So why is your data infrastructure being replaced? Probably because of an organizational failure and it'll have almost nothing to do with technical aspects of that infrastructure. This is true at least 90% of the time (IME).

Data infrastructure is particularly bad for this because in any sufficiently large organization you will completely underestimate the impact of changing data dependencies for metrics, dashboards, monitoring, ML training and so on. Those things can be hard to find and map out and generally you only find them when they break. Sometimes they can break for years before anyone notices even when the thing is used by live production systems.

yks · 2024-08-11T23:36:32 1723419392

If your system is not getting replaced after your departure, it is likely your domain within the organization is just not attracting ambitious people.

kerkeslager · 2024-08-11T11:12:19 1723374739

I work in a different domain (full stack development), but I think the principle here applies broadly.

I tend to favor tools that have been around for a long time. A lot of the sites I have built have deployment scripts written in bash with dependencies on apt packages and repos, git to pull application code, and rsync to copy over files. It would probably be okay to update to zsh at this point. ;) I'm constantly shocked by the complexity of deployment infrastructures when I get into new projects: I've spent plenty of time working with Docker and Kubernetes and I have yet to see a case where these simplified things. As a rule, I don't throw out existing infrastructure, but if I'm doing greenfield development I never introduce containers--they simply don't do anything that can't be done more explicitly in a few lines of Bash.

One of the sites I still maintain has been running for 15 years. I ported it from Fedora (not my choice) to Debian about 8 years ago, and the only thing that changed in the deployment scripts was the package manager. I switched to DigitalOcean 5 years ago and the deployment script didn't change, period. The deployment script is 81 lines of bash. git blame shows 64 of those lines are from the original commit of the file. The changes are primarily to add new packages and to change the firewall to UFW.

And critically: I didn't write this script. This was written by a guy before me who just happened to have a similar deploy philosophy to me. That's a much better maintainability story than having to hire someone with your specific deploy tools on their resume.

itsoktocry · 2024-08-12T01:27:17 1723426037

>I tend to favor tools that have been around for a long time

If there are no early adopters, then no tools will ever make it to the "been around for a long time" stage. Someone has to use it.

kerkeslager · 2024-08-12T08:51:37 1723452697

My clients don't pay me to do community service.

And frankly, in the vast majority of cases, someone doesn't have to use it, because we simply don't need new solutions to already-solved problems. The proliferation of new solutions made by people who don't understand the existing solutions is a barrier to progress, not a driver of progress. There is a shortage of people willing to do the boring work of iterative improvement to existing tools, not a shortage of people willing to try out the latest exciting new tool.

If a tool solves a problem that hasn't already been solved, then there isn't any "tool that has been around for a long time" to prefer. You use the new tool in that case because you don't have a choice.

And yes, sometimes there do emerge new solutions that solve the same problems better, but that's actually extremely rare, because if there were an easy, obvious way to do it better, the original tool's developers would have done that.

And then it has to be better enough to justify changing everything else that has been built around the existing solutions. This is why, incidentally, I generally support making rare breaking changes to existing tools: the cost of switching to a completely new tool is greater than the cost of making breaking changes to an existing tool.

banku_brougham · 2024-08-11T17:52:51 1723398771

This is a compelling testimonial. Someone recently posted a 'git ops' package here that was equally simple and bash based. Im interested to try, but im wondering what the rationale will be for avoiding github actions.

If you could share script or snippets I would be grateful.

One difficulty for me is my datastore and reporo g package is a python application using Prefect, and I manage dependency with Poetry.

The 'poetry install' phase isnt always ha ds off and my script fails.

kerkeslager · 2024-08-11T19:15:30 1723403730

> Im interested to try, but im wondering what the rationale will be for avoiding github actions.

I don't. Github actions are simple enough, and easily run the scripts I'm talking about.

> If you could share script or snippets I would be grateful.

Maybe I'll pull something together and post it if I get the chance this week.

> One difficulty for me is my datastore and reporo g package is a python application using Prefect, and I manage dependency with Poetry.

> The 'poetry install' phase isnt always ha ds off and my script fails.

I don't use poetry, I use pypi directly with pip, so not sure I can help you there. That sounds like a problem to debug. But notably, it's probably easier to debug than similar problems with a containerized system (which happen).

ambicapter · 2024-08-11T18:57:40 1723402660

> but im wondering what the rationale will be for avoiding github actions.

Having used them, I would say lack of test-ability and slightly unhelpful flow-control rules would be my guess. The former you'll find in any managed CI solution.

tazu · 2024-08-11T11:54:17 1723377257

For infrastructure, I try to keep the Lindy effect [1] in mind.

[1]: https://en.wikipedia.org/wiki/Lindy_effect

kerkeslager · 2024-08-11T20:47:46 1723409266

Exactly! Thanks, I didn't know there was a name for that.

moltar · 2024-08-11T16:45:51 1723394751

An easy to maintain stack from my experience that almost anyone can do:

- S3 for storage

- Glue catalog to describe / define source data shapes

- Athena to query the above

- dbt for business data modelling (has Athena and glue adapter)

The only difficult part I always struggle with is getting partitioning right.

hipadev23 · 2024-08-11T17:20:46 1723396846

Everytime I look at S3/Glue/Athena I can’t help but feeling like the Glue layer shouldn’t be necessary and it’s instead just part of athena’s ddl

ianburrell · 2024-08-11T18:04:12 1723399452

Athena is query engine and can use multiple catalogs. It forwards DDL queries to the catalog. Glue is the default catalog.

mulmen · 2024-08-13T01:36:45 1723513005

Glue catalogs can be used by other query engines as well. Separating schema from compute is the foundational concept behind a data lake.

fifilura · 2024-08-12T11:45:18 1723463118

Exactly, and same as my comment below (parquet+iceberg+s3).

And yes Athena is a part of that. And we also use dbt but mostly for a place to commit and push queries. And I agree with the other question about glue, it is the ugliest part.

I guess a +1 is not per hacker news standard, but i still want to give it some strength, given that we came up with the same solution independently.

OutOfHere · 2024-08-11T19:09:37 1723403377

Is DBT really necessary? (serious question) If so, why? What would go wrong by skipping it?

moltar · 2024-08-11T21:38:48 1723412328

No, not necessary at all. You can write queries and CTEs and create views in Athena/Glue by hand, if that’s what you prefer.

OutOfHere · 2024-08-11T21:47:35 1723412855

I mean what does DBT offer me here that makes it worthwhile?

moltar · 2024-08-12T06:25:00 1723443900

- an established convention for project organization

- a tool to run lots of SQL queries at scale

- a tool to create and update views in the correct graph order to avoid dependency issues (e.g. removing a column from child view that parent still depends on).

- SQL codegen / templating using Jinja

- an ecosystem of packages that provide useful utility macros. E.g. every project eventually needs a calendar. Just look at that SQL statement to generate one. It’s gnarly.

- a test runner on data to ensure quality and contract adherence to avoid breakage upstream.

itsoktocry · 2024-08-12T01:24:25 1723425865

It offers a well organized SQL project. You have to store the scripts somewhere.

You may not need it, I find it really useful.

alexpotato · 2024-08-12T00:39:55 1723423195

Here is the Big Investment Bank version of "resume driven development":

- "Legacy" trading system in place (old but battle tested)

- New head of technology for the business comes in

- They get the bright idea to retire the old system and roll out a new system

- They promise ridiculous deadlines

- Roll out a half baked new system and retire AT MOST 60% of the old system

- They bounce to the next role at another firm b/c they have "retired legacy trading system / rolled out new system" on their resume

Meanwhile, the new system ALWAYS has some giant outage or near miss due to the rushed deadlines.

mkl95 · 2024-08-11T15:41:07 1723390867

> The problem I find is that many data teams get thrown into having to design data infrastructure with little experience actually setting up one in the past. Don’t get me wrong, we all start with no experience. But it can be difficult to assess what all the nuances of different tooling and designs can be.

I've been there. Companies can be cheap about training and I was given none before building some sophisticated data stuff that surprisingly worked, but probably could have been simpler.

I got a much better job soon after, and hopefully my replacement got some training.

devjab · 2024-08-11T15:56:21 1723391781

I think that one of the biggest issues is that a lot of training is flat out horrible. The author preaches simplicity and working directly with the business, but mean while you have entire teams of developers being taught something like Clean Architecture, SOLID, DRY and all sorts of fancy things that when used poorly leads to extreme over-abstractions.

So even the most well meaning engineers with good training can go about building data structures which won’t last very long once the key people leave. Simply because what they were taught doesn’t work.

fifilura · 2024-08-11T17:39:52 1723397992

Parquet+iceberg stored on s3 as a base. That is solid enough.

After that comes various kinds of caches, maybe postgres for frontend? Or something streaming?

But once everything is stored as files you gain the freedom to experiment or refactor.

4WIW · 2024-08-21T01:07:03 1724202423

I did designs that overstayed their intended lifespan by 15 years. I did designs that were cancelled even before being fully implemented. Most however had their predictable lifespan of 3-6 years.

It seems to me that the key is to make useful product based on sound technical decisions; entropy (a thing you can't control) will handle the rest.

cgio · 2024-08-11T23:11:21 1723417881

Missing a most important part in my opinion, store data using open standards in an accessible platform to enable rather than anticipate evolution. Options would be e.g. Postgres, Parquet, Avro, Json, even csv. Storage is the foundational, absolutely infrastructural data infrastructure. No one cares if data pipelines infrastructure changes, but if it cannot be done, just because your data is hosed into a vendor locked-in platform, then that is the infrastructure failure you did not want in your conscience.

pphysch · 2024-08-11T17:02:36 1723395756

> Our really smart engineer working over time built amazing custom infrastructure

> They quit and no one knows how it works

Either the infrastructure wasn't "amazing" in the first place or clueless management is looking for a scapegoat.

"Amazing" is an interesting word choice because a non-technical manager will be amazed by any blob of code. "Amazing" doesn't mean a straightforward, robust solution to a difficult problem.

cwalv · 2024-08-11T23:46:48 1723420008

> They quit and no one knows how it works

Agreed, this is a very low bar for "amazing"

kkfx · 2024-08-11T11:13:21 1723374801

Hem... Sorry but... It seems more propaganda for proprietary cloud solutions than a personal statement and actually the conclusion "do not do it yourself" tend to be regularly denied by the facts...

Choosing third party, well known, FLOSS infra/open formats it's a thing, not developing their own infra with such tools in house is another.

yobbo · 2024-08-11T17:04:17 1723395857

Yes, but this article seems to be talking to a business that has no competence in "data" outside of a handful random engineers that have no stake in the business. The advice given amounts to "avoid bad things".

Resume-driven engineering etc is the result of engineers with no stake in the business future. Any solution must involve incentives against "bad things" and favour "good things".

pmx · 2024-08-11T11:50:40 1723377040

> In one example I came in and found a data vault project that wasn’t being used. It actually had pretty good documentation. However, the team had taken so long and hadn’t fully completed the project which led to their dismissal.

I feel like this is a major reason things don't get documentation. We don't get judged on it, nobody cares how good the docs are, but they DO care that we're shipping features.

cmiles74 · 2024-08-11T13:55:04 1723384504

The most praise I've ever received for my documentation was from the developers forced to read it after I moved on to a position with another company. It's a little unfortunate I didn't hear more about it as I was writing it, but then again it's the kind of stuff other developers find useful.

Still, in my opinion, spending the time writing documentation was a net positive: it didn't take me all that long to write up and it clearly made someone else's job easier, so much so that they mentioned it to me. And, of course, I've had to read my own docs more than once.

yourapostasy · 2024-08-12T03:29:00 1723433340

> It's a little unfortunate I didn't hear more about it as I was writing it...

This is usually because most people find it faster and easier to ping the random access memory that is you while you are there instead of serially reading the documentation. I have found a non-linear correlation going in the wrong direction between the comprehensiveness of the documentation written, and the amount of requests for live assistance on the very matters covered in the documentation, when the requestee is the author of the documentation.

This is where an effectively-segmented Table Of Contents comes in handy. As the author, instead of rewarding unwanted organizational behavior by giving the requestor the spoon-fed answer they seek even if the requestee knows it off the top of their head, encourage the organization a "teach them to fish" posture by pointing them in the right direction by the chapter section.

"It will be somewhere in Chapter Foo" gets the requestors to familiarize themselves with the documentation structure at least in piecemeal fashion.

Junior engineers who make a habit of asking questions already addressed in documentation and don't take the "LMGTFY"-type responses as a hint they should search first, ask later, I start helping by asking, "what keywords did you try searching upon in the documentation", and start a conversation about why they used only such keywords. I'm excited by the RAG-family generative AI I am able to start loading documentation into that will let me ask, "what prompts did you try upon the documentation fine-tuned AI", as these AI's are powerfully effective with functionally fuzzy keyword searches. I'll fine-tune the documentation when it looks like various terminology is used over and over by searchers, but in general this approach works great to teach juniors how to search for information.

Engineers who don't quickly learn and keep asking for spoon-fed answers, enter an endless loop where they might occasionally help fine-tune the documentation, but otherwise are only pointed to the general area to search, and delegated to their colleagues who have learned to search for peer-led instruction. In over 20 years of applying this strategy, I've yet to run into an engineer who would not eventually learn, though if I did, they probably self select out of this field.

In a dog fooding fashion, when I ask someone for help, after my specific ask of them, I tell them what documentation I tried availing myself of first, what terms I searched upon, how I reasoned through to the point where I'm stuck, and what my theories/models are of what I think is happening. This helps rapidly prune the tree of responses, and 80% of the time the response is along the lines of, "oh, yeah the official documentation is out of date, use this technical note instead".

devjab · 2024-08-11T16:02:40 1723392160

Should they care? I’ve come into organisations that had spent entire years worth of man hours on setting things up correctly so that they could potentially scale to millions of concurrent users. Organisations which would never reach more than 50.000 concurrent users in their wildest dreams.

On the flip side I’ve seen some extreme cowboy hacker man code run perfectly fine for its entire 10 year life cycles.

Now, I don’t think you should go completely cowboy, but I do think you should think about whether or not your “correctness” is getting in the way of your actual job. Which is typically as a service function where you’re supposed to deliver business value at a rapid pace. Obviously it depends on what you do. If you work in medical software you’re probably going to want to get things right, but just how much programming could be perfectly fine if it was just thrown together without any adherence to “correctness”? The theory tells you it’ll cost you down the line, and in some cases it will. In my anecdotal experience it’s not as often as we might like to think, and sometimes the cost is even worth it. In two decades I’ve only ever really seen two poorly build systems cost so much down the line that they would’ve been better off having been build better from the get go. In both cases they couldn’t have been build better upfront because the startups didn’t have the people to do so.

g4zj · 2024-08-11T11:53:45 1723377225

I think I write decent documentation, but my target audience is usually my future self. If I can rely on my own documentation to help quickly reacquaint myself with the project later on, I generally consider it sufficient.

mannyv · 2024-08-12T07:50:20 1723449020

Design so you can replace the design when needed.

fsndz · 2024-08-11T19:03:17 1723402997

The problem is consultants selling bullshit as expertise are more prevalent than honest consultants. And for these bullshit consultant, selling the most unnecessarily complex solution with all the trendy keywords and making beautiful slides is all that counts. And what is funny is that customers believe all those lies.

xiaodai · 2024-08-12T04:20:46 1723436446

whoever thought spark based on Scala is a good idea should get shot.