Building a Bank with Kubernetes [slides]

obeattie · on Oct 24, 2016

Author of the talk here. Happy to answer any questions anyone has. This post also contains more info on how we build our systems: https://monzo.com/blog/2016/09/19/building-a-modern-bank-bac...

And as an aside, I'll also be giving a longer talk at Kubecon going into more detail on some of this stuff :-)

KirinDave · on Oct 24, 2016

Not that I don't appreciate the subject matter, I wave the flag for K8S all the time and I've shipped bank product presentation stuff on it... But...

The fundamental challenges of building a bank are almost entirely orthogonal to things like distributed system uptime and resiliency (unless, I suppose, you could lose consistency during the types service loss Kubernetes makes easy to ameliorate). Evidence for this abounds: nearly every major bank out there is at least 10 years behind the tech we're discussing here. Sure, banks are getting savvy to modern techniques for their presentation and API layers (my employer, Capital One, for example).

But the actual challenges are data consistency and liveliness and audit-ability are preserved. I'm really curious if you're using a novel technique to achieve this that Docker and self-managed micro-app swarms can deliver on better.

Because what really limits most financial institutions from embracing a lot more modern tech is their core systems of record AND the acceptance of said systems by their governing agencies.

So when we talk about basic datacenter ops, that's all great. But I don't think they're the things that make K8S great for a bank. I think most people in a position to evaluate a financial institution roadmap would be unmoved by this deck.

Now, if you talk about other things we both know K8S is great for... Things like discovering the genesis of a database action by preserving it throughout the chain of responders, or being able to rapidly respond to site exploits with rolling restarts and a built in mechanism for feature flagging, or having a really great way to offer data scientists the environments they need without risk for data theft, or being able to use traditional CI/CD methodologies but end up with a single deployable unit that is amenable to both automated an manual review and mechanical deployment in spite of the tooling used within.

Not that I think selling K8S is your job. But I thought I'd mention the perspective of someone doing inf modernization and product work at a major bank.

And of course, as always: the opinions above are my own and not those of my employer or co-workers.

obeattie · on Oct 24, 2016

>what really limits most financial institutions from embracing a lot more modern tech is their core systems of record AND the acceptance of said systems by their governing agencies.

I just want to say that I think there is a huge amount of FUD about how you can and cannot build your technology as a regulated entity – and in particular as a bank. In reality, close to 100% of requirements from a regulator will tell you _what_ you must build, not _how_ you must build it. Even then, especially in terms of resilience and security, they are almost always a subset of our own requirements.

What can be more of a challenge is convincing an auditor that what you have done is acceptable, since it can be so different from what they may have seen before. Again, I don't think this is a reason to compromise. We see technology as a major competitive advantage, so it is worth the effort to find open-minded auditors, and spend time to explain and demonstrate how (and why) our software meets the requirements.

I don't think there's any way we could build a secure, resilient bank with the kind of product experience we want, AND do it on the budget of a startup if we approach technology through the same lens as existing banks.

KirinDave · on Oct 24, 2016

> I just want to say that I think there is a huge amount of FUD about how you can and cannot build your technology as a regulated entity – and in particular as a bank. In reality, close to 100% of requirements from a regulator will tell you _what_ you must build, not _how_ you must build it. Even then, especially in terms of resilience and security, they are almost always a subset of our own requirements.

I'm not sure what your involvement here is, and mine is limited (thankfully) to a substantial distance. However, I think there is a rather big difference between different FI's experience; because it's a true game of politics.

When investigating if Level Money should be a bank (and learning that almost no one wants to be a bank, it's very hard to make money just being a direct-to-consumer deposit bank unless you're quite scaled), I basically had a surprisingly credulous audience because the CFPB was basically willing to doorbuster anything that they thought would spur the big banks to action. It was... very surprising to ultimately decide that it was impossible for financial reasons to actually succeed at being a bank.

But they will insist on things that every bank should have, like a credible way for analytics to run in an environment where the raw data is not subject to exfiltration by a compromised data scientist's machines without an audit trail a mile wide.

Good luck to you.

mjs · on Oct 24, 2016

> I'm not sure what your involvement here is …

Parent poster is Head of Engineering at the bank (not CTO?), and author of the slides.

KirinDave · on Oct 24, 2016

Yes but that may or may not mean direct involvement with regulators. For small companies it's really hard to tell where they are in the process.

obeattie · on Oct 24, 2016

We're _very_ directly involved with regulation, and with regulators :-) It's a very important part of our business, so we put a lot of effort into ensuring that we please regulators _and_ that we run our business in the way we want.

geodel · on Oct 24, 2016

Just this morning I read an article in Bloomberg news magazine about UK regulators being avant-garde and promoting competition in banking by allowing branchless/app based banking. There was reference to Monzo as well as an example of new age bank.

tluyben2 · on Oct 24, 2016

> I just want to say that I think there is a huge amount of FUD about how you can and cannot build your technology as a regulated entity – and in particular as a bank. In reality, close to 100% of requirements from a regulator will tell you _what_ you must build, not _how_ you must build it. Even then, especially in terms of resilience and security, they are almost always a subset of our own requirements.

Having worked with banks and insurers solely in Java and connecting to legacy (mostly Cobol) before, I was surprised, in my current position, to see some companies doing their complete banking back end in PHP & MySQL. I knew the how is not part of the regulations, but I did expect the CTOs to pick the 'no one ever got fired for choosing' choice.

webscaleizfun · on Oct 24, 2016

It seems common knowledge these days among the slightly but not too technically inclined that any new major project should use a LAMP stack as its base.

People see Facebook, Amazon, and many others running PHP & MySQL on Linux at scale and they know it works reliably, so while it may not have the support of Cisco or Oracle, it is pretty close on the 'no one ever got fired for choosing' X scale, since you can point to every other major company using these building blocks reliably if your investors, CEO, board or auditors asked why you chose to use PHP & MySQL.

In summary, PHP & MySQL have become the modern equivalent of a "safe" choice for your stack to be built on. Its not necessarily a bad choice either, you get access to a large community of skilled people who can write PHP & write SQL statements, and while everyone likes to hate on PHP, it isn't about to up and disappear any time in the next decade either (unlike COBOL).

lgieron · on Oct 24, 2016

I'd say the reliability the bank is looking for is much higher than what's acceptable for web companies (i.e. web companies are fine with eventual consistency, which would is obviously unacceptable in a banking system outside of trivial non-core features).

jerf · on Oct 24, 2016

I'd also suggest that it's not just scale; the kind of reliability Facebook needs is fundamentally different than what a bank needs. Broadly speaking, Facebook needs the site to keep working as well as possible even if some subservice fails, and a bank needs a subservice not to fail. I'm summarizing here and I know it; clearly neither of them is actually on the absolute extreme end, as Facebook needs authentication to work and a bank may not care if the interest rate display widget on their customer banking app fails to load a couple of times. But I'd still suggest there's enough difference between the requirements to be a fundamentally different domain.

Even in "the cloud" things differ between services. A social media app has very different reliability requirements than a backup cloud.

hueving · on Oct 24, 2016

Well actually there are many sub services in a bank that can go down without major impacts. The two major banks I use have weekly planned outages of features like old statement retrieval, person to person payments, ACH transfers, etc. Basically everything in the web interface could experience outages without any major crisis.

As long as ATM requests always work, nobody really seems to care.

nickpsecurity · on Oct 24, 2016

"Broadly speaking, Facebook needs the site to keep working as well as possible even if some subservice fails, and a bank needs a subservice not to fail."

One of the reasons many of them stick with mainframes, AS/400's, and NonStop systems for backends. ;)

snowwrestler · on Oct 24, 2016

Why would eventual consistency be unacceptable in a banking system? In my experience people interact with social media on far shorter time scales than their banks.

When they post a new Instagram photo, they expect that their friends will see it basically instantaneously.

In comparison, when people use their debit card at CVS, they're not expecting anyone to log into their bank account seconds later and see the charge show up.

I would think correctness is more important than speed in a retail consumer bank.

Or do I misunderstand what you mean by eventual consistency?

lgieron · on Oct 24, 2016

If your data is only eventually consistent, then DB node A can have your bank balance at $x for some time still, while it is already $0 at on node B. Then, if some operation (say withdrawal) is checking the balance with node A, then you have a problem.

snowwrestler · on Oct 24, 2016

Yes, this is true of eventually consistent systems. The question is a) what does "eventually" mean (replication takes seconds, minutes, or hours?), b) what time delta do you expect for most transaction requests, and c) what is the risk of being temporarily wrong?

Seems to me that a bank could answer these questions as well as any other business, and build a system that works within the answers.

voidfunc · on Oct 24, 2016

Banks are already eventually consistent. Balances are reconciled COB.

Xorlev · on Oct 24, 2016

You're actually somewhat right! ATMs are an (sometimes) an example of eventual consistency. If an ATM is offline, it'll often allow you to make a withdraw anyways and once it's back on the network report back. That could mean an overdraft for you. Caveat here is that these are often low-traffic ATMs on the periphery, ones in the city are usually making calls home to check balances.

However, the buck (no pun intended) has to stop somewhere. Overdraft limits have to be consistently applied. Even that is somewhat up in the air. Take this with a grain of salt as it's second-party information, but my wife works in fraud prevention at a smaller credit union. She says that transactions are collected throughout the day and overdrafts are only applied at the end of the day to allow for bills to drain your account beyond its capacity and then payroll to land without applying overdrafts unless you're in the red afterwards. In some sense, that's even "eventual consistency" on the scale of 24 hours.

The most important thing in banks is that at the end of the day, the balance sheet, well, balances. And they limit their liability by preventing too much overdraft and applying daily limits to ATM withdrawals. I pose that general eventual consistency fits that pretty well, as long as "eventual" isn't "hours" for the most part.

A little more on eventual consistency in general as I understand it, eventual consistency systems come in many forms. In a leader/follower setup (think MySQL w/ async replication), usually "important" calls are made to the leader in a consistent fashion and changes are asynchronously replicated to the followers for general read fanout. There are a lot of different kinds of systems with different guarantees. In a dynamo-style system, writes/reads are usually done to a quorum of replicas (e.x. 2/3 replicas), and only if the read from the two replicas disagree are the values on all three replicas "repaired" via last-write-wins. Facebook has a model they call causal consistency[1] which models causal relationships (e.x. B depends on A, therefore B isn't visible until A is also replicated).

You can consider any system with a queue or log in it that doesn't provide some token to check for operation completion to be eventual. For example, imagine you fronted DB writes with Kafka. Lag between writing to Kafka and commit into the DB may only be 100ms, but that's "eventual". However, if you provided back a "FYI, your write is offset 1234 on partition 5", you could use that as a part of a read pipeline that checked that the DB writer was beyond offset 1234 on partition 5 before allowing the read to proceed. That'd be consistent.

[1] http://queue.acm.org/detail.cfm?id=2610533

UK-AL · on Oct 24, 2016

I would assume its pretty hard to make something ACID style when doing stuff across Microservices, which i'm assuming mondo is doing.

KirinDave · on Oct 24, 2016

That part is surprisingly easy if you architect it right. The core abstraction most banks use is your "available balance" and the fact that they can reconcile on a longer time period than seconds.

SEJeff · on Oct 24, 2016

Oracle will happily support MySQL. In fact, they own it.

KirinDave · on Oct 24, 2016

> , I was surprised, in my current position, to see some companies doing their complete banking back end in PHP & MySQL.

Glad to know! Now I need to find these folks so I can hide in a dumpster from them!

tormeh · on Oct 24, 2016

PHP in banking? That's scary. Pretending to be Facebook and being able to handle PHP development looks like a bad judgment call from where I'm standing.

PHP seems like it is made for everything banks are not. I have the impression PHP is hard to do securely, invites bugs but let you get going quickly. This sounds like Lockheed deciding to use C++ for the F35.

obeattie · on Oct 24, 2016

To be clear, we (Monzo) are not writing banking software in PHP, or using MySQL.

riskable · on Oct 24, 2016

I work for a huge bank and the system of record thing is spot-on. I don't think there's any law that says we must operate systems of record in certain ways or keep records of every little thing but remember that this is banking. Keeping records of every little thing is practically the religion of banking!

So for a long time now the big banks have been trying (and often failing) to keep track--centrally--of every little server or device that pops up on their networks. So there was a big push recently (few years ago) at the big banks to improve their systems of record. The goal being mostly related to better financial (asset) tracking. So they can figure out which internal teams were using the most/least resources as well as figure out who's not upgrading their stuff on a regular basis (technical debt builders).

So they spent all this money improving their systems of record and in walks Docker. It practically turns the entire concept of having a central place to track "systems" on its head!

To give you an example of the difficulties: We have loads of policies that say things like, "all systems/applications must be registered in <system of record>." Sounds simple enough: Just make sure that wherever a Docker container comes up we create a new record in the system of record and remove it when it comes down.

Except it's not that simple for many reasons the most obviously problematic of which is that the "system of record" works in batch. As in, you submit your request to add a new record and then maybe 8 hours later it'll show up.

Did I mention that there's also policies that say you can't put any system into production until it shows up in the system of record? =)

That's just scratching the surface though. Because the system of record at most financial institutions doesn't just allow you to delete records. Once you create one it is there forever. It merely gets marked as "retired" (or similar) and most banks require phases as well. For example, before a production system can be marked as retired it must first go through a mandatory, "waning" period (what it's called depends on the bank) that can often be weeks.

I can go on and on about all the zillions of ways in which systems of record (and the policies that go with them) are anathema to usage of Docker but I think everyone reading should "get the picture" at this point. If not, just imagine the havoc that would entail when you have thousands of Docker containers coming up and down every second. Or the entire concept of a container only being up for a few seconds to perform a single batch operation (banks love batch, remember!).

If you think the system of record requirements make adoption of Docker difficult you should know that the security policies are worse! Imagine a policy that states that all systems must undergo a (excruciatingly slow) security scan before being put into production. That's just one of the headaches, sigh.

nstj · on Oct 24, 2016

> what really limits most financial institutions from embracing a lot more modern tech is their core systems of record AND the acceptance of said systems by their governing agencies

I tend to think (from experience) that the reason banks don't embrace modern tech is inertia - they've been able to "milk the cow" of regulatory body sanctioned profits for so long that they're in a mindset of not wishing to introduce volatility into what has been a highly predictable revenue stream. They've been using garbage technology for so long and making money at the same time that there has been no institutional impetus to innovate and embrace new tech. This is now coming to bite them in the ass as regulatory bodies are allowing new players to market and therefore eroding some of the guaranteed profits banks enjoyed up until now.

I doubt most large capital markets institutions (on both the retail and IB sides) will be able to weather the current (and impending) storm of disruption and come out unscathed.

user5994461 · on Oct 24, 2016

> I tend to think (from experience) that the reason banks don't embrace modern tech is inertia

A lot of inertia. A lot of new-hype-tech is unreliable undocumented shit not ready for production. A lot of new-tech-is-old-tech that's been done for many years but with a new name.

Whenever someone asks "why don't you use Docker for xxx?"

Reply with "When is the last time you had an issue with Docker? Tell me about it."

You'll LOVE the horror stories <3

nickpsecurity · on Oct 24, 2016

"You'll LOVE the horror stories <3"

It's even better when you're on a site like Hacker News with stronger-than-average, technical people. You get to read endorsements of the latest and greatest with people asking why (insert failure of common action here) is happening... in the same thread. I'll just stick with a hardened, flexible configuration of classic architecture and components that work.

KirinDave · on Oct 24, 2016

Yes I'd hesitate to use Docker in a system I wanted to maintain for 10 years. We can't get consistent interfaces for 2 months!

raesene6 · on Oct 24, 2016

I'd say inertia is a large factor but also risk aversion. Bank systems have to be very available (or they attract large fines) and so there's a tendency to stick with things that are proven to be robust even when they have other problems.

The other major problem I've seen is that most banks don't think of themselves as technology companies, and treat IT as an overhead to be minimized, which is absolutely the wrong approach.

This tends to lead to things that save money in the short term (e.g. outsourcing deals) but could well cost money in the longer term, as they make replacing legacy systems harder.

riskable · on Oct 24, 2016

This is nonsense. Not every system in banks needs to be highly available. That's just silly.

Like the print server on the 3rd floor needs an active standby! Haha.

No, just like any organization banks have "critical" systems and "everything else." Docker is mostly being "sold" as a means to replace and improve the non-critical stuff. Like that internal web app everyone uses to look up <whatever>. Or the system that generates daily reports on <whatever>.

Just like most organizations, banks have a few critical systems and everything else is less so (to varying degrees).

raesene6 · on Oct 24, 2016

yes and the topic of discussion in this thread is ..... core banking systems.... which do have to be highly available. If the topic had been bank print servers and I had made my comment yours may have made more sense

My comment didn't say "banks print servers need to be highly available" anywhere.. at all..

kchoudhu · on Oct 24, 2016

>what really limits most financial institutions from embracing a lot more modern tech is their core systems of record AND the acceptance of said systems by their governing agencies.

Let me put this even more bluntly: if your bank's technology plan isn't 99.5% about dealing with government regulation and maintaining core record integrity and auditability, you are not a serious player in the space.

kokey · on Oct 24, 2016

Matt Levine's comment always stick in my head: 'I say sometimes that the tech industry is about moving fast and breaking things, while "finance is an industry of moving fast, breaking things, being mired in years of litigation, paying 10-digit fines, and ruefully promising to move slower and break fewer things in the future."'

nickpsecurity · on Oct 24, 2016

That's a good comment. Applies to most banks. Then there's Goldman Sachs. ;)

kchoudhu · on Oct 24, 2016

Oh, what I said was coloured explicitly by my experiences working on regulatory development at Goldman.

Everything was filtered through the lens of "What would the regulators say? And how do we tell the regulators what we're doing?"

riskable · on Oct 24, 2016

Man, it's like people think banks are special when it comes to IT. They're not!

You have lots of extra regulations for sure but most of them are about retention of financial records. If a system isn't processing/storing financial records or "privileged" information nobody gives a damn.

The "technology plan" is 99.5% about making or saving money. That remaining .5%? Yeah, that's compliance. Because that's all it costs. Unless you think central logging systems are going to take up some large percentage of a multi-billion dollar quarterly budget?

People love to complain about "the costs of regulation" but you know what? In finance it really doesn't amount much in terms of "how much we spend." How much "it holds back the market" is a different debate entirely.

Aside: Without those regulations we'd just repeat all the same financial disasters throughout history.

pjbster · on Oct 26, 2016

You know, it's strange: when I worked for an insurance IT dept. we were informed that strict adherence to an ITIL-certified release process was essential to keep us "within compliance with FSA regulations - we need to do this in order to continue trading".

From experience, said process cost waaaay more than 0.5% of the budget. Time and cost overruns, massive overhead in personnel and a drain on mental resources which should have been spent on actual release quality rather than an audit trail meant to convey "Certified" quality. All in all, I'd say 50% of the costs of IT delivery were spent in plodding through the checkpoints with much of the other half being consumed by the interest on 20 years of technical debt accrued as a result of those very same resources being misdirected in such regulatory endeavours.

I recognise than I'm far too cynical to see regulation as anything other than a shield against liability. It's simply too obstructive to contribute to actual quality improvement. On the plus side, it does keep about 50% of IT personnel in a job.

So I guess you can count me in with the lovers :-)

kchoudhu · on Oct 24, 2016

If it's a computer in a bank and it touches risk, trading or treasury, it's fair game for the Fed auditor.

So you tell me: what computer system of any import in a bank doesn't touch one of these three things?

riskable · on Oct 25, 2016

Actually all systems are fair game to the auditors. If an auditor wants to see something they get to see it 99% of the time. End of story.

They really don't care about systems that don't process financial information! They don't care about your dev or qa environments. They don't care about your DNS servers or your switches or much else for that matter.

Regulators are 100% laser-focused on financial information and transactions. They want to see ledgers and logs and they want to see evidence that your systems prevent tampering. That's it.

There's no financial regulators that actually audit IT stuff. We probably should have them but we don't. The closest is the FFIEC but they only publish non-binding guidelines.

If you think the PCI-DSS matters to banks you're mistaken. Every year we audit ourselves and put the results in a filing cabinet somewhere. We have no obligation to show it to anyone and no one would hold us accountable for failing to be PCI compliant anyway.

kchoudhu · on Oct 25, 2016

My department's annual budget was in the 100MM+ region -- and this doesn't count surge resourcing used to deal with capricious requests from the feds. I have been asked about my dev and qa environments (and how they are firewalled from production systems) repeatedly. And yes, I have been asked about network architecture too. Penalties for non-compliance came in the form of significant financial penalties. It only got worse once Dodd Frank hit. Once securities of any kind are involved, shit gets real fast.

aidos · on Oct 24, 2016

As a Monzo customer I have to say that the product Monzo offers and the features that traditional banks want me to use are also somewhat orthogonal :-)

So long as they can satisfy the regulatory requirements, I'd rather they were using whatever else it took to build a great product.

nvarsj · on Oct 24, 2016

What kind of resource usage do you see with linkerd?

It looks good, but I'm concerned about the resource overhead of a JVM based proxy on every node / pod.

edit: Ok answered myself - found https://blog.buoyant.io/2016/06/17/small-memory-jvm-techniqu....

So with some work it can be reduced from 500Mi->100Mi per instance. It'd be interesting to see the kind of CPU time it uses under load, though.

Annatar · on Oct 24, 2016

The most valuable things in the talk were linkerd and BGP (especially configuration of Zebra and linkerd), unfortunately both were only glossed over briefly.

Has linkerd been ported over to SmartOS? (So far I haven't found anything saying one way or the other.)

For your next talks, please describe as many failure scenarios as you can think of, and what happens at the OS / application / network level. The more concrete failure examples you can provide, the better.

The why's are clear. The mechanisms are not.

obeattie · on Oct 24, 2016

This is great feedback – thank you.

A lot of the feedback from my blog post a month or so ago was that the "why" was not clear enough, so I guess I focussed more on this. Now I've (hopefully!) covered that better, I agree that more depth on these topics would be helpful.

moderation · on Oct 24, 2016

> Has linkerd been ported over to SmartOS? (So far I haven't found anything saying one way or the other.)

linkerd is a JVM application. It should run fine under SmartOS with a JVM. I run it successfully on a Raspberry Pi, OS X, Linux etc.

raesene6 · on Oct 24, 2016

Hi,

I'd be very interested to hear if you've come across any good documentation on hardening Kubernetes?

From what I've seen so far there's very limited documentation on that when compared to other components like Docker, and the defaults can sometimes not be suitable for a high security environment (e.g. https://raesene.github.io/blog/2016/10/08/Kubernetes-From-Co... )

ben_hall · on Oct 24, 2016

There is a solution to this coming in v1.5 - https://github.com/kubernetes/kubernetes/pull/32518

raesene6 · on Oct 24, 2016

yeah I've got that noted in the blog post :) but it was really just an example. One of the things thats struck me about docker/Kubernetes etc is that they tend to be tuned for a general use case in terms of security and configuration. Choices that might improve security and restrict usefulness of services are not usually defaults.

As such there needs to be a level of hardening done from an out of the box perspective where they're being used in a high-security environment (e.g. banking).

For Docker we have resources like docker_bench and the CIS guide which provide a list of possible hardening steps, but I've not managed to find anything like that for Kubernetes, which is why I'm interested in how Monzo are addressing that issue.

dimastopel · on Oct 24, 2016

Not sure if you are looking for a commercial solution, but we (Twistlock [1]) develop a security suite for enterprises working with Docker and / or Kubernetes. In fact, we are officially recommended by Google for working with GKE which is very much based on Kubernetes [2]. I'd be glad to elaborate if relevant.

[1] https://www.twistlock.com/

[2] https://cloudplatform.googleblog.com/2015/11/enhancements-to...

SEJeff · on Oct 24, 2016

Take a look at Openshift Origin for pointers. Mandatory TLS at all levels, SELinux is required with a policy to match, OVS based SDN with a multi tenant plugin preventing k8s namespaces (projects in Openshift speak) from talking to each other at the network layer, etc. Security is the primary feature of Openshift

olalonde · on Oct 24, 2016

I hadn't heard of linkerd before. How does it compare to etcd or consul and why did you chose it?

atombender · on Oct 24, 2016

Linkerd is not very similar to Etcd or Consul. It's a proxy that sits between your microservices/apps to act as a middleman for RPC calls. Being a proxy, it can handle things like load balancing, throttling and failure handling, meaning you don't have to build that into each app.

When used together with Kubernetes, Linkerd will use Kubernetes for discovery, so you don't need Etcd or Consul directly (K8s itself relies on Etcd, though).

justifier · on Oct 24, 2016

this is an excellent talk

a link to the video: https://skillsmatter.com/skillscasts/9146-building-a-bank-wi...

thanks for giving and posting it

i especially appreciate the effort because a month ago i posted this(o) comment:

> can anyone suggest reading to understand how contemporary banks function, where can i get an understanding of a bank or credit union from a software engineer's perspective: dependencies, steps to start, challenges of running, protections from common problems, interesting emerging disruptions;

best of luck with monzo, https://monzo.com/

(o) https://news.ycombinator.com/item?id=12567536

ktamura · on Oct 24, 2016

What are you using for container/pod logging?

obeattie · on Oct 24, 2016

Good question. We have a couple of approaches to this:

* Every request that comes into our system is assigned a unique ID, which is propagated on every downstream call and returned in a response header. When logs are emitted during request processing, they are tagged with this ID. A system we've built in-house indexes these log events against their trace ID in Cassandra (on a separate cluster). This lets us take a failing, slow, or otherwise interesting request and look up all the things that happened to it during processing. Events in this system are TTL'd according to their severity – so an event at critical or error severity is kept longer than one at debug severity. * stdout/stderr from all our containers is forwarded to journald on each host. Logstash then pushes all these logs to Elastic (and also to permanent cold storage). This is useful to look at the "big picture" and means we can analyse all the logs in aggregate and makes it very obvious when something is _very_ wrong and causing a lot of requests to fail, but is less useful than slog for pinpointing a specific issue.

It's worth also noting (since I find often a whole load of things can be mixed into logging) that we do not drive our monitoring off these logs, at least at the moment. We have separate systems for that.

atombender · on Oct 24, 2016

How good is Cassandra at log-like data? Also, why the split between Cassandra and Logstash? Why not a single solution?

obeattie · on Oct 24, 2016

Because of its disk layout, Cassandra is truly excellent at time-series, append-heavy data. In our setup, the data is partitioned by time bucket and n number of labels (one of which is the request ID).

We may unify the two at some point, but there's no immediate need to do so. While the write use-case is quite similar across both, the read use-case is quite different: slog requires reasonably low latency reads soon after the data is written, data can age out after 2-30 days depending on severity, and sometimes dropping events is acceptable. It would be acceptable for reads from the "archival" system to take minutes or even hours, the data should be kept forever (or for a long time), and dropping events is never acceptable.

sandGorgon · on Oct 24, 2016

this is awesome! how do you do this? a new header with a unique id... generated by something like lua+nginx. but then how do you pass this request from one service to another?

amatix · on Oct 24, 2016

For us, a `X-Request-ID` header is generated by any app if it doesn't receive it from upstream -- but normally nginx or the CDN will generate it. There's a few nginx modules to do it, we use https://github.com/newobj/nginx-x-rid-header

Most languages/logging frameworks have some sort of per-thread context (eg. Filters in Python, MDC in log4j, etc) to be able to tag log messages with. If you're using postgresql, you can call `SET application_name='{requestID}';` and that can be output as part of logs too.

gdubya · on Oct 24, 2016

There are quite a few monitoring products being built to solve this problem. Many of them are based around Zipkin (http://zipkin.io/)

caniszczyk · on Oct 24, 2016

I strongly suggest people look at the OpenTracing work too: http://opentracing.io/

foxylion · on Oct 24, 2016

We do the same in our setup. A Apache HTTPd assignes a unique id (mod_unique_id) as a http request and response header. So any downstream system will get the request header and can attach it to the logs. (In our case we write json log and one field is the request id)

tonyhb · on Oct 24, 2016

If K8S secrets management is storing them in plain-text in etcd how do you go about making it actually secure for a bank?

obeattie · on Oct 24, 2016

We don't store things that are actually secret, as k8s secrets ;-) Hashicorp Vault is quite good at this.

tonyhb · on Oct 24, 2016

Wahoo! Good to know. Definitely don't trust the security by default :)

coleca · on Oct 24, 2016

Here is the video from the presentation. Love to see real world use cases for K8S.

https://skillsmatter.com/skillscasts/9146-building-a-microse...

akramhussein · on Oct 24, 2016

As an aside, as a Monzo beta tester I can't say enough good things. Currently travelling around Asia and the card works almost everywhere and I pay 0 fees and get the exact Master Card rate. Twice now I've landed with 0 cash in the airport and been fine. I've used the card in over 10 countries outside UK with <1% issues, and often this is better than my Barclays or Amex.

This is just a small benefit but when you put the whole product/experience together from the app + in-app customer service + the card + etc, it just works so well and really comes into a category of it's own. A lot of people say "yeah buy my bank does X too" and while true, they way I look at this is Monzo is like the iPod - other MP3 players had same functionality but this one just works and works damn well.

ed_blackburn · on Oct 24, 2016

Simply using it as a bog standard card in the UK it is head and shoulders above other high street banks.

Kudos for having the brass balls to actually execute a conversation that has probably happened a million times in every City pub.

avitzurel · on Oct 24, 2016

It can be called: "How we built 'x' with Kubernetes".

Really the only thing that is specific to a bank (as I see it) is that they use separate linkerd in order to do the secure stuff. Which is essentially what banks have been doing for ages.

I commented before on how Kube has just taken over and beat mesos/marathon stack. This talk is an example to that. You can see how many people jumped on the Kube stack and running successful deployments on it.

agibsonccc · on Oct 24, 2016

Disclosure: I mainly use DC/OS mesos myself. I've evaluated k8s for our use case and didn't find it was quite what we were looking for. Our customers and stack are mainly JVM based. We do on prem deployments not cloud where GCE is already doing pretty well. We also mainly work with the microsoft side of things (azure,enterprise stuff)

Not convinced of this. Direct mesos and yarn integration with spark (not to mention a lot of the software already being built on top) is going to keep mesos/marathon relevant for a long time. It's definitely better for big data workloads.

I think a lot of startups will jump to this for sure. Many startups don't actually have big data stacks and prefer to use go based stuff (mainly because it's simpler). In that case k8s makes sense for that.

A lot of these companies will likely prefer DC/OS and mesos/marathon because they have in house zookeeper expertise already. ZK is a dependency for much of the big data ecosystem as well a kafka and mesos.

The synergy is a lot better.

That being said: Many here will disagree. I definitely think k8s is winning developer mindshare overall, but I don't think it will have 100% of the market.

davidopp__ · on Oct 24, 2016

We're working on directly-integrated Spark-on-Kubernetes right now and would love to get input from folks who are interested. The Github issue where we're discussing it is here: https://github.com/kubernetes/kubernetes/issues/34377

Comcast has a prototype of YARN on Kubernetes here: https://github.com/Comcast/kube-yarn

(Disclosure: I work on the Kubernetes project at Google.)

agibsonccc · on Oct 24, 2016

The YARN stuff is interesting. I would be curious to see something akin to: https://spark.apache.org/docs/1.6.1/running-on-mesos.html http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn...

My main area of interest with this is: We have a lot of Java native interface code we run. I don't want folks to have to worry about configuring library paths and the like. My support loads get messy quick the second "c code" comes in the picture.

The ability to hook in to k8s to spin up executors would be pretty neat. The Mesos containerizer has that beat right now.

1 of the main reasons I heavily prefer mesos is also dc/os. dcos package install thing is head and shoulders above just having a docker runtime.

As for yarn: A lot of my customers use yarn and don't know what k8s is. OR if they do have a cluster that uses docker, they call it a "docker cluster" and it's often separate. It's not as hard of a sell for me for the YARN folks to just say: "install my docker daemon as an RPM or cloudera parcel so you can run executors"

vs: "Install k8s in your docker cluster with all this extra stuff in there"

K8s for me is in this limbo of: Not integrated enough or too complex for embedding in commercial use.

You guys have a great story going. Especially on the UX side of things. I know a lot of folks in the k8s ecosystem, but many of them aren't focused on big data or the JVM. Mesos on the other hand was spark's parent project. The integration levels show there. I'll be keeping an eye on it :).

My inherent problem with a lot of this is it's still prototyping. A lot of this support is still very green field and I'm stoked you guys are working on it. If anything because more competition is always good.

davidopp__ · on Oct 24, 2016

Regarding packages, you might be interested in https://github.com/kubernetes/charts

Disclosure: I work on the Kubernetes project at Google.

agibsonccc · on Oct 24, 2016

Thanks! I'll take a look around. There's a lot more than just "apps" at stake here, but again: I'll keep an eye on the progress. We also do a lot with jvm based microservices and the lightbend/play stack which already standardized on DC/OS with conductr. Datatstax enterprise is also of interest to us. A lot of this stuff is already "baked".

There were a lot of little things that made us pick the stack we did - a lot of this is highly specific to our use cases and customer base. We will likely have a k8s version at some point. When that time comes (or if enough customers ask for it) I'll re evaluate what we need on the stack.

Thanks for the links!

williamallthing · on Oct 24, 2016

You'll be happy to know that linkerd works really well on DC/OS as well. :) https://blog.buoyant.io/2016/10/10/linkerd-on-dcos-for-servi...

elsonrodriguez · on Oct 24, 2016

> Many startups don't actually have big data stacks and prefer to use go based stuff (mainly because it's simpler). In that case k8s makes sense for that.

Something to keep in mind is that golang's simplicity and performance is going drive an increase in big data tools written in go.

agibsonccc · on Oct 27, 2016

If those tools are around in 10 years I will pay attention.

Go has momentum. Hadoop has lasted longer than most startups currently using go.

alexandre_m · on Oct 24, 2016

What about running K8s on top of DC/OS?

I find the idea of using Mesos as the resources scheduler much more interesting, especially for multi-tenancy where each tenant launch their own k8s cluster on shared infrastructure.

agibsonccc · on Oct 27, 2016

Yeah that is a great poin and something I want to look into if k8s is supported by our customers. Hence why I said I would be interested in seeing it evolve.

graffitici · on Oct 24, 2016

But what about "Big Data" workloads? Running Spark or Cassandra clusters say? My understanding is that having a custom scheduler makes mesos more attractive for those tasks? Has your experience been different?

avitzurel · on Oct 24, 2016

Personally I always ran big data on Amazon stack so never really needed that ability

agibsonccc · on Oct 24, 2016

On prem aka "non aws" it does tend to matter yes.

kokey · on Oct 24, 2016

Well, it's not a bank, yet. Their banking license restrictions could be lifted next year. That said if they are successful it will probably go a long way towards disrupting the customer experience of retail banking. The banks will be able to compete with this, but having something like this to model their improved experience on is good for them and good for customers in general. I am, however, a bit cynical about this. I think if Monzo takes off they will hit a hurdle or event along the way that can sink the company. Most fintech companies are overly vulnerable to making the same mistakes as most banks made many decades ago. In the end it will probably mean they will get salvaged by being bought by a bank. Then the customer experience won't improve so fast any more, while the back end and processes are being made robust to allow it to remain in business long term. The culture in the company will also change to be somewhat more, uhm, traditional and boring. The result will still be that the customer experience has been pushed forward in general and that is a good thing.

elcct · on Oct 25, 2016

I was trying to find an analogy, and I think this company is doing something like making TV guide more accessible and not having in mind that in a few years TV will be obsolete. Probably this will be a good business for the next decade, but it is not doing anything really disruptive.

hanief · on Oct 24, 2016

But will it run FORTRAN or COBOL? ;)

nickpsecurity · on Oct 24, 2016

And does it support decimal math or will it have floating point errors instead? ;)

redwood · on Oct 24, 2016

Anyone here using k8s in prod? If so, what's your use case? What parts of the stack run inside k8s? Even state? What kind of org do you work for? Mission critical?

trastentrasten · on Oct 24, 2016

What you using for persistence of user data?

isostatic · on Oct 24, 2016

Do you completely rely on AWS? I.e. if amazon goes bust your company also dies? Or are you just using AWS as a provider of VMs, and could move to rackspace or linode.

mandudebruh · on Oct 24, 2016

Kubernetes has tutorials for various other cloud providers http://kubernetes.io/docs/getting-started-guides#turn-key-cl...

elcct · on Oct 25, 2016

some of those documents are quite outdated

wstrange · on Oct 24, 2016

Kubernetes provides a great deal of insulation for this exact problem.

It runs just fine on all the major clouds, bare metal, etc.

0xmohit · on Oct 24, 2016

Is there a video available elsewhere? Last I checked, it wasn't possible to download those from SkillsMatter.

harshreality · on Oct 24, 2016

    % youtube-dl 'https://skillsmatter.com/skillscasts/9146-building-a-microservices-with-kubernetes'    
    [generic] 9146-building-a-microservices-with-kubernetes: Requesting header                         
    WARNING: Falling back on generic information extractor.                                            
    [generic] 9146-building-a-microservices-with-kubernetes: Downloading webpage                       
    [generic] 9146-building-a-microservices-with-kubernetes: Extracting information                    
    [vimeo] 188042022: Downloading webpage                                                             
    [vimeo] 188042022: Extracting information                                                          
    [vimeo] 188042022: Downloading JSON metadata                                                       
    [vimeo] 188042022: Downloading m3u8 information                                                    
    [download] Destination: Meetups_Oct19_19-43-44-188042022.mp4                                       
    [download]   7.9% of 152.75MiB at 82.49KiB/s ETA 29:05

runeks · on Oct 24, 2016

Cool! Although, seems like "youtube-dl" may not be the right name for that app anymore.

timothyb89 · on Oct 24, 2016

youtube-dl is still plenty active, its last commit was just a few hours ago: https://github.com/rg3/youtube-dl/

karakal · on Oct 24, 2016

How come this project doesn't get a takedown request?

timothyb89 · on Oct 24, 2016

No idea, but I'm not going to complain. It's incredibly useful.