Event-Driven Architecture

jeremycw · on Feb 11, 2020

There's something about event driven architectures that captures our engineering minds and makes our imaginations run wild. At face value it seems like the Grand Unifying Theory of software engineering, a Grand Unifying Architecture. Everything is perfectly decoupled and only reads and writes to an event bus. Perfect uniformity, everything just reads and writes events. Perfect open-closed semantics, I can completely add new functionality without touching any existing components just by having it read from the event bus.

However, like the grand unifying theories of physics so far, it just doesn't match reality. I have yet to witness a system that fully embraces event driven architecture that isn't a complete nightmare to read, understand, debug and modify. Yet we seem unable to shake the idea that it is a panacea. It captures the minds of each new generation of engineers and gets implemented in the technology of the era. For me it was applying the observable pattern to everything in Java. Now it's setting up a bunch of microservices that only communicate through Kafka. Maybe I'm just not a true Scotsman but this pattern has anecdotally never worked well in anything I've had to work with. I would think very hard before applying it as a pattern in my code let alone as the driving force of my architecture. That's just my experience and 2 cents.

dkarl · on Feb 11, 2020

In my experience it's a bit the opposite; the domains I've worked in have always have events, whether given that name or not, and the systems for handling these events are usually coded much more optimistically and naively at first before people break down and start migrating parts of the system to run in event-driven style for performance and reliability reasons. People who are aware of event-driven architecture tend to make a smoother transition and create systems that make sense and are easier to work with. People who get there accidentally, forced every inch by successive performance and reliability bugs, end up with a hodge-podge that in retrospect is a poorly designed event-driven system.

But that's just what I've seen in my experience. I've seen damage from people being ignorant of event-driven architecture or being in denial about how their systems are evolving; you've seen damage from people being overeager to use it. Probably in your shoes I would have seen the same things you have.

One thing I've given up on seeing is the content of the linked article.

safety-third · on Feb 12, 2020

In my experience, much more damage has been done by people who implement EDA and fancy cloud architecture in general. Something like 95% of applications are simple CRUD with a minuscule amount of custom domain logic. Doing anything beyond a sanely laid out monolith for these is resume padding and potentially project killing. I think people just don't want to admit that their job is validation, presentation, and shuffling data in and out of a database.

dkarl · on Feb 12, 2020

Yeah, but very quickly in my experience the simple REST-based CRUD service starts waking up every 15 minutes to post a batch of rows from the database to a different REST-based CRUD service (or to itself if you have a monolith), and when that gets slow it gets scaled by increasing the wake-up frequency, and then someone starts writing batching logic and trying to scale it horizontally. Then external services start getting mixed in (and, if you're doing B2B, customer services) and you have to deal with their performance and reliability issues. Or maybe not; maybe you're Reddit and can get huge while still being essentially a CRUD app with a couple of UI front ends.

dryan76 · on Feb 12, 2020

I think this is the key: EDA should be used to communicate between different applications. If it is used within an application it points to over-engineering.

m463 · on Feb 11, 2020

you can try reading https://www.fullhn.com/ while it's on the front page

m463 · on Feb 11, 2020

The thing is -- computers ARE event driven at the low levels.

Problem is most systems are designed to completely hide the event-driven nature of things.

The "nightmare to read, understand, debug and modify" is happily abstracted away and we have a nice safe environment to work with.

... and it then becomes much harder to support parallelism, error handling and responsiveness.

I think there may be a more nuanced solution.

kazinator · on Feb 12, 2020

Not really.

Interrupts are really a polling mechanism: the CPU checks an interrupt line at convenient times and alters its control flow.

Most signaling depends on timing: based on what signals have been seen so far, some other signals must appear within a certain time frame. The new signals are blindly sampled.

For instance, if some device asserts some bus line that it's writing some data, then the data to be written is expected to then appear in some time window. The data lines will be sampled regardless of whether the device actually puts out that data. There is no "device has done it" event, and even the original write indication signal is basically polled.

If we look for a pure cause-and-effect relationship, it's hard to find. A device whose operation is altered by a signal is as much the cause for what happens as is the device which originates that signal.

delusional · on Feb 11, 2020

I'm curious here. In my mind the computer is basically imperitive. It processes a series of instructions. It then happens to process multiple of such streams at once, but it's still essentially serial.

How is it event driven?

fallous · on Feb 12, 2020

It's event driven in the manner of stimulus/response. A computer may process a series of instructions, but absent a loop it then finishes and does nothing without being told to process another series of instructions. Even within a listening loop it is essentially looking for "something to happen" and if nothing happens it NOPs, or does some basic house cleaning regarding the loop which is essentially the same. It is only when a stimulus occurs that the computer responds with a not-loop series of instructions.

iainmerrick · on Feb 11, 2020

I suspect they’re confusing “event driven” with “interrupt driven”.

m463 · on Feb 11, 2020

Yes, interrupts and events are distinct.

Would it be clearer to say sequential vs asynchronous?

Or to say that interrupts are one form of event, but not all events involve interrupts.

dalore · on Feb 11, 2020

Most computers have had more than one core and even additional cpu's like a gpu. Also things on the bus usually have their own processors. So things are happening at the same time, it might tick at the same rate that the bus allows but they communicate by events like interrupts. As far as I understand it.

m463 · on Feb 11, 2020

i/o, exceptions, system calls.. all these (usually) involve stopping the sequential flow of your program and doing something else.

shados · on Feb 11, 2020

If you do it at a small scale, where a team of a handful engineers is responsible for following/maintaining/debugging countless distinct systems that are only linked through Kafka, its insane.

if you have hundreds or thousands of engineers in more teams than you can count, and each team is responsible for a few event producers or consumers and have the time and resources to dedicate to be experts about their own systems and their interfaces/boundaries, it's an amazing way to scale your organization. It really does work.

kingdomcome50 · on Feb 11, 2020

I understand what you are saying here. But... I am having trouble determining the implication.

It seems like you are saying that an EDA is only useful when no one has to understand the system in its entirety? That can't possibly be useful can it?

The implication is that an EDA is inappropriate if I have a system with 1000 services and 5 engineers, but amazing if I have 500 engineers? Isn't the purpose of a technical architecture to optimize for the former (assuming engineering is a cost-center)?

The former (small-scale) version you describe is really just an optimization for the latter (large-scale) version you describe a la "We have simplified our systems so much that we no longer need as many engineers to maintain it".

In a way, your comment serves to substantiate the point of its parent in that it asserts that choosing an EDA requires more engineers per service.

I must be misunderstanding your point.

bcrosby95 · on Feb 11, 2020

> Isn't the purpose of a technical architecture to optimize for the former (assuming engineering is a cost-center)?

The companies that developed these things are employing hundreds or thousands of engineers. So, yeah, they are optimizing for engineering costs when you have thousands of engineers. They are building systems that cannot be fully understood by a single person. If you don't have thousands of engineers, their solutions to their problem won't necessarily be your solution to your problem.

Complex beasts like microservices and event driven architecture are about enabling huge engineering departments to make changes to huge systems without constantly stepping on eachothers toes.

Ignoring headcount, they are actually very inefficient because systems designed like this necessarily has tons of duplicated effort across the organization. Those inefficiencies are made up by reducing the amount of communication required between thousands of people and hundreds of teams to change and maintain the system. But if your department isn't huge then the communication overhead is less than the duplicated effort, which tends to make these designs a poor choice.

These sorts of things usually do not scale down. If you have 1,000 services and 5 engineers there's a 99.999% chance you're doing it wrong. 5 engineers is not even enough employees to manage, monitor, and build the requisite tooling to make efficient use of a system distributed across 1,000 services, much less build and understand the system as a whole.

shados · on Feb 12, 2020

No, if you can have 1000 services maintained by 5 eng, you're doing fine.

The problem organizations hit is such: you have 2-3 huge services maintained by 5 engineers. The company grows, the services grow. 5 eng become 10 eng become 100. It's still the same 2 or 3 services but much, much bigger. You get to a point where people are getting in each other's way, and every additional person you had provided less value than the people before them. They have to work within the boundaries of the system and coordinate with other people.

It's all in the same programming language so you are limited by one pool of potential hires. Tech debt affects everyone. Deploys affect everyone. Downtime affect everyone.

So you split it to solve some of these. Then split it again. And again. And again. You keep splitting until someone can reasonably wrap their head around a single atomic piece, from implementation to deployment. The piece can be rewritten from scratch at any time without impacting any other piece, in any language, on any infrastructure, with any framework, to optimize for the people working on it.

That's great, but that piece still has to be part of the whole. These event buses are an elegant way to keep them together without adding complexity at the individual piece level.

Yeah, it means people might not the understand the whole system anymore, but do they need to? It's like people organization. The CEO doesn't know the intricacies of the HR system. They know it at a macro level but defer to the HR manager for the details. Same thing here, but with software.

bob1029 · on Feb 12, 2020

Can confirm that message buses aren't just buzzwords and are literally the only feasible way to combine business systems in the most complex of enterprises. I have seen a messaging domain with over 1000 different event types to deal with. Gigabytes of messages per hour. 24/7/365. Tens of thousands of different system interfaces all talking together using synchronous RPC semantics without experiencing any errors for days at a time. Systems spread across the planet. It's the perfect solution if you have the resources to maintain it.

snovv_crash · on Feb 11, 2020

Event driven architecture is like GOTOs but with data instead of execution flow. Throw your data to the wind, someone somewhere will handle it (I hope).

ivix · on Feb 11, 2020

In general I tend to agree. However really the issue is one of correctly defining the granularity of events that the system should be aware of.

After all, even a completely synchronous webapp is event driven, it just so happens to only care about a single event, a request.

The problems come about when code which can all be synchronous becomes unnecessarily complicated and spread across multiple microservices.

Microservices are generally a solution to people scaling issues, not technical ones. If you have significantly more services than teams, you are probably doing something wrong.

chadcmulligan · on Feb 11, 2020

My experience is similar - the problem becomes when an event is triggered - where did it come from, oh it was from this event which was from this event and so on.

Where events shine in my experience is in GUI's - the onClick event then does some code. So my rule of thumb is only one level of events and it comes from an external source - user, network etc.

Arelius · on Feb 12, 2020

Ohh my god this. My current project has this highly overengineered (with C++ and templates no less) event system that persists. Despite being one of the biggest complaints of our customers. We continue to think that the event system is "just one more" small tweak away from solving all of our problems.

fallous · on Feb 12, 2020

I think the level of abstraction applied to "event" is key in determining whether the architecture is successful (or even appropriate).

If you attempt to implement at a granular level (got click on button!) then it turns into a giant mess because you're essentially reinventing the logical flow and event loop of programming via a message bus, which is a pretty terrible idea. If instead you operate at the business logic level of abstraction it can result in a much more coherent and easily-extended system, especially since that level also sees the most change from executives or operations. At that level it's also easier to integrate with third-party services.

It is certainly not a cure for all that ails you, despite the claims of some snake-oil salesmen to the contrary.

erik_landerholm · on Feb 11, 2020

I couldn't agree more. It's like the pot of gold at the end of a rainbow.

kazinator · on Feb 12, 2020

> Everything is perfectly decoupled

And every damned stack trace in the debugger goes a couple of frames up into a "receive_event" function, and then the trail goes cold.

patsplat · on Feb 12, 2020

It works extremely well for ux development ;-)

iamspoilt · on Feb 11, 2020

It seems like the post on event driven architecture couldn't handle the click event load from Hacker News and is throwing a "Database Error".

nawgszy · on Feb 11, 2020

While I definitely respect seizing the snark opportunity.. I feel like I see loads of websites crash from HN / Reddit hug of death regardless of architecture, so I don't think I would necessarily consider this a failure of the architecture

tjoff · on Feb 11, 2020

Maybe says more about current architectures.

It is hard to imagine any reason as to why a normal site/blog couldn't cope with this.

Now since whatever solution you pick is going to be an overengineered mess it's easy to understand each in isolation, most of us don't have the time/energy to optimize for this. And that is fine!

But it is kinda sad that the quick and easy approach is not a statically generated site. There is absolutely no reason why that should be any harder.

Yet for some reason we don't value that.

hombre_fatal · on Feb 12, 2020

We don't value it when 99% of blogs never reach a level of traffic, beyond rare spikes, where they need to introduce a cache (like pre-generated files) in front of their dynamically-generated website. The blog operator likely doesn't even notice it, and when they do + are savvy enough, they just install one of the various Wordpress caching/static plugins.

And we do value it when downtime loses us business, something that's not likely to apply to our personal blogs.

Maybe Wordpress should enable some low-risk caching by default, but maybe it's not worth it since most installs never get traffic and caching is confusing especially to the non-tech-savvy.

tjoff · on Feb 12, 2020

My point is that we probably should value simplicity itself. And as a bonus things like this wouldn't happen.

claytongulick · on Feb 11, 2020

Given what I've seen in practice with companies that try to follow this architecture, I can't say I'm very surprised.

This type of architecture is really only suitable for a narrow problem domain, imho. Most people that I've seen attempt it don't understand all of the edge cases that arise from it.

For example, not getting immediate success or failure status from an API call, and having to constantly poll to see what the status of a submitted message is.

nickserv · on Feb 11, 2020

I think an event based system can apply to a wide variety of domains, but that the consequences need to be understood up front. This means doing things differently than the traditional top down controller approach with synchronous responses.

For example, an asynchronous API should have a way of pushing notifications to the caller, and the caller then has a way of processing these returns.

dantodor · on Feb 11, 2020

Yup, the pipelines are down and CQRS is suffering :D

ak39 · on Feb 11, 2020

That’s a saga for another time.

erik_seaberg · on Feb 11, 2020

HN traffic is coming from the wrong side of the RPC/streaming boundary. Everyone could get by with a lot less hardware if HTTP GETs were queued instead of bursts of realtime load.

NicoJuicy · on Feb 11, 2020

Cached db query ? :P Totally fine in Infrastructure

Nuzzerino · on Feb 11, 2020

This is tangential but I'm reminded of the ontology people who constantly bash RDF but have trouble providing good arguments or links to documentation as to why it's a bad thing. Ontological tech helps with organization of conceptual data so it would be an easy ask for them.

Moral of the story is there's a lot of information being preached by those who either haven't or won't put that knowledge into practice first. It would have been nice to hear about any projects the author could point to and explain how the architecture helped it succeed, but that is absent from this post.

aledalgrande · on Feb 11, 2020

Totally agree, a lot of people preaching just after reading books and blog posts and running a minimal hobby project. But I am not surprised.

Nothing is inherently bad, unless you use it for the wrong purpose.

socceroos · on Feb 14, 2020

The banking system

organsnyder · on Feb 11, 2020

I realize this is mostly snark, but is there any evidence that the website is hosted on a stack that is event-driven? Most likely it's just hosted on WordPress or something.

_sbrk · on Feb 11, 2020

Heh heh heh .. yep:

"Error establishing a database connection"

ralphael · on Feb 11, 2020

shame as I like the content and this could distract from it

ch_sm · on Feb 11, 2020

Here is a cached version of the article: https://webcache.googleusercontent.com/search?q=cache:https:...

AndrewKemendo · on Feb 11, 2020

I'm consistently surprised at the negative comments on EDA on Hacker News because there are so many examples of major organizations successfully implementing and running EDA at scale. Here are a few examples:

Uber:

- https://eng.uber.com/ureplicator/

- https://eng.uber.com/reliable-reprocessing/

Google:

- https://cloud.google.com/blog/products/gcp/implementing-an-e...

Twilio:

- https://signal.twilio.com/2017/sf/sessions/18530/building-ro...

Stripe:

- https://stripe.com/blog/canonical-log-lines

My hypothesis about why this is, is that most organizations probably don't need EDA yet. They don't have that many data producers and consumers and don't have HA and other requirements that drive the need, so implementing it is overkill, and so their experiences have been bad.

mst · on Feb 12, 2020

"Trying to implement google-scale things when you're tiny" seems to be an extremely common and understandably tempting antipattern.

karatestomp · on Feb 12, 2020

The amount of horseshit that's exploded all over any simple-ass system that could fit comfortably on one real mid-range server, these days, is truly astounding. I thought I was used to churn and such in this field, but the last five or so years are really starting to strain me. The amount of junk one increasingly must know to work on these over-engineered systems—and they never have enough personnel to keep everyone from having to know & constantly work with a dozen different tools and interfaces just to get anything done, on top of what you need for the work of actually writing and working with code—is getting to be more than I can handle.

I only hope if anything good comes out of the next bubble burst it's that some of this piles-of-cash burning counterproductive insanity gets reigned in.

_jordan · on Feb 12, 2020

I think it has to do with this idea that you aren't an advanced engineer if you don't use words like 'protocol buffers' and 'event-driven architecture' and 'distributed systems engineering' in your day-to-day. talking in clear and plain terms means you don't know a lot of stuff to justify your generally very high salary. (or at least, that's the flawed mindset)

nostrebored · on Feb 12, 2020

But "trying to decouple, modularize, and choose the right tools for the job" for your applications seems applicable at any scale.

EDA requires a change in testing methodology, in software design, and a bit of reading, but calling it a "google-scale thing" is pretty fallacious. The idea that a monolith is easier to maintain or that synchronous inter-component communication is easier to reason about also seems fallacious.

I'd love to see a chart of developer's perceptions of event driven microservices and their personal operational expectations.

mst · on Feb 12, 2020

My point was meant to be more general, in that people try to implement the "google-scale" version of any given conceptual model even when they don't need to (i.e. this is about "choosing the right tools")

For example, one can do a CQRS style database setup just fine with a single API server, a single worker process system, and a single postgresql db - and given all changes are already driven by command objects, building out other datastores later if you need to tends to work out well.

Though I would also point out that on average, communication within a process is easier to reason about than communication outside because there are fewer failure modes, and sync is easier to reason about than async because there are fewer failure modes.

Admittedly, a monolith does make overly tight coupling between components easier to not notice as you're doing it ... but then again it's depressingly easy to accidentally end up with what's essentially a distributed monolith even with a theoretically-microservice-based design (googling "distributed monolith" will provide a bunch of articles with disagreeing definitions of the term ;)

AndrewKemendo · on Feb 12, 2020

Also worth noting that it is a new and different paradigm than most people are used to.

ak39 · on Feb 11, 2020

This is just an overly elaborate concept of a centralized (or commonly accessible table) of states that other independent systems can use for their own “triggers”.

What seems like a lifetime ago, we achieved this with IBM MQSeries allowing us to send events from a Microsoft transaction server (MTS) object to a COBOL program that listened for the events to insert records into an order entry system that was on an AS/400 DB2 db.

Maybe an Event Bus concept carries all the expected “good design aspects” of this pattern. But it is just another example of an integration concept - but please can we stop with evangelizing this as though it ought to be the linchpin or necessary aspect of a good modern system you are starting with!

Edit: this design concept 25 years ago was the best we had considering the zero alternatives for guaranteeing that events don’t get lost etc. IOW: this design itself was a compromise for absence of a cohesive and unified database (what the Event Bus crew and the Microservices crowd would now call a monolith)!

idiocratic · on Feb 11, 2020

I think this is just a good overview of the architecture and its characteristics. I'm not sure anyone is trying to evangelize anything.

kazinator · on Feb 12, 2020

Event-driven architecture can be nasty. Ideally, you want the point of origin of an action, and the point of its execution, to be connected by a chain of function calls which can be traced in a debugger.

Events are not easily traceable. An event is pulled from some event queue, and by the time that happens, the thing that put the event into the queue has long since buzzed off to do something else.

That's just one problem. The other problem is how events can be subject to a routing, translating, duplicating and splitting labyrinth. Events can bi{tri,quad,...}furcate. Just because your code is processing the event over here doesn't mean you were the first to do so, or the last.

I have a good idea! Let's process this event and re-inject it. Two years later, someone is debugging an event routing loop.

aabbcc1241 · on Feb 12, 2020

For people cannot view the article, here's an archive: https://web.archive.org/web/20200210123446/https://pradeeplo...

teddyh · on Feb 11, 2020

Ubuntu’s “Upstart” init system used an event-based design, which was criticized by the creator of systemd, thus:

—

[Upstart]'s main feature is its event-based approach: starting and stopping of processes is bound to "events" happening in the system, where an "event" can be a lot of different things, such as: a network interfaces becomes available or some other software has been started.

Upstart does service serialization via these events: if the syslog-started event is triggered this is used as an indication to start D-Bus since it can now make use of Syslog. And then, when dbus-started is triggered, NetworkManager is started, since it may now use D-Bus, and so on.

One could say that this way the actual logical dependency tree that exists and is understood by the admin or developer is translated and encoded into event and action rules: every logical "a needs b" rule that the administrator/developer is aware of becomes a "start a when b is started" plus "stop a when b is stopped". In some way this certainly is a simplification: especially for the code in Upstart itself. However I would argue that this simplification is actually detrimental. First of all, the logical dependency system does not go away, the person who is writing Upstart files must now translate the dependencies manually into these event/action rules (actually, two rules for each dependency). So, instead of letting the computer figure out what to do based on the dependencies, the user has to manually translate the dependencies into simple event/action rules. Also, because the dependency information has never been encoded it is not available at runtime, effectively meaning that an administrator who tries to figure our why something happened, i.e. why a is started when b is started, has no chance of finding that out.

Furthermore, the event logic turns around all dependencies, from the feet onto their head. Instead of minimizing the amount of work (which is something that a good init system should focus on, as pointed out in the beginning of this blog story), it actually maximizes the amount of work to do during operations. Or in other words, instead of having a clear goal and only doing the things it really needs to do to reach the goal, it does one step, and then after finishing it, it does all steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus is in no way an indication that NetworkManager should be started too (but this is what Upstart would do). It's right the other way round: when the user asks for NetworkManager, that is definitely an indication that D-Bus should be started too (which is certainly what most users would expect, right?).

A good init system should start only what is needed, and that on-demand. Either lazily or parallelized and in advance. However it should not start more than necessary, particularly not everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event logic. It appears to me that most events that are exposed in Upstart actually are not punctual in nature, but have duration: a service starts, is running, and stops. A device is plugged in, is available, and is plugged out again. A mount point is in the process of being mounted, is fully mounted, or is being unmounted. A power plug is plugged in, the system runs on AC, and the power plug is pulled. Only a minority of the events an init system or process supervisor should handle are actually punctual, most of them are tuples of start, condition, and stop. This information is again not available in Upstart, because it focuses in singular events, and ignores durable dependencies.

[…]

— http://0pointer.net/blog/projects/systemd

rubiquity · on Feb 12, 2020

Web servers are event-driven architecture. They listen to events on the HTTP bus.

ykr1 · on Feb 11, 2020

Says database error. Was that an event?

stephenwilcock · on Feb 11, 2020

When I follow the link to the article I get an "Error establishing a database connection" message.

Delicious irony.

pradeepl · on Feb 15, 2020

yes, this is ironical. I was away and thanks for the HN hug of death my blog went down :-). My blog provider throttled it down severely and I was able to restore it from backups just now. I blog occasionally to put down notes on what ever I am working on or thinking through, in the hope that maybe it would help some one or myself in the future. Apologies if I wasted your time.