> Yet, the event producers shouldn’t assume getting a response. This is why we c...

mrkeen · on March 2, 2023

> Events are facts; they can (and should) have multiple receivers. Yet, the event producers shouldn’t assume getting a response.

What this means is: There may be many consumers listening to payment events - a PaymentHistory service, a CustomerFraudDetection service, a Settlement service, a Bookkeeping service.

If I publish the fact that {Customer x paid $3.40}, that event has been persisted, and is considered to have happened. What's to be done if something goes wrong downstream? From my point of view, I can't do any more. It is up to the downstream services to restore themselves to a good state. If I 'retry' that message, that's exactly the same as if the customer had paid twice.

phkahler · on March 2, 2023

>> If I 'retry' that message, that's exactly the same as if the customer had paid twice.

Transactions (events) and messages are not the same thing. There should be no problem resending a GUID identified (or maybe a timestamped) transaction.

mrkeen · on March 2, 2023

> Transactions (events) and messages are not the same thing

Correct. I 'publish events', I do not 'send messages'.

> There should be no problem resending a GUID identified (or maybe a timestamped) transaction.

Let's say that I do want to roll my own resending (instead of publishing once). Do I record that a GUID is processed before or after I respond to it?

lucasyvas · on March 2, 2023

Respond meaning by sending a receipt back the to the original sender?

Before. Because then the sender can assume it never arrived if it never receives a response and can try again. If the receiver has already processed/recorded it, it goes through the motions again, but in an idempotent way that doesn't break anything by performing the action twice (or more). Eventually, the sender will get the response that it doesn't need to try again and it will reach resolution.

This is an extension of the two generals problem and can be totally resolved by using an idempotent design. However, you can never guarantee a resolution is reached on any given attempt, and that is by design.

My personal favorite approach is not using timestamps. The sender attaches a unique GUID, and the receiver records that GUID along with the processed event record. If the receiver sees it 100 times, it doesn't matter. It will do nothing the next 99 times, and tell the sender that it performed the task even without actually repeating it.

Eventually the sender will record that and stop asking.

mrkeen · on March 2, 2023

My followup question seems to have arrived out of order with respect to your suggestion.

>> Do I record that a GUID is processed before or after I respond to it?

> My personal favorite approach is ... the receiver records that GUID along with the processed event record.

Well, does the receiver record that it has handled that GUID before or after it attempts to handle it?

aynyc · on March 2, 2023

One word: Idempotent.

mrkeen · on March 2, 2023

You may need a few more words than that. My preferred definition of idempotency is that your actions take you to a desired state - if it didn't work, try again, and hopefully you'll get to the desired state this time.

The only time I've encountered an idempotency implementation in the wild, the team used the other interpretation of it, that is: the system will always respond the same for the same input. So if you try something and get an exception, they went through great lengths to ensure you would also get an exception the second time as well.

Which do you prefer?

aynyc · on March 2, 2023

Your definition is generally inline with mine. The small exception is you can retry as many times as you can, once the desired state is reached, the result is the same after that no matter how many times you keep trying.

Using the example given, I (as producer) can retry the message however many times, the response I get should be the same. Implementation detail usually involves some kind of ID so downstream consumers know that they have process the said ID.

I do not believe if there is an exception, then future retry produce exception as well. But perhaps I misread your example.

jonfw · on March 2, 2023

The entire point of idempotency is not to always respond the same every single time, it's that the user can always expect the same response every single time. If the user receives a response it does not expect (or no response, etc.) than they don't want the same response on every retry, they still want the expected response

In a perfect world, an idempotent app responds the same every time. But in a perfect world you don't really need idempotency.

Izkata · on March 2, 2023

Cool, now I can by hundreds of things that cost $3.40 and only pay once.

JustSomeNobody · on March 2, 2023

> Cool, now I can by hundreds of things that cost $3.40

Why not?

> and only pay once.

Sounds like your rules engine is borked.

Izkata · on March 2, 2023

> Sounds like your rules engine is borked.

Well, that is what they suggested - it needs more work to prevent that situation.

marcosdumay · on March 2, 2023

Oh, the word "event" strongly implies that you can filter, reroute, multiplex, store, branch, or do whatever with them. You can't do that if you require a response.

If you create an architecture where your requests need a response, you'd better not call those "events" and ignore all the event related literature. If you want events to work, you'd better architecture your software to work without responses.

foobiekr · on March 2, 2023

We can't have nice things because in the real world most EDA designs are pretty naive and it's painful to look at them.

If your events are pure telemetry - samples of some data, some event that is standalone like a click, etc. - then life is easy. A missed event is no big deal, most of the time, and if it is you need some message-transactional, lossless event delivery mechanism, assuming you can pay the cost in performance or $, but those are both pretty easy concepts and people can mostly go from zero to understanding them in a day.

And here is where the trouble starts.

A lot of events in the real world are actually events that represent statefulness in the form of what are basically a log of state machine transitions, even if (because people don't really know how to build such systems) the modeling is not explicit.

In an ideal world with some publisher A and some end receiver C with perhaps some number of pipeline processors B, as long as the path from A to B to C is such that no drops or re-ordering can occur, things are should be fine. A sends to B+, the last stage of B sends to C. All good.

But in the real world, flakiness exists - bugs are _common_ and non-telemetry events are a problem:

* conveyance mechanisms are not actually reliable - they tend to have bugs, and this includes favorites like Kafka * A, B and C can each have bugs - they might drop a message, they might _do the wrong thing_, they might persist some incorrect state, etc.

This is how you get into the situation I dealt with last month where the SRE team had to occasionally fix rows in the database where due to a bug that lasted for months some rows would be wrong, and sticky wrong because the system didn't have any kind of inconsistency-discovery mechanism even though the state data was always available.

Almost every event-driven system I've seen in the real world lacks the ability to detect this even when the source of truth is available (A, for example). Even a simple re-transmission or some digest of the state to detect inconsistency between the bottom and the top layers is routinely absent. It's pretty bad.

By analogy, anyone who has done low-level work will be familiar with the very different classes of events that are usually called "edge triggered (notified once, lost after)" or "level triggered (visible for as long as the state persists)"; more broadly there are stateful and non-stateful. Level triggered is _always_ easier to deal with because you can implement periodic self-check to recover in the event that an ephemeral bug has caused the event to be lost or mis-processed. There is incredible irony in the fact that big complex EDA systems have lost the thread on what is, in fact, the easy case.