Blog

Event Sourcing: What it is and why you should take an interest in software data modeling.

Data modelling as a separate concern has gone the way of the dodo with the rise of rapid prototyping tools and frameworks that encourage rapid generation of boiler-plate systems for interacting with generic SQL, or no-SQL databases, largely glossing over the Cambrian explosion’s choice of modern databases.

With the “Agile” and “XP” (Extreme Programming) methodologies, looking ~~too far~~ into the future is shot-down as being an anti-practice. Big-data has enjoyed a meteoric rise as a core concern for many companies, but is often hastily tacked on to existing processes.

In today’s competetive markets it’s simply too risky to let the engineering team decide what metrics and analytics to measure, and how and when to track them.

Data modelling has never been more important.

What is data modelling?

Data modelling, at it’s simplest is a process of deciding what data are availalbe when an interaction with one of your systems takes place and discussing what to do with it.

At the most familiar level this involves discussing what field and attributes are significant when setting up a user sign-up system or what data to track for “products” in an e-commerce system.

More alien to the typical business person may be questions about whether different sizes and colours of jeans in a webshop are variants under a “parent” product, or separate products with some kind of grouping, all

When taking off-the-shelf software such as Wordpress, decisions about the structure of your data have been made for you; informed by years of Wordpress’ experience in this domain; Wordpress and many others provide a way to make custom models based upon certain primitives, this is how for example Wordpress can be extended into a web shop, forum, community or online newspaper.

When commissioning bespoke software there is much more freedom, but with a couple of significant caveats, the engineers building the software must have a good briefing about why certain data are being stored and what purpose it is to serve in the long-run, and that the engineers’ decisions will be heavily skewed by their choice of language, framework and libraries.

Many peripheral systems offer drop-in solutions to tracking certain sets of data, owing to the ubiquity of these systems tracking that internally within an application bespoke or otherwise has fallen by the wayside, Google Analytics is a prime example with which nearly everyone is familiar.

Conscious or unconsciously there has been a decision taken that means user-data, which devices they use, which country they reside in, and more are all 2nd class citizens in our data model, and that it’s of so little business value we store it with a 3rd party.

Something similar happens with performance monitoring, where software developers determine that this is also 2nd tier data, and can be offloaded to a 3rd party system.

Can the requirements for agility, accuracy and consistency be balanced?

If you ask any business person if they’d like to have more data about their customers, and be better informed and to not have to shoot more in the dark about how the business at large was performing, they’s scream “yes”, and businesses with better data perform better. App powered Ãœber outperforming radio, notebook and pencil powered taxis, and machine learning driven traders outperforming Wall Street’s hot-shots.

Whether building bespoke software, or opting for off-the-shelf software the decision about what data to keep and what to discard rarely, if ever comes up. What if we could make collecting everything, all the time the default, and we could do it without making projects take longer, or cost more?

It’s 100% realistic

Before I exlpore how it might be possible, let me make an ask you a question:

If you would be a football or baseball team manager, and you were expected to coach your team to victory based on the final result of each game, do you think you’d have enough data to do your job?

These are the constraints we put up on product and project managers by only storing the “final result” of user’s interactions with our software, when we simply throw a user record into a SQL database, or a booking into a NoSQL database and fire off an email, we’re storing barely a shadow of the interaction that lead to that result.

To put another way, often make the mistake of storing the state, not the transformation.

The end state of a football or baseball game is the result of hundreds of passes, hundreds of pitches, and a lot of luck and strategy.

Simiarly the end state of a user signing up, or placing an order in your software is the result of potentially dozens of clicks, inputs into fields, activations and deactivations of tabs and windows.

Many ancient fields rely on keeping accurate records of changes to the status quo, namely banks who never edit a transaction, but keep a ledger of all transfers in and out (even in cases of mistakes, they’re reversed with a nullifying transaction, not an erasure of the mistake), legal contracts which are never changed, but only ever added to, and ledgers of land ownership, etc.

We can adopt this practice in software, very easily, let’s talk about how.

Storing transformations, not state

It’s tricky to get much deeper without entering into some pseudo code, some form of notation that allows us to express an intent to change the state of something, the pseudo code below is just an arbitrary annotation of the time, context and intent.

Let’s work with the tyical example of someone placing an order in an e-commerce system. In a typical software system, a user will click links, add products to their “basket”, maybe remove them, and eventually proceed to a checkout, where they’ll provide billing details, and hopefully have an order shipped to them, some retailers do after-sales follow-up and ask for feedback.

Depending heavily upon the actual technical implementation, the system may only talk with the server backend once or twice in the whole process, with the client (the web browser or app) managing the state of the basket and gathering all the order details in isolation, and sending the finished “order” through to the backend systems when it’s finished. This is analagous to the football manager being.

In reality most systems talk with the backend more frequently, constantly overwriting the state of the order with the most recent state from the client (web browser or app) as the user changes things on their journey towards finalizing the order.

In those kinds of systems, this is how a final order might look:

UPDATE ORDER SET FIELDS
 id [ 101012345 ]
 status [ pending shipping ]
 products [ 1Ã— milk , 2Ã— eggs , 10Ã— chives ]
 price [ 54.00 USD ]
 shipping address [ example street 123 , exampleville , 90210, CA ]
 billing address [ the office 456 , worksville , 90210, NY ]
 contact details [ max exampleman , max@exampleman.com , 776-2323 ]

This contains probably just about everything we need to ship an order, and you might imagine how this grows or shrinks in complexity depending exactly what is being sold, but it doesn’t tell you anything about what made the user choose milk, eggs and cheese, and how it came to be in this quantity, it doesn’t tell you if there was anything else in their basket before they finally checked out, and two weeks from now, when the customer calls to as why the order hasn’t been delivered to the office, it can’t tell you what their shipping address was set to when they checked out or if it’s been changed in the meantime.

The alternative to this is to store interactions, the transitions, that looks something like this in this annotation the numbers in the first column are the time in seconds since the interaction began, we’ll need this when discussing the example below:

0 | s = begin_new_anonymous_session("Firefox, Windows, 1900x1280, 8.8.8.8")
 +02s | request_page_login_form(s)
 +15s | submit_page_login_form(s, "max@exampleman.com", "â€¢â€¢â€¢â€¢â€¢â€¢â€¢â€¢â€¢")
 +17s | search(s, "pizza")
 +18s | search(s, "salmon")
 +22s | search(s, "omlette")
 +24s | add_product_to_basket(s, "milk", qty: 1)
 +26s | add_product_to_basket(s, "chives", qty: 10)
 +26s | add_product_to_basket(s, "butter", qty: 1)
 +26s | add_product_to_basket(s, "eggs", qty: 1)
 +30s | visit_checkout_page(s)
 +34s | remove_product_from_basket(s, "butter", qty: 1)
 +35s | set_billing_address(s, [the office 456 , worksville , 90210, NY])
 +36s | create_order_from_basket(s)

In this model we can correlate the user’s session (s always given as the first input to the datum) with whta happened, and because each iteraction is logged with a timstamp we can determine what order things go in.

We can determine for example that the seaches for pizza and salmon didn’t inspire the customer to put anything in their baseket, and we might go look into our records to find out what results were shown, how long it took. In an ideal world we’d be able to reproduce all the user’s interactions and “replay” them so that we can see what went wrong for this user, and determine if we need to make changes in businesses processes.

Because we have this list of things that happened, it’s super simple to write a function (technically a “left fold”) over the list of “facts”, journal, log, ledger of the user’s interaction, and generate the equivilent order “summary” as we had above, but in this model we have a wealth more information about their experience.

Later in the process your businesses backend systems might add their own events:

+2h55m | set_order_status(101012345, "starting packing")
 +3h | set_order_status(101012345, "packed ready to ship")
 +3h30m | set_order_status(101012345, "shipping")
 +3h31m | set_order_shipping_partner(101012345, "fedex")
 +3h31m | set_order_shipping_tracking_code(101012345, "1.800.463.3339")

Automated systems that contact the user might be “listening” to the firehose of events that your software generates looking for events such as create_order_from_basket and set_order_status and sending the customer status emails, or paging staff that there’s work to be done in the warehouse.

When the customer logs in, and wants to see an order summary, they can see the current state of their order, and how long it took to arrive in each state.

When the customer calls your customer service department and asks whether it’s too late to change their shipping address it becomes a very easy question to answer; as long as there’s no set_order_status(101012345, "shipping") in the history, we can still offer the user that option.

When you’re conducting employee reviews, and customer satisfaction surveys, or looking to optimize business processes in general you may ask â€œWhy did it take nearly two hours before this order was selected for packing?â€ or â€œis five minutes a fair time to pack 12 items?â€.

If your business intelligence people are really switched-on they might think to examine this ledger for cases of people reducing product counts in their baskets within a few minutes of checking out, and ask themselves why, and replay the cases and look whether the shipping and handling prices, or lead-times varied wildly.

Wait, sounds like a lot of work?

It’s a different mindset, which whilst somewhat uncomfortable at first yields massive benefits. This article skirted over how impractical this can be once the volume of data becomes large.

Above a few tens of thousands of “events” in the log or ledger, rendering pages and serving API calls by starting at t0 and processing each transition event to derive the current state and deliver it becomes simply too slow.

This is where some principles of good software come in, when we can start to think about pre-generating things, so that when the user logs in and wants their order summary page, we’ve already generated it for them, and we don’t need to do it again. Because we know that there’s only a few certain types of events that would ever make this page change, we can cache it for essentially forever, and have wholly predictable loading times (serving a static page) for any page in any application at any scale.

When pre-generating content to serve back to the user we also have incredible freedom to choose the best place and way to store it:

Some databases excel at near-instantly yielding members of two lists which are common to both; this comes up a lot when checking membership of groups, or who is “friends” with another person.
Some databases excel at search, we might carbon-copy all make_product_avaiable_for_sale type events to a projection that indexes them for search.
Some systems are designed for fraud detection, we might send some or all of our add_credit_card_to_account or settle_balance events to them and use them to inject some set_account_trust_level(+10, "known good credit record") or similar events into a user’s account stream.

All of this is possible with non transactional ledger orientated systems, but it’s nearly always a bolt-on, or after thought.

By starting with this system up-front however, the initial investment isn’t that much higher, and the short, medium and long term benefits are evident throughout the lifetime of the product, tool, service technology or company as a whole.

A user’s right to privacy

It would be irresponsible to talk about all of this without a footnote about privacy, awareness of privacy implications of being active on the internet, or using apps or smart devices.

If you continue not to make data modelling part of the discussion you have when building new products and software components, you’ll likely find yourself struggling to make sense of the how and why something came to be, and struggle to fit your product to a market. At this point you’ll probably end up bolting on poorly thought out collection strategies for collecting data in the future. Because you can’t really know what data is there to be collected in the first place, and because it’s not a core part of the operative data for your product, you’ll make a trade-off and collect some things, sometimes, maybe by introducing one or more third party tools and investing engineering effort to get them connected and synched, then, they’ll rot. Without constant maintenance these integrations will fall apart, and the inflated engineering effort to collect sub-optimal data will have beocme worthless again, demanding a re-investment.

The downsides, of having less data, less accurately logged, for a subset of interactions with your product or software, managed by third parties is a cocktail of bad business fundamentals; somewhere in the world will be a competitor who decided to discuss data modelling up-front and who’s got full control of their data, incredible insights into their relationships with their clients, and has a more palletable terms and conditions for their userbase.