All other things being equal, fresher data are better data. However, it is almost never the case that all other things are equal.
- Many data warehouses charge differently and more for streaming updates than for batch updates (see BigQuery ingestion pricing as an example)
- Updates to some data may necessitate updates to other data, leading to a cascade of updates
- Fresh data may be unreliable in isolation (see “how to handle piecemeal data” for handling issues such as referential integrity)
- Initial trends may be misleading, such as in real-time web analytics
Every decision made at every company has some agent making that decision. Perhaps it is an artifical intelligence making it, or perhaps a deterministic business process, or sometimes, a human. Fundamentally, every decision is made at some level by a human, even if they decide to delegate it to a computer system. See Dr. Marshall’s book Data Conscience about the effects such delegation can have and how to consider them in better ways.
When deciding how fresh your data need to be, the key question to ask is this: “How quickly would new data change the actions of the decision-making agent?” This clarifies things immediately. As examples:
- A fraud-detection system should prevent fraudulent orders, but in order to do so it must act fast enough to prevent the order from completing
- Website personalisation should happen quickly enough to render the website
- Stock-level updates should happen quickly enough to prevent selling stock that can’t be fulfilled
Strategic decisions – the kind that often rely on reports, dashboards, and other visualisations made by analysts using Business Intelligence software – are typically not made in seconds. Could new data arrive in the next five minutes that would change such decisions? Obviously, just as much as noting that we must get halfway to a destination before we can get to that destination. The chance of such information arriving is small, and waiting “just in case” for perfect certainty will result in an infinite wait. However, we are not flummoxed by Zeno’s paradoxes here, though they are some of the oldest recorded objections to real-time analytics.
Strategic decisions, then, can be made as effectively from old but representative data as from fresh but representative data. The important aspect of the data is that they are representative, not that they are arbitrarily new. Website conversion rates from two weeks ago may still be representative if nothing material has changed, and sales data from ten minutes ago may be unrepresentative if an item has gone out of stock. See our previous post “Keeping on-call calm” for applying this principle to operational issues.
Since all other things are rarely equal, decide the speed of data updates according to how those data will be used. Then, communicate the expected latencies in a place accessible to everyone who will rely on the data. Set alarms for critical conditions. Your decisions will be just as good, but your data/compute bills will be lower.