Assembling reliable records from semi-reliable systems
Ensuring good data quality is fundamental to effective data governance, and this blog post describes one method of managing the effects of out-of-order arrival and latency, which any networked system can encounter.
Record-keeping in commerce is an old task. Some of the earliest written records were related to commerce, such as this stone payslip. These days, most companies will be storing these records on paper, in spreadsheets, or in databases. A few companies will have a cohesive data fabric that helps to both manage records and make effective use of them.
One of the technically trickier aspects of ensuring good data quality is that information may arrive from a set of outside sources, and those sources may supply information at different paces and different places wearing different faces. There are many ways to solve this problem. Here, we present one such method that provides good visibility into the current status of records and is robust against the order of arrival of information.
Let’s assume we have a very simple record with the following structure.
|Order ID||Product||Package tracking||Delivered on||Customer rating|
|1||Sling Mini MIRUM® Edition|
If everything goes exactly right, we will receive the package tracking information, then the delivery information, and then (we hope) the customer rating. Across millions of transactions, though, this process will probably break at least a few times.
Here’s one way it could break.
sequenceDiagram Bellroy->>Warehouse: Let's ship Order 1 Warehouse->>Shipper: Here's Order 1 to ship Shipper->>Warehouse: Here's your tracking id Note right of Bellroy: Network error Customer->>Bellroy: I love it! 5 stars Bellroy->>Customer: We're glad you love it. Thanks for letting us know. Bellroy-->>Warehouse: We're missing tracking<br />for Order 1 Warehouse->>Bellroy: Here it is Bellroy->>Shipper: What's the delivery status? Shipper->>Bellroy: We delivered last week
To construct this record requires information from multiple systems. That information may not arrive at all (e.g., due to a network error) or may arrive out of order (e.g., receiving feedback before the item is known to have been shipped). Any system that relies on 100% message arrival and 100% in-order arrival is likely to produce poor quality data. Fortunately, we can fix this by a simple process.
- Document the ‘good’ states of a record.
- Document the acceptable latency for information arrival.
- Keep track of the timing of information arrival.
- Check that the state of a record is in the ‘good’ state for a process before sending the record to that process.
- Re-request late information or otherwise mitigate the problem, and possibly alert someone. This can be done via email or most workplace collaboration tools.
A typical toolchain for this will include a database, an ELT/ETL tool, an orchestration tool, and a messaging system. At present we use PostgreSQL as our transactional database, Apache NiFi for ELT and orchestration, Slack for many notifications and Zabbix for performance monitoring. These communicate via webhooks, database protocols, and Apache Kafka. Except for Slack, all of these are open-source products.
We’ll write in future posts about our affinity for open-source software and the contributions we make to maintaining and expanding the availability of useful open-source products.