DATA ENGINEERING
A go-to information for knowledge engineers wading by way of the backfilling maze
Think about beginning a brand new knowledge pipeline and getting knowledge from a supply you’ve by no means parsed earlier than (e.g. pulling data from an API or an current hive desk). Now, you’re on a mission to make it seem to be you collected this knowledge ages in the past. That’s one instance of what we name knowledge backfilling in knowledge engineering.
Nevertheless it’s not nearly beginning a brand new knowledge pipeline or desk. You might have a desk that’s been gathering knowledge for some time, and abruptly, it’s worthwhile to change the info (for instance because of a brand new metric definition), or toss in additional knowledge from a brand new knowledge supply. Or perhaps there’s an ungainly hole in your knowledge, and also you simply wish to patch it up. All these conditions are examples of knowledge backfilling. The widespread thread is popping “again” in time and “filling” up your desk with some historic knowledge.
The next determine (Determine 1) exhibits an easy backfilling situation. On this occasion, a each day job retrieves knowledge from two upstream sources (one for platform A and one other for platform B). The dataset is structured with the primary partition being ‘ds,’ and the second partition (or sub-partitions) representing the platforms. Sadly, knowledge for the interval from 2023–10–03 to 2023–10–05 is absent because of sure points. To handle this hole, a backfilling operation was initiated (the backfilling job began on 2023–10–08).
A quick heads-up earlier than continuing additional: inside the area of knowledge engineering, we usually encounter two eventualities: “backfilling” a desk or “restating” a desk. These processes, whereas sharing some similarities, have some refined variations. Backfilling, as a observe, is about populating lacking or incomplete knowledge in a dataset. Its software is often directed in direction of updating historic knowledge or rectifying gaps. Conversely, restating a desk includes effecting substantial…