Quantcast
Channel: Tallan Blog »» Mark Frawley - Tallan Blog
Viewing all articles
Browse latest Browse all 10

Overview of Bi-temporal Data Concepts – Part 1

$
0
0

Introduction

This post is the first of a series providing an overview of “bi-temporal data management” and contrasting it with the current standard approach in data warehousing for tracking historical changes in data (“temporal data management”) developed by Ralph Kimball.  Bi-temporal data management is a concept that has been under development since the 1980’s, but is familiar to few data professionals even today.  Reasons for this include:

  • in many applications it just isn’t needed, or its complexity is not justified;
  • the Kimball methodology is predominant for temporal data management in the data warehouse context – and it does not handle or even consider bi-temporality other than via several half-baked patterns;
  • without built-in support in the major relational database engines, the necessary SQL to implement bi-temporality is devilishly complex; for all practical purposes it is not feasible from perspectives of understandability, maintenance and performance, especially for non-trivial amounts of data.

To give this some context, let’s review what data warehouse designers already know – that standard practice for temporal – i.e. time history – data management generally takes the form of one of Kimball’s first four slowly-changing-dimension (SCD) patterns known as SCD0, SCD1, SCD2 and SCD3:

  • SCD0 = write once and never update
  • SCD1 = overwrite the old value with a changed value. When a dimension table has a mix of SCD1 and SCD2 fields, the SCD1 one(s) need the overwrites applied to all historical rows, not just the current one
  • SCD2 = create a new row when something changes with the new value, and “end” the prior row via effective date update. Note that at least in pure form, in this pattern changes are only recorded as time goes forward – changes are not recorded retroactively.
  • SCD3 = maintain N prior states of a particular attribute using N columns. Typically, N=2 and we maintain current/prior or current/original.

These patterns can be applied to whole dimension tables or (less commonly) applied in a mix-and-match fashion to individual columns.  Also, in a typical implementation, dimension tables of any of the above types use surrogate keys as their primary keys, and are joined to suitably-populated fact tables on these keys.

All of this works fine as long as the nature of the temporal changes that must be recorded – i.e. the business rules being implemented – fit one of the SCD patterns.  In the majority of cases they do, accounting for why the Kimball patterns have become such industry standards.

However, there are important business use cases that no Kimball approach addresses naturally and cleanly.  These relate to distinguishing between error correction and natural change, across all time, not just going-forward time.

Bi-temporal data management fills this gap.  It enables recording, querying for, and even enforcing temporal referential integrity regarding:

  1. the currently most accurate state of data about the past, present and (potentially) future

  2. what we thought was the most accurate data about those periods at any point in the past. In other words, bi-temporal representation allows distinguishing between natural changes and corrections.  Both imply back-valued changes and are cases that no Kimball technique handles with generality (though some variations attempt it and may be adequate in particular cases).

Bi-temporal treatment means that data believed correct when first recorded and later recognized as erroneous, is never overwritten or deleted.  This functionality comes at a high cost in complexity and query performance, certainly if not implemented in the database engine (and possibly even if it is),  and thus with the present best-case state-of-the-art would only be applied to entities where there is a strong business need for such capabilities.

In the remainder of these posts temporal and bi-temporal will be used synchronously, except where indicated otherwise.

 Observations

That data changes over time is of course nothing remarkable.  Keeping track of (only) the current state of data, meaning it is applicable to all time, is common and easily accomplished in the relational model with standard SQL in any RDBMS.  Non-temporal relations of this nature are modeled by Kimball SCD1.

The need to record some kind of “history” of changes to data over time arises in virtually any application involving persistent data.  A relation in which some form of history for an entity is maintained is called temporal, and by definition in relational theory must contain more than one row per entity.  This multi-row representation is the source of all of the complication that surrounds comprehensive history tracking, both maintaining and querying it.  While relational purists like CJ Date might dispute it, history management in an RDBMS seems to violate the relational principle of identity, because relational theory has no fundamental notion of time in relation to identity.  There is no native way to establish uniqueness at points in time, leading in practical implementations to various “effective” timestamps and complex SQL assertions to do so.

The fact that temporal data is commonplace in no way means that representing it accurately in a database is easy.  Until fairly recently the state of the art in commercial RDBMS technology did not implement even relatively primitive temporal semantics such as INTERVAL that have been in the SQL standard since SQL-92.  SQL Server in particular supported no temporal semantics until SQL Server 2016, and even then not full bi-temporality.  DB2, Teradata, Oracle and PostgreSQL are more advanced in built-in temporal support, and have been for longer.

When not supported by the DB engine, functional equivalents for maintaining, querying and enforcing temporal integrity must be carried out using vanilla SQL, with PK and FK constraints enforced via a combination of UNIQUE constraints, CHECK constraints, triggers and/or other methods.  Doing so correctly is difficult, complex and subtle.

Temporal data management refers to all the insights, techniques, and proposed methodologies for managing temporal data properly which are outgrowths of academic research over the last 30+ years (much of which was also concerned with proposed additions to the SQL standard).  “Properly” here means able to support all needed temporal scenarios, while enforcing referential and primary key integrity – at points in time.

The best known approach to doing so is called the “standard temporal model”, elaborated by Dr. Richard Snodgrass in the 1990’s and documented in a now out-of-print 1999 book (see References at the end of this post).  This book forms the backbone of the ideas here summarized, and most likely of temporal features that have been built into various RDBMSs.  The book is not quite a cookbook because the techniques must be adapted to the capabilities and limitations of the target RDBMS platform – but much of the heavy lifting has been worked out.  Some business requirements really cannot be met unless temporal management is done properly, either by brute force or, where available, by the DB engine.

The word “history”, as typically used, is a convenient shorthand, but it would be far more general to use derivatives of “temporal”, since “history” generally implies only past states from the perspective of the present, while true temporal support is concerned with past, present and future states, from past and present perspectives.  For this reason “temporal” will be used instead of “history” in most of the following.
Temporality always implies a grain of time at which data changes are tracked.  This is completely application-specific, but for simplicity in this summary it will be assumed, with no loss of generality, that this grain is either daily or point-in-time (event-oriented).  There are also many subtleties and complications involving such things as time zones, UTC time and the maximum time resolution supported by a particular RDBMS.  Whether any of these are relevant is very application-specific.

References

  1. There exist many academic papers on the subject of temporal data, but they are generally far too theoretical to be actionable by mere practitioners. On the other hand, there are virtually no books on the subject.  Fortunately, Dr. Richard Snodgrass, the most famous academic in the subject, published Developing Time-Oriented Database Applications in SQL in 1999.  It is not only approachable, but includes illustrative SQL clauses to accomplish everything discussed, and the principles needed to adapt them if necessary (any given RDBMS may not support some of the constructs he uses, and so alternatives would need to be found).This book is out of print, but fortunately the professor has made it available online at the preceding link.  This link has clearer images of some pages.
  2. In 2002, the famous relational theorist C.J. Date, et al, published Temporal Data & the Relational Model which deals with relational representations of temporal data. While it, like all of Date’s technical writings, is extremely precise, it is also extremely abstract, theoretical and not approachable for most practitioners.  It does not offer actual SQL solutions, and according to the reference following, really only covers valid time modeling, not bi-temporal time.
  3. In 2010, Managing Time in Relational Databases was published. It again is not as actionable as Snodgrass but is more so than Date.  In fact, the authors, recognizing the complexity of SQL involved in updating, querying and maintaining integrity on temporal data, claim to have developed a middleware product designed to hide all the complexity and allow temporal operations to be expressed declaratively – generating the difficult SQL under the covers.  They even received a patent on this in 2012.  Once one becomes familiar with the issues of temporal data one can greatly appreciate the usefulness of such a thing, even if the book is irritating with its frequent references to their nascent product.  While little actual SQL is shown, the book nonetheless contains many useful ideas and a good introduction to the history of the subject.  Their techniques are even more general than those of Snodgrass in the standard temporal model, in that they also support future-valued transaction time.  They refer to both their overall technique and product as Asserted Versioning.These authors also have an extensive article series in Information Management’s web site (registration required).

Viewing all articles
Browse latest Browse all 10

Trending Articles