Saving Data Historically with Temporal Tables: Part 1: Queries

Saving temporal tables in PostgreSQL
Comments 0

Share to social media

In this article we discuss how to store data temporally in PostgreSQL. Temporal database store data in a way that allows you to see not only what the value of any column is currently, but at any point in time that your system is created to handle. In this three part series, I will aim to define what kind of information can be obtained from temporal tables and how to express requests for this information as SQL queries.

In this part 1 we explain concepts of temporal databases in general and outline major types of queries you will need for general used of these tables. In part 2, we consider aggregations over temporal data in the part 2. We do not discuss how data can be modified in a temporal database, although some of the behavior of INSERT, UPDATE, or DELETE differs significantly from behavior of these commands in a non-temporal database. Part 3 in the series will look at modifying data in a temporal database.

To avoid confusion, we must emphasize the difference between temporary and temporal: the former means that the item (e.g. table) is kept for limited period of time (usually temporary tables are dropped at the end of a session), while the latter means that an item (table0) keeps versions of the data that were valid at different points in time in order to preserve history.

Temporal databases (databases supporting time travel) were studied since late 1970-es. Since 2011 temporal features are included into SQL standard. Many of these features are available in the database systems widely used in the industry. However, the temporal features are under-used due to insufficient knowledge of design and coding patterns, or weakly implemented/ lack of such patterns).

Defining a temporal database

The temporal database is a database that can keep information on the point in time when facts represented in the database were, are, or will be valid. There are several different temporal database models that relay on different ways to store time information. In this article a temporal database contains temporal tables and each row in a temporal table has an associated time period. In contrast with intervals (such as 2 years, 3 months, or just 14 milliseconds) a period is a range of time values (consisting of start time to end time). In our examples periods will be represented with a pair of timestamps.

Following the recommendation of the SQL standard, we adopt closed-open semantics for periods. In plain words this means that the start point is included in the period, but the end point is not. Consequently, if start point and end point coincide, then the period contains exactly one point in time.

The starting point for our examples is a table containing data about employees of a small completely imaginary start-up located in one of the European countries. The original (non-temporal) table (most likely, coming from entity-relationship model). This table is not used in our examples, so there is no need to create it, but the definition is shown below.

Each row of this table describes exactly one entity (that is, a person), and all values of attributes are valid at current time.

All SQL statements in this article can be executed on PostgreSQL. Note that starting from version 15, PostgreSQL discourages used of public schema. To create a separate schema for these examples, you can use the following psql commands or their equivalent if you are running another client application:

The database design of this imaginary company is to allow temporal support. So, in the following code sample, the above definition is converted into a temporal table that will be used on our examples.

A row of the temporal table stores the data related to an employee during certain time period. As the data may change, several rows might be needed for any employee. Therefore, the emp_no column cannot serve as a primary key anymore.

3 columns were added: a surrogate primary key (emp_row_id) and timestamps for the start and the end of a period. We’ll use suffixes _ts (start) and _te (end) for all columns and variables representing boundaries of a time period. PostgreSQL fans may immediately comment that we should use tstzrange type instead of two timestamps. We completely agree that this type is much better, but we vould like to give a chance to run our examples to readers who do not use PostgreSQL.

The recommended data type for timestamps in PostgreSQL is timestamp with time zone or timestamptz. This is not important for our examples as all values will be in the same time zone, but we observe this recommendation.

The column emp_no is still important as its values identify employees regardless of any changes in other columns. We call the primary keys of non-temporal versions of a table business keys in the temporal context. Please note that the term business key is not used in the SQL standard. The content of a temporal table must satisfy the temporal integrity constraint: the periods in the rows with same business key cannot overlap, that is, only one row is allowed at any point in time for each business key. Such constraints can be efficiently supported in PostgreSQL using EXCLUSION constraints supported with spatial indexes.

We are now ready to populate this table:

The data in this table shows that Timo moved from the project p15 to p20 on June 15, 2023 and his salary was changed from 4800 to 5000 on July 01, 2023. Similarly, the salary of Esa was changed on April 16 and he left the company on Sep. 01, 2023.

As the company is based in Europe, the salaries are per month. Readers who are more comfortable with annual salaries may multiply these values by 12. However, then some of the queries below will also require adjustment. Some of periods have a constant infinity available in PostgreSQL. You can replace it with a value in very far future if your DBMS does not understand what infinity means.

We need one more table containing definition of reporting periods:

We populate this table with quarters of 2023:

We are now ready to run and discuss queries.

Querying Temporal Databases

According to the SQL standard, queries to temporal tables that do not contain any indication of required point in time or period should return the same result as queries to non-temporal tables, that is, the output should be based on the current state of data.

To support this behavior, some systems use two related tables to represent a temporal table: one for current state and another for other version of rows. The latter is sometimes called history table. So, if a query refers to a temporal table but does not contain FOR SYSTEM TIME clause, the result will be based on the current state of table data. If temporal data are really needed, the user has a choice of using this clause or querying history table directly.

In this article, as stated, we would like to use temporal features on a system that does not directly implement temporal tables and does not support FOR SYSTEM TIME (for example, PostgreSQL).

In order to provide the standard behavior, we can create a view that selects the state of the data in our temporal table:

Any query to this view will return exactly same result as a query would to a table of current values (which we did not create). All queries in the remaining part of this article will use only our temporal tables (that are like history tables in other models).

The most straightforward way to use a temporal table is to explore previous states of the data. To select data as they were at certain time, we need to use point-in-time query. Any valid query to a non-temporal version of the database schema can be converted into a point-in-time query by adding a condition on period attributes for each table accessed by the query.

For example, to find out the state of our emp table on June 12, 2023 we can run the following query:

Note that inequalities in the condition are different because of closed-open semantics of periods. The reader may also notice that the view defined above is actually a point-in-time query where the time is current time.

The output of this query is shown below:

The reader is encouraged to steer the time machine (that is, to run the query above substituting different points in time).

The ability to query the database state at any point in time is an extremely valuable feature of temporal databases. However, the full power of relational theory and SQL relies on the fact that the result of any query is a table and can be used as an input for other, more complex queries.

The output of a point-in-time query is not a temporal table. To obtain the full power of a temporal database we need temporal queries that yield temporal tables.

For example, the following query selects rows related to one person for a limited period of time (from July 01, 2003 to the end of this year).

The Boolean operator overlaps is defined in SQL standard and is implemented in PostgreSQL. It is important to notice that this operator supports closed-open semantics. If your favorite database system does not provide this Boolean operator, you need to replace it with a condition expressed using inequalities.

The output is a temporal table containing one row because there were no changes for this person during this period. The period for the output row(s) is an intersection of period of the row and period of the query. We’ll need to intersect periods also in other queries: if any two items having validity periods are combined, the validity period for the combination is usually an intersection. In other words, a value combined from several sources must be valid only when all sources are valid.

A number and a line

Description automatically generated with medium confidence

This query can be viewed as an analog of relational selection.

Another type of temporal queries over a single temporal table is change history. It can be viewed as relational projection. In the example we also include selection condition to reduce the size of output and we do not specify the query period, thus requesting the the full history.

The output of this query contains 3 rows, but the first two contain the same value of salary because the changed attribute is project which is not included in the select list.

Rows containing the same values of non-temporal attributes and adjacent periods can be merged, like duplicate removal in SQL.

Finally, several temporal tables can be combined using temporal join operation. The difference between temporal join and non-temporal one is that matching rows are included in the output only if they are valid at the sema time. More precisely, the period for joined row is an intersection of periods of input rows, and rows are included into result only if this intersection is not empty.

An example of temporal join is shown below:

Like the temporal selection above, this query returns intersections of periods and checks that incoming periods overlap.

Actually, this query returns temporal cartesian product (also called a cross-join) because there is no condition on attribute values except period overlap.

Semantics of Time and Time Dimensions

Until now we have not explained how the start and end of the periods are defined. There are dozens of ways to interpret time (usually called time dimensions) studied in the literature. The most widely known are system (or transaction) time and valid time dimensions. If system time is in use, the start of a period is the time when the value was stored in the database and the end of period is the time when the value was replaced with another value or logically deleted.

Advantages of system time are:

  • Easy migration of applications from non-temporal database as the start and end of periods are obtained from system clock.
  • Any value stored in the database is never changed or deleted as only the ends of periods need to be changed.

Many implementations use two tables to represent temporal tables with system time: one table for current (latest) state and another for complete history.

However, if we consider the database as a model of a part of the real world, then use of system time implies that changed in the real world should be recorded to the database (almost) exactly at the same time when the changes occur in reality.

Such an assumption is acceptable for some applications (for example, data are coming from sensors directly to the database) but may be over-restrictive for other applications. For example, changes of salary in our table should happen at midnight.

One of alternatives to system time is called valid (or effective) time, which stores time when changes occurred. This time is much more useful from the modelling perspective but the values for period boundaries must be supplied from the application.

Some of the models support two-dimensional time. Such databases are called bitemporal.

The SQL standard defines system time dimension and provides basic support for one more dimension.

The difference between system and valid time will be important in part 2 of this article where we assume that the table contains valid time.

Conclusions

In this article we discuss how an application can obtain benefits from a temporal database. We do not include any information on several important topics related to temporal databases, such as data structures, modification (INSERT, UPDATE, DELETE), integrity constraints, and performance. Instead, we discuss what can be extracted from temporal databases. In part 1 (this part) we outlined different types of queries useful in any temporal database context. In part 2 we’ll discuss aggregation and grouping, and in part 3 we will discuss how to modify data.

Finally, although our examples use PostgreSQL-specific features, we tried to reduce dependency on the specific DBMS by creating our own temporal features, which you will note later how this helps when you want to vary from the way a feature specific to an RDBMS is implemented.

References

During the first decades of research on temporal database a complete bibliography was maintained. More information on this bibliography is available in [1]. The book [2] highlights the major outcome of that research. A systematic presentation of theoretical viewpoint on temporal databases can be found in [3].

An article [4] provides an overview of temporal features in SQL Stanard 2011 (that weren’t significantly changed in subsequent editions of the Standard). It also contains rationale for decisions made in the Standard.

One of several practical approaches to implementation of temporal features is described in [5]. The authors introduce asserted time dimensions and describe advantages of bi-temporal data model based on assured and effective time dimensions.

An article [6] introduces an alignment operation that provides an extension of relational algebra supporting temporal operations for one-dimensional time.

Finally, an emotionally rich annotated bibliography is available at [7].

  1. Michael D. Soo. 1991. Bibliography on temporal databases. SIGMOD Rec. 20, 1 (March 1991), 14–23. https://doi.org/10.1145/122050.122054
  2. Abdullah Tansel, James Clifford, Shashi Gadia, Sushil Jajodia, Arie Segev, and Richard T. Snodgrass (editors). Temporal Databases: Theory, Design, and Implementation. 1993. 
  3. C. J. Date, Hugh Darwen, Nikos Lorentzos. Time and Relational Theory, Second Edition: Temporal Databases in the Relational Model and SQL. 2nd edition, 2014.
  4. Krishna Kulkarni and Jan-Eike Michels. “Temporal Features in SQL:2011”. SIGMOD Record, September 2012
  5. Tom Johnston and Randall Weis. Managing Time in Relational Databases: How to Design, Update and Query Temporal Data. 2010.
  6. Anton Dignös, Michael H. Böhlen, and Johann Gamper. 2012. Temporal alignment. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD ’12). Association for Computing Machinery, New York, NY, USA, 433–444. https://doi.org/10.1145/2213836.2213886
  7. Temporal Databases Annotated Bibliography. https://illuminatedcomputing.com/posts/2017/12/temporal-databases-bibliography/

 

Article tags

Load comments

About the author

Boris Novikov

See Profile

Boris is a database expert. His experience includes several decades of both industrial and academic activities such as design, development, and tuning database, teaching at undergraduate and graduate levels, leading post-graduate research groups and advising dozens of PhD students. His research interests are in the broad area of information management and include design, development, and tuning of databases, applications, and database management systems (DBMSs). He also has interests in distributed scalable systems for stream processing and analytics. Boris is an author of 4 books and over 150 research papers.