Seminal Papers in Data Science: A Relational Model for Large Shared Data Banks

Even with the rising popularity of NoSQL, most companies are still using some form of SQL-based relational database management system. While SQL (then called SEQUEL) was first introduced by IBM’s Donald D. Chamberlain and Raymond F. Boyce in 1974, their work built on the ideas of Edgar F. Codd. Codd was another IBM computer scientist who proposed a relational model for database management in 1970. In this post, I discuss some of the main takeaways from Codd’s influential paper, and how his ideas relate to our modern use of SQL.

Relations

Codd uses the term relation to describe what is essentially the cornerstone of his model. The relation is formally described as follows:

“Given sets S1, S2, …, Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on. We shall refer to Sj as the jth domain of R. As defined above, R is said to have degree n. Relations of degree 1 are often called unary, degree 2 binary, degree 3 ternary, and degree n n-ary.”

-Codd (1970)

This definition may look completely foreign, but if you are familiar with SQL, Codd is actually getting at something quite familiar. Codd proposes that a relation can be represented as an array, based on the following conditions:

“An array which represents an n-ary relation R has the following properties:

(1) Each row represents an n-tuple of R.

(2) The ordering of rows is immaterial.

(3) All rows are distinct.

(4) The ordering of columns is significant — it corresponds to the ordering S1, S2, …, Sn of the domains on which R is defined (see, however, remarks below on domain-ordered and domain-unordered relations).

(5) The significance of each column is partially conveyed by labeling it with the name of the corresponding domain.”

-Codd (1970)

#database #data-engineering #data #sql #data-science

Relations

towardsdatascience.com

Seminal Papers in Data Science: A Relational Model for Large Shared Data Banks