Big Data for Engineers - Chapter 2: Lessons Learned from the past

jzfrank

jzfrank

Posted on March 5, 2023

Big Data for Engineers - Chapter 2: Lessons Learned from the past

Data independence

The fundamental socle on which modern data management is based (Edgar Codd 1950).

Data independence means the independence of the logical view on the data and physical storage.

A database management system stack is four-layer:

  • A logical query language
  • A logical model
  • A physical compute layer that processes the query
  • A physical storage layer where the data is physically stored

In the past two decades, the physical layer has changed: clusters with thousands of machines rather than a single machine. However, the look and feel of querying data remain the same to the end user.

Formal prerequisites

Sets and relations

  • set, inclusion: are trivial concepts

  • Cartesian product: AxB

  • relation: a relation R on a family of sets is a subset of their cartesian product.

  • partial function: A partial function p between two sets A and B is a relation is a relation that does not associate any element of A to more than an element of B

  • function: a partial function that is "injective"

Sets commonly used:

natural integers, relative integers, decimals, rational numbers, real numbers, a set of all values

Rational database management systems (RDBMS)

main concepts

RDBMS is based on a tabular data format. Its first class citizens:

  • table: a collection of records
  • attribute: a property that records can have
  • row: a record in a collection. Also called: record, entity, document, item, business object
  • primary key: attribute(s) tha uniquely identify a record

formalism behind relational model

Tables can be defined formally (as purely mathematical objects)

formalism of table

Relational integrity

A collection T fulfills relational integrity of all records have identical support.

Relational integrity

Domain integrity

The values associated with each attribute should be restricted to a domain.

Domain integrity

Not Domain integrity

Atomic integrity

values are only atomic values: the collections does not contain any nested collections or sets or lists or anything has a structure of its own.

Atomic integrity

Not Atomic integrity

If these constraints are relaxed, we enter the world of NoSQL databases:

SQL vs NoSQL

Relational Algebra

Relational Algebra

Selection

Selection

Projection

Projection

Grouping

Grouping

Renaming

Changes the name of a column

Renaming

Extended projections

Compute values and assign results to a new column.

Extended projections

Cartesian products

Cartesian products

Joins
Join is a "filtered Cartesian proect" in which we only combine directly related tuples and omit all other non-matching pairs.

Joins

Joins

Normal Forms

Anomalies might occur in poorly designed databases. To avoid anomalies, best practice is to follow normal forms.

First normal form: atomic integrity

not atomic integrity

atomic integrity

Second normal form: each column in a record contains information on the entire record. A table should be only about one type of thing. Each non key field must be about the same thing as the primary key.

Not Second normal form

Second normal form

Third normal form: forbids functional dependency on anything else than the primary key

Not Third normal form

Third normal form

SQL language

SQL is a declarative language. SQL is also a functional language (to some extent).

Example

SELECT c.century AS cent,
       COUNT(c.name) AS num_captains,
       SUM(s.id) AS ships
FROM captains c, ships s
WHERE c.id = s.captain
GROUP BY century
HAVING COUNT(c.name) >= 3
Enter fullscreen mode Exit fullscreen mode

Internally:

SQL internal

Languages

bigger picture

Transactions

ACID:

  • Atomicity: either an update is applied completely, or not at all
  • Consistency: Properties of database stays the same
  • Isolation: the system "feels like" the user is the only one using the system, however in fact thousands are using concurrently.
  • Durability: all data written to database is durably stored and will not be lost (e.g. electricity shortage or a disk crash)

Scaling up and out

If we have a lot of rows: object storage, column storage, distributed file systems, massive parallel processing.

If we have a lot of columns: column storage.

If we have a lot of nesting: syntax, data models, document stores, querying.

So, if you do not fully understand normal forms, 3 integrities, and ACID. It is OK for this course, we will very soon break these conventions and enter the Big Data world.

💖 💪 🙅 🚩
jzfrank
jzfrank

Posted on March 5, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related