Big Data for Engineers - Chapter 2: Lessons Learned from the past
jzfrank
Posted on March 5, 2023
Data independence
The fundamental socle on which modern data management is based (Edgar Codd 1950).
Data independence means the independence of the logical view on the data and physical storage.
A database management system stack is four-layer:
- A logical query language
- A logical model
- A physical compute layer that processes the query
- A physical storage layer where the data is physically stored
In the past two decades, the physical layer has changed: clusters with thousands of machines rather than a single machine. However, the look and feel of querying data remain the same to the end user.
Formal prerequisites
Sets and relations
set, inclusion: are trivial concepts
Cartesian product: AxB
relation: a relation R on a family of sets is a subset of their cartesian product.
partial function: A partial function p between two sets A and B is a relation is a relation that does not associate any element of A to more than an element of B
function: a partial function that is "injective"
Sets commonly used:
natural integers, relative integers, decimals, rational numbers, real numbers, a set of all values
Rational database management systems (RDBMS)
main concepts
RDBMS is based on a tabular data format. Its first class citizens:
- table: a collection of records
- attribute: a property that records can have
- row: a record in a collection. Also called: record, entity, document, item, business object
- primary key: attribute(s) tha uniquely identify a record
formalism behind relational model
Tables can be defined formally (as purely mathematical objects)
Relational integrity
A collection T fulfills relational integrity of all records have identical support.
Domain integrity
The values associated with each attribute should be restricted to a domain.
Atomic integrity
values are only atomic values: the collections does not contain any nested collections or sets or lists or anything has a structure of its own.
If these constraints are relaxed, we enter the world of NoSQL databases:
Relational Algebra
Selection
Projection
Grouping
Renaming
Changes the name of a column
Extended projections
Compute values and assign results to a new column.
Cartesian products
Joins
Join is a "filtered Cartesian proect" in which we only combine directly related tuples and omit all other non-matching pairs.
Normal Forms
Anomalies might occur in poorly designed databases. To avoid anomalies, best practice is to follow normal forms.
First normal form: atomic integrity
Second normal form: each column in a record contains information on the entire record. A table should be only about one type of thing. Each non key field must be about the same thing as the primary key.
Third normal form: forbids functional dependency on anything else than the primary key
SQL language
SQL is a declarative language. SQL is also a functional language (to some extent).
Example
SELECT c.century AS cent,
COUNT(c.name) AS num_captains,
SUM(s.id) AS ships
FROM captains c, ships s
WHERE c.id = s.captain
GROUP BY century
HAVING COUNT(c.name) >= 3
Internally:
Languages
Transactions
ACID:
- Atomicity: either an update is applied completely, or not at all
- Consistency: Properties of database stays the same
- Isolation: the system "feels like" the user is the only one using the system, however in fact thousands are using concurrently.
- Durability: all data written to database is durably stored and will not be lost (e.g. electricity shortage or a disk crash)
Scaling up and out
If we have a lot of rows: object storage, column storage, distributed file systems, massive parallel processing.
If we have a lot of columns: column storage.
If we have a lot of nesting: syntax, data models, document stores, querying.
So, if you do not fully understand normal forms, 3 integrities, and ACID. It is OK for this course, we will very soon break these conventions and enter the Big Data world.
Posted on March 5, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.