Essential SQL Commands for Data Science

Structured Query Language (SQL) is important for a data scientist because it is a powerful way to access, process, clean and analyze data stored in relational databases.

Understanding of essential SQL commands is therefore crucial to allow the data scientist perform efficiently in their role.

Categories of SQL commands

1. Data Definition Language (DDL)

These are set of SQL commands that can be used to define the database schema.

List of DDL commands:

CREATE - create the database or its objects (table, index, function, views, store procedure and triggers)
DROP - delete objects from the database.
ALTER - make changes to a table, view or entire database.
TRUNCATE - remove all records from a table without deleting the table structure.

2. Data Manipulation Language (DML)

A subset of SQL commands used to manipulate the data that is present in the database.

List of DDL commands:

INSERT - insert data into a table.
UPDATE - modify or change existing records in a table.
DELETE - delete records from a table.
SELECT - query data from a table(s)

SQL for Data Scientists

One is able to perform the following using SQL:

Basics
Joins
Aggregations
Subqueries & Temporary Tables
Data Cleaning
Window Functions

Lets delve deeper into these processes on SQL as well as the essential SQL commands that come in handy.

Basic SQL Commands

This comprises of CRUD (Create Read Update Delete) operations that is made possible by the use of the DDL and DML commands

SQL Joins

JOIN statements are used to combine two or more tables, based on a related column between them.

-JOIN - returns records that have matching values in both tables.

-LEFT JOIN - returns all records from the left table, and the matched records from the right table.

-RIGHT JOIN - returns all records from the right table, and the matched records from the left table

-FULL JOIN - returns all records when there is a match in either left or right table

SQL Aggregations

These are commands that are used to perform calculations on a set of rows in a table and return a single value as the result.

Here are the most common SQL aggregation functions:

SUM() - calculates the sum of values in a column.
AVG() -calculates the average value of a column.
MAX() - finds the maximum value in a column.
MIN() - finds the minimum value in a column.
COUNT() - counts the number of rows in a table or the number of rows that meet a certain condition.

These aggregation functions are very useful for data scientists as they enable them to perform calculations on large datasets quickly and efficiently. They can be used to generate summaries of data, identify trends, and make informed decisions based on the results.

SQL Subqueries & Temporary Tables

Data Cleaning

SQL string functions are used to clean data.

LEFT() - extracts characters from a string starting from the left.
RIGHT() - extracts characters from a string starting from the right.
SUBSTR() - extracts a substring from a string starting at any position.
CONCAT() - adds two or more strings together.
CAST() - converts a value of any type into a specific, different data type.
POSITION() - used to find the position of a substring within a string, starting from a specified position.
STRPOS() - finds the position of the first occurrence of a string inside another string.
COALESE() - returns the first non-null value in a list.

Subqueries & Temporary Tables

Window Functions

A window function allows one to compare one row to another without doing any joins. Window functions are effective while measuring trends over time or rank a specific column, and it retains the total number of records without collapsing or condensing any of the original datasets.

Aggregate Window Function
Aggregate functions such as SUM(), COUNT(), AVERAGE(), MAX(), MIN() applied over a particular window (set of rows) are called aggregate window functions.

Ranking Window Functions

RANK() - assigns ranks to all the rows within every partition. Rank is assigned such that rank 1 given to the first row and rows having same value are assigned same rank. For the next rank after two same rank values, one rank value will be skipped.
DENSE_RANK() - assigns rank to each row within partition.

The difference between RANK() and DENSE_RANK() is that in DENSE_RANK(), for the next rank after two same rank, consecutive integer is used, no rank is skipped.