Essential SQL Commands For Data Science
Yankho Chimpesa
Posted on March 12, 2023
Data is naturally at the heart of the job of a data scientist or data analyst. You can get your information from a variety of sources.
Because data is frequently stored in a SQL database, understanding SQL query commands is often required to perform this role successfully.
This article will introduce you to some of the more basic commands, as well as some of the more advanced operations that will be useful to you as a data analyst or data scientist.
The commands are classified based on multiple operations such as simple data retrieval, aggregations, joins and complex conditions.
The following are some of the essential SQL commands you need to have knowledge of as a data scientist:
SELECT
The SELECT command is used to retrieve data from a database. It is used to specify which columns and rows to retrieve from a table. Here is an example:
SELECT *
FROM
neighbourhoods
neighbourhood_id | neighbourhood |
---|---|
0 | Ashfield |
1 | Bankstown |
2 | Blacktown |
3 | Burwood |
4 | Botany Bay |
In this example, we are selecting all columns from a table called neighbourhoods.
The * operator is used to select all columns in a table:
FROM
The FROM command is used to specify the table or tables from which to retrieve data. Here is an example:
In this example, we are retrieving data from a table called names.
SELECT *
FROM
names
reg_id | name |
---|---|
0 | Astrid |
1 | Barin |
2 | Blaje |
3 | Brian |
4 | Cody |
If you need to retrieve data from multiple tables, you can use a JOIN statement. We will cover JOIN in more detail later in this article.
WHERE
The WHERE command is used to filter the data based on a specified condition. It is used to narrow down the results to only those rows that meet the specified condition.
Here is an example:
In this example, we are answering this question: How would you adapt the query to be sorted by host_id, to display the host_id and the host, and to be restricted to the neighbourhood_id of a particular neighbourhood, let's say number 35?
# 1/ Fetch only host_id, host from the listings table
# 2/ Make sure you filtered the data to just neighbourhood_id=35
# 3/ Make sure the output is sorted by host_id in descending order
SELECT host_id, host FROM listings
WHERE neighbourhood_id=35
ORDER BY host_id DESC
host_id | host |
---|---|
285488167 | Rick |
185783910 | Tiina |
109067745 | Annie |
41506490 | Andrew |
GROUP BY
The GROUP BY command is used to group the data based on one or more columns. It is used to aggregate data based on the grouping columns.
The GROUP BY requires aggregate functions:
COUNT: total number of rows
SUM: sum of all the values
MAX: maximum value
MIN: minimum value
AVG: average value
Here is an example:
We're now interested in tracking all neighbourhoods in which we are "over-represented". Let's first count all the occurences of each neighbourhood in our listings-table.
# Instructions:
# 1/ Fetch neighbourhood_id from the listings table
# 2/ For the second column get the number of listings in each neighbourhood
# TO BE COMPLETED
SELECT neighbourhood_id,
COUNT(neighbourhood_id)
FROM listings
GROUP BY neighbourhood_id
neighbourhood_id | COUNT(neighbourhood_id) |
---|---|
2 | 3 |
0 | 1 |
1 | 1 |
4 | 1 |
HAVING
The HAVING command is used to filter the data after it has been grouped. It is used to filter out groups that do not meet a specified condition. Here is an example:
SELECT listing_id, COUNT(host_id) as count
FROM reviews
GROUP BY host_name
HAVING COUNT(host_id) > 10;
In this example, we are selecting listing_id and counting the number of values in host_id for each group of values in listing_id. We then use the HAVING clause to filter the results so that only groups with a count greater than 10 are included in the results.
ORDER BY
The ORDER BY command is used to sort the data based on one or more columns. It is used to sort the data in ascending or descending order. Here is an example:
Find all the listings where we set our neighbourhood_id to 27 and "Private room".
# Instructions:
# 1/ Fetch host_id, host from the listings table
# 2/ Make sure you filtered the data to just neighbourhood_id=27 and room_type='Private room'
# 3/ Make sure the output is sorted by host_id in descending order
SELECT host_id, host
FROM listings
WHERE
neighbourhood_id=27 AND room_type='Private room'
ORDER BY host_id DESC
DISTINCT
In SQL, the DISTINCT keyword is used to select only unique values from a column or set of columns. Here are some examples of how to use the DISTINCT keyword in SQL:
SELECT DISTINCT first_name
FROM Customers;
first_name |
---|
Edwin |
William |
Samuel |
Linda |
In this example, we are selecting only the distinct values of first_name column from the table. The resulting query will return a list of unique values of the first_name column.
AS
The AS command is used to make aliases or rename column names.
We are renaming "customer id" to "ID" and "first name" to "Name" in the example below.
SELECT customer_id AS ID,
first_name AS Name
FROM Customers;
ID | Name |
---|---|
1 | Edwin |
2 | William |
3 | Samuel |
4 | Linda |
LIKE
The LIKE command is used for string filtering. You will provide the expression and it will use it to find the values that are matching the expression.
Consider the following example:
# Instructions:
# 1/ Fetch all columns from the listings table
# 2/ Make sure you filtered the data to names that start with Jos
SELECT
*
FROM listings
WHERE host LIKE 'Jos%'
listing_id | listing | host_id |
---|---|---|
22296011 | Large private room on Camperdown park & Newtown | 10873080 |
JOIN
In SQL, a JOIN statement is used to combine data from two or more tables based on a common column. Joining tables is a powerful way to retrieve data that is spread across multiple tables. There are several types of JOIN statements, including:
- INNER JOIN: An inner join returns only the rows that have matching values in both tables being joined. Here is an example:
SELECT orders.order_id, customers.customer_name
FROM orders
INNER JOIN customers
ON orders.customer_id = customers.customer_id;
In this example, we are selecting the order_id from the orders table and the customer_name from the customers table where the customer_id in both tables matches.
- LEFT JOIN: A left join returns all the rows from the left table (the table specified before the LEFT JOIN keyword) and the matching rows from the right table (the table specified after the LEFT JOIN keyword). If there are no matching rows in the right table, the result will contain NULL values for the right table columns. Here is an example:
SELECT customers.customer_name, orders.order_id
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id;
In this example, we are selecting the customer_name from the customers table and the order_id from the orders table where the customer_id in both tables matches. If there are no matching orders for a customer, the result will contain NULL values for the order_id column.
- RIGHT JOIN: A right join is similar to a left join, but it returns all the rows from the right table and the matching rows from the left table. If there are no matching rows in the left table, the result will contain NULL values for the left table columns. Here is an example:
SELECT customers.customer_name, orders.order_id
FROM customers
RIGHT JOIN orders
ON customers.customer_id = orders.customer_id;
In this example, we are selecting the customer_name from the customers table and the order_id from the orders table where the customer_id in both tables matches. If there are no matching customers for an order, the result will contain NULL values for the customer_name column.
- FULL OUTER JOIN: A full outer join returns all the rows from both tables and combines the matching rows from both tables. If there are no matching rows in one of the tables, the result will contain NULL values for the columns of the table that has no matching rows. Here is an example:
SELECT customers.customer_name, orders.order_id
FROM customers
FULL OUTER JOIN orders
ON customers.customer_id = orders.customer_id;
In this example, we are selecting the customer_name from the customers table and the order_id from the orders table where the customer_id in both tables matches. If there are no matching customers for an order or no matching orders for a customer, the result will contain NULL values for the respective columns. Note that not all database management systems support the FULL OUTER JOIN syntax.
These are the main types of JOIN statements in SQL. Understanding the different types of JOINs and when to use them is an important skill for data scientists who work with relational databases.
UNION
In SQL, the UNION operator is used to combine the results of two or more SELECT statements into a single result set. Here are some examples of how to use the UNION operator in SQL:
Simple UNION example:
SELECT host_id, host_name
FROM listings
UNION
SELECT reg_number, reg_name
FROM reviews;
In this example, we are selecting columns from two different tables and combining the results using the UNION operator. The resulting query will return all unique combinations of host_id, reg_number and host_name,reg_name from both tables.
UNION with ORDER BY:
SELECT host_id, host_name
FROM listings
UNION
SELECT reg_number, reg_name
FROM reviews
ORDER BY reg_number ASC;
In this example, we are using the UNION operator to combine the results of two SELECT statements, but we are also using the ORDER BY clause to sort the results by column1 in ascending order. The resulting query will return all unique combinations of host_id, reg_number and host_name, reg_name from both tables, sorted by reg_number.
UNION with WHERE clause:
SELECT host_id, host_name
FROM listings
WHERE purchase> 10
UNION
SELECT reg_number, reg_name
FROM reviews
WHERE order_price < 5;
In this example, we use the UNION operator to combine the results of two SELECT statements, but we also use WHERE clauses to filter the results of each SELECT statement prior to combining them.
The query that results will return all unique combinations of the columns that satisfy the conditions in either WHERE clause.
The UNION operator is an extremely useful tool for combining the results of multiple SELECT statements into a single result set.
You can use the UNION operator to perform complex queries on your data and extract meaningful insights from it.
CASE
CASE statement is a powerful tool that allows you to perform conditional logic within a SQL query. With the CASE statement, you can evaluate an expression and return different values based on different conditions. Here are some examples of how to use the CASE statement in MySQL:
SELECT item,
amount,
CASE
WHEN amount < 1000 THEN 'Low'
ELSE 'High'
END AS Priority
FROM Orders;
item | amount | Priority |
---|---|---|
Keyboard | 600 | Low |
Mouse | 200 | Low |
Monitor | 18000 | High |
Keyboard | 900 | Low |
Mousepad | 850 | Low |
Conclusion
In conclusion, SQL is a critical tool for any data scientist as it provides a powerful way to query, filter, and analyze data stored in relational databases. The ability to extract valuable insights from large datasets is a key component of data science, and SQL provides an efficient and effective way to accomplish this task.
In this article, we covered some essential SQL commands that every data scientist should know.
However, there are many other SQL commands and techniques that data scientists can use to enhance their data analysis skills. For instance, joining tables, aggregating data, and using subqueries can all help data scientists to analyze data more effectively. Additionally, using SQL with other tools such as Python, R, and visualization software can provide even more advanced capabilities in data analysis.
Finally, it's worth noting that while SQL is a powerful tool, it's not the only tool that data scientists should rely on. Other tools and techniques, such as machine learning, deep learning, and natural language processing, can also provide valuable insights into data. The key to successful data analysis is to use the right tools and techniques for the task at hand and to constantly learn and adapt as new technologies and methods emerge.
Mastering SQL commands is an essential skill for data scientists looking to extract valuable insights from large datasets.
By understanding how to connect to a database, retrieve data, filter data, and sort data, data scientists can effectively manipulate data and extract insights that will help them make informed business decisions. However, it's important to remember that SQL is just one tool in the data scientist's toolkit, and that the most successful data analysis requires a diverse set of skills and techniques
Posted on March 12, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.