# download Northwind SQLite DB
!wget https://tdmdal.github.io/mma-sql-2023/data/northwind.sqlite3
# install ipython-sql 0.5.0
# this Python package allows us to run SQL in Jupyter/IPython notebook via %sql and %%sql magics
# the ipython-sql already installed on Google Colab isn't up-to-date to work with the SQLAlchemy installed
# https://github.com/catherinedevlin/ipython-sql
!pip install ipython-sql==0.5.0 --quiet
# load the SQL magic extension
%load_ext sql
# now we can use the magic extension to connect to our SQLite DB
# use %sql to write an inline SQL command
# use %%sql to write SQL commands in a cell
%sql sqlite:///northwind.sqlite3
After connecting to the Northwind DB, let's first check what database engine we are using.
%%sql
SELECT sqlite_version();
We are using SQLite 3. It's a small and fast Relational Database engine. SQLite DB is self-contained: one database is just a file. Believe it or not, SQLite is the most used DB engine in the world. It's built into millions of mobile phones (e.g. Android or iOS phones) and it's used by many popular apps to store data.
A database may contain many tables. Let's see how many tables we have in the Northwind DB.
Every SQLite database has a special table called sqlite_master
. It contains a master listing of all database objects (tables, indexes, etc.) in the database and the SQL used to create each object. We can query this table to find out how many tables (excluding sqlite_master
) we have in our Northwind DB.
We will use SQL keywords SELECT...FROM...WHERE
. Don't worry about them for now. We will soon learn those keywords.
%%sql
SELECT name FROM sqlite_master WHERE type='table' AND name NOT LIKE 'sqlite_%';
Note that special tables in SQLite start with sqlite_
. Those are reserved tables for SQLite engine system use. Above we filtered out those tables. In most cases, we shouldn't touch those special tables.
A table is made up of one or more columns/fields. Let's take a look a the Categories table (column names, types, etc.).
%sql PRAGMA table_info([Categories]);
We present the relationships between the tables using an Entity Relationship (ER) diagram.
In the above ER diagram, the tiny vertical key icon indicates a column is a primary key. A primary key is a column (or set of columns) whose values uniquely identify every row in a table. For example, OrderID
is the primary key in the Orders
table, and OrderID
and ProductID
(combined) is the primary key in the OrderDetails
table.
The relationship icon (a line with a horizontal key at one end and an infinite symbol at the other end) indicates a foreign key constraint and a one-to-many relationship. A foreign key is a column (or set of columns) in one table whose values uniquely identify a row of another table or the same table. A foreign key mostly refers to a primary key in another table. A foreign key constraint requires that the constrained column contain only values from the primary key column of the other table. For example CustomerID
in the Orders
table is a foreign key that refers to the CustomerID
primary key in the Customers
table, and it can only contain values that exist in the CustomerID
column of the Customers
table.
In addition, it happens that every foreign key constraint in the Northwind DB establishes a one-to-many relationship, i.e. a row from one table can have multiple matching rows in another table. For example, one row from the Customers
table can match multiple rows in the Orders
table (via CustomerID
). This makes sense as one customer can place more than one orders. (Another common relationship a foreign key constraint can establish is the one-to-one relationship.)
logo | meaning |
---|---|
primary key | |
one-to-many foreign key contraint |
Question: Why do we need foreign key constraints? (Discussion)
You can find out relationships between tables by querying the sqlite_master
table. See the code below.
Of course, querying the sqlite_master
table to find out relationships between tables is specific to SQLite. We would do it differently if we use other database management systems. For example, if we use MySQL system, we would query the key_column_usage
table in the information_schema
DB. See this stackoverflow question.
Alternatively, some SQL client tools can generate ER diagram for you. The above diagram is generated by SQL Server Management Studio (SSMS). The Northwind DB is a sample DB originally shipped with Microsoft SQL Server.
%%sql
-- find out relationships between tables using SQL command
SELECT sql
FROM sqlite_master
WHERE name = "Orders";
We will work on the problems with a * sign together. Other problems are left for homework.
We review
SELECT...FROM...
SELECT...FROM...ORDER BY...
SELECT...FROM...WHERE...
; IN
, NOT
, LIKE
and %
wildcardDATE()
, ||
)Display CategoryName
and Description
in the Categories
table.
%%sql
-- start your code here
# Just curious, what's in the picture column?
result = %sql SELECT CategoryName, Picture FROM Categories;
result.DataFrame().loc[0, 'Picture']
# still curious about the picture column?
# image is actually stored as a blob (Binary Large Object) in SQLite
from matplotlib.pyplot import imshow
from PIL import Image
from io import BytesIO
%matplotlib inline
row = 0
im = Image.open(BytesIO(result.DataFrame().loc[row, 'Picture'][78:])) # cut the first 78 bytes (OLE header)
print(im)
imshow(im)
%%sql
-- start your code here
The resulting table should display FirstName
, LastName
, HireDate
, Title
and Country
.
%%sql
-- take a look at the Employees table first
-- use LIMIT to limit the rows being displayed
SELECT FirstName, LastName, HireDate, Title, Country
FROM Employees
LIMIT 10;
%%sql
-- just curious, how many distinct coutries do we have?
SELECT DISTINCT Country
FROM Employees
%%sql
-- start your code here
Display supplierID
, ContactNames
and ContactTitle
for those whose ContactTitle
is not Marketing Manager.
Hint: NOT
operator.
%%sql
-- start your code here
Hint: LIKE
operator and %
wildcard.
%%sql
-- start your code here
Hint: IN
operator.
%%sql
-- start your code here
Display FirstName
, LastName
, Title
, and birth date, ordered by birth date. Show only the date portion of the BirthDate
column/field (YYYY-MM-DD).
%%sql
-- start your code here
Show the FirstName
and LastName
columns from the Employees
table, and then create a new column called FullName
, showing first name and last name joined together in one column in a format like "DAVOLIO, Nancy".
%%sql
-- start your code here
In the OrderDetails
table, we have the columns UnitPrice
and Quantity
. Create a new column, TotalPrice
, that multiplies these two (ignore the Discount
column). Display OrderID
, ProductID
, UnitPrice
, Quantity
and TotalPrice
. Order the result first by ProductID
(ascending) and then by TotalPrice
(descending).
Note that the question asks about order items (not orders). One order may contain multiple items/products. An order item is identified by the OrderID
and ProductID
columns.
%%sql
-- start your code here
A product need reordering if UnitsInStock + UnitsOnOrder <= ReorderLevel
and Discontinued = 0
. In your result, display ProductID
, ProductName
, UnitsInStock
, UnitsOnOrder
, ReorderLevel
, and Discontinued
.
%%sql
-- start your code here
COUNT()
, MIN()
, MAX()
, and AVG()
).SELECT...FROM...GROUP BY...HAVING...
.SELECT
clause ordering: SELECT...FROM...WHERE...GROUP BY...HAVING...ORDER BY...
.SELECT...FROM...WHERE...(SELECT...FROM...)
Note that the question asks about order items (not orders). One order may contain multiple items/products. An order item is identified by (OrderID + ProductID).
%%sql
-- start your code here
Order the result by the count of contact titles.
%%sql
-- start your code here
%%sql
-- start your code here
Ignore the Discount
. Contrast this question to Ex.9.
%%sql
-- start your code here
"The last 12 months" is with respect to the last OrderDate
in the Orders
table.
%%sql
-- start your code here
-- datetime(): https://www.sqlite.org/lang_datefunc.html
%%sql
-- start your code here
We can also use the (INNER) JOIN
keyword.
%%sql
-- start your code here
Display OrderID
, OrderDate
(date only), and CompanyName
of the Shipper, and sort by OrderID
. In addition, only display rows with OrderID < 10255
.
%%sql
-- start your code here
In your result, display CategoryName
and total number of product.
%%sql
-- start your code here
In your result, display OrderID
and total value of the order (ignore Discount
).
%%sql
-- start your code here
In your result, display EmployeeID
, FirstName
, LastName
, ProductID
and ProductName
.
%%sql
-- start your code here
%%sql
-- start your code here
Note that the Jupyter Notebook displays Null as None.
%%sql
-- start your code here
We define high-value customers as those who have made at least 1 order with a total value >= $10,000 (ignore discount). We only consider orders placed in year 2016.
%%sql
-- start your code here
This time, we define high-value customers as those who have made total order >= $15,000 (ignore discount). We only consider orders placed in year 2016.
%%sql
-- start your code here
If we don't have time to go through all the concepts below, I encourage you to learn them on your own.
CASE
keywordUNION
keywordMany SQL implementations support something called Common Table Expression (CTE).
A CTE is a named temporary result set that exists within the scope of a single statement and that can be referred to later within that statement, possibly multiple times.
CTE is like a subquery, but it's more flexible. In particular, it can be referred to many times in a query, as if we have a temporary table. CTE is created using the WITH
keyword. Below is a trivial example of CTE.
WITH
cte1 AS (SELECT a, b FROM table1),
cte2 AS (SELECT c, d FROM table2)
SELECT b, d FROM cte1
JOIN cte2 ON cte1.a = cte2.c;
You can also use a temporary table to store SELECT
result. Temporary table can be referred to multiple times in a single session. The syntax to create temporary table and store results from SELECT
is
CREATE TEMPORARY TABLE TableName AS SELECT...
%%sql
/* CTE (WITH keyword) */
-- start your code here
%%sql
/* Subquery */
-- start your code here
%%sql
/* Temporary Table */
-- start your code here
In your result, display EmployeeID
, LastName
, total number of orders, number of late orders, % of late orders (rounding to 2 decimal places).
Define late order as RequiredDate <= ShippedDate
.
Below are three simpler related questions. You may want to solve them first before you attempt Ex.26 itself.
Note: There are many ways to solve this exercise. One of them is to create temporary tables to help us store intermediate results.
%%sql
-- 1. Which orders were shipped late?
-- start your code here
%%sql
-- 2. Which employee has the most late orders?
-- start your code here
%%sql
-- 3. Compare each employee's number of orders and number of late orders
-- start your code here
%%sql
-- Solve Ex.26 (Find % of late orders for each employee)
-- start your code here
%%sql
-- start your code here
-- JOIN keyword
Group customers into four categories, very high (total order in 2016 > 10,000), high (5,000 < total order <= 10,000), medium (1,000 < total order <= 5,000), and low (0 < total order <= 1,000). When calculating total order value, you can ignore discount.
We will learn the CASE
keyword.
%%sql
-- step 1
DROP TABLE IF EXISTS TotalOrderAmountTpTable;
CREATE TEMPORARY TABLE TotalOrderAmountTpTable AS
SELECT Customers.CustomerID, CompanyName, SUM(UnitPrice * Quantity) AS TotalOrderAmount
FROM Customers
JOIN Orders ON Customers.CustomerID = Orders.CustomerID
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID
WHERE strftime('%Y', OrderDate) = "2016"
GROUP BY Customers.CustomerID, CompanyName
ORDER BY TotalOrderAmount DESC;
%%sql
-- step 2
SELECT CustomerID, CompanyName, totalOrderAmount,
CASE
WHEN TotalOrderAmount > 0 AND TotalOrderAmount <= 1000 THEN 'Low'
WHEN TotalOrderAmount > 1000 AND TotalOrderAmount <= 5000 THEN 'Medium'
WHEN TotalOrderAmount > 5000 AND TotalOrderAmount <= 10000 THEN 'High'
WHEN TotalOrderAmount > 10000 THEN 'Very High'
END AS CustomerCategory
FROM TotalOrderAmountTpTable
ORDER BY CustomerID;
Some Northwind employees are planning a business trip, and they would like to visit as many customers and suppliers as possible. They would like to see a list of all countries and their corresponding total number of customers and total number of suppliers.
We will use temporary tables.
%%sql
-- step 1: count total customers by country
DROP TABLE IF EXISTS TotalCustomersTpTable;
CREATE TEMPORARY TABLE TotalCustomersTpTable AS
SELECT Country, COUNT(CustomerID) AS TotalCustomers
FROM Customers
GROUP BY Country;
%%sql
-- step 2: count total suppliers by country
DROP TABLE IF EXISTS TotalSuppliersTpTable;
CREATE TEMPORARY TABLE TotalSuppliersTpTable AS
SELECT Country, COUNT(SupplierID) AS TotalSuppliers
FROM Suppliers
GROUP BY Country;
%%sql
-- step 3: join two intermediate tables (full outer join using union)
SELECT TotalCustomersTpTable.Country, TotalCustomers, TotalSuppliers
FROM TotalCustomersTpTable
LEFT JOIN TotalSuppliersTpTable
ON TotalCustomersTpTable.Country = TotalSuppliersTpTable.Country
UNION
SELECT TotalSuppliersTpTable.Country, TotalCustomers, TotalSuppliers
FROM TotalSuppliersTpTable
LEFT JOIN TotalCustomersTpTable
ON TotalCustomersTpTable.Country = TotalSuppliersTpTable.Country
ORDER BY TotalSuppliersTpTable.Country;
%%sql
-- clean up
DROP TABLE TotalCustomersTpTable;
DROP TABLE TotalSuppliersTpTable;
We find all distinct countries in Customers
and Supplier
tables, and combine them to be a derived table CountriesTable
. This table exists only during the execution of the query. We then LEFT JOIN
the table to Customers
and Supplier
tables, and use GROUP BY
to obtain the result.
%%sql
-- Display intermediate result to take a look (debug)
SELECT
CountriesTable.Country,
CustomerID,
SupplierID
FROM
(SELECT DISTINCT Country FROM Customers
UNION
SELECT DISTINCT Country FROM Suppliers
) AS CountriesTable -- We have a subquery under the FROM keyword. It gives a derived table, CountriesTable.
LEFT JOIN Customers ON CountriesTable.Country = Customers.Country
LEFT JOIN Suppliers ON CountriesTable.Country = Suppliers.Country;
%%sql
-- now, completely solve ex.29
SELECT
CountriesTable.Country,
Count(DISTINCT CustomerID) AS TotalCustomers,
Count(DISTINCT SupplierID) AS TotalSuppliers
FROM
(SELECT DISTINCT Country FROM Customers
UNION
SELECT DISTINCT Country FROM Suppliers
) AS CountriesTable -- We have a subquery under the FROM keyword. It gives a derived table, CountriesTable.
LEFT JOIN Customers ON CountriesTable.Country = Customers.Country
LEFT JOIN Suppliers ON CountriesTable.Country = Suppliers.Country
GROUP BY CountriesTable.Country
ORDER BY CountriesTable.Country;
We have covered quite a few basic SQL keywords. They are mostly about getting data from a DB and manipulate the retrieved data. There are much more to SQL. Take a look at this post to get an idea of what advanced SQL topics are about, and if you are interested, explore them on your own.
Most of the exercises in this notebook are drawn from this book by Sylvia Moestl Vasilik. The book is based on Microsoft SQL server and its T-SQL.
Another great SQL book for beginner is SQL in 10 Minute by Ben Forta.
Examples of various joins shown in the slides.
# create a new sqlite database
%sql sqlite:///join_playground.sqlite3
%%sql
-- create left table: table1
DROP TABLE IF EXISTS table1;
CREATE TABLE table1 (
pk INTEGER NOT NULL,
t1c1 TEXT NOT NULL
);
INSERT INTO table1
(pk, t1c1)
VALUES
(1,"a"),
(2,"b");
-- create right table: table2
DROP TABLE IF EXISTS table2;
CREATE TABLE table2 (
fk INTEGER NOT NULL,
t2c1 TEXT NOT NULL
);
INSERT INTO table2
(fk, t2c1)
VALUES
(1,"c"),
(1,"d"),
(3,"e");
%%sql
select * from table1;
%%sql
select * from table2;
%%sql
-- inner join
select * from table1 join table2 on table1.pk = table2.fk;
%%sql
-- left join
select *
from table1
left join table2 on table1.pk = table2.fk;
%%sql
-- left join with exclusion
select *
from table1
left join table2 on table1.pk = table2.fk
where table2.fk IS NULL;
%%sql
-- right join using LEFT JOIN key word
select *
from table2
left join table1 on table2.fk = table1.pk;
%%sql
-- right join with exclusion using LEFT JOIN and WHERE key word
select *
from table2
left join table1 on table2.fk = table1.pk
where table1.pk is NULL;
%%sql
-- fuller outer join
-- it is important to list all the column names (i.e. not using *; can use * if using MySQL)
-- https://stackoverflow.com/questions/53076981/sql-union-doesnt-check-for-matching-column-names-is-it-up-to-me
select pk, t1c1, fk, t2c1
from table1
left join table2 on table1.pk = table2.fk
union
select pk, t1c1, fk, t2c1
from table2
left join table1 on table2.fk = table1.pk;
%%sql
-- full outer join with exclusion
select pk, t1c1, fk, t2c1
from table1
left join table2 on table1.pk = table2.fk
where table2.fk is NULL
union
select pk, t1c1, fk, t2c1
from table2
left join table1 on table2.fk = table1.pk
where table1.pk is NULL;