MMA Bootcamp - Intro to SQL

1. Preparation

We start by setting up a connection to our sample database, Northwind DB.

After connecting to the Northwind DB, let's first check what database engine we are using.

We are using SQLite 3. It's a small and fast Relational Database engine. SQLite DB is self-contained: one database is just a file. Believe it or not, SQLite is the most used DB engine in the world. It's built into millions of mobile phones (e.g. Android or iOS phones) and it's used by many popular apps to store data.

2. A quick look at Northwind DB

How many tables does this DB have?

A database may contain many tables. Let's see how many tables we have in the Northwind DB.

Every SQLite database has a special table called sqlite_master. It contains a master listing of all database objects (tables, indexes, etc.) in the database and the SQL used to create each object. We can query this table to find out how many tables (excluding sqlite_master) we have in our Northwind DB.

We will use SQL keywords SELECT...FROM...WHERE. Don't worry about them for now. We will soon learn those keywords.

Note that special tables in SQLite start with sqlite_. Those are reserved tables for SQLite engine system use. Above we filtered out those tables. In most cases, we shouldn't touch those special tables.

How does the Categories table look like?

A table is made up of one or more columns/fields. Let's take a look a the Categories table (column names, types, etc.).

How about the relationships between the tables?

We present the relationships between the tables using an Entity Relationship (ER) diagram.

ER Diagram

In the above ER diagram, the tiny vertical key icon indicates a column is a primary key. A primary key is a column (or set of columns) whose values uniquely identify every row in a table. For example, OrderID is the primary key in the Orders table, and OrderID and ProductID (combined) is the primary key in the OrderDetails table.

The relationship icon (a line with a horizontal key at one end and an infinite symbol at the other end) indicates a foreign key constraint and a one-to-many relationship. A foreign key is a column (or set of columns) in one table whose values uniquely identify a row of another table or the same table. A foreign key mostly refers to a primary key in another table. A foreign key constraint requires that the constrained column contain only values from the primary key column of the other table. For example CustomerID in the Orders table is a foreign key that refers to the CustomerID primary key in the Customers table, and it can only contain values that exist in the CustomerID column of the Customers table.

In addition, it happens that every foreign key constraint in the Northwind DB establishes a one-to-many relationship, i.e. a row from one table can have multiple matching rows in another table. For example, one row from the Customers table can match multiple rows in the Orders table (via CustomerID). This makes sense as one customer can place more than one orders. (Another common relationship a foreign key constraint can establish is the one-to-one relationship.)

logo meaning
key logo primary key
foreign key constraint one-to-many foreign key contraint

Question: Why do we need foreign key constraints? (Discussion)

You can find out relationships between tables by querying the sqlite_master table. See the code below.

Of course, querying the sqlite_master table to find out relationships between tables is specific to SQLite. We would do it differently if we use other database management systems. For example, if we use MySQL system, we would query the key_column_usage table in the information_schema DB. See this stackoverflow question.

Alternatively, some SQL client tools can generate ER diagram for you. The above diagram is generated by SQL Server Management Studio (SSMS). The Northwind DB is a sample DB originally shipped with Microsoft SQL Server.

Exercises

We will work on the problems with a * sign together. Other problems are left for homework.

Part 1. Warm-up

We review

  1. Retrieve data: SELECT...FROM...
  2. Sort retrieved data: SELECT...FROM...ORDER BY...
  3. Filter data: SELECT...FROM...WHERE...; IN, NOT, LIKE and % wildcard
  4. Create calculated fields: mathematical calculations (e.g. +, -, *, /); data manipulation functions (e.g. DATE(), ||)

Ex1*. Which categories do we have?

Display CategoryName and Description in the Categories table.

Ex2. Which shippers do we have?

Ex3* Find names and hire dates of all US sales representatives

The resulting table should display FirstName, LastName, HireDate, Title and Country.

Ex4. Find supplier contacts who are not marketing managers

Display supplierID, ContactNames and ContactTitle for those whose ContactTitle is not Marketing Manager.

Hint: NOT operator.

Ex.5* Find products with "Chef" in its name

Hint: LIKE operator and % wildcard.

Ex.6 Find all orders shipping to Canada, Mexico or USA.

Hint: IN operator.

Ex.7* Who is the oldest employee?

Display FirstName, LastName, Title, and birth date, ordered by birth date. Show only the date portion of the BirthDate column/field (YYYY-MM-DD).

EX.8 Show employee full name

Show the FirstName and LastName columns from the Employees table, and then create a new column called FullName, showing first name and last name joined together in one column in a format like "DAVOLIO, Nancy".

Ex.9* Which order items have total sales above $12,000.

In the OrderDetails table, we have the columns UnitPrice and Quantity. Create a new column, TotalPrice, that multiplies these two (ignore the Discount column). Display OrderID, ProductID, UnitPrice, Quantity and TotalPrice. Order the result first by ProductID (ascending) and then by TotalPrice (descending).

Note that the question asks about order items (not orders). One order may contain multiple items/products. An order item is identified by the OrderID and ProductID columns.

Ex.10 Show all products that need reordering

A product need reordering if UnitsInStock + UnitsOnOrder <= ReorderLevel and Discontinued = 0. In your result, display ProductID, ProductName, UnitsInStock, UnitsOnOrder, ReorderLevel, and Discontinued.

Part 2. Summarize and Group Data

  1. Summarize data using aggregate functions (e.g. COUNT(), MIN(), MAX(), and AVG()).
  2. Group data and filter groups: SELECT...FROM...GROUP BY...HAVING....
  3. SELECT clause ordering: SELECT...FROM...WHERE...GROUP BY...HAVING...ORDER BY....
  4. Filter data by subquery: SELECT...FROM...WHERE...(SELECT...FROM...)

Ex.11* Count total number of order items. What's their minimum, maximum and average price?

Note that the question asks about order items (not orders). One order may contain multiple items/products. An order item is identified by (OrderID + ProductID).

Ex. 12* How many customers do we have for each contact title.

Order the result by the count of contact titles.

Ex.13 Find total customers per country and city

Ex.14* Show all orders with values greater than $12,000?

Ignore the Discount. Contrast this question to Ex.9.

Ex.15* Find top three countries with the highest average freight charges for the last 12 months.

"The last 12 months" is with respect to the last OrderDate in the Orders table.

Part 3. Join Tables

  1. Inner join: SELECT...FROM...INNER JOIN...ON...
  2. Left join: SELECT...FROM...LEFT JOIN...ON...
  3. Other joins.

Ex.16* Show products and their associated suppliers

Display the ProductID, ProductName, and the CompanyName of the Supplier. Sort by ProductID.

We can use the WHERE keyword.

We can also use the (INNER) JOIN keyword.

Ex.17 Show all orders and shippers with OrderID less than 10255.

Display OrderID, OrderDate (date only), and CompanyName of the Shipper, and sort by OrderID. In addition, only display rows with OrderID < 10255.

Ex.18* Find total number of products in each category

In your result, display CategoryName and total number of product.

Ex.19 Show all orders with values greater than $12,000 and are placed in 2016?

In your result, display OrderID and total value of the order (ignore Discount).

Ex.20* What products has Michael Suyama ever sold?

In your result, display EmployeeID, FirstName, LastName, ProductID and ProductName.

Ex.21* Find customers that never placed an order

Note that the Jupyter Notebook displays Null as None.

Ex.22* Find customers who never placed an order from Margaret Peacock (EmployeeID 4)

Ex.23* Find high-value customers - 1

We define high-value customers as those who have made at least 1 order with a total value >= $10,000 (ignore discount). We only consider orders placed in year 2016.

Ex.24 Find high-value customers - 2

This time, we define high-value customers as those who have made total order >= $15,000 (ignore discount). We only consider orders placed in year 2016.

Part 4. Others

If we don't have time to go through all the concepts below, I encourage you to learn them on your own.

  1. CTE and temporary table
  2. Self-join
  3. CASE keyword
  4. UNION keyword

Many SQL implementations support something called Common Table Expression (CTE).

A CTE is a named temporary result set that exists within the scope of a single statement and that can be referred to later within that statement, possibly multiple times.

CTE is like a subquery, but it's more flexible. In particular, it can be referred to many times in a query, as if we have a temporary table. CTE is created using the WITH keyword. Below is a trivial example of CTE.

WITH
  cte1 AS (SELECT a, b FROM table1),
  cte2 AS (SELECT c, d FROM table2)
SELECT b, d FROM cte1
  JOIN cte2 ON cte1.a = cte2.c;

You can also use a temporary table to store SELECT result. Temporary table can be referred to multiple times in a single session. The syntax to create temporary table and store results from SELECT is

CREATE TEMPORARY TABLE TableName AS SELECT...

Ex.25* Find the total number of orders fulfilled by the oldest employee (CTE, Subquery, and temporary table)

Ex.26* For each employee, find what percentage of his/her orders were shipped late?

In your result, display EmployeeID, LastName, total number of orders, number of late orders, % of late orders (rounding to 2 decimal places).

Define late order as RequiredDate <= ShipppedDate.

Below are three simpler related questions. You may want to solve them first before you attempt Ex.26 itself.

  1. Which orders were shipped late?
  2. Which employee has the most late orders?
  3. Compare each employee's number of orders and number of late orders.

Note: There are many ways to solve this exercise. One of them is to create temporary tables to help us store intermediate results.

Ex.27* Find customers with multiple orders in 5 day period

Ex.28 Group customers by its total order value in 2016 for a sales compaign.

Group customers into four categories, very high (total order in 2016 > 10,000), high (5,000 < total order <= 10,000), medium (1,000 < total order <= 5,000), and low (0 < total order <= 1,000). When calculating total order value, you can ignore discount.

We will learn the CASE keyword.

Ex. 29 Countries with suppliers and customers

Some Northwind employees are planning a business trip, and they would like to visit as many customers and suppliers as possible. They would like to see a list of all countries and their corresponding total number of customers and total number of suppliers.

Method 1. Use intermediate tables.

We will use temporary tables.

Method 2. Use subquery to get a derived table

We find all distinct countries in Customers and Supplier tables, and combine them to be a derived table CountriesTable. This table exists only during the execution of the query. We then LEFT JOIN the table to Customers and Supplier tables, and use GROUP BY to obtain the result.

Summary

We have covered quite a few basic SQL keywords. They are mostly about getting data from a DB and manipulate the retrieved data. There are much more to SQL. Take a look at this post to get an idea of what advanced SQL topics are about, and if you are interested, explore them on your own.

References

Most of the exercises in this notebook are drawn from this book by Sylvia Moestl Vasilik. The book is based on Microsoft SQL server and its T-SQL.

Another great SQL book for beginner is SQL in 10 Minute by Ben Forta.

Appendix

Examples of various joins shown in the slides.