SQL Join Tutorial
In this article, we will discuss JOINs in SQL (Structured Query Language), various types of JOIN
s, and their syntax. We will also run SELECT
queries using different types of JOIN
s on data in sample tables and view the result sets.
You can also check this tutorial in the following video:
1. Introduction
In real-world applications, data are stored in RDBMS
(Relational Database Management Systems) by removing redundancy and assigning primary or composite keys to identify them uniquely. This is part of normalization. Relationships of data stored in multiple tables are set by the primary key and foreign key mechanism. Thus, persisted data represents domain entity relationships like one-to-one and one-to-many. For example, let’s say we have an orders
table. The details of each order are stored in another table, say order_details
. Order details are linked to orders using the common column order_id
. The orders table is called the parent table and the order_details
table is called the child table.
When we want to retrieve data from multiple tables, JOIN
s are the mechanism to construct the condition to retrieve meaningful data.
2. What are Joins in SQL?
Joins extract information from multiple tables or views satisfying certain condition(s) and combine them into a result set. Typically primary and foreign keys are used to link tables in the join condition.
Different types of Joins that we will demonstrate in this article are:
3. Tables and Data
For running the queries of Inner and Outer joins, we will use a book_store
database consisting of three tables: locations
, publishers
and books
.
For the remaining three joins, viz Self Join
, Semi Join
and Anti Join
, we will use an hr database consisting of two tables: departments
and employees
.
The following diagram shows the structure of these tables.
4. Definitions, Syntax and Results
4.1 PART-I
In this section, we will discuss Inner Join
and the Outer Join
s (Left, Right and Full).
4.1.1 Inner Join
The Inner Join is used when you want rows that have exact matches for the join column values.
Suppose we want to know the books in our database and their publisher names. This data is retrieved with the following query:
SELECT * FROM books b INNER JOIN publishers p ON (b.publisher_id = p.publisher_id);
This returns books and publishers rows where publisher_id
is the same. A screenshot of the result set is given below:
In the SELECT
clause, instead of all columns (*), we can specify some columns only, say those with business values avoiding columns with id numbers. You can drop the keyword INNER
and use just the keyword JOIN
.
The modified query and the returned rows are given below:
SELECT b.title, b.author, p.name publisher FROM books b JOIN publishers p ON (b.publisher_id = p.publisher_id);
4.1.2 Left Outer Join
When you use LEFT OUTER JOIN
, the table “before” these keywords is the “left” table and is specified in the FROM clause. It will have all its rows in the result set.
In rows that match the join condition, there will be values from both tables. In the unmatched rows, the “left” table columns will have values from the database (except the join column), but the “right” table columns will have NULL
values.
Let’s say we want to retrieve all locations
and publishers
where ever they are present. The query for this requirement is:
SELECT * FROM locations l LEFT OUTER JOIN publishers p ON (l.location_id = p.location_id);
Result set for this query is as given below:
The keyword OUTER
is optional. LEFT JOIN
means a LEFT OUTER JOIN
. So, we drop the OUTER
keyword from our query. Let’s also say we do not want the publisher location ids, we want to title the publisher name column as publisher_name
, and want the data to be ordered by the location table’s location ids. The query then is:
SELECT l.*, p.name publisher_name FROM locations l LEFT JOIN publishers p ON (l.location_id = p.location_id) ORDER BY l.location_id;
The output of this query is given below:
We can extend the LEFT JOIN
to three tables too by joining the above query with the books table. The query will be:
SELECT l.*, p.name publisher_name, b.title, b.author FROM locations l LEFT JOIN publishers p ON (l.location_id = p.location_id) LEFT JOIN books b ON (p.publisher_id = b.publisher_id) ORDER BY l.location_id, p.publisher_id, b.book_id;
The result set returned:
4.1.3 Right Outer Join
The right outer join is the mirror image of the left outer join. The result set will have all rows from the “right” table.
In rows that match the join condition, there will be values from both tables. In the unmatched rows, the “right” table columns (except the join column) will have values from the database, but the “left” table columns will have NULL
values.
As an example, we want to show all the publishers in our database whether we have their books or not. The query is:
SELECT * FROM books b RIGHT OUTER JOIN publishers p ON (p.publisher_id = b.publisher_id);
The result is as follows:
The keyword OUTER
is optional. RIGHT JOIN
means RIGHT OUTER JOIN
. So, we drop the OUTER
keyword from our query. Also, we want to drop the book_id
and publisher_id
columns and title the publisher name column as publisher_name
. Then the query is
SELECT b.title, b.author, p.name publisher_name FROM books b RIGHT JOIN publishers p ON (p.publisher_id = b.publisher_id);
The output is as follows:
We can extend the RIGHT JOIN
to three tables too by joining the above query with the locations table. The query will be:
SELECT b.title, b.author, p.name publisher_name, l.city FROM books b RIGHT JOIN publishers p ON (p.publisher_id = b.publisher_id) RIGHT JOIN locations l ON (l.location_id = p.location_id) ORDER BY l.location_id, p.publisher_id, b.book_id;
The output is as given below:
4.1.4 Full Outer Join
The full outer join combines the functionality of the left outer join and the right outer join. It returns a result set that includes rows from both left and right tables.
In “left” table rows that do not have matching “right” table rows, the “left” table columns will have values from the database but the “right” table values will have NULL values.
Similarly, in “right” table rows that do not have matching “left” table rows, the “right” table columns will have values from the database but the “left” table will columns will have NULL
values.
SELECT * FROM locations l FULL OUTER JOIN publishers p ON (l.location_id = p.location_id);
The result set is as given below:
As in left join and right join, the keyword OUTER
is optional here too. Also, we can extend the outer join to three tables. The query for this requirement is:
SELECT l.city, p.name publisher_name, b.title, b.author FROM locations l FULL JOIN publishers p ON (l.location_id = p.location_id) FULL JOIN books b ON (p.publisher_id = b.publisher_id) ORDER BY l.city;
The output is as given below:
MySQL does not support the FULL OUTER JOIN
keywords. The full outer join in MySQL is accomplished by a UNION
of the left outer join and right outer join. Therefore, the full outer join queries will have the following syntax:
SELECT * FROM locations l LEFT JOIN publishers p ON (l.location_id = p.location_id) UNION SELECT * FROM locations l RIGHT JOIN publishers p ON (l.location_id = p.location_id);
SELECT l.city, p.name publisher_name, b.title, b.author FROM locations l LEFT OUTER JOIN publishers p ON (l.location_id = p.location_id) LEFT OUTER JOIN books b ON (p.publisher_id = b.publisher_id) UNION SELECT l.city, p.name publisher_name, b.title, b.author FROM locations l RIGHT OUTER JOIN publishers p ON (l.location_id = p.location_id) RIGHT OUTER JOIN books b ON (p.publisher_id = b.publisher_id) ORDER BY city;
Output in MySQL will be the same as that shown in the previous two screenshots.
4.2 PART-II
In this section, we will cover Self Join
, Semi Join
and Anti Join
. For these, we will use the hr database of two tables — employees
and departments
.
4.2.1 Self Join
A self-join is a unary relation in which a table is joined with itself. Each row of the table is joined with all rows including itself, depending on some condition(s).
A real-world use case is of a table that stores events and we need to extract meaningful insights from that table itself. For example, the most delayed event in a table, where the delay is calculated by the delta of the timestamp of each event and its previous event.
The most common example is the reporting of employees and their managers’ names from the employees
table. You can write the query using either the WHERE
keyword or JOIN
/ ON
keywords, as shown below:
SELECT e1.first_name, e1.last_name, e2.first_name mgr_first_name, e2.last_name mgr_last_name FROM employees e1, employees e2 WHERE e1.manager_id = e2.employee_id;
SELECT e1.first_name, e1.last_name, e2.first_name mgr_first_name, e2.last_name mgr_last_name FROM employees e1 JOIN employees e2 ON e1.manager_id = e2.employee_id;
The output for these is given in the following screenshot:
4.2.2 Semi Join
You use semi-join when you want to select rows from the first table only, but use the second table to decide which rows to return.
For example, let’s say you are asked to provide the list of all employees and their department names, but the department id should be between 2000 and 4000. As a programmer, your department (IT) id is 5000 and you feel excluded. You may be tempted to ask, why, what is so special about the other departments? You even suspect they are getting extra bonus. But then corporate career has taught that you never question your boss or senior management, so you write the query without any complaints and send them the output.
The query can be written with either the IN
keyword or the EXISTS
keyword, as shown below.
SELECT e.first_name, e.last_name, d.name department_name FROM employees e, departments d WHERE e.department_id = d.department_id AND d.department_id IN (SELECT d.department_id WHERE d.department_id BETWEEN 2000 AND 4000 );
SELECT e.first_name, e.last_name, d.name department_name FROM employees e, departments d WHERE e.department_id = d.department_id AND EXISTS (SELECT d.department_id WHERE d.department_id BETWEEN 2000 AND 4000 );
For performance, EXISTS
is preferred. Output of both queries is in the screenshot given below:
4.4.3 Anti Join
Anti-join is like the semi-join to select rows from the first table only, but using a negative condition on the second table to decide which rows to return.
In our example, it can be used to retrieve all employees and their department names, but the department id should not be between 2000 and 4000. If this task comes from the management, naturally you will be happy since the I.T. department of which you are a member has a department id that is not between 2000 and 4000.
The query and the output are given below:
SELECT e.first_name, e.last_name, d.name department_name FROM employees e, departments d WHERE e.department_id = d.department_id AND NOT EXISTS (SELECT d.department_id WHERE d.department_id BETWEEN 2000 AND 4000 );
5. Download the Source Code
This article was a tutorial on various types of SQL
JOIN
s.The SQL
source code for creating the tables, inserting the sample data, and all queries discussed in this article are available in a zip file. The zip file has separate scripts for PostgreSQL and MySQL.
You can download the full source code of this example here: SQL Join Tutorial
Last updated on Jan. 14th, 2022