sql

SQL Join Tutorial

In this article, we will discuss JOINs in SQL (Structured Query Language), various types of JOINs, and their syntax. We will also run SELECT queries using different types of JOINs on data in sample tables and view the result sets.

You can also check this tutorial in the following video:

SQL Join Tutorial – video

1. Introduction

In real-world applications, data are stored in RDBMS (Relational Database Management Systems) by removing redundancy and assigning primary or composite keys to identify them uniquely. This is part of normalization. Relationships of data stored in multiple tables are set by the primary key and foreign key mechanism. Thus, persisted data represents domain entity relationships like one-to-one and one-to-many. For example, let’s say we have an orders table. The details of each order are stored in another table, say order_details. Order details are linked to orders using the common column order_id. The orders table is called the parent table and the order_details table is called the child table.

When we want to retrieve data from multiple tables, JOINs are the mechanism to construct the condition to retrieve meaningful data.

2. What are Joins in SQL?

Joins extract information from multiple tables or views satisfying certain condition(s) and combine them into a result set. Typically primary and foreign keys are used to link tables in the join condition.

Different types of Joins that we will demonstrate in this article are:

sql join

3. Tables and Data

For running the queries of Inner and Outer joins, we will use a book_store database consisting of three tables: locations, publishers and books.

For the remaining three joins, viz Self Join, Semi Join and Anti Join, we will use an hr database consisting of two tables: departments and employees.

The following diagram shows the structure of these tables.

sql join

4. Definitions, Syntax and Results

4.1 PART-I

In this section, we will discuss Inner Join and the Outer Joins (Left, Right and Full).

4.1.1 Inner Join

The Inner Join is used when you want rows that have exact matches for the join column values.

Suppose we want to know the books in our database and their publisher names. This data is retrieved with the following query:

SELECT *
FROM books b
INNER JOIN publishers p
ON (b.publisher_id = p.publisher_id);

This returns books and publishers rows where publisher_id is the same. A screenshot of the result set is given below:

sql join

In the SELECT clause, instead of all columns (*), we can specify some columns only, say those with business values avoiding columns with id numbers. You can drop the keyword INNER and use just the keyword JOIN.

The modified query and the returned rows are given below:

SELECT b.title, b.author, p.name publisher
FROM books b
JOIN publishers p
ON (b.publisher_id = p.publisher_id);

sql join

4.1.2 Left Outer Join

When you use LEFT OUTER JOIN, the table “before” these keywords is the “left” table and is specified in the FROM clause. It will have all its rows in the result set.

In rows that match the join condition, there will be values from both tables. In the unmatched rows, the “left” table columns will have values from the database (except the join column), but the “right” table columns will have NULL values.

Let’s say we want to retrieve all locations and publishers where ever they are present. The query for this requirement is:

SELECT *
FROM locations l
LEFT OUTER JOIN publishers p
ON (l.location_id = p.location_id);

Result set for this query is as given below:

sql join

The keyword OUTER is optional. LEFT JOIN means a LEFT OUTER JOIN. So, we drop the OUTER keyword from our query. Let’s also say we do not want the publisher location ids, we want to title the publisher name column as publisher_name, and want the data to be ordered by the location table’s location ids. The query then is:

SELECT l.*, p.name publisher_name
FROM locations l
LEFT JOIN publishers p
ON (l.location_id = p.location_id)
ORDER BY l.location_id;

The output of this query is given below:

We can extend the LEFT JOIN to three tables too by joining the above query with the books table. The query will be:

SELECT l.*, p.name publisher_name, b.title, b.author
FROM locations l
LEFT JOIN publishers p
ON (l.location_id = p.location_id)
LEFT JOIN books b
ON (p.publisher_id = b.publisher_id)
ORDER BY l.location_id, p.publisher_id, b.book_id;

The result set returned:

4.1.3 Right Outer Join

The right outer join is the mirror image of the left outer join. The result set will have all rows from the “right” table.
In rows that match the join condition, there will be values from both tables. In the unmatched rows, the “right” table columns (except the join column) will have values from the database, but the “left” table columns will have NULL values.

As an example, we want to show all the publishers in our database whether we have their books or not. The query is:

SELECT *
FROM books b
RIGHT OUTER JOIN publishers p
ON (p.publisher_id = b.publisher_id);

The result is as follows:

The keyword OUTER is optional. RIGHT JOIN means RIGHT OUTER JOIN. So, we drop the OUTER keyword from our query. Also, we want to drop the book_id and publisher_id columns and title the publisher name column as publisher_name. Then the query is

SELECT b.title, b.author, p.name publisher_name
FROM books b
RIGHT JOIN publishers p
ON (p.publisher_id = b.publisher_id);

The output is as follows:

We can extend the RIGHT JOIN to three tables too by joining the above query with the locations table. The query will be:

SELECT b.title, b.author, p.name publisher_name, l.city
FROM books b
RIGHT JOIN publishers p
ON (p.publisher_id = b.publisher_id)
RIGHT JOIN locations l
ON (l.location_id = p.location_id)
ORDER BY l.location_id, p.publisher_id, b.book_id;

The output is as given below:

4.1.4 Full Outer Join

The full outer join combines the functionality of the left outer join and the right outer join. It returns a result set that includes rows from both left and right tables.
In “left” table rows that do not have matching “right” table rows, the “left” table columns will have values from the database but the “right” table values will have NULL values.
Similarly, in “right” table rows that do not have matching “left” table rows, the “right” table columns will have values from the database but the “left” table will columns will have NULL values.

SELECT *
FROM locations l
FULL OUTER JOIN publishers p
ON (l.location_id = p.location_id);

The result set is as given below:

As in left join and right join, the keyword OUTER is optional here too. Also, we can extend the outer join to three tables. The query for this requirement is:

SELECT l.city, p.name publisher_name, b.title, b.author
FROM locations l
FULL JOIN publishers p
ON (l.location_id = p.location_id)
FULL JOIN books b
ON (p.publisher_id = b.publisher_id)
ORDER BY l.city;

The output is as given below:

MySQL does not support the FULL OUTER JOIN keywords. The full outer join in MySQL is accomplished by a UNION of the left outer join and right outer join. Therefore, the full outer join queries will have the following syntax:

SELECT *
FROM locations l
LEFT JOIN publishers p
ON (l.location_id = p.location_id)
UNION
SELECT *
FROM locations l
RIGHT JOIN publishers p
ON (l.location_id = p.location_id);
SELECT l.city, p.name publisher_name, b.title, b.author
FROM locations l
LEFT OUTER JOIN publishers p
ON (l.location_id = p.location_id)
LEFT OUTER JOIN books b
ON (p.publisher_id = b.publisher_id)
UNION
SELECT l.city, p.name publisher_name, b.title, b.author
FROM locations l
RIGHT OUTER JOIN publishers p
ON (l.location_id = p.location_id)
RIGHT OUTER JOIN books b
ON (p.publisher_id = b.publisher_id)
ORDER BY city;

Output in MySQL will be the same as that shown in the previous two screenshots.

4.2 PART-II

In this section, we will cover Self Join, Semi Join and Anti Join. For these, we will use the hr database of two tables — employees and departments.

4.2.1 Self Join

A self-join is a unary relation in which a table is joined with itself. Each row of the table is joined with all rows including itself, depending on some condition(s).

A real-world use case is of a table that stores events and we need to extract meaningful insights from that table itself. For example, the most delayed event in a table, where the delay is calculated by the delta of the timestamp of each event and its previous event.

The most common example is the reporting of employees and their managers’ names from the employees table. You can write the query using either the WHERE keyword or JOIN / ON keywords, as shown below:

SELECT e1.first_name, e1.last_name, e2.first_name mgr_first_name, e2.last_name mgr_last_name
FROM employees e1, employees e2
WHERE e1.manager_id = e2.employee_id;
SELECT e1.first_name, e1.last_name, e2.first_name mgr_first_name, e2.last_name mgr_last_name
FROM employees e1
JOIN employees e2
ON e1.manager_id = e2.employee_id;

The output for these is given in the following screenshot:

4.2.2 Semi Join

You use semi-join when you want to select rows from the first table only, but use the second table to decide which rows to return.

For example, let’s say you are asked to provide the list of all employees and their department names, but the department id should be between 2000 and 4000. As a programmer, your department (IT) id is 5000 and you feel excluded. You may be tempted to ask, why, what is so special about the other departments? You even suspect they are getting extra bonus. But then corporate career has taught that you never question your boss or senior management, so you write the query without any complaints and send them the output.

The query can be written with either the IN keyword or the EXISTS keyword, as shown below.

SELECT e.first_name, e.last_name, d.name department_name
FROM employees e, departments d
WHERE e.department_id = d.department_id
AND d.department_id IN
(SELECT d.department_id
WHERE d.department_id BETWEEN 2000 AND 4000
);
SELECT e.first_name, e.last_name, d.name department_name
FROM employees e, departments d
WHERE e.department_id = d.department_id
AND EXISTS
(SELECT d.department_id
WHERE d.department_id BETWEEN 2000 AND 4000
);

For performance, EXISTS is preferred. Output of both queries is in the screenshot given below:

4.4.3 Anti Join

Anti-join is like the semi-join to select rows from the first table only, but using a negative condition on the second table to decide which rows to return.

In our example, it can be used to retrieve all employees and their department names, but the department id should not be between 2000 and 4000. If this task comes from the management, naturally you will be happy since the I.T. department of which you are a member has a department id that is not between 2000 and 4000.

The query and the output are given below:

SELECT e.first_name, e.last_name, d.name department_name
FROM employees e, departments d
WHERE e.department_id = d.department_id
AND NOT EXISTS
(SELECT d.department_id
WHERE d.department_id BETWEEN 2000 AND 4000
);

5. Download the Source Code

This article was a tutorial on various types of SQL JOINs.The SQL source code for creating the tables, inserting the sample data, and all queries discussed in this article are available in a zip file. The zip file has separate scripts for PostgreSQL and MySQL.

Download
You can download the full source code of this example here: SQL Join Tutorial

Last updated on Jan. 14th, 2022

Mahboob Hussain

Mahboob Hussain graduated in Engineering from NIT Nagpur, India and has an MBA from Webster University, USA. He has executed roles in various aspects of software development and technical governance. He started with FORTRAN and has programmed in a variety of languages in his career, the mainstay of which has been Java. He is an associate editor in our team and has his personal homepage at http://bit.ly/mahboob
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button