Menu Bar

Monday, October 31, 2011

Another way of looking at JOINs


"No one would surrender to the Dread Pirate Westley." - Westly, The Princess Bride
I recently had a call from a colleague asking me about how to get some information out of the database. Since the query was using an Oracle database, it was using the (+) outer join syntax on the predicate level which to me, has never been what you'd call intuitive, particularly when joining multiple tables together.

For those people not familiar with it, a simple left outer join would read as follows:

select a.col1, b.col2
from tab1 a, tab2 b
where a.col1 = b.col1(+)

The (+) syntax is used in the side of the predicate where NULL's would be expected in the event of an outer join. So this example would indicate a left outer join.

The query was more complicated than this simple example and I suggested using ANSI standard syntax to build the query instead.

In SQL Server, in order to perform an outer join, you need to use the ANSI standard syntax since it's the only way an outer join is supported. I suggested to my colleague that he learn it as well, for future benefit.

What is an outer join? An outer join is where you join two tables on a predicate and if the table being outer joined doesn't match, the row is returned anyway, but with NULL's in the result set in any columns from the joined table where a match was not found.

Going back to Elementary School math, think of a Venn Diagram as follows (We will call the intersection in the middle "C"):

An Inner Join is only going to return the data shown as "C" - those rows common to both "A" and "B" across the join predicate.

A Left Outer Join will show all rows from "A" as well as "C" (non matching data from "B" will return as NULL)
A Right Outer Join will show all rows from "B" as well as "C" (non matching data from "A" will return as NULL)

So, with that explained, on to the syntax. The Syntax of a join, when done using the ANSI standard syntax is:

SELECT <list of columns>
FROM <TABLE1>
[LEFT|RIGHT] [OUTER|INNER] JOIN <TABLE2>
ON <JOIN CONDITION>
WHERE <EXTRA PREDICATES>

So let's say we have a table with a purchase order header and a table with PO Items and we want a report of all of the PO's for a particular vendor. There are two ways we could right this code:

SELECT a.PO_NO, b.PO_ITEM_NO, b.QTY_ORDERED, b.UNIT_PRICE
FROM PO_HDR a, PO_ITEM b
WHERE a.PO_NO = b.PO_NO
AND a.VENDOR = 'ACME'
ORDER BY a.PO_NO, b.PO_ITEM_NO

The same query written using ANSI JOIN syntax would be like this:

SELECT a.PO_NO, b.PO_ITEM_NO, b.QTY_ORDERED, b.UNIT_PRICE
FROM PO_HDR a
JOIN PO_ITEM b
ON a.PO_NO = b.PO_NO
WHERE a.VENDOR = 'ACME'
ORDER BY a.PO_NO, b.PO_ITEM_NO

The two queries would generate the exact same output.

The above example would return only those records in the "C" subset on our Venn Diagram above.

Let's try something different. Let's say we have a table containing all Employees (one of the columns is the Department that the employee belongs to) and a table of all Departments. If you wanted to find all employees who had not yet been assigned a department (This example courtesy of Wikipedia):

SELECT a.EmployeeID, a.DepartmentID, b.DepartmentID, b.DepartmentName
FROM   employee a
LEFT OUTER JOIN department  b
ON a.DepartmentID = b.DepartmentID;

This would return a list of all employees and their departments. Any employee not assigned a department would show as NULL in columns 3 and 4. If the same query were executed as a Right Outer join, we'd get a list of Departments without Employees assigned and columns 1 and 2 would be null if no employees were found for any given department.

LEFT and RIGHT, in the case of an outer join, define which table of the join, the first (LEFT) or the second (RIGHT) should always return data. The other side returns the NULL in the case of something not matching.

The final part I want to talk about is the logical structure of building a query. If you ever find yourself writing a query that is joining multiple tables, I've sometimes found that making sure all the joins are correct can be burdensome at times.

I find that the easiest way to build a complex query is one table at a time. Write a simple select statement and build it up one table at a time. As you add in more tables, think of the whole query you currently have as the "table" you're joining your next table into, so you can consider you join conditions. This advice is probably more for those who are not deeply familiar with SQL, but I've found this invaluable when analyzing a complex query that is performing badly.

So we could start with a list of employees:

SELECT EmployeeID, EmployeeName
FROM EMPLOYEES

Then add in their departments, including employees not in a department at the moment

SELECT a.EmployeeID, a.DepartmentID, b.DepartmentID, b.DepartmentName
FROM   employee a
LEFT OUTER JOIN department  b
ON a.DepartmentID = b.DepartmentID;

Then add in the description of the pay group that each employee is in - only listing those that have a pay group

SELECT a.EmployeeID, a.DepartmentID, b.DepartmentID, b.DepartmentName, 
       a.PayGroup, c.PayGroupDesc
FROM   employee a
LEFT OUTER JOIN department  b
ON a.DepartmentID = b.DepartmentID
INNER JOIN paygroup c
ON a.PayGroup = c.PayGroup

And then finally add in their physical location description. As a final selection criteria, only those employees that are not terminated.

SELECT a.EmployeeID, a.DepartmentID, b.DepartmentID, b.DepartmentName, 
       a.PayGroup, c.PayGroupDesc, d.PhysicalLocDesc
FROM   employee a
LEFT OUTER JOIN department  b
ON a.DepartmentID = b.DepartmentID
INNER JOIN paygroup c
ON a.PayGroup = c.PayGroup
INNER JOIN physlocation d
ON a.PhysicalLoc = d.PhysicalLoc
WHERE a.terminated = 'N'

When looking at a complex query this way, it's easy to keep track of what your query is doing, rather than attempted to add every column you need to that select list right at the start and trying to build it all at once. The more tables you're trying to join together, the more likely you are to forget a join condition and end up with a nasty Cartesian join in there somewhere.