Friday, November 8, 2013

Joins, Indexes And Our Friend The Where Clause

This is a presentation I gave, just over a year ago, for one of our lunchtime “Byte Share” learning sessions for our developers, both Business Intelligence and Software, many of them new to writing queries. I have modified this from the original PowerPoint to something that is a bit more readable...I hope.



Joins, Indexes And Our Friend The Where Clause
(no relation to Santa)

Joins...Setting the Stage

Joins have the basic structure of-- FROM [Left_Table] JoinType [Right_Table] ON JoinCondition

There are three basic types of Joins: Inner Joins, Outer Joins and Cross Joins

We will be working with a very simple “notional” dataset



Let’s also take a moment to define a term, alias. This is used to reference database tables by short names or abbreviations within a query.

Inner Join

This type of join takes every record in the left table and looks for matches in the right table. For each match that is found you will have one row in the results set, if more than one match is found you will have multiple rows in the results set, if no match is found there will not be a row returned for that record.

SELECT REP.Name AS SalesRep, REG.RegionDesc AS Region
FROM SALES_REGION REG INNER JOIN SALES_REPRESENTATIVE REP ON REG.RegionID = REP.SalesRegion


This could also be written as:

SELECT REP.Name AS SalesRep, REG.RegionDesc AS Region
FROM SALES_REGION REG JOIN SALES_REPRESENTATIVE REP ON REG.RegionID = REP.SalesRegion


The INNER descriptor is optional but may add clarity when reading your query.

This query yields the following results from our notional dataset:



Notice there are no records for the North sales region since no sales person is assigned to that region.

Outer Join

Three variations:
Left Outer Join (aka Left Join)
Takes every record in the left table and looks for matches in the right table. For each match that is found you will have one row in the results set, if more than one match is found you will have multiple rows in the results set, if no match is found the column in the results set for the right table will have a NULL.
Right Outer Join (aka Right Join)
Same as a Left Outer Join but bases search on the right table rather than the left.
Full Outer Join (aka Full Join)
This is like a combination of Left and Right Outer Joins. All records from both tables are in the results set and matched where they meet the join condition and NULL in the results set columns where matches are not found.

SELECT REP.Name AS SalesRep, REG.RegionDesc AS Region
FROM SALES_REGION REG LEFT OUTER JOIN SALES_REPRESENTATIVE REP ON REG.RegionID = REP.SalesRegion




With our small minuscule notional dataset the results for a Right Outer Join are the same as for an Inner Join and the results for a Full Join are the same as the Left Outer Join.

Cross Join

Slightly different structure than an Inner or Outer join in that it has no ON condition. Instead it joins every row in the left table to every row in the right table creating a Cartesian product.

SELECT REP.Name AS SalesRep, REG.RegionDesc AS Region
FROM SALES_REGION REG CROSS JOIN SALES_REPRESENTATIVE REP




When would you use this? For this dataset the results don’t give us anything meaningful however there are circumstances that call for a Cross Join. For example, if you run an online business that prints books on order and you offer every title in a variety of languages doing a cross join of your TITLES table with your LANGUAGES table could give you all the product possibilities that you sell. Keep in mind that cross joins can be very dangerous if used improperly especially on tables with millions of rows. This would definitely impact performance if not completely bring the server to a screeching halt.

Now that we have covered the basics lets mix it up a bit with some complex Joins

Multiple Join Conditions

AND Join Conditions

FROM QUOTE Q JOIN EQUIPMENT E ON
Q.PROJECT_NO = E.PROJECT_NO AND
Q.QUOTE_NO = E.QUOTE_NO AND
Q.REV_NO = E.REV_NO


OR Join Conditions

FROM TABLE_A A JOIN TABLE_B B ON
A.ORDER_NO = B.CONTRACT_NO OR A.ORDER_NO = B.PURCHASEORDER_NO


Multiple tables, cross database mixing Inner and Outer joins

FROM QUOTE Q JOIN PROJECT P ON Q.PROJECT_NO = P.PROJECT_NO
LEFT JOIN EMPLOYEE.dbo.EMPLOYEE_INFO E ON P.SALESPERSON_ID = E.EMPLOYEE_ID

You can even use joins in an UPDATE query

UPDATE PROJECT
SET CONTRACTOR_ID = 12345
FROM PROJECT P JOIN PROJECT_STATUS S
ON P.STATUS_CODE = S.STATUS_CODE
WHERE S.STATUS_PROJECT_IS_ACTIVE = 1 AND P.SALESPERSON_ID = 54321

I bet you are wondering why is all the data split into all these tables that we need to join? This is done through a database design process called Normalization. Does this help with performance? No, it does not necessarily help with performance nor is it designed for that purpose. So why do we normalize the data? It is done to prevent data anomalies…but that is a discussion for another day.

Know thy data

Make sure that the joins you are creating make sense. You will typically want to make sure you are joining on columns that have a Primary Key—Foreign Key relationship. If you join on non-key values you could end up with…



Spurious: false, bogus, not valid.
Tuple: row in a result set.

Wait a minute!

You said that when databases are normalized it does not help with performance. It seems to me that if you split the data in to multiple tables that have to be reconnected it would actually be a performance hit. Is there any way to help speed things up? Yes there is….INDEXES.

SQL Server Indexes

There are two types of indexes:
Clustered, these are created automatically when creating Primary Key. There is only one per table…actually it is the table sorted on the Index Key.
Non-clustered, these have to be manually created and you may have multiple per table.

Clustered Indexes

These are kind of like a phone book, you seek based on the index key and your data is right there. Usually indexed on the Primary Key but can be changed if needed. It is possible to drop the Clustered Index on the Primary Key, this will also drop the Primary Key, a Clustered Index on a different key can then be created and the Primary Key re-added which will generate a unique non-clustered index on the Primary Key. There are some cases where an auto-generated identity column is used as the Primary Key rather than natural key. Typically this is done to maintain Primary/Foreign key relations when the data in the natural key is likely to change over time. Usually clustered indexes are created on columns that are used in other tables as foreign keys. This is done in improve query performance for joins.

Non-clustered Indexes

These act like the index in the back of a book, you search the index for the key value you are looking for and it directs you to the page where the data is stored. Non-clustered indexes can contain several columns, if the query can be completed using only the data in the columns included in the index it is called a covering index. These can be very useful for increasing performance of some high frequency queries. “So let’s just index all the columns.” Not a good idea. More indexes do not mean better performance. There is overhead for maintaining indexes. Whenever data is inserted, updated or deleted the index needs to be updated. Indexes take storage space; it is possible that you could create indexes that are eating up disk space that are not being used. Not all data should be indexed for example Yes/No data (low cardinality).

Our Friend the WHERE Clause

What is it used for?
It restricts the rows returned in the result set through the use of predicates; it is used to extract only those records that fulfill a specified criterion. Basically it acts as a filter for your data.

The WHERE clause is not just for SELECT queries!

Where should I use WHERE?
In SELECT queries to limit the size of your results set to just the relevant data.
In UPDATE queries; did you REALLY mean to change that value for all record in the table? Remember there is no “Undo” button.
In DELETE queries; typically I prefer to mark a row as inactive rather than delete but if you MUST delete make sure you just delete the appropriate records. Remember there is no “Undo” button.

Relational-Operator Conditions You can craft very simple to very complex WHERE conditions using the following operators:
=, >, <, <>, =>, <=, IN, NOT IN, BETWEEN, LIKE ‘Somethi%’, NOT LIKE ‘Everythi%’, IS NULL, IS NOT NULL, AND, OR …

These are the one you are most likely to use; there may be a few others that are not on this list.

Remember those indexes? One of the biggest benefit will be experienced due to the WHERE clause. Try to filter on Primary Keys or Foreign keys as much as possible. Integer and CHAR/VARCHAR are good for filtering and can be indexed. BLOBS (Binary Large OBjectS) cannot be indexed and should not be used for a filter. Yes, I realize that there are times you MUST filter on a TEXT datatype but keep in mind that it will be SLOW. Just please, please, please don’t do the following:

WHERE SomeValue LIKE ‘%string%’

Even if SomeValue is a Varchar(50) and indexed the leading wildcard will mean that the index won’t be used.

No comments:

Post a Comment