Words of a once misguided soul in SQL

Over these months I have noticed that a SQL developer must know these following things which gets unnoticed very easily:

  • The sole purpose of using OUTER keyword in FULL, RIGHT or LEFT join is to improve code readability. Period.
    Many times you have noticed that you can specify a LEFT/RIGHT/FULL in two ways :

    They will produce the same result in the same amount of time under similar circumstances. The only difference is code readability, in latter case its more explicit than in the former. Also OUTER joins collectively refers to cases where the resultset can contain values which are out of the scope of the conditions specified.
    Same thing applies to using INNER keyword in JOINS.
  • Interchanging conditions in WHERE clause with that in ON clause while using INNER JOINS won't effect the result set but in case of OUTER JOINS it will.
    Lets say you have a query like this :

    First of all if this query had INNER JOIN instead of LEFT OUTER JOIN then even if you interchange the conditions in WHERE clause and ON clause it won't change a thing because you want the resultset to satisfy both the conditions strictly.
    However, in the case of the given example you are saying that I want the resultset which satisfies the ON clause condition but I also want a partial resultset in cases where the ON clause condition is violated and I want the resultset to satisfy the condition specified in WHERE clause strictly. Phew!

    Now what actually the query is doing is that its selecting the employees whose managerid's are null, in other words employees without a manager.
  • No RBAR !
    In sql you generally do set based operation i.e. you say to the sql engine to take one column of one table and do something with it using another column of another table. This is not true for most of the programming language where this concept is exploited quite regularly and also is one of the good practices because of which its called RBR(row by row) operations.
    RBAR stands for Row By Agonizing Row, is generally achieved by cursors


    The above query simply multiplies the salaries of employees in department 'X' by a factor of 1.1. This can easily be done by update statement and should be avoided as far as possible.
    However, when you want to do something with rows on the basis of previous rows in that order then you might wanna use RBAR but it is highly advised to avoid it.
  • Indexing is not always the option.
    When I was working on ETL systems I was given a problem of reducing the amount of time required to join with a table containing about a billion records. So like any SQL rookie I created an index on it, and the first thing my reviewer said was 'ok, first of all you need to get rid of that index', ok wait what. In ETL systems when you are using about a dozen of tables you need be very careful when creating an index because it'll surely affect the upstream, and however you'll decrease the running time of the affected area the system as a whole will execute in the same time.
    So how did we fix it, lets look at that in the next but I'll suggest that indexing is like caching so if you are using the same table at the same column many times, then you can surely use indexing.
  • Incremental load vs. Reload
    So as discussed in the previous point, I dealt with the problem by doing an incremental load instead of reload.
    In ETL system you can do the last part in two ways, either you can reload the whole data again each day, this'll ensure reliability and easy recoverability however it'll take more time, or you can incrementally load the data in which case the time will be reduced drastically but if the system fails or corrupt is loaded then you need to recover it manually.
    We used this method to bring down the running time however we modified this approach to fallback to reloading the data when the table was not present that we could minimize the manual steps required for recovering from a system failure.
  • Exploiting indexes the wrong way.
    I have seen cases where the indexes are just used the wrong way, for eg. here is one query :


    This will do a full table scan as you are passing the column rows to a function so the query engine must apply the function to every row value and hence rendering the index useless. It is advise to use LIKE in these cases.

Comments