Thursday, September 29, 2016

Linear time, Logarithmic time, and Databases

Database tables are usually just arrays of data, sometimes contiguous in memory, and sometimes not. To scan a table for the right answer on an arbitrary field would be O(n) in the worst case, since database tables are not guaranteed to be ordered (it's implementation specific, by my understanding, meaning that so long as the queries function properly the DBMS can store the data however it likes).

When you have 20 records, no big deal. However, when you have 20,000,000 records, this quickly becomes an issue. To fix this seeking issue, there is the concept of an index. This creates a meta-table with additional information on the table that makes querying easier. This meta table contains the column of data being indexed, and the indexed column is sorted and put into a binary tree variation (the most common is a red-black tree, because it offers guarantees on the worst case run time). So when the database executes its query, it looks at the where clause to see if it has an index for that set of conditions. If it does, it searches in the index to get the direct pointer to the row in memory. By virtue of using a pseudo-balanced binary tree, this means the search only takes O(log n) time.

However, there are caveats here too. Database Management Systems (DBMS) are amazingly fast at some things, but do very little inference on their own. Let's take an example of a users table:

Users
------
id: integer
name: string


A common query we might use is Select * from Users where id=5. Without an index, this query needs to scan every entry in users. With an index on the "ID" column, this runs in logarithmic time. Same thing with Name - if you have an index, it's logarithmic, otherwise it's linear. Now let's look at a more complex query:

Select * from users where id < 5 AND name="matt";

This can't make use of just an index on the column ID, or an index on the column name, because even if it finds the ID, it will still need to scan the table for the name. So one thing we can do is create a multi-column index - one on each pair of <id, name> from the table. This removes the table scan, but introduces a weird quirk of DB behavior. Say we have an index on the column pair <id, name>. Now take the following query:

Select * from Users where name = 'matt' and id < 5

You would think this query could use the <id, name> index, but what actually happens is that because there isn't an exact match the DB does a table scan instead. The reason is because the SQL interpreter doesn't want to handle the parsing complexity necessary to allow for arbitrary interpretation of indices. In essence, the order of the columns in the index *must* match the order of the columns in the where clause in order for that index to be used

Something I found useful when creating database indices in Rails was the following code:

    [:column_1, :column_2, :column_3].permutation.each do |combo|
      # Allowing the automated names results in names that are too long
      add_index :table , combo, name: combo.join("_")
    end

.permutation generates all the permutations of the preceding array, making doing things like 3-column indices with all variants a snap

Note that this is only true for relational databases like SQL Server, Oracle, Postgres, MySQL, and so on. NoSQL databases use different data structures behind the scenes, and are more optimized for text-base and document-based searches. Additionally, some providers may improve on this in various ways to enhance DB performance, so while you can use the above as a general guide, it's usually a good idea to verify this behavior when picking a new database system.