Thursday, September 29, 2016

Linear time, Logarithmic time, and Databases

Database tables are usually just arrays of data, sometimes contiguous in memory, and sometimes not. To scan a table for the right answer on an arbitrary field would be O(n) in the worst case, since database tables are not guaranteed to be ordered (it's implementation specific, by my understanding, meaning that so long as the queries function properly the DBMS can store the data however it likes).

When you have 20 records, no big deal. However, when you have 20,000,000 records, this quickly becomes an issue. To fix this seeking issue, there is the concept of an index. This creates a meta-table with additional information on the table that makes querying easier. This meta table contains the column of data being indexed, and the indexed column is sorted and put into a binary tree variation (the most common is a red-black tree, because it offers guarantees on the worst case run time). So when the database executes its query, it looks at the where clause to see if it has an index for that set of conditions. If it does, it searches in the index to get the direct pointer to the row in memory. By virtue of using a pseudo-balanced binary tree, this means the search only takes O(log n) time.

However, there are caveats here too. Database Management Systems (DBMS) are amazingly fast at some things, but do very little inference on their own. Let's take an example of a users table:

Users
------
id: integer
name: string


A common query we might use is Select * from Users where id=5. Without an index, this query needs to scan every entry in users. With an index on the "ID" column, this runs in logarithmic time. Same thing with Name - if you have an index, it's logarithmic, otherwise it's linear. Now let's look at a more complex query:

Select * from users where id < 5 AND name="matt";

This can't make use of just an index on the column ID, or an index on the column name, because even if it finds the ID, it will still need to scan the table for the name. So one thing we can do is create a multi-column index - one on each pair of <id, name> from the table. This removes the table scan, but introduces a weird quirk of DB behavior. Say we have an index on the column pair <id, name>. Now take the following query:

Select * from Users where name = 'matt' and id < 5

You would think this query could use the <id, name> index, but what actually happens is that because there isn't an exact match the DB does a table scan instead. The reason is because the SQL interpreter doesn't want to handle the parsing complexity necessary to allow for arbitrary interpretation of indices. In essence, the order of the columns in the index *must* match the order of the columns in the where clause in order for that index to be used

Something I found useful when creating database indices in Rails was the following code:

    [:column_1, :column_2, :column_3].permutation.each do |combo|
      # Allowing the automated names results in names that are too long
      add_index :table , combo, name: combo.join("_")
    end

.permutation generates all the permutations of the preceding array, making doing things like 3-column indices with all variants a snap

Note that this is only true for relational databases like SQL Server, Oracle, Postgres, MySQL, and so on. NoSQL databases use different data structures behind the scenes, and are more optimized for text-base and document-based searches. Additionally, some providers may improve on this in various ways to enhance DB performance, so while you can use the above as a general guide, it's usually a good idea to verify this behavior when picking a new database system.

Monday, June 9, 2014

Georgia Tech - Two More Weeks Down!

So last night I finished module 3 in my AI for Robotics course that I am taking as a part of Georgia Tech's Online MS in Computer Science. Things have somewhat settled down into a routine, with my various classmates working through the course at their own speed. Due to a complete lack of time in pretty much my entire life right now I'm simply trying to keep abreast of the weeks as they are due according to the schedule provided at the start of the term. Problem sets are due each Wednesday, and are submitted through Udacity's software. An instructor then goes through and ports these into something called T-Square, which I understand to be Georgia Tech's grading system. I could be wrong, though.

Assumed background knowledge

One of the things I'm noticing is that the course is starting to assume some background knowledge that not a lot of students may have. This is unavoidable in pretty much any course, as I have found out too often during my stints as a college professor. However, there are some ways this could be mitigated. For example, this is my first course at Georgia Tech, and my understanding is that it is targeted towards the end of the program. I took it because I found the subject interesting, but I could see where this might be causing me some trouble.

One example is that the course makes some decent use of techniques from Linear Algebra. I have a healthy grounding in this thanks to my time at DePaul and a few years spent in the pixel mines, but more often than not this type of math is an elective - at best - for computer science undergraduates. The problem is that there wasn't much in the course description that indicated that this type of math would be used. Statistical analysis is also featured heavily in the course, and while there are ample resources available for coming up to speed most students have no idea that they will need to do so before the course starts. I had around a month before classes began where I could have been preparing myself, but as it was I kind of walked into the course blind.

Interesting Problem Sets

One of the things I do find interesting is that the problem sets at the end of each lecture are actually not a significant portion of the grade. In fact, they're worth 3 points each (compared to a total upwards of 70 points for the two projects), and are graded on a pass-fail basis. To quote the professor, they essentially serve as a kind of "attendance check." That's all well and good, but if I am putting forth effort towards these assignments I'd like to see just a little bit more credit given for my efforts. Not much I can do about that, though, and I can definitely see the instructor's point.

One thing I do like about these problem sets is the use of automated grading. At first it's used in a fragile manner, along the lines of "Does the program output include this string," but the most recent assignment used statistical analysis in its grading to determine if the provided code was correct. It's pretty rewarding to see code that you wrote indicated as correct against an arbitrary data set with around 95% of accuracy.

Moving Forward

One thing I'm kind of leery of is wondering when the lack of prerequisite knowledge is going to bite me. I'm worried about the week 4 subject material, as it depends a lot upon dynamic programming and some aspects of machine learning - to the point where I'm reading chapters from a ML textbook in order to increase my knowledge base. It leads to one of the things I've been most concerned about, which is adequately scheduling time to devote to my courses.

So far the coursework takes me between 2-4 hours a week - a surprisingly low amount. However, I know that this is not the case for other courses, such as Machine Learning which can demand upwards of 40 hours of additional time each week, and in many cases I'm waiting for the other shoe to drop. I suppose at this point all I can do is wait and see, and hope it doesn't get too bad.

I'll try to provide these posts more frequently but, as I mentioned above, life happens :-/

Thursday, May 22, 2014

More posts at Elite Gaming Computers!

This week I had a nice contrast in my articles. On the one hand I covered the ASUS Z97-WS, an absolute beast of a motherboard. On the other hand, I covered the Raspberry Pi for emulation and HTPC usage. The contrast pegged my chuckle meter, at least, as I found it interesting to go from "God of POWER!" to "Small but mighty!".

Anyway, check out the articles here:

http://elitegamingcomputers.com/raspberry-pi-htpc-emulator-on-a-budget/
http://elitegamingcomputers.com/the-asus-z97-ws-a-beast-of-a-board/

Georgia Tech OMSCS - Week 1 Recap

I know, it's only Thursday!

So I just finished my first "week" of class for Georgia Tech, and I wanted to get some of my thoughts down on virtual paper. I'd like to chronicle my adventure semi-frequently in this manner, so please bear with me if this isn't your cup of tea.

Some Background

So one of the things I'm willing to admit right off the bat is that I was pretty nervous. Given my prior trials and tribulations trying to even get into the program, I had no idea what to expect. I remember some of the experience of getting my MS the first time around, and some of the courses I took as an undergrad, but I wasn't sure how this experience would compare. What is an education like at a top-10 institution? Would I be able to measure up to the bar and finally put some of my constant inferiority complex to rest?

Well, we'll have to wait a while for the answer to that last one. I'm not very confident by nature, even though I know I have a modicum of programming and communication skills. I'm always looking for ways to "prove" myself, so to speak, and I tend to take on more than I can handle as a result. I am not sure that that personality trait will ever truly go away, and honestly this Georgia Tech experience stems from that whole complex - I had had a couple of bad interview experiences with companies that rhyme with Moogle and Bamazon, and was trying to figure out how to bolster my knowledge and not feel like I was incompetent. I guess this way I'll be able to point at a fancy piece of paper and say "Well, I was good enough for that".

Provided I am actually up to the task, that is.

The experience so far

Anyway, enough navel-gazing. I'm taking CS-8803 - AI for Robotics - and so far the experience has been pretty good. On the one hand the course is delivered almost entirely via lecture. I have mixed feelings about this, and feel another digression is in order. I know that there are different ways people learn, and that some people feel that they learn best visually. I also, though, have taught several classes where the student claims to only learn well visually but, if you really got down to it, what they really wanted was a step-by-step guide on how to do an assignment rather than a lecture that builds the foundations for a concept that they then apply themselves. In other words, they were (in my eyes) using the "I'm a visual learner" mantra as an excuse rather than a legitimate gripe. If you couple this with my NOT being a visual learner (I want nothing more than a big textbook to work through), I tend to be skeptical of the approach to begin with.

I have to say, though, I was pleasantly surprised. Aside from minor frustrations the lectures do a decent job of presenting the material. The way a module works is that a video of the lecture plays up to the point of a "quiz," which in a classroom setting would be the instructor asking a question of the class, and stops to allow you to input an answer into a form that is then checked for correctness. Once the answer is entered (or skipped, if you get stuck) the lecture continues until the next quiz. The lectures themselves vary widely, using video recordings in some cases, recorded whiteboard sessions in others, and still others with the professor simply talking to the students. Of most use to me were the intermittent programming quizzes - the instructor would explain a concept, then say "Ok, now you try to write a function that does this," and suddenly you're presented with a minimal Python IDE to develop code in, along with a test run and evaluation button to determine if your code is correct. It's obvious that a lot of thought went into these lectures, and it comes through in the design of the course.

Frustrations

My frustrations with the course so far are two-fold. First is that the course is programmed in Python - a language I had a passing familiarity with, but which is subject to its own quirks. If the course was in C++ or Ruby I'd have no trouble, but as it is I am learning the language as I go along. On the one hand that's good - I'll have another language under my belt when I finish the class - but it is also frustrating and increases the time I spend on assignments. I'm not sure there's any way around this, though.

The second frustration is with the video-based learning. It is extremely challenging to reference material in a video, particularly when programming. Programming is a lot about copy-and-paste, and you can't paste from a video. I have had this argument with any number of people, and I remain unconvinced - for these type of topics you can provide a video if you like, but the most useful thing to a professional programmer is a text version that allows you to 1: keep it open in a browser next to your IDE for reference, 2: allows you to review without having to hunt through a video, and 3: allows you to learn without needing to break your concentration for audio and visual stimuli. To me this is reminiscent of a problem I had at Enova, which is that for some reason the IT department thought it was a good idea to do all of their internal support tutorials as videos. Videos are an extremely poor choice when adding a VPN key, or when you want to save the information for future reference. If OMSCS is going to be mostly video-based, I suspect I'll be lamenting this for a while to come. But then again, I was able to complete the first week's assignment and understand the material (with one exception, as Bayes rule is still kicking my ass for some reason), so I guess the point is moot.

Communication

The course also has a separate message board on piazza.com, which is used to speak with your fellow classmates and ask questions of the instructor and TA. This is a great feature, I think, and has helped me a couple times. Most of the people in the course are active on the boards and very willing to help, and it is also helpful to see that some people have the same questions I do. Where I've had trouble in the course I've quickly been able to get a response on the message board, as both my fellow students and the instructors are active there.

Self-direction

One of the things I really like at the moment, but could see myself coming to dislike, is that the course is largely self-directed. If I had the time to spare, I could push through the entire course as quickly as possible. While the lectures are organized into "Lessons" that represent a week's worth of work, there is nothing preventing you from moving ahead as you work. On the plus side, this means that I was able to finish the first week's assignment well over a week before it was due, which is great as I maintain a far too busy schedule to begin with. However, I can see this causing problems near the end of the term as if we are looking at a time-based submission and grading process, I won't be receiving feedback on my submissions until they are officially "due". So for example, if I was to power through module 2 right now I wouldn't see any feedback until June 4 at the earliest, by which time the assignment will likely be out-of-mind.

I suppose time will tell on how successful this approach is for me. For now I like that I can work ahead when I have some spare time!

Evaluating my own teaching experience

Another thing the OMSCS experience is allowing me to do is evaluate my own teaching, and see what is effective. I teach college courses online for a couple universities, and seeing how this course is conducted provides some interesting contrasts. From the start, it's obvious that more effort went into the Udacity course design than goes into the courses I typically teach - which has been a constant frustration point for me as, with one exception, I often don't get to design my courses so that they have the material I feel is necessary. However, a lot of the assignments in the class I'm taking now are "did it or not" evaluations - you get points for completing the activity, but no partial points are awarded. I don't know what grading will be like yet (doesn't happen until next week), but I think "all or nothing" assignments are a disservice to the student and fail to recognize that it IS possible to half-understand something.

Conclusion

To sum up, I'm very much in a wait-and-see pattern right now. Some of my frustrations with the format may be exclusively due to this course's design - I've already heard that some of the other courses handle things a lot differently - while others may be due to the format. I guess my opinion will continue to evolve as I run through the experience. I know that's kind of a weak close to this lengthy post, but to be honest that's about how I feel right now - not very strongly on the "good" or "bad" spectrum, just "different."

I do think it will be interesting to come back and read this after I have a few more classes under my belt. So stay tuned, I guess.

Friday, May 9, 2014

Wednesday, May 7, 2014

Georgia Tech and Me

A few months ago I posted this after receiving a rejection from Georgia Tech's Online Masters in Computer Science program. Since then, after an email exchange with the dean of the school of computing, my rejection was reversed and I was admitted into the summer session, which begins very soon. I've been holding off posting about this, but wanted to give a brief touchbase. My previous post doesn't paint the program in a very good light. Some of that was due to my being hurt at an unjustified rejection, but some of my criticisms are still valid. As I get ready to register for my first course and begin taking classes online, here are a few of my observations on areas for improvement:
  • Process deadlines and dates - Once admitted to the program, there are still a lot of questions. For example - I was admitted to the summer term, but I have no idea exactly what the dates for this term are. I have a vague idea of May to July, but none of the admissions material indicates the timeframe for the work I will be performing. Furthermore, the registration dates are not too highly publicized - there is a process where you can check with the registrar via the student portal, but it wasn't even really working until mid last week due to the status of OMSCS applicants. A quick email with a few reminders (alongside the DAILY emails we get with somewhat relevant general Georgia Tech news) would greatly benefit the prospective and admitted students.
  • Value for money - I don't speak of the education, as I have yet to experience that (and regardless of my opinions, Georgia Tech's Computer Science program is ranked in the top ten in the nation), but rather during the application process. Form letters are inexcusable when someone has paid an application fee to be considered. This is endemic across academia - pay $50, get a form rejection. If I am paying $50, even if I don't get accepted I feel as though my purchase - which is between 5-10 hours of minimum wage earnings - warrants at least SOME individualized feedback. A personalized note. Tips for improving on the application. ANYTHING other than a generic, lazy form rejection. I understand the fears of providing a base for a discrimination suit, but in an objective measuring system designed to weed out applicants, there needs to be at least SOME room for objective feedback! I find this just as frustrating in the business world when applying for jobs (I am looking at YOU, Google and Amazon - give some freaking feedback and alleviate the single greatest complaint about your interview processes), but at least in those cases the only thing I lose is my time.
  • Samples of coursework - I have absolutely zero idea how to prepare for the forthcoming coursework. Any examples of the courses themselves may have been publicized, but the accepted applicants should be reminded of these examples rather than having to resort to google searches of mass media. You're working on developing a system that will serve thousands of students each term - these convenience elements will save thousands of man-hours in help desk questions, support tickets, and overall will reduce general frustration.
Have any of you been accepted to the program? What are your thoughts on the above?

Monday, May 5, 2014

New Freelance Gig

Hey All,

I know it's been pretty quiet here lately, but I've been keeping up with my writing. I lately picked up a new freelance writing gig. I'll be writing content for the blog at Elite Gaming Computers - http://www.elitegamingcomputers.com/blog

You can see my first article here: http://elitegamingcomputers.com/kingstons-hyperx-cloud-gaming-headset/