She Blinded Me With Science

It seems to me, that one of the most important aspects of software development is one that doesn't get a great amount of focus. Debugging. Sure, it's mentioned here and there, but, for example, first year students aren't even taught about the command line Java debugger.

So, I believe this video of Stuart Halloway, "Debugging with the Scientific Method" is required viewing. Of course, it's not just debugging, but any sort of performance or work on a website or application. Take stackoverflow for example, it's a popular site and hosted on their own servers. I have been reading lately on their setup and the monitoring they do, not only for uptime, but for performance.

For example, they use HAProxy to load balance to their web tier servers, obviously not unusual, that's what HAProxy is for. But, they also have these proxies capture and filter performance data from their application via headers in the HTTP response. It's probably something that everyone does, but to be honest, I've never come across any mention of this trick. (There's also their miniprofiler tool, which I'm using a variant of).

Given how little debugging is taught in university (well, my university) I can't judge on how common and detailed this sort of performance measurement is. I suspect that it might not be very common, so could be an interesting area for me to focus on.

I don't know SQL

I mentioned in the previous post that I'm not a database guru, luckily I haven't actually had to do a great deal of complex SQL queries. Which, is a shame in a way, because I was working on a particular SQL query this week just past, and it was interested. I learned a few things about PostgreSQL that I think make it the database to select when starting a project.

Of course I know that SQL Server and Oracle would have these features, but I would be honestly surprised if MySQL did. I'm constantly surprised by how many projects use/support MySQL when it really is the inferior database to PostgreSQL. Again, I would argue that it's because ORM frameworks abstract away useful distinguishing features of the underlying database. But, I could obviously be wrong… unless you're dealing with GIS data, such as I am with my recent project.

It's an application that reads and stores GPS data from a Tait DMR network. The specifics aren't too important, but, basically, every thirty seconds we get a GPS location for a fleet of vehicles that we store in a database for querying. You could just get the decimal latitude and longitudes and store them in any database, but then when you try to do something with the data, it can get difficult.

For example, we have a customer, who wants to track the time that his employees are onsite. This is so he can charge his customer the correct amount. The thing is, when you get a GPS reading, it may be different. If they drive off for lunch, then head back, it would be a fine trick for them to get the exact GPS reading again.

Thankfully, the PostGIS extension for PostgreSQL gives datatypes and functions to help with this.

First, the table definition for the GPS readings:

CREATE TABLE gps_readings (
  location GEOGRAPHY(POINT,4326) NOT NULL,
  speed integer NOT NULL,
  time_and_date timestamp NOT NULL,
  unit_id integer REFERENCES units(id) ON UPDATE CASCADE,
  PRIMARY KEY (time_and_date, unit_id)

Nothing really too surprising here, just the geography type that PostGIS gives you. With GIS geography data, there's different ways you can do data projections, since you're trying to map a coordinate system of a spheroid (the Earth). GPSs return data in decimal latitude and longitude (WGS84, which is the same system used by Google maps), this means you're trying to use a Cartesian system to map onto a sphere. This results in distortion, if you've ever seen a 2D map of the Earth, you can see how massive Greenland is, even though it's a tiny country. That's a result of the distortion.

PostGIS uses the geography type to keep track of what system you're using, in this case I'm using SRID 4326. Which I believe is the expected GPS coordinate system.

Anyway, basically, I have to get all the locations of the vehicle and group them into clusters, with the clusters being within a certain radius of each other. Actually, I'll just include the code here, since I already commented what it does…

--- The following is a pretty messy SQL query.
--- How it works as follows:
--- First, it takes all the distinct gps readings for a unit where the
--- unit hasn't been moving. It then does cluster analysis of these
--- groups to organise them into clusters.
--- Then, the gps readings for the units are taken, each reading being
--- compared to each cluster, basically associating each reading with
--- the corresponding cluster. With the clusters generalised to a
--- common point.
--- Then these results are parsed through using windows, with the
--- window, the current grouped_location is compared with the last
--- grouped_location. If they are different, it means that the vehicle
--- has moved out of a cluster, so the time of the current record must
--- be the starting time of movement into a new location.
--- Then end_time is then calculated by looking ahead for a change in
--- location (meaning the record is the last reading inside the current
--- location).
--- This gives us start and end times, but with a few readings when the
--- unit was inside the location. They are removed, and a final window
--- is used to put the end_time into the same row as the start_time (so
--- we just end up with a single grouped_location with a start and end
--- time). Rows with the same start and end time are removed, before we
--- finally return the grouped_location, start, end, and total times.
--- Because of the way the Clojure JDBC works, and because we use the
--- same parameters for two different subqueries, the same arguments
--- need to be substituted in twice.
--- They are, unit_id, start_date, end_date, unit_id, start_date, end_date.
SELECT ST_Y(grouped_location) AS latitude,
          ST_X(grouped_location) AS longitude,
          start_time, end_time, end_time - start_time AS total_time
    (SELECT grouped_location, start_time,
       CASE WHEN ST_Equals((lead(grouped_location) OVER tadw),grouped_location)
                 AND lead(end_time) OVER tadw IS NOT NULL
                 THEN lead(end_time) OVER tadw
            WHEN end_time IS NOT NULL THEN end_time
          END AS end_time
         (SELECT *,
              CASE WHEN lag(grouped_location) OVER tadw IS NULL
                        THEN time_and_date
                   WHEN ST_Equals((lag(grouped_location) OVER tadw),grouped_location)
                        THEN NULL
                   WHEN NOT ST_Equals((lag(grouped_location) OVER tadw),grouped_location)
                        THEN time_and_date
              END AS start_time,
              CASE WHEN NOT ST_Equals((lead(grouped_location) OVER tadw),grouped_location)
                        THEN time_and_date
                   ELSE NULL
              END AS end_time
              (SELECT ST_Centroid(UNNEST(ST_Clusterwithin(location::geometry, 0.01))) AS grouped_location
                   (SELECT DISTINCT location
                      FROM gps_readings
                     WHERE unit_id = ?
                       AND speed = 0) AS clus_loc_filter) AS clusters
INNER JOIN gps_readings
        ON (ST_DWithin(clusters.grouped_location, gps_readings.location::geometry, 0.01)
           AND unit_id = ? AND speed = 0)
          WINDOW tadw AS (ORDER BY time_and_date)) AS tbl_start_times
           WHERE ((start_time IS NOT NULL) OR (end_time IS NOT NULL))
    WINDOW tadw AS (ORDER BY time_and_date)) AS tbl_end_times
 WHERE (start_time IS NOT NULL AND (start_time <> end_time)
    OR end_time is null)
   AND (end_time - start_time) > interval '5 minutes';

The question marks aren't part of the SQL, since I'm using Clojure JDBC, they're where the unit id for the vehicle gets substituted in. Again, I think that there is room for improvement here. Simply because I haven't used enough SQL to learn the best way to approach this. As it stands, I'm pretty happy with it.

On the low-end VM (Linux, 2GB RAM) PostgreSQL is running on, it will get the result in about 1100ms, that's reading all the GPS position data for that unit in the system. It's been collecting the GPS data since the 11th of March 2016, which is about 20,000 rows.

Considering the ClusterWithin function it's running, that's not too bad, it was a bit slower than that without that DISTINCT that I mentioned in my last post. Also, it should get faster, since I haven't added date ranges to restrict the number of rows searched.

So, in summary, if you use MySQL, you should be using PostgreSQL and you should be taking advantage of the database features where possible, you can do some pretty cool stuff with those Window functions. I just wonder how many web applications using frameworks are missing out on easy performance gains because they've got sloppy SQL queries.

I don't like ORM

I'm not a database guru, and I don't have an extensive knowledge of SQL, and that's why I don't like ORM. I don't like ORM because it's an abstraction developers use because they don't want to go near SQL for various reasons. But! That doesn't mean you shouldn't be using ORM in some cases, for example, if you're writing a Rails application, or Django, and you're not doing (or going to eventually do) anything outside the typical scope of these frameworks.

The problem is, that these frameworks abstract away advanced features that a database offers (because they have give the same interface over a wide range of databases) and tend to end up treating the database on the lowest common denominator.

So, I write using all the features of the database I've selected, because I'm not planning on moving between databases. It'll be pain to move to say Oracle, but really, how often does that actually happen? In the particular example I'm going to get into, I'm using PostgreSQL 9.5, because I need PostGIS. PostGIS is an extensive available to PostgreSQL that gives a bunch of GIS datatypes and functions, and is a good thing if you're dealing with, say, GPS data.

Anyway, anyone who uses a database should know SQL, much like anyone who writes code should also know C. Even if you don't actually write software in it, you should know C, because it lets you understand what is going on under the covers.

I'm short on time now, but, I will present two SQL queries, one takes a few seconds to run, the other is two magnitudes faster (for obvious reasons) and I believe demonstrates why you need to know SQL rather than relying on an ORM (not that an ORM made the slower query, but in my experience, correct me if I'm wrong, ORMs tend to do this as they can't read your mind):

First, the slower query:

SELECT ST_Centroid(UNNEST(ST_Clusterwithin(location::geometry, 0.001))) AS grouped_location
  FROM gps_readings
 WHERE unit_id = 6
   AND speed < 10

Now the one that is two magnitudes faster:

SELECT ST_Centroid(UNNEST(ST_Clusterwithin(location::geometry, 0.01))) AS grouped_location
    (SELECT DISTINCT location
       FROM gps_readings
      WHERE unit_id = 3
        AND speed = 0) clus_loc_filter;

The Man Who Sold The World

I have a problem, I'm developing an application that uses a database (that's not the problem, databases are great, people should stop using them as stupid key/value stores), the problem I have is that I want to get continuous integration set up pretty early into the project. The earlier I get it organised, the better habits will flow from that (or that's the plan at least), and also the quicker it will be in the long run since it will help with the path to continuous deployment.

The problem I have, is that my CI server (TeamCity) is running under a virtual machine. Ok, that's not the problem exactly, but the problem is I want to have the state of my database in a known good state. Also, we need to make sure that it's documented how to build the system. For local development, I'm using Vagrant, but, since the CI server is running in a virtual machine itself, then Vagrant isn't really suitable. (Yes, you could run a VM in a VM, but lets just not.)

So, I either have to get a physical server running Linux (either migrate TeamCity to it, or have an agent run on it), or, figure out how to get a virtual machine that I can bring up and down via scripting. Luckily for me, it looks like someone already has:

The basic idea is to create a template on your ESX server, then use a plugin for bringing a machine up and down based on that template. This seems like something that could work. The reason I need to bring up a whole database is because I'm not using some sort of ORM to abstract away the database. There are a few reasons for this:

  1. I would lose the use of triggers, which are important for maintaining data integrity, also they save the round-trips you can get with ORM.
  2. I can use the database authentication system, there's no point rolling my own when the database can do it better and faster.
  3. I don't need to switch databases, and designing the system for that means you end up coding the for lowest common denominator.

I have read some other blogs commenting both sides of the ORM debate, and I should write a post on it in the future about it, motivation willing. Anyway, I haven't got the build system up and running just yet, but I'm hoping that I've made some progress by Friday so should be able to do a quick post on how that's working out.

Trigger on Habits

Damn it all, I missed another blog post again, I'll have to do two today. I missed out because I don't really have a good consistent trigger to get the habit of posting started. This is a problem if my routine changes, this happened yesterday and Monday. It's something I'll have to work on, maybe making it something I do first thing in the morning when I get into work, or into Uni.