Beep Beep Beep

I use ZFS, and I love it, I think it is the best filesystem out there. It's primary focus is on integrity, which is the most important thing. What is also important, backups. Even with the data integrity that ZFS offers (which far surpasses any hardware RAID), you still have to backup.

Again, with ZFS, this is much easier than with other solutions (like Bacula for example). Since we run Sun servers, we also run Solaris, since when you run Solaris on Sun hardware, the licence is relatively cheap. As a result, I use the Timeslider service to automatically create snapshots (which, when you share a ZFS filesystem out via CIFS shows up in the Windows GUI as "previous versions").

Because of this, I also use the "zfs-send" plugin, basically backing up snapshots to a separate Solaris server. However, there are some gotchas which may catch you out if you had a working config, and then change things around and find the zfs-send service failing.

First, zfs-send will put a hold on snapshots. It does this so they don't get deleted before they're used to send to the remote server. However, if you're in the situation where you need to clear all the snapshots (for example, you've moved, or changed zfs filesystems you want to backup). Then you will find you can't delete these, what you have to do is "zfs release" the snapshots.

Here is a little snippet that will do this (and delete ALL zfs-auto-snap snapshots on the system):

for snap in `zfs list -H -o name -t snapshot | grep @zfs-auto-snap`;
do zfs release org.opensolaris:time-slider-plugin:zfs-send $snap;
   zfs destroy $snap;

Then, secondly, zfs-send stores the name of the previously sent snapshot as a property on the filesystem. It does this, so it knows it can use an incremental zfs send. However, if you have broken this sequence, or deleted the snapshots, then this will cause it to break.

You can look for it with:

zfs get -r org.opensolaris:time-slider-plugin:zfs-send storage

Where "storage" can be replaced with your particular zpool name. To clear a property, you use "zfs inherit", like so:

zfs inherit org.opensolaris:time-slider-plugin:zfs-send storage/shares

Changing "storage/shares" to the particular ZFS file system you want to clear the property from. You can clear this property recursively by just adding the "-r" option:

zfs inherit -r org.opensolaris:time-slider-plugin:zfs-send storage/shares

Once you've done this, just enable the service (or clear it if it was forced into maintenance) and you should be golden.

I'm not joking

Jonathan Blow of Software Quality, you should watch this if you're interested in writing software. I used to have an Amiga, and to be honest, it was far more responsive than my current beast of a PC.

She Blinded Me With Science

It seems to me, that one of the most important aspects of software development is one that doesn't get a great amount of focus. Debugging. Sure, it's mentioned here and there, but, for example, first year students aren't even taught about the command line Java debugger.

So, I believe this video of Stuart Halloway, "Debugging with the Scientific Method" is required viewing. Of course, it's not just debugging, but any sort of performance or work on a website or application. Take stackoverflow for example, it's a popular site and hosted on their own servers. I have been reading lately on their setup and the monitoring they do, not only for uptime, but for performance.

For example, they use HAProxy to load balance to their web tier servers, obviously not unusual, that's what HAProxy is for. But, they also have these proxies capture and filter performance data from their application via headers in the HTTP response. It's probably something that everyone does, but to be honest, I've never come across any mention of this trick. (There's also their miniprofiler tool, which I'm using a variant of).

Given how little debugging is taught in university (well, my university) I can't judge on how common and detailed this sort of performance measurement is. I suspect that it might not be very common, so could be an interesting area for me to focus on.

I don't know SQL

I mentioned in the previous post that I'm not a database guru, luckily I haven't actually had to do a great deal of complex SQL queries. Which, is a shame in a way, because I was working on a particular SQL query this week just past, and it was interested. I learned a few things about PostgreSQL that I think make it the database to select when starting a project.

Of course I know that SQL Server and Oracle would have these features, but I would be honestly surprised if MySQL did. I'm constantly surprised by how many projects use/support MySQL when it really is the inferior database to PostgreSQL. Again, I would argue that it's because ORM frameworks abstract away useful distinguishing features of the underlying database. But, I could obviously be wrong… unless you're dealing with GIS data, such as I am with my recent project.

It's an application that reads and stores GPS data from a Tait DMR network. The specifics aren't too important, but, basically, every thirty seconds we get a GPS location for a fleet of vehicles that we store in a database for querying. You could just get the decimal latitude and longitudes and store them in any database, but then when you try to do something with the data, it can get difficult.

For example, we have a customer, who wants to track the time that his employees are onsite. This is so he can charge his customer the correct amount. The thing is, when you get a GPS reading, it may be different. If they drive off for lunch, then head back, it would be a fine trick for them to get the exact GPS reading again.

Thankfully, the PostGIS extension for PostgreSQL gives datatypes and functions to help with this.

First, the table definition for the GPS readings:

CREATE TABLE gps_readings (
  location GEOGRAPHY(POINT,4326) NOT NULL,
  speed integer NOT NULL,
  time_and_date timestamp NOT NULL,
  unit_id integer REFERENCES units(id) ON UPDATE CASCADE,
  PRIMARY KEY (time_and_date, unit_id)

Nothing really too surprising here, just the geography type that PostGIS gives you. With GIS geography data, there's different ways you can do data projections, since you're trying to map a coordinate system of a spheroid (the Earth). GPSs return data in decimal latitude and longitude (WGS84, which is the same system used by Google maps), this means you're trying to use a Cartesian system to map onto a sphere. This results in distortion, if you've ever seen a 2D map of the Earth, you can see how massive Greenland is, even though it's a tiny country. That's a result of the distortion.

PostGIS uses the geography type to keep track of what system you're using, in this case I'm using SRID 4326. Which I believe is the expected GPS coordinate system.

Anyway, basically, I have to get all the locations of the vehicle and group them into clusters, with the clusters being within a certain radius of each other. Actually, I'll just include the code here, since I already commented what it does…

--- The following is a pretty messy SQL query.
--- How it works as follows:
--- First, it takes all the distinct gps readings for a unit where the
--- unit hasn't been moving. It then does cluster analysis of these
--- groups to organise them into clusters.
--- Then, the gps readings for the units are taken, each reading being
--- compared to each cluster, basically associating each reading with
--- the corresponding cluster. With the clusters generalised to a
--- common point.
--- Then these results are parsed through using windows, with the
--- window, the current grouped_location is compared with the last
--- grouped_location. If they are different, it means that the vehicle
--- has moved out of a cluster, so the time of the current record must
--- be the starting time of movement into a new location.
--- Then end_time is then calculated by looking ahead for a change in
--- location (meaning the record is the last reading inside the current
--- location).
--- This gives us start and end times, but with a few readings when the
--- unit was inside the location. They are removed, and a final window
--- is used to put the end_time into the same row as the start_time (so
--- we just end up with a single grouped_location with a start and end
--- time). Rows with the same start and end time are removed, before we
--- finally return the grouped_location, start, end, and total times.
--- Because of the way the Clojure JDBC works, and because we use the
--- same parameters for two different subqueries, the same arguments
--- need to be substituted in twice.
--- They are, unit_id, start_date, end_date, unit_id, start_date, end_date.
SELECT ST_Y(grouped_location) AS latitude,
          ST_X(grouped_location) AS longitude,
          start_time, end_time, end_time - start_time AS total_time
    (SELECT grouped_location, start_time,
       CASE WHEN ST_Equals((lead(grouped_location) OVER tadw),grouped_location)
                 AND lead(end_time) OVER tadw IS NOT NULL
                 THEN lead(end_time) OVER tadw
            WHEN end_time IS NOT NULL THEN end_time
          END AS end_time
         (SELECT *,
              CASE WHEN lag(grouped_location) OVER tadw IS NULL
                        THEN time_and_date
                   WHEN ST_Equals((lag(grouped_location) OVER tadw),grouped_location)
                        THEN NULL
                   WHEN NOT ST_Equals((lag(grouped_location) OVER tadw),grouped_location)
                        THEN time_and_date
              END AS start_time,
              CASE WHEN NOT ST_Equals((lead(grouped_location) OVER tadw),grouped_location)
                        THEN time_and_date
                   ELSE NULL
              END AS end_time
              (SELECT ST_Centroid(UNNEST(ST_Clusterwithin(location::geometry, 0.01))) AS grouped_location
                   (SELECT DISTINCT location
                      FROM gps_readings
                     WHERE unit_id = ?
                       AND speed = 0) AS clus_loc_filter) AS clusters
INNER JOIN gps_readings
        ON (ST_DWithin(clusters.grouped_location, gps_readings.location::geometry, 0.01)
           AND unit_id = ? AND speed = 0)
          WINDOW tadw AS (ORDER BY time_and_date)) AS tbl_start_times
           WHERE ((start_time IS NOT NULL) OR (end_time IS NOT NULL))
    WINDOW tadw AS (ORDER BY time_and_date)) AS tbl_end_times
 WHERE (start_time IS NOT NULL AND (start_time <> end_time)
    OR end_time is null)
   AND (end_time - start_time) > interval '5 minutes';

The question marks aren't part of the SQL, since I'm using Clojure JDBC, they're where the unit id for the vehicle gets substituted in. Again, I think that there is room for improvement here. Simply because I haven't used enough SQL to learn the best way to approach this. As it stands, I'm pretty happy with it.

On the low-end VM (Linux, 2GB RAM) PostgreSQL is running on, it will get the result in about 1100ms, that's reading all the GPS position data for that unit in the system. It's been collecting the GPS data since the 11th of March 2016, which is about 20,000 rows.

Considering the ClusterWithin function it's running, that's not too bad, it was a bit slower than that without that DISTINCT that I mentioned in my last post. Also, it should get faster, since I haven't added date ranges to restrict the number of rows searched.

So, in summary, if you use MySQL, you should be using PostgreSQL and you should be taking advantage of the database features where possible, you can do some pretty cool stuff with those Window functions. I just wonder how many web applications using frameworks are missing out on easy performance gains because they've got sloppy SQL queries.

I don't like ORM

I'm not a database guru, and I don't have an extensive knowledge of SQL, and that's why I don't like ORM. I don't like ORM because it's an abstraction developers use because they don't want to go near SQL for various reasons. But! That doesn't mean you shouldn't be using ORM in some cases, for example, if you're writing a Rails application, or Django, and you're not doing (or going to eventually do) anything outside the typical scope of these frameworks.

The problem is, that these frameworks abstract away advanced features that a database offers (because they have give the same interface over a wide range of databases) and tend to end up treating the database on the lowest common denominator.

So, I write using all the features of the database I've selected, because I'm not planning on moving between databases. It'll be pain to move to say Oracle, but really, how often does that actually happen? In the particular example I'm going to get into, I'm using PostgreSQL 9.5, because I need PostGIS. PostGIS is an extensive available to PostgreSQL that gives a bunch of GIS datatypes and functions, and is a good thing if you're dealing with, say, GPS data.

Anyway, anyone who uses a database should know SQL, much like anyone who writes code should also know C. Even if you don't actually write software in it, you should know C, because it lets you understand what is going on under the covers.

I'm short on time now, but, I will present two SQL queries, one takes a few seconds to run, the other is two magnitudes faster (for obvious reasons) and I believe demonstrates why you need to know SQL rather than relying on an ORM (not that an ORM made the slower query, but in my experience, correct me if I'm wrong, ORMs tend to do this as they can't read your mind):

First, the slower query:

SELECT ST_Centroid(UNNEST(ST_Clusterwithin(location::geometry, 0.001))) AS grouped_location
  FROM gps_readings
 WHERE unit_id = 6
   AND speed < 10

Now the one that is two magnitudes faster:

SELECT ST_Centroid(UNNEST(ST_Clusterwithin(location::geometry, 0.01))) AS grouped_location
    (SELECT DISTINCT location
       FROM gps_readings
      WHERE unit_id = 3
        AND speed = 0) clus_loc_filter;