Managing migrations with sqitch

Migrating birds From Pexels Wendy Wei

I do have one little problem with sqitch. I am so used to typing a letter u after a letter q that I regularly mistype the command as squitch, and then end up staring at the terminal for so long I wonder if I should have typed sqint. Very minor naming issues aside, the project has a website at www.sqitch.org. I’ve found it a simple and specific database migration tool which does as much as I want and no more. It’s written in Perl and available directly from Ubuntu repositories with

sudo apt-get install sqitch

sqitch is designed to work very closely with git. Indeed, as with using git init to create a new repository, executing sqitch init in a git repository will initialize a data managed migration project. This blog post documents a couple of departures I made from the official tutorial. Optional arguments to the init call give the name of the project, a remote git repository address and the database engine of choice. A full call to init would therefore contain the following information:

sqitch init <MY PROJECT> --uri <MY REMOTE GIT REPO> --engine pg

This will generate two local files, sqitch.conf and sqitch.plan. These files need to be kept under version control, along with the sql files that will be generated under three folders, deploy, revert and verify. The manual recommends adding information on the location of the local psql as well as a user name and email address.

sqitch config --user engine.pg.client /usr/bin/psql
sqitch config --user.name 'Paul Hewson'
sqitch config --user.email '<MY EMAIL ADDRESS>

However, this has the effect of adding a “global” config file in ~/.sqitch/sqitch.conf. I find this annoying partly because I like the idea of keeping dotfiles under ~/.config. More importantly, I prefer keeping all config files for a specific project under the folders for that project and hence under the version control for that project. I would rather my project git history knew if I had swapped psql locations at some stage. One change I found I had to make manually was to specify the target databases. This means adding a section to the local sqitch.conf file. The name “official_surveys” becomes a proxy for the full database server specification. The preamble db:pg indicates that this is a PostgreSQL database. This makes is simple to specify test and production databases as necessary. Currently, my project is a research project working with nationally commissioned surveys and I only have one working database. I may in due course decide to run a “production” database in a container.

[target "official_surveys"]
        uri = db:pg://USERNAME@HOSTNAME:PORTNUMBER/DATABASE_NAME

Having set up the project, the first step is to add files locally specifying the DDL required for the first migration. sqitch add with an optional (short) comment sets up the required sql files locally and makes an adjustment to the sqitch.plan.

sqitch add first_migration -n 'My first migration'

This generates files in three folders, namely

deploy/first_migration.sql
revert/first_migration.sql
verify/first_migration.sql

Here’s where the fun starts. The next task is to write into these files all the DDL necessary to set up schema, tables, views and functions in the deploy script, all the corresponding DDL to remove these objects in the revert script. The verify script is intended to provide tests that the database structure is correct. In a complex migration scheme on a live database there are lots of ways this could go wrong. In my case, I am somewhat misusing this by writing anonymous scripts to ensure that data has been loaded correctly into the tables. This is an illustration from the National Travel Survey. Data is supplied in a tab separated value, with the column name HomeIUse_B01ID and the responses are coded as integers, including -8, -9 for missing data values. This is manipulated into an enum in the target table. There are therefore several steps where something could go wrong, and so a number of variables are chosen at random and checks are made that the data has been transformed correctly.

DO $$
DECLARE
    home_internet int;
BEGIN
   home_internet := (
       select count(*) 
         from dft.nts_individual_2002_2019 
        where internet_frequency_home::text = 'Several times a day'
        );
   ASSERT home_internet = 25388;
END $$;

Running the command

sqitch verify official_surveys

will execute these functions on the target database. As noted, this is an abuse of the design intentions of sqitch as the command could be executed along with the deploy script as

sqitch deploy --verify official_surveys

would both deploy a migration and verify the subsequent database structure. However, this slight misuse of sqitch serves a purpose at present until I find a way to more closely link data import steps to data verification steps.

Developing a new migration consists of repeatedly cyling through

sqitch deploy official_surveys

and

sqitch revert official_surveys --to @HEAD^ -y

until the migration looks acceptable, fixing any errors which arise. At that stage, I can import the data and verify the migration and data transformation. One point of note is that the --to @HEAD^ reverts the last migration only (i.e., the migration currently being worked on). It is possible to use alternatives such as @HEAD^^ and @HEAD^3 which remove all migrations up to the immediately previous one, and two migrations previous. @ROOT may be a great England cricketer, but reverting --to @ROOT will remove all migrations. It is possible to be more precise; by issuing:

sqitch log official_surveys

we obtain a log of all migrations applied to the target database, with the SHA for each migration and can use the SHA codes to limit the reversion.

Finally, two features I haven’t used yet. At a certain point, it is possible to tag a migration using

sqitch tag v1.0.0 -n 'A useful comment'

and there is also a sqitch bundle command which bundles all the migrations in a bundle folder ready for zipping and distributing. I can’t see my self every using this, as I can only imagine using sqitch from a git repository.

Anyway, I’m finding this an extremely useful utility when developing databases to handle official surveys which are large scale and messy. It is only by working through the surveys you discover features such as a survey boost and then have to make a decision whether to break the table into a main and boost table (because only a smaller sub-sample answered the boost questions and to include these in the main table would include a lot of empty rows). One example of this is the Health Survey for England where there is a specific battery of questions which require input from a nurse and are applied to a smaller subset of the whole. For this reason, the database migration is closely linked to the data import and verification steps. I would really like to split them out, partly to increase my skill level is using Great Expectations a suite of “unit” style tests intended for data rather than code.

Read More