From Pexels Wendy Wei
I do have one little problem with
sqitch. I am so used to typing a letter
u after a letter
q that I regularly mistype the command as
squitch, and then end up staring at the terminal for so long I wonder if I should have typed
sqint. Very minor naming issues aside, the project has a website at www.sqitch.org. I’ve found it a simple and specific database migration tool which does as much as I want and no more. It’s written in Perl and available directly from Ubuntu repositories with
sudo apt-get install sqitch
sqitch is designed to work very closely with
git. Indeed, as with using
git init to create a new repository, executing
sqitch init in a git repository will initialize a data managed migration project. This blog post documents a couple of departures I made from the official tutorial. Optional arguments to the
init call give the name of the project, a remote git repository address and the database engine of choice. A full call to
init would therefore contain the following information:
sqitch init <MY PROJECT> --uri <MY REMOTE GIT REPO> --engine pg
This will generate two local files,
sqitch.plan. These files need to be kept under version control, along with the
sql files that will be generated under three folders,
verify. The manual recommends adding information on the location of the local
psql as well as a user name and email address.
sqitch config --user engine.pg.client /usr/bin/psql sqitch config --user.name 'Paul Hewson' sqitch config --user.email '<MY EMAIL ADDRESS>
However, this has the effect of adding a “global” config file in
~/.sqitch/sqitch.conf. I find this annoying partly because I like the idea of keeping dotfiles under
~/.config. More importantly, I prefer keeping all config files for a specific project under the folders for that project and hence under the version control for that project. I would rather my project git history knew if I had swapped
psql locations at some stage. One change I found I had to make manually was to specify the target databases. This means adding a section to the local
sqitch.conf file. The name “official_surveys” becomes a proxy for the full database server specification. The preamble db:pg indicates that this is a PostgreSQL database. This makes is simple to specify test and production databases as necessary. Currently, my project is a research project working with nationally commissioned surveys and I only have one working database. I may in due course decide to run a “production” database in a container.
[target "official_surveys"] uri = db:pg://USERNAME@HOSTNAME:PORTNUMBER/DATABASE_NAME
Having set up the project, the first step is to add files locally specifying the DDL required for the first migration.
sqitch add with an optional (short) comment sets up the required
sql files locally and makes an adjustment to the
sqitch add first_migration -n 'My first migration'
This generates files in three folders, namely
deploy/first_migration.sql revert/first_migration.sql verify/first_migration.sql
Here’s where the fun starts. The next task is to write into these files all the DDL necessary to set up schema, tables, views and functions in the deploy script, all the corresponding DDL to remove these objects in the revert script. The verify script is intended to provide tests that the database structure is correct. In a complex migration scheme on a live database there are lots of ways this could go wrong. In my case, I am somewhat misusing this by writing anonymous scripts to ensure that data has been loaded correctly into the tables. This is an illustration from the National Travel Survey. Data is supplied in a tab separated value, with the column name
HomeIUse_B01ID and the responses are coded as integers, including -8, -9 for missing data values. This is manipulated into an
enum in the target table. There are therefore several steps where something could go wrong, and so a number of variables are chosen at random and checks are made that the data has been transformed correctly.
DO $$ DECLARE home_internet int; BEGIN home_internet := ( select count(*) from dft.nts_individual_2002_2019 where internet_frequency_home::text = 'Several times a day' ); ASSERT home_internet = 25388; END $$;
Running the command
sqitch verify official_surveys
will execute these functions on the target database. As noted, this is an abuse of the design intentions of
sqitch as the command could be executed along with the deploy script as
sqitch deploy --verify official_surveys
would both deploy a migration and verify the subsequent database structure. However, this slight misuse of
sqitch serves a purpose at present until I find a way to more closely link data import steps to data verification steps.
Developing a new migration consists of repeatedly cyling through
sqitch deploy official_surveys
sqitch revert official_surveys --to @HEAD^ -y
until the migration looks acceptable, fixing any errors which arise. At that stage, I can import the data and verify the migration and data transformation. One point of note is that the
--to @HEAD^ reverts the last migration only (i.e., the migration currently being worked on). It is possible to use alternatives such as
@HEAD^3 which remove all migrations up to the immediately previous one, and two migrations previous.
@ROOT may be a great England cricketer, but reverting
--to @ROOT will remove all migrations. It is possible to be more precise; by issuing:
sqitch log official_surveys
we obtain a log of all migrations applied to the target database, with the SHA for each migration and can use the SHA codes to limit the reversion.
Finally, two features I haven’t used yet. At a certain point, it is possible to tag a migration using
sqitch tag v1.0.0 -n 'A useful comment'
and there is also a
sqitch bundle command which bundles all the migrations in a
bundle folder ready for zipping and distributing. I can’t see my self every using this, as I can only imagine using
sqitch from a git repository.
Anyway, I’m finding this an extremely useful utility when developing databases to handle official surveys which are large scale and messy. It is only by working through the surveys you discover features such as a survey boost and then have to make a decision whether to break the table into a main and boost table (because only a smaller sub-sample answered the boost questions and to include these in the main table would include a lot of empty rows). One example of this is the Health Survey for England where there is a specific battery of questions which require input from a nurse and are applied to a smaller subset of the whole. For this reason, the database migration is closely linked to the data import and verification steps. I would really like to split them out, partly to increase my skill level is using Great Expectations a suite of “unit” style tests intended for data rather than code.Share on: