Using DVC

Clocks From Pexels Kevin Ku

I’ve been meaning to do this for a long time. I got my fingers burnt trying to update my database to load the latest release of stats19 data and decided that was the hint I needed. The first thing to note is that I installed the debian package using the instructions here

So the first thing to do is enter the data storage folder and run:

git init
dvc init

I’m holding my “store” in a USB folder, so I ran

dvc remote add -d usb_remote /mnt/usb1/dvcstore

to get that set up.

Then, it’s just a case of doing

dvc add </path/to/file.csv>
git add </path/to/file.csv>.dvc .gitignore

(I don’t think you need .gitignore if you are updating a file). Although, I used this opportunity to add some metadata to the .dvc files. The git then needs some git commit and git push (assuming we’ve set up a remote git repo as well).

I added some meta-data to the .dvc file to help me track a few details.

  meta:
    source_url: 
      https://www.gov.uk/government/statistical-data-sets/road-safety-open-data
    download_date: '2025-10-14'
    publisher: DfT
    license: OGL v3
    timeframe: 1979 to 2022
    format: csv
    row_count: 11845978
    column_count: 21

Key commands here are:

wc -l file.csv
head -n 1 file.csv | awk -F, '{print NF}'

Then we just need to run

git tag <data/years>
git commit
git push
dvc push

So the cunning plan here is that I can run

git checkout <data/years>
dvc pull

And I should have the relevant data in my working area.

Finally, now that I’m using this I can make the download part of the dvc process

Because I’d been manually curating in the past, I had to force the system into an update

dvc import_url --force https://stats19.gov.uk/location

But at this point I can update my metadata, and git add / git commit and even git tag

Then next year, I only need to run

dvc update collisions_latest.csv.dvc