I like to track my Tableau visualization projects in git to protect against screwups and for curiosity of what previous versions looked like. Theoretically, it might help with collaborators working on the same workbook, but I’ve never actually done that or wanted to do that. I use live updates for my data sets so the .twb files don’t contain sensitive data and the xml is at least possible to get a sense of what changed. But Tableau stores thumbnail images of worksheets and dashboards as base64 encoded graphics inline the xml. This makes the files really big and I got tired of manually removing these so I came up with this script to clean out the unneeded elements.

Why use git with tableau

I’m trying to figure out better ways to increase reuse of data visualizations within my employer, so I thought that if I stored by Tableau workbooks in git it would be easier to share techniques and data sources used. I’m not sure how useful that is, but it is convenient to version control my workbooks on the off chance that I screw up harder than command+Z will fix. Or if I revisit an old project and want to see previous versions before the final version. And in general, just all the basic development use cases where version control comes in handy. Perhaps one day, I’ll live the dream to get a Pull Request with new functionality added by some helpful contributor. But for now, I’ll be happy with being able to email links to projects instead of sending file attachments.

Tableau added versioning back in 2016’s 9.3 but that doesn’t help me if I accidentally delete the file or want to diff between versions.

I don’t check in twbx files as they are huge and typically the data are not something I share easily. So I’m happy with just the xml-version twbx file checked in and then other users have to get the data themselves if they want to reproduce. All of my tableau work is internal, so space isn’t an issue for our internal GitLab servers, but it seems just wrong to waste that much space. I already feel bad about giant xml files, not to mention all my jupyter notebooks.

Finding a problem

A few months ago, I was about to check in a day’s work and I noticed my workbook was really big. Instead of 100kb, it was 9MB. I thought I had accidentally extracted the data instead of pulling directly from my data sources, but it was actually lots and lots of base64 encoded, inline xml images for each of my worksheets crammed into a element with an individual element for each worksheet. I had a lot of worksheets, maybe 30, so there was lots of xml. I tried deleting them from the workbook file and everything still opened up again, Tableau just regenerated the thumbnails. So they didn't seem to be required unless I eventually published the workbook to [Tableau Public](https://public.tableau.com/en-us/s/) or to a server.

Here’s an excerpt from superstore.twb, but mine looked pretty similar…

<thumbnails>
    <thumbnail height='192' name='Commission Model' width='192'>
      iVBORw0KGgoAAAANSUhEUgAAAMAAAADACAYAAABS3GwHAAAACXBIWXMAAA7DAAAOwwHHb6hk
      AAAgAElEQVR4nO29yZMbeZbn9/EV7nCsDiCA2BkLd+bC3FOd1dMzUxqp1XPpvsxIpoM0d+kg
      M5l0bB1kMtOY/gSZTtMHmWlMKs1oeqY6LbMrK6syycqFyeQWEYwIRCCAwOLYHDt80SEyWcnk
      FgxGMEiGf040BPjwfoA/99/3997v/QTf930CAk4o4nE7EBBwnAQBEHCi2VcA+L6PZVlH7cux
      4Y77tDuD+15zBl2aP3st4NXjXgB4gya/+n//HVevfMn69u797/J9vrxy5YnG8ivX+fSzz7l6
      9SuG7sEc8lyH69/feOTfx50qN9aLT7QzqK/z3/2P/wtjH3xvzP/2P/333CraD33vsFXi5t37
      x9yvbvLdevnpnA946ZB//Ic77NIZy7z97vuIAlSKefI7VXpjjz/94B0AWuU8364W8RH403cv
      8PFv/kA8nuTye5dR8Pl+dZu/+Iv/DABv3OfvP/sSRRJYuPgOt/7w9yTTWUrlKpPZNGNBI+q2
      sAjTrpSYmpqk2ujwyz/9gF6vT+HuLfJVm6SZYSLsciNfI5lMcX4uyqA3ZDe/xt2ShevAh+9d
      4N/+20+YmU6jRHO8cX4RgA/eWOT6Sol5rcbMmbcAuH39a9p9BzEU4c2zs/z2i68J+V188wK3
      r39NqzciFEmxFHveP0XAcXDvCaDEp/mLf/Aav/30Y76+sYbggw9s3F279+a//+z3hHWV6uq3
...10000 lines...
    </thumbnail>
</thumbnails>

In the superstore workbook included in Tableau Desktop 2019.4, thumbnails runs from lines 6452-8996 and increases the size on disk of the workbook from 463KB to 607KB. And since any change to the workbook completely regenerates the thumbnail image, leaving these in just really increases the size of the repo quite a bit, and quickly.

Looking for a solution

I don’t use server and rarely use public so just I was happy with smaller git projects. I deleted the thumbnails, checked in my project and forgot until the next time I was working on a tableau project. Then I remembered it all again, and became sad because of how boring it will be to do this forever.

I googled around a bit and read about others (like cmtoomey and sechilds)who were using tableau and github and automatically handling the stripping out of the thumbnails element. These approaches used python to run as a pre-commit hook. I set this up for my project and was happy. For a while.

The downside of this is that I had to have that python script included in every repo and I had to have an environment set up for python to run it. It wasn’t that big of a deal as most of my projects already have some python along with tableau, but I didn’t want to introduce python everywhere just to clean up these files.

An easier, quick hack

So today, I finally got around to writing a small shell script to strip out these elements. I do all my development either on a Mac or on Windows using the git bash shell that comes with git for windows. So a shell script would run even if I didn’t have python set up on that machine.

It’s a quick hack that uses awk instead of actually parsing the xml. I’m afraid it might break with future versions of Tableau, so I’m not comfortable setting it up as a pre-commit git hook. But one day, I’ll rewrite it. But until someone bugs me about it, this should be fine to run before I rebase and push to my org’s repos.

/bin/find . -name *.twb | while read fname; do awk '/<thumbnails>+$/,/<\/thumbnails>+$/{next}1' "$fname" > temp.tmp && mv temp.tmp "$fname"; done

It’s just a simple one liner that cuts out the entire thumbnails element. It runs it for every single workbook, but if there were no changes then the output will be the same so it won’t register with git. Comically, I do have to adjust it between mac and windows because gitbash finds the windows “find” command instead of the unix find.

Reusability

I’m not sure what the right solution is to having reusable visualizations that can reliably run on multiple users. It’s interesting seeing software engineering concepts come into new worlds like data analysts and visualization designers. I wish tool designers would think about this more when designing their tool. One of my pet peeves with PowerBI is that everything is a giant binary workbook so version tracking and sharing is pretty hard. I’m not sure what I’d do if I had to have my methods audited and I was using PowerBI, pretty hard to show what changed or didn’t change.

If I really want something to be reusable, or move it to production, I try to recreate it in html+javascript so it can be widely used and clearly show the line of site between data, cleaning and processing, and visualization. But I still use tableau quite a bit for quick explorations. This script at least helps me organize projects a little better.

Future work

  • Do a proper xml parsing
  • Write a git pre-commit hook to only run this on changed workbooks