It’s been a while since the last update on this blog that had anything to do with what I’m working on at the BDC, so here’s the big update.
It was a very well-read story and was made better by the fact that we took the data Bacaj gathered and turned it into a nifty interactive database.
The health department provided Bacaj with almost 200 PDF documents generated by their inspectors, who use touchscreen tablets when out on the job. Jason meticulously read those PDFs and entered the data into a spreadsheet which he used as the basis for his article and which I used to power the database.
The data was cleaned up in Google Refine (of course), exported to CSV and then imported into Tableau Public. (Fortunately, our IT guy was able to give me a copy of VMWare Fusion and a license for Windows 7 so that I could run Tableau on my iMac. I really wish that they would make a Mac version…)
Already having Bacaj’s data in a tabular format was a huge help, and getting Tableau up and working on my new virtual machine was no sweat.
Geocoding for Tableau to make the database’s map component was trickier than it ought to be. I was used to the way Google Fusion Tables handles geocoding data (and in fact can geocode data for you), so producing columns that Tableau would read involved a lot of trial and error.
Eventually, I found GPS Visualizer, which has tools that plug in to Yahoo and Google’s maps APIs. I used their site to batch geocode a long text-based list of addresses.
With the latitude and longitude columns in place, it was no trouble to get Tableau to map the data.
However, Tableau is a deep, deep tool, and it took a number of tries and lots of research in its documentation to figure out exactly what I was doing — and I still didn’t tap in to a tenth of what it can do. It’s great tool to have in the box.
I don’t consider the database finished. Bacaj’s approach for data entry wasn’t terribly efficient, and I am in the process a custom Google form to input even more data from the PDFs, including links to the PDFs online where readers can see the actual reports — a crucial feature missing from version 1.
Would I rather use some sort of PDF scraping software? Of course, but I don’t know how, and at this point I’m not concerned with learning another new thing for this project (which, according to my research would nearly require me to learn yet another new this, the fearsome command line).
Today, I even went so far as to upload all the PDFs to DocumentCloud.
They aren’t particularly pretty or well-marked with metadata, but they’re a start.
We’ll produce more interactive databases in the future, just as soon as we can get ourselves some good local datasets to work with. If you have suggestions, of course, let me know in the comments.