Hi,
I think I found a way to to raise the visibility of some pretty boring civic data and present it to the public in a more useful manner. Like most cities and town Watertown, MA keeps detailed meeting minutes for each Town Council meeting that takes place. Since 2006 those meeting notes have been placed in the Document Center as PDFs. I doubt that many people are reviewing those notes, I’d honestly be surprised if there was anyone else but me looking through them. If you do review them, please say hi in the comments! Check out He Said, She Said if you want to skip the details and get to the data extracted from 5 years of Town Council meetings.
Project Goal
For this project I wanted to see if I could take the very dry and challenging to read meeting minutes from the Town Council meetings and present them in a way that might make it more interesting to people in Watertown. I’ll consider this project a success if just a handful of people explore some of these meeting minutes.
Challenges
Once again my primary challenge is working with information that is locked up in PDF documents. So what could I do to take these vanilla, boring presentations of our civic employees and create a more engaging experience.
Data for this project
Tools used in this project
- We’ll need the trusty Typhoeus and nifty Nokogiri ruby gems to help us screen scrape and download the PDFs
- The wonderful pdftotext tool to liberate all that juicy information from the PDF
- Open Calais will help us find interesting people, quotes and business mentioned in the minutes and we will use the Calais Ruby gem to help us here.
- A number of ruby scripts to use all these fine tools
- Ruby on Rails running on Heroku will serve up the now super awesome meeting notes
The Result?
He Said, She Said is the website that shows quotes from the Town Council meetings. I’m pretty happy. I was able to pull down the PDFs, convert them to text, run them through the Open Calais web service and present them in a Rails app in a much more compelling way. The basic website that shows quotes from Town Council meetings is called He Said, She Said (thanks for the name Kimmi!) I’ve found it a fun way to look at quotes and dive into older meetings. There is obviously much more that should be done but I’d like to get it into the hands of more people in Watertown to collect feedback.
“Councilor Lawn indicated that a few business owners contacted him and stated that they would be willing to pay more for the service than lose the service altogether.”
Steps to reproduce this project in your town
- Locate your town meeting notes. Hopefully they are online and if you are lucky they are at least in a PDF format.
- Download one PDF and try to use the scripts and tools mentioned to extract the text. You may even want to copy the text from the PDF to try the Open Calais Viewer. Using their free viewer, you’ll get a sense of how useful their entity extraction is before you commit all the way.
- If Open Calais seems to provide useful results for your town meeting notes you can use, modify or improve the Ruby script (town-council-parse.rb) that I have provided, to screen scrape your town website and pull down the PDFs for you to locally work with them.
- After you have downloaded the PDFs using the script you’ll need to convert the PDFs to a plain text format so Open Calais can extract meaning from them. Take a look at the convert_to_pdf.rb script. It is very, very simple as it just converts all the PDFs it finds in your current directory into text files. If pdftotext can do batch jobs I didn’t find it right away. After running this script you should see a .txt file with the same name as the pdf.
- Now after you have your text files you’ll need to send the text that was extracted from the PDF to the Open Calais web service for them to generate meaning and structure from your documents. The script, extract-entities.rb uses the Calais Ruby gem to make our requests much easier. When running this script it will take a while to generate the files from the plain text documents. I like working with the JSON data format so that is what we get back from Open Calais.
- JSON isn’t the most user friendly format for people to look at so we should do something about that. How about we convert it to the more Excel and lay person friendly format of Comma Separated Values, CSV. Using the Siren Ruby gem we use create-quotes-csv.rb to parse the JSON returned and stored in our new csv files.
- At this point in your file system you should have 3 files for each meeting. The PDF, the .txt file and the .txt.json file. You should also have a new file called quotes.csv that provides a CSV file containing all of the quotes from the meetings that Open Calais located.
- Upload the PDFs to Scribd so people can more easily read them on the web. I currently have moved over 93 of the 126 documents before I hit the Scribd rate limit. Hopefully I will remember to go back and add the remaining documents. For my own reminder Scribd shut me off at TC%20Minutes%204.24.2007.pdf.
- You may find this is a place where you might stop. Having the CSV file of quotes gives you quite a bit of ability to store it in a relational database and perform fun interesting queries. I wanted to provide a more visible face to this information so I used Heroku to run a very, very simple Rails server that only currently shows a single random quote from the meetings.
Next Steps
I have known about ScraperWiki for a while but haven’t yet really tried it out. I’d like to see if I can migrate the collection of scripts that I’ve been using over to that tool. I also need to finish uploading documents
Thanks,
Matt