Over the weekend, like many Americans, I did my taxes. I’m required to file a couple of different Schedule C’s for various side-businesses I have, and when it comes time to hunt for deductions, it sure is nice to be a programmer.
I had initially thought that I would be able to download the entire year’s bank transactions into a CSV file, but alas, my bank (and many others, I’m finding) only provide 45 days of history. They do, however, allow you to download PDF files of all of your statements, and so there I was staring at thousands of transactions in PDFs, dreading retyping them into Excel.
Enter my skills as a Rubyist. I sought out the PDF::Reader library, which allows you to hook its parsing engine to a custom callback and do what you want with it. This definitely parsed the PDF fine, but I had no context; no idea where I was in my statement, because there’s no callback for a “new line” character – it’s just a stream of words. I found that if I used Adobe Acrobat to save the files as text-accessible, then I started to have statements I was able to work with.
Now that I had ‘lines’, I was able to use the power of regular expressions to get the data I needed. The lines I was interested in started with a date and an amount, and the rest was just description for my transaction.
Here’s the warts-and-all code I used to compile the year’s worth of spreadsheet data. It’s truly “quick-and-dirty”, but it saved me tons of time. The next step, of course, would be to implement regular expressions against the ‘memo’ field of these transactions, and pre-suppose categories and deductions based on these patterns. But then again, maybe it’s time to just use Quicken and stop waiting until the last minute.