Scraping PDFs with Tabula

10 downloads 113 Views 4MB Size Report
Jun 27, 2014 - Tabula is a tool that lets you take data tables out of PDFs. (It's not the same as Tableau, which is a da
Scraping PDFs with Tabula @manuelaristaran @jeremybmerrill (New York Times) @mtigas (ProPublica)

http://tabula.technology/ — @TabulaPDF

http://bit.ly/tabula-ire14

Tabula is a tool that lets you take data tables out of PDFs. (It’s not the same as Tableau, which is a data visualization tool.) You also need to have a “text-based” file — if you can select the text inside it, you’re probably in good shape.

Tabula was built with support by ProPublica and La Nacion (Argentina) and the Knight-Mozilla OpenNews program.

We’re also lucky to have received a Knight Prototype grant to continue working on it this year, and we’re lucky to work in newsrooms that let us improve the program. Tabula is a tool built by journalists, originally for journalism problems.

PDF is the worst possible format for
 data exchange. “Portable Document Format” is electronic paper, meant to be rendered the same way
 regardless of the device. PDF cares about the form, we just
 care about the content.

Ever try to copy a table out of a PDF file?

Unfortunately, PDFs are regularly used for publishing important information.

Why can’t you just ask for Excel (or other raw data)?

Why can’t you just ask for Excel (or other raw data)? (Hint: you should!)

http://projects.propublica.org/docdollars/

Sometimes you get data from private organizations and this PDFs of data tables are all they provide. In this case, there’s no pathway to asking for raw data.

http://projects.propublica.org/docdollars/

http://projects.propublica.org/docdollars/

http://ijnet.org/blog/news-app-tracks-voting-records-argentina

In other countries, you might have issues like this too: Congressional voting records in Argentina were recently provided only as PDF reports.

http://ijnet.org/blog/news-app-tracks-voting-records-argentina

Why is data published
 in PDF?

Ignorance

Malice

“The crime stats are subject to being corruptible in an excel sheet. They have been changed in the past by persons unknown and this affects the veracity of the original data posted. If stats are posted on-line in a PDF format, this reduces the risk of contamination. [...] Effective immediately the stats should just be posted in a PDF format.” -Minneapolis Police Department

http://www.minnpost.com/data/2013/09/update-minneapolis-police-department-restores-accessible-data-format

● “Text-based” PDFs ○ Can select text inside the file. ○ PDF stores exact positions of every character on the page. ○ Tabula can take this text information & return the data. ● Scanned PDFs ○ Just a collection of images. ○ PDF file doesn’t actually contain the text you’re seeing. ○ Tabula can’t do anything about these unless the file is processed with OCR software first. (Watch out for accuracy!)

http://tabula.technology/ http://bit.ly/tabula-ire14 See also:

http://bit.ly/pdf-madness http://bit.ly/docsdocsdocs