New Document Tools to Unearth Redacted Text, Personal Information, and More

1 year ago 52

data journalism extract DocumentCloud redaction

New features from DocumentCloud can help investigative journalists reveal or protect poorly-redacted text as well as quickly scrape personal information embedded in large files. Image: Shutterstock

One of the biggest technological leaps for investigative journalists in recent years has been the development of free tools to make large document bundles searchable and manageable for small teams.

In earlier years, reporters needed stacks of multi-colored sticky notes, data input volunteers, and lots of time to manage boxes of public records that arrive in every format, from longhand script to unstructured data tables and partially redacted reports. 

But tools powered by machine learning and the ingenuity of open source program developers have not only tamed giant leaks, but can also unearth hidden data in those bundles, and reduce the risk of inadvertently publishing sensitive information.

For instance, attendees at the 2022 Investigative Reporters & Editors conference were amazed to learn that the AI-powered Google Pinpoint tool — in addition to its many time-saving parsing functions — could also transcribe, and search, the tiny text on brass plaques in the distant background of photographs. Indeed, journalists at the environmental newsroom Floodlight were recently named finalists for the Goldsmith Investigative Reporting Prize after they used Pinpoint to auto-analyze thousands of pages of leaked documents to identify individuals allegedly behind a media corruption scandal.

DocumentCloud now boasts of many more cutting-edge functions.

There was a similarly enthusiastic response at the recent NICAR23 data journalism conference in Tennessee, when reporters learned about powerful new digging features on the open source DocumentCloud platform.

A service of the nonprofit MuckRock Foundation, the largely free DocumentCloud platform is already popular for its base document management features, which include easy upload of 70 formats, from PDFs to spreadsheets and graphics; annotating reports; and — its best-known feature — the ability to embed processed documents directly into your story. You can also keyword search its public database of some five million documents added by other researchers and reporters, using familiar Google-type syntax like “AND” and “OR.” And its embedding function is especially important in the current era of declining trust in media, as audiences can directly check your claim that you did indeed find X or Y from a report, effectively turning documents into on-the-record sources. 

But DocumentCloud now features many more cutting-edge functions — which include importing from programs like Google Drive to transcribing YouTube audio and even peering through weak blackout redactions (see the list below).

Tools to Address Real-World Data Challenges

In a single-speaker presentation at NICAR23, Sanjin Ibrahimovic, Open Source Fellow at the MuckRock Foundation, said add-ons to the core functions have been created by the DocumentCloud community — users, fellows, data science grantees, and journalists — to address problems and opportunities they encountered during live projects.

Document Cloud personal identifying information detector add-on data journalism

DocumentCloud’s PII Detector add-on feature can extract key information previously hidden in huge data files. Image: Screenshot, DocumentCloud

For instance, Ibrahimovic said users noticed that it took a long time to pick out personally identifiable information (PII) scattered throughout thick files, and that some could be missed in embedded information, like email addresses, social security numbers, ZIP codes, credit card numbers, and physical addresses in the small print.

So DocumentCloud has added a feature that automatically finds and highlights PII terms.  

Meanwhile, Ibrahimovic said users were also struck by the weakness and inaccuracy of redactions in documents from government agencies — where officials often use black highlighter pens or poor redaction software to conceal sensitive or secret information. This presents a risk for newsrooms seeking to embed documents, as sensitive information on victims, for instance, could be digitally extracted by bad actors. 

So DocumentCloud implemented a “Bad Redactions” add-on feature, which helps journalists in two crucial ways:

  • It automatically analyzes and surfaces all the supposedly redacted passages in a single spreadsheet, so you can sometimes reveal what the agency intended to conceal.
  • It gives you the option to complete the redaction job: to permanently scrub all the digital information beneath the blacked out sections, and fully redact them for public documents, or pages you embed. Ibrahimovic warned that reporters should think carefully before clicking on the “Confirm Redaction” button for passages they choose to redact — “because this is a permanent procedure — it’s not reversible.”

For his recent Organized Crime and Corruption Reporting Project (OCCRP) investigation into the trafficking of endangered brazilwood, Luiz Fernando Toledo used Bad Redactions to learn the names of small Brazilian companies fined for smuggling.

“Users don’t need any programmatic knowledge to run an Add-On.” — Sanjin Ibrahimovic, MuckRock Foundation open source fellow

His story involved obtaining hundreds of reports on environmental fines from government agencies and then organizing those documents, explained Toledo — who is also project coordinator for the Data Fixers environmental crimes nonprofit. “The Bad Redactions Add-On helped me to find out the names of several people and companies charged. The Import Documents function was also very important. It was easy to parse through so many documents and find the key parts that I needed. I also used DocumentCloud to fact check the whole project.”

User-Friendly Digging Features

Even though they are transparent and open source, Ibrahimovic acknowledged that Add-Ons require coding skills if you want to create one. They are built with platforms like the DocumentCloud application programming interface (API) and GitHub Actions. But he said Add-Ons are only accepted for the service if they are easy to use.

Document Cloud Bad Redactions add-on data journalism

The DocumentCloud Bad Redactions Add-On can both reveal poorly redacted info and help journalists protect confidential information. Image: Screenshot, DocumentCloud

“Users don’t need any programmatic knowledge to run an Add-On,” he pointed out. “The idea for data extraction and analysis procedures is so smaller newsrooms can use this without needing programming skills.”

Nevertheless, running Add-Ons may present technical challenges to non-data reporters — so users should check out MuckRock’s YouTube tutorial channel on the topic.

Access to DocumentCloud requires creating an account — ideally, using your institutional email address — which is followed by a quick verification step. Access to the growing library of new features involves clicking on “Add-Ons,” and then “Browse All Add-Ons.”

Ibrahimovic said some of the newer Add-On tools can:

  • Import documents from Google Drive, Dropbox, WeTransfer, and Mediafire.
  • Convert email files (EML and MSG formats) into PDFs.
  • Pull data from websites with its Scraper function, which can also automatically download and index newly uploaded documents from your target site.  
  • Detect and display poorly redacted text.
  • Back up projects to The Internet Archive.
  • Bulk-edit a large set of documents.
  • Transcribe audio files — including YouTube — and automatically upload transcriptions to your account.
  • Extract tables within PDFs using a Tabula-based tool.
  • Recognize and highlight PII terms, such as phone numbers, social security information, and physical addresses.

For several attendees, this latter function — the ‘PII Detector’ – was most exciting, partly because it can instantly provide a database of contact details for potential sources from a massive sheaf of court filings or audit reports.

“I was able to keyword search a decade’s worth of meeting notes to locate the information.” — The Macon Newsroom investigative reporter Laura Corley

Laura Corley, an investigative reporter at nonprofit The Macon Newsroom in the US state of Georgia, said new Add-Ons had already proved essential for her research into racial and economic equity at two local charter schools. Minutes of meetings posted by the school governing boards, she said, ran to hundreds of pages, and rarely listed the topics under discussion in headings.

“Without knowing specifically when certain business items were discussed, it could take hours or even days to find the right documents,” she explained. “The DocumentCloud Scraper Add-On allowed me to cull all of the meeting minutes from both websites within minutes. I was able to keyword search a decade’s worth of meeting notes to locate the information.”

“It yielded more results than expected, and gave me further context,” she added. 

Summing up, Ibrahimovic said: “Collectively, we think these features really lower the barrier of entry for deep document analysis for journalists and researchers with limited resources.”

Additional Resources

Free, Game-Changing Data Extraction Tools that Require No Coding Skills

Testing the Potential of Using ChatGPT to Extract Data from PDFs

New Investigative Tools for Monitoring Social Media Platforms


Rowan Philp, senior reporter, GIJNRowan Philp is a reporter for GIJN. He was formerly chief reporter for South Africa’s Sunday Times. As a foreign correspondent, he has reported on news, politics, corruption, and conflict from more than two dozen countries around the world.

The post New Document Tools to Unearth Redacted Text, Personal Information, and More appeared first on Global Investigative Journalism Network.

Read Entire Article