April | 2018 | Joseph Bugeja

Researchers often have to analyse data. Sometimes, data are contained in PDF files. While most of the commercial analysis tools, e.g., NVivo, support working with this file format, oftentimes it is better to have these converted to plaintext format, especially if you need to do some preprocessing (e.g. stemming words, removing digits, conflating whitespaces, etc.). While, it is can straightforward to do this from most PDF editors, it becomes cumbersome and time-consuming when you are dealing with multiple files. On Linux/Mac a quick way of solving this is to use the package (pdftotext) as follows:

for file in *.pdf; do pdftotext -nopgbrk -eol unix "$file"; done

Here, we converted all PDFs found in the current directory to text format.

Alternatively, you can also use pandoc package. This is a very powerful tool that can convert files from multiple sources to different formats, e.g., Markdown, LaTeX, EPUB, and many more. E.g., hereunder we are converting all text files found in the current directory to PDF format:

for file in *.txt; do pandoc "$file" -o "$file.pdf"; done

Hope you will find this useful!

Joseph Bugeja

Security, Privacy, and Academic Life

Month: April 2018

Efficient Way to Convert Multiple PDFs to Plaintext Format

Creating a Hierarchical Taxonomy Through Latex