Efficient Way to Convert Multiple PDFs to Plaintext Format

Researchers often have to analyse data. Sometimes, data are contained in PDF files. While most of the commercial analysis tools, e.g., NVivo, support working with this file format, oftentimes it is better to have these converted to plaintext format, especially if you need to do some preprocessing (e.g. stemming words, removing digits, conflating whitespaces, etc.). While, it is can straightforward to do this from most PDF editors, it becomes cumbersome and time-consuming when you are dealing with multiple files.  On Linux/Mac a quick way of solving this is to use the package (pdftotext) as follows:

for file in *.pdf; do pdftotext -nopgbrk -eol unix "$file"; done

Here, we converted all PDFs found in the current directory to text format.

Alternatively, you can also use pandoc package. This is a very powerful tool that can convert files from multiple sources to different formats, e.g., Markdown, LaTeX, EPUB, and many more.  E.g., hereunder we are converting all text files found in the current directory to PDF format:

for file in *.txt; do pandoc "$file" -o "$file.pdf"; done

Hope you will find this useful!

Creating a Hierarchical Taxonomy Through Latex

One of the things researchers have to occasionally develop is a taxonomy. Essentially, a taxonomy is a process that helps classify concepts in a logical manner.

There are many different tools and methods to help draw a taxonomy. But, if you are working with Latex, you can easily do so through the “forest” package. I am showing here a simple example of how you can draw one to represent household appliances and kitchen aids in a smart home:

Screen Shot 2018-04-04 at 08.02.49

The result of running the above code is the graphic presented hereunder:

Screen Shot 2018-04-04 at 08.04.36

Hope you will find this useful!