Convert .pdf to .txt

Convert .pdf to .txt - r

The problem is not new on Stackoverflow, but I am pretty sure I am missing something obvious.
I am trying to convert a few .pdf files into .txt files, in order to mine their text. I based my approach on this excellent script. The text in the .pdf files is not composed by images, hence no OCR required.
# Load tm package
library(tm)
# The folder containing my PDFs
dest <- "./pdfs"
# Correctly installed xpdf from http://www.foolabs.com/xpdf/download.html
file.exists(Sys.which(c("pdfinfo", "pdftotext")))
[1] TRUE TRUE
# Delete white spaces from pdfs' names
sapply(myfiles, FUN = function(i){
file.rename(from = i, to = paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})
# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"',
paste0('"', i, '"')), wait = FALSE))
It should create a .txt copy of any .pdf file in the dest folder. I checked for issues with the path, for white spaces in the path, for xpdf common installation issues but nothing happens.
Here is the repository I am working on. If it can be useful, I can paste the SessionInfo. Thanks in advance.

Late answer:
But I recently discovered that with the current verions of tm (0.7-4) you can read pdfs directly into a corpus if you have pdftools installed (install.packages("pdftools")).
library(tm)
directory <- getwd() # change this to directory where pdf-files are located
# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))

Related

Read multiple “.xlsx” files

I am trying to read multiple excel files under different folders by R
Here is my solution:
setwd("D:/data")
filename <- list.files(getwd(),full.names = TRUE)
# Four folders "epdata1" "epdata2" "epdata3" "epdata4" were inside the folder "data"
dataname <- list.files(filename,pattern="*.xlsx$",full.names = TRUE)
# Every folder in the folder "data" contains five excel files
datalist <- lapply(dataname,read_xlsx)
Error: `path` does not exist:'D:/data/epidata1/出院舱随访1.xlsx'
But read_xlsx was successfully run
read_xlsx("D:/data/epidata1/出院舱随访1.xlsx")
All file directories are available in the "data" folder and why R fails to read those excel file?
Your help will much appreciated!

I dont see any point why your code shouldnt work. Make sure your folder names are correct. In your comments you write "epdata1" and your error says "epidata1".
I tried it with some csv and mixed xlsx files.
This is again what i would come up with, to find the error/typo:
library(readxl)
pp <- function(...){print(paste(...))}
main <- function(){
# finding / setting up data main folder
# You may change this to your needs
main_dir <- paste0(getwd(),"/data/")
pp("working directory:",dir_data)
pp("Found following folders:")
pp(list.files(main_dir,full.names = FALSE))
data_folders <- list.files(dir_data,full.names = TRUE)
pp("Found these files in folders:",list.files(data_folders,full.names = TRUE))
pp("Filtering *.xlsx files",list.files(data_folders,pattern="*.xlsx$",full.names = TRUE))
files <- list.files(data_folders,pattern="\\.xlsx$",full.names = TRUE)
datalist <- lapply(files,read_xlsx)
print(datalist)
}
main()

How can I create a data frame in R from a zip file with multiple levels located in an URL?

I have been trying to work this out but I have not been able to do it...
I want to create a data frame with four columns: country-number-year-(content of the .txt file)
There is a .zip file in the following URL:
https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT
The file contains a folder with 49 folders in it, and each of them contain 150 .txt files give or take.
I first tried to download the zip file with get_dataset but did not work
if (!require("dataverse")) devtools::install_github("iqss/dataverse-client-r")
library("dataverse")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu")
"Error in get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu") :
Not Found (HTTP 404)."
Then I tried
temp <- tempfile()
download.file("https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT",temp)
UNGDC <-unzip(temp, "UNGDC+1970-2018.zip")
It worked to some point... I downloaded the .zip file and then I created UNGDC but nothing happened, because it only has the following information:
UNGDC
A connection with
description "/var/folders/nl/ss_qsy090l78_tyycy03x0yh0000gn/T//RtmpTc3lvX/fileab730f392b3:UNGDC+1970-2018.zip"
class "unz"
mode "r"
text "text"
opened "closed"
can read "yes"
can write "yes"
Here I don't know what to do... I have not found relevant information to proceed... Can someone please give me some hints? or any web to learn how to do it?
Thanks for your attention and help!!!

How about this? I used the zip package to unzip, but possibly the base unzip might work as well.
library(zip)
dir.create(temp <- tempfile())
url<-'https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT'
download.file(url, paste0(temp, '/PZUURT.zip'), mode = 'wb', exdir = temp)
unzip(paste0(temp, '/PZUURT.zip'), exdir = temp)
Note in particular I had to set the mode = 'wb' as I'm on a Windows machine.
I then saw that the unzipped archive had a _MACOSX folder and a Converted sessions folder. Assuming I don't need the MACOSX stuff, I did the following to get just the files I'm interested in:
root_folder <- paste0(temp,'/Converted sessions/')
filelist <- list.files(path = root_folder, pattern = '*.txt', recursive = TRUE)
filenames <- basename(filelist)
'filelist' contains the full paths to each text file, while 'filenames' has just each file name, which I'll then break up to get the country, the number and the year:
df <- data.frame(t(sapply(strsplit(filenames, '_'),
function(x) c(x[1], x[2], substr(x[3], 1, 4)))))
colnames(df) <- c('Country', 'Number', 'Year')
Finally, I can read the text from each of the files and stick it into the dataframe as a new Text field:
df$Text <- sapply(paste0(root_folder, filelist), function(x) readChar(x, file.info(x)$size))

Read and Convert .pdf files to .txt files in R faster?

I am trying to .pdf files (most of which are image based) to .txt files in bulk. The below program successfully converts both text and image based pdfs to text files.
My problem is that there is a set of ~15 pdf files that take a really long time to convert. They aren't particularly large (maximum pages between 10 to 600) but my program takes about 45 mins to convert them.
Why is it taking so long to convert them and how can I speed it up? I am using CRAN RGui(64-bit) and the R version 3.5.0
The .pdf files are in the following hirarchy
My Directory->Sub-folder 1->abc.pdf
My Directory->Sub-folder 2->def.pdf
etc..
The code is as below:
programdir<-"C:\\My directory"
# Delete all txt files in the path
file.remove(list.files(path=programdir, pattern = ".txt", recursive = T, full.names = T))
# Get list of sub folders in the main directory
mydir<-list.dirs(path=programdir,full.names = TRUE, recursive = TRUE)
# Loop through sub-folders, starting from 2 as 1 is the parent directory
for(i in 2:length(mydir)) {
# make a vector of PDF file names
myfiles <- list.files(path=mydir[i],pattern = ".pdf",
full.names = TRUE,recursive = TRUE)
# Loop through every file in the sub-directory
for(j in 1:length(myfiles)) {
# Render pdf to png image
img_file <- pdftools::pdf_convert(myfiles[j], format = 'tiff', dpi = 400)
# Extract text from png image
pdftotext <- ocr(img_file)
# Ensure text files are named as per sub-directory name_pdf name.txt format
fname = paste(mydir[i],basename(file_path_sans_ext(myfiles[j])),sep="_")
# Save files to directory path
sink(file=paste(fname , ".txt", sep=''))
writeLines(unlist(lapply(pdftotext , paste, collapse=" ")))
sink()
j <- j + 1 # Next file in sub-directory
}
i <- i + 1 # Next sub-directory record
}
file.remove(list.files(pattern = ".tiff", recursive = TRUE, full.names = TRUE))

Using docx2txt in R

I have used the following code within R to convert a PDF file to a text file for future use of the tm package. I am using the downloaded "pdftotext.exe" file.
This code is working properly and produces a "txt" for every PDF in the directory.
myfiles <- list.files(path = dir04, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/xpdf/xpdfbin-win-3.04/bin64/pdftotext.exe"',paste0('"', i, '"')), wait = FALSE))
I am trying to figure out how to use "docx2txt" in a similar manner. However, the file formats are not .exe files. Can I use the "docx2txt-1.4" or "docx2txt-1.4.tar" in the same manner? The following code provides an error for each file.
myfiles <- list.files(path = dir08, pattern = "docx", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/docx2txt/docx2txt-1.4.gz"',paste0('"', i, '"')), wait = FALSE))
Warning
running command '"C:/docx2txt/docx2txt-1.4.gz" "C:/....docx"' had status 127
how do I create a corpus of *.docx files with tm? doesn't have quite enough info.

You can convert a ".docx" file to ".txt" with the following code which is a different approach :
library(RDCOMClient)
path_Word <- "C:\\temp.docx"
path_TXT <- "C:\\temp.txt"
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_Word),
ConfirmConversions = FALSE)
doc$SaveAs(path_TXT, FileFormat = 4) # Converts word document to txt
text <- readLines(path_TXT)
text

Automate zip file reading in R

I need to automate R to read a csv datafile that's into a zip file.
For example, I would type:
read.zip(file = "myfile.zip")
And internally, what would be done is:
Unzip myfile.zip to a temporary folder
Read the only file contained on it using read.csv
If there is more than one file into the zip file, an error is thrown.
My problem is to get the name of the file contained into the zip file, in orded to provide it do the read.csv command. Does anyone know how to do it?
UPDATE
Here's the function I wrote based on #Paul answer:
read.zip <- function(zipfile, row.names=NULL, dec=".") {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get the files into the dir
files <- list.files(zipdir)
# Throw an error if there's more than one
if(length(files)>1) stop("More than one data file inside zip")
# Get the full name of the file
file <- paste(zipdir, files[1], sep="/")
# Read the file
read.csv(file, row.names, dec)
}
Since I'll be working with more files inside the tempdir(), I created a new dir inside it, so I don't get confused with the files. I hope it may be useful!

Another solution using unz:
read.zip <- function(file, ...) {
zipFileInfo <- unzip(file, list=TRUE)
if(nrow(zipFileInfo) > 1)
stop("More than one data file inside zip")
else
read.csv(unz(file, as.character(zipFileInfo$Name)), ...)
}

You can use unzip to unzip the file. I just mention this as it is not clear from your question whether you knew that. In regard to reading the file. Once your extracted the file to a temporary dir (?tempdir), just use list.files to find the files that where dumped into the temporary directory. In your case this is just one file, the file you need. Reading it using read.csv is then quite straightforward:
l = list.files(temp_path)
read.csv(l[1])
assuming your tempdir location is stored in temp_path.

I found this thread as I was trying to automate reading multiple csv files from a zip. I adapted the solution to the broader case. I haven't tested it for weird filenames or the like, but this is what worked for me so I thought I'd share:
read.csv.zip <- function(zipfile, ...) {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir)
files <- files[grep("\\.csv$", files)]
# Create a list of the imported csv files
csv.data <- sapply(files, function(f) {
fp <- file.path(zipdir, f)
return(read.csv(fp, ...))
})
return(csv.data)}

If you have zcat installed on your system (which is the case for linux, macos, and cygwin) you could also use:
zipfile<-"test.zip"
myData <- read.delim(pipe(paste("zcat", zipfile)))
This solution also has the advantage that no temporary files are created.

Here is an approach I am using that is based heavily on #Corned Beef Hash Map 's answer. Here are some of the changes I made:
My approach makes use of the data.table package's fread(), which
can be fast (generally, if it's zipped, sizes might be large, so you
stand to gain a lot of speed here!).
I also adjusted the output format so that it is a named list, where
each element of the list is named after the file. For me, this was a
very useful addition.
Instead of using regular expressions to sift through the files
grabbed by list.files, I make use of list.file()'s pattern
argument.
Finally, I by relying on fread() and by making pattern an
argument to which you could supply something like "" or NULL or
".", you can use this to read in many types of data files; in fact,
you can read in multiple types of at once (if your .zip contains
.csv, .txt in you want both, e.g.). If there are only some types of
files you want, you can specify the pattern to only use those, too.
Here is the actual function:
read.csv.zip <- function(zipfile, pattern="\\.csv$", ...){
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir, rec=TRUE, pattern=pattern)
# Create a list of the imported csv files
csv.data <- sapply(files,
function(f){
fp <- file.path(zipdir, f)
dat <- fread(fp, ...)
return(dat)
}
)
# Use csv names to name list elements
names(csv.data) <- basename(files)
# Return data
return(csv.data)
}

The following refines the above answers. FUN could be read.csv, cat, or anything you like, providing the first argument will accept a file path. E.g.
head(read.zip.url("http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/Downloads/ICD-9-CM-v32-master-descriptions.zip", filename = "CMS32_DESC_LONG_DX.txt"))
read.zip.url <- function(url, filename = NULL, FUN = readLines, ...) {
zipfile <- tempfile()
download.file(url = url, destfile = zipfile, quiet = TRUE)
zipdir <- tempfile()
dir.create(zipdir)
unzip(zipfile, exdir = zipdir) # files="" so extract all
files <- list.files(zipdir)
if (is.null(filename)) {
if (length(files) == 1) {
filename <- files
} else {
stop("multiple files in zip, but no filename specified: ", paste(files, collapse = ", "))
}
} else { # filename specified
stopifnot(length(filename) ==1)
stopifnot(filename %in% files)
}
file <- paste(zipdir, files[1], sep="/")
do.call(FUN, args = c(list(file.path(zipdir, filename)), list(...)))
}

Another approach that uses fread from the data.table package
fread.zip <- function(zipfile, ...) {
# Function reads data from a zipped csv file
# Uses fread from the data.table package
## Create the temporary directory or flush CSVs if it exists already
if (!file.exists(tempdir())) {dir.create(tempdir())
} else {file.remove(list.files(tempdir(), full = T, pattern = "*.csv"))
}
## Unzip the file into the dir
unzip(zipfile, exdir=tempdir())
## Get path to file
file <- list.files(tempdir(), pattern = "*.csv", full.names = T)
## Throw an error if there's more than one
if(length(file)>1) stop("More than one data file inside zip")
## Read the file
fread(file,
na.strings = c(""), # read empty strings as NA
...
)
}
Based on the answer/update by #joão-daniel

unzipped file location
outDir<-"~/Documents/unzipFolder"
get all the zip files
zipF <- list.files(path = "~/Documents/", pattern = "*.zip", full.names = TRUE)
unzip all your files
purrr::map(.x = zipF, .f = unzip, exdir = outDir)

I just wrote a function based on top read.zip that may help...
read.zip <- function(zipfile, internalfile=NA, read.function=read.delim, verbose=TRUE, ...) {
# function based on http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r
# check the files within zip
unzfiles <- unzip(zipfile, list=TRUE)
if (is.na(internalfile) || is.numeric(internalfile)) {
internalfile <- unzfiles$Name[ifelse(is.na(internalfile),1,internalfile[1])]
}
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
if (verbose) catf("Directory created:",zipdir,"\n")
dir.create(zipdir)
# Unzip the file into the dir
if (verbose) catf("Unzipping file:",internalfile,"...")
unzip(zipfile, file=internalfile, exdir=zipdir)
if (verbose) catf("Done!\n")
# Get the full name of the file
file <- paste(zipdir, internalfile, sep="/")
if (verbose)
on.exit({
catf("Done!\nRemoving temporal files:",file,".\n")
file.remove(file)
file.remove(zipdir)
})
else
on.exit({file.remove(file); file.remove(zipdir);})
# Read the file
if (verbose) catf("Reading File...")
read.function(file, ...)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Convert .pdf to .txt - r

Related

Read multiple “.xlsx” files

How can I create a data frame in R from a zip file with multiple levels located in an URL?

Read and Convert .pdf files to .txt files in R faster?

Using docx2txt in R

Automate zip file reading in R

Categories

Resources