Using docx2txt in R - r

I have used the following code within R to convert a PDF file to a text file for future use of the tm package. I am using the downloaded "pdftotext.exe" file.
This code is working properly and produces a "txt" for every PDF in the directory.
myfiles <- list.files(path = dir04, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/xpdf/xpdfbin-win-3.04/bin64/pdftotext.exe"',paste0('"', i, '"')), wait = FALSE))
I am trying to figure out how to use "docx2txt" in a similar manner. However, the file formats are not .exe files. Can I use the "docx2txt-1.4" or "docx2txt-1.4.tar" in the same manner? The following code provides an error for each file.
myfiles <- list.files(path = dir08, pattern = "docx", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/docx2txt/docx2txt-1.4.gz"',paste0('"', i, '"')), wait = FALSE))
Warning
running command '"C:/docx2txt/docx2txt-1.4.gz" "C:/....docx"' had status 127
how do I create a corpus of *.docx files with tm? doesn't have quite enough info.

You can convert a ".docx" file to ".txt" with the following code which is a different approach :
library(RDCOMClient)
path_Word <- "C:\\temp.docx"
path_TXT <- "C:\\temp.txt"
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_Word),
ConfirmConversions = FALSE)
doc$SaveAs(path_TXT, FileFormat = 4) # Converts word document to txt
text <- readLines(path_TXT)
text

Related

R - load multiple csv files and drop .csv from name

I have some files in a base directory that I use to house all my .csv files
base_dir <- file.path(path)
file_list <- list.files(path = base_dir, pattern = "*.csv")
I would like to load all of them at once:
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(paste(base_dir, file_list[i], sep = ""))
)}
However, this produces files with ".csv" in the names in R.
What I would like to do is load all the files but drop the ".csv" from the name once they are loaded.
I have tried the following:
for (i in 1:length(file_list)){ assign(file_list[i],
read.csv(substr(paste(base_dir, file_list[i], sep = ""), 1,
nchar(file_list[i]) -4))
)}
But I received an error: No such file or directory
Is there a way to do this somewhat efficiently?
Normally one reads them into a list rather than having them as free objects floating around in the workspace. Use dir or Sys.glob to generate the full path names and then use read.csv to read each one in. L's names will be pathnames so reduce them to the basename and remove .csv .
# paths <- dir(path = path, pattern = "\\.csv$", full = TRUE)
paths <- Sys.glob(sprintf("%s/*.csv", path))
L <- Map(read.csv, paths)
names(L) <- sub("\\.csv$", "", basename(names(L)))
If you really want them as free floating objects anyways then add this:
list2env(L, .GlobalEnv)
We can use sub to remove the .csv at the end
for (i in 1:length(file_list)){
assign(sub("\\.csv$", "", basename(file_list[i])),
read.csv(paste(base_dir, file_list[i], sep = ""))
)}
Or another option is file_path_sans_ext
for (i in 1:length(file_list)){
assign(tools::file_path_sans_ext(basename(file_list[i])),
read.csv(paste(base_dir, file_list[i], sep = ""))
)}
The error produced in OP's code is because the substr is applied on the 'value' part i.e. the one goes into reading the file instead of the 'x' i.e. the corrected code would be
for(i in 1:length(file_list)){
assign(substr(paste(base_dir, file_list[i], sep = ""),
1, nchar(file_list[i]) -4),
read.csv(file_list[i])
)}
Also, if the working directory is different, it may be better to specify full.names = TRUE
file_list <- list.files(path = base_dir, pattern = "*\\.csv$", full.names = TRUE)

Looping multiple pdf and converting to multiple excel using R programming

I have few PDF files in a folder. I am performing certain operations and converting them into excel. Below is the code,
init <- dir(path = "C:/Users/sankirtanmoturi/Desktop/rloop", pattern = "\\.pdf$", all.files = TRUE, full.names = TRUE)
trans <- function(file){
try <- pdf_text(file)
try1 <- unlist(str_split(try,"[\\r\\n]+"))
try2 <- str_split_fixed(str_trim(try1), "\\s{1,}, 20")
write.xlsx(try2, sub("\\.xlsx$", "-UP.xlsx", file))
}
lapply(init, trans)
I am getting the below error
Error in identical(n, Inf) : argument "n" is missing, with no default
I figured out that, there's problem with str_split or str_split_fixed.
But if I am not trying to loop and try for a single file, It is converting successfully
Please help me to run this for all pdf files in a folder
There are mainly typos in your question. The below code should work:
init <- dir(path = "C:/Users/sankirtanmoturi/Desktop/rloop", pattern = "\\.pdf$", all.files = TRUE, full.names = TRUE)
trans <- function(file){
try <- pdf_text(file)
try1 <- unlist(str_split(try,"[\\r\\n]+"))
try2 <- str_split_fixed(str_trim(try1), "\\s{1,}", 20)
write.xlsx(try2, sub("\\.pdf$", "-UP.xlsx", file))
}
lapply(init, trans)

Combine csv files with common file identifier

I have a list of approximately 500 csv files each with a filename that consists of a six-digit number followed by a year (ex. 123456_2015.csv). I would like to append all files together that have the same six-digit number. I tried to implement the code suggested in this question:
Import and rbind multiple csv files with common name in R but I want the appended data to be saved as new csv files in the same directory as the original files are currently saved. I have also tried to implement the below code however the csv files produced from this contain no data.
rm(list=ls())
filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test")
NAPS_ID <- gsub('.+?\\([0-9]{5,6}?)\\_.+?$', '\\1', filenames)
Unique_NAPS_ID <- unique(NAPS_ID)
n <- length(Unique_NAPS_ID)
for(j in 1:n){
curr_NAPS_ID <- as.character(Unique_NAPS_ID[j])
NAPS_ID_pattern <- paste(".+?\\_(", curr_NAPS_ID,"+?)\\_.+?$", sep = "" )
NAPS_filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test", pattern = NAPS_ID_pattern)
write.csv(do.call("rbind", lapply(NAPS_filenames, read.csv, header = TRUE)),file = paste("C:/Users/smithma/Desktop/PM25_test/MERGED", "MERGED_", Unique_NAPS_ID[j], ".csv", sep = ""), row.names=FALSE)
}
Any help would be greatly appreciated.
Because you're not doing any data manipulation, you don't need to treat the files like tabular data. You only need to copy the file contents.
filenames <- list.files("C:/Users/smithma/Desktop/PM25_test", full.names = TRUE)
NAPS_ID <- substr(basename(filenames), 1, 6)
Unique_NAPS_ID <- unique(NAPS_ID)
for(curr_NAPS_ID in Unique_NAPS_ID){
NAPS_filenames <- filenames[startsWith(basename(filenames), curr_NAPS_ID)]
output_file <- paste0(
"C:/Users/nwerth/Desktop/PM25_test/MERGED_", curr_NAPS_ID, ".csv"
)
for (fname in NAPS_filenames) {
line_text <- readLines(fname)
# Write the header from the first file
if (fname == NAPS_filenames[1]) {
cat(line_text[1], '\n', sep = '', file = output_file)
}
# Append every line in the file except the header
line_text <- line_text[-1]
cat(line_text, file = output_file, sep = '\n', append = TRUE)
}
}
My changes:
list.files(..., full.names = TRUE) is usually the best way to go.
Because the digits appear at the start of the filenames, I suggest substr. It's easier to get an idea of what's going on when skimming the code.
Instead of looping over the indices of a vector, loop over the values. It's more succinct and less likely to cause problems if the vector's empty.
startsWith and endsWith are relatively new functions, and they're great.
You only care about copying lines, so just use readLines to get them in and cat to get them out.
You might consider something like this:
##will take the first 6 characters of each file name
six.digit.filenames <- substr(filenames, 1,6)
path <- "C:/Users/smithma/Desktop/PM25_test/"
unique.numbers <- unique(six.digit.filenames)
for(j in unique.numbers){
sub <- filenames[which(substr(filenames,1,6) == j)]
data.for.output <- c()
for(file in sub){
##now do your stuff with these files including read them in
data <- read.csv(paste0(path,file))
data.for.output <- rbind(data.for.output,data)
}
write.csv(data.for.output,paste0(path,j, '.csv'), row.names = F)
}

Looping through files using dynamic name variable in R

I have a large number of files to import which are all saved as zip files.
From reading other posts it seems I need to pass the zip file name and then the name of the file I want to open. Since I have a lot of them I thought I could loop through all the files and import them one by one.
Is there a way to pass the name dynamically or is there an easier way to do this?
Here is what I have so far:
Temp_Data <- NULL
Master_Data <- NULL
file.names <- c("f1.zip", "f2.zip", "f3.zip", "f4.zip", "f5.zip")
for (i in 1:length(file.names)) {
zipFile <- file.names[i]
dataFile <- sub(".zip", ".csv", zipFile)
Temp_Data <- read.table(unz(zipFile,
dataFile), sep = ",")
Master_Data <- rbind(Master_Data, Temp_Data)
}
I get the following error:
In open.connection(file, "rt") :
I can import them manually using:
dt <- read.table(unz("D:/f1.zip", "f1.csv"), sep = ",")
I can create the sting dynamically but it feels long winded - and doesn't work when I wrap it with read.table(unz(...)). It seems it can't find the file name and so throws an error
cat(paste(toString(shQuote(paste("D:/",zipFile, sep = ""))),",",
toString(shQuote(dataFile)), sep = ""), "\n")
But if I then print this to the console I get:
"D:/f1.zip","f1.csv"
I can then paste this into `read.table(unz(....)) and it works so I feel like I am close
I've tagged in data.table since this is what I almost always use so if it can be done with 'fread' that would be great.
Any help is appreciated
you can use the list.files command here:
first set your working directory, where all your files are stored there:
setwd("C:/Users/...")
then
file.names = list.files(pattern = "*.zip", recursive = F)
then your for loop will be:
for (i in 1:length(file.names)) {
#open the files
zipFile <- file.names[i]
dataFile <- sub(".zip", ".csv", zipFile)
Temp_Data <- read.table(unz(zipFile,
dataFile), sep = ",")
# your function for the opened file
Master_Data <- rbind(Master_Data, Temp_Data)
#write the file finaly
write_delim(x = Master_Data, path = paste(file.names[[i]]), delim = "\t",
col_names = T )}

Convert .pdf to .txt

The problem is not new on Stackoverflow, but I am pretty sure I am missing something obvious.
I am trying to convert a few .pdf files into .txt files, in order to mine their text. I based my approach on this excellent script. The text in the .pdf files is not composed by images, hence no OCR required.
# Load tm package
library(tm)
# The folder containing my PDFs
dest <- "./pdfs"
# Correctly installed xpdf from http://www.foolabs.com/xpdf/download.html
file.exists(Sys.which(c("pdfinfo", "pdftotext")))
[1] TRUE TRUE
# Delete white spaces from pdfs' names
sapply(myfiles, FUN = function(i){
file.rename(from = i, to = paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})
# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"',
paste0('"', i, '"')), wait = FALSE))
It should create a .txt copy of any .pdf file in the dest folder. I checked for issues with the path, for white spaces in the path, for xpdf common installation issues but nothing happens.
Here is the repository I am working on. If it can be useful, I can paste the SessionInfo. Thanks in advance.
Late answer:
But I recently discovered that with the current verions of tm (0.7-4) you can read pdfs directly into a corpus if you have pdftools installed (install.packages("pdftools")).
library(tm)
directory <- getwd() # change this to directory where pdf-files are located
# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))

Resources