Looping multiple pdf and converting to multiple excel using R programming - r

I have few PDF files in a folder. I am performing certain operations and converting them into excel. Below is the code,
init <- dir(path = "C:/Users/sankirtanmoturi/Desktop/rloop", pattern = "\\.pdf$", all.files = TRUE, full.names = TRUE)
trans <- function(file){
try <- pdf_text(file)
try1 <- unlist(str_split(try,"[\\r\\n]+"))
try2 <- str_split_fixed(str_trim(try1), "\\s{1,}, 20")
write.xlsx(try2, sub("\\.xlsx$", "-UP.xlsx", file))
}
lapply(init, trans)
I am getting the below error
Error in identical(n, Inf) : argument "n" is missing, with no default
I figured out that, there's problem with str_split or str_split_fixed.
But if I am not trying to loop and try for a single file, It is converting successfully
Please help me to run this for all pdf files in a folder

There are mainly typos in your question. The below code should work:
init <- dir(path = "C:/Users/sankirtanmoturi/Desktop/rloop", pattern = "\\.pdf$", all.files = TRUE, full.names = TRUE)
trans <- function(file){
try <- pdf_text(file)
try1 <- unlist(str_split(try,"[\\r\\n]+"))
try2 <- str_split_fixed(str_trim(try1), "\\s{1,}", 20)
write.xlsx(try2, sub("\\.pdf$", "-UP.xlsx", file))
}
lapply(init, trans)

Related

How to rbind similar csv files that are scattered in many different zip files, using a function?

Consider one file 'C:/ZFILE' that includes many zip files.
Now, consider that each of these zip includes many csv, among which one specific csv named 'NAME.CSV', all these scattered 'NAME.CSV' being similarly named and structured (i.e., same columns).
How to rbind all these scattered csv?
The script below allows that, but a function would be more appropriate.
How to do this?
Thanks
zfile <- "C:/ZFILE"
zlist <- list.files(path = zfile, pattern = "\\.zip$", recursive = FALSE, full.names = TRUE)
zlist # list all zip from the zfile file
zunzip <- lapply(zlist, unzip, exdir = zfile) # unzip all zip in the zfile file (may takes time depending on the number of zip)
library(data.table) # rbindlist & fread
csv_name <- "NAME.CSV"
csv_list <- list.files(path = zfile, pattern = paste0("\\", csv_name, "$"), recursive = TRUE, ignore.case = FALSE, full.names = TRUE)
csv_list # list all 'NAME.CSV' from the zfile file
csv_rbind <- rbindlist(sapply(csv_list, fread, simplify = FALSE), idcol = 'filename')
You can try this type of function ( you can pass the unzip call directly to the cmd param of data.table::fread())
get_zipped_csv <- function(path) {
fnames = list.files(path,full.names = T)
rbindlist(lapply(fnames, \(f) fread(cmd = paste0("unzip -p ",f))[,src:=f]))
}
Usage:
get_zipped_csv(path = "C:\ZFILE\")

Import multiple CSV files with Softball statistics and plot the progress [duplicate]

I have written the following function to combine 300 .csv files. My directory name is "specdata". I have done the following steps for execution,
x <- function(directory) {
dir <- directory
data_dir <- paste(getwd(),dir,sep = "/")
files <- list.files(data_dir,pattern = '\\.csv')
tables <- lapply(paste(data_dir,files,sep = "/"), read.csv, header = TRUE)
pollutantmean <- do.call(rbind , tables)
}
# Step 2: call the function
x("specdata")
# Step 3: inspect results
head(pollutantmean)
Error in head(pollutantmean) : object 'pollutantmean' not found
What is my mistake? Can anyone please explain?
There's a lot of unnecessary code in your function. You can simplify it to:
load_data <- function(path) {
files <- dir(path, pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)
do.call(rbind, tables)
}
pollutantmean <- load_data("specdata")
Be aware that do.call + rbind is relatively slow. You might find dplyr::bind_rows or data.table::rbindlist to be substantially faster.
To update Prof. Wickham's answer above with code from the more recent purrr library which he coauthored with Lionel Henry:
Tbl <-
list.files(pattern="*.csv") %>%
map_df(~read_csv(.))
If the typecasting is being cheeky, you can force all the columns to be as characters with this.
Tbl <-
list.files(pattern="*.csv") %>%
map_df(~read_csv(., col_types = cols(.default = "c")))
If you are wanting to dip into subdirectories to construct your list of files to eventually bind, then be sure to include the path name, as well as register the files with their full names in your list. This will allow the binding work to go on outside of the current directory. (Thinking of the full pathnames as operating like passports to allow movement back across directory 'borders'.)
Tbl <-
list.files(path = "./subdirectory/",
pattern="*.csv",
full.names = T) %>%
map_df(~read_csv(., col_types = cols(.default = "c")))
As Prof. Wickham describes here (about halfway down):
map_df(x, f) is effectively the same as do.call("rbind", lapply(x, f)) but under the hood is much more efficient.
and a thank you to Jake Kaupp for introducing me to map_df() here.
This can be done very succinctly with dplyr and purrr from the tidyverse. Where x is a list of the names of your csv files you can simply use:
bind_rows(map(x, read.csv))
Mapping read.csv to x produces a list of dfs that bind_rows then neatly combines!
```{r echo = FALSE, warning = FALSE, message = FALSE}
setwd("~/Data/R/BacklogReporting/data/PastDue/global/") ## where file are located
path = "~/Data/R/BacklogReporting/data/PastDue/global/"
out.file <- ""
file.names <- dir(path, pattern = ".csv")
for(i in 1:length(file.names)){
file <- read.csv(file.names[i], header = TRUE, stringsAsFactors = FALSE)
out.file <- rbind(out.file, file)
}
write.csv(out.file, file = "~/Data/R/BacklogReporting/data/PastDue/global/global_stacked/past_due_global_stacked.csv", row.names = FALSE) ## directory to write stacked file to
past_due_global_stacked <- read.csv("C:/Users/E550143/Documents/Data/R/BacklogReporting/data/PastDue/global/global_stacked/past_due_global_stacked.csv", stringsAsFactors = FALSE)
files <- list.files(pattern = "\\.csv$") %>% t() %>% paste(collapse = ", ")
```
If your csv files are into an other directory, you could use something like this:
readFilesInDirectory <- function(directory, pattern){
files <- list.files(path = directory,pattern = pattern)
for (f in files){
file <- paste(directory,files, sep ="")
temp <- lapply(file, fread, sep=",")
data <- rbindlist( temp )
}
return(data)
}
In your current function pollutantmean is available only in the scope of the function x. Modify your function to this
x <- function(directory) {
dir <- directory
data_dir <- paste(getwd(),dir,sep = "/")
files <- list.files(data_dir,pattern = '\\.csv')
tables <- lapply(paste(data_dir,files,sep = "/"), read.csv, header = TRUE)
assign('pollutantmean',do.call(rbind , tables))
}
assign should put result of do.call(rbind, tables) into variable called pollutantmean in global environment.

R - load multiple csv files and drop .csv from name

I have some files in a base directory that I use to house all my .csv files
base_dir <- file.path(path)
file_list <- list.files(path = base_dir, pattern = "*.csv")
I would like to load all of them at once:
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(paste(base_dir, file_list[i], sep = ""))
)}
However, this produces files with ".csv" in the names in R.
What I would like to do is load all the files but drop the ".csv" from the name once they are loaded.
I have tried the following:
for (i in 1:length(file_list)){ assign(file_list[i],
read.csv(substr(paste(base_dir, file_list[i], sep = ""), 1,
nchar(file_list[i]) -4))
)}
But I received an error: No such file or directory
Is there a way to do this somewhat efficiently?
Normally one reads them into a list rather than having them as free objects floating around in the workspace. Use dir or Sys.glob to generate the full path names and then use read.csv to read each one in. L's names will be pathnames so reduce them to the basename and remove .csv .
# paths <- dir(path = path, pattern = "\\.csv$", full = TRUE)
paths <- Sys.glob(sprintf("%s/*.csv", path))
L <- Map(read.csv, paths)
names(L) <- sub("\\.csv$", "", basename(names(L)))
If you really want them as free floating objects anyways then add this:
list2env(L, .GlobalEnv)
We can use sub to remove the .csv at the end
for (i in 1:length(file_list)){
assign(sub("\\.csv$", "", basename(file_list[i])),
read.csv(paste(base_dir, file_list[i], sep = ""))
)}
Or another option is file_path_sans_ext
for (i in 1:length(file_list)){
assign(tools::file_path_sans_ext(basename(file_list[i])),
read.csv(paste(base_dir, file_list[i], sep = ""))
)}
The error produced in OP's code is because the substr is applied on the 'value' part i.e. the one goes into reading the file instead of the 'x' i.e. the corrected code would be
for(i in 1:length(file_list)){
assign(substr(paste(base_dir, file_list[i], sep = ""),
1, nchar(file_list[i]) -4),
read.csv(file_list[i])
)}
Also, if the working directory is different, it may be better to specify full.names = TRUE
file_list <- list.files(path = base_dir, pattern = "*\\.csv$", full.names = TRUE)

for Loop in R not binding files

I'm fairly new to R, so my apologies if this is a very basic question.
I'm trying to read two Excel files in, using the list.files(pattern) method, then using a for loop to bind the files and replace values in the bound file. However, the output that my script is producing is the output from only one file, meaning that it is not binding.
The file names are fact_import_2020 and fact_import_20182019.
FilePath <- "//srdceld2/project2/"
FileNames <- list.files(path = FilePath, pattern = "fact_import_20", all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
FileCount <- length(FileNames)
for(i in 1:FileCount){
MOH_TotalHC_1 <- read_excel(paste(FilePath, "/", FileNames[i], sep = ""), sheet = 1, range = cell_cols("A:I"))
MOH_TotalHC_2 <- read_excel(paste(FilePath, "/", FileNames[i], sep = ""), sheet = 1, range = cell_cols("A:I"))
MOH_TotalHC <- rbind(MOH_TotalHC_1, MOH_TotalHC_2)
MOH_TotalHC <- MOH_TotalHC[complete.cases(MOH_TotalHC), ]
use full.names = TRUE in list.files().
After this, make sure FileNames has full path of the files.
Then loop through the filenames, instead of filecount.
I think, you are trying to do this. I am guessing here. Please see below.
You are getting data from one file, because you are overwriting the data from file-2 with data from file-1. The for() loop is indicating it.
FileNames <- list.files(path = FilePath, pattern = "fact_import_20", all.files = FALSE,
full.names = TRUE, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
# list of data from excell files
df_lst <- lapply(FileNames, function(fn){
read_excel(fn, sheet = 1, range = cell_cols("A:I"))
})
# combine both data
MOH_TotalHC <- do.call('rbind', df_lst)
# complete cases
MOH_TotalHC[complete.cases(MOH_TotalHC), ]
The potential solution is below. This solution is taken from here and seems like a
duplicate question.
Potential solution:
library(readxl)
library(data.table)
#Set your path here
FilePath <- "//srdceld2/project2/"
#Update the pattern to suit your needs. Currently, its just set for XLSX files
file.list <- list.files(path = FilePath, pattern = "*.xlsx", full.names = T)
df.list <- lapply(file.list, read_excel, sheet = 1, range = cell_cols("a:i"))
attr(df.list, "names") <- file.list
names(df.list) <- file.list
setattr(df.list, "names", file.list)
#final data frame is here
dfFinal <- rbindlist(df.list, use.names = TRUE, fill = TRUE)
Assumptions and call outs:
The files in the folder are similar file types. For example xlsx.
The files could have different set of columns and NULLs as well.
Note that the order of the columns matter and so if there are more columns in new file the number of output columns could be different.
Note: Like #Sathish, I am guessing what the input could look like

Using docx2txt in R

I have used the following code within R to convert a PDF file to a text file for future use of the tm package. I am using the downloaded "pdftotext.exe" file.
This code is working properly and produces a "txt" for every PDF in the directory.
myfiles <- list.files(path = dir04, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/xpdf/xpdfbin-win-3.04/bin64/pdftotext.exe"',paste0('"', i, '"')), wait = FALSE))
I am trying to figure out how to use "docx2txt" in a similar manner. However, the file formats are not .exe files. Can I use the "docx2txt-1.4" or "docx2txt-1.4.tar" in the same manner? The following code provides an error for each file.
myfiles <- list.files(path = dir08, pattern = "docx", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/docx2txt/docx2txt-1.4.gz"',paste0('"', i, '"')), wait = FALSE))
Warning
running command '"C:/docx2txt/docx2txt-1.4.gz" "C:/....docx"' had status 127
how do I create a corpus of *.docx files with tm? doesn't have quite enough info.
You can convert a ".docx" file to ".txt" with the following code which is a different approach :
library(RDCOMClient)
path_Word <- "C:\\temp.docx"
path_TXT <- "C:\\temp.txt"
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_Word),
ConfirmConversions = FALSE)
doc$SaveAs(path_TXT, FileFormat = 4) # Converts word document to txt
text <- readLines(path_TXT)
text

Resources