How to loop through multiple pdf files using R libraries - r

I have a task to read in multiple pdf files and extract header and footer.
The below code helps me get header and footer from one file without any issue, but I want to do the same for multiple files and extract data. Please advise.
library(pdftools)
library(tm)
#Multiple files in a directory
files<- list.files(pattern='pdf$')
#File header and footer extraction
pdf_22 <- pdf_text("Test_List.pdf") %>% str_split("\n")
for (i in 1:35) {
yy <- pdf_22[[i]][-5:-24]
}

You can try with lapply :
library(pdftools)
files<- list.files(pattern='pdf$')
lapply(files, function(x) {
lapply(strsplit(pdf_text(x), "\n"), `[`, -5:-24)
}) -> result
result

Related

How to loop through folders and apply function on files in these folders

I have multiple similarly named but different folders, each containing similarly named but different csv files.
For example, I have three folders named "output", each containing "image.csv" and "cells.csv".
How do I loop through each "output" folder, then read each csv files in the folder and apply function onto these files?
Here's what I tried :
Firstly, I list the folders named "output":
dirs<-list.dirs()
dirs<-dirs[grepl("output",dirs)]
Then I want to set up a function to join both csv files, something like below (codes are incomplete though, please help to correct this):
object_extraction<-function(x){ image<-read.csv(image.csv, header=T, sep=",")
cells<-read.csv(cells.csv, header=T, sep=",")
object<-dplyr::inner_join(cells,image,by="ImageNumber")
return(object)}
Finally I want to loop the function above through the "output" folders
object<-list()
for(i in 1:length(dirs)){
object[[i]]<-object_extraction(dirs[i])
Thank you
Make the path to read csv dynamic in your function
object_extraction<-function(x){
image<-read.csv(paste0(x, '/image.csv'), header=T, sep=",")
#header = T and sep = ',' is default in read.csv so this should
#work without specifying them as well.
cells<-read.csv(paste0(x, '/cells.csv'))
object<-dplyr::inner_join(cells,image,by="ImageNumber")
return(object)
}
and then apply the function to each folder.
dirs <- list.dirs(recursive=FALSE)
dirs <- grep('output', dirs, value = TRUE)
result <- lapply(dirs, object_extraction)
Two errors I can spot in your code:
You need to use the directory name form the dirs variable, eg:
object_extraction<-function(x){
image<-read.csv(file.path(x, "image.csv"), header=T, sep=",")
cells<-read.csv(file.path(x, "cells.csv"), header=T, sep=",")
object<-dplyr::inner_join(cells,image,by="ImageNumber")
return(object)
}
And the file names should be strings, "image.csv" and "cells.csv"
HTH

Loading .rmd datafile within a function to the environment

I'm new to functions and I would like to load my data with a function.
The function appears to be correct but the file does not save as a dataframe to the environment, while this does happen when it's not within the function.
This is my script:
read_testdata <- function(file) {
Dataset_test <- read_rds(here("foldername", file))
}
read_testdata("filename")
Can someone spot my error?
After some thinking I spotted my problem, the correct code should be this:
read_testdata <- function(file) {
read_rds(here("foldername", file))
}
Dataset_test <- read_testdata("filename.rds")

How to remove file name extensions from the global environment

With the code below, I have imported all .txt files from working directory.
temp=list.files(pattern = "*.txt")
for (i in 1:length(temp)) { assign(temp[i], read.delim(temp[i]))
But all of them came with .txt extension like this.
How can I remove all .txt extensions from data names?
You can rename the variables in your for loop itself
for (i in 1:length(temp)) {assign(sub(".txt$", "", temp[i]), read.delim(temp[i]))}
Or if you have already imported the variables change their names later
vals <- ls(pattern = ".txt$")
for (i in vals) { assign(sub(".txt$", "", i), get(i)) }
and then clean up the old names
rm(list = vals)
On a side note, using assign is considered bad. Read it's potential dangers and side effects here.

Loading .txt files using a function in R

The following code works:
Name<-"s1521r0000_rd2.txt"
OneFile<-read.table(file=Name, sep="", skip=35, fill=TRUE )
But, I am trying to write a function that will load one .txt file so that I can load whatever .txt file I want. I wrote the following function which is not working:
ReadOneFile<-function(Name="s1521r0000_rd2.txt"){
OneFile<-read.table(file=Name, sep="", skip=35, fill=TRUE )
}
It would be great if you could help me.
You need to return() the file from the function into an object. For instance:
func.readonefile <- function(Name) {
thefile <- read.table(file=Name,sep="",skip=35,fill=TRUE)
return(thefile)
}
a_file <- func.readonefile(Name="s1521r0000_rd2.txt")

Read files by folder in R

I was trying to read all files in a folder using R, but I always got an error such like that:
>folder<-"/Volumes/cphg/projects/PROVIDE/freeze" #working directory
>filelist<-list.files(folder) #all files in the directory
>data<-vector("list", length(filelist)) #empty list
>names(data)<-filelist
>for (name in filelist) {
+ data[[name]]<-read.table(paste(folder, name, sep="/"), header=T)
+}
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Does any body know what' wrong here and how to fix it?
You can use tryCatch and return NULL if reading the file fails. Then you can Filter the results to exclude the NULLs
L <- setNames(lapply(filelist, function(x) {
tryCatch(read.table(file.path(folder, name)), error=function(e) NULL)
}), filelist)
data <- Filter(NROW, L)
Just to make it clear... and to close the question properly
The problem is that there is at least one file empty. Check the file name when it through the error.

Resources