Extracting data from multiple pdf files in R

Extracting data from multiple pdf files in R - r

Below is the code to create some test data:
df <- data.frame(page_id = c(3,3,3,3,3), element_id = c(19, 22, 26, 31, 31),
text = c("The Protected Percentage of your property value thats has been chosen is 0%",
"The Arrangement Fee payable at complettion: £50.00",
"Interest rate is fixed for the life of the period is: 5.40%",
"The Benchmark rate that will be used to calculate any early repayment 2.08%",
"The property value used in this scenario is 275,000.00"))
I have many pdf files from which I want to extract the same information using regular expressions. I have managed to extract all the information I require from 1 pdf file so far. Below is the code for it - with comments:
library("textreadr")
library("pdftools")
library("tidyverse")
library("tidytext")
library("textreadr")
library("tm")
# read in the PDF file
Off_let_data <- read_pdf("50045400_K021_2017-V001_300547.pdf")
# read all pdf file from a folder
files <- list.files(pattern = "pdf$")[1]
# extract the account number from the first pdf file
acc_num <- str_extract(files, "^\\d+")
# The RegEx's used to extract the relevant information
protec_per_reg <- "Protected\\sP\\w+\\sof"
Arr_Fee_reg <- "^The\\sArrangement\\sF\\w+"
Fix_inter_reg <- "Fixed\\sI\\w+\\sR\\w+"
Bench_rate_reg <- "Benchmark\\sR\\w+\\sthat"
# create a df that only includes the rows which match the above RegEx
Off_let <- Off_let_data %>% filter(page_id == 3, str_detect(Off_let_data$text, protec_per_reg)|
str_detect(Off_let_data$text, Arr_Fee_reg) | str_detect(Off_let_data$text, Fix_inter_reg) |
str_detect(Off_let_data$text, Bench_rate_reg))
# Now only extract the numbers from the above DF
off_let_num <- str_extract(Off_let$text, "\\d+\\.?\\d+")
# The first element is always a NA value - based on the structure of these PDF files
# replace the first element of this character vector with the below
off_let_num[is.na(off_let_num)] <- str_extract(Off_let$text, "\\d+%")[[1]]
off_let_num
The off_let_num variable is a vector with has 4 elements that are required from the pdf file.
NOW I would like to apply all these steps to a folder that includes many pdf file. So, far I have managed to read all the PDF file into separate data frames - the code for which is below:
# read all pdf files into a list
file_list <- list.files(pattern = '*.pdf')
# Read in all the pdf files into seperate data frames
for (file_name in off_let) {
assign(paste0("off","_",sub(".pdf","",file_name)), read_pdf(file_name))
}
I now have a many data frames in my working directory. I would like to apply the same process I applied to one pdf file in the start, to all these data frames beginning with 'off'.
I guess the way to go would be to convert the initial process into a function and then call this function to be applied to all the data frames beginning with 'off'. The results should be appended to a data frame which should include all the elements (4) extracted from these pdf files.
I'm not sure how to achieve this. Please help!

Related

Reading in multiple srt files

I'd like to read in multiple srt files in R. I can read them into a list but I need to load them in sequentially by the way they were created in the file directory.
I'd also like to make a column to tell which file they come from. So I can tell which data came from file 1, file 2.. etc.
I can read them in as a list; but the files are names like "1 - FileTest"; "2 - FileTest", "#10 FileTest",... etc
This then loads the list like 1, 10, 11... etc. Even though if I arrange the files in my file directory file 11 was created after 9 for instance. I should just need a parameter for them to load sequentially so then when I put them in dataframe they show in chronological order.
list_of_files <- list.files(path=path,
pattern = "*.srt",
full.names = TRUE)
Files <- lapply(list_of_files, srt.read)
Files <- data.frame(matrix(unlist(Files), byrow=T),stringsAsFactors=FALSE)
The files load in but they don't load in chronological order it is difficult to tell what data is associated with which file.
I have approximately 150 files so being able to compile them into a single dataframe would be very helpful. Thanks!

Consider extracting meta data of the files with file.info (includes created/modified time, file size, owner, group, etc.). Then order that resulting data frame by created date/time, and finally import .srt files with ordered list of files:
raw_list_of_files <- list.files(path=path,
pattern = "*.srt",
full.names = TRUE)
# CREATE DATA FRAME OF FILE INFO
meta_df <- file.info(raw_list_of_files)
# SORT BY CREATED DATE/TIME
meta_df <- with(meta_df, meta_df[order(ctime),])
# IMPORT DATA FRAMES IN ORDERED FILES
srt_list <- lapply(row.names(meta_df), srt.read)
final_df <- data.frame(matrix(unlist(srt_list), byrow=TRUE),
stringsAsFactors=FALSE)

how to use "for loop" to write multiple .csv file names?

Does anyone know the best way to carry out a "for loop" that would read in different subject id's and append them to the name of an exported csv?
As an example, I have multiple output files from an electrocardiogram software program (each file belongs to one individual). The files are named C800_HR.bdf.evt, C801_HR.bdf.evt, C802_HR.bdf.evt etc. Each file gets read into r and then has a script applied to calculate heart rate variability. At the end of the script, I need to add a loop that will extract the subject id (e.g., C800, C801, C802) and write a new file name for each individual so that it becomes C800_RtoR.csv. Essentially, I would like to avoid changing the syntax every time I read in and export a file name.
I am currently using the following syntax to read in multiple files:
>setwd("/Users/kmpc/Downloads")
>myhrvdata <-lapply(Sys.glob("C8**_HR.bdf.evt"), read.delim)

Try this out:
cardio_files <- list.files(pattern = "C8\\d{2}_HR.bdf.evt")
subject_ids <- sub("^(C8\\d{2})_.*", "\\1" cardio_files)
myList <- lapply(cardio_files, read.delim)
## do calculations on the list
for (i in names(myList)) {
write.csv(myList[[i]], paste0(subject_ids[i], "_RtoR.csv"))
}
The only thing is, you have to deal with using a list when doing your calculations. You could combine them to a single data.frame, but it would be best to leave it as a list to write the files at the end.

Consider generalizing your process by creating a function that: 1) reads in file, 2) processes data, 3) outputs to csv. Then have lapply call the defined method iteratively across all Sys.glob items and even return a list of calculated data frames.
proc_heart_rate <- function(f_name) {
# READ IN .evt FILE INTO df
df <- read.delim(f_name)
# CALCULATE HEART RATE VARIABILITY WITH df
...
# OUTPUT df TO CSV
subject_id <- gsub("\\_.*", "", f_name)
write.csv(df, paste0(subject_id, "_RtoR.csv"))
# RETURN df FOR OTHER USES
return(df)
}
# LIST OF DATA FRAMES WITH CALCULATIONS
myhrvdata_list <-lapply(Sys.glob("C8**_HR.bdf.evt"), proc_heart_rate)

Convert XML Doc to CSV using R - Or search items in XML file using R

I have a large, un-organized XML file that I need to search to determine if a certain ID numbers are in the file. I would like to use R to do so and because of the format, I am having trouble converting it to a data frame or even a list to extract to a csv. I figured I can search easily if it is in a csv format. So , I need help understanding how to do convert it and extract it properly, or how to search the document for values using R. Below is the code I have used to try and covert the doc,but several errors occur with my various attempts.
## Method 1. I tried to convert to a data frame, but the each column is not the same length.
require(XML)
require(plyr)
file<-"EJ.XML"
doc <- xmlParse(file,useInternalNodes = TRUE)
xL <- xmlToList(doc)
data <- ldply(xL, data.frame)
datanew <- read.table(data, header = FALSE, fill = TRUE)
## Method 2. I tried to convert it to a list and the file extracts but only lists 2 words on the file.
data<- xmlParse("EJ.XML")
print(data)
head(data)
xml_data<- xmlToList(data)
class(data)
topxml <- xmlRoot(data)
topxml <- xmlSApply(topxml,function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml),
row.names=NULL)
write.csv(xml_df, file = "MyData.csv",row.names=FALSE)
I am going to do some research on how to search within R as well, but I assume the file needs to be in a data frame or list to so either way. Any help is appreciated! Attached is a screen shot of the data. I am interested in finding matching entity id numbers to a list I have in a excel doc.

R text mining documents from multiple txt files

I have multiple txt files, each referring to a different month of the year (for many years). So, how I could analyze these files (text mining) each of these separately from a unique corpus (or something similar), by taking track of the month-year reference, thank you.

Here is an example I programmed for Game of Thrones subtitles. The subtitles are in the form 60 text files, one file for one episode in the form of S01E01 were we wanted to keep the episode information.
The following code will read the files into a list, and will turn it into a dataframe with episode information and text. You will have to adapt it to your own problem.
library(plyr)
####### Read data ######
filenames <- list.files("Set7/Game of Thrones Subtitles", pattern="*", full.names=TRUE)
filenames_short <- list.files("Set7/Game of Thrones Subtitles", pattern="*", full.names=FALSE)
ldf <- alply(.data=filenames,.margins=1,.fun=scan,what = "character", quiet = T, quote = "")
names(ldf) <- filenames_short
# Loop over all filenames
# Turns list into two columns of a dataframe, episode and word
# create empty dataframe
df_got_subs <- as.data.frame(NULL)
for (i in 1:60) {
# extract listname
# vector with list name
listenname <- filenames_short[i]
vec_listenname <- rep.int(listenname,length(ldf[[i]]))
# Doublecheck
cat("listenname: ",listenname,"\n")
# turn list element into vector
vec_subs <- as.vector(ldf[[i]])
# create dataframe from vectors
df_subs <- cbind.data.frame(vec_listenname,vec_subs,stringsAsFactors=FALSE)
# attach to the "big" dataframe
df_got_subs <- rbind.data.frame(df_got_subs,df_subs)
}
# test datastructure
str(df_got_subs)
# change column names
colnames(df_got_subs) <- c("episode","subs")
The whole text mining we did with the tidytext package from Julia Silge. I didn't post the code because she gives much better examples in this post:
http://juliasilge.com/blog/Life-Changing-Magic/
I hope this helps with your problem.

Applying an R script to multiple files

I have an R script that reads a certain type of file (nexus files of phylogenetic trees), whose name ends in *.trees.txt. It then applies a number of functions from an R package called bGMYC, available here and creates 3 pdf files. I would like to know what I should do to make the script loop through the files for each of 14 species.
The input files are in a separate folder for each species, but I can put them all in one folder if that facilitates the task. Ideally, I would like to output the pdf files to a folder for each species, different from the one containing the input file.
Here's the script
# Call Tree file
trees <- read.nexus("L_boscai_1411_test2.trees.txt")
# To use with different species, substitute "L_boscai_1411_test2.trees.txt" by the path to each species tree
#Store the number of tips of the tree
ntips <- length(trees$tip.label[[1]])
#Apply bgmyc.single
results.single <- bgmyc.singlephy(trees[[1]], mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create the 1st pdf
pdf('results_single_boscai.pdf')
plot(results.single)
dev.off()
#Sample 50 trees
n <- sample(1:length(trees), 50)
trees.sample <- trees[n]
#Apply bgmyc.multiphylo
results.multi <- bgmyc.multiphylo(trees.sample, mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create 2nd pdf
pdf('results_boscai.pdf') # Substitute 'results_boscai.pdf' by "*speciesname.pdf"
plot(results.multi)
dev.off()
#Apply bgmyc.spec and spec.probmat
results.spec <- bgmyc.spec(results.multi)
results.probmat <- spec.probmat(results.multi)
#Create 3rd pdf
pdf('trees_boscai.pdf') # Substitute 'trees_boscai.pdf' by "trees_speciesname.pdf"
for (i in 1:50) plot(results.probmat, trees.sample[[i]])
dev.off()
I've read several posts with a similar question, but they almost always involve .csv files, refer to multiple files in a single folder, have a simpler script or do not need to output files to separate folders, so I couldn't find a solution to my specific problem.
Shsould I use a for loop or could I create a function out of this script and use lapply or another sort of apply? Could you provide me with sample code for your proposed solution or point me to a tutorial or another reference?
Thanks for your help.

It really depends on the way you want to run it.
If you are using linux / command line job submission, it might be best to look at
How can I read command line parameters from an R script?
If you are using GUI (Rstudio...) you might not be familiar with this, so I would solve the problem
as a function or a loop.
First, get all your file names.
files = list.files(path = "your/folder")
# Now you have list of your file name as files. Just call each name one at a time
# and use for loop or apply (anything of your choice)
And since you would need to name pdf files, you can use your file name or index (e.g loop counter) and append to the desired file name. (e.g. paste("single_boscai", "i"))
In your case,
files = list.files(path = "your/folder")
# Use pattern = "" if you want to do string matching, and extract
# only matching files from the source folder.
genPDF = function(input) {
# Read the file
trees <- read.nexus(input)
# Store the index (numeric)
index = which(files == input)
#Store the number of tips of the tree
ntips <- length(trees$tip.label[[1]])
#Apply bgmyc.single
results.single <- bgmyc.singlephy(trees[[1]], mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create the 1st pdf
outname = paste('results_single_boscai', index, '.pdf', sep = "")
pdf(outnam)
plot(results.single)
dev.off()
#Sample 50 trees
n <- sample(1:length(trees), 50)
trees.sample <- trees[n]
#Apply bgmyc.multiphylo
results.multi <- bgmyc.multiphylo(trees.sample, mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create 2nd pdf
outname = paste('results_boscai', index, '.pdf', sep = "")
pdf(outname) # Substitute 'results_boscai.pdf' by "*speciesname.pdf"
plot(results.multi)
dev.off()
#Apply bgmyc.spec and spec.probmat
results.spec <- bgmyc.spec(results.multi)
results.probmat <- spec.probmat(results.multi)
#Create 3rd pdf
outname = paste('trees_boscai', index, '.pdf', sep = "")
pdf(outname) # Substitute 'trees_boscai.pdf' by "trees_speciesname.pdf"
for (i in 1:50) plot(results.probmat, trees.sample[[i]])
dev.off()
}
for (i in 1:length(files)) {
genPDF(files[i])
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting data from multiple pdf files in R - r

Related

Reading in multiple srt files

how to use "for loop" to write multiple .csv file names?

Convert XML Doc to CSV using R - Or search items in XML file using R

R text mining documents from multiple txt files

Applying an R script to multiple files

Categories

Resources