Converting pdf files into data.frames - r

I'm currently trying to create a function that will read many pdf files into a data frame. My ultimate goal is to have it read specific information from the pdf files and convert them into a data.frame with insurance plan names in each row and the columns comprising of information I need such as individual plan price, family plan prices, etc. I have been following an answer given by someone for a similar question in the past. However, I keep getting an error. Here is a link to two different files I am practicing on(1 and 2).
Here are my code and error below:
PDFtoDF = function(file) {
dat = readPDF(control=list(text="-layout"))(elem=list(uri=file),
language="en", id="id1")
dat = c(as.character(dat))
dat = gsub("^ ?([0-9]{1,3}) ?", "\\1|", dat)
dat = gsub("(, HVOL )","\\1 ", dat)
dat = gsub(" {2,100}", "|", dat)
excludeRows = lapply(gregexpr("\\|", dat), function(x) length(x)) != 6
write(dat[excludeRows], "rowsToCheck.txt", append=TRUE)
dat = dat[!excludeRows]
dat = read.table(text=dat, sep="", quote="", stringsAsFactors=FALSE)
names(dat) = c("Plan", "Individual", "Family")
return(dat)
}
files <- list.files(pattern = "pdf$")
df = do.call("rbind", lapply(files, PDFtoDF))
Error in read.table(text = dat, sep = "", quote = "", stringsAsFactors =
FALSE) : no lines available in input
Before this approach, I have been using the pdftools package and regular expressions. This approach worked except it was difficult to clarify a pattern for some parts of the document such as the plan name which is at the top. I was hoping the methodology I'm trying now will help since it will extract the text into separate strings for me.

Here's the best answer:
require(readtext)
df <- readtext("*.pdf")
Yes it's that simple, with the readtext package!

Related

Reduce unnecessary repeated reading from files in nested for loop R

I'm writing some R code to handle pairs of files, an Excel and a csv (Imotions.txt). I need extract a column from the Excel and merge it to the csv, in pairs. Below is my abbreviated script: My script is now in polynomial time, and keeps repeating the body of the nested for loop 4 times instead of just doing it once.
Basically is there a general way to think about running some code over a paired set of files that I can translate to this and other languages?
excel_files <- list.files(pattern = ".xlsx" , full.names = TRUE)
imotion_files <-list.files(pattern = 'Imotions.txt', full.names = TRUE)
for (imotion_file in imotion_files) {
for (excel_file in excel_files) {
filename <- paste(sub("_Imotions.txt", "", imotion_file))
raw_data <- extract_raw_data(imotion_file)
event_data <- extract_event_data(imotion_file)
#convert times to milliseconds
latency_ms <- as.data.frame(
sapply(
df_col_only_ones$latency,
convert_to_ms,
raw_data_first_timestamp = raw_data_first_timestamp
)
)
#read in paradigm data
paradigm_data <- read_excel(path = excel_file, range = "H30:H328")
merged <- bind_cols(latency_ms, paradigm_data)
print(paste("writing = ", filename))
write.table(
merged,
file = paste(filename, "_EVENT", ".txt", sep = ""),
sep = '\t',
col.names = TRUE,
row.names = FALSE,
quote = FALSE
)
}
}
It is not entirely clear about some operations. Here is a an option in tidyverse
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
out <- crossing(excel_files, imotion_files) %>%
mutate(filename = str_remove(imotion_file, "_Imotions.txt"),
raw_data = map(imotion_files, extract_raw_data),
event_data = map(imption_filess, extract_event_data),
paradigm_data = map(excel_files, ~
read_excel(.x, range = "H30:H328") %>%
bind_cols(latency_ms, .))
Based on the OP's code, latency_ms can be created outside the loop once and used it while binding the columns
Based on the naming of raw_data_first_timestamp, I'm assuming it's created by the extract_raw_data function - otherwise you can move the latency_ms outside the loop entirely, as akrun mentioned.
If you don't want to use tidyverse, see the modified version of your code at bottom. Notice that the loops have been broken out to cut down on duplicated actions.
Some general tips to improve efficiency when working with loops:
Before attempting to improve nested loop efficiencies, consider whether the loops can be broken out so that data from earlier loops is stored for usage in later loops. This can also be done with nested loops and variables tracking whether data has already been set, but it's usually simpler to break the loops out and negate the need for the tracking variables.
Create variables and call functions before the loop where possible. Depending on the language and/or compiler (if one is used), variable creation outside loops may not help with efficiency, but it's still good practice.
Variables and functions which must be created or called inside loops should be done in the highest scope - or the outermost loop - possible.
Disclaimer - I have never used R, so there may be syntax errors.
excel_files <- list.files(pattern = ".xlsx" , full.names = TRUE)
imotion_files <-list.files(pattern = 'Imotions.txt', full.names = TRUE)
paradigm_data_list <- vector("list", length(excel_files))
for (i in 1:length(excel_files)) {
#read in paradigm data
paradigm_data_list[[i]] <- read_excel(path = excel_files[[i]], range = "H30:H328")
}
for (imotion_file in imotion_files) {
filename <- paste(sub("_Imotions.txt", "", imotion_file))
raw_data <- extract_raw_data(imotion_file)
event_data <- extract_event_data(imotion_file)
#convert times to milliseconds
latency_ms <- as.data.frame(
sapply(
df_col_only_ones$latency,
convert_to_ms,
raw_data_first_timestamp = raw_data_first_timestamp
)
)
for (paradigm_data in paradigm_data_list) {
merged <- bind_cols(latency_ms, paradigm_data)
print(paste("writing = ", filename))
write.table(
merged,
file = paste(filename, "_EVENT", ".txt", sep = ""),
sep = '\t',
col.names = TRUE,
row.names = FALSE,
quote = FALSE
)
}
}

Exporting two data frames into same file

I currently have two data-frames, One DF contains around ~100,000 rows, while the other only has ~1000. I can export either one of these using the write.table function shown below...
write.table(DF_1, file = paste("DF_one.csv" ),
row.names = F, col.names = T, sep = ",")
This is easily opened by excel and works well. The problem is I need to include the other data frame in the very same excel file, and I'm not sure how to do this or if it is even possible.
I am open to any ideas, and have provided some example data to work with below.
#Example data for data frame one, length =30
Dates<-c(Sys.Date()+1:30)
Data1<-c(1+1:30)
#Data Frame One
Df1<-data.frame(Dates,Data1)
#Example data for data rame two, length=10
Letters<-c(letters[1:10])
Data2<-c(1:10)
#Data Frame Two
Df2<-data.frame(Letters,Data2)
#Now, is there a way can we export both to the same file?
#Here is the export for just data frame one
write.table(Df1, file = paste("DFone.csv" ),
row.names = F, col.names = T, sep = ",")
Any ideas including:"stop being picky and just export 2 files and then merge in excel" are appreciated.
Research Done:
I like this approach but would prefer a horizontal format instead of vertical
(I should probably just not be picky)
How to merge multiple data frame into one table and export to Excel?
How to write multiple tables, dataframes, regression results etc - to one excel file?
Thanks for all the help!
I have no idea if this preserves the information structure that you want but you are really intent on getting them into the same table you could do the following.
Both <- data.frame(Df1,Df2)
write.table(Both, file = paste("DF_Both.csv" ),
row.names = F, col.names = T, sep = ",")
Because the first solution did not meet your requirements here is another one that saves data frames to multiple tabs of an excel spreadsheet.
install.packages("xlsx")
library(xlsx)
###Define the save.xlsx function
save.xlsx <- function (file, ...)
{
require(xlsx, quietly = TRUE)
objects <- list(...)
fargs <- as.list(match.call(expand.dots = TRUE))
objnames <- as.character(fargs)[-c(1, 2)]
nobjects <- length(objects)
for (i in 1:nobjects) {
if (i == 1)
write.xlsx(objects[[i]], file, sheetName = objnames[i])
else write.xlsx(objects[[i]], file, sheetName = objnames[i],
append = TRUE)
}
print(paste("Workbook", file, "has", nobjects, "worksheets."))
}
### Save the file to your working directory.
save.xlsx("WorkbookTitle.xlsx", Df1, Df2)
Full discolsure this was adapted from another question on stack overflow R dataframes to multi sheet Excel Work

Creating many sorted data frames from a single larger frame

Current dilemma: I have a massive data frame that I am trying to break down into smaller files based on a partial string match in the column. I have made a script that works great for this:
df <- read.csv("file.csv", header = TRUE, sep = ",")
newdf <- select(df, matches('threshold1',))
write.csv(newdf,"threshold1.file.csv", row.names = FALSE)
The problem is that I have hundreds of thresholds to break apart into separate files. There must be a way I can loop this script to create all the files for me rather than manually editing the script to say threshold2, threshold3, etc.
You can try to solve it with lapply.
# Functions that splits and saves the data.frame
split_df <- function(threshold, df){
newdf <- select(df, matches(threshold,))
write.csv(newdf,
paste(".file.csv", sep = ""), row.names = FALSE)
return(threshold)
}
df <- read.csv("file.csv", header = TRUE, sep = ",")
# Number for thresholds
N <- 100
threshold_l <- paste("threshold", 1:N, sep = "")
lapply(threshold_l, split_df, df = df)

How do I loop through a number of variables and perform operations on them in R?

Suppose that I have 30 tsv files of twitter data, say Google, Facebook and LlinkedIn, etc. I want to perform a set of operations on all of them, and was wondering if I can do so using a loop.
Specifically, I know that I can create variables using a loop, such as
index = c("fb", "goog", "lkdn")
for (i in 1:length(index)){
file_name = paste(names[i], ".data", sep = "")
assign(file_name, read.delim(paste(index$report_id[i],
"-tweets.tsv", sep = ""), header = T,
stringsAsFactors = F))
}
But how do I perform operations to all these data files in the loop? For example, if I want to order the datafiles using data[order(data[,4]), ], how do I make sure that the data file name is changed in each iteration of the loop? Thanks!
Build a function that does all of the operations you need it to do and then create a loop calling that function instead. If you insist on using assign to create lots of variables (not a great practice for this very reason) then try something like:
files <- dir("path/to/files", pattern = "*.tsv")
fileFunction <- function(x){
df <- read.delim(x, sep = "\t", header = T, stringsAsFactors = F)
df <- df[order(df[,4]),]
return(df)
}
for (a in files){
assign(a, fileFunction(a))
}

How can I write out multiple files with different filenames in R

I have one BIG file (>10000 lines of data) and I want write out a separate file by ID. I have 50 unique ID names and I want a separate text file for each one. Here's what Ive got so far, and I keep getting errors. My ID is actually character string which I would prefer if I can name each file after that character string it would be best.
for (i in 1:car$ID) {
a <- data.frame(car[,i])
carib <- car1[,(c("x","y","time","sd"))]
myfile <- gsub("( )", "", paste("C:/bridge", carib, "_", i, ".txt"))
write.table(a, file=myfile,
sep="", row.names=F, col.names=T quote=FALSE, append=FALSE)
}
One approach would be to use the plyr package and the d_ply() function. d_ply() expects a data.frame as an input. You also provide a column(s) that you want to slice and dice that data.frame by to operate on independently of one another. In this case, you have the column ID. This specific function does not return an object, and is thus useful for plotting, or making charter iteratively, etc. Here's a small working example:
library(plyr)
dat <- data.frame(ID = rep(letters[1:3],2) , x = rnorm(6), y = rnorm(6))
d_ply(dat, "ID", function(x)
write.table(x, file = paste(x$ID[1], "txt", sep = "."), sep = "\t", row.names = FALSE))
Will generate three tab separates files with the ID column as the name of the files (a.txt, b.txt, c.txt).
EDIT - to address follow up question
You could always subset the columns you want before passing it into d_ply(). Alternatively, you can use/abuse the [ operator and select the columns you want within the call itself:
dat <- data.frame(ID = rep(letters[1:3],2) , x = rnorm(6), y = rnorm(6)
, foo = rnorm(6))
d_ply(dat, "ID", function(x)
write.table(x[, c("x", "foo")], file = paste(x$ID[1], "txt", sep = ".")
, sep = "\t", row.names = FALSE))
For the data frame called mtcars separated by mtcars$cyl:
lapply(split(mtcars, mtcars$cyl),
function(x)write.table(x, file = paste(x$cyl[1], ".txt", sep = "")))
This produces "4.txt", "6.txt", "8.txt" with the corresponding data. This should be faster than looping/subsetting since the subsetting (splitting) is vectorized.

Resources