Converting Multiple PDF files into Text (R Language) - r

I am using the "tesseract" library in R to convert "PDF files into text", like shown over here: https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html
library(pdftools)
library(tesseract)
pngfile <- pdftools::pdf_convert('myfile_1.pdf', dpi = 600)
text <- tesseract::ocr(pngfile)
cat(text)
The above code works perfectly. Now, I am trying to "mass upload" a large number of PDF files and convert them into text- currently, I figured out how to do this manually
#import and convert 1st file
pngfile_1 <- pdftools::pdf_convert('myfile_1.pdf', dpi = 600)
text_1 <- tesseract::ocr(pngfile_1)
#import and convert 2nd file (note: the files do not have the same naming convention)
pngfile_2 <- pdftools::pdf_convert('second_file.pdf', dpi = 600)
text_2 <- tesseract::ocr(pngfile_2)
etc
I copied/pasted the above code 50 times (while changing the "index", i.e. pngfile_i, text_i) and was able to accomplish what I wanted to do. However, I am looking for a somewhat "automatic" to import and convert all the pdf files.
At the moment, all my pdf files are in the following folder:
"C:/Users/me/Documents/mypdfs"
I found the following code which can be used to "mass import" pdf files into R:
library(dplyr)
library(data.table)
tbl_fread <-
list.files(pattern = "*.pdf") %>%
map_df(~fread(.))
But I am not sure how to instruct this code to import all pdf's from the correct directory ("C:/Users/me/Documents/mypdfs"). I also don't know how to instruct R to "rename" each imported pdf as "pdf_1, pdf_2, etc."
If all the pdf files were correctly imported and created, I could then write a "loop" and execute the desired commands, e.g.
# "n" would be the total number of pdf files
for (i in 1:n)
{
pngfile_i <- pdftools::pdf_convert('myfile_i.pdf', dpi = 600)
text_i <- tesseract::ocr(pngfile_i)
}
Can someone please show me how to do this?
Thanks

You can add full.names = TRUE in your list.files-function, but this assumes that "C:/Users/me/Documents/mypdfs" is contained within your project.
Alternatively, you can use path = "Documents/mypdfs with full.names = TRUE which will direct the path to mypdfs.
list.files(
path = "Documents/mypdfs"
full.names = TRUE,
pattern = "*.pdf"
)
To save them according to you pdf_n then you can use paste along with map. Here I used data.frames to provide an example, as I do not work with pdfs and have none in bulk that I am willing to process.
library(tidyverse)
1:length(tbl_fread) %>% map(
.f = function(i) {
# Your regular function
# related to PDF
# Saving according to desired names
write.table(
tbl_fread[[i]],
file = paste0("pdf_", i, ".csv")
)
}
)
To verify it works as intended, we can read it accordingly,
read.table(
file = "pdf_1.csv"
)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Related

Using glue-like constructs on RHS in R/Tidyeval

I've spent hours trying to make glue on the RHS of a formula work and out of clues. Here is a simple reprex.
meta <- function(x, var, suffix){
x<- x %>% mutate("{{var}}_{suffix}":= 5)
x<- x %>% mutate("{{var}}_{suffix}_new":= {{var}} - "{{var}}_{suffix}")
}
x<- meta(mtcars, mpg, suf)
#Should be equivalent to
x<- mtcars %>% mutate(mpg_suf:= 5)
x<- x%>% mutate(mpg_suf_new:= mpg - mpg_suf)
#N: Tried https://stackoverflow.com/questions/70427403/how-to-correctly-glue-together-prefix-suffix-in-a-function-call-rhs but none of the methods in it worked, unfortunately
Meta function gives me "Error in local_error_context(dots = dots, .index = i, mask = mask) :
promise already under evaluation: recursive default argument reference or earlier problems? "
Went over all hits for the searchwords for it on SO but nothing worked at the moment.
Would really appreciate any insights. Thank you!
Here is a working version:
meta <- function(x, var, suffix){
new_name <- rlang::englue("{{ var }}_{{ suffix }}")
x %>%
mutate("{new_name}" := 5) %>%
mutate("{new_name}_new" := {{ var }} - .data[[new_name]])
}
names(meta(mtcars, mpg, suf))
#> [1] "mpg" "cyl" "disp" "hp"
#> [5] "drat" "wt" "qsec" "vs"
#> [9] "am" "gear" "carb" "mpg_suf"
#> [13] "mpg_suf_new"
To understand what is going on:
Learn about the difference between "{{ var }}" and "{var}" in tidyeval glue strings: https://rlang.r-lib.org/reference/glue-operators.html
Learn about englue() to create glue strings outside of the LHS of :=: https://rlang.r-lib.org/reference/englue.html. This part is not necessary but I thought it was nicer to create and reuse a variable.
Tricky part, you create a new column with a constructed name and then want to use the new column that this name refers to. You'll have to subset it with .data, see: https://rlang.r-lib.org/reference/dot-data.html
See also the general topic: https://rlang.r-lib.org/reference/topic-data-mask-programming.html
I think it's best if we define the pieces we need first, then we can use them as needed on the LHS or the RHS of the calculation. I will add that it doesn't make much sense to me to pass the suffix argument as a bare name. I think it would be a clearer choice to make it string only.
library(dplyr)
meta <- function(x, var, suffix) {
var <- rlang::as_name(enquo(var))
suffix <- rlang::as_name(enquo(suffix)) # Remove this to make "suffix" string only.
new_var <- glue::glue("{var}_{suffix}")
x %>%
mutate("{new_var}" := 5,
"{new_var}_new" := !!sym(var) - !!sym(new_var))
}
mtcars %>%
head() %>%
meta(mpg, suf)
mpg cyl disp hp drat wt qsec vs am gear carb mpg_suf mpg_suf_new
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 5 16.0
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 5 16.0
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 5 17.8
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 5 16.4
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5 13.7
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 5 13.1

How to use a short script to eliminate all but one duplicate column variables based on the prefix of the colname

I want to know to use a short script to eliminate all but one duplicate column variables based on the prefix of the colname without inputting the variables I want to remove by hand.
For example, I created repeats of the mtcars$am variables, called am1, am2, am3, and am4 in a data frame called mtcars_example_2. I removed the original am variable in the mtcars_example_2 data frame.
I can use the script below to eliminate all variables with the prefix "am" but the am1 variable into a new variable called mtcars_example_3 using the code below, which inputs all variables to remove by hand:
## long way of removing all variable with am prefix that were not am1
mtcars_example_3 <-
mtcars_example_2 %>%
select(
-c(
"am2", "am3", "am4"
)
)
But this seems like the long way of doing this. Is there a faster way that does not require me to individual type in the names of each of the variables that I want to remove from the data.
Is this possible? If so, how can this be done?
Thanks ahead of time.
Here is the code for the example:
# example data
## loads packages
library(tidyverse)
## creates mtcars_example data
mtcars_example_1 <- data.frame(mtcars)
mtcars_example_2 <- data.frame(mtcars_example_1)
## creates duplicate variables, based on am variable
mtcars_example_2$am1 <- mtcars_example_1$am
mtcars_example_2$am2 <- mtcars_example_1$am
mtcars_example_2$am3 <- mtcars_example_1$am
mtcars_example_2$am4 <- mtcars_example_1$am
## removes original variable
mtcars_example_2 <-
mtcars_example_2 %>%
select(
-c(
"am"
)
)
## long way of removing all variable with am prefix that were not am1
mtcars_example_3 <-
mtcars_example_2 %>%
select(
-c(
"am2", "am3", "am4"
)
)
You can remove all the variables that start with am but keep am1 :
library(dplyr)
mtcars_example_2 %>% select(-starts_with('am'), am1) %>% head
# mpg cyl disp hp drat wt qsec vs gear carb am1
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 4 4 1
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 4 4 1
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 4 1 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 0
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 3 2 0
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1 0
Depending on your actual scenario you can also use regex to remove columns.
mtcars_example_2 %>% select(-matches('am[2-4]')) %>% head
We could also do
library(dplyr)
mtcars_example_2 %>%
select(-contains('am'), am1)

Read specific (non-consecutive) rows from csv

I have a large csv file and would like to read only certain lines, defined by a vector of row numbers to be read. Is there any way to read these rows without reading the whole csv into memory?
The only solutions I've found seem to allow reading consecutive lines (e.g. lines 2-100).
A simple example of how you might combine the sed approach I linked to into an R function:
read_rows <- function(file,rows,...){
tmp <- tempfile()
row_cmd <- paste(paste(rows,"p",sep = ""),collapse = ";")
cmd <- sprintf(paste0("sed -n '",row_cmd,"' %s > %s"),file,tmp)
system(command = cmd)
read.table(file = tmp,...)
}
write.csv(x = mtcars,file = "~/Desktop/scratch/mtcars.csv")
> read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(3,6,7),sep = ",",header = FALSE,row.names = 1)
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(1,5,9),sep = ",",header = TRUE,row.names = 1)
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Note the difference with row 1 as the column headers.
sqldf will read it into a database (which it will create and then delete for you) and then read only the rows you want into R. Assuming the csv file created in the Note at the end define the desired Rows and then use read.csv.sql. We have used a temporary file for the database but if the data is sufficiently small you can omit the dbname argument and it will use memory.
library(sqldf)
Rows <- c(3, 5, 10)
s <- toString(Rows)
fn$read.csv.sql("Letters.csv", "select * from file where rowid in ($s)",
dbname = tempfile())
giving:
X Letters
1 "3" "c"
2 "5" "e"
3 "10" "j"
If the number of rows desired is very large then rather than embedding the row numbers in the SQL statement create a data frame from them and join it:
library(sqldf)
Rows <- c(3, 5, 10)
RowsDF <- data.frame(Rows)
s <- toString(Rows)
fn$read.csv.sql("Letters.csv",
"select file.* from file join RowsDF on file.rowid = RowsDF.Rows",
dbname = tempfile())
Note
Letters <- data.frame(Letters = letters, stringsAsFactors = FALSE)
write.csv(Letters, "Letters.csv")

Multiple plots in R on a pdf with a loop

I think I need help with the loop. How would you do multiple plots on separate pdf pages with the data below:
pdf page 1:
Mazda RX4
2 panel plot for mpg vs cyl and mpg vs vs
pdf page 2:
Hornet 4 D
2 panel plot for mpg vs cyl and mpg vs vs
and the same for Valiant.
model mpg cyl vs
Mazda RX4 21.0 6 0
Mazda RX4 21.0 6 0
Mazda RX4 22.8 4 1
Hornet 4 D 21.4 6 1
Hornet 4 D 18.7 8 0
Valiant 18.1 6 1
Valiant 21.4 6 1
Valiant 21.0 6 0
Valiant 22.8 6 0
Thanks.
What I do in this case is set up the plots I want on one page with gridExtra, save that as a PDF, and then concatenate all these PDFs with ghostscript.
In R:
library(gridExtra)
library(ggplot2)
plot_one <- ggplot() + geom_...
plot_two <- ggplot() + geom_...
# Arrange the two plots one per row.
# grid.arrange'd plots can be nested, too!
two_rows <- grid.arrange(plot_one, plot_two, nrow = 2)
ggsave("dataset_1.pdf", two_rows)
# repeat for second, third, etc datasets so you end up with dataset_2.pdf etc
These are then concatenated to one PDF with multiple pages with ghostscript:
gs -sDEVICE=pdfwrite \
-dNOPAUSE \
-dQUIET \
-dBATCH \
-sOutputFile=multipage.pdf \
dataset_1.pdf dataset_2.pdf
Derived from an example elsewhere (https://www.researchgate.net/post/How_to_save_the_graphics_in_several_separate_pages_with_R)
# Create pdf
pdf(...)
# Create different plots
plot1(...)
plot2(...)
plot3(...)
dev.off()
Note: set the parameter onefile=FALSE in pdf()

Apply variable function to columns in data.table

I'm wondering if there's a way to apply a function in a string variable to .SD cols in a data.table.
I can generalize all other parts of function calls using a data.table, including input and output columns, which I'm very happy about. But the final piece seems to be applying a variable function to a data.table, which is something I believe I've done before with dplyr and do.call.
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
This obviously works:
mtcars[,eval(returnNames) := myfunc(.SD),.SDcols = SDnames,by = cyl]
But if I want to apply a dynamic function, something like this does not work:
functionCall <- "myfunc"
mtcars[,eval(returnNames) := lapply(.SD,eval(functionCall)),.SDcols = SDnames,by = cyl]
I get this error:
Error in `[.data.table`(mtcars, , `:=`(eval(returnNames), lapply(.SD, : attempt to apply non-function
Is using "apply" with "eval" the right idea, or am I on the wrong track entirely?
You don't want lapply. Since myfunc takes a data.table with multiple columns, you just want to feed such a data table into the function as one object.
To get the function you need get instead of eval
On the left-hand-side of :=, you can just put the character vector in parentheses, eval isn't needed
-
mtcars[, (returnNames) := get(functionCall)(.SD)
, .SDcols = SDnames
, by = cyl]
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb calculatedColumn
# 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2310.0
# 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2310.0
# 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2120.4
# 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2354.0
# 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3272.5
# 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1900.5
The code above was run after the following code
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
functionCall <- "myfunc"

Resources