convert batch of pdf to text using pdftools

convert batch of pdf to text using pdftools - r

I'm tying to convert 1000 pdfs to text for data analysis. I'm using the package pdftools.
I have been able to convert 2 pdf using the following code:
library(pdftools)
file_list <- list.files('pdf', full.names = TRUE, pattern = 'pdf')
for(i in 1:length(file_list)){
temp <- pdf_text(file_list[i])
temp <- tolower(temp)
file_name = paste(file_list[i], '.txt')
sink(file_name)
cat(temp)
sink()
}
but when I add more than 2 I get the following error:
" Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure."
also, I would like the final text file to be only "file_name.txt" right now i'm getting "file_name.pdf .txt"
thanks,

library(pdftools)
library(purrr)
setwd("/tmp/test")
file_list <- list.files(".", full.names = TRUE, pattern = '.pdf$')
s_pdf_text <- safely(pdf_text) # helps catch errors
walk(file_list, ~{ # iterate over the files
res <- s_pdf_text(.x) # try to read it in
if (!is.null(res$result)) { # if successful
message(sprintf("Processing [%s]", .x))
txt_file <- sprintf("%stxt", sub("pdf$", "", .x)) # make a new filename
unlist(res$result) %>% # cld be > 1 pg (which makes a list)
tolower() %>%
paste0(collapse="\n") %>% # make one big text block with line breaks
cat(file=txt_file) # write it out
} else { # if not successful
message(sprintf("Failure converting [%s]", .x)) # show a message
}
})

Related

Get the number of rows and columns of a multiple CSV file

Is there any way to get information about the number of rows and columns of a multiple CSV file in R and save it in a CSV file? Here is my R code:
#Library
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("fs")) install.packages("fs")
#Mentioning Files Location
file_paths <- fs::dir_ls("C:\\Users\\Desktop\\FileCount\\Test")
file_paths[[2]]
#Reading Multiple CSV Files
file_paths %>%
map(function(path)
{
read_csv(path,col_names = FALSE)
})
#Counting Number of Rows
lapply(X = file_paths, FUN = function(x) {
length(count.fields(x))
})
#Counting Number of Columns
lapply(X = file_paths, FUN = function(x) {
length(ncol(x))
})
#Saving CSV File
write.csv(file_paths,"C:\\Users\\Desktop\\FileCount\\Test\\FileName.csv", row.names = FALSE)
Couple of things are not working:
Number of Columns of a multiple CSV file
When I am saving the file, I want to save Filename, number of rows and number of columns. See attached image.
How the output looks like:
Attached some CSV Files for testing: Here
Any help appreciated.

Welcome on SO! Using the tidyverse and data.table, here's a way to do it:
Note: All the .csv files are in my TestStack directory, but you can change it with your own directory (C:/Users/Desktop/FileCount/Test).
Code:
library(tidyverse)
csv.file <- list.files("TestStack") # Directory with your .csv files
data.frame.output <- data.frame(number_of_cols = NA,
number_of_rows = NA,
name_of_csv = NA) #The df to be written
MyF <- function(x){
csv.read.file <- data.table::fread(
paste("TestStack", x, sep = "/")
)
number.of.cols <- ncol(csv.read.file)
number.of.rows <- nrow(csv.read.file)
data.frame.output <<- add_row(data.frame.output,
number_of_cols = number.of.cols,
number_of_rows = number.of.rows,
name_of_csv = str_remove_all(x,".csv")) %>%
filter(!is.na(name_of_csv))
}
map(csv.file, MyF)
Output:
number_of_cols number_of_rows name_of_csv
1 3 2150 CH_com
2 2 34968 epci_com_20
3 3 732 g1g4
4 7 161905 RP
I have this output because my TestStack had 4 files named CH_com.csv, epci_com_20.csv,...
You can then write the object data.frame.output to a .csv as you wanted: data.table::fwrite(data.frame.output, file = "Output.csv")

files_map <- "test"
files <- list.files(files_map)
library(data.table)
output <- data.table::rbindlist(
lapply(files, function(file) {
dt <- data.table::fread(paste(files_map, file, sep = "/"))
list("number_of_cols" = ncol(dt), "number_of_rows" = nrow(dt), "name_of_csv" = file)
})
)
data.table::fwrite(output, file = "Filename.csv")
Or with map and a seperate function to do the tasks, but without using an empty table first and update it with a global assignment. I see this happen a lot on apply functions, while it is not needed at all.
myF <- function(file) {
dt <- data.table::fread(paste(files_map, file, sep = "/"))
data.frame("number_of_cols" = ncol(dt), "number_of_rows" = nrow(dt), "name_of_csv" = file)
}
output <- do.call(rbind, map(files, myF))

R efficiently bind_rows over many dataframes stored on harddrive

I have roughly 50000 .rda files. Each contains a dataframe named results with exactly one row. I would like to append them all into one dataframe.
I tried the following, which works, but is slow:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
load(files[1])
results_table = results
rm(results)
for(i in c(2:length(files))) {
print(paste("We are at step ", i,sep=""))
load(files[i])
results_table= bind_rows(list(results_table, results))
rm(results)
}
Is there a more efficient way to do this?

Using .rds is a little bit easier. But if we are limited to .rda the following might be useful. I'm not certain if this is faster than what you have done:
library(purrr)
library(dplyr)
library(tidyr)
## make and write some sample data to .rda
x <- 1:10
fake_files <- function(x){
df <- tibble(x = x)
save(df, file = here::here(paste0(as.character(x),
".rda")))
return(NULL)
}
purrr::map(x,
~fake_files(x = .x))
## map and load the .rda files into a single tibble
load_rda <- function(file) {
foo <- load(file = file) # foo just provides the name of the objects loaded
return(df) # note df is the name of the rda returned object
}
rda_files <- tibble(files = list.files(path = here::here(""),
pattern = "*.rda",
full.names = TRUE)) %>%
mutate(data = pmap(., ~load_rda(file = .x))) %>%
unnest(data)

This is untested code but should be pretty efficient:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
data_list <- lapply("mydata.rda", function(f) {
message("loading file: ", f)
name <- load(f) # this should capture the name of the loaded object
return(eval(parse(text = name))) # returns the object with the name saved in `name`
})
results_table <- data.table::rbindlist(data_list)
data.table::rbindlist is very similar to dplyr::bind_rows but a little faster.

How do I get my loop on pdf_text only to read all the files?

I have a series of 475 files that I need to convert to text. I have written the following code to do that:
files <- list.files(pattern = "pdf$")
for (i in 1:length(files)){
print(i)
files_pdfs <- pdf_text(files[i]) %>% tibble(txt = .) %>% unnest_tokens(word, txt)}
It appears to execute successfully but when I inspect the output, it has clearly only read the text from the final file. I tried breaking the corpus of PDFs up into smaller segments and I still get the same problem - always just the text from the final file. I'm sure it's a basic error in my code but I can't figure it out. Any ideas?

You are overwriting files_pdfs on every cycle. Try:
files <- list.files(pattern = "pdf$")
files_pdfs <- list()
for (i in 1:length(files))
{
print(i)
files_pdfs[[files[i]]] <- pdf_text(files[i]) %>%
tibble(txt = .) %>%
unnest_tokens(word, txt)
}

How do I run a for loop on multiple word files in R

I have 44 doc files. From each file, I need to extract the customer name and amount. I am able to this for one file using the read_document command and using the grep to extract the amount and customer name. When I do this for 44 files, I am getting an error. Not sure where I am wrong:
ls()
rm(list = ls())
files <- list.files("~/experiment", ".doc")
files
length(files)
for (i in length(files)){
library(textreadr)
read_document(files[i])
}
Here is the full code that I run on one file:
file <- "~/customer_full_file.docx"
library(textreadr)
full_customer_file <- read_document(file, skip = 0, remove.empty = TRUE, trim = TRUE)
#checking file is read correctly
head(full_customer_file)
tail(full_customer_file)
# Extracting Name
full_customer_file <- full_customer_file[c(1,4)]
amount_extract <- grep("Amount", full_customer_file, value = T)
library(tm)
require(stringr)
amount_extract_2 <- lapply(amount_extract, stripWhitespace)
amount_extract_2 <- str_remove(marks_extract_2, "Amount")
name_extract <- grep("Customer Name and ID: ", full_customer_file, value = T)
name_extract
name_extract_2 <- lapply(name_extract, stripWhitespace)
name_extract_2 <- str_remove(name_extract_2, "Customer Name and ID: ")
name_extract_2 <- as.data.frame(name_extract_2)
names(name_extract_2)[1] <- paste("customer_full_name")
amount_extract_2 <- as.data.frame(amount_extract_2)
names(amount_extract_2)[1] <- paste("amount")
amount_extract_2
customer_final_file <- cbind(name_extract_2, amount_extract_2)
write.table(customer_final_file, "~/customer_amount.csv", sep = ",", col.names = T, append = T)
Here is the code that I run on 44 file
ls()
rm(list = ls())
files <- list.files("~/experiment", ".doc")
files
length(files)
library(textreadr)
for (i in 1:length(files)){
read_document(files[i])
}
Here is the error that I am getting:
> library(textreadr)
> for (i in 1:length(files)){
+ read_document(files[i])
+ }
Warning messages:
1: In utils::unzip(file, exdir = tmp) :
error 1 in extracting from zip file
2: In utils::unzip(file, exdir = tmp) :
error 1 in extracting from zip file
3: In utils::unzip(file, exdir = tmp) :
error 1 in extracting from zip file
4: In utils::unzip(file, exdir = tmp) :
error 1 in extracting from zip file
5: In utils::unzip(file, exdir = tmp) :
error 1 in extracting from zip file

I could give you my code, which I used to analyze different word files through the sentimentr package in R. I guess you can use the same structure that I have and just change the for in function to loop the extraction for every docx.
And this is the code:
library(sentimentr)
folder_path <- "C:\\Users\\yourname\\Documents\\R\\"
# Get a list of all the docx files in the folder
docx_files <- list.files(path = folder_path, pattern = "\\.docx$", full.names = TRUE)
# Create an empty data frame to store the results
results <- data.frame(file = character(0), sentiment = numeric(0))
# Loop over the list of files
for (file in docx_files) {
# Read the docx file
sample_data <- read_docx(file)
# Extract the content and create a summary
content <- docx_summary(sample_data)
law <- content[sapply(strsplit(as.character(content$text),""),length)>5,]
# Calculate the sentiment of the summary (or in your case extraction)
sentiment <- sentiment_by(as.character(law$text))
# Add a row to the data frame with the results for this file
results <- rbind(results, data.frame(file = file, sentiment = sentiment$ave_sentiment))
}
# View the results data frame
View(results)
I hope that is near enough to your problem to solve it

Function to extract date from jpg files in a directory

I have a large volume (aprx 10 000) jpg files with dates written on each one. I wish to extract the date from each jpg and add this to a dataframe with a corresponding filename.
I have read this forum and beyond and I have tried to patch together a function in R which will perform the task but I cannot get it to work. I have used a loop to:
1) generate a list of image files in the chosen directory
2) create a dataframe for the results with a column for file path and a column
for date (extracted from the jpg)
3) loop through files in directory:
Resize,
Crop to portion of image showing date,
OCR the image,
Write date to dataframe - created in step 2
This seems to crash when I run the function and I am not really sure why. I am an R user but I have not written functions before (you can probably tell)
I am using R 3.6.0 and RStudio
library(tesseract)
library(magick)
library(tidyverse)
library(gsubfn)
get_jpeg_date <- function(folder) {
file_list <- list.files(path=folder, pattern="*.jpg", recursive = T)
image_dates <- as.data.frame(file_list)
image_dates $ ImageDate <- rep_len(x = NA, length.out = length(file_list))
eng <- tesseract("eng")
for (i in length(file_list) ) {
ImageDate <- image_read(paste(folder,"\\",file_list, sep = ""))%>%
image_resize("2000") %>%
image_crop("300x100+1800") %>%
tesseract::ocr(engine = eng) %>%
strapplyc("\\d+/\\d+/\\d+", simplify = TRUE)%>%
image_dates[,i]
}
}
x <- get_jpeg_date(folder = folder)
folder <- "C:/file_path"
x <- get_jpeg_date(folder = folder)
The code in the loop works on single files but there is no output when I run the function on a small test sample of 3 jpg images.

Consider re-factoring your function to run on a single jpg file, then assign column to it with sapply or map. In R, the last line of a function is the return object. Since for loops are not the last process, function will return the OCR'ed and regex-ed string vector.
get_jpeg_date <- function(pic) {
eng <- tesseract("eng")
image_read(pic) %>%
image_resize("2000") %>%
image_crop("300x100+1800") %>%
tesseract::ocr(engine = eng) %>%
strapplyc("\\d+/\\d+/\\d+", simplify = TRUE)
}
file_list <- list.files(path=folder, pattern="*.jpg", full.names = TRUE, recursive = TRUE)
# DATA FRAME BUILD
image_dates_df <- data.frame(img_path = file_list)
# COLUMN ASSIGNMENT
image_dates_df$img_date <- sapply(image_dates_df$img_path, get_jpeg_date)
# ALTERNATIVELY WITH dplyr::mutate() and purrr:map()
image_dates_df <- data.frame(img_path = file_list) %>%
mutate(img_date = map(img_path, get_jpeg_date))

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

convert batch of pdf to text using pdftools - r

Related

Get the number of rows and columns of a multiple CSV file

R efficiently bind_rows over many dataframes stored on harddrive

How do I get my loop on pdf_text only to read all the files?

How do I run a for loop on multiple word files in R

Function to extract date from jpg files in a directory

Categories

Resources