I have 100 scanned PDF files and I need to convert them into text files.
I have first converted them into png files (see script below),
now I need help to convert these 100 png files to 100 text files.
library(pdftools)
library("tesseract")
#location
dest <- "P:\\TEST\\images to text"
#making loop for all files
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
#Convert files to png
sapply(myfiles, function(x)
pdf_convert(x, format = "png", pages = NULL,
filenames = NULL, dpi = 600, opw = "", upw = "", verbose = TRUE))
#read files
cat(text)
I expect to have a text file for each png file:
From: file1.png, file2.png, file3.png...
To: file1.txt, file2.txt, file3.txt...
But the actual result is one text file containing all png files text.
I guess you left out the bit with teh png -> text bit, but I assume you used library(tesseract).
You could do the following in your code:
library(tesseract)
eng <- tesseract("eng")
sapply(myfiles, function(x) {
png_file <- gsub("\\.pdf", ".png", x)
txt_file <- gsub("\\.pdf", ".txt", x)
pdf_convert(x, format = "png", pages = 1,
filenames = png_file, dpi = 600, verbose = TRUE)
text <- ocr(png_file, engine = eng)
cat(text, file = txt_file)
## just return the text string for convenience
## we are anyways more interested in the side effects
text
})
Related
I am trying to plot multiple trackViewer Vignette lollipopPlot plots that have been made from lapply and save them into single PDF file in grid system.
The following code makes the plots and saved them in separate file:
files <- list.files(path="/Users/myusername/lollyplot_data/", pattern="*.txt", full.names=TRUE, recursive=FALSE)
lapply(files, function(x) {
file_base_name = sub('\\..*$', '', basename(x))
myfile <- read.csv(x, header=FALSE, sep = "\t")
chrom_info = strsplit(myfile$V1, ':')[[1]][1]
sample.gr <- GRanges(chrom_info,IRanges(myfile$V2, myfile$V2, names=myfile$V1), color=myfile$V3, score=myfile$V4)
features <- GRanges(chrom_info, IRanges(myfile$V2, myfile$V2))
sample.gr.rot <- sample.gr
png(paste(file_base_name, "png", sep = "."), width = 600, height = 595)
lolliplot(sample.gr.rot, features, cex = 1.2, yaxis.gp = gpar(fontsize=18, lwd=2), ylab = FALSE, xaxis.gp = gpar(fontsize=10))
grid.text(strsplit(file_base_name, "_")[[1]][7], x=.5, y=.98, just="top",
gp=gpar(cex=1.5, fontface="bold"))
dev.off()
})
How do I transform the above code to save the plots in a single PDF file in a grid system. The total plots are 20 plots so the grid will have 5 rows and 4 columns. I tried using grid.arrange but that did not work.
Really appreciate any input in advance.
I am trying to convert a raster image in the file type of PBM to a CSV file.
I have tried this:
setwd("~/Desktop/")
directory <- "test.pbm"
ndirectory <- "test.csv"
file_name <- list.files(directory, pattern = ".pbm")
files.to.read <- paste(directory, file_name)
files.to.write <- paste(ndirectory, paste(sub(".pbm","",
file_name),".csv"))
for (i in 1:length(files.to.read)) {
temp <- (read.csv(files.to.read[i], header = TRUE, skip = 11,
fill = TRUE))
write.csv(temp, file = files.to.write[i])
}
But I am getting the error "No such file or directory" but the file is definitely inside my Desktop directory. Am I overcomplicating this or does anyone have any suggestions how I could move forward?
You can get the src files absolute paths by setting the path and full.names flag.
And then replace the ".pbm" to ".csv" to get the destination file names easily.
Try this.
src_files <- list.files(path="~/Desktop/",pattern = ".pbm", full.names = TRUE)
dest_files <- sub(".pbm", ".csv", src_files)
for (i in 1:length(src_files)) {
temp <- (read.csv(src_files[i], header = TRUE, skip = 11, fill = TRUE))
write.csv(temp, file = dest_files[i])
}
I am trying to .pdf files (most of which are image based) to .txt files in bulk. The below program successfully converts both text and image based pdfs to text files.
My problem is that there is a set of ~15 pdf files that take a really long time to convert. They aren't particularly large (maximum pages between 10 to 600) but my program takes about 45 mins to convert them.
Why is it taking so long to convert them and how can I speed it up? I am using CRAN RGui(64-bit) and the R version 3.5.0
The .pdf files are in the following hirarchy
My Directory->Sub-folder 1->abc.pdf
My Directory->Sub-folder 2->def.pdf
etc..
The code is as below:
programdir<-"C:\\My directory"
# Delete all txt files in the path
file.remove(list.files(path=programdir, pattern = ".txt", recursive = T, full.names = T))
# Get list of sub folders in the main directory
mydir<-list.dirs(path=programdir,full.names = TRUE, recursive = TRUE)
# Loop through sub-folders, starting from 2 as 1 is the parent directory
for(i in 2:length(mydir)) {
# make a vector of PDF file names
myfiles <- list.files(path=mydir[i],pattern = ".pdf",
full.names = TRUE,recursive = TRUE)
# Loop through every file in the sub-directory
for(j in 1:length(myfiles)) {
# Render pdf to png image
img_file <- pdftools::pdf_convert(myfiles[j], format = 'tiff', dpi = 400)
# Extract text from png image
pdftotext <- ocr(img_file)
# Ensure text files are named as per sub-directory name_pdf name.txt format
fname = paste(mydir[i],basename(file_path_sans_ext(myfiles[j])),sep="_")
# Save files to directory path
sink(file=paste(fname , ".txt", sep=''))
writeLines(unlist(lapply(pdftotext , paste, collapse=" ")))
sink()
j <- j + 1 # Next file in sub-directory
}
i <- i + 1 # Next sub-directory record
}
file.remove(list.files(pattern = ".tiff", recursive = TRUE, full.names = TRUE))
I have used the following code within R to convert a PDF file to a text file for future use of the tm package. I am using the downloaded "pdftotext.exe" file.
This code is working properly and produces a "txt" for every PDF in the directory.
myfiles <- list.files(path = dir04, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/xpdf/xpdfbin-win-3.04/bin64/pdftotext.exe"',paste0('"', i, '"')), wait = FALSE))
I am trying to figure out how to use "docx2txt" in a similar manner. However, the file formats are not .exe files. Can I use the "docx2txt-1.4" or "docx2txt-1.4.tar" in the same manner? The following code provides an error for each file.
myfiles <- list.files(path = dir08, pattern = "docx", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/docx2txt/docx2txt-1.4.gz"',paste0('"', i, '"')), wait = FALSE))
Warning
running command '"C:/docx2txt/docx2txt-1.4.gz" "C:/....docx"' had status 127
how do I create a corpus of *.docx files with tm? doesn't have quite enough info.
You can convert a ".docx" file to ".txt" with the following code which is a different approach :
library(RDCOMClient)
path_Word <- "C:\\temp.docx"
path_TXT <- "C:\\temp.txt"
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_Word),
ConfirmConversions = FALSE)
doc$SaveAs(path_TXT, FileFormat = 4) # Converts word document to txt
text <- readLines(path_TXT)
text
I try to convert all my .txt files in .csv, but I didn't manage to create the loop.
The actual line for one file (which works perfectly) would be the following:
tab = read.delim("name_file", header = TRUE, skip = 11)
write.table(tab, file="name_file.csv",sep=",",col.names=TRUE,row.names=FALSE)
And I would like to do that for all the .txt file I have in wd.
I tried the loop with, based on some reasearch on the web, but I am not sure it's the right one:
FILES = list.files(pattern = ".txt")
for (i in 1:length(FILES)) {
FILES = read.csv(file = FILES[i], header = TRUE, skip = 11, fill = TRUE)
write.csv(FILES, file = paste0(sub("folder_name", ".txt","", FILES[i]), ".csv"))
}
I'm on Windows system.
I would appreciate some help... Thanks!
Hi I have the same problem before just like you, and now I made it works. Try this:
directory <- "put_your_txt_directory_here"
ndirectory <- "put_your_csv_directory_here"
file_name <- list.files(directory, pattern = ".txt")
files.to.read <- paste(directory, file_name, sep="/")
files.to.write <- paste(ndirectory, paste0(sub(".txt","", file_name),".csv"), sep="/")
for (i in 1:length(files.to.read)) {
temp <- (read.csv(files.to.read[i], header = TRUE, skip = 11, fill = TRUE))
write.csv(temp, file = files.to.write[i])
}
You need to index the output inside the loop as well. Try this:
INFILES = list.files(pattern = ".txt")
OUTFILES = vector(mode = "character", length = length(INFILES))
for (i in 1:length(INFILES)) {
OUTFILES[i] = read.csv(file = INFILES[i], header = TRUE, skip = 11,
fill = TRUE)
write.csv(OUTFILES[i], file = paste0("folder_name", sub(".txt","", INFILES[i]), ".csv"))
}
Assuming that your input files always have at least 11 rows (since you skip the first 11 rows!) this should work:
filelist = list.files(pattern = ".txt")
for (i in 1:length(filelist)) {
cur.input.file <- filelist[i]
cur.output.file <- paste0(cur.input.file, ".csv")
print(paste("Processing the file:", cur.input.file))
# If the input file has less than 11 rows you will reveive the error message:
# "Error in read.table: no lines available in input")
data = read.delim(cur.input.file, header = TRUE, skip = 11)
write.table(data, file=cur.output.file, sep=",", col.names=TRUE, row.names=FALSE)
}
If you reveive any error during file conversion it is caused by the content (e. g. unequal number of rows per column, unequal number of columns etc.).
PS: Using a for loop is OK here since it does not limit the performance (there is no "vectorized" logic to read and write files).