I have jpeg files in my local directory, I want to Extract text from all the images one by one and should record all the values in each cells accordingly . Can anyone please help me with the code? I have used Tesseract apnd Magick package to extract the text. But now I need to keep it in the loop
First of all, you have to know which files you want to read. Go to the directory where they are located and get their names with list.files.
old_dir <- getwd()
setwd('path/to/directory')
filenames <- list.files(pattern = '\\.jpg') # or '\\.jpeg'
Now the standard trick is to loop through the file names with one of the *apply functions. For the sake of simplicity, I will define a function that do the actual read and OCR text extraction operations.
library(magick)
library(tesseract)
read_ocr_png <- function(file){
img <- image_read(file)
image_ocr(img)
}
text_list <- lapply(filenames, read_ocr_png)
names(text_list) <- filenames
And reset the working directory when done.
setwd(old_dir)
Related
I have several PDF files in my directory. I have downloaded them previously, no big deal so far.
I want to read all those files in R. My idea was to use the "pdf_text" function from the "pdftools" package and write a formula like this:
mypdftext <- pdf_text(files)
Where "files" is an object that gathers all the PDF file names, so that I don't have to write manually all the names. Because I have actually downlaoded a lot of files, it would avoid me to write:
mypdftext <- pdf_text("file1.pdf", "file2.pdf", and many more files...)
To create the object "pdflist", I used "files <- list.files (pattern = "pdf$")"
The “files” vector contains all the PDF file names.
But "files" does not work with pdf_text function, probably because it's a vector. What can I do instead?
maybe this is not the best solution but this works for me:
library(pdftools)
# Set your path here.
your_path = 'C:/Users/.../pdf_folder'
setwd(your_path)
getwd()
lf = list.files(path=getwd(), pattern=NULL, all.files=FALSE,
full.names=FALSE)
#Creating a list to iterate
my_pdfs = {}
#Iterate. Asssign each element of list files, to a list.
for (i in 1:length(lf)){my_pdfs[i] <- pdf_text(lf[i])}
#Calling the first pdf of the list.
my_pdfs[1]
Then you can assign each of the pdfs to a single file of whatever you want. Of course, each file will be saved in each element of the list. Does this solve your problem?
You could try using lapply over the vector that contains the location of every pdf file (files). I would recommend using list.files(..., full.names = T) to get the complete location of each pdf file. This should work.
mypdfs<-lapply(files, pdf_text)
I am trying to deal with extracting a subset from multiple .grb2 files in the same file path, and write them in a csv. I am able to do it for one (or a few) by using the following set of commands:
GRIB <- brick("tmp2m.1989102800.time.grb2")
GRIB <- as.array(GRIB)
readGDAL("tmp2m.1989102800.time.grb2")
tmp2m.6hr <- GRIB[51,27,c(261:1232)]
str(tmp2m.6hr)
tmp2m.data <- data.frame(tmp2m.6hr)
write.csv(tmp2m.data,"tmp1.csv")
The above set of commands extract, in csv, temperature values for specific latitude "51" and longitude "27", as well as for a specific time range "c(261:1232)".
Now I have hundreds of these files (with different file names, of course) in the same directory and I want to do the same for all. As you know, better than me, I cannot do this to one by one, changing the file name each time.
I have struggled a lot with this, but so far I did not manage to do it. Since I am new in R, and my knowledge is limited, I would very much appreciate any possible help with this.
The simplest way would be to use a normal for loop:
path <- "your file path here"
input.file.names <- dir(path, pattern =".grb2")
output.file.names <- paste0(tools::file_path_sans_ext(file.names),".csv")
for(i in 1:length(file.names)){
GRIB <- brick(input.file.names[i])
GRIB <- as.array(GRIB)
readGDAL(input.file.names[i]) # edited line
tmp2m.6hr <- GRIB[51,27,c(261:1232)]
str(tmp2m.6hr)
tmp2m.data <- data.frame(tmp2m.6hr)
write.csv(tmp2m.data,output.file.names[i])
}
You could of course create the body of the for loop into a function and then use the standard lapply or the map function from purrr.
Note that this code will print out different CSV files. If you want to append the data to a single file then you should check out write.table
I am trying to extract one text file from each of the zip files located in one folder. Then I want to combine those text files into one dataframe.
The folder has multiple Zip files:
pf_0915.zip
pf_0914.zip
pf_0913.zip
.....
Inside of those zip files are multiple text files. I am only interested in the one called abc.txt. This is a fixed width format file without header. I have already set up a read for this file using read_fwd. Since all the extracted text files have the same name, it might be better to rename them according the name of their archive. i.e. the abc.txt from pf_0915.zip could be called abc_0915.txt. Once they are all read they should be combined into a large file called abcCombined.txt.
Or as each new abc.txt file is read, we could add it to the abcCombined.txt.
I have tried various version of unzip() and unz() without much success. This was done without looping through all the zip files. And finally, this directory contains many zip files, are there ways to read only some of them by using pattern matching like grep. I would for example be interested in reading only September files, those .._09...txt.
Any hints would be appreciated.
The following:
Creates a vector of the files in a directory
Uses the list parameter to unzip() to see the metadata for the contents
Builds a regular expression to find only the target file (I did that in the event your use-case generalizes to a broader pattern)
Tests if any of the files meet your criteria
Keeps only those files into a resultant vector
Iterates over that vector and
Extracts only the target file into a temporary directory
Reads it into a data.frame
Ultimately binds the individual data.frames into one big one
You can write out the resultant combined data.frame however you wish.
library(purrr)
target_dir <- "so"
extract_file <- "abc.txt"
list.files(target_dir, full.names=TRUE) %>%
keep(~any(grepl(sprintf("^%s$", extract_file), unzip(., list=TRUE)$Name))) %>%
map_df(function(x) {
td <- tempdir()
read.fwf(unzip(x, extract_file, exdir=td), widths=c(4,1,4,2))
}) -> combined_df
The version below just expands some of the shortcuts in the one above:
only_files_with_this_name <- function(zip_path, name) {
zip_contents <- unzip(zip_path, list=TRUE)
look_for <- sprintf("^%s$", name)
any(grepl(look_for, zip_contents$Name))
}
list.files(target_dir, full.names=TRUE) %>%
keep(only_files_with_this_name, name=extract_file)) %>%
map_df(function(x) {
td <- tempdir()
file_in_zip <- unzip(x, extract_file, exdir=td)
read.fwf(file_in_zip, widths=c(4,1,4,2))
unlink(file_in_zip)
}) -> combined_df
Can't comment because of my low reputation, so although this is a partial answer:
If you know the file name within the various zips the syntax to get just that file would be something like the following:
my_data<-read.csv(unz("pf_0915.zip","abc.txt"))
This is the code for a csv obviously, not a fixed width text, but if you already have that set up, it'll be something like
my_data<-read_fwd(unz("pf_0915.zip","abc.txt") ... )
with all your other parameters in the ...
You can do this in a loop if you have many zips, and accumulate them in a data frame, data table, whatever structure floats your boat...
I'm writing a loop script which involves reading a file from a workbook (using the package XLConnect). The challenge is that the file names contain characters (representing time) that I want to ignore.
For example, here are 3 paths to those files:
G://User//Documents//daily_data//Op_Schedule_20160520_132025.xlsx
G://User//Documents//daily_data//Op_Schedule_20160521_142805.xlsx
G://User//Documents//daily_data//Op_Schedule_20160522_103052.xlsx
I need to import hundreds of those files. I can easily account for the character string representing the date (e.g. 20160522), but not the time.
Is there a way to tell R to ignore some characters located in the file path? Here is how I was thinking of writing my script (the "???" is where i need help). I know a loop is probably not the most efficient way, but i'm open to suggestions, should you have any:
require(XLConnect)
path= "G://User//Documents//daily_data//Op_Schedule_"
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
scheduleList = rep(list(matrix(1,1,1)),length(wd.seq))
for(i in 1:length(wd.seq)) {
wb = loadWorkbook(file= paste0(path,wd.seq[i],"???",".xlxs"))
scheduleList[[i]] = readWorksheet(wb,sheet='=SCHEDULE', header = TRUE)
}
`
Thanks for reading and suggestions, if any.
Mathieu
I don't know if this is helpful, but if you want to read all the files in a certain directory (which it seems to me is what you're after), you can read all the filenames into a list using the list.files() function, for example
fileList <- list.files(""G://User//Documents//daily_data//")
And then load the xlsx files looping through the list with a for loop
for(i in fileList) {
loadWorkbook(file = i)
}
I haven't used the XLConnect function before so that exact code probably doesn't work, but the loop will iterate through all the files in that directory and so you can construct your loading call using the i variable for the filename (it won't be an absolute path though, so you might need to use paste to add the first part of the filepath)
I realize there might be other files in the directory that are not excel files, you could use grepl to select only files containg "OP_Schedule_"
fileListClean <- fileList[grepl("Op_Schedule_",fileList)]
or perhaps only selecting .xlsx files in the directory:
fileListClean <- fileList[grepl(".xlsx",fileList)]
Edit to fit your reply:
Since you need to fit it to a sequence, you can do it as you did earlier:
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
wd.seq2 <- paste("Op_Schedule_", wd.seq, sep = "")
And then use grepl to only pick files starting with that extensions:
fileListClean <- fileList[grepl(paste(wd.seq2, collapse = "|"), fileList)]
Full disclosure: The last part i got from this SO answer: grep using a character vector with multiple patterns
I have a 100 text files with matrices which I want to open using R - the read.table() command can be used for that.
I cant figure out how to assign these files to separate variable names so that I can carry out operations on the matrices.
I am trying to use the for loop but keep getting error messages.
I hope somebody can help me out with this...
If you have 100 files, it may make more sense to simply keep them in one neat list.
# Get the list of files
#----------------------------#
folder <- "path/to/files"
fileList <- dir(folder, recursive=TRUE) # grep through these, if you are not loading them all
# use platform appropriate separator
files <- paste(folder, fileList, sep=.Platform$file.sep)
# Read them in
#----------------------------#
myMatrices <- lapply(files, read.table)
Then access via, eg, myMatrices[[37]] or using lapply
Would it be easer to just use list.files?
For example:
files <- list.files(directory/path, pattern= "regexp.if.needed")
And then you could access each element by calling files[1], files[2], etc. This would allow you to pull out either all the files in a directory, or just the ones that matched a regular expression.