Function to extract date from jpg files in a directory - r

I have a large volume (aprx 10 000) jpg files with dates written on each one. I wish to extract the date from each jpg and add this to a dataframe with a corresponding filename.
I have read this forum and beyond and I have tried to patch together a function in R which will perform the task but I cannot get it to work. I have used a loop to:
1) generate a list of image files in the chosen directory
2) create a dataframe for the results with a column for file path and a column
for date (extracted from the jpg)
3) loop through files in directory:
Resize,
Crop to portion of image showing date,
OCR the image,
Write date to dataframe - created in step 2
This seems to crash when I run the function and I am not really sure why. I am an R user but I have not written functions before (you can probably tell)
I am using R 3.6.0 and RStudio
library(tesseract)
library(magick)
library(tidyverse)
library(gsubfn)
get_jpeg_date <- function(folder) {
file_list <- list.files(path=folder, pattern="*.jpg", recursive = T)
image_dates <- as.data.frame(file_list)
image_dates $ ImageDate <- rep_len(x = NA, length.out = length(file_list))
eng <- tesseract("eng")
for (i in length(file_list) ) {
ImageDate <- image_read(paste(folder,"\\",file_list, sep = ""))%>%
image_resize("2000") %>%
image_crop("300x100+1800") %>%
tesseract::ocr(engine = eng) %>%
strapplyc("\\d+/\\d+/\\d+", simplify = TRUE)%>%
image_dates[,i]
}
}
x <- get_jpeg_date(folder = folder)
folder <- "C:/file_path"
x <- get_jpeg_date(folder = folder)
The code in the loop works on single files but there is no output when I run the function on a small test sample of 3 jpg images.

Consider re-factoring your function to run on a single jpg file, then assign column to it with sapply or map. In R, the last line of a function is the return object. Since for loops are not the last process, function will return the OCR'ed and regex-ed string vector.
get_jpeg_date <- function(pic) {
eng <- tesseract("eng")
image_read(pic) %>%
image_resize("2000") %>%
image_crop("300x100+1800") %>%
tesseract::ocr(engine = eng) %>%
strapplyc("\\d+/\\d+/\\d+", simplify = TRUE)
}
file_list <- list.files(path=folder, pattern="*.jpg", full.names = TRUE, recursive = TRUE)
# DATA FRAME BUILD
image_dates_df <- data.frame(img_path = file_list)
# COLUMN ASSIGNMENT
image_dates_df$img_date <- sapply(image_dates_df$img_path, get_jpeg_date)
# ALTERNATIVELY WITH dplyr::mutate() and purrr:map()
image_dates_df <- data.frame(img_path = file_list) %>%
mutate(img_date = map(img_path, get_jpeg_date))

Related

Adding a new column with filenames for the list of files in a for loop

I have a time series data. I stored the data in txt files under daily subfolders in Monthly folders.
setwd(".../2018/Jan")
parent.folder <-".../2018/Jan"
sub.folders <- list.dirs(parent.folder, recursive=TRUE)[-1] #To read the sub-folders under parent folder
r.scripts <- file.path(sub.folders)
A_2018 <- list()
for (j in seq_along(r.scripts)) {
A_2018[[j]] <- dir(r.scripts[j],"\\.txt$")}
Of these .txt files, I removed some of the files which I don't want to use for the further analysis, using the following code.
trim_to_two <- function(x) {
runs = rle(gsub("^L1_\\d{4}_\\d{4}_","",x))
return(cumsum(runs$lengths)[which(runs$lengths > 2)] * -1)
}
A_2018_new <- list()
for (j in seq_along(A_2018)) {
A_2018_new[[j]] <- A_2018[[j]][trim_to_two(A_2018[[j]])]
}
Then, I want to make a rowbind by for loop for the whole .txt files. Before that, I would like to remove some lines in each txt file, and add one new column with file name. The following is my code.
for (i in 1:length(A_2018_new)) {
for (j in 1:length(A_2018_new[[i]])){
filename <- paste(str_sub(A_2018_new[[i]][j], 1, 14))
assign(filename, read_tsv(complete_file_name, skip = 14, col_names = FALSE),
)
Y <- r.scripts %>% str_sub(46, 49)
MD <- r.scripts %>% str_sub(58, 61)
HM <- filename %>% str_sub(9, 12)
Turn <- filename %>% str_sub(14, 14)
time_minute <- paste(Y, MD, HM, sep="-")
Map(cbind, filename, SampleID = names(filename))
}
}
But I didn't get my desired output. I tried to code using other examples. Could anyone help to explain what my code is missing.
Your code seems overly complex for what it is doing. Your problem is however not 100% clear (e.g. what is the pattern in your file names that determine what to import and what not?). Here are some pointers that would greatly simplify the code, and likely avoid the issue you are having.
Use lapply() or map() from the purrr package to iterate instead of a for loop. The benefit is that it places the different data frames in a list and you don't need to assign multiple data frames into their own objects in the environment. Since you tagged the tidyverse, we'll use the purrr functions.
library(tidyverse)
You could for instance retrieve the txt file paths, using something like
txt_files <- list.files(path = 'data/folder/', pattern = "txt$", full.names = TRUE) # Need to remove those files you don't with whatever logic applies
and then use map() with read_tsv() from readr like so:
mydata <- map(txt_files, read_tsv)
Then for your manipulation, you can again use lapply() or map() to apply that manipulation to each data frame. The easiest way is to create a custom function, and then apply it to each data frame:
my_func <- function(df, filename) {
df |>
filter(...) |> # Whatever logic applies here
mutate(filename = filename)
}
and then use map2() to apply this function, iterating through the data and filenames, and then list_rbind() to bind the data frames across the rows.
mydata_output <- map2(mydata, txt_files, my_func) |>
list_rbind()

How to iterate over excel-sheets with readxl

I am supposed to load the data for my master-thesis in an R-dataframe, which is stored in 74 excel workbooks. Every workbook has 4 worksheets each, called: animals, features, r_words, verbs. All of the worksheets have the same 12 variables(starttime, word, endtime, ID,... etc.). I want to concatenate every worksheet under the one before, so the resulting dataframe should have 12 columns and the number of rows is depending on how many answers the 74 subjects produced.
I want to use the readxl-package of the tidyverse and followed this article: https://readxl.tidyverse.org/articles/articles/readxl-workflows.html#csv-caching-and-iterating-over-sheets.
The first problem I face is how to read all 4 worksheets with read_excel(path, sheet = "animals", "features", "r_words", "verbs"). This only works with the first worksheet, so I tried to make a list with all the sheet-names (object sheet). This is also not working. And when I try to use the following code with just one worksheet, the next line throws an error:
Error in basename(.) : a character vector argument expected
So, here is a part of my code, hopefully fulfilling the requirements:
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
# indices
subfile_nos <- 1:length(filenames)
# function to read all the sheets in at once and cache to csv
read_then_csv <- function(sheet, path) {
for (i in 1:length(filenames)){
sheet <- excel_sheets(filenames[i])
len.sheet <- 1:length(sheet)
path <- read_excel(filenames[i], sheet = sheet[i]) #only reading in the first sheet
pathbase <- path %>%
basename() %>% #Error in basename(.) : a character vector argument expected
tools::file_path_sans_ext()
path %>%
read_excel(sheet = sheet) %>%
write_csv(paste0(pathbase, "-", sheet, ".csv"))
}
}
You should do a double loop or a nested map, like so:
library(dplyr)
library(purrr)
library(readxl)
# I suggest looking at
?purrr::map_df
# Function to read all the sheets in at once and save as csv
read_then_csv <- function(input_filenames, output_file) {
# Iterate over files and concatenate results
map_df(input_filenames, function(f){
# Iterate over sheets and concatenate results
excel_sheets(f) %>%
map_df(function(sh){
read_excel(f, sh)
})
}) %>%
# Write csv
write_csv(output_file)
}
# Test function
filenames <- list.files("data", pattern = '\\.xlsm',full.names = TRUE)
read_then_csv(filenames, 'my_output.csv')
You say...'I want to concatenate every worksheet under the one before'... The script below will combine all sheets from all files. Test it on a COPY of your data, in case it doesn't do what you want/need it to do.
# load names of excel files
files = list.files(path = "C:\\your_path_here\\", full.names = TRUE, pattern = ".xlsx")
# create function to read multiple sheets per excel file
read_excel_allsheets <- function(filename, tibble = FALSE) {
sheets <- readxl::excel_sheets(filename)
sapply(sheets, function(f) as.data.frame(readxl::read_excel(filename, sheet = f)),
simplify = FALSE)
}
# execute function for all excel files in "files"
all_data <- lapply(files, read_excel_allsheets)

R - variable name in loop

I've got let's say 10 csv files with names like as
file_1_tail.csv
file_2_tail.csv
file_3_tail.csv
...
file_10_tail.csv
The only difference in name is in the number (1 to 10). Each has the same structure - 1000 rows and 100 columns.
I need to read them to the R, select specific columns and write as new file. My code for one file is below:
file_2_tail = read_csv("file_2_tail.csv")
file_2_tail_selected = file_2_tail[,c(1:7,30)])
write.csv2(file_2_tail_selected, file = "file_2_selected.csv")
And now I want to use loop to automate this for all ten files.
for (i in 1:10){
file_"i"_tail = read_csv("file_"i"_tail.csv")
file_"i"_tail_selected = file_"i"_tail[,c(1:7,30)]
write.csv2(file_"i"_tail_selected, file = "file_"i"_selected.csv")
}
And of course it doesn't work - i is not readable in this notation. How should I fix this?
You can't assign read_csv results to a string like that. Instead you can just store it in a temporary variable tmp
for (i in 1:10){
tmp <- read_csv(paste0("file_", i, "_tail.csv"))
tmp <- tmp[, c(1:7,30)]
write.csv2(tmp, file = paste0("file_", i, "_selected.csv"))
}
Btw this is probably a more efficient way to read multiple files
library(tidyverse)
filePattern <- "\\.csv$"
fileList <- list.files(path = ".", recursive = FALSE,
pattern = filePattern, full.names = TRUE)
result <- fileList %>%
purrr::set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
purrr::map_df(read_csv, .id = "FileName") %>%
select(1:7, 30)
result

Read the file created/modified last in different directories in R

I'd want to read the CSV files modified( or created) most recently in differents directories and then put it in a pre-existing single dataframe (df_total).
I have two kinds of directories to read:
A:/LogIIS/FOLDER01/"files.csv"
On others there a folder with several files.csv, as the example bellow:
"A:/LogIIS/FOLDER02/FOLDER_A/"files.csv"
"A:/LogIIS/FOLDER02/FOLDER_B/"files.csv"
"A:/LogIIS/FOLDER02/FOLDER_C/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_A/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_B/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_C/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_D/"files.csv"
Something like this...
#get a vector of all filenames
files <- list.files(path="A:/LogIIS",pattern="files.csv",full.names = TRUE,recursive = TRUE)
#get the directory names of these (for grouping)
dirs <- dirname(files)
#find the last file in each directory (i.e. latest modified time)
lastfiles <- tapply(files,dirs,function(v) v[which.max(file.mtime(v))])
You can then loop through these and read them in.
If you just want the latest file overall, this will be files[which.max(file.mtime(files))].
Here a tidyverse-friendly solution
list.files("data/",full.names = T) %>%
enframe(name = NULL) %>%
bind_cols(pmap_df(., file.info)) %>%
filter(mtime==max(mtime)) %>%
pull(value)
Consider creating a data frame of files as file.info maintains OS file system metadata per path such as created time:
setwd("A:/LogIIS")
files <- list.files(getwd(), full.names = TRUE, recursive = TRUE)
# DATAFRAME OF FILE, DIR, AND METADATA
filesdf <- cbind(file=files,
dir=dirname(files),
data.frame(file.info(files), row.names =NULL),
stringsAsFactors=FALSE)
# SORT BY DIR AND CREATED TIME (DESC)
filesdf <- with(filesdf, filesdf[order(dir, -xtfrm(ctime)),])
# AGGREGATE LATEST FILE PER DIR
latestfiles <- aggregate(.~dir, filesdf, FUN=function(i) head(i)[[1]])
# LOOP THROUGH LATEST FILE VECTOR FOR IMPORT
df_total <- do.call(rbind, lapply(latestfiles$file, read.csv))
Here is a pipe-friendly way to get the most recent file in a folder. It uses an anonymous function which in my view is slightly more readable than a one-liner. file.mtime is faster than file.info(fpath)$ctime.
dir(path = "your_path_goes_here", full.names = T) %>% # on W, use pattern="^your_pattern"
(function(fpath){
ftime <- file.mtime(fpath) # file.info(fpath)$ctime for file CREATED time
return(fpath[which.max(ftime)]) # returns the most recent file path
})

save files into a specific subfolder in a loop in R

I feel I am very close to the solution but at the moment i cant figure out how to get there.
I´ve got the following problem.
In my folder "Test" I´ve got stacked datafiles with the names M1_1; M1_2, M1_3 and so on: /Test/M1_1.dat for example.
No I want to seperate the files, so that I get: M1_1[1].dat, M1_1[2].dat, M1_1[3].dat and so on. These files I´d like to save in specific subfolders: Test/M1/M1_1[1]; Test/M1/M1_1[2] and so on, and Test/M2/M1_2[1], Test/M2/M1_2[2] and so on.
Now I already created the subfolders. And I got the following command to split up the files so that i get M1_1.dat[1] and so on:
for (e in dir(path = "Test/", pattern = ".dat", full.names=TRUE, recursive=TRUE)){
data <- read.table(e, header=TRUE)
df <- data[ -c(2) ]
out <- split(df , f = df$.imp)
lapply(names(out),function(z){
write.table(out[[z]], paste0(e, "[",z,"].dat"),
sep="\t", row.names=FALSE, col.names = FALSE)})
}
Now the paste0 command gets me my desired split up data (although its M1_1.dat[1] instead of M1_1[1].dat), but i cant figure out how to get this data into my subfolders.
Maybe you´ve got an idea?
Thanks in advance.
I don't have any idea what your data looks like so I am going to attempt to recreate the scenario with the gender datasets available at baby names
Assuming all the files from the zip folder are stored to "inst/data"
store all file paths to all_fi variable
all_fi <- list.files("inst/data",
full.names = TRUE,
recursive = TRUE,
pattern = "\\.txt$")
> head(all_fi, 3)
[1] "inst/data/yob1880.txt" "inst/data/yob1881.txt"
Preset function that will apply to each file in the directory
f.it <- function(f_in = NULL){
# Create the new folder based on the existing basename of the input file
new_folder <- file_path_sans_ext(f_in)
dir.create(new_folder)
data.table::fread(f_in) %>%
select(name = 1, gender = 2, freq = 3) %>%
mutate(
gender = ifelse(grepl("F", gender), "female","male")
) %>% (function(x){
# Dataset contains names for males and females
# so that's what I'm using to mimic your split
out <- split(x, x$gender)
o <- rbind.pages(
lapply(names(out), function(i){
# New filename for each iteration of the split dataframes
###### THIS IS WHERE YOU NEED TO TWEAK FOR YOUR NEEDS
new_dest_file <- sprintf("%s/%s.txt", new_folder, i)
# Write the sub-data-frame to the new file
data.table::fwrite(out[[i]], new_dest_file)
# For our purposes return a dataframe with file info on the new
# files...
data.frame(
file_name = new_dest_file,
file_size = file.size(new_dest_file),
stringsAsFactors = FALSE)
})
)
o
})
}
Now we can just loop through:
NOTE: for my purposes I'm not going to spend time looping through each file, for your purposes this would apply to each of your initial files, or in my case all_fi rather than all_fi[2:5].
> rbind.pages(lapply(all_fi[2:5], f.it))
============================ =========
file_name file_size
============================ =========
inst/data/yob1881/female.txt 16476
inst/data/yob1881/male.txt 15306
inst/data/yob1882/female.txt 18109
inst/data/yob1882/male.txt 16923
inst/data/yob1883/female.txt 18537
inst/data/yob1883/male.txt 15861
inst/data/yob1884/female.txt 20641
inst/data/yob1884/male.txt 17300
============================ =========

Resources