Move files with specific name pattern to specific subfolders - r

So I would like to copy files to a specific folder based on a certain part in their name. For your overview I put my folder structure below. In the folders D1 and D2 I have multiple files (as example I put names of two files here) and the folders Brightfield and FITC. I would like to move the .TIF files to either the folder Brightfield or FITC dependent on whether the file name has brightfield in its name or FITC (see what I would like).
Current situation:
main Directory
|
|___ Experiment
├── D1
├── Brightfield
│ └── FITC
|__ 20210205_DML_3_4_D1_PM_flow__w1brightfield 100 - CAM_s3_t5.TIF
|__ 20210205_DML_3_4_D1_PM_flow__w2FITC 100- CAM_s3_t5.TIF
└── D2
├── temperature
└── weather
|__ 20210219_DML_3_4_D2_AM_flow__w1brightfield 100 - CAM_s4_t10.TIF
|__ 20210219_DML_3_4_D2_AM_flow__w2FITC 100- CAM_s4_t10.TIF
What I would like:
main Directory
|
|___ Experiment
├── D1
├── Brightfield
|__20210205_DML_3_4_D1_PM_flow__w1brightfield 100 - CAM_s3_t5.TIF
└── FITC
|__ 20210205_DML_3_4_D1_PM_flow__w2FITC 100- CAM_s3_t5.TIF
├── D2
├── Brightfield
|__20210219_DML_3_4_D2_AM_flow__w1brightfield 100 - CAM_s4_t10.TIF
└── FITC
|__20210219_DML_3_4_D2_AM_flow__w2FITC 100- CAM_s4_t10.TIF
In another question asked on stackoverflow I found a code that I thought I could adjust to my situation, but I get an error that says: Error in mapply(FUN = function (path, showWarnings = TRUE, recursive = FALSE, : zero-length inputs cannot be mixed with those of non-zero length. Apparently the list that needs to be formed (in parts) only shows NA. The code that I used is below:
files <- c("20210205_DML_3_4_D0_PM_flow__w1brightfield 100 - CAM_s3_t5.TIF", "20210205_DML_3_4_D0_PM_flow__w2FITC 100- CAM_s3_t5.TIF",
"20210219_DML_3_4_D1_AM_flow__w1brightfield 100 - CAM_s4_t10.TIF", "20210219_DML_3_4_D1_AM_flow__w2FITC 100- CAM_s4_t10.TIF")
# write some temp (empty) files for copying
for (f in files) writeLines(character(0), f)
parts <- strcapture(".*_(D[01])_*_([brightfield]|[FITC])_.*", files, list(d="", tw=""))
parts
# d tw
# 1 D0 Brightfield
# 2 D0 FITC
# 3 D1 Brightfield
# 4 D1 FITC
dirs <- do.call(file.path, parts[complete.cases(parts),])
dirs
# [1] "D0/Brightfield" "D0/FITC" "D1/Brightfield" "D1/FITCr"
### pre-condition, only files, no dir-structure
list.files(".", pattern = "D[0-9]", full.names = TRUE, recursive = TRUE)
# [1] "./20210205_DML_3_4_D0_PM_flow__w1brightfield 100 - CAM_s3_t5.TIF" "./"20210205_DML_3_4_D0_PM_flow__w2FITC 100- CAM_s3_t5.TIF"
### create dirs, move files
Vectorize(dir.create)(unique(dirs), recursive = TRUE) # creates both D0 and D0/Brightfield, ...
# D0/Brightfield D0/FITC D1/Brightfield D1/FITC
# TRUE TRUE TRUE TRUE
file.rename(files, file.path(dirs, files))
# [1] TRUE TRUE TRUE TRUE
### post-condition, files in the correct locations
list.files(".", pattern = "D[0-9]", full.names = TRUE, recursive = TRUE)
Where is it going wrong?

You are doing the parts <- bit wrong, it should be like so:
parts <- strcapture(".*_(D[01])_.*(brightfield|FITC).*", files, list(d="", tw=""))
parts
Output:
> parts
d tw
1 D0 brightfield
2 D0 FITC
3 D1 brightfield
4 D1 FITC
There were a couple of errors:
you forgot a . in _*_ , correct should be _.*_.
Don't put [] around the words brightfield and FITS, that's not how you use [].
there aren't underscores around brightfield or FITC in your filenames. So don't put underscores around them in your regular expression.
May I recommend reading up on an introduction article or a tutorial? It does not take much to learn what you need to overcome your problems here.

Related

dir_tree() to HTML/Table

I have a very large folder with many subfolders and hence a large number of files. I would like to create an HTML file with the folder structure with dropdown options for the different levels as well as a searchbar. I thought about reactable or a small shiny app, but maybe someone has an idea. My first problem is to get the structure from fs::fs_tree into a suitable format.
Consider the following folder structure:
fs::fs_tree()
├── folder1
├── folder2
│ └── readme.R
├── folder3
│ ├── subfolder1
│ │ ├── example.R
│ │ └── example2.R
│ └── subfolder2
│ └── plot.R
You can use the jsTreeR package.
You don't need a Shiny app since a "jsTree" is a HTML widget, and you can save it as a HTML file with htmlwidgets::saveWidget.
Here is the folder example of this package:
library(jsTreeR)
# make the nodes list from a vector of file paths
makeNodes <- function(leaves){
dfs <- lapply(strsplit(leaves, "/"), function(s){
item <-
Reduce(function(a,b) paste0(a,"/",b), s[-1], s[1], accumulate = TRUE)
data.frame(
item = item,
parent = c("root", item[-length(item)]),
stringsAsFactors = FALSE
)
})
dat <- dfs[[1]]
for(i in 2:length(dfs)){
dat <- merge(dat, dfs[[i]], all = TRUE)
}
f <- function(parent){
i <- match(parent, dat$item)
item <- dat$item[i]
children <- dat$item[dat$parent==item]
label <- tail(strsplit(item, "/")[[1]], 1)
if(length(children)){
list(
text = label,
data = list(value = item),
children = lapply(children, f)
)
}else{
list(text = label, data = list(value = item))
}
}
lapply(dat$item[dat$parent == "root"], f)
}
folder <-
list.files(Sys.getenv("R_HOME"), recursive = TRUE)
nodes <- makeNodes(folder)
jstree(nodes, search = TRUE)

R move files with specific name pattern to folder in different subdirectories

I would like to copy files to a specific folder based on a certain part in their name. Below you will find my folder structure and where the files are. In both the D0 and D1 folders you will find files that are named like this structure: 20210308_DML_D0_Temp_s1_t1.txt or 20210308_DML_D1_weather_s3_t6.txt with D0/D1 in which folder it is situated, Temp/weather whether it is temperature or weather file, s1/s3 is the location and t1/t6 is the timepoint. The first thing that I wanted to do is to loop over the txt files in both D0 and D1 files and move the files that have Temp in their name to the temperature subfolder and files that have weather in their name to the weather subfolder in both D0 and D1 folders
main Directory
|
|___ weather_day
├── D0
├── temperature
│ └── weather
|__ 20210308_DML_D0_Temp_s1_t1.txt
|__ 20210308_DML_D1_weather_s3_t6.txt
└── D1
├── temperature
└── weather
|__ 20210308_DML_D0_Temp_s1_t1.txt
|__ 20210308_DML_D1_weather_s3_t6.txt
I tried to do it with a for loop such as:
wd = getwd() #set working directory to subfolder
pathway = paste0(wd,"/weather_day/")
for (i in pathway){
file.copy(i,"temperature)
file.copy(i,"weather")
}
In the end I want it like this that the txt files are in the folder according whether they have temperature or weather in their name:
main Directory
|
|___ weather_day
├── D0
├── temperature
|__20210308_DML_D0_Temp_s1_t1.txt
└── weather
|__ 20210308_DML_D0_weather_s3_t6.txt
├── D1
├── temperature
|__20210308_DML_D1_Temp_s1_t1.txt
└── weather
|__20210308_DML_D1_weather_s3_t6.txt
However, it does not work for me. I think I have to use file.copy, but how can I use this function to move the file based on a certain name pattern of the file and can I use a for loop in a for loop to read over the folders D0 and D1 and then the txt files in these folders?
You didn't provide very much information to go off of. If I understand what you're asking, this should work.
library(tidyverse)
# collect a list of files with their paths
collector = list.files(paste0(getwd(), "/weather_day"),
full.names = T, # capture the file names along with the full path
recursive = T) # look in subfolders
# establish the new 'weather' path
weather = paste0(getwd(), "/weather/")
# establish the new 'temp' path
temp = paste0(getwd(), "/temp/")
collector = data.frame("files" = collector) %>% # original path
mutate(files2 = ifelse(str_detect(str_extract(files,
"([^\\/]+$)"),
"weath"), # if weather, make a new path
paste0(weather,
str_extract(files,
"([^\\/]+$)")
), # end paste0/ if true
ifelse(str_detect(str_extract(files,
"([^\\/]+$)"),
"temp"), # if temp, make a new path
paste0(temp,
str_extract(files,
"([^\\/]+$)")
), # end paste0/ if true
files) # if not weather or temp, no change
) # end if
) # end mutate
dir.create(weather) # create directories
dir.create(temp)
# move the files
file.rename(from = collector[,1],
to = collector[,2])
# validate the change
list.files(weather) # see what's different
list.files(temp) # see what's different
Based on what #alexdegrote1995 added, how about this:
# collect a list of files with their paths
collector = list.files(paste0(getwd(), "/weather_day"),
full.names = T, # capture the file names along with the full path
recursive = T) # look in subfolders
# establish the new 'weather' path
weather = paste0(getwd(), "/D0/weather/")
# establish the new 'temp' path
temp = paste0(getwd(), "/D0/temperature/")
collector = data.frame("files" = collector) %>%
mutate(files2 = ifelse(str_detect(str_extract(files,
"([^\\/]+$)"),
"weath"),
paste0(weather,
str_extract(files,
"([^\\/]+$)")
), # end paste0/ if true
ifelse(str_detect(str_extract(files,
"([^\\/]+$)"),
"temp"),
paste0(temp,
str_extract(files,
"([^\\/]+$)")
), # end paste0/ if true
files) # if not weather or temp, don't change
), # end if
filesD1 = paste0(gsub(pattern="D0", # make a third column for the D1 folder
replacement="D1",
x =files2,))) # end mutate
file.rename(from = collector[,1], # move files to the D0 folder
to = collector[,2])
file.copy(from = collector[,2], # add copy to the D1 folder
to = collector[,3])
Edited to include more filenames, pre-conditions (no dir structure), and post-conditions. (Plus move instead of copy.)
files <- c("20210308_DML_D0_Temp_s1_t1.txt", "20210308_DML_D0_weather_s3_t6.txt",
"20210308_DML_D1_Temp_s1_t1.txt", "20210308_DML_D1_weather_s3_t6.txt")
# write some temp (empty) files for copying
for (f in files) writeLines(character(0), f)
parts <- strcapture(".*_(D[01])_([Tt]emp|[Ww]eather)_.*", files, list(d="", tw=""))
parts
# d tw
# 1 D0 Temp
# 2 D0 weather
# 3 D1 Temp
# 4 D1 weather
dirs <- do.call(file.path, parts[complete.cases(parts),])
dirs
# [1] "D0/Temp" "D0/weather" "D1/Temp" "D1/weather"
### pre-condition, only files, no dir-structure
list.files(".", pattern = "D[0-9]", full.names = TRUE, recursive = TRUE)
# [1] "./20210308_DML_D0_Temp_s1_t1.txt" "./20210308_DML_D0_weather_s3_t6.txt" "./20210308_DML_D1_Temp_s1_t1.txt"
# [4] "./20210308_DML_D1_weather_s3_t6.txt"
### create dirs, move files
Vectorize(dir.create)(unique(dirs), recursive = TRUE) # creates both D0 and D0/Temp, ...
# D0/Temp D0/weather D1/Temp D1/weather
# TRUE TRUE TRUE TRUE
file.rename(files, file.path(dirs, files))
# [1] TRUE TRUE TRUE TRUE
### post-condition, files in the correct locations
list.files(".", pattern = "D[0-9]", full.names = TRUE, recursive = TRUE)
# [1] "./D0/Temp/20210308_DML_D0_Temp_s1_t1.txt" "./D0/weather/20210308_DML_D0_weather_s3_t6.txt"
# [3] "./D1/Temp/20210308_DML_D1_Temp_s1_t1.txt" "./D1/weather/20210308_DML_D1_weather_s3_t6.txt"

R Shiny not reading file path

Consider a shiny app with the following folder structure (github here):
├── data
│ └── import_finance.R
│ └── data.xlsx
├── dashboard.R
├── ui_elements
│ └── sidebar.R
│ └── body.R
The import_finance.R runs correctly when not run in the shiny app. However, when running the shiny app it fails to recognize the path.
# list of quarterly earnings worksheets
file_paths <- list.files('dashboard/data', pattern = 'xlsx', full.names = TRUE)
path <- file_paths[1]
# import all sheets from each workbook and merge to one df
import <-
path %>%
excel_sheets() %>%
set_names() %>%
map_df(xlsx_cells, path = path) %>%
mutate(workbook = word(path, -1, sep = '/')) # add column with file_name
Shiny says Error: path does not exist: ‘NA’. Please note that import_finance.R is called in dashboard.R via source("data/import_finance.R").
Even when specifying the full path the error persists. This thread claims the same error:
> system.file("/Users/pblack/Documents/Git Projects/opensource-freetoplay-economics/dashboard/data/dashboard/data/ATVI 12-Quarter Financial Model Q1 CY20a.xlsx")
[1] ""
Any idea of the mistake I'm making? It's strange the script runs fine, but just not when running as a shiny app.
The error here was in
# list of quarterly earnings worksheets
file_paths <- list.files('dashboard/data', pattern = 'xlsx', full.names = TRUE)
When the shiny app was running it was operating from the data folder as the working directory when running source("data/import_finance.R").
To accommodate simply
# list of quarterly earnings worksheets
file_paths <- list.files('data', pattern = 'xlsx', full.names = TRUE)

R regex to list files that do not begin with eg `AA` or `BB`

Here's the reprex we'll need to create in our working directory:
library(tidyverse)
library(openxlsx)
library(readxl)
write.xlsx(list(iris), "AA-excel-file.xlsx")
write.xlsx(list(iris), "BB-excel-file.xlsx")
write.xlsx(list(iris), "CC-excel-file.xlsx")
write.xlsx(list(iris), "DD-excel-file.xlsx")
write.xlsx(list(iris), "EE-excel-file.xlsx")
And my working directory looks something like this:
C:
├── my-R-working-directory/
├── AA-excel-file.xlsx
├── BB-excel-file.xlsx
├── CC-excel-file.xlsx
├── DD-excel-file.xlsx
└── EE-excel-file.xlsx
I've crafted a regular expression (demo here) that "selects" any file that does not begin with either AA or BB:
^(?!AA|BB)\w+$
I want to use this regex with base R list.files() to list all files that do not begin with either AA or BB. Here's my attempt:
list.files("path/of/folder", pattern = "\\^(?!AA|BB)\w+$.xlsx$", full.names = TRUE)
#> Error: '\w' is an unrecognized escape in character string starting ""\\^(?!AA|BB)\w"
#> Error: unexpected ')' in " full.names = TRUE)"
I think my pattern argument is slightly off. This similar command does work fine, but doesn't exclude the AA and BB files:
list.files("path/of/folder", pattern = "\\.xlsx$", full.names = TRUE)
How do I properly write the pattern argument to exclude any files that start with AA or BB? And if you have the capability can you correct my regular expression? The regex only seems to work with "letters or numbers" characters. Any white space, dashes, dots, etc. break the regex (see demo).
You could use pattern to get all xlsx files then inverse grep those starting with AA or BB:
library(tidyverse)
library(openxlsx)
library(readxl)
write.xlsx(list(iris), "AA-excel-file.xlsx")
write.xlsx(list(iris), "BB-excel-file.xlsx")
write.xlsx(list(iris), "CC-excel-file.xlsx")
write.xlsx(list(iris), "DD-excel-file.xlsx")
write.xlsx(list(iris), "EE-excel-file.xlsx")
grep("^(AA|BB).*", list.files(pattern = "\\.xlsx$"), invert = TRUE, value = TRUE)
#> [1] "CC-excel-file.xlsx" "DD-excel-file.xlsx" "EE-excel-file.xlsx"

Efficient strategy for recursive `list.files()` call in R function

I have a folder that will receive raw data from remote stations.
The data structure is mostly controlled by the acquisition date, with a general pattern given by:
root/unit_mac_address/date/data_here/
Here's an example of a limited tree call for one unit with some days of recording.
.
├── 2020-04-03
│   ├── 2020-04-03T11-01-31_capture.txt
│   ├── 2020-04-03T11-32-36_capture.txt
│   ├── 2020-04-03T14-58-43_capture.txt
│   ├── img
│   └── temperatures.csv
...
├── 2020-05-21
│   ├── 2020-05-21T11-10-55_capture.txt
│   ├── img
│   └── temperatures.csv
└── dc:a6:32:2d:b8:62_ip.txt
Inside each img folder, I have hundreds/thousands of images that are all timestamped with the datetime of acquisition.
My goal is to pool the data in temperatures.csv from all the units at a target_date.
My current approach is the following:
# from root dir, get all the temperatures.csv files
all_files <- list.files(pattern = "temperatures.csv",
full.names = TRUE,
# do it for all units
recursive = TRUE)
# subset the files from the list that contain the target_date
all_files <- all_files[str_detect(all_files, target_date)]
# read and bind into df
df <- lapply(all_files, function(tt) read_csv(tt)) %>%
bind_rows()
I chose to search for temperatures.csv because it's not timestamped, but I guess I am also going through all the imgs anyways. I don't think there's a way to limit list.files() to a certain level or recursion.
This works but, is it the best way to do it? What can be done to improve performance? Data comes in every day, so there is a growing number of files that the list.files() function will have to go through for each of the 10-20 units.
Would it be more efficient if the temperature.csv files themselves carried the timestamp (2020-05-26_temperatures.csv)? I can ask for timestamps on the temperatures.csv files itself (not the current approach) but I feel I should be able to handle this on my side.
Would it be more efficient to only look for dirs that have target_date? and then build the paths so that it's only looking at the first level on each target_date dir? Any hints on doing this appreciated.
Using the comment as a good guide, here's the benchmarking for the alternative way to do this.
Here's the gist of the new function
all_units <- list.dirs(recursive = FALSE)
# mac address has ":" unlikely in anothed dir
all_units <- all_units[str_detect(all_units, ":")]
all_dirs <- lapply(all_units,
function(tt) list.dirs(path = tt, recursive = FALSE)) %>%
unlist()
# there shouldn't be children but
# we paste0 to only get the date folder and not the children of that folder
relevant_dirs <- all_dirs[str_detect(all_dirs, paste0(target_date, "$"))]
all_files <- lapply(relevant_dirs,
function(tt)
list.files(tt, pattern = "temperatures.csv", full.names = TRUE)) %>%
unlist()
df <- lapply(all_files, function(tt) read_csv(tt)) %>%
bind_rows()
Here's the actual benchmark for a target day with just 2 units getting data, I suspect this difference will only get bigger with time for the recursive option.
unit: milliseconds
expr min lq mean median uq max neval
read_data 111.76401 124.17692 130.59572 127.84681 133.35566 317.6134 1000
read_data_2 39.72021 46.58495 50.80255 49.05811 52.01692 141.2126 1000
If you really worry about performance you might consider data.table, already for its fas fread function. Here is an example code:
library(data.table)
target_date <- "2020-04-03"
all_units <- list.dirs(path=".", recursive = FALSE) # or insert your path
all_units <- grep(":..:..:", all_units, value = TRUE, invert = TRUE)
temp_files <- sapply(all_units,
function(x) file.path(x, paste0(target_date, "/temperatures.csv")),
USE.NAMES = FALSE)
idx <- file.exists(temp_files)
df <- setNames(lapply(temp_files[idx], fread),
paste(basename(all_units[idx]), target_date, sep="_"))
rbindlist(df, idcol = "ID")

Resources