Looping through subdirectories, reshaping matrix and saving output in R - r

I have a project folder (project1) which contains data from 20 Subjects (Sub-001, Sub-002 etc.). Each Subject's folder contains a subfolder (named AD) which has an AD.csv correlation matrix inside.
The folder structure looks like this
├── Sub-001
│ └── AD
│ └── AD.csv
└── Sub-002
└── *
I want to write a for loop that iteratively goes into each Subject's AD folder, reshapes the AD.csv matrix from wide to long, and saves the reshaped matrix as a new subj_AD_reshaped.csv file, named after the subject's ID (e.g., Sub-001_AD_reshaped.csv), in the Subject's folder.
The code I have written:
files<-list.files("group/S/project1/")
subjects<-list.dirs(path=files, full.names=T, recursive=F)
subj.AD<-subjects/"AD"
for (subj.AD in 1:length(subjects)) {
df<-read.csv("matrix.csv", sep="", header=F)
df.matrix<-as.matrix(df)
df.matrix[lower.tri(df.matrix)]<-NA
df.matrix.melt<-as.data.frame(reshape2::melt(df.matrix, na.rm=T))
write.csv(df.matrix.melt, file="subjects/Subject_AD_reshaped.csv))
}
This code doesn't work mainly because R doesn't seem to recognise the AD folder for both input and output. What's the best way to write this for-loop?

Maybe the following code will solve the problem.
base_path <- "group/S/project1"
subjects <- list.dirs(path = base_path, full.names = TRUE, recursive = FALSE)
for(s in subjects){
fl <- file.path(s, "AD", "AD.csv")
df1 <- read.table(fl, sep = ",")
df1.matrix <- as.matrix(df1)
is.na(df1.matrix) <- lower.tri(df1.matrix)
df2 <- as.data.frame(df.matrix)
df2.melt <- reshape2::melt(df2, na.rm = TRUE)
out_file <- paste0(basename(s), "_AD_reshaped.csv")
out_file <- file.path(s, out_file)
write.table(df2.melt, out.file, sep = ",", row.names = FALSE)
}

Related

Efficient strategy for recursive `list.files()` call in R function

I have a folder that will receive raw data from remote stations.
The data structure is mostly controlled by the acquisition date, with a general pattern given by:
root/unit_mac_address/date/data_here/
Here's an example of a limited tree call for one unit with some days of recording.
.
├── 2020-04-03
│   ├── 2020-04-03T11-01-31_capture.txt
│   ├── 2020-04-03T11-32-36_capture.txt
│   ├── 2020-04-03T14-58-43_capture.txt
│   ├── img
│   └── temperatures.csv
...
├── 2020-05-21
│   ├── 2020-05-21T11-10-55_capture.txt
│   ├── img
│   └── temperatures.csv
└── dc:a6:32:2d:b8:62_ip.txt
Inside each img folder, I have hundreds/thousands of images that are all timestamped with the datetime of acquisition.
My goal is to pool the data in temperatures.csv from all the units at a target_date.
My current approach is the following:
# from root dir, get all the temperatures.csv files
all_files <- list.files(pattern = "temperatures.csv",
full.names = TRUE,
# do it for all units
recursive = TRUE)
# subset the files from the list that contain the target_date
all_files <- all_files[str_detect(all_files, target_date)]
# read and bind into df
df <- lapply(all_files, function(tt) read_csv(tt)) %>%
bind_rows()
I chose to search for temperatures.csv because it's not timestamped, but I guess I am also going through all the imgs anyways. I don't think there's a way to limit list.files() to a certain level or recursion.
This works but, is it the best way to do it? What can be done to improve performance? Data comes in every day, so there is a growing number of files that the list.files() function will have to go through for each of the 10-20 units.
Would it be more efficient if the temperature.csv files themselves carried the timestamp (2020-05-26_temperatures.csv)? I can ask for timestamps on the temperatures.csv files itself (not the current approach) but I feel I should be able to handle this on my side.
Would it be more efficient to only look for dirs that have target_date? and then build the paths so that it's only looking at the first level on each target_date dir? Any hints on doing this appreciated.
Using the comment as a good guide, here's the benchmarking for the alternative way to do this.
Here's the gist of the new function
all_units <- list.dirs(recursive = FALSE)
# mac address has ":" unlikely in anothed dir
all_units <- all_units[str_detect(all_units, ":")]
all_dirs <- lapply(all_units,
function(tt) list.dirs(path = tt, recursive = FALSE)) %>%
unlist()
# there shouldn't be children but
# we paste0 to only get the date folder and not the children of that folder
relevant_dirs <- all_dirs[str_detect(all_dirs, paste0(target_date, "$"))]
all_files <- lapply(relevant_dirs,
function(tt)
list.files(tt, pattern = "temperatures.csv", full.names = TRUE)) %>%
unlist()
df <- lapply(all_files, function(tt) read_csv(tt)) %>%
bind_rows()
Here's the actual benchmark for a target day with just 2 units getting data, I suspect this difference will only get bigger with time for the recursive option.
unit: milliseconds
expr min lq mean median uq max neval
read_data 111.76401 124.17692 130.59572 127.84681 133.35566 317.6134 1000
read_data_2 39.72021 46.58495 50.80255 49.05811 52.01692 141.2126 1000
If you really worry about performance you might consider data.table, already for its fas fread function. Here is an example code:
library(data.table)
target_date <- "2020-04-03"
all_units <- list.dirs(path=".", recursive = FALSE) # or insert your path
all_units <- grep(":..:..:", all_units, value = TRUE, invert = TRUE)
temp_files <- sapply(all_units,
function(x) file.path(x, paste0(target_date, "/temperatures.csv")),
USE.NAMES = FALSE)
idx <- file.exists(temp_files)
df <- setNames(lapply(temp_files[idx], fread),
paste(basename(all_units[idx]), target_date, sep="_"))
rbindlist(df, idcol = "ID")

Trying to rename multiple.csv files using data contained within the file

A machine I use spits out .csv files named by the time. But I need them named after the plate they were read from, which is contained within the file.
I created list of files:
files <- list.files(path="", pattern="*.csv")
I then tried using a for-loop to first create a data frame from each file containing the 1st row only, then to create a variable from the relevant piece of data, (the desired name), and then renaming the files.
for(x in files)
{
y <- read.csv(x, nrow = 1, header = FALSE, stringsAsFactors = TRUE)
z <- y[2, 2]
file.rename(x, z)
}
It didn't work. After 7 hours of trying (new to R) I am here. Please give simple advice, I have basically zero R experience.
I believe the following for loop does what the question asks for if the new filename is the second column header value.
If it is not, change nmax to the appropriate column number.
fls <- list.files(pattern = '\\.csv')
for(f in fls){
x <- scan(file = f, what = character(), nmax = 2, nlines = 1, sep = ',')
g <- paste0(x[2], '.csv')
file.rename(f, g)
}

My R script not picking up all the files in the folder

My R script is trying to aggregate excel spreadsheets that are in different folders within the Concerned Files folder (shown in the directory below) and putting all the data into one master file. However, the script is randomly selecting files to copy information from and when i run the code, the following error shows so i am assuming this is why it's not choosing every file in the folder?
all_some_data <- rbind(all_some_data, temp)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
The whole code:
#list of people's name it has to search the folders for. For our purposes, i am only taking one name
managers <- c("Name")
#directory of all the files
directory = 'C:/Users/Username/OneDrive/Desktop/Testing/Concerned Files/'
#Create an empty dataframe
all_HR_data <-
setNames(
data.frame(matrix(ncol = 8, nrow = 0)),
c("Employee", "ID", "Overtime", "Regular", "Total", "Start", "End", "Manager")
)
str(files)
#loop through managers to get time sheets and then add file to combined dataframe
for (i in managers){
#a path to find all the extract files
files <-
list.files(
path = paste(directory, i, "/", sep = ""),
pattern = "*.xls",
full.names = FALSE,
recursive = FALSE
)
#for each file, get a start and end date of period, remove unnecessary columns, rename columns and add manager name
for (j in files){
temp <- read_excel(paste(directory, i, "/", j, sep = ""), skip = 8)
#a bunch of manipulations with the data being copied over. Code not relevant to the problem
all_some_data <- rbind(all_some_data, temp)
}
}
The most likely cause of your problem is an extra column in one or more of your files.
A potential solution along with a performance improvement is to use the bind_rows function from the dplyr package. This function is more fault tolerant than the base R rbind.
Wrap you loop up with lapply statement and then use bind_rows on the entire list of dataframes in one statement.
output <-lapply(files, function(j) {
temp <- read_excel(paste(directory, i, "/", j, sep = ""), skip = 8)
#a bunch of manipulations with the data being copied over.
# Code not relevant to the problem
temp #this is the returned value to the list
})
all_some_data <- dplyr::bind_rows(output)

using rbind problem inside nested for loop

I am trying to open and read multiple NetCDF files and I want to save the result to one list of many data frames.
in my working directory, I have the folder "main_folder" contains five folders (x1,x2,x3,x4, and x5) each folder of these five contains a different number of subfolders, let me say for example the folder x1 contains subfolders from folder "y1", to "y20". the folder y1 contains n1 number of NetCDF files. the folder y2 contains n2 number of NetCDF files and so on. similarly for the other folders x2, x3,x4,x5.
From the folder x1, I want to open, read and get the variables from all NetCDF files and make them as one data frame df1.
and from the folder x2, I want to make the second data frame df2 and so on.
at the End I will be having five data frames corresponding to each folder content. and then I want to make a list of these five data frames.
I wrote one code, it works except one problem which is the second data frame in the list contain the data of df1 appended to it the data of the second file df2. and df5 contains the data of df1+df2+df3+df4+df5.
How can I solve this problem.
here is my code
setwd("E:/main_folder")
#1# list all files in the main_folder
folders<- as.list(list.files("E:/main_folder"))
#2# make list of subfiles
subfiles<- lapply(folders, function(x) as.list(list.files(paste("E:/main_folder",x, sep="/"))))
#3# list the netcdf files from each subfiles
files1<- lapply(subfiles[[1]], function(x) list.files(paste( folders[1],x, sep = "/"),pattern='*.nc',full.names=TRUE))
files2<- lapply(subfiles[[2]], function(x) list.files(paste( folders[2],x, sep = "/"),pattern='*.nc',full.names=TRUE))
files3<- lapply(subfiles[[3]], function(x) list.files(paste( folders[3],x, sep = "/"),pattern='*.nc',full.names=TRUE))
files4<- lapply(subfiles[[4]], function(x) list.files(paste( folders[4],x, sep = "/"),pattern='*.nc',full.names=TRUE))
files5<- lapply(subfiles[[5]], function(x) list.files(paste( folders[5],x, sep = "/"),pattern='*.nc',full.names=TRUE))
#4# join all files in one list
filelist<- list(files1,files2,files3,files4,files5)
#5# Read the NetCDF and get the desired variables
df<- data.frame()
MissionsData<- list()
for (i in seq_along(filelist)){
n<- length(filelist[[i]])
for (j in 1:n){
for( m in 1:length( filelist[[i]][[j]])){
nc<- nc_open(filelist[[i]][[j]][[m]])
lat<- ncvar_get(nc, "glat.00")
lon<- ncvar_get(nc, "glon.00")
ssh<- ncvar_get(nc, "ssh.53")
jdn<- ncvar_get(nc, "jday.00")
df<- rbind(df,data.frame(lat,lon,ssh,jdn))
nc_close(nc)
}
}
MissionsData[[i]]<- df
}
In addition, Can I make step #3# in one go instead of typing them manually?
#3 Nesting the code inside another `lapply` should do the job:
filelist = lapply(subfiles, function(subfile){
lapply(subfile, function(x) list.files(paste(folders[1],x, sep = "/"),
pattern='*.nc', full.names=TRUE))
})
#This might work as #5.
#It was written without reproducible code so I didn't test it
MissionsData = lapply(filelist, function(x){
# I don't see the j and m indexes used for any other purpose than looping
# so I just unlist these files into a vector
files_i = unlist(x, recursive = TRUE)
df_list = lapply(files_i, function(file_i){
nc = nc_open(file_i)
lat = ncvar_get(nc, "glat.00")
lon = ncvar_get(nc, "glon.00")
ssh = ncvar_get(nc, "ssh.53")
jdn = ncvar_get(nc, "jday.00")
nc_close(nc)
return(data.frame(lat,lon,ssh,jdn))
})
df = do.call(rbind, df_list)
})

save files into a specific subfolder in a loop in R

I feel I am very close to the solution but at the moment i cant figure out how to get there.
I´ve got the following problem.
In my folder "Test" I´ve got stacked datafiles with the names M1_1; M1_2, M1_3 and so on: /Test/M1_1.dat for example.
No I want to seperate the files, so that I get: M1_1[1].dat, M1_1[2].dat, M1_1[3].dat and so on. These files I´d like to save in specific subfolders: Test/M1/M1_1[1]; Test/M1/M1_1[2] and so on, and Test/M2/M1_2[1], Test/M2/M1_2[2] and so on.
Now I already created the subfolders. And I got the following command to split up the files so that i get M1_1.dat[1] and so on:
for (e in dir(path = "Test/", pattern = ".dat", full.names=TRUE, recursive=TRUE)){
data <- read.table(e, header=TRUE)
df <- data[ -c(2) ]
out <- split(df , f = df$.imp)
lapply(names(out),function(z){
write.table(out[[z]], paste0(e, "[",z,"].dat"),
sep="\t", row.names=FALSE, col.names = FALSE)})
}
Now the paste0 command gets me my desired split up data (although its M1_1.dat[1] instead of M1_1[1].dat), but i cant figure out how to get this data into my subfolders.
Maybe you´ve got an idea?
Thanks in advance.
I don't have any idea what your data looks like so I am going to attempt to recreate the scenario with the gender datasets available at baby names
Assuming all the files from the zip folder are stored to "inst/data"
store all file paths to all_fi variable
all_fi <- list.files("inst/data",
full.names = TRUE,
recursive = TRUE,
pattern = "\\.txt$")
> head(all_fi, 3)
[1] "inst/data/yob1880.txt" "inst/data/yob1881.txt"
Preset function that will apply to each file in the directory
f.it <- function(f_in = NULL){
# Create the new folder based on the existing basename of the input file
new_folder <- file_path_sans_ext(f_in)
dir.create(new_folder)
data.table::fread(f_in) %>%
select(name = 1, gender = 2, freq = 3) %>%
mutate(
gender = ifelse(grepl("F", gender), "female","male")
) %>% (function(x){
# Dataset contains names for males and females
# so that's what I'm using to mimic your split
out <- split(x, x$gender)
o <- rbind.pages(
lapply(names(out), function(i){
# New filename for each iteration of the split dataframes
###### THIS IS WHERE YOU NEED TO TWEAK FOR YOUR NEEDS
new_dest_file <- sprintf("%s/%s.txt", new_folder, i)
# Write the sub-data-frame to the new file
data.table::fwrite(out[[i]], new_dest_file)
# For our purposes return a dataframe with file info on the new
# files...
data.frame(
file_name = new_dest_file,
file_size = file.size(new_dest_file),
stringsAsFactors = FALSE)
})
)
o
})
}
Now we can just loop through:
NOTE: for my purposes I'm not going to spend time looping through each file, for your purposes this would apply to each of your initial files, or in my case all_fi rather than all_fi[2:5].
> rbind.pages(lapply(all_fi[2:5], f.it))
============================ =========
file_name file_size
============================ =========
inst/data/yob1881/female.txt 16476
inst/data/yob1881/male.txt 15306
inst/data/yob1882/female.txt 18109
inst/data/yob1882/male.txt 16923
inst/data/yob1883/female.txt 18537
inst/data/yob1883/male.txt 15861
inst/data/yob1884/female.txt 20641
inst/data/yob1884/male.txt 17300
============================ =========

Resources