Dynamic files name load and processing with for loop

Dynamic files name load and processing with for loop - r

I'm processing some .xlsx, there are named like time1_drug1,time1_drug2,until tiume6_drug5 (30 files in total). I want to load these xlsx to R and name them to dataset such as t1d1, t2d2.
I tried to use sprintf, but I cannot figure out how to make valid.
for(i in 1:6) {
for(j in 1:5) {
sprintf("time%i","drug%j,i,j)=read.xlsx("/Users/pathway/dataset/time_sprintf(%i,i)_drug(%j,j).xlsx", 1)}
names(sprintf("t%i","d%j,i,j))=c("result", "testF","TestN")
sprintf("t%i","d%j,i,j)$Discription[which(sprintf("t%i","d%j,i,j)$testF>=1&sprintf("t%i","d%j,i,j)$TestN>=2)]="High+High"
}
}
I expect to get 30 data like t1d1 till t6d5.

You should (almost) never use assign. When reading multiple files into R you should (almost) always put them in a named list.
A rough outline of a much better approach is this:
# Put all the excel files in a directory and this retrieves all their paths
f <- dir("/Users/pathway/dataset/",full.names = TRUE)
# Read all files into a list
drug_time <- lapply(X = f,FUN = read.xlsx)
# Name each list element based on the file name
names(drug_time) <- gsub(pattern = ".xlsx",replacement = "",x = basename(f),fixed = TRUE)

You can use the for loop as you are, but you should also use the assign function:
for(i in 1:6){
for(j in 1:5){
assign(paste0('t', i, '_', 'd', j), read.xlsx(paste0("/Users/pathway/dataset/time_",i,"_drug",j,".xlsx"), 1))
}
}

Related

Adding a new column with filenames for the list of files in a for loop

I have a time series data. I stored the data in txt files under daily subfolders in Monthly folders.
setwd(".../2018/Jan")
parent.folder <-".../2018/Jan"
sub.folders <- list.dirs(parent.folder, recursive=TRUE)[-1] #To read the sub-folders under parent folder
r.scripts <- file.path(sub.folders)
A_2018 <- list()
for (j in seq_along(r.scripts)) {
A_2018[[j]] <- dir(r.scripts[j],"\\.txt$")}
Of these .txt files, I removed some of the files which I don't want to use for the further analysis, using the following code.
trim_to_two <- function(x) {
runs = rle(gsub("^L1_\\d{4}_\\d{4}_","",x))
return(cumsum(runs$lengths)[which(runs$lengths > 2)] * -1)
}
A_2018_new <- list()
for (j in seq_along(A_2018)) {
A_2018_new[[j]] <- A_2018[[j]][trim_to_two(A_2018[[j]])]
}
Then, I want to make a rowbind by for loop for the whole .txt files. Before that, I would like to remove some lines in each txt file, and add one new column with file name. The following is my code.
for (i in 1:length(A_2018_new)) {
for (j in 1:length(A_2018_new[[i]])){
filename <- paste(str_sub(A_2018_new[[i]][j], 1, 14))
assign(filename, read_tsv(complete_file_name, skip = 14, col_names = FALSE),
)
Y <- r.scripts %>% str_sub(46, 49)
MD <- r.scripts %>% str_sub(58, 61)
HM <- filename %>% str_sub(9, 12)
Turn <- filename %>% str_sub(14, 14)
time_minute <- paste(Y, MD, HM, sep="-")
Map(cbind, filename, SampleID = names(filename))
}
}
But I didn't get my desired output. I tried to code using other examples. Could anyone help to explain what my code is missing.

Your code seems overly complex for what it is doing. Your problem is however not 100% clear (e.g. what is the pattern in your file names that determine what to import and what not?). Here are some pointers that would greatly simplify the code, and likely avoid the issue you are having.
Use lapply() or map() from the purrr package to iterate instead of a for loop. The benefit is that it places the different data frames in a list and you don't need to assign multiple data frames into their own objects in the environment. Since you tagged the tidyverse, we'll use the purrr functions.
library(tidyverse)
You could for instance retrieve the txt file paths, using something like
txt_files <- list.files(path = 'data/folder/', pattern = "txt$", full.names = TRUE) # Need to remove those files you don't with whatever logic applies
and then use map() with read_tsv() from readr like so:
mydata <- map(txt_files, read_tsv)
Then for your manipulation, you can again use lapply() or map() to apply that manipulation to each data frame. The easiest way is to create a custom function, and then apply it to each data frame:
my_func <- function(df, filename) {
df |>
filter(...) |> # Whatever logic applies here
mutate(filename = filename)
}
and then use map2() to apply this function, iterating through the data and filenames, and then list_rbind() to bind the data frames across the rows.
mydata_output <- map2(mydata, txt_files, my_func) |>
list_rbind()

how to apply R script on all of my data knowing that i use 2 different extension in my code

I have create a R script that analyse and manipulate 2 different data frame extension, for exemple one task is to extract certain values from data and export it as a .txt file, here is the begining of my script and the data files that i use:
setwd('C:\\Users\\Zack\\Documents\\RScripts\***')
heat_data="***.heat"
time ="***.timestamp"
ts_heat = read.table(heat_data)
ts_heat = ts_heat[-1,]
rownames(ts_heat) <- NULL
ts_time = read.table(time)
back_heat = subset(ts_heat, V3 == 'H')
back_time = ts_time$V1
library(data.table)
datDT[, newcol := fcoalesce(
nafill(fifelse(track == "H", back_time, NA_real_), type = "locf"),
0)]
last_heat = subset(ts_heat, V3 == 'H')
last_time = last_heat$newcol
x = back_time - last_heat
dataest = data.frame(back_time , x)
write_tsv(dataestimation, file="dataestimation.txt")
I than use those 2 files to calculate and extract specific values.
So can anyone plz tell me how can I run this script on each and every .heat and .timestamp files.
My objective is to calculate and extract this values for each file. I note that each file contain
.heat and .timestamp. I note also that I am a windows user.
Thank you for your help

You can use list.files
heat_data <- list.files(pattern = ".*.heat")
time <- list.files(pattern = ".*.timestamp")
and then process each file in a loop (or use lapply)
for (i in heat_data) {
h <- read.table(i)
# other code
}
for (j in time) {
t <- read.table(j)
# other code
}
you may want to pass the path to list.files as well instead of using setwd:
heat_data <- list.files("your/path/", pattern = ".*.heat")
After edit question
Let's say you have 3 .heat files and 3 .timestamp files in your path named
1.heat
2.heat
3.heat
1.timestamp
2.timestamp
3.timestamp
so there is a correspondence between heat and timestamp (given by the file name).
You can read these files with
heat_data <- list.files("your/path/", pattern = ".*.heat")
time <- list.files("your/path/", pattern = ".*.timestamp")
At this point, create a function that does exactly what you want. This function takes as input only an index and two paths
function (i, heat_data, time) {
ts_heat <- read.table (heat_data[i])
ts_time <- read.table (time[i])
#
# other code
#
write_tsv(dataestimation, file = paste ("dataestimation", i, ".txt", sep = ""))
}
This way you will have files named dataestimation_1.txt, dataestimation_2.txt and dataestimation_3.txt.
Finally use lapply to call the function for all files in the folder
lapply (1: 3, heat_data, time)
This is just one of the possible ways to proceed.

How to get better performance in R: one big file or several smaller files?

I had about 200 different files (all of them were big matrices, 465x1080) (that is huge for me). I then used cbind2 to make them all one bigger matrix (465x200000).
I did that because I needed to create one separate file for each row (465 files) and I thought that it would be easier for R to load the data from 1 file to the memory only ONCE and then just read line per line creating a separate file for each one of them, instead of opening and closing 200 different files for every row.
Is this really the faster way? (I am wondering because now it is taking quite a lot to do that). When I check in the Task Manager from Windows it shows the RAM used by R and it just goes from 700MB to 1GB to 700MB all the time (twice every second). Seems like the main file wasn't loaded just once, but that it is being loaded and erased from the memory in every iteration (which could be the reason why it is a bit slow?).
I am a beginner so all of this that I wrote might not make any sense.
Here is my code: (those +1 and -1 are because the original data has 1 extra column that I dont need in the new files)
extractStationData <- function(OriginalData, OutputName = "BCN-St") {
for (i in 1:nrow(OriginalData)) {
OutputData <- matrix(NA,nrow = ncol(OriginalData)-1,3)
colnames(OutputData) <- c("Time","Bikes","Slots")
for (j in 1:(ncol(OriginalData)-1)) {
OutputData[j,1] <- colnames(OriginalData[j+1])
OutputData[j,2] <- OriginalData[i,j+1]
}
write.table(OutputData,file = paste(OutputName,i,".txt",sep = ""))
print(i)
}
}
Any thoughts? Maybe I should just create an object (the huge file) before the first for loop and then it would be loaded just once?
Thanks in advance.

Lets assume you have already created the 465x200000 matrix and in question are only extractStationData function. Then we can modify it for example like this:
require(data.table)
extractStationData <- function(d, OutputName = "BCN-St") {
d2 <- d[, -1] # remove the column you do not need
# create empty matrix outside loop:
emtyMat <- matrix(NA, nrow = ncol(d2), 3)
colnames(emtyMat) <- c("Time","Bikes","Slots")
emtyMat[, 1] <- colnames(d2)
for (i in 1:nrow(d2)) {
OutputData <- emtyMat
OutputData[, 2] <- d2[i, ]
fwrite(OutputData, file = paste(OutputName, i, ".txt", sep = "")) # use fwrite for speed
}
}
V2:
If your OriginalData is in matrix format this approach for creating the list of new data.tables looks quite fast:
extractStationData2 <- function(d, OutputName = "BCN-St") {
d2 <- d[, -1] # romove the column you dont need
ds <- split(d2, 1:nrow(d2))
r <- lapply(ds, function(x) {
k <- data.table(colnames(d2), x, NA)
setnames(k, c("Time","Bikes","Slots"))
k
})
r
}
dl <- extractStationData2(d) # list of new data objects
# write to files:
for (i in seq_along(dl)) {
fwrite(dl[[i]], file = paste(OutputName, i, ".txt", sep = ""))
}
Should work also for data.frame with minor changes:
k <- data.table(colnames(d2), t(x), NA)

Sort list into sub-lists based on pattern matching R

I have a number of TIFF files (each belonging to an image date) in one folder and want to make lists for as many unique dates as there are and then populate those lists with the appropriate files. Ideally, I'd like to have a function where a user would just make changes to the list of dates, though I haven't been able to run a function that would loop through my list of dates. Instead, I've tried to make a function and would run it for each unique date.
dates <- list('20180420', '20180522', '20180623', '20180725', '20180810')
# Make a list of all files in the data directory
allFilesDataDirectory <- list.files(path = dataDirectory, pattern = 'TIF$')
# allFilesDataDirectory is a list of 60 TIFF files with the same naming convention along the lines of LC08_L1TP_038037_20180810_20180815_01_T1_B9
allDateLists <- NULL
for (d in dates){
fileFolderDate <- NULL
dynamicDateNames <- paste0('fileListL8', d)
assign(dynamicDateNames, fileFolderDate)
allDateLists <- c(allDateLists, dynamicDateNames)
}
myFunction <- function(date, fileNameList){
# files first
for (i in allFilesDataDirectory){
# Create a list out of the file name by splitting up the name wherever there is a _ in the name
splitFileName <- unlist(strsplit(i, "[_]"))
if(grepl(splitFileName[4], date) & (grepl('B', splitFileName[8]))){
fileNameList <- c(fileNameList, i)
print(i)
}
else {
print('no')
}
}
}
myFunction(date = '20180623', fileNameList = 'fileListL820180623')
The function runs, but fileListL820180623 is NULL.
When hard coding this, everything works and am not sure of the difference. I tried using assign() (not shown here), but it did nothing.
for (i in allFilesDataDirectory){
# Create a list out of the file name by splitting up the name wherever there is a _ in the name
splitFileName <- unlist(strsplit(i, "[_]"))
if(grepl(splitFileName[4], '20180623') & (grepl('B', splitFileName[8]))){
fileListL820180623 <<- c(fileListL820180623, i)
}
else {
print('no')
}
}

For some reason grepl wasn't working well in this case, but glob2rx worked great.
dates <- list('20180420', '20180522', '20180623', '20180725', '20180810')
for (d in dates){
listLandsatFiles <- list.files(path = dataDirectory, pattern = glob2rx(paste0('*', d, '*B*TIF')))
files
dynamicFileListName <- paste0('fileListL8', d)
assign(dynamicFileListName, listLandsatFiles)
}
p.s. This might be helpful if you have multiple Landsat images saved in one directory and want to make lists by image date of only the TIFF files ( and perhaps want to make a raster brick later on).

I am not exactly sure what you want to achieve, but it seems you are making it too difficult and you are using poor choices with shortcuts <<- and assign (there are very few cases where their use is warranted).
I would suggest something along these lines:
getTiffPattern <- function(pattern='', folder='.') {
ff <- list.files(folder, pattern = pattern, full=TRUE)
grep('\\.tif$', ff, ignore.case = TRUE, value=TRUE)
}
getTiffPattern('20180420')
Or for a vector of dates
dates <- list('20180420', '20180522', '20180623', '20180725', '20180810')
x <- lapply(dates, getTiffPattern)

Specifying consecutive file names and assigning consecutive vectors with counter variable in for loops

I am trying to analyze 10 sets of data, for which I have to import the data, remove some values and plot histograms. I could do it individually but can naturally save a lot of time with a for loop. I know this code is not correct, but I have no idea of how to specify the name for the input files and how to name each iterated variable in R.
par(mfrow = c(10,1))
for (i in 1:10)
{
freqi <- read.delim("freqspeci.frq", sep="\t", row.names=NULL)
freqveci <- freqi$N_CHR
freqveci <- freqveci[freqveci != 0 & freqveci != 1]
hist(freqveci)
}
What I want to do is to have the counter number in every "i" in my code. Am I just approaching this the wrong way in R? I have read about the assign and paste functions, but honestly do not understand how I can apply them properly in this particular problem.

you can do if in several ways:
Use list.files() to get all files given directory. You can use regular expression as well. See here
If the names are consecutive, then you can use
for (i in 1:10)
{
filename <- sprintf("freqspeci.frq_%s",i)
freqi <- read.delim(filename, sep="\t", row.names=NULL)
freqveci <- freqi$N_CHR
freqveci <- freqveci[freqveci != 0 & freqveci != 1]
hist(freqveci)
}
Use also can use paste() to create file name.
paste("filename", 1:10, sep='_')

you could just save all your datafiles into an otherwise empty Folder. Then get the filenames like:
filenames <- dir()
for (i in 1:length(filenames)){
freqi <- read.delim("freqspeci.frq", sep="\t", row.names=NULL)
# and here whatever else you want to do on These files
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Dynamic files name load and processing with for loop - r

You can use the for loop as you are, but you should also use the assign function: for(i in 1:6){ for(j in 1:5){ assign(paste0('t', i, '_', 'd', j), read.xlsx(paste0("/Users/pathway/dataset/time_",i,"_drug",j,".xlsx"), 1)) } }

Related

Adding a new column with filenames for the list of files in a for loop

how to apply R script on all of my data knowing that i use 2 different extension in my code

How to get better performance in R: one big file or several smaller files?

Sort list into sub-lists based on pattern matching R

Specifying consecutive file names and assigning consecutive vectors with counter variable in for loops

Categories

Resources