I'm reading .csv files from several different directories into a nested list. Along the lines of
filenames <- list(a = list.files("/some_dir_1", pattern = "*.csv"), # not a reproducible example but for demonstration purposes
b = list.files("/some_dir_2", pattern = "*.csv"),
c = list.files("/some_dir_3", pattern = "*.csv"))
# creates a nested of list of file paths
dat.list <- lapply(filenames, lapply, read.csv)
# creates a nested list of dataframes, with the same structure as filenames
I'd like to name each element with their file path.
This could be done by naming them one by one, e.g.
names(dat.list[["a"]]) <- filenames[["a"]]
or by putting this in a for-loop, but is there a more versatile method? Preferably a tidyverse friendly solution, along the lines of...
filenames %>% lapply(., lapply, read_csv) %>% #some naming call#
Or am I going about this in the wrong way?
Any help would be greatly appreciated, thanks.
Based on the description, either we can use lapply to loop through the sequence of 'filenames' or with for loop to change the names of each of the dat.list[[i]] elements
lapply(seq_along(filenames), function(i) setNames(dat.list[[i]], filenames[[i]]))
Or with Map
Map(setNames, dat.list, filenames)
Or
for(i in seq_along(filenames)) names(dat.list[[i]]) <- filenames[[i]]
If we want to use tidyverse, the equivalent option based on base R Map would be
library(purrr)
map2(dat.list, filenames, setNames)
NOTE: The for loop assignment will reflect on the original 'dat.list', while we have to assign the lapply back to dat.list to update the 'dat.list'
data
filenames <- list(a = c('a1.csv', 'a2.csv'), b = c('b1.csv', 'b2.csv'))
set.seed(24)
dat.list <- lapply(1:2, function(i) replicate(2, as.data.frame(matrix(sample(1:5, 5*5,
replace = TRUE), 5, 5)), simplify = FALSE))
Related
I have a time series data. I stored the data in txt files under daily subfolders in Monthly folders.
setwd(".../2018/Jan")
parent.folder <-".../2018/Jan"
sub.folders <- list.dirs(parent.folder, recursive=TRUE)[-1] #To read the sub-folders under parent folder
r.scripts <- file.path(sub.folders)
A_2018 <- list()
for (j in seq_along(r.scripts)) {
A_2018[[j]] <- dir(r.scripts[j],"\\.txt$")}
Of these .txt files, I removed some of the files which I don't want to use for the further analysis, using the following code.
trim_to_two <- function(x) {
runs = rle(gsub("^L1_\\d{4}_\\d{4}_","",x))
return(cumsum(runs$lengths)[which(runs$lengths > 2)] * -1)
}
A_2018_new <- list()
for (j in seq_along(A_2018)) {
A_2018_new[[j]] <- A_2018[[j]][trim_to_two(A_2018[[j]])]
}
Then, I want to make a rowbind by for loop for the whole .txt files. Before that, I would like to remove some lines in each txt file, and add one new column with file name. The following is my code.
for (i in 1:length(A_2018_new)) {
for (j in 1:length(A_2018_new[[i]])){
filename <- paste(str_sub(A_2018_new[[i]][j], 1, 14))
assign(filename, read_tsv(complete_file_name, skip = 14, col_names = FALSE),
)
Y <- r.scripts %>% str_sub(46, 49)
MD <- r.scripts %>% str_sub(58, 61)
HM <- filename %>% str_sub(9, 12)
Turn <- filename %>% str_sub(14, 14)
time_minute <- paste(Y, MD, HM, sep="-")
Map(cbind, filename, SampleID = names(filename))
}
}
But I didn't get my desired output. I tried to code using other examples. Could anyone help to explain what my code is missing.
Your code seems overly complex for what it is doing. Your problem is however not 100% clear (e.g. what is the pattern in your file names that determine what to import and what not?). Here are some pointers that would greatly simplify the code, and likely avoid the issue you are having.
Use lapply() or map() from the purrr package to iterate instead of a for loop. The benefit is that it places the different data frames in a list and you don't need to assign multiple data frames into their own objects in the environment. Since you tagged the tidyverse, we'll use the purrr functions.
library(tidyverse)
You could for instance retrieve the txt file paths, using something like
txt_files <- list.files(path = 'data/folder/', pattern = "txt$", full.names = TRUE) # Need to remove those files you don't with whatever logic applies
and then use map() with read_tsv() from readr like so:
mydata <- map(txt_files, read_tsv)
Then for your manipulation, you can again use lapply() or map() to apply that manipulation to each data frame. The easiest way is to create a custom function, and then apply it to each data frame:
my_func <- function(df, filename) {
df |>
filter(...) |> # Whatever logic applies here
mutate(filename = filename)
}
and then use map2() to apply this function, iterating through the data and filenames, and then list_rbind() to bind the data frames across the rows.
mydata_output <- map2(mydata, txt_files, my_func) |>
list_rbind()
I'm working on a project where I want to create a list of tibbles containing data that I read in from Excel. The idea will be to call on the columns of these different tibbles to perform analyses on them. But I'm stuck on how to name tibbles in a for loop with a name that changes based on the for loop variable. I'm not certain I'm going about this the correct way. Here is the code I've got so far.
filenames <- list.files(path = getwd(), pattern = "xlsx")
RawData <- list()
for(i in filenames) {
RawData <- list(i <- tibble(read_xlsx(path = i, col_names = c('time', 'intesity'))))
}
I've also got the issue where, right now, the for loop overwrites RawData with each turn of the loop but I think that is something I can remedy if I can get the naming convention to work. If there is another method or data structure that would better suite this task, I'm open to suggestions.
Cheers,
Your code overwrites RawData in each iteration. You should use something like this to add the new tibble to the list RawData <- c(RawData, read_xlsx(...)).
A simpler way would be to use lapply instead of a for loop :
RawData <-
lapply(
filenames,
read_xlsx,
col_names = c('time', 'intesity')
)
Here is an approach with map from package purrr
library(tidyverse)
filenames <- list.files(path = getwd(), pattern = "xlsx")
mylist <- map(filenames, ~ read_xlsx(.x, col_names = c('time', 'intesity')) %>%
set_names(filenames)
Similar to the answer by #py_b, but add a column with the original file name to each element of the list.
filenames <- list.files(path = getwd(), pattern = "xlsx")
Raw_Data <- lapply(filenames, function(x) {
out_tibble <- read_xlsx(path = x, col_names = c('time', 'intesity'))
out_tibble$source_file <- basename(x) # add a column with the excel file name
return(out_tibble)
})
If you want to merge the list of tibbles into one big one you can use do.call('rbind', Raw_Data)
I'm just getting used to using lapply and I've been trying to figure out how I can use names from a vector to append within filenames I am calling, and naming new dataframes. I understand I can use paste to call the files of interest, but not sure I can create the new dataframes with the _var name appended.
site_list <- c("site_a","site_b", "site_c","site_d")
lapply(site_list,
function(var) {
all_var <- read.csv(paste("I:/Results/",var,"_all.csv"))
tbl_var <- read.csv(paste("I:/Results/",var,"_tbl.csv"))
rsid_var <- read.csv(paste("I:/Results/",var,"_rsid.csv"))
return(var)
})
Generally, it often makes more sense to apply a function to the list elements and then to return a list when using lapply, where your variables are stored and can be named. Example (edit: use split to process files together):
files <- list.files(path= "I:/Results/", pattern = "site_[abcd]_.*csv", full.names = TRUE)
files <- split(files, gsub(".*site_([abcd]).*", "\\1", files))
processFiles <- function(x){
all <- read.csv(x[grep("_all.csv", x)])
rsid <- read.csv(x[grep("_rsid.csv", x)])
tbl <- read.csv(x[grep("_tbl.csv", x)])
# do more stuff, generate df, return(df)
}
res <- lapply(files, processFiles)
I have a bunch of data frames that are named in the same pattern "dfX.csv" where X represents a number from 1 to 67. I loaded them into seperate dataframes using following piece of code:
folder <- mypath
file_list <- list.files(path=folder, pattern="*.csv")
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(paste(folder, file_list[i], sep=',', header=TRUE))
)}
What I'm trying to do is merge/rbind them into a single huge dataframe.
for (i in 1:length(file_list)){
df_main <- rbind(df_main, df[[i]].csv)
}
However using that I'm getting an error:
Error: unexpected symbol in:
"for (i in 1:length(file_list)){
df_main <- rbind(df_main, df[[i]].csv"
Any idea what might be causing an issue & whether there's a simpler way of doing things.
If file_list is a character vector of filenames that have since been loaded into variables in the local environment, then perhaps one of
do.call(rbind.data.frame, mget(ls(pattern = "^df\\s+\\.csv")))
do.call(rbind.data.frame, mget(paste0("df", seq_along(file_list), ".csv")))
The first assumes anything found (as df*.csv) in R's environment is appropriate to grab. It might not grab then in the correct order, so consider using sort or somehow ordering them yourself.
mget takes a string vector and retrieves the value of the object with each name from the given environment (current, by default), returning a list of values.
do.call(rbind.data.frame, ...) does one call to rbind, which is much much faster than iteratively rbinding.
Here I use map() to iterate over your files reading each one into a list of dataframes and bind_rows is used to bind all df together
library(tidyverse)
df <- map(list.files(), read_csv) %>%
bind_rows()
If you have a lot of data (lot of rows), here's a data.table approach that works great:
library(data.table)
basedir <- choose.dir() # directory with all the csv files
file_names <- list.files(path = basedir, pattern= '*.csv', full.names = F, recursive = F)
big_list <- lapply(file_names, function(file_name){
dat <- fread(file = file.path(basedir, file_name), header = T)
# Add a 'filename' column to each data.table to back-track where it was read from
# this is why we set full.names = F in the list.files line above
dat$filename <- gsub('.csv', '', file_name)
return(dat)
})
big_data <- rbindlist(l = big_list, use.names = T, fill = T)
If you want to read only some columns and not all, you can use the select argument in fread - helps improve speed since empty columns are not read in, similarly skip lets you skip reading in a bunch of rows.
I have to run a loop based on certain vector values. Some example code and data is shown below:
list_store <- list()
vec <- c(3,2,3)
data_list <- lapply(list(head(mtcars,10), head(mtcars,15), head(mtcars,20), head(mtcars, 9),
head(mtcars,14), head(mtcars,18), head(mtcars,20), head(mtcars,10)),
function(x) rownames_to_column(x))
data_list1 <- lapply(list(head(mtcars,7), head(mtcars,8), head(mtcars,10)), function(x) rownames_to_column(x))
result <- lapply(data_list, function(i){
list_store[[length(list_store) + 1]] <- merge(i, data_list1[[1]], all.y = TRUE)
})
The above code is that I want to merge first three files of data_list with first file of data_list1, the next two files of data_list with second file of data_list1 and finally the other three files of data_list with the third file of data_list1. In my code I merge all the files of data_list with the first file of data_list1, but I want to change data_list1 as per vec
I can have a loop keeping track of i, j and so on to do all the process, but I want to know if there is any efficient way.
We replicate the 'vec' by the sequence of 'vec', use that to split the 'data_list' into 3 list elements each having a list. Then, use Map to pass the corresponding list elements from the split dataset and 'data_list1', loop through the nested list with lapply and merge with the elements of 'data_list1', use c to convert the nested list back to the normal list structure of 'data_list'.
do.call(c,
Map(function(x,y) lapply(x, function(dat)
merge(dat, y, all.y = TRUE)),
split(data_list, rep(seq_along(vec), vec)),
data_list1))