Using lapply variable in read.csv - r

I'm just getting used to using lapply and I've been trying to figure out how I can use names from a vector to append within filenames I am calling, and naming new dataframes. I understand I can use paste to call the files of interest, but not sure I can create the new dataframes with the _var name appended.
site_list <- c("site_a","site_b", "site_c","site_d")
lapply(site_list,
function(var) {
all_var <- read.csv(paste("I:/Results/",var,"_all.csv"))
tbl_var <- read.csv(paste("I:/Results/",var,"_tbl.csv"))
rsid_var <- read.csv(paste("I:/Results/",var,"_rsid.csv"))
return(var)
})

Generally, it often makes more sense to apply a function to the list elements and then to return a list when using lapply, where your variables are stored and can be named. Example (edit: use split to process files together):
files <- list.files(path= "I:/Results/", pattern = "site_[abcd]_.*csv", full.names = TRUE)
files <- split(files, gsub(".*site_([abcd]).*", "\\1", files))
processFiles <- function(x){
all <- read.csv(x[grep("_all.csv", x)])
rsid <- read.csv(x[grep("_rsid.csv", x)])
tbl <- read.csv(x[grep("_tbl.csv", x)])
# do more stuff, generate df, return(df)
}
res <- lapply(files, processFiles)

Related

How to change column names of many dataframes in R?

I would like to make the same changes to the column names of many dataframes. Here's an example:
ChangeNames <- function(x) {
colnames(x) <- toupper(colnames(x))
colnames(x) <- str_replace_all(colnames(x), pattern = "_", replacement = ".")
return(x)
}
files <- list(mtcars, nycflights13::flights, nycflights13::airports)
lapply(files, ChangeNames)
I know that lapply only changes a copy. How do I change the underlying dataframe? I want to still use each dataframe separately.
Create a named list, apply the function and use list2env to reflect those changes in the original dataframes.
library(nycflights13)
files <- dplyr::lst(mtcars, flights, airports)
result <- lapply(files, ChangeNames)
list2env(result, .GlobalEnv)

Create a list of tibbles with unique names using a for loop

I'm working on a project where I want to create a list of tibbles containing data that I read in from Excel. The idea will be to call on the columns of these different tibbles to perform analyses on them. But I'm stuck on how to name tibbles in a for loop with a name that changes based on the for loop variable. I'm not certain I'm going about this the correct way. Here is the code I've got so far.
filenames <- list.files(path = getwd(), pattern = "xlsx")
RawData <- list()
for(i in filenames) {
RawData <- list(i <- tibble(read_xlsx(path = i, col_names = c('time', 'intesity'))))
}
I've also got the issue where, right now, the for loop overwrites RawData with each turn of the loop but I think that is something I can remedy if I can get the naming convention to work. If there is another method or data structure that would better suite this task, I'm open to suggestions.
Cheers,
Your code overwrites RawData in each iteration. You should use something like this to add the new tibble to the list RawData <- c(RawData, read_xlsx(...)).
A simpler way would be to use lapply instead of a for loop :
RawData <-
lapply(
filenames,
read_xlsx,
col_names = c('time', 'intesity')
)
Here is an approach with map from package purrr
library(tidyverse)
filenames <- list.files(path = getwd(), pattern = "xlsx")
mylist <- map(filenames, ~ read_xlsx(.x, col_names = c('time', 'intesity')) %>%
set_names(filenames)
Similar to the answer by #py_b, but add a column with the original file name to each element of the list.
filenames <- list.files(path = getwd(), pattern = "xlsx")
Raw_Data <- lapply(filenames, function(x) {
out_tibble <- read_xlsx(path = x, col_names = c('time', 'intesity'))
out_tibble$source_file <- basename(x) # add a column with the excel file name
return(out_tibble)
})
If you want to merge the list of tibbles into one big one you can use do.call('rbind', Raw_Data)

Rbind multiple Data Frames in a loop

I have a bunch of data frames that are named in the same pattern "dfX.csv" where X represents a number from 1 to 67. I loaded them into seperate dataframes using following piece of code:
folder <- mypath
file_list <- list.files(path=folder, pattern="*.csv")
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(paste(folder, file_list[i], sep=',', header=TRUE))
)}
What I'm trying to do is merge/rbind them into a single huge dataframe.
for (i in 1:length(file_list)){
df_main <- rbind(df_main, df[[i]].csv)
}
However using that I'm getting an error:
Error: unexpected symbol in:
"for (i in 1:length(file_list)){
df_main <- rbind(df_main, df[[i]].csv"
Any idea what might be causing an issue & whether there's a simpler way of doing things.
If file_list is a character vector of filenames that have since been loaded into variables in the local environment, then perhaps one of
do.call(rbind.data.frame, mget(ls(pattern = "^df\\s+\\.csv")))
do.call(rbind.data.frame, mget(paste0("df", seq_along(file_list), ".csv")))
The first assumes anything found (as df*.csv) in R's environment is appropriate to grab. It might not grab then in the correct order, so consider using sort or somehow ordering them yourself.
mget takes a string vector and retrieves the value of the object with each name from the given environment (current, by default), returning a list of values.
do.call(rbind.data.frame, ...) does one call to rbind, which is much much faster than iteratively rbinding.
Here I use map() to iterate over your files reading each one into a list of dataframes and bind_rows is used to bind all df together
library(tidyverse)
df <- map(list.files(), read_csv) %>%
bind_rows()
If you have a lot of data (lot of rows), here's a data.table approach that works great:
library(data.table)
basedir <- choose.dir() # directory with all the csv files
file_names <- list.files(path = basedir, pattern= '*.csv', full.names = F, recursive = F)
big_list <- lapply(file_names, function(file_name){
dat <- fread(file = file.path(basedir, file_name), header = T)
# Add a 'filename' column to each data.table to back-track where it was read from
# this is why we set full.names = F in the list.files line above
dat$filename <- gsub('.csv', '', file_name)
return(dat)
})
big_data <- rbindlist(l = big_list, use.names = T, fill = T)
If you want to read only some columns and not all, you can use the select argument in fread - helps improve speed since empty columns are not read in, similarly skip lets you skip reading in a bunch of rows.

read.csv into nested list and set element names

I'm reading .csv files from several different directories into a nested list. Along the lines of
filenames <- list(a = list.files("/some_dir_1", pattern = "*.csv"), # not a reproducible example but for demonstration purposes
b = list.files("/some_dir_2", pattern = "*.csv"),
c = list.files("/some_dir_3", pattern = "*.csv"))
# creates a nested of list of file paths
dat.list <- lapply(filenames, lapply, read.csv)
# creates a nested list of dataframes, with the same structure as filenames
I'd like to name each element with their file path.
This could be done by naming them one by one, e.g.
names(dat.list[["a"]]) <- filenames[["a"]]
or by putting this in a for-loop, but is there a more versatile method? Preferably a tidyverse friendly solution, along the lines of...
filenames %>% lapply(., lapply, read_csv) %>% #some naming call#
Or am I going about this in the wrong way?
Any help would be greatly appreciated, thanks.
Based on the description, either we can use lapply to loop through the sequence of 'filenames' or with for loop to change the names of each of the dat.list[[i]] elements
lapply(seq_along(filenames), function(i) setNames(dat.list[[i]], filenames[[i]]))
Or with Map
Map(setNames, dat.list, filenames)
Or
for(i in seq_along(filenames)) names(dat.list[[i]]) <- filenames[[i]]
If we want to use tidyverse, the equivalent option based on base R Map would be
library(purrr)
map2(dat.list, filenames, setNames)
NOTE: The for loop assignment will reflect on the original 'dat.list', while we have to assign the lapply back to dat.list to update the 'dat.list'
data
filenames <- list(a = c('a1.csv', 'a2.csv'), b = c('b1.csv', 'b2.csv'))
set.seed(24)
dat.list <- lapply(1:2, function(i) replicate(2, as.data.frame(matrix(sample(1:5, 5*5,
replace = TRUE), 5, 5)), simplify = FALSE))

Subset multiple dataframes in a loop in R

I am trying to drop columns from over 20 data frames that I have imported. However, I'm getting errors when I try to iterate through all of these files. I'm able to drop when I hard code the individual file name, but as soon as I try to loop through all of the files, I have errors. Here's the code:
path <- "C://Home/Data/"
files <- list.files(path=path, pattern="^.file*\\.csv$")
for(i in 1:length(files))
{
perpos <- which(strsplit(files[i], "")[[1]]==".")
assign(
gsub(" ","",substr(files[i], 1, perpos-1)),
read.csv(paste(path,files[i],sep="")))
}
mycols <- c("test," "trialruns," "practice")
`file01` = `file01`[,!(names(`file01`) %in% mycols)]
So, the above will work and drop those three columns from file01. However, I can't iterate through files02 to files20 and drop the columns from all of them. Any ideas? Thank you so much!
As #zx8754 mentions, consider lapply() maintaining all dataframes in one compiled list instead of multiple objects in your environment (but below also includes how to output individual dfs from list):
path <- "C://Home/Data/"
files <- list.files(path=path, pattern="^.file*\\.csv$")
mycols <- c("test," "trialruns," "practice")
# READ IN ALL FILES AND SUBSET COLUMNS
dfList <- lapply(files, function(f) {
read.csv(paste0(path, f))[mycols]
})
# SET NAMES TO EACH DF ELEMENT
dfList <- setNames(dfList, gsub(".csv", "", files))
# IN CASE YOU REALLY NEED INDIVIDUAL DFs
list2env(dfList, envir=.GlobalEnv)
# IN CASE YOU NEED TO APPEND ALL DFs
finaldf <- do.call(rbind, dfList)
# TO RETRIEVE FIRST DF
dfList[[1]] # OR dfList$file01

Resources