Using a function to repeat function over a collection of filenames - r

I would like to write a function to repeat a chunk of code over a collection of file names (names of files present in my working directory) in r. I would also like to save the outputs of each line as a global environment object if possible. General structure of what I am trying to do is given below. Function name is made up.
global_environment_object_1 <- anyfunction("filename and extension").
# Repeat this over a set of filenames in the working directory and save each as a separate
# global environment object with separate names.
A Real life example can be:
sds22 <- get_subdatasets("EVI_2017_June_2.hdf")
sds23 <- get_subdatasets("EVI_2016_June_1.hdf")
-where object names and file names are changing and the total number of files is 48.
Thanks for the help in advance!

Try using :
#Get all the filenames
all_files <- list.files(full.names = TRUE)
#Probably to be specific
#all_files <- list.files(patter = "\\.hdf$", full.names = TRUE)
#apply get_subdatasets to each one of them
all_data <- lapply(all_files, get_subdatasets)
#Assign name to list output
names(all_data) <- paste0('sds', seq_along(all_data))
#Get the data in global environment
list2env(all_data, .GlobalEnv)

Related

Best way to import multiple .csv files into separate data frames? lapply() [duplicate]

Suppose we have files file1.csv, file2.csv, ... , and file100.csv in directory C:\R\Data and we want to read them all into separate data frames (e.g. file1, file2, ... , and file100).
The reason for this is that, despite having similar names they have different file structures, so it is not that useful to have them in a list.
I could use lapply but that returns a single list containing 100 data frames. Instead I want these data frames in the Global Environment.
How do I read multiple files directly into the global environment? Or, alternatively, How do I unpack the contents of a list of data frames into it?
Thank you all for replying.
For completeness here is my final answer for loading any number of (tab) delimited files, in this case with 6 columns of data each where column 1 is characters, 2 is factor, and remainder numeric:
##Read files named xyz1111.csv, xyz2222.csv, etc.
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
##Create list of data frame names without the ".csv" part
names <-substr(filenames,1,7)
###Load all files
for(i in names){
filepath <- file.path("../Data/original_data/",paste(i,".csv",sep=""))
assign(i, read.delim(filepath,
colClasses=c("character","factor",rep("numeric",4)),
sep = "\t"))
}
Quick draft, untested:
Use list.files() aka dir() to dynamically generate your list of files.
This returns a vector, just run along the vector in a for loop.
Read the i-th file, then use assign() to place the content into a new variable file_i
That should do the trick for you.
Use assign with a character variable containing the desired name of your data frame.
for(i in 1:100)
{
oname = paste("file", i, sep="")
assign(oname, read.csv(paste(oname, ".txt", sep="")))
}
This answer is intended as a more useful complement to Hadley's answer.
While the OP specifically wanted each file read into their R workspace as a separate object, many other people naively landing on this question may think that that's what they want to do, when in fact they'd be better off reading the files into a single list of data frames.
So for the record, here's how you might do that.
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files("path/to/files")
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files,read.csv,...)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",
list.files("path/to/files",full.names = FALSE),
fixed = TRUE)
Now any of the files can be referred to by my_files[["filename"]], which really isn't much worse that just having separate filename variables in your workspace, and often it is much more convenient.
Here is a way to unpack a list of data.frames using just lapply
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
filelist <- lappy(filenames, read.csv)
#if necessary, assign names to data.frames
names(filelist) <- c("one","two","three")
#note the invisible function keeps lapply from spitting out the data.frames to the console
invisible(lapply(names(filelist), function(x) assign(x,filelist[[x]],envir=.GlobalEnv)))
Reading all the CSV files from a folder and creating vactors same as the file names:
setwd("your path to folder where CSVs are")
filenames <- gsub("\\.csv$","", list.files(pattern="\\.csv$"))
for(i in filenames){
assign(i, read.csv(paste(i, ".csv", sep="")))
}
A simple way to access the elements of a list from the global environment is to attach the list. Note that this actually creates a new environment on the search path and copies the elements of your list into it, so you may want to remove the original list after attaching to prevent having two potentially different copies floating around.
I want to update the answer given by Joran:
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files(path="set your directory here", full.names=TRUE)
#full.names=TRUE is important to be added here
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files, read.csv)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",list.files("copy and paste your directory here",full.names = FALSE),fixed = TRUE)
#Now you can create a dataset based on each filename
df <- as.data.frame(all_csv$nameofyourfilename)
a simplified version, assuming your csv files are in the working directory:
listcsv <- list.files(pattern= "*.csv") #creates list from csv files
names <- substr(listcsv,1,nchar(listcsv)-4) #creates list of file names, no .csv
for (k in 1:length(listcsv)){
assign(names[[k]] , read.csv(listcsv[k]))
}
#cycles through the names and assigns each relevant dataframe using read.csv
#copy all the files you want to read in R in your working directory
a <- dir()
#using lapply to remove the".csv" from the filename
for(i in a){
list1 <- lapply(a, function(x) gsub(".csv","",x))
}
#Final step
for(i in list1){
filepath <- file.path("../Data/original_data/..",paste(i,".csv",sep=""))
assign(i, read.csv(filepath))
}
Use list.files and map_dfr to read many csv files
df <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)
Reproducible example
First write sample csv files to a temporary directory.
It's more complicated than I thought it would be.
library(dplyr)
library(purrr)
library(purrrlyr)
library(readr)
data_folder <- file.path(tempdir(), "iris")
dir.create(data_folder)
iris %>%
# Keep the Species column in the output
# Create a new column that will be used as the grouping variable
mutate(species_group = Species) %>%
group_by(species_group) %>%
nest() %>%
by_row(~write.csv(.$data,
file = file.path(data_folder, paste0(.$species_group, ".csv")),
row.names = FALSE))
Read these csv files into one data frame.
Note the Species column has to be present in the csv files, otherwise we would loose that information.
iris_csv <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)

How to call several variables in a for loop in R?

I have several .csv files of data stored in a directory, and I need to import all of them into R.
Each .csv has two columns when imported into R. However, the 1001st row needs to be stored as a separate variable for each of the .csv files (it corresponds to an expected value which was stored here during the simulation; I want it to be outside of the main data).
So far I have the following code to import my .csv files as matrices.
#Load all .csv in directory into list
dataFiles <- list.files(pattern="*.csv")
for(i in dataFiles) {
#read all of the csv files
name <- gsub("-",".",i)
name <- gsub(".csv","",name)
i <- paste(".\\",i,sep="")
assign(name,read.csv(i, header=T))
}
This produces several matrices with the naming convention "sim_data_L_mu" where L and mu are parameters from the simulation. How can I remove the 1001st row (which has a number in the first column, and the second column is null) from each matrix and store it as a variable named "sim_data_L_mu_EV"? The main problem I have is that I do not know how to call all of the newly created matrices in my for loop.
Couldn't post long code in comments so am writing here:
# Use dialog to select folder
# Full names are required to access files that are not in the current working directory
file_list <- list.files(path = choose.dir(), pattern = "*.csv", full.names = T)
big_list <- lapply(file_list, function(z){
df <- read.csv(z)
scalar <- df[1000,1]
return(list(df, scalar))
})
To access the scalar value from the third file, you can use
big_list[[3]][2]
The elements in big_list follow the order of file_list so you always know which file the data comes from.
If you use data.table::fread() instead of read.csv, you can play around with assigning column names, selecting which rows/columns to read etc. It's also considerably faster for large datafiles.
Hope this helps!

Assigning Directory as a Variable in R

I need to create a function called PollutantMean with the following arguments: directory, pollutant, and id=1:332)
I have most of the code written but I can't figure out how to assign my directory as a variable. My current working directory is C:/Users/User/Documents. I tried writing the variable as:
directory <- "C:/Users/User/specdata" and that didn't work.
Next I tried the following:
directory <- list.files("specdata", full.names=TRUE) and that didn't work either.
Any ideas on how to change this?
If you are trying to assign the values in your current working directory to the variable "directory" Why not take the simple method and add:
directory <- getwd()
This should take the contents of the working directory and assign the values to the variable "directory".
I've already worker with directory as variables, I usually declare them like that
directory<-"C://Users//User//specdata//"
To take back your example.
Then, if I want to read a specific file in this directory, I will just go like :
read.table(paste(directory,"myfile.txt",sep=""),...)
It's the same process to write in a file
write.table(res,file=paste(directory,"myfile.txt",sep=""),...)
Is this helping ?
EDIT : you can then use read.csv and it will work fine
I think you are confused by the assignment operation in R. The following line
directory <- "C:/Users/User/specdata"
assigns a string to a new object that just happened to be called directory. It has the same effect on your working environment as
elephant <- "C:/Users/User/specdata"
To change where R reads its files, use the function setwd (short for set working directory):
setwd("C:/Users/User/specdata")
You can also specify full path names to functions that read in data (like read.table). For your specific problem,
# creates a list of all files ending with `csv` (i.e. all csv files)
all.specdata.files <- list.files(path = "C:/Users/User/specdata", pattern = "csv$")
# creates a list resulting from the application of `read.csv` to
# each of these files (which may be slow!!)
all.specdata.list <- lapply(all.specdata.files, read.csv)
Then we use dplyr::rbind_all to row-bind them into one file.
library(dplyr)
all.specdata <- rbind_all(all.specdata.list)
Then use colMeans to determine the grand means. Not sure how to do this without seeing the data.
Assuming that the columns in each of the 300+ csv files are the same, that is have column j contains the same type of data in all files, then the following example should be of use:
# let's use a temp directory for storing the files
tmpdr <- tempdir()
# Let's creat a large matrix of values and then split it into many different
# files
original_data <- data.frame(matrix(rnorm(10000L), nrow = 1000L))
# write each row to a file
for(i in seq(1, nrow(original_data), by = 1)) {
write.csv(original_data[i, ],
file = paste0(tmpdr, "/", formatC(i, format = "d", width = 4, flag = 0), ".csv"),
row.names = FALSE)
}
# get a character vector with the full path of each of the files
files <- list.files(path = tmpdr, pattern = "\\.csv$", full.names = TRUE)
# read each file into a list
read_data <- lapply(files, read.csv)
# bind the read_data into one data.frame,
read_data <- do.call(rbind, read_data)
# check that our two data.frames are the same.
all.equal(read_data, original_data)
# [1] TRUE

Combine multiple .RData files containing objects with the same name into one single .RData file

I have many many .RData files containing one dataframe that I had saved in a previous analysis and the data frame has the same name for each file loaded. So for example using load(file1.RData) I get a data frame called 'df', then using load(file2.RData) I get a data frame with the same name 'df'. I was wondering if it is at all possible to combine all these .RData files into one big .RData file since I need to load them all at once, with the name of each df equal to the file name so I can then use the different data frames.
I can do this using the code below, but it is very intricate, there must be a simpler way to do this… Thank you for your suggestions.
Say I have 3 .RData files and want to save all in a file called "main.RData" with their specific name (now they all come out as 'df'):
all.files = c("/Users/fra/file1.RData", "/Users/fra/file2.RData", "/Users/fra/file3.RData")
assign(gsub("/Users/fra/", "", all.files[1]), local(get(load(all.files[1]))))
rm(list= ls()[!(ls() %in% (ls(pattern = "file")))])
save.image(file="main.RData")
all.files = all.files = c("/Users/fra/file1.RData", "/Users/fra/file2.RData", "/Users/fra/file3.RData")
for (f in all.files[-1]) {
assign(gsub("/Users/fra/", "", f), local(get(load(f))))
rm(list= ls()[!(ls() %in% (ls(pattern = "file")))])
save.image(file="main.RData")
}
Here's an option that incorporates several existing posts
all.files = c("file1.RData", "file2.RData", "file3.RData")
Read multiple dataframes into a single named list (How can I load an object into a variable name that I specify from an R data file?)
mylist<- lapply(all.files, function(x) {
load(file = x)
get(ls()[ls()!= "filename"])
})
names(mylist) <- all.files #Note, the names here don't have to match the filenames
You can save the list, or transfer the dataframes into the global environment prior to saving (Unlist a list of dataframes)
list2env(mylist ,.GlobalEnv)
Alternatively, if the dataframes were identical and you wanted to create a single big dataframe, you could collapse the list and add a variable with names of contributing files (Dataframes in a list; adding a new variable with name of dataframe).
all <- do.call("rbind", mylist)
all$id <- rep(all.files, sapply(mylist, nrow))
I think the best answer I saw was the code below, which I copied from an SO answer which I can't track down right now. Apologies to the original author.
resave <- function(..., list = character(), file) {
previous <- load(file)
var.names <- c(list, as.character(substitute(list(...)))[-1L])
for (var in var.names) assign(var, get(var, envir = parent.frame()))
save(list = unique(c(previous, var.names)), file = file)
}
#I took advantage of the fact the load function
#returns the name of the loaded variables, so
#I could use the function's environment instead of creating one.
#And when using get, I was careful to only look in the
#environment from which the function is called, i.e. parent.frame()

R loop for anova on multiple files

I would like to execute anova on multiple datasets stored in my working directory. I have come up so far with:
files <- list.files(pattern = ".csv")
for (i in seq_along(files)) {
mydataset.i <- files[i]
AnovaModel.1 <- aov(DES ~ DOSE, data=mydataset.i)
summary(AnovaModel.1)
}
As you can see I am very new to loops and cannot make this work. I also understand that I need to add a code to append all summary outputs in one file. I would appreciate any help you can provide to guide to the working loop that can execute anovas on multiple .csv files in the directory (same headers) and produce outputs for the record.
you might want to use list.files with full.names = TRUE in case you are not on the same path.
files <- list.files("path_to_my_dir", pattern="*.csv", full.names = T)
# use lapply to loop over all files
out <- lapply(1:length(files), function(idx) {
# read the file
this.data <- read.csv(files[idx], header = TRUE) # choose TRUE/FALSE accordingly
aov.mod <- aov(DES ~ DOSE, data = this.data)
# if you want just the summary as object of summary.aov class
summary(aov.mod)
# if you require it as a matrix, comment the previous line and uncomment the one below
# as.matrix(summary(aov.mod)[[1]])
})
head(out)
This should give you a list with each entry of the list having a summary matrix in the same order as the input file list.
Your error is that your loop is not loading your data. Your list of file names is in "files" then you start moving through that list and set mydataset.i equal to the name of the file that matches your itterator i... but then you try to run aov on the file name that is stored in mydataset.i!
The command you are looking for to redirect your output to a file is sink. Consider the following:
sink("FileOfResults.txt") #starting the redirect to the file
files <- list.files("path_to_my_dir", pattern="*.csv", full.names = T) #using the fuller code from Arun
for (i in seq_along(files)){
mydataset.i <- files[i]
mydataset.d <- read.csv(mydataset.i) #this line is new
AnovaModel.1 <- aov(DES ~ DOSE, data=mydataset.d) #this line is modified
print(summary(AnovaModel.1))
}
sink() #ending the redirect to the file
I prefer this approach to Arun's because the results are stored directly to the file without jumping through a list and then having to figure out how to store the list to a file in a readable fashion.

Resources