I would like to execute anova on multiple datasets stored in my working directory. I have come up so far with:
files <- list.files(pattern = ".csv")
for (i in seq_along(files)) {
mydataset.i <- files[i]
AnovaModel.1 <- aov(DES ~ DOSE, data=mydataset.i)
summary(AnovaModel.1)
}
As you can see I am very new to loops and cannot make this work. I also understand that I need to add a code to append all summary outputs in one file. I would appreciate any help you can provide to guide to the working loop that can execute anovas on multiple .csv files in the directory (same headers) and produce outputs for the record.
you might want to use list.files with full.names = TRUE in case you are not on the same path.
files <- list.files("path_to_my_dir", pattern="*.csv", full.names = T)
# use lapply to loop over all files
out <- lapply(1:length(files), function(idx) {
# read the file
this.data <- read.csv(files[idx], header = TRUE) # choose TRUE/FALSE accordingly
aov.mod <- aov(DES ~ DOSE, data = this.data)
# if you want just the summary as object of summary.aov class
summary(aov.mod)
# if you require it as a matrix, comment the previous line and uncomment the one below
# as.matrix(summary(aov.mod)[[1]])
})
head(out)
This should give you a list with each entry of the list having a summary matrix in the same order as the input file list.
Your error is that your loop is not loading your data. Your list of file names is in "files" then you start moving through that list and set mydataset.i equal to the name of the file that matches your itterator i... but then you try to run aov on the file name that is stored in mydataset.i!
The command you are looking for to redirect your output to a file is sink. Consider the following:
sink("FileOfResults.txt") #starting the redirect to the file
files <- list.files("path_to_my_dir", pattern="*.csv", full.names = T) #using the fuller code from Arun
for (i in seq_along(files)){
mydataset.i <- files[i]
mydataset.d <- read.csv(mydataset.i) #this line is new
AnovaModel.1 <- aov(DES ~ DOSE, data=mydataset.d) #this line is modified
print(summary(AnovaModel.1))
}
sink() #ending the redirect to the file
I prefer this approach to Arun's because the results are stored directly to the file without jumping through a list and then having to figure out how to store the list to a file in a readable fashion.
Related
Suppose we have files file1.csv, file2.csv, ... , and file100.csv in directory C:\R\Data and we want to read them all into separate data frames (e.g. file1, file2, ... , and file100).
The reason for this is that, despite having similar names they have different file structures, so it is not that useful to have them in a list.
I could use lapply but that returns a single list containing 100 data frames. Instead I want these data frames in the Global Environment.
How do I read multiple files directly into the global environment? Or, alternatively, How do I unpack the contents of a list of data frames into it?
Thank you all for replying.
For completeness here is my final answer for loading any number of (tab) delimited files, in this case with 6 columns of data each where column 1 is characters, 2 is factor, and remainder numeric:
##Read files named xyz1111.csv, xyz2222.csv, etc.
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
##Create list of data frame names without the ".csv" part
names <-substr(filenames,1,7)
###Load all files
for(i in names){
filepath <- file.path("../Data/original_data/",paste(i,".csv",sep=""))
assign(i, read.delim(filepath,
colClasses=c("character","factor",rep("numeric",4)),
sep = "\t"))
}
Quick draft, untested:
Use list.files() aka dir() to dynamically generate your list of files.
This returns a vector, just run along the vector in a for loop.
Read the i-th file, then use assign() to place the content into a new variable file_i
That should do the trick for you.
Use assign with a character variable containing the desired name of your data frame.
for(i in 1:100)
{
oname = paste("file", i, sep="")
assign(oname, read.csv(paste(oname, ".txt", sep="")))
}
This answer is intended as a more useful complement to Hadley's answer.
While the OP specifically wanted each file read into their R workspace as a separate object, many other people naively landing on this question may think that that's what they want to do, when in fact they'd be better off reading the files into a single list of data frames.
So for the record, here's how you might do that.
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files("path/to/files")
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files,read.csv,...)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",
list.files("path/to/files",full.names = FALSE),
fixed = TRUE)
Now any of the files can be referred to by my_files[["filename"]], which really isn't much worse that just having separate filename variables in your workspace, and often it is much more convenient.
Here is a way to unpack a list of data.frames using just lapply
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
filelist <- lappy(filenames, read.csv)
#if necessary, assign names to data.frames
names(filelist) <- c("one","two","three")
#note the invisible function keeps lapply from spitting out the data.frames to the console
invisible(lapply(names(filelist), function(x) assign(x,filelist[[x]],envir=.GlobalEnv)))
Reading all the CSV files from a folder and creating vactors same as the file names:
setwd("your path to folder where CSVs are")
filenames <- gsub("\\.csv$","", list.files(pattern="\\.csv$"))
for(i in filenames){
assign(i, read.csv(paste(i, ".csv", sep="")))
}
A simple way to access the elements of a list from the global environment is to attach the list. Note that this actually creates a new environment on the search path and copies the elements of your list into it, so you may want to remove the original list after attaching to prevent having two potentially different copies floating around.
I want to update the answer given by Joran:
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files(path="set your directory here", full.names=TRUE)
#full.names=TRUE is important to be added here
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files, read.csv)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",list.files("copy and paste your directory here",full.names = FALSE),fixed = TRUE)
#Now you can create a dataset based on each filename
df <- as.data.frame(all_csv$nameofyourfilename)
a simplified version, assuming your csv files are in the working directory:
listcsv <- list.files(pattern= "*.csv") #creates list from csv files
names <- substr(listcsv,1,nchar(listcsv)-4) #creates list of file names, no .csv
for (k in 1:length(listcsv)){
assign(names[[k]] , read.csv(listcsv[k]))
}
#cycles through the names and assigns each relevant dataframe using read.csv
#copy all the files you want to read in R in your working directory
a <- dir()
#using lapply to remove the".csv" from the filename
for(i in a){
list1 <- lapply(a, function(x) gsub(".csv","",x))
}
#Final step
for(i in list1){
filepath <- file.path("../Data/original_data/..",paste(i,".csv",sep=""))
assign(i, read.csv(filepath))
}
Use list.files and map_dfr to read many csv files
df <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)
Reproducible example
First write sample csv files to a temporary directory.
It's more complicated than I thought it would be.
library(dplyr)
library(purrr)
library(purrrlyr)
library(readr)
data_folder <- file.path(tempdir(), "iris")
dir.create(data_folder)
iris %>%
# Keep the Species column in the output
# Create a new column that will be used as the grouping variable
mutate(species_group = Species) %>%
group_by(species_group) %>%
nest() %>%
by_row(~write.csv(.$data,
file = file.path(data_folder, paste0(.$species_group, ".csv")),
row.names = FALSE))
Read these csv files into one data frame.
Note the Species column has to be present in the csv files, otherwise we would loose that information.
iris_csv <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)
I would like to write a function to repeat a chunk of code over a collection of file names (names of files present in my working directory) in r. I would also like to save the outputs of each line as a global environment object if possible. General structure of what I am trying to do is given below. Function name is made up.
global_environment_object_1 <- anyfunction("filename and extension").
# Repeat this over a set of filenames in the working directory and save each as a separate
# global environment object with separate names.
A Real life example can be:
sds22 <- get_subdatasets("EVI_2017_June_2.hdf")
sds23 <- get_subdatasets("EVI_2016_June_1.hdf")
-where object names and file names are changing and the total number of files is 48.
Thanks for the help in advance!
Try using :
#Get all the filenames
all_files <- list.files(full.names = TRUE)
#Probably to be specific
#all_files <- list.files(patter = "\\.hdf$", full.names = TRUE)
#apply get_subdatasets to each one of them
all_data <- lapply(all_files, get_subdatasets)
#Assign name to list output
names(all_data) <- paste0('sds', seq_along(all_data))
#Get the data in global environment
list2env(all_data, .GlobalEnv)
I am trying to loop through all the subfolders of my wd, list their names, open 'data.csv' in each of them and extract the second and last value from that csv file.
The df would look like this :
Name_folder_1 2nd value Last value
Name_folder_2 2nd value Last value
Name_folder_3 2nd value Last value
For now, I managed to list the subfolders and each of the file (thanks to this thread: read multiple text files from multiple folders) but I struggle to implement (what I'm guessing should be) a nested loop to read and extract data from the csv files.
parent.folder <- "C:/Users/Desktop/test"
setwd(parent.folder)
sub.folders1 <- list.dirs(parent.folder, recursive = FALSE)
r.scripts <- file.path(sub.folders1)
files.v <- list()
for (j in seq_along(r.scripts)) {
files.v[j] <- dir(r.scripts[j],"data$")
}
Any hints would be greatly appreciated !
EDIT :
I'm trying the solution detailed below but there must be something I'm missing as it runs smoothly but does not produce anything. It might be something very silly, I'm new to R and the learning curve is making me dizzy :p
lapply(files, function(f) {
dat <- fread(f) # faster
dat2 <- c(basename(dirname(f)), head(dat$time, 1), tail(dat$time, 1))
write.csv(dat2, file = "test.csv")
})
Not easy to reproduce but here is my suggestion:
library(data.table)
files <- list.files("PARENTDIR", full.names = T, recursive = T, pattern = ".*.csv")
lapply(files, function(f) {
dat <- fread(f) # faster
# Do whatever, get the subfolder name for example
basename(dirname(f))
})
You can simply look recursivly for all CSV files in your parent directory and still get their corresponding parent folder.
I have an R script that reads a certain type of file (nexus files of phylogenetic trees), whose name ends in *.trees.txt. It then applies a number of functions from an R package called bGMYC, available here and creates 3 pdf files. I would like to know what I should do to make the script loop through the files for each of 14 species.
The input files are in a separate folder for each species, but I can put them all in one folder if that facilitates the task. Ideally, I would like to output the pdf files to a folder for each species, different from the one containing the input file.
Here's the script
# Call Tree file
trees <- read.nexus("L_boscai_1411_test2.trees.txt")
# To use with different species, substitute "L_boscai_1411_test2.trees.txt" by the path to each species tree
#Store the number of tips of the tree
ntips <- length(trees$tip.label[[1]])
#Apply bgmyc.single
results.single <- bgmyc.singlephy(trees[[1]], mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create the 1st pdf
pdf('results_single_boscai.pdf')
plot(results.single)
dev.off()
#Sample 50 trees
n <- sample(1:length(trees), 50)
trees.sample <- trees[n]
#Apply bgmyc.multiphylo
results.multi <- bgmyc.multiphylo(trees.sample, mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create 2nd pdf
pdf('results_boscai.pdf') # Substitute 'results_boscai.pdf' by "*speciesname.pdf"
plot(results.multi)
dev.off()
#Apply bgmyc.spec and spec.probmat
results.spec <- bgmyc.spec(results.multi)
results.probmat <- spec.probmat(results.multi)
#Create 3rd pdf
pdf('trees_boscai.pdf') # Substitute 'trees_boscai.pdf' by "trees_speciesname.pdf"
for (i in 1:50) plot(results.probmat, trees.sample[[i]])
dev.off()
I've read several posts with a similar question, but they almost always involve .csv files, refer to multiple files in a single folder, have a simpler script or do not need to output files to separate folders, so I couldn't find a solution to my specific problem.
Shsould I use a for loop or could I create a function out of this script and use lapply or another sort of apply? Could you provide me with sample code for your proposed solution or point me to a tutorial or another reference?
Thanks for your help.
It really depends on the way you want to run it.
If you are using linux / command line job submission, it might be best to look at
How can I read command line parameters from an R script?
If you are using GUI (Rstudio...) you might not be familiar with this, so I would solve the problem
as a function or a loop.
First, get all your file names.
files = list.files(path = "your/folder")
# Now you have list of your file name as files. Just call each name one at a time
# and use for loop or apply (anything of your choice)
And since you would need to name pdf files, you can use your file name or index (e.g loop counter) and append to the desired file name. (e.g. paste("single_boscai", "i"))
In your case,
files = list.files(path = "your/folder")
# Use pattern = "" if you want to do string matching, and extract
# only matching files from the source folder.
genPDF = function(input) {
# Read the file
trees <- read.nexus(input)
# Store the index (numeric)
index = which(files == input)
#Store the number of tips of the tree
ntips <- length(trees$tip.label[[1]])
#Apply bgmyc.single
results.single <- bgmyc.singlephy(trees[[1]], mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create the 1st pdf
outname = paste('results_single_boscai', index, '.pdf', sep = "")
pdf(outnam)
plot(results.single)
dev.off()
#Sample 50 trees
n <- sample(1:length(trees), 50)
trees.sample <- trees[n]
#Apply bgmyc.multiphylo
results.multi <- bgmyc.multiphylo(trees.sample, mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))
#Create 2nd pdf
outname = paste('results_boscai', index, '.pdf', sep = "")
pdf(outname) # Substitute 'results_boscai.pdf' by "*speciesname.pdf"
plot(results.multi)
dev.off()
#Apply bgmyc.spec and spec.probmat
results.spec <- bgmyc.spec(results.multi)
results.probmat <- spec.probmat(results.multi)
#Create 3rd pdf
outname = paste('trees_boscai', index, '.pdf', sep = "")
pdf(outname) # Substitute 'trees_boscai.pdf' by "trees_speciesname.pdf"
for (i in 1:50) plot(results.probmat, trees.sample[[i]])
dev.off()
}
for (i in 1:length(files)) {
genPDF(files[i])
}
I need to create a function called PollutantMean with the following arguments: directory, pollutant, and id=1:332)
I have most of the code written but I can't figure out how to assign my directory as a variable. My current working directory is C:/Users/User/Documents. I tried writing the variable as:
directory <- "C:/Users/User/specdata" and that didn't work.
Next I tried the following:
directory <- list.files("specdata", full.names=TRUE) and that didn't work either.
Any ideas on how to change this?
If you are trying to assign the values in your current working directory to the variable "directory" Why not take the simple method and add:
directory <- getwd()
This should take the contents of the working directory and assign the values to the variable "directory".
I've already worker with directory as variables, I usually declare them like that
directory<-"C://Users//User//specdata//"
To take back your example.
Then, if I want to read a specific file in this directory, I will just go like :
read.table(paste(directory,"myfile.txt",sep=""),...)
It's the same process to write in a file
write.table(res,file=paste(directory,"myfile.txt",sep=""),...)
Is this helping ?
EDIT : you can then use read.csv and it will work fine
I think you are confused by the assignment operation in R. The following line
directory <- "C:/Users/User/specdata"
assigns a string to a new object that just happened to be called directory. It has the same effect on your working environment as
elephant <- "C:/Users/User/specdata"
To change where R reads its files, use the function setwd (short for set working directory):
setwd("C:/Users/User/specdata")
You can also specify full path names to functions that read in data (like read.table). For your specific problem,
# creates a list of all files ending with `csv` (i.e. all csv files)
all.specdata.files <- list.files(path = "C:/Users/User/specdata", pattern = "csv$")
# creates a list resulting from the application of `read.csv` to
# each of these files (which may be slow!!)
all.specdata.list <- lapply(all.specdata.files, read.csv)
Then we use dplyr::rbind_all to row-bind them into one file.
library(dplyr)
all.specdata <- rbind_all(all.specdata.list)
Then use colMeans to determine the grand means. Not sure how to do this without seeing the data.
Assuming that the columns in each of the 300+ csv files are the same, that is have column j contains the same type of data in all files, then the following example should be of use:
# let's use a temp directory for storing the files
tmpdr <- tempdir()
# Let's creat a large matrix of values and then split it into many different
# files
original_data <- data.frame(matrix(rnorm(10000L), nrow = 1000L))
# write each row to a file
for(i in seq(1, nrow(original_data), by = 1)) {
write.csv(original_data[i, ],
file = paste0(tmpdr, "/", formatC(i, format = "d", width = 4, flag = 0), ".csv"),
row.names = FALSE)
}
# get a character vector with the full path of each of the files
files <- list.files(path = tmpdr, pattern = "\\.csv$", full.names = TRUE)
# read each file into a list
read_data <- lapply(files, read.csv)
# bind the read_data into one data.frame,
read_data <- do.call(rbind, read_data)
# check that our two data.frames are the same.
all.equal(read_data, original_data)
# [1] TRUE