R, rbind with multiple files defined by a variable - r

First off, this is related to a homework question for the Coursera R programming course. I have found other ways to do what I want to do but my research has led me to a question I'm curious about. I have a variable number of csv files that I need to pull data from and then take the mean of the "pollutant" column in said files. The files are listed in their directory with an id number. I put together the following code which works fine for a single csv file but doesn't work for multiple csv files:
pollutantmean <- function (directory, pollutant, id = 1:332) {
id <- formatC(id, width=3, flag="0")`
dataset<-read.csv(paste(directory, "/", id,".csv",sep=""),header=TRUE)`
mean(dataset[,pollutant], na.rm = TRUE)`
}
I also know how to rbind multiple csv files together if I know the ids when I am creating the function, but I am not sure how to assign rbind to a variable range of ids or if thats even possible. I found other ways to do it such as calling an lapply and the unlisting the data, just curious if there is an easier way.

Well, this uses an lapply, but it might be what you want.
file_list <- list.files("*your directory*", full.names = T)
combined_data <- do.call(rbind, lapply(file_list, read.csv, header = TRUE))
This will turn all of your files into one large dataset, and from there it's easy to take the mean. Is that what you wanted?
An alternative way of doing this would be to step through file by file, taking sums and number of observations and then taking the mean afterwards, like so:
sums <- numeric()
n <- numeric()
i <- 1
for(file in file_list){
temp_df <- read.csv(file, header = T)
temp_mean <- mean(temp_df$pollutant)
sums[i] <- sum(temp_df$pollutant)
n[i] <- nrow(temp_df)
i <- i + 1
}
new_mean <- sum(sums)/sum(n)
Note that both of these methods require that only your desired csvs are in that folder. You can use a pattern argument in the list.files call if you have other files in there that you're not interested in.

A vector is not accepted for 'file' in read.csv(file, ...)
Below is a slight modification of yours. A vector of file paths are created and they are looped by sapply.
files <- paste("directory-name/",formatC(1:332, width=3, flag="0"),
".csv",sep="")
pollutantmean <- function(file, pollutant) {
dataset <- read.csv(file, header = TRUE)
mean(dataset[, pollutant], na.rm = TRUE)
}
sapply(files, pollutantmean)

Related

Errors in finding column mean of .csv file with NA cells in R

I have a folder with several .csv files containing raw data with multiple rows and 39 columns (x obs. of 39 variables), which have been read into R as follows:
# Name path containing .csv files as folder
folder = ("/users/.../");
# Find the number of files in the folder
file_list = list.files(path=folder, pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list))
{
assign(file_list[i],
read.csv(paste(folder, file_list[i], sep='')))
}
I want to find the mean of a specific column in each of these .csv files and save it in a vector as follows:
for (i in 1:length(file_list))
{
clean = na.omit(file_list[i])
ColumnNameMean[i] = mean(clean["ColumnName"])
}
When I run the above fragment of code, I get the error "argument is not numeric or logical: returning NA". This happens in spite of attempting to remove the NA values using na.omit. Using complete.cases,
clean = file_list[i][complete.cases(file_list[i]), ]
I get the error: incorrect number of dimensions, even though the number of columns haven't been explicitly stated.
How do I fix this?
Edit: corrected clean[i] to clean (and vice versa). Ran code, same error.
Sample .csv file
There are several things wrong with your code.
folder = ("/users/.../"); You don't need the parenthesis and you definitely do not need the semi-colon. The semi-colon separates instructions, does not end them. So, this instruction is in fact two instructions, the assigment of a string to folder and between the ; and the newline the NULL instruction.
You are creating many objects in the global environment in the for loop where you assign the return value of read.csv. It is much better to read in the files into a list of data.frames.
na.omit can remove all rows from the data.frames. And there is no need to use it since mean has a na.rm argument.
You compute the mean values of each column of each data.frame. Though the data.frames are processed in a loop, the columns are not and R has a fast colMeans function.
You mistake [ for [[. The correct ways would be either clean[, "ColumnName"] or clean[["ColumnName"]].
Now the code, revised. I present several alternatives to compute the columns' means.
First, read all files in one go. I set the working directory before reading them and reset after.
folder <- "/users/.../"
file_list <- list.files(path = folder, pattern = "^muse.*\\.csv$")
old_dir <- setwd(folder)
df_list <- lapply(file_list, read.csv)
setwd(old_dir)
Now compute the means of three columns.
cols <- c("Delta_TP9", "Delta_AF7", "Theta_TP9")
All_Means <- lapply(df_list, function(DF) colMeans(DF[cols], na.rm = TRUE))
names(All_Means) <- file_list
Compute the means of all columns starting with Delta or Theta. Get those columns names with grep.
df_names <- names(df_list[[1]])
cols2 <- grep("^Delta", df_names, value = TRUE)
cols2 <- c(cols2, grep("^Theta", df_names, value = TRUE))
All_Means_2 <- lapply(df_list, function(DF) colMeans(DF[cols2], na.rm = TRUE))
names(All_Means_2) <- file_list
Finally, compute the means of all numeric columns. Note that this time the index vector cols3 is a logical vector.
cols3 <- sapply(df_list[[1]], is.numeric)
All_Means_3 <- lapply(df_list, function(DF) colMeans(DF[cols3], na.rm = TRUE))
names(All_Means_3) <- file_list
Try it like this:
setwd("U:/Playground/StackO/")
# Find the number of files in the folder
file_list = list.files(path=getwd(), pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(file_list[i]))
}
ColumnNameMean <- rep(NULL, length(file_list))
for (i in 1:length(file_list)){
clean = get(file_list[i])
ColumnNameMean[i] = mean(clean[,"Delta_TP10"])
}
ColumnNameMean
#> [1] 1.286201
I used get to retrieve the data.frame otherwise file_list[i] just returns a string. I think this is an idiom used in other languages like python. I tried to stay true to the way you were using but there are easier way than indexing like this.
Maybe this:
lapply(list.files(path=getwd(), pattern="*.csv"), function(f){ dt <- read.csv(f); mean(dt[,"Delta_TP10"]) })
PS: Be careful with na.omit(), it removes ALL the rows with NA which in your case is your whole data.frame since Elements is only NA

Loop through subfolders and extract data from CSV files

I am trying to loop through all the subfolders of my wd, list their names, open 'data.csv' in each of them and extract the second and last value from that csv file.
The df would look like this :
Name_folder_1 2nd value Last value
Name_folder_2 2nd value Last value
Name_folder_3 2nd value Last value
For now, I managed to list the subfolders and each of the file (thanks to this thread: read multiple text files from multiple folders) but I struggle to implement (what I'm guessing should be) a nested loop to read and extract data from the csv files.
parent.folder <- "C:/Users/Desktop/test"
setwd(parent.folder)
sub.folders1 <- list.dirs(parent.folder, recursive = FALSE)
r.scripts <- file.path(sub.folders1)
files.v <- list()
for (j in seq_along(r.scripts)) {
files.v[j] <- dir(r.scripts[j],"data$")
}
Any hints would be greatly appreciated !
EDIT :
I'm trying the solution detailed below but there must be something I'm missing as it runs smoothly but does not produce anything. It might be something very silly, I'm new to R and the learning curve is making me dizzy :p
lapply(files, function(f) {
dat <- fread(f) # faster
dat2 <- c(basename(dirname(f)), head(dat$time, 1), tail(dat$time, 1))
write.csv(dat2, file = "test.csv")
})
Not easy to reproduce but here is my suggestion:
library(data.table)
files <- list.files("PARENTDIR", full.names = T, recursive = T, pattern = ".*.csv")
lapply(files, function(f) {
dat <- fread(f) # faster
# Do whatever, get the subfolder name for example
basename(dirname(f))
})
You can simply look recursivly for all CSV files in your parent directory and still get their corresponding parent folder.

R: save each loop result into one data frame

I have written a loop in R (still learning). My purpose is to pick the max AvgConc and max Roll_TotDep from each looping file, and then have two data frames that each contains all the max numbers picked from individual files. The code I wrote only save the last iteration results (for only one single file)... Can someone point me a right direction to revise my code, so I can append the result of each new iteration with previous ones? Thanks!
data.folder <- "D:\\20150804"
files <- list.files(path=data.folder)
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- sub[which.max(sub$AvgConc),]
maxETD <- sub[which.max(sub$Roll_TotDep),]
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The problem is that max1Conc and maxETD are not lists data.frames or vectors (or other types of object capable of storing more than one value).
To fix this:
maxETD<-vector()
max1Conc<-vector()
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- append(max1Conc,sub[which.max(sub$AvgConc),])
maxETD <- append(maxETD,sub[which.max(sub$Roll_TotDep),])
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The difference here is that I made the two variables you wish to write out empty vectors (max1Conc and maxETD), and then used the append command to add each successive value to the vectors.
There are more idiomatic R ways of accomplishing your goal; personally, I suggest you look into learning the apply family of functions. (http://adv-r.had.co.nz/Functionals.html)
I can't directly test the whole thing because I don't have a directory with files like yours, but I tested the parts, and I think this should work as an apply-driven alternative. It starts with a pair of functions, one to ingest a file from your directory and other to make a row out of the two max values from each of those files:
library(dplyr)
data.folder <- "D:\\20150804"
getfile <- function(filename) {
sub <- read.table(file.path(data.folder, filename), header=TRUE)
return(sub)
}
getmaxes <- function(df) {
rowi <- data.frame(AvConc.max = max(df[,"AvConc"]), ETD.max = max(df[,"ETD"]))
return(rowi)
}
Then it uses a couple of rounds of lapply --- embedded in piping courtesy ofdplyr --- to a) build a list with each data set as an item, b) build a second list of one-row data frames with the maxes from each item in the first list, c) rbind those rows into one big data frame, d) and then cbind the filenames to that data frame for reference.
dfmax <- lapply(as.list(list.files(path = data.folder)), getfiles) %>%
lapply(., getmaxes) %>%
Reduce(function(...) rbind(...), .) %>%
data.frame(file = list.files(path = data.folder), .)

Get summary / plot for a column out of a folder (~3000 csv files) in R

I'm a student from Germany. I want to create a summary (0.25 & 0.75 quantile, mean, min, max) and different plots for special columns (e.g. Inflow or Low).
The issue is that there is not only one .csv file, there are about 3200 files in that folder - different names (ISIN numbers of portfolios all starting with DE000LS9xxx).
After I looked through different platforms and this forum I tried different possibilities. My last try was to name every file 001.csv, 002.csv, etc. and use an answer out of this forum:
directory <- setwd("~/Desktop/Uni/paper/testdata/")
Inflowmean <- function(directory, Inflow, id = 1:3) {
filenames <- sprintf("%03d.csv", id)
filenames <- paste(directory, filenames, sep=";", dec=",")
ldf <- lapply(filenames, read.csv)
df=ldply(ldf)
summary(df[, Inflow], na.rm = TRUE)
}
I really hope that you can help me, cause I'm new and just started to learn commands in RStudio - seems that I'm not able to handle it, also tried different tutorials and the help function in the program...
Thank you so much!
It is rather unclear what your question actually is, but there are a number of problems with your code:
directory <- setwd("~/Desktop/Uni/paper/testdata/"): See ?setwd - it returns the current directory before changing the working directory, not ~/Desktop/Uni/paper/testdata/. You probably want
directory <- "~/Desktop/Uni/paper/testdata/"
setwd(directory)
filenames <- paste(directory, filenames, sep=";", dec=",") -- this will create filenames like "~/Desktop/Uni/paper/testdata/;001.csv;,". You probably want the separator to be / or .Platform$file.sep. I don't know why you have dec="," but that will just paste it onto the end. Try pasteing a few things together to see what gives you file names that make sense for your data.
Your ldply syntax is wrong: you probably want
ldply(ldf, function (x) summary(x[, Inflow], na.rm=T))
See ?ldply for more information. Also, to use ldply, you need library(plyr) somewhere. If you just want base R, you could try
do.call(rbind, lapply(x, function (x) summary(x[, Inflow], na.rm=T)))
Where the lapply applies your function (summary(x[, Inflow], na.rm=T)) to each of your dataframes, and do.call(rbind, ...) just joins all the summaries together into a single dataframe.
from
Using R to list all files with a specified extension
and
Opening all files in a folder, and applying a function
filenames <- list.files("~/Desktop/Uni/paper/testdata", pattern="*.csv", full.names=TRUE)
ldf <- lapply(filenames, read.csv)
res <- lapply(ldf, summary)

Loop in R to read many files

I have been wondering if anybody knows a way to create a loop that loads files/databases in R.
Say i have some files like that: data1.csv, data2.csv,..., data100.csv.
In some programming languages you one can do something like this data +{ x }+ .csv the system recognizes it like datax.csv, and then you can apply the loop.
Any ideas?
Sys.glob() is another possibility - it's sole purpose is globbing or wildcard expansion.
dataFiles <- lapply(Sys.glob("data*.csv"), read.csv)
That will read all the files of the form data[x].csv into list dataFiles, where [x] is nothing or anything.
[Note this is a different pattern to that in #Joshua's Answer. There, list.files() takes a regular expression, whereas Sys.glob() just uses standard wildcards; which wildcards can be used is system dependent, details can be used can be found on the help page ?Sys.glob.]
See ?list.files.
myFiles <- list.files(pattern="data.*csv")
Then you can loop over myFiles.
I would put all the CSV files in a directory, create a list and do a loop to read all the csv files from the directory in the list.
setwd("~/Documents/")
ldf <- list() # creates a list
listcsv <- dir(pattern = "*.csv") # creates the list of all the csv files in the directory
for (k in 1:length(listcsv)){
ldf[[k]] <- read.csv(listcsv[k])
}
str(ldf[[1]])
Read the headers in a file so that we can use them for replacing in merged file
library(dplyr)
library(readr)
list_file <- list.files(pattern = "*.csv") %>%
lapply(read.csv, stringsAsFactors=F) %>%
bind_rows
fi <- list.files(directory_path,full.names=T)
dat <- lapply(fi,read.csv)
dat will contain the datasets in a list
Let's assume that your files have the file format that you mentioned in your question and that they are located in the working directory.
You can vectorise creation of the file names if they have a simple naming structure. Then apply a loading function on all the files (here I used purrr package, but you can also use lapply)
library(purrr)
c(1:100) %>% paste0("data", ., ".csv") %>% map(read.csv)
Here's another solution using a for loop. I like it better than the others because of its flexibility and because all dfs are directly stored in the global environment.
Assume you've already set your working directory, the algorithm will iteratively read all files and store them in the global environment with the name "datai".
list <- c(1:100)
for (i in list) {
filename <- paste0("data", i)
wd <- paste0("data", i, ".csv")
assign(filename, read.csv(wd))
}
First, set the working directory.
Find and store all the files ending with .csv.
Bind all of them row-wise.
Following is the code sample:
setwd("C:/yourpath")
temp <- list.files(pattern = "*.csv")
allData <- do.call("rbind",lapply(Sys.glob(temp), read.csv))
This may be helpful if you have datasets for participants as in psychology/sports/medicine etc.
setwd("C:/yourpath")
temp <- list.files(pattern = "*.sav")
#Maybe you want to unselect /delete IDs
DEL <- grep('ID(04|08|11|13|19).sav', temp)
temp2 <- temp[-DEL]
#Make a list of that contains all data
read.all <- lapply(temp2, read_sav)
#View(read.all[1])
#Option 1: put one under the next
df <- do.call("rbind", read.all)
Option 2: make something within each dataset (single IDs) e.g. get the mean of certain parts of each participant
mw_extraktion <- function(data_raw){
data_raw <- data.frame(data_raw)
#you may now calculate e.g. the mean for a certain variable for each ID
ID <- data_raw$ID[1]
data_OneID <- c(ID, Var2, Var3) #put your new variables (e.g. Means) here
} #end of function
data_combined <- t(data.frame(sapply(read.all, mw_extraktion) ) )

Resources