RSME on dataframe of multiple files in R - r

My goal is to read many files into R, and ultimately, run a Root Mean Square Error (rmse) function on each pair of columns within each file.
I have this code:
#This calls all the files into a dataframe
filnames <- dir("~/Desktop/LGsampleHUCsWgraphs/testRSMEs", pattern = "*_45Fall_*")
#This reads each file
read_data <- function(z){
dat <- read_excel(z, skip = 0, )
return(dat)
}
#This combines them into one list and splits them by the names in the first column
datalist <- lapply(filnames, read_data)
bigdata <- rbindlist(datalist, use.names = T)
splitByHUCs <- split(bigdata, f = bigdata$HUC...1 , sep = "\n", lex.order = TRUE)
So far, all is working well. Now I want to apply an rmse [library(Metrics)] analysis on each of the "splits" created above. I don't know what to call the "splits". Here I have used names but that is an R reserved word and won't work. I tried the bigdata object but that didn't work either. I also tried to use splitByHUCs, and rMSEs.
rMSEs <- sapply(splitByHUCs, function(x) rmse(names$Predicted, names$Actual))
write.csv(rMSEs, file = "~/Desktop/testRMSEs.csv")
The rmse code works fine when I run it on a single file and create a name for the dataframe:
read_excel("bcc1_45Fall_1010002.xlsm")
bcc1F1010002 <- read_excel("bcc1_45Fall_1010002.xlsm")
rmse(bcc1F1010002$Predicted, bcc1F1010002$Actual)
The "splits" are named by the "splitByHUCs" script, like this:
They are named for the file they came from, appropriately. I need some kind of reference name for the rmse formula and I don't know what it would be. Any ideas? Thanks. I made some small versions of the files, but I don't know how to add them here.

As it is a list, we can loop over the list with sapply/lapply as in the OP's code, but the names$ is incorrect as the lambda function object is x which signifies each of the elements of the list (i.e. a data.frame). Therefore, instead of names$, use x$
sapply(splitByHUCs, function(x) rmse(x$Predicted, x$Actual))

Related

Saving data frames using a for loop with file names corresponding to data frames

I have a few data frames (colors, sets, inventory) and I want to save each of them into a folder that I have set as my wd. I want to do this using a for loop, but I am not sure how to write the file argument such that R understands that it should use the elements of the vector as the file names.
I might write:
DFs <- c("colors", "sets", "inventory")
for (x in 1:length(DFs)){
save(x, file = "x.Rda")
}
The goal would be that the files would save as colors.Rda, sets.Rda, etc. However, the last element to run through the loop simply saves as x.Rda.
In short, perhaps my question is: how do you tell R that I am wanting to use elements being run through a loop within an argument when that argument requires a character string?
For bonus points, I am sure I will encounter the same problem if I want to load a series of files from that folder in the future. Rather than loading each one individually, I'd also like to write a for loop. To load these a few minutes ago, I used the incredibly clunky code:
sets_file <- "~/Documents/ME teaching/R notes/datasets/sets.csv"
sets <- read.csv(sets_file)
inventories_file <- "~/Documents/ME teaching/R notes/datasets/inventories.csv"
inventories <- read.csv(inventories_file)
colors_file <- "~/Documents/ME teaching/R notes/datasets/colors.csv"
colors <- read.csv(colors_file)
For compactness I use lapply instead of a for loop here, but the idea is the same:
lapply(DFs, \(x) save(list=x, file=paste0(x, ".Rda"))))
Note that you need to generate the varying file names by providing x as a variable and not as a character (as part of the file name).
To load those files, you can simply do:
lapply(paste0(DFs, ".Rda"), load, envir = globalenv())
To save you can do this:
DFs <- list(color, sets, inventory)
names(DFs) = c("color", "sets", "inventory")
for (x in 1:length(DFs)){
dx = paste(names(DFs)[[x]], "Rda", sep = ".")
dfx = DFs[[x]]
save(dfx, file = dx)
}
To specify the path just inform in the construction of the dx object as following to read.
To read:
DFs <- c("colors", "sets", "inventory")
# or
DFs = dir("~/Documents/ME teaching/R notes/datasets/")
for(x in 1:length(DFs)){
arq = paste("~/Documents/ME teaching/R notes/datasets/", DFs[x], ".csv", sep = "")
DFs[x] = read.csv(arq)
}
It will read as a list, so you can access using [[]] indexation.

Errors in finding column mean of .csv file with NA cells in R

I have a folder with several .csv files containing raw data with multiple rows and 39 columns (x obs. of 39 variables), which have been read into R as follows:
# Name path containing .csv files as folder
folder = ("/users/.../");
# Find the number of files in the folder
file_list = list.files(path=folder, pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list))
{
assign(file_list[i],
read.csv(paste(folder, file_list[i], sep='')))
}
I want to find the mean of a specific column in each of these .csv files and save it in a vector as follows:
for (i in 1:length(file_list))
{
clean = na.omit(file_list[i])
ColumnNameMean[i] = mean(clean["ColumnName"])
}
When I run the above fragment of code, I get the error "argument is not numeric or logical: returning NA". This happens in spite of attempting to remove the NA values using na.omit. Using complete.cases,
clean = file_list[i][complete.cases(file_list[i]), ]
I get the error: incorrect number of dimensions, even though the number of columns haven't been explicitly stated.
How do I fix this?
Edit: corrected clean[i] to clean (and vice versa). Ran code, same error.
Sample .csv file
There are several things wrong with your code.
folder = ("/users/.../"); You don't need the parenthesis and you definitely do not need the semi-colon. The semi-colon separates instructions, does not end them. So, this instruction is in fact two instructions, the assigment of a string to folder and between the ; and the newline the NULL instruction.
You are creating many objects in the global environment in the for loop where you assign the return value of read.csv. It is much better to read in the files into a list of data.frames.
na.omit can remove all rows from the data.frames. And there is no need to use it since mean has a na.rm argument.
You compute the mean values of each column of each data.frame. Though the data.frames are processed in a loop, the columns are not and R has a fast colMeans function.
You mistake [ for [[. The correct ways would be either clean[, "ColumnName"] or clean[["ColumnName"]].
Now the code, revised. I present several alternatives to compute the columns' means.
First, read all files in one go. I set the working directory before reading them and reset after.
folder <- "/users/.../"
file_list <- list.files(path = folder, pattern = "^muse.*\\.csv$")
old_dir <- setwd(folder)
df_list <- lapply(file_list, read.csv)
setwd(old_dir)
Now compute the means of three columns.
cols <- c("Delta_TP9", "Delta_AF7", "Theta_TP9")
All_Means <- lapply(df_list, function(DF) colMeans(DF[cols], na.rm = TRUE))
names(All_Means) <- file_list
Compute the means of all columns starting with Delta or Theta. Get those columns names with grep.
df_names <- names(df_list[[1]])
cols2 <- grep("^Delta", df_names, value = TRUE)
cols2 <- c(cols2, grep("^Theta", df_names, value = TRUE))
All_Means_2 <- lapply(df_list, function(DF) colMeans(DF[cols2], na.rm = TRUE))
names(All_Means_2) <- file_list
Finally, compute the means of all numeric columns. Note that this time the index vector cols3 is a logical vector.
cols3 <- sapply(df_list[[1]], is.numeric)
All_Means_3 <- lapply(df_list, function(DF) colMeans(DF[cols3], na.rm = TRUE))
names(All_Means_3) <- file_list
Try it like this:
setwd("U:/Playground/StackO/")
# Find the number of files in the folder
file_list = list.files(path=getwd(), pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(file_list[i]))
}
ColumnNameMean <- rep(NULL, length(file_list))
for (i in 1:length(file_list)){
clean = get(file_list[i])
ColumnNameMean[i] = mean(clean[,"Delta_TP10"])
}
ColumnNameMean
#> [1] 1.286201
I used get to retrieve the data.frame otherwise file_list[i] just returns a string. I think this is an idiom used in other languages like python. I tried to stay true to the way you were using but there are easier way than indexing like this.
Maybe this:
lapply(list.files(path=getwd(), pattern="*.csv"), function(f){ dt <- read.csv(f); mean(dt[,"Delta_TP10"]) })
PS: Be careful with na.omit(), it removes ALL the rows with NA which in your case is your whole data.frame since Elements is only NA

R: save each loop result into one data frame

I have written a loop in R (still learning). My purpose is to pick the max AvgConc and max Roll_TotDep from each looping file, and then have two data frames that each contains all the max numbers picked from individual files. The code I wrote only save the last iteration results (for only one single file)... Can someone point me a right direction to revise my code, so I can append the result of each new iteration with previous ones? Thanks!
data.folder <- "D:\\20150804"
files <- list.files(path=data.folder)
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- sub[which.max(sub$AvgConc),]
maxETD <- sub[which.max(sub$Roll_TotDep),]
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The problem is that max1Conc and maxETD are not lists data.frames or vectors (or other types of object capable of storing more than one value).
To fix this:
maxETD<-vector()
max1Conc<-vector()
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- append(max1Conc,sub[which.max(sub$AvgConc),])
maxETD <- append(maxETD,sub[which.max(sub$Roll_TotDep),])
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The difference here is that I made the two variables you wish to write out empty vectors (max1Conc and maxETD), and then used the append command to add each successive value to the vectors.
There are more idiomatic R ways of accomplishing your goal; personally, I suggest you look into learning the apply family of functions. (http://adv-r.had.co.nz/Functionals.html)
I can't directly test the whole thing because I don't have a directory with files like yours, but I tested the parts, and I think this should work as an apply-driven alternative. It starts with a pair of functions, one to ingest a file from your directory and other to make a row out of the two max values from each of those files:
library(dplyr)
data.folder <- "D:\\20150804"
getfile <- function(filename) {
sub <- read.table(file.path(data.folder, filename), header=TRUE)
return(sub)
}
getmaxes <- function(df) {
rowi <- data.frame(AvConc.max = max(df[,"AvConc"]), ETD.max = max(df[,"ETD"]))
return(rowi)
}
Then it uses a couple of rounds of lapply --- embedded in piping courtesy ofdplyr --- to a) build a list with each data set as an item, b) build a second list of one-row data frames with the maxes from each item in the first list, c) rbind those rows into one big data frame, d) and then cbind the filenames to that data frame for reference.
dfmax <- lapply(as.list(list.files(path = data.folder)), getfiles) %>%
lapply(., getmaxes) %>%
Reduce(function(...) rbind(...), .) %>%
data.frame(file = list.files(path = data.folder), .)

How do I save individual species data downloaded via rgbif?

I have a list of species and I want to download occurrence data from them using rgbif. I'm trying out the code with just two species with the assumption that when I get it to work for two getting it to work for the actual (and much longer) list won't be a problem. Here's the code I'm using:
#Start
library(rgbif)
splist <- c('Acer platanoides','Acer pseudoplatanus')
keys <- sapply(splist, function(x) name_suggest(x)$key[1], USE.NAMES=FALSE)
OS1=occ_search(taxonKey=keys, fields=c('name','key','decimalLatitude','decimalLongitude','country','basisOfRecord','coordinateAccuracy','elevation','elevationAccuracy','year','month','day'), minimal=FALSE,limit=10, return='data')
OS1
#End
This bit works almost perfectly. I get data for both species divided by species. One species is missing some columns, but I'm assuming for now that's an issue with the data, not the code. The next line I tried -
write.csv(OS1, "os1.csv")
works fine when saving a single species but not for more than one. Can someone please help? How do I save data for each species as separate files, bearing in mind I also want the method to work for data for more than 2 species?
Thanks!
The result is a list, which means you can use R's functions to climb each list element and save it. The following code extracts species names (you might have this laying around somewhere already) and uses mapply to pair species data and file name and use this to save a .txt file.
filenames <- paste(sapply(sapply(OS1, FUN = "[[", "name", simplify = FALSE), unique), ".txt", sep = "")
mapply(OS1, filenames, FUN = function(x, y) write.table(x, file = y, row.names = FALSE))
This is akin to a for loop solution, but some might argue a more concise one.
for (i in 1:length(filenames)) {
write.table(OS1[[i]], file = filenames[i], row.names = FALSE)
}

R Programming: Difficulty removing NAs from frame when using lapply

Full disclosure: I am taking a Data Science course on Coursera. For this particular question, we need to calculate the mean of some pollutant data that is being read in from multiple files.
The main function I need help with also references a couple other functions that I wrote in the script. For brevity, I'm just going to list them and their purpose:
boundIDs: I use this to bound the input so that inputs won't be accepted that are out of range. (range is 1:332, so if someone enters 1:400 this changes the range to 1:332)
pollutantToCode: converts the pollutant string entered to that pollutant's column number in the data file
fullFilePath - Creates the file name and appends it to the full file path. So if
someone states they need the file for ID 1 in directory
"curse/your/sudden/but/inevitable/betrayal/", the function will return
"curse/your/sudden/but/inevitable/betrayal/001.csv" to be added to
the file list vector.
After all that, the main function I'm working with is:
pollutantmean <- function(directory = "", pollutant, id = 1:332){
id <- boundIDs(id)
pollutant <- pollutantToCode(pollutant)
numberOfIds <- length(id)
fileList <- character(numberOfIds)
for (i in 1:numberOfIds){
if (id[i] > 332){
next
}
fileList[i] <- fullFilePath(directory, id[i])
}
data <- lapply(fileList, read.csv)
print(data[[1]][[pollutant]])
}
Right now, I'm intentionally printing only the first frame of data to see what my output looks like. To remove the NAs I've tried using:
data <- lapply(fileList, read.csv)
data <- data[!is.na(data)]
But the NAs remained, so then I tried computing the mean directly and using the na.rm parameter:
print(mean(data[[1]][[pollutant]], na.rm = TRUE))
But the mean was still "NA". Then I tried na.omit:
data <- lapply(fileList, na.omit(read.csv))
...and unfortunately the problem persisted.
Can someone please help? :-/
(PS: Right now I'm just focusing on the first frame of whatever is read in, i.e. data[[1]], since I figure if I can't get it for the first frame there's no point in iterating over the rest.)

Resources