Apply functions in different csv files in R with - r

Already made some research about loops in R and found that most of them focus only on how to change file or variable name with loops like Change variable name in for loop using R could find a good solution from other articles of bloggers....
The following is what I want to do with my original data(s1P...s9P) to get new data(s1Pm...s9Pm) by calculate their means according to the Datum column in (s1P...s9P).
Running the following lines one by one is okay, however, it seems like there should be a possibility to use loops to make it tidy.
Any suggestions would be appreciated. Have a nice weekend!
s1Pm = aggregate(s1P, list(s1P$Datum), mean)
s2Pm = aggregate(s2P, list(s2P$Datum), mean)
s3Pm = aggregate(s3P, list(s3P$Datum), mean)
s4Pm = aggregate(s4P, list(s4P$Datum), mean)
s5Pm = aggregate(s5P, list(s5P$Datum), mean)
s6Pm = aggregate(s6P, list(s6P$Datum), mean)
s7Pm = aggregate(s7P, list(s7P$Datum), mean)
s8Pm = aggregate(s8P, list(s8P$Datum), mean)
s9Pm = aggregate(s9P, list(s9P$Datum), mean)

We can load all the objects in a list with mget and then apply the aggregate by looping through the list
outLst <- lapply(mget(paste0("s", 1:9, "P")),
function(x) aggregate(x, list(x$Datum), mean))
names(outLst) <- paste0(names(outLst), "m")
It is better to keep the output in a list rather than creating multiple objects. But, it can be done as well
list2env(outLst, envir = .GlobalEnv)
though not recommended

Related

RSME on dataframe of multiple files in R

My goal is to read many files into R, and ultimately, run a Root Mean Square Error (rmse) function on each pair of columns within each file.
I have this code:
#This calls all the files into a dataframe
filnames <- dir("~/Desktop/LGsampleHUCsWgraphs/testRSMEs", pattern = "*_45Fall_*")
#This reads each file
read_data <- function(z){
dat <- read_excel(z, skip = 0, )
return(dat)
}
#This combines them into one list and splits them by the names in the first column
datalist <- lapply(filnames, read_data)
bigdata <- rbindlist(datalist, use.names = T)
splitByHUCs <- split(bigdata, f = bigdata$HUC...1 , sep = "\n", lex.order = TRUE)
So far, all is working well. Now I want to apply an rmse [library(Metrics)] analysis on each of the "splits" created above. I don't know what to call the "splits". Here I have used names but that is an R reserved word and won't work. I tried the bigdata object but that didn't work either. I also tried to use splitByHUCs, and rMSEs.
rMSEs <- sapply(splitByHUCs, function(x) rmse(names$Predicted, names$Actual))
write.csv(rMSEs, file = "~/Desktop/testRMSEs.csv")
The rmse code works fine when I run it on a single file and create a name for the dataframe:
read_excel("bcc1_45Fall_1010002.xlsm")
bcc1F1010002 <- read_excel("bcc1_45Fall_1010002.xlsm")
rmse(bcc1F1010002$Predicted, bcc1F1010002$Actual)
The "splits" are named by the "splitByHUCs" script, like this:
They are named for the file they came from, appropriately. I need some kind of reference name for the rmse formula and I don't know what it would be. Any ideas? Thanks. I made some small versions of the files, but I don't know how to add them here.
As it is a list, we can loop over the list with sapply/lapply as in the OP's code, but the names$ is incorrect as the lambda function object is x which signifies each of the elements of the list (i.e. a data.frame). Therefore, instead of names$, use x$
sapply(splitByHUCs, function(x) rmse(x$Predicted, x$Actual))

How to create a string that can be used as LHS and assigned a value to?

I feel stupid for asking such a simple question, but I am hitting my head in the wall.
Why does the paste0() create a string that cannot be not interpreted as name for an empty object ? Is there a different way of create the LHS that would be better?
As input I have a dataframe. As an output I want to have a new filtered dataframe. This works fine as long as I manually type all the code. However, I am trying to reduce repetition, and therefore I want to create a function that does the same thing, but then it is not working anymore.
library(magrittr)
df <- data.frame(
var_a = round(runif(20), digits = 1),
var_b = sample(letters, 20)
)
### Find duplicates
df$duplicate_num <- duplicated(df$var_a)
df$duplicate_txt <- duplicated(df$var_b)
df # a check
### Create two lists of duplicates
list_of_duplicate_num <-
df %>%
filter(duplicate_num)
list_of_duplicate_num # a check
list_of_duplicate_txt <-
df %>%
filter(duplicate_txt)
list_of_duplicate_txt # a check '
So far everything works as expected.
I would like to simplify the code and make this to a function that takes the arguments "num" or "txt". But I am having problems with creating the LHS.
The below should, in my mind, do the same as the code above.
paste0("list_of_duplicate_", "num") <-
df %>%
filter(duplicate_num)
I do get an error message:
Error in paste0("list_of_duplicate_", "num") <- df %>%
filter(duplicate_num) :
target of assignment expands to non-language object
My goal is to create a function with something like this:
make_list_of_duplicates <- function(criteria = "num") {
paste0("list_of_duplicate_", criteria) <-
df %>%
filter(paste0("duplicate_", criteria))
paste0("list_of_duplicate_", criteria) # a check
}
### Create two lists of duplicates
make_list_of_duplicates("num")
make_list_of_duplicates("txt")
and then continue with some joins etc.
I have been looking to tidy evaluation, assignments, rlang::enexpr(), base::substitute(), get(), mget() and many other things, but after two day of reading and trial and error, I am convinced that there must be a an other direction to look at that I am not seeing.
I am running MS Open R 4.0.2.
I am grateful for any suggestions.
Sincerely,
Eero
I found the solution to my question, when I understood that it was a case of indirection. Because I was on a wrong track, I created lots of complications and made it more difficult than necessary. Thanks to #r2evans who pointed me in the right direction. I have in the mean time decided that I will use loops, instead of functions, but here is the working function:
## Example of using paste inside a function to refer to an object.
library(magrittr)
library(dplyr)
df <- data.frame(
var_a = round(runif(20), digits = 1),
var_b = sample(letters, 20)
)
# Find duplicates
df$duplicate_num <- duplicated(df$var_a)
df$duplicate_txt <- duplicated(df$var_b)
# SEE https://dplyr.tidyverse.org/articles/programming.html#indirection-2
make_list_of_duplicates_f2 <- function(criteria = "num") {
df %>%
filter(.data[[paste0("duplicate_", {{criteria}})]])
}
# Create two lists of duplicates
list_of_duplicates_f2_num <-
make_list_of_duplicates_f2("num")
list_of_duplicates_f2_txt <-
make_list_of_duplicates_f2("txt")

Using a For-loop to create multiple objects with incremental suffixes, then reading in .csv file to each new object (also with incremental suffixes)

I've just started learning R so forgive me for my ignorance! I'm reading in lots of .csv files, each of which correlates to a different year (2010-2019). I then filter down the .csv files based on a variable within one of the columns (because the datasets are very large. Currently I am using the below code to do this and then repeating it for each year:
data_2010 <- data.table::fread("//Project/2010 data/2010 data.csv", select = c("date", "id", "type"))
data_b_2010 <- data_2010[which(data_2010$type=="ABC123")]
rm(data_2010)
What I would like to do is use a For-loop to create new object data_20xx for each year, and then read in the .csv files (and apply the filter of "type") for each year too.
I think I know how to create the objects in a For-loop but not entirely sure how I would also assign the .csv files and change the filepath string so it updates with each year (i.e. "//Project/2010 data/2010 data.csv" to "//Project/2011 data/2011 data.csv").
Any help would be greatly appreciated!
Next time please provide a repoducible example so we can help you.
I would use data.table which contains specialized functions to do what you want.
library(data.table)
setwd("Project")
allfiles <- list.files(recursive = T, full.names = T)
allcsv <- allfiles[grepl(".csv", allfiles)]
data_list <- list()
for(i in 1:length(allcsv)) {
print(paste(round(i/length(allcsv),2)))
data_list[i] <- fread(allcsv[i])
}
data_list_filtered <- lapply(data_list, function(x) {
y <- data.frame(x)
return(y[which(y["type"]=="ABC123",)])
})
result <- rbindlist(data_list_filtered)
First, list.files will tell you all the files contained in your working dir by default.
Second, read each csv file into the data_list list using the fast and efficient fread function.
Third, do the filtering within a loop, as requested.
Fourth, use rbindlist from data.table to rbind all of these data.table's.
Finally, if you are not familiar with the data.table syntax, you can run setDF(result) to convert your results back to a data.frame.
I strongly encourage you to learn the data.table syntax as it is quite powerful and efficient for tabular data manipulations. These vignettes will get you started.

R create a loop for a list of data frames

I currently have a list which is made up of around 80+ data frames, what I would like to do is to loop a chunk of code for each individual data frame within the list, without naming each one individually, or splitting them into individual data frames to work on.
Currently I split the list into each individual data frame using the below code:
dat5split <- setNames(split(dat5, dat5$CODE), paste0("df", unique(dat5$CODE)))
list2env(dat5split, globalenv())
I then work through each data frame individually:
# call in SPC function and write to 'results10000'
results10000<-SPC_XBAR(df10000,vol_n,seasonality)
results10000 = results10000 %>%
cbind(Spec = df10000$CODE) %>%
subset(`table_n` == 1)
results10000 <- results10000[order(results10000$tpd),]
results10000$Date <- as.Date(cbind(Date = df10000$CENSUS_DATE))
# call in SPC function and write to 'results10001'
results10001<-SPC_XBAR(df10001,vol_n,seasonality)
results10001 = results10001 %>%
cbind(Spec = df10001$CODE) %>%
subset(`table_n` == 1)
results10001 <- results10001[order(results10001$tpd),]
results10001$Date <- as.Date(cbind(Date = df10001$CENSUS_DATE))
Currently I call in the function 'SPC_XBAR' to where vol_n and seasonality are set earlier in the code. The script then passes the values to the function which then assigns the results to 'results10000, results10001' etc etc. Upon which I do a small bit of data wrangling on each newly created data frame before feeding the results back into sql server at the end.
As you can see each one is being individually hard coded which is not efficient.
What I would like to do is to loop a chunk of code for each individual data frame within the list, without naming each one individually.
I believe a loop would solve this issue but I am a little inexperienced when it comes to the ability to create a loop around it. Any advice would be much appreciated.
Cheers
Have you considered using lapply instead of a loop throughout the list? Check it here...
EDIT: I try to elaborate a bit more... What happens if you do this:
myFunction <- function(x) {
results<-SPC_XBAR(x,vol_n,seasonality)
results = results %>%
cbind(Spec = x$CODE) %>%
subset(`table_n` == 1)
results <- results[order(results$tpd),]
results$Date <- as.Date(cbind(Date = x$CENSUS_DATE))
results
}
lapply(dat5split, myFunction)
I would expect it to return a list of the resulting datasets

How do I save individual species data downloaded via rgbif?

I have a list of species and I want to download occurrence data from them using rgbif. I'm trying out the code with just two species with the assumption that when I get it to work for two getting it to work for the actual (and much longer) list won't be a problem. Here's the code I'm using:
#Start
library(rgbif)
splist <- c('Acer platanoides','Acer pseudoplatanus')
keys <- sapply(splist, function(x) name_suggest(x)$key[1], USE.NAMES=FALSE)
OS1=occ_search(taxonKey=keys, fields=c('name','key','decimalLatitude','decimalLongitude','country','basisOfRecord','coordinateAccuracy','elevation','elevationAccuracy','year','month','day'), minimal=FALSE,limit=10, return='data')
OS1
#End
This bit works almost perfectly. I get data for both species divided by species. One species is missing some columns, but I'm assuming for now that's an issue with the data, not the code. The next line I tried -
write.csv(OS1, "os1.csv")
works fine when saving a single species but not for more than one. Can someone please help? How do I save data for each species as separate files, bearing in mind I also want the method to work for data for more than 2 species?
Thanks!
The result is a list, which means you can use R's functions to climb each list element and save it. The following code extracts species names (you might have this laying around somewhere already) and uses mapply to pair species data and file name and use this to save a .txt file.
filenames <- paste(sapply(sapply(OS1, FUN = "[[", "name", simplify = FALSE), unique), ".txt", sep = "")
mapply(OS1, filenames, FUN = function(x, y) write.table(x, file = y, row.names = FALSE))
This is akin to a for loop solution, but some might argue a more concise one.
for (i in 1:length(filenames)) {
write.table(OS1[[i]], file = filenames[i], row.names = FALSE)
}

Resources