I am currently learning R, and I tried to change a for loop to use apply.
The context is a dataframe galton with 2 variables, parent (hight in inches) and child (height in inches). I want to sample repeatedly from this and get a linear model (using lm) and save that result into a vector.
library(UsingR)
sampleLm <- vector(100,mode="list")
for(i in 1:100) {
sampleGalton <- galton[sample(1:length(galton$child),size=50,replace=F),]
sampleLm[[i]] <- lm(sampleGalton$child ~ sampleGalton$parent)
}
I tried this:
sampleLm <- vector(100,mode="list")
sapply(samples, function(x) {
sampleGalton <- galton[sample(1:length(galton$child),size=50,replace=F),]
x <- lm(sampleGalton$child ~ sampleGalton$parent)
})
the code samples are taken from the galton height of children given parents height.
you can get this data in the UsingR package. This way you get galton. But really it could be anything. just some regular data frame.
but while it executes properly, the sampleLm vector isn't updated and contains all None. I get the impression this is normal because of the "no side effect" rule I found from the R documentation.
There must be a way to reformulate this so the for is replaced with apply. The question is how?
The easiest way here is replicate:
sampleLm <- replicate(100, lm(child ~ parent, data = galton,
subset = sample(seq(nrow(galton)), size = 50)),
simplify = FALSE)
You don't need to preallocate sampleLm when using the *apply family. You just need to write the function you want to run so that it turns the result of interest and then store the final result in a variable.
sampleLm <- sapply(samples, function(x) {
sampleGalton <- galton[sample(1:length(galton$child),size=50,replace=F),]
lm(sampleGalton$child ~ sampleGalton$parent)
})
Related
Context: I'd like to save the results of a Likelihood ratio test for a multinomial logistic regression in several dynamic variables, but I'm not sure how I could do that. This is what I've been trying:
library(lmtest)
indels = c("C.T","A.G","G.A","G.C","T.C","C.A","G.T","A.C","C.G","A.del","TAT.del","TCTGGTTTT.del","TACATG.del","GATTTC.del")
my_list = list()
for (i in 1:length(indels)) {
assign(paste0("lrtest_results_",indels[i]), my_list[[i]]) = lrtest(multinom_model_completo, indels[i])
}
I was basically trying to save each variable (with the name lrtest_results_ + the dynamic part of the variable name which depends on the vector indels) in a list using the assign method and paste0, but it doesn't seem to be working. Any help is very welcome!
The best way is to lapply the test function to each element of the vector indels and assign the names after.
my_list <- lapply(indels, \(x) lrtest(multinom_model_completo, x))
names(my_list) <- paste0("lrtest_results_", indels)
I am new to R and am working on writing some cool functions while I learn statistics in parallel. I'm trying to make a function that will take a numeric vector, perform the "root mean squared" operations and then have the output return essentially same vector with the possible outliers removed.
For example, if the vector is c(2,4,9,10,100) the resulting RMS would be about 37.
Therefore, I want the output to return the same vector with the possible outlier (in this case, 100) removed from the dataset. So the result would be 2, 4, 9, 10
I put my code below but the output isn't working. I tried it 2 different ways. Everything up to the line that says RMS final works. But below that it does not.
How can I modify this function so that it does what I want? Also, as a bonus, and this might be asking a lot but based on my coding below, any tips for a newbie on making functions would be something I'd be grateful for as well. Thanks so much!
RMS_x <- c(2,4,9,10,100)
#Root Mean Squared Function - Takes a numeric vector
RMS <- function(RMS_x){
RMS_MEAN <- mean(RMS_x)
RMS_DIFF <- (RMS_x-RMS_MEAN)
RMS_DIFF_SQ <- RMS_DIFF^2
RMS_FINAL <- sqrt(sum(RMS_DIFF_SQ)/length(RMS_x))
for(i in length(RMS_x)){
if(abs(RMS_x[i]) > RMS_FINAL){
output <- RMS_x[i]}
else {NULL} }
return(output)
}
#Root Mean Squared Function - Takes a numeric vector
RMS <- function(RMS_x){
RMS_MEAN <- mean(RMS_x)
RMS_DIFF <- (RMS_x-RMS_MEAN)
RMS_DIFF_SQ <- RMS_DIFF^2
RMS_FINAL <- sqrt(sum(RMS_DIFF_SQ)/length(RMS_x))
#output <- ifelse(abs(RMS_x) > RMS_FINAL,RMS_x, NULL)
return(RMS_FINAL)
}
Try following in the first lines of the RMS function.
RMS <- function(RMS_x) {
bp <- boxplot(RMS, plot = FALSE)
RMS_x <- RMS_x[!(RMS_x %in% bp$out)]
...
Now, you have RMS_x sans the outliers.
The boxplot function has a way of determining the outliers. Here, I am using that to remove them.
Since you are asking more specifically about R and R functions I’ll focus my response on that. There are a couple errors I'll point out then provide a few alternative solutions.
Your first function isn’t producing the output you want for two reasons:
The logic instructs the function to return a single value rather than a vector. If you’re trying to load a vector within your for loop (one without the outlier) make sure to initialize the vector outside of the function : output <- vector() (note that in my solution below however this is not required). Also the value it is returning is just a value in your vector RMS_x that is greater than the RMS rather that finding an outlier, just fyi if that's what you wanted.
There’s an error and/or typo in your for loop argument, it’s minor but it turns your for loop into not-a-loop whatsoever – which is obviously the total opposite of what you intended. The for loop needs a vector to loop through, the argument should be: for(i in 1:length(RMS_x))
In your code the loop is jumping straight to i = 5 because that is the length of your vector (length(RMS_x) = 5). Given that the values in the RMS_x vector were already in ascending order your code happens to give the "right" answer but that's just because of how you initially loaded the vector. This may have been a typo in your question, and it's a difference of only 2 code characters, but it totally changes what the function looks for.
Solution:
To get what you are trying to accomplish, you need to write two functions: 1.) that defines what's considered an outlier in your data set and 2.) a second function that strips out the outliers and calculates RMS. Then from there either make the functions independent or nest them to pass variables (this kind of goes with your bonus request as well since it's multiple ways of writing functions).
Function to identify outliers:
outlrs <- function(vec){
Q1 <- summary(vec)["1st Qu."]
Q3 <- summary(vec)["3rd Qu."]
# defining outliers can get complicated depending on your sample data but
# your data set is super simple so we'll keep it that way
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5*(IQR)
upper_bound <- Q3 + 1.5*(IQR)
bounds <- c(lower_bound, upper_bound)
return(bounds)
assign("non_outlier_range", bounds, envir = globalEnv())
# the assign() function will create an actual object in your environment
# called non_outlier_range that you can access directly - return()
# just mean the result will be spit out into the console or into a variable
# you load it into
}
Now moving on to the second function, a few options here:
First Way: Input bounds argument into RMS_func()
RMS_func <- function(dat, bounds){
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
# Call function from approach 1 - note that here the assign() in the
# definition of outlrs() would be required to refer to non_outlier_range:
RMS_func(dat = RMS_x, bounds = non_outlier_range)
Second Way: Call outlrs() inside the second function
RMS_func <- function(dat){
bounds <- outlrs(vec = dat)
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
# Call RMS_func - here the assign() in outlrs() would not be needed is not
# needed because the output will exist within the functions temp environment
# and be passed to RMS_func
RMS_func(dat = RMS_x)
Third Way: Nest outlrs() definition within the RMS_Func - in this case you only need one nested function to accomplish your task
RMS_Func <- function(dat){
outlrs <- function(vec){
Q1 <- summary(dat)["1st Qu."]
Q3 <- summary(dat)["3rd Qu."]
#Q1 <- quantile(vec)["25%"]
#Q3 <- summary(vec)["75%"]
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5*(IQR)
upper_bound <- Q3 + 1.5*(IQR)
bounds <- c(lower_bound, upper_bound)
return(bounds)
}
bounds <- outlrs(vec = dat)
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
P.S. Wrote this pretty quickly - will likely re-test and edit later. Hopefully for now this helps.
I want to generate a dataframe with summary statistics (AUC, Gini, RMSE, etc.) from validation of multiple models on multiple datasets.
I've got x number of models (classifiers - gbm, xgb, rf, etc. - all built in caret package) that are enclosed in ListOfModels, and y number of datasets (dataframes with identical variables over several data points) that are enclosed in ListOfDatasets.
I can create a short version of the desired dataframe by running a custom function fun_modelStats (that extracts model stats using model and dataset as arguments) inside ldply - but can do so only either over a ListOfModels and just one specific dataset or over a ListOfDatasets and just one specific model, like this:
modelStats_by_model <- ldply(ListOfModels, function(model) {
modelStats <- fun_modelStats(model, B97_2012SU_2013)
})
and
modelStats_by_dataset <- ldply(ListOfDatasets, function(dataset) {
modelStats <- fun_modelStats(gbmFit1, dataset)
})
The resulting dataframe with models' stats has either x or y number of rows, and I can't get my head around the way of building this dataframe with x*y rows, i.e. stats from all models validated on all datasets.
I did experiment with Map and mapply, and for loop, but to no avail.
Using Map I get weird incorrect output:
modelStats_all <- Map(fun_modelStats, ListOfModels, ListOfDatasets)
The for loop does generate the desired output with this code below, but only as plain text in console whereas I need it as a dataframe.
for(i in names(ListOfModels)) {
for(j in names(ListOfDatasets)) {
modelStats <- fun_modelStats(ListOfModels[[i]], ListOfDatasets[[j]])
print(modelStats)
}
}
Many thanks in advance for help!
P.S. Further search at SO (How to write a function that takes a model as an argument in R - this post, for example) shows that using aggregate.formula or aggregate.data.frame or rbind.data.frame could help, but I can't figure out how.
Here is the solution, in case anyone faces a similar problem:
fun_multiModelStats <- function(ListOfModels, ListOfDatasets) {
multiModelStats <- data.frame()
for(i in names(ListOfModels)) {
for(j in names(ListOfDatasets)) {
model <- ListOfModels[[i]]
dataset <- ListOfDatasets[[j]]
modelStats <- fun_modelStats(model, dataset)
modelName <- names(ListOfModels[i])
datasetName <- names(ListOfDatasets[j])
modelStats <- cbind(modelName, datasetName, modelStats)
multiModelStats <- rbind(multiModelStats, modelStats)
}
}
return(multiModelStats)
}
Yet, I would like to find a solution without double for loops but rather with something from the apply family of functions.
What will happen if you'll modify your for loop as following:
modelStats <- data.frame()
for(i in names(ListOfModels)) {
for(j in names(ListOfDatasets)) {
modelStats[i,j] <- fun_modelStats(ListOfModels[[i]], ListOfDatasets[[j]])
print(modelStats)
}
}
I'm working with a set of results of INLA package in R. These results are stored in objects with meaningful names so I can have, for instance, model_a, model_b... in current environment. For each of these models I'd like to do several processing tasks including extracting of the data to separate data frame, which can then be used to merge to spatial data to create map, etc.
Turning to simpler, reproducible example let's assume two results
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
model_a <- lm(weight ~ group)
model_b <- lm(weight ~ group - 1)
I can handle the steps for an individual model, for instance:
model_a_sum <- data.frame(var = character(1), model_a_value = numeric(1))
model_a_sum$var <- "Intercept"
model_a_sum$model_a_value <- model_a$coefficients[1]
png("model_a_plot.png")
plot(model_a, las = 1)
dev.off()
Now, I'd like to reuse this code for each of the models, essentially constructing correct names depending on the model I'm using. I'm more Stata than R person and inside Stata that would be a trivial task to use the stub of a name (model_a, or even a only..) and construct foreach loop that would implement all the steps, adapting names for each of the models.
In R, for loops have been bashed all over the internet so I presume I shouldn't attempt to venture into the territory of:
models <- c("model_a", "model_b", "model_c")
for (model in models) {
...
}
What would be the better solution for such scenario?
Update 1: Since comments suggested that for might indeed be an option I'm trying to put all the tasks into a loop. So far I manged to name the data frame correctly using assign and get correct data plotted under correct name using get:
models <- c("model_a", "model_b")
for (i in 1:length(models)) {
# create df
name.df <- paste0(models[i], "_sum")
assign(name.df, data.frame(var = character(1), value = numeric(1)))
# replace variables of df with results from the model
# plot and save
name.plot <- paste0(models[i], "_plot.png")
png(name.plot)
plot(get(models[i]), which = 1, las = 1)
dev.off()
}
Is this reasonable approach? Any better solutions?
One thing I cannot solve is having the second variable of the df named according to the model (ie. model_a_value instead of current value. Any ideas how to solve that?
Some general tips/advice:
As mentioned in comments, don't believe much of the negativity about for loops in R. The issue is not that they are bad, but more that they are correlated with some bad code patterns that are inefficient.
More important is to use the right data organization. Don't keep the models each in a separate object!. Put them in a list:
l <- vector("list",3)
l[[1]] <- lm(...)
l[[2]] <- lm(...)
l[[3]] <- lm(...)
Then name the list:
names(l) <- paste0("model_",letters[1:3])
Now you can loop over the list without resorting to awkward and unnecessary tools like assign and get, and more importantly when you're ready to step up from for loops to tools like lapply you're all good to go.
I would use similar strategies for your data frames as well.
See #joran answer, this one is to show use of assign and get but should be avoided when possible.
I would go this way for the for loop:
for (model in models) {
m <- get(model) # to get the real model object
# create the model_?_sum dataframe
assign(paste0(model,"_sum"), data.frame(var = "Intercept", value = m$coefficients[1]))
assign(paste0(model,"_sum"), setNames( get(paste0(model,"_sum")), c("var",paste0(model,"_value"))) ) # per comment to rename the value column thanks to #Franck in chat for the guidance
# paste0 to create the text
png(paste0(model,"_plot.png"))
plot(m, las = 1) # use the m object to graph
dev.off()
}
which give the two images and this:
> model_a_sum
var value
(Intercept) Integer 5.032
> model_b_sum
var value
groupCtl Integer 5.032
>
I'm unsure of why you wish this dataframe, but I hope this give clues on how to makes variables names and how to access them.
I have a large data set and I want to perform several functions at once and extract for each a parameter.
The test dataset:
testdf <- data.frame(vy = rnorm(60), vx = rnorm(60) , gvar = rep(c("a","b"), each=30))
I first definded a list of functions:
require(fBasics)
normfuns <- list(jarqueberaTest=jarqueberaTest, shapiroTest=shapiroTest, lillieTest=lillieTest)
Then a function to perform the tests by the grouping variable
mynormtest <- function(d) {
norm_test <- res_reg <- list()
for (i in c("a","b")){
res_reg[[i]] <- residuals(lm(vy~vx, data=d[d$gvar==i,]))
norm_test[[i]] <- lapply(normfuns, function(f) f(res_reg[[i]]))
}
return(norm_test)
}
mynormtest(testdf)
I obtain a list of test summaries for each grouping variable.
However, I am interested in getting only the parameter "STATISTIC" and I did not manage to find out how to extract it.
You can obtain the value stored as "STATISTIC" in the output of the various tests with
res_list <- mynormtest(testdf)
res_list$a$shapiroTest#test#statistic
res_list$a$jarqueberaTest#test#statistic
res_list$a$lillieTest#test#statistic
And correspondingly for set b:
res_list$b$shapiroTest#test$statistic
res_list$b$jarqueberaTest#test$statistic
res_listb$lillieTest#test$statistic
Hope this helps.
Concerning your function fgetparam I think that it is a nice starting point. Here's my suggestion with a few minor modifications:
getparams2 <- function(myp) {
m <- matrix(NA, nrow=length(myp), ncol=3)
for (i in (1:length(myp))){
m[i,] <- sapply(1:3,function(x) myp[[i]][[x]]#test$statistic)}
return(m)
}
This function represents a minor generalization in the sense that it allows for an arbitrary number of observations, while in your case this was fixed to two cases, a and b. The code can certainly be further shortened, but it might then also become somewhat more cryptic. I believe that in developing a code it is helpful to preserve a certain compromise between efficacy and compactness on one hand and readability or easiness to understand on the other.
Edit
As pointed out by #akrun and #Roland the function getparams2() can be written in a much more elegant and shorter form. One possibility is
getparams2 <- function(myp) {
matrix(unname(rapply(myp, function(x) x#test$statistic)),ncol=3)}
Another great alternative is
getparams2 <- function(myp){t(sapply(myp, sapply, function(x) x#test$statistic))}