Regression in R using a function - r

I am trying to smooth out my data for each variable in the data frame. Lets say it looks like this:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
data$x <- 1:nrow(data)
I then specify my x and y variables as:
x <- data$x
y <- data$v1
I can fit the predicted line I want (and I am happy with the process):
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_v1 <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
I therefore end up with an appropriately titled new variable (the predicted values for v1). I now want to do this for each of the other 100 variables in my data frame (v2 and v3 in this example).
I tried using a function to do this but can't get it to work as intended. Here is my attempt:
myfunction <- function(xaxis,yaxis){
# Specfiy my "y" and "x"
x <- data$xaxis
y <- data$yaxis
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_yaxis <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
}
myfunction(x,v1)
myfunction(x,v2)
myfunction(x,v3)
Not only does the function not work as intended, I would like to avoid calling the function 100 times for each variable and instead somehow loop through it.
This is really simple to do in SAS using macros but I am struggling to get this to work in R.

You can model your data directly with the lm() function:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
x <- 1:nrow(data)
# initialize a list to store the models
models = vector("list", length = (ncol(data)))
# create a loop running over the columns of data
for (i in 1:(ncol(data))){
models[[i]] = lm(data[,i] ~ poly(x,2, raw = TRUE))}
You can also use lapply instead of the for-loop, as stated in the comments.
Use predict() to get the values of the models:
smoothed_v1 = predict(model[[1]], newdata=data.frame(x = x))
Edit:
Regarding your comment - you can store the new values in data with:
for (i in (length(models):1)){
data <- cbind(predict(models[[i]], newdata=data.frame(x = x)), data)
# set the name for the new column
names(data)[1] = paste("pred_v",i, sep ="")}

Related

How do I make variable weights dynamic in lmer for loop

I want to be able to input the variable name that I'll be using in the "weights" option in the lmer function. So then I can change the dataset, and cycle through the "weights" and pull the correct variable.
I want to pull the correct column for weights within the for loop.
So for y, the equation would be:
lmer(y~x+(1|study), weights = weight.var)
And y1:
lmer(y1~x+(1|study),weights = weight.var1)
So I named the weighting variables (weight.opt), then want to use them in the formula within the for loop. I can use "as.formula" to get the formula working and connected to the dataset, but I'm not sure how to do something similar with the weights.
x <- rnorm(300,0,1)
y <- x*rnorm(300,2,0.5)
y1 <- x*rnorm(300,0.1,0.1)
study <- rep(c("a","b","c"),each = 100)
weight.var <- rep(c(0.5,2,4),each = 100)
weight.var1 <- rep(c(0.1,.2,.15),each = 100)
library(lme4)
dataset <- data.frame(x,y,y1,study,weight.var,weight.var1)
resp1 <- c("y","y1")
weight.opt <- c("weight.var","weight.var1")
for(i in 1:2){
lmer(as.formula(paste(resp1[i],"~x+(1|study)")),weights = weight.opt[i],data = dataset)
}
This seems to work fine:
res_list <- list()
for(i in 1:2){
res_list[[i]] <- lmer(as.formula(paste(resp1[i],"~x+(1|study)")),
weights = dataset[[weight.opt[i]]],data = dataset)
}

Assigning a variable to pasted name of column in R

I have a few data frames with the names:
Meanplots1,
Meanplots2,
Meanplots3 etc.
I am trying to write a for loop to do a series of equations on each data frame.
I am attempting to use the paste0 function.
What I want to happen is for x to be a column of each data set. So the code should work like this line:
x <- Meanplots1$PAR
However, since I want to put this in a for loop I want to format it like this:
for (i in 1:3){
x <- paste0("Meanplots",i,"$PAR")
Dmodel <- nls(y ~ ((a*x)/(b + x )) - c, data = dat, start = list(a=a,b=b,c=c))
}
What this does is it assigns x to the list "Meanplots1$PAR" not the actual column. Any idea on how to fix this?
We can get all the data.frame in a list with mget
lst1 <- mget(ls(pattern = '^MeanPlots\\d+$'))
then loop over the list with lapply and apply the model
DmodelLst <- lapply(lst1, function(dat) nls(y ~ ((a* PAR)/(b + PAR )) - c,
data = dat, start = list(a=a,b=b,c=c)))
Replace 'x' with the column name 'PAR'.
In the OP's loop, create a NULL list to store the output ('Outlst'), get the value of the object from paste0, then apply the formula with the unquoted column name i.e. 'PAR'
Outlst <- vector("list", 3)
ndat <- data.frame(x = seq(0,2000,100))
for(i in 1:3) {
dat <- get(paste0("MeanPlots", i))
modeltmp <- nls(y ~ ((a*PAR)/(b + PAR )) - c,
data = dat, start = list(a=a,b=b,c=c))
MD <- data.frame(predict(modeltmp, newdata = ndat))
MD[,2] <- ndat$x
names(MD) <- c("Photo","PARi")
Outlst[[i]] <- MD
}
Now, we extract the output of each list element
Outlst[[1]]
Outlst[[2]]
instead of creating multiple objects in the global environment

Reqsubsets results differ with coef() for model with linear dependencies

while using Regsubsets from package leaps on data with linear dependencies, I found that results given by coef() and by summary()$which differs. It seems that, when linear dependencies are found, reordering changes position of coefficients and coef() returns wrong values.
I use mtcars just to "simulate" the problem I had with other data. In first example there is no issue of lin. dependencies and best given model by BIC is mpg~wt+cyl and both coef(),summary()$which gives the same result. In second example I add dummy variable so there is possibility of perfect multicollinearity, but variables in this order (dummy in last column) don't cause the problem. In last example after changing order of variables in dataset, the problem finally appears and coef(),summary()$which gives different models. Is there anything incorrect in this approach? Is there any other way to get coefficients from regsubsets?
require("leaps") #install.packages("leaps")
###Example1
dta <- mtcars[,c("mpg","cyl","am","wt","hp") ]
bestSubset.cars <- regsubsets(mpg~., data=dta)
(best.sum <- summary(bestSubset.cars))
#
w <- which.min(best.sum$bic)
best.sum$which[w,]
#
best.sum$outmat
coef(bestSubset.cars, w)
#
###Example2
dta2 <- cbind(dta, manual=as.numeric(!dta$am))
bestSubset.cars2 <- regsubsets(mpg~., data=dta)
(best.sum2 <- summary(bestSubset.cars2))
#
w <- which.min(best.sum2$bic)
best.sum2$which[w,]
#
coef(bestSubset.cars2, w)
#
###Example3
bestSubset.cars3 <- regsubsets(mpg~., data=dta2[,c("mpg","manual","am","cyl","wt","hp")])
(best.sum3 <- summary(bestSubset.cars3))
#
w <- which.min(best.sum3$bic)
best.sum3$which[w,]
#
coef(bestSubset.cars3, w)
#
best.sum2$which
coef(bestSubset.cars2,1:4)
best.sum3$which
coef(bestSubset.cars3,1:4)
The order of vars by summary.regsubsets and regsubsets are different. The generic function coef() of regsubsets calls those two in one function, and the results are in mess if you are trying to force.in or using formula with fixed order. Changing some lines in the coef() function might help. Try codes below, see if it works!
coef.regsubsets <- function (object, id, vcov = FALSE, ...)
{
s <- summary(object)
invars <- s$which[id, , drop = FALSE]
betas <- vector("list", length(id))
for (i in 1:length(id)) {
# added
var.name <- names(which(invars[i, ]))
thismodel <- which(object$xnames %in% var.name)
names(thismodel) <- var.name
# deleted
#thismodel <- which(invars[i, ])
qr <- .Fortran("REORDR", np = as.integer(object$np),
nrbar = as.integer(object$nrbar), vorder = as.integer(object$vorder),
d = as.double(object$d), rbar = as.double(object$rbar),
thetab = as.double(object$thetab), rss = as.double(object$rss),
tol = as.double(object$tol), list = as.integer(thismodel),
n = as.integer(length(thismodel)), pos1 = 1L, ier = integer(1))
beta <- .Fortran("REGCF", np = as.integer(qr$np), nrbar = as.integer(qr$nrbar),
d = as.double(qr$d), rbar = as.double(qr$rbar), thetab = as.double(qr$thetab),
tol = as.double(qr$tol), beta = numeric(length(thismodel)),
nreq = as.integer(length(thismodel)), ier = numeric(1))$beta
names(beta) <- object$xnames[qr$vorder[1:qr$n]]
reorder <- order(qr$vorder[1:qr$n])
beta <- beta[reorder]
if (vcov) {
p <- length(thismodel)
R <- diag(qr$np)
R[row(R) > col(R)] <- qr$rbar
R <- t(R)
R <- sqrt(qr$d) * R
R <- R[1:p, 1:p, drop = FALSE]
R <- chol2inv(R)
dimnames(R) <- list(object$xnames[qr$vorder[1:p]],
object$xnames[qr$vorder[1:p]])
V <- R * s$rss[id[i]]/(object$nn - p)
V <- V[reorder, reorder]
attr(beta, "vcov") <- V
}
betas[[i]] <- beta
}
if (length(id) == 1)
beta
else betas
}
Another solution that works for me is to randomize the order of the column(independent variables) in your dataset before running the regsubsets. The idea is that after reorder hopefully the highly correlated columns will be far apart from each other and will not trigger the reorder behavior in the regsubsets algorithm.

Obtain scaled predictor values when using `lm` in R

I'm using lm in R to do simple multilinear regression. Here's an example model:
m <- lm(formula = t ~ a + b + 0, data = df1)
where t, a and b are columns in df1. This model calculates 2 coefficients, let's call them a.coef and b.coef. If I then use this model to predict some other data, say in df2, I can get the predicted values like so:
predict(m, df2)
if I have the columns a and b in df2 as well. It essentially returns
df2$a * a.coef + df2$b * b.coef
What I'd like, however, are the columns df2$a * a.coef and df2$b * b.coef. R sums them and gives me the answer, but I'd like to see how the scaling affects these values.
Is there a convenient way to do this in R (esp in lm or predict.lm), or will I have to manually code this myself? I played with the terms argument in predict.lm, but I couldn't get anywhere.
Thanks for the help!
EDIT
I wrote this function:
scaled.fn <- function(dt, x, y, i) {
# dt is data.table
# x is dependent column (col name as str)
# y are predictor columns (col names as vector of str)
# i is name of column to multiply, as str
dep = paste(y, collapse = " + ")
my.formula = paste(x, " ~ ", dep, sep = "")
m = lm(formula = my.formula, data = dt)
# column names in dt are named in y
return(dt[, get(i) * coef(m)[i]])
}
Try this:
sweep(df2, MARGIN = 2, coef(m), '*')
EDIT: more specific solution:
sweep(df2[,c("a","b")], MARGIN = 2, coef(m), '*')

List Indexing in R over a loop

I'm new to using lists in R and am trying to run a loop over various data frames that stores multiple models for each frame. I would like the models that correspond to a given data frame within the first index of the list; e.g. [[i]][1], [[i]][2]. The following example overwrites the list:
f1 <- data.frame(x = seq(1:6), y = sample(1:100, 6, replace = TRUE), z = rnorm(6))
f2 <- data.frame(x = seq(6,11), y = sample(1:100, 6, replace = TRUE), z = rnorm(6))
data.frames <- list(f1,f2)
fit <- list()
for(i in 1:length(data.frames)){
fit[[i]] <- lm(y ~ x, data = data.frames[[i]])
fit[[i]] <- lm(y ~ x + z, data = data.frames[[i]])
}
Any idea how to set up the list or the indexing in the loop such that it generates an output that has the two models for the first frame referenced as [[1]][1] and [[1]][2] and the second frame as [[2]][1] and [[2]][2]? Thanks for any and all help.
Calculate both models in a single lapply call applied to each part of the data.frames list:
lapply(data.frames, function(i) {
list(lm(y ~ x, data = i),
lm(y ~ x + z, data=i))
})

Resources