I am trying to create a function that passes a parameter in as the dependent variable with the independent variables staying the same.
I have tried to use {{}} but see the problem as something like the below if select contains was possible.
test_func <- function(dataframe, dependent){
model <- tidy(lm({{ dependent }} ~ . - select(contains("x")), data = dataframe))
return(model)
}
test_func(datasets::anscombe, x1)
The function should pass as function(dataframe, dependent) with a single model.
Use reformulate().
f <- function(d, y) lm(reformulate(names(d)[grep("x", names(d))], response=y), data=d)
f(datasets::anscombe, "y1")
# Call:
# lm(formula = reformulate(names(d)[grep("x", names(d))], response = y),
# data = d)
#
# Coefficients:
# (Intercept) x1 x2 x3 x4
# 4.33291 0.45073 NA NA -0.09873
Related
Suppose I have a data frame in the environment, mydata, with three columns, A, B, C.
mydata = data.frame(A=c(1,2,3),
B=c(4,5,6),
C=c(7,8,9))
I can create a linear model with
lm(C ~ A, data=mydata)
I want a function to generalize this, to regress B or C on A, given just the name of the column, i.e.,
f = function(x){
lm(x ~ A, data=mydata)
}
f(B)
f(C)
or
g = function(x){
lm(mydata$x ~ mydata$A)
}
g(B)
g(C)
These solutions don't work. I know there is something wrong with the evaluation, and I have tried permutations of quo() and enquo() and !!, but no success.
This is a simplified example, but the idea is, when I have dozens of similar models to build, each fairly complicated, with only one variable changing, I want to do so without repeating the entire formula each time.
If we want to pass unquoted column name, and option is {{}} from tidyverse. With select, it can take both string and unquoted
library(dplyr)
printcol2 <- function(data, x) {
data %>%
select({{x}})
}
printcol2(mydata, A)
# A
#1 1
#2 2
#3 3
printcol2(mydata, 'A')
# A
#1 1
#2 2
#3 3
If the OP wanted to pass unquoted column name to be passed in lm
f1 <- function(x){
rsp <- deparse(substitute(x))
fmla <- reformulate("A", response = rsp)
out <- lm(fmla, data=mydata)
out$call <- as.symbol(paste0("lm(", deparse(fmla), ", data = mydata)"))
out
}
f1(B)
#Call:
#lm(B ~ A, data = mydata)
#Coefficients:
#(Intercept) A
# 3 1
f1(C)
#Call:
#lm(C ~ A, data = mydata)
#Coefficients:
#(Intercept) A
# 6 1
Maybe you are looking for deparse(substitute(.)). It accepts arguments quoted or not quoted.
f = function(x, data = mydata){
y <- deparse(substitute(x))
fmla <- paste(y, 'Species', sep = '~')
lm(as.formula(fmla), data = data)
}
mydata <- iris
f(Sepal.Length)
#
#Call:
#lm(formula = as.formula(fmla), data = data)
#
#Coefficients:
# (Intercept) Speciesversicolor Speciesvirginica
# 5.006 0.930 1.582
f(Petal.Width)
#
#Call:
#lm(formula = as.formula(fmla), data = data)
#
#Coefficients:
# (Intercept) Speciesversicolor Speciesvirginica
# 0.246 1.080 1.780
I think generally, you might be looking for:
printcol <- function(x){
print(x)
}
printcol(mydata$A)
This doesn't involve any fancy evaluation, you just need to specify the variable you'd like to subset in your function call.
This gives us:
[1] 1 2 3
Note that you're only printing the vector A, and not actually subsetting column A from mydata.
I'm a beginner and I'm trying to write a function that fits a model (univariate) one variable at a time. I want to able to enter all the dependent variables when calling the function and then make the function to fit one variable at a time. Something like:
f <-function(y, x1,x2,x3) {(as.formula(print()))...}
I tried to make a list of x1, x2, x3 inside the function but it didn't work.
Maybe someone here could help me with this or point me in the right direction.
Here's something that might get you going. You can put your x variables inside a list, and then fit once for each element inside the list.
fitVars <- function(y, xList) {
lapply(xList, function(x) lm(y ~ x))
}
y <- 1:10
set.seed(10)
xVals <- list(rnorm(10), rnorm(10), rnorm(10))
fitVars(y, xVals)
This returns one fit result for each element of xVals:
[[1]]
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
4.984 -1.051
[[2]]
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
5.986 -1.315
[[3]]
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
7.584 2.282
Another option is to use the ... holder to use an arbitrary number of arguments:
fitVars <- function(y, ...) {
xList <- list(...)
lapply(xList, function(x) lm(y ~ x))
}
set.seed(10)
fitVars(y, rnorm(10), rnorm(10), rnorm(10))
This gives the same result as above.
I am writing a custom script to bootstrap standard errors in a GLM in R and receive the following error:
Error in eval(predvars, data, env) : numeric 'envir' arg not of length one
Can someone explain what I am doing wrong? My code:
#Number of simulations
sims<-numbersimsdesired
#Set up place to store data
saved.se<-matrix(NA,sims,numberofcolumnsdesired)
y<-matrix(NA,realdata.rownumber)
x1<-matrix(NA,realdata.rownumber)
x2<-matrix(NA,realdata.rownumber)
#Resample entire dataset with replacement
for (sim in 1:sims) {
fake.data<-sample(1:nrow(data5),nrow(data5),replace=TRUE)
#Define variables for GLM using fake data
y<-realdata$y[fake.data]
x1<-realdata$x1[fake.data]
x2<-realdata$x2[fake.data]
#Run GLM on fake data, extract SEs, save SE into matrix
glm.output<-glm(y ~ x1 + x2, family = "poisson", data = fake.data)
saved.se[sim,]<-summary(glm.output)$coefficients[0,2]
}
An example: if we suppose sims = 1000 and we want 10 columns (suppose instead of x1 and x2, we have x1...x10) the goal is a dataset with 1,000 rows and 10 columns containing each explanatory variable's SEs.
There isn't a reason to reinvent the wheel. Here is an example of bootstrapping the standard error of the intercept with the boot package:
set.seed(42)
counts <- c(18,17,15,20,10,20,25,13,12)
x1 <- 1:9
x2 <- sample(9)
DF <- data.frame(counts, x1, x2)
glm1 <- glm(counts ~ x1 + x2, family = poisson(), data=DF)
summary(glm1)$coef
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) 2.08416378 0.42561333 4.896848 9.738611e-07
#x1 0.04838210 0.04370521 1.107010 2.682897e-01
#x2 0.09418791 0.04446747 2.118131 3.416400e-02
library(boot)
intercept.se <- function(d, i) {
glm1.b <- glm(counts ~ x1 + x2, family = poisson(), data=d[i,])
summary(glm1.b)$coef[1,2]
}
set.seed(42)
boot.intercept.se <- boot(DF, intercept.se, R=999)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = DF, statistic = intercept.se, R = 999)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* 0.4256133 0.103114 0.2994377
Edit:
If you prefer doing it without a package:
n <- 999
set.seed(42)
ind <- matrix(sample(nrow(DF), nrow(DF)*n, replace=TRUE), nrow=n)
boot.values <- apply(ind, 1, function(...) {
i <- c(...)
intercept.se(DF, i)
})
sd(boot.values)
#[1] 0.2994377
I'm trying to create a series of models based on subsets of different categories in my data. Instead of creating a bunch of individual model objects, I'm using lapply() to create a list of models based on subsets of every level of my category factor, like so:
test.data <- data.frame(y=rnorm(100), x1=rnorm(100), x2=rnorm(100), category=rep(c("A", "B"), 2))
run.individual.models <- function(x) {
lm(y ~ x1 + x2, data=test.data, subset=(category==x))
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
# [[1]]
# Call:
# lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
# x))
# Coefficients:
# (Intercept) x1 x2
# 0.10852 -0.09329 0.11365
# ....
This works fantastically, except the model call shows subset = (category == x) instead of category == "A", etc. This makes it more difficult to use both for diagnostic purposes (it's hard to remember which model in the list corresponds to which category) and for functions like predict().
Is there a way to substitute the actual character value of x into the lm() call so that the model doesn't use the raw x in the call?
Along the lines of Explicit formula used in linear regression
Use bquote to construct the call
run.individual.models <- function(x) {
lmc <- bquote(lm(y ~ x1 + x2, data=test.data, subset=(category==.(x))))
eval(lmc)
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
[[1]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"A"))
Coefficients:
(Intercept) x1 x2
-0.08434 0.05881 0.07695
[[2]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"B"))
Coefficients:
(Intercept) x1 x2
0.1251 -0.1854 -0.1609
I'm looking for suggestions on how to deal with NA's in linear regressions when all occurrences of an independent/explanatory variable are NA (i.e. x3 below).
I know the obvious solution would be to exclude the independent/explanatory variable in question from the model but I am looping through multiple regions and would prefer not to have a different functional forms for each region.
Below is some sample data:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
# Quick plot of data
library(ggplot2)
library(reshape2)
df.melt <-melt(df, id=c("time"))
p <- ggplot(df.melt, aes(x=time, y=value)) +
geom_line() + facet_grid(variable ~ .)
p
I have read the documentation for lm and tried various na.action settings without success:
lm(y~x1+x2+x3, data=df, singular.ok=TRUE)
lm(y~x1+x2+x3, data=df, na.action=na.omit)
lm(y~x1+x2+x3, data=df, na.action=na.exclude)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.omit)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.exclude)
Is there a way to get lm to run without error and simply return a coefficient for the explanatory reflective of the lack of explanatory power (i.e. either zero or NA) from the variable in question?
Here's one idea:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
replaceNA<-function(x){
if(all(is.na(x))){
rep(0,length(x))
} else x
}
lm(y~x1+x2+x3, data= data.frame(lapply(df,replaceNA)))
Call:
lm(formula = y ~ x1 + x2 + x3, data = data.frame(lapply(df, replaceNA)))
Coefficients:
(Intercept) x1 x2 x3
0.05467 1.01133 -0.10613 NA
lm(y~x1+x2, data=df)
Call:
lm(formula = y ~ x1 + x2, data = df)
Coefficients:
(Intercept) x1 x2
0.05467 1.01133 -0.10613
So you replace the variables which contain only NA's with variable which contains only 0's. you get the coefficient value NA, but all the relevant parts of the model fits are same (expect qr decomposition, but if information about that is needed, it can be easily modified). Note that component summary(fit)$alias (see ?alias) might be useful.
This seems to relate your other question: Replace lm coefficients in [r]
You won't be able to include a column with all NA values. It does strange things to model.matrix
x1 <- 1:5
x2 <- rep(NA,5)
model.matrix(~x1+x2)
(Intercept) x1 x2TRUE
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$x2
[1] "contr.treatment"
So your alternative is to programatically create the model formula based on the data.
Something like...
make_formula <- function(variables, data, response = 'y'){
if(missing(data)){stop('data not specified')}
using <- Filter(variables,f= function(i) !all(is.na(data[[i]])))
deparse(reformulate(using, response))
}
variables <- c('x1','x2','x3')
make_formula(variables, data =df)
[1] "y ~ x1 + x2"
I've used deparse to return a character string so that there is no environment issues from creating the formula within the function. lm can happily take a character string which is a valid formula.