Add a string as a formula with defined function - r

I want to define a function when I input a string as covariate the function will put my string on the specific location and transform it as a formula. I know my code is incorrect but I do not know how to write it.
What I want is when I type covars <- "+s(time,bs= 'cr',fx=TRUE,k=7)" the function will add covarsto the formula like this gam.model <- gam(cvd ~ pm10 +s(time,bs= 'cr',fx=TRUE,k=7), data = chicagoNMMAPS , family =poisson, na.rm=T)
library(dlnm) # use chicagoNMMAPS data
library(mgcv)
# define myfun
myfun <- function(covars){
covars <- covars
gam.model <- gam(cvd ~ pm10 + covars, data = chicagoNMMAPS , family =poisson, na.rm=T)
summary(gam.model)
}
myfun("+s(time,bs= 'cr',fx=TRUE,k=7)")
myfun should do this :
gam.model <- gam(cvd ~ pm10 + covars, data = chicagoNMMAPS , family =poisson, na.rm=T)

Are You looking for this, not sure but try this as.formula with paste0:
myfunc_formula <- function(covars){
return(as.formula(paste0('cvd ~ pm10 ', covars)))
}
we can later use this input to gam(myfunc_formula(covars), data = chicagoNMMAPS , family =poisson, na.rm=T),
## In case someone wants to return the summary of given gam model
myfunc_formula_v1 <- function(covars){
gam1 <- gam(as.formula(paste0('cvd ~ pm10 ', covars)), data = chicagoNMMAPS , family =poisson, na.rm=TRUE)
return(summary(gam1))
}
Also we can make it flexible, by providing parameters for input like target variable name etc.
for example another version could be:
myfunc_formula_v2 <- function(covars, target='cvd'){
return(as.formula(paste0(target, ' ~ pm10 ', covars)))
}
Output:
> myfunc_formula(covars)
cvd ~ pm10 + s(time, bs = "cr", fx = TRUE, k = 7)
given covars = "+s(time,bs= 'cr',fx=TRUE,k=7)"

paste0 works, but reformulate is marginally more elegant:
myfun <- function(covars){
form <- reformulate(c("pm10",covars), response="cvd")
gam.model <- gam(form, data = chicagoNMMAPS , family =poisson, na.rm=TRUE)
summary(gam.model)
}

Related

Dummies not included in summary

I want to create a function which will perform panel regression with 3-level dummies included.
Let's consider within model with time effects :
library(plm)
fit_panel_lr <- function(y, x) {
x[, length(x) + 1] <- y
#adding dummies
mtx <- matrix(0, nrow = nrow(x), ncol = 3)
mtx[cbind(seq_len(nrow(mtx)), 1 + (as.integer(unlist(x[, 2])) - min(as.integer(unlist(x[, 2])))) %% 3)] <- 1
colnames(mtx) <- paste0("dummy_", 1:3)
#converting to pdataframe and adding dummy variables
x <- pdata.frame(x)
x <- cbind(x, mtx)
#performing panel regression
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
form <- as.formula(paste0(names(y), "~", form,'-1'))
params <- list(
formula = form, data = x_copy, model = "within",
effect = "time"
)
pglm_env <- list2env(params, envir = new.env())
model_plm <- do.call("plm", params, envir = pglm_env)
model_plm
}
However, if I use data :
data("EmplUK", package="plm")
dep_var<-EmplUK['capital']
df1<-EmplUK[-6]
In output I will get :
> fit_panel_lr(dep_var, df1)
Model Formula: capital ~ sector + emp + wage + output + dummy_1 + dummy_2 +
dummy_3 - 1
<environment: 0x000001ff7d92a3c8>
Coefficients:
sector emp wage output
-0.055179 0.328922 0.102250 -0.002912
How come that in formula dummies are considered and in coefficients are not ? Is there any rational explanation or I did something wrong ?
One point why you do not see the dummies on the output is because they are linear dependent to the other data after the fixed-effect time transformation. They are dropped so what is estimable is estimated and output.
Find below some (not readily executable) code picking up your example from above:
dat <- cbind(EmplUK, mtx) # mtx being the dummy matrix constructed in your question's code for this data set
pdat <- pdata.frame(dat)
rhs <- paste(c("emp", "wage", "output", "dummy_1", "dummy_2", "dummy_3"), collapse = "+")
form <- paste("capital ~" , rhs)
form <- formula(form)
mod <- plm(form, data = pdat, model = "within", effect = "time")
detect.lindep(mod$model) # before FE time transformation (original data) -> nothing offending
detect.lindep(model.matrix(mod)) # after FE time transformation -> dummies are offending
The help page for detect.lindep (?detect.lindep is included in package plm) has some more nice examples on linear dependence before and after FE transformation.
A suggestion:
As for constructing dummy variables, I suggest to use R's factor with three levels and not have the dummy matrix constructed yourself. Using a factor is typically more convinient and less error prone. It is converted to the binary dummies (treatment style) by your typical estimation function using the model.frame/model.matrix framework.

Dynamically create model formula in a loop [duplicate]

Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}
See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433
An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2
Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])
You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))
I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)

How to programmatically create formulas using tildes in R [duplicate]

Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}
See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433
An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2
Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])
You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))
I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)

Using summary(glm-object) inside ldply() with the summarise() - function

How do i use the summary-function inside a ldply()-summarise-function to extract p-values?
Example data:
(The data frame "Puromycin" is preinstalled)
library(reshape2)
library(plyr)
Puromycin.m <- melt( Puromycin , id=c("state") )
Puro.models <- dlply( Puromycin.m , .(variable) , glm , formula = state ~ value ,
family = binomial )
I can construct this data frame with extracted results:
ldply( Puro.models , summarise , "n in each model" = length(fitted.values) ,
"Coefficients" = coefficients[2] )
But i cant extract the p-values in the same way. I thougt this would work but it does not:
ldply( Puro.models , summarise ,
"n in each model" = length(fitted.values) ,
"Coefficients" = coefficients[2],
"P-value" = function(x) summary(x)$coef[2,4] )
How can i extract p-values to that data frame :) Please help!
Why don't you get them directly?
library(reshape2)
library(plyr)
Puromycin.m <- melt( Puromycin , id=c("state") )
Puro.models <- ddply( Puromycin.m , .(variable), function(x) {
t <- glm(x, formula = state ~ value, family="binomial")
data.frame(n = length(t$fitted.values),
coef = coefficients(t)[2],
pval = summary(t)$coef[2,4])
})
> Puro.models
# variable n coef pval
# 1 conc 23 -0.55300908 0.6451550
# 2 rate 23 -0.01555023 0.1272184

Formula with dynamic number of variables

Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}
See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433
An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2
Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])
You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))
I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)

Resources