Is there a way to create a StatsModels formula programmatically? For example, suppose I want to create a formula like #formula(y ~ x1 + x2) for any number of predictor terms. In other words, I'd like to create a function that behaves as follows:
julia> using StatsModels
julia> make_formula(3)
FormulaTerm
Response:
y(unknown)
Predictors:
x1(unknown)
x2(unknown)
x3(unknown)
julia> make_formula(3) == #formula(y ~ x1 + x2 + x3)
true
Is there a way to do this?
As of StatsModels version 0.6.x, you can create formula terms using term constructors such as Term and ConstantTerm, or using the generic function term, which works on both numbers and symbols. The terms can then be combined using formula operators such as +, &, and ~.
Here is how a simple formula can be constructed using Term constructors:
julia> f = Term(:y) ~ ConstantTerm(1) + Term(:x)
FormulaTerm
Response:
y(unknown)
Predictors:
1
x(unknown)
julia> f == #formula(y ~ 1 + x)
true
And here is how you could use the term function to programmatically create a formula:
julia> function make_formula(n)
term(:y) ~ sum(term.(Symbol.(string.("x", 1:n))))
end
make_formula (generic function with 1 method)
julia> make_formula(3)
FormulaTerm
Response:
y(unknown)
Predictors:
x1(unknown)
x2(unknown)
x3(unknown)
julia> make_formula(3) == #formula(y ~ x1 + x2 + x3)
true
For more details, see the StatsModels documentation here.
Related
I have existing code for fitting a sigmoid curve to data in R. How can I used selfstart (or another method) to automatically find start values for the regression?
sigmoid = function(params, x) {
params[1] / (1 + exp(-params[2] * (x - params[3])))
}
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y
# fitting code
fitmodel <- nls(y~a/(1 + exp(-b * (x-c))), start=list(a=1,b=.5,c=25))
# visualization code
# get the coefficients using the coef function
params=coef(fitmodel)
y2 <- sigmoid(params,x)
plot(y2,type="l")
points(y)
This is a common (and interesting) problem in non-linear curve fitting.
Background
We can find sensible starting values if we take a closer look at the function sigmoid
We first note that
So for large values of x, the function approaches a. In other words, as a starting value for a we may choose the value of y for the largest value of x.
In R language, this translates to y[which.max(x)].
Now that we have a starting value for a, we need to decide on starting values for b and c. To do that, we can make use of the geometric series
and expand f(x) = y by keeping only the first two terms
We now set a = 1 (our starting value for a), re-arrange the equation and take the logarithm on both sides
We can now fit a linear model of the form log(1 - y) ~ x to obtain estimates for the slope and offset, which in turn provide the starting values for b and c.
R implementation
Let's define a function that takes as an argument the values x and y and returns a list of parameter starting values
start_val_sigmoid <- function(x, y) {
fit <- lm(log(y[which.max(x)] - y + 1e-6) ~ x)
list(
a = y[which.max(x)],
b = unname(-coef(fit)[2]),
c = unname(-coef(fit)[1] / coef(fit)[2]))
}
Based on the data for x and y you give, we obtain the following starting values
start_val_sigmoid(x, y)
#$a
#[1] 1
#
#$b
#[1] 0.2027444
#
#$c
#[1] 15.01613
Since start_val_sigmoid returns a list we can use its output directly as the start argument in nls
nls(y ~ a / ( 1 + exp(-b * (x - c))), start = start_val_sigmoid(x, y))
#Nonlinear regression model
# model: y ~ a/(1 + exp(-b * (x - c)))
# data: parent.frame()
# a b c
# 1.0395 0.1254 29.1725
# residual sum-of-squares: 0.2119
#
#Number of iterations to convergence: 9
#Achieved convergence tolerance: 9.373e-06
Sample data
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y
I have two vectors (example):
x=c(100,98,60,30,28,30,20,10)
y=c(10,9.8,5,3,2,3.4,2.8,1)
I would like to fit them using this function:
and get the fitting parameters a b c d
I used this:
m<-nls(x~a/1+e^(-b*(y-c)) + d)
but I got this error:
Error in y - c : non-numeric argument to binary operator
x and y are reversed and e^(...) should be exp(...). Also I found that setting d to 0 helped.
d <- 0 # fix d at 0
st <- list(a = mean(y), b = 1/sd(x), c = mean(x))
fm <- nls(y ~ a/(1+exp(-b*(x-c))) + d, start = st)
fm
giving:
Nonlinear regression model
model: y ~ a/(1 + exp(-b * (x - c)))
data: parent.frame()
a b c
19.96517 0.02623 99.73842
residual sum-of-squares: 1.82
Number of iterations to convergence: 9
Achieved convergence tolerance: 9.023e-06
Plotting this it seems to be a good fit visually:
plot(y ~ x)
lines(fitted(fm) ~ x, col = "red")
I think the reason is that c is considered as the combine operator. Change that to another symbol (c1 for instance). Of course you would also ened to specify meaningful starting paramaters, but I guess that was not your question.
I'm going to simplify my problem as much as to prove that I'm not just throwing my assignment at you guys. I really want to learn how to get loop to work with regressions.
Let's suppose I want to run two OLS but I don't want to type the same ols command twice OR add another series of commands into my script. This is because a) I actually have way more than two regressions and b) I want to try to code this as efficient as I can (I have tried copying and pasting same ols commands). Also, I'm not just running a simple OLS as I'm also running HAC estimator depending on the serial correlation and heteroskedasticity test.
The code that i have came up with so far is,
Packages
if (!require("lmtest")) install.packages("lmtest")
library("lmtest")
if (!require("sandwich")) install.packages("sandwich")
library("sandwich")
Data
data<-read.csv(file.choose())
x1<-data$x1
x2<-data$x2
x3<-data$x4
x5<-data$x5
x6<-data$x6
x7<-data$x7
y1<-data$y1
Regressions
reg1<-(y1 ~ x1 + x2 + x3 + x4)
reg2<-(y1 ~ x2 + x4 + x6 + x7)
p<-0.05
Loop
for (i in 1:2) {
#OLS#
ols[i]<-lm(reg[i])
#Breusch-Pagan Test#
bptest(ols[i],varformula = NULL, studentize = TRUE)
bpp<-bptest(ols[i])$p.value
if(bpp>p) hs<-F else hs<-T
#Breusch-Godfrey Serial Correlation Test#
bgtest(ols[i],order=2,order.by=NULL,type=c("Chisq"))
bgp<-bgtest(ols[i])$p.value
if(bgp>p) sc<-F else sc<-T
#HAC Estimator#
HAC<-vcovHAC(ols[i],order.by=NULL,prewhite=FALSE,adjust=TRUE,diagnostics=FALSE,sandwich = TRUE,ar.method = "ols")
if (sc==T|hs==T) coeftest(ols[i],vcov.=HAC) else ols[i]
if (sc==T|hs==T) write.csv(coeftest(ols[i],vcov.=HAC),file="ols[i]HAC.csv") else write.csv(summary(ols[i])$coefficient,file="ols1.csv")
}
When I run this I get
Error in stats::model.frame(formula = reg[i], drop.unused.levels = TRUE) : object 'reg' not found
I have also tried the above code with
for (i in reg[1]:reg[2]) {
}
but it only returned
Error: object 'reg' not found.
Where did I go wrong?
This is too long for a comment, so I post it as a partial answer.
The difference seems to be the formula, and you are asking for a way to make your code more efficient. One way is to use a list of formulae and then combine the list with lapply. For instance,
reg <- list(
reg1 = as.formula(y1 ~ x1 + x2 + x3 + x4),
reg2 = as.formula(y1 ~ x2 + x4 + x6 + x7)
)
ols <- lapply(reg, function(x) lm(x, data=data))
Here, ols is a list of two elements, each of which is a regression corresponding to the formula-list. You can use the same principle for other functions, for instance:
bgtests <- lapply(ols, function(x)
bgtest(x,order=2,order.by=NULL,type=c("Chisq")))
This executes your bgtest function for each regression stored in ols. In a similar fashion, you can write it up so that it executes your heteroskedasticity corrections etc. The important point is this: you supply a list to lapply, and each element of that list is what is passed onto the function that you provide. The output of lapply is then a list with the output of that function.
In case you don't want to use lapply and to adress your actual question: the problem in your code is that there is no object called reg. Subsetting a non-existing object such as reg[1] hence does not work. If you execute the first lines of my code above, reg[1] and reg[2] become defined so that your loop should work.
The 'get' function is what you want, in conjunction with 'paste'. Below I fit two regressions using the cars data in R. Then I want to write a loop that extracts its coefficients. The 'get' function goes and find the object that matches the object name you specified.
> (reg1 <- lm(dist ~ speed, data = cars))
Call:
lm(formula = dist ~ speed, data = cars)
Coefficients:
(Intercept) speed
-17.579 3.932
> (reg2 <- lm(dist ~ 1 + I(speed^2), data = cars))
Call:
lm(formula = dist ~ 1 + I(speed^2), data = cars)
Coefficients:
(Intercept) I(speed^2)
8.860 0.129
> coeff <- matrix(0, nrow = 2, ncol = 2)
> for (i in 1:2)
+ {
+
+ # Main step
+ model <- get(paste("reg", i, sep = ""))
+ coeff[i,] <- coefficients(model)
+ }
> coeff
[,1] [,2]
[1,] -17.579095 3.9324088
[2,] 8.860049 0.1289687
>
I've fit a parametric function using nls, and now I want to print out an expression of the function with the learned parameters substituted back in. For example:
x <- runif(100, 0, 100)
m <- 13 * exp(-0.05 * x^2) + 0.1 + runif(100,0,0.1)
mod <- nls(m ~ a*exp(-b*x^2)+c, start=list(a=10,b=0.1,c=0.1))
I can extract the formula and coefficients like so:
formula(mod)
# m ~ a * exp(-b * x^2) + c
coef(mod)
# a b c
# 13.00029360 0.04975388 0.14457936
But I don't see a way to substitute them back directly. The only thing I can seem to do involves writing out the formula again:
substitute(m ~ a * exp(-b * x^2) + c, as.list(round(coef(mod), 4)))
# m ~ 13.0003 * exp(-0.0498 * x^2) + 0.1446
My ultimate goal here is to read a fitted nls object from an RDS file on disk and show its functional expression in an org-mode document.
Is this what you're looking for?
do.call(substitute, args=list(formula(mod), as.list(round(coef(mod),4))))
# m ~ 13.0097 * exp(-0.0501 * x^2) + 0.1536
It works because do.call first evaluates both of the arguments in args and only then uses substitute() to substitute the coefficients into the formula expression. i.e., the expression that do.call() ultimately evaluates looks like this one, as desired:
as.call(list(substitute, formula(mod), as.list(round(coef(mod),4))))
# .Primitive("substitute")(m ~ a * exp(-b * x^2) + c, list(a = 13.0097,
# b = 0.0501, c = 0.1536))
I'm trying to use linear regression to figure out the best weighting for 3 models to predict an outcome. So there are 3 variables (x1, x2, x3) that are the predictions of the dependent variable, y. My question is, how do I run a regression with the constraint that the sum of the coefficients sum to 1. For example:
this is good:
y = .2(x1) + .4(x2) + .4(x3)
since .2 + .4 + .4 = 1
this is no good:
y = 1.2(x1) + .4(x2) + .3(x3)
since 1.2 + .4 + .3 > 1
I'm looking to do this in R if possible. Thanks. Let me know if this needs to get moved to the stats area ('Cross-Validated').
EDIT:
The problem is to classify each row as 1 or 0. y is the actual values ( 0 or 1 ) from the training set, x1 is the predicted values from a kNN model, x2 is from a randomForest, x3 is from a gbm model. I'm trying to get the best weightings for each model, so each coefficient is <=1 and the sum of the coefficients == 1.
Would look something like this:
y/Actual value knnPred RfPred gbmPred
0 .1111 .0546 .03325
1 .7778 .6245 .60985
0 .3354 .1293 .33255
0 .2235 .9987 .10393
1 .9888 .6753 .88933
... ... ... ...
The measure for success is AUC. So I'm trying to set the coefficients to maximize AUC while making sure they sum to 1.
There's very likely a better way that someone else will share, but you're looking for two parameters such that
b1 * x1 + b2 * x2 + (1 - b1 - b2) * x3
is close to y. To do that, I'd write an error function to minimize
minimizeMe <- function(b, x, y) { ## Calculates MSE
mean((b[1] * x[, 1] + b[2] * x[, 2] + (1 - sum(b)) * x[, 3] - y) ^ 2)
}
and throw it to optim
fit <- optim(par = c(.2, .4), fn = minimizeMe, x = cbind(x1, x2, x3), y = y)
No data to test on:
mod1 <- lm(y ~ 0+x1+x2+x3, data=dat)
mod2 <- lm(y/I(sum(coef(mod1))) ~ 0+x1+x2+x3, data=dat)
And now that I think about it some more, skip mod2, just:
coef(mod1)/sum(coef(mod1))
For the five rows shown either of round(knnPred) or round(gbmPred) give perfect predictions so there is some question whether more than one predictor is needed.
At any rate, to solve the given question as stated the following will give nonnegative coefficients that sum to 1 (except possibly for tiny differences due to computer arithmetic). a is the dependent variable and b is a matrix of independent variables. c and d define the equality constraint (coeffs sum to 1) and e and f define the inequality constraints (coeffs are nonnegative).
library(lsei)
a <- cbind(x1, x2, x3)
b <- y
c <- matrix(c(1, 1, 1), 1)
d <- 1
e <- diag(3)
f <- c(0, 0, 0)
lsei(a, b, c, d, e, f)