I am new to R and I'm working on a project, where we have to do some experiments with generating linear regression models.
Here is my code:
# regression coefficients
beta1 = 1
beta2 = 0
beta3 = 5
beta4 = 1
beta5 = 1
# generated data
df <- data.frame(y1 = rnorm(n, mean = 50, sd = 1),
sprem1 = rnorm(n, mean = 0, sd = 1),
sprem2 = rnorm(n, mean = 10, sd = 2),
sprem3 = rnorm(n, mean = 10, sd = 2),
sprem4 = rnorm(n, mean = 20, sd = 2),
sprem5 = rnorm(n, mean = 20, sd = 1))
fit.lm <- lm(formula = y1 ~ beta1 * sprem1 + beta2 * sprem2 + beta3 * sprem3 + beta4 * sprem4 + beta5 * sprem5, data = df)
But I get an error:
Error in model.frame.default(formula = y1 ~ beta1 * sprem1 + beta2 * sprem2 + : variable lengths differ (found for 'beta1')
Where is the problem?
Any help would be appreciated.
In the formula, * is interpreted as the interaction between each beta and the terms from df. This interaction is akin to an element-wise product, but the beta are scalar, whereas the sprem are vectors... so their length differ.
It's hard to tell what you want to do, but you're probably looking for something like this:
fit.lm <- lm(formula = y1 ~ I(beta1 * sprem1) + I(beta2 * sprem2) + I(beta3 * sprem3) + I(beta4 * sprem4) + I(beta5 * sprem5), data = df)
Related
I have been playing with the bayesQR package, and want to apply it to an application that calls for variable selection using LASSO. As far as I understand, this is possible in bayesQR, but I haven't been able to get any variables dropped. My toy example is below, where the extraneous variables (c and d) are dropped by glmnet, but not by bayesQR.
Is there something fundamental that I am missing? How can I get model4 below to have eliminated variables?
library(data.table)
library(bayestestR)
library(bayesQR)
library(glmnet)
# Generate data
n = 250
seed = 22
noise_sd = 5
set.seed(seed)
dt = data.table(a = runif(n, min = 0, max = 10),
b = runif(n, min = 0, max = 10),
c = runif(n, min = 0, max = 10),
d = rnorm(n, sd = .01 * noise_sd))
dt[, y := (a + rnorm(n, sd = 1 * noise_sd)) * 2 + (b + rnorm(n, sd = 2 * noise_sd)) + rnorm(n, sd = 2 * noise_sd)]
formula = y ~ a + b + c + d
# Just with GLM
model1 = glm(data = dt,
formula = formula)
# LASSO with glmnet
x = as.matrix(dt[, .(a,b,c,d)])
y = dt$y
cv_model <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_model$lambda.min
model2 <- glmnet(x, y, alpha = 1, lambda = best_lambda)
print(coef(model2))
# Quantile regression with bayesQR
model3 = bayesQR(data = dt,
formula = formula,
ndraw = 5000,
seed = seed)
# Quantile regression with bayesQR using adaptive lasso
model4 = bayesQR(data = dt,
formula = formula,
ndraw = 5000,
alasso = TRUE,
seed = seed)
message('GLM')
print(summary(model1))
message('glmnet with LASSO')
print(coef(model2))
message('bayesQR')
print(summary(model3))
message('bayesQR with ALASSO')
print(summary(model4))
I tried to search this online, but couldn't exactly figure out what my issue was. Here is my code:
n = 10000
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y1 <- 10*sin(x1/10) + 10 + rnorm(n, sd = 1)
y2 <- x2 * cos(x2) - 2 * rnorm(n, sd = 2)
x <- c(x1, x2)
y <- c(x1, x2)
start1 = list(a = 10, b = 5)
start2 = list(a = 30, b = 5)
library(flexmix)
library(flexmixNL)
modelNL <- flexmix(y~x, k =2,
model = FLXMRnlm(formula = y ~ a*x/(b+x),
family = "gaussian",
start = list(start1, start2)))
plot(x, y, col = clusters(modelNL))
and before the plot, it gives me this error:
Error in matrix(1, nrow = sum(groups$groupfirst)) : data is too long
I checked google for similar errors, but I don't quite understand what is wrong with my own code that results in this error.
As you can already tell, I am very new to R, so please explain it in the most layman terms possible. Thank you in advance.
Ironically (in the context of an error message saying data is "too long") I think the proximate cause of that error is no data argument. If you give it the data in the form of a dataframe, you still get an error but its not the same one as you are experiencing. When you plot the data, you get a rather bizarre set of values at least from a statistical distribution standpoint and it's not clear why you are trying to model this with this formula. Nonetheless, with those starting values and a dataframe argument to data, one sees results.
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x/(b+x),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
6664 13336
convergence after 20 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.436 6664 20000 0.333
Comp.2 0.564 13336 16306 0.818
'log Lik.' -91417.03 (df=7)
AIC: 182848.1 BIC: 182903.4
Most R regression functions first check for the matchng names in formulae within the data= argument. Apparently this function fails when it needs to go out to the global environment to match formula tokens.
I tried a formula suggested by the plot of the data and get convergent results:
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x*cos(x+b),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
9395 10605
convergence after 17 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.521 9395 18009 0.522
Comp.2 0.479 10605 13378 0.793
'log Lik.' -78659.85 (df=7)
AIC: 157333.7 BIC: 157389
The reduction in AIC seems huge compare to the first formula.
I made a code to simulate a dataset in R to see how backward selection works in machine learning. And I generated poly() function to write polynomial function and then wanted to choose the suitable polynomial using Cp, BIC, adjusted R^2.
The code is:
###Generating dataset
set.seed(1)
X = rnorm(100)
eps = rnorm(100)
beta0 = 3
beta1 = 2
beta2 = -3
beta3 = 0.3
Y = beta0 + beta1 * X + beta2 * X^2 + beta3 * X^3 + eps
library(leaps)
data.full = data.frame(y = Y, x = X)
mod.full = regsubsets(y ~ poly(x, 10, raw = T), data = data.full, nvmax = 10)
mod.summary = summary(mod.full)
### Find the model size for best cp, BIC and adjr2
which.min(mod.summary$cp)
For cp, BIC and adjusted R^2 I get model with polynomial 3 as it should be
However, now I want to simulate 100 datasets and see in how many datasets do I get the right model. I simulated 100 datasets but now I am not getting polynomial 3 for each of the measures. And I don't quite understand what I'm doing wrong. My code for simulation is:
###Generating 100 datasets
data <- replicate(100, rnorm(n=100))
epsilon <- replicate(100,rnorm(n=100))
###Formula (same as before)
Y = beta0 + beta1 * data + beta2 * data^2 + beta3 * data^3 + epsilon
data.full = data.frame(y = Y, x = data)
###Using polynomial terms
mod.bwd = regsubsets(data.full$y.1 ~ poly(data.full$x.1, 10, raw = T), data = data.full, nvmax = 10,
method = "backward")
bwd.summary = summary(mod.bwd)
which.min(bwd.summary$cp)
which.min(bwd.summary$bic)
which.max(bwd.summary$adjr2)
For a given subset cp, Bic, adjr2 are giving me different results. For example, using y.1 and x.1 (first dataset in simulation) gives following results:
which.min(bwd.summary$cp): 7
which.min(bwd.summary$bic): 4
which.max(bwd.summary$adjr2): 9
Can someone help me what I'm doing wrong in simulating these 100 datasets.
If I've read your code correctly you run the model on the same simulated dataset 100 times instead of all 100 simulated datasets, this should do the trick:
set.seed(42)
###Generating 100 datasets
data <- replicate(100, rnorm(n=100))
epsilon <- replicate(100,rnorm(n=100))
###Formula (same as before)
Y = beta0 + beta1 * data + beta2 * data^2 + beta3 * data^3 + epsilon
data.full = data.frame(y = Y, x = data)
res <- lapply(1:100, function(i){
###Using polynomial terms
mod.bwd = regsubsets(data.full[[i]] ~ poly(data.full[[100+i]], 10, raw = T), data = data.full, nvmax = 10,
method = "backward")
bwd.summary = summary(mod.bwd)
c(
which.min(bwd.summary$cp),
which.min(bwd.summary$bic),
which.max(bwd.summary$adjr2)
)
})
res <- do.call(rbind, res)
With this rng-seed this gives some lines where all cirteria select the correct model.
I have a dataframe, with
### create sample data
set.seed(1)
x = runif(n = 100, 0, 20)
beta1 = 2
beta2 = 1
beta3 = (1/2)
### Create sample data frame and vector
y = beta1 + beta2*x^(beta3) + rnorm(n = 100, 0, 0.01)
data = as.data.frame(cbind(y,x))
### Fitting the data with nls()-function;
fit1 = nls(
y~vb1(x,beta1,beta2,beta3),
data=data,start=list(beta1 = 0, beta2 = 1, beta3 = 1)
)
where
vb1 = function(x,beta1,beta2,beta3){
beta1 + beta2*x^(beta3)
}
In the end I want to plot the output:
plot(y~x, col = 3)
nlsTracePlot(fit1,vb1(x,beta1,beta2,beta3),legend="bottomright")
However, it gives the following error,
3. stop(gettextf("'%s' is not a function, character or symbol", deparse(FUN)), domain = NA)
2. match.fun(fun)
1. nlsTracePlot(fit1, vb1(x, beta1, beta2, beta3), legend = "bottomright")
Everything works just fine, until I try to plot it with the above function.
Assuming you are using the FSA package, you will need to pass the fun argument a raw function. The function needs to be specified in a particular way. See documentation of nlsTracePlot for more information.
vb1 <- function(x, beta1, beta2, beta3){
if (length(beta1) == 3) {
beta2 <- beta1[2]
beta3 <- beta1[3]
beta1 <- beta1[1]
}
beta1 + beta2 * x^(beta3)
}
plot(y ~ x, col = 3)
nlsTracePlot(fit1, vb1, legend = "bottomright")
I want to fit to a 1 - exp(x) function to a data set , but giving higher weight to the first values. However, the following code is not working in such way:
x <-sqrt((0.05)^2+(0.05)^2)*seq(from = 1, to = 20, by = 1)
y <- c(11,20,27,32,35,36,36.5,25,16,9,4,1,7.87e-16,2.07e-15,-9.36e-16,1.61e-15,-3.81e-16,3.92e-16,7.65e-16,-8.26e-16)
temp <- data.frame(cbind(x,y))
we <- 1/(log1p(seq_along(x)))
# fit non-linear model
mod <- nls(y ~ (1 - exp(a + b * x)), data = temp, start = list(a = 0, b = 0), weights = we)
#add fitted curve
lines(temp$x, predict(mod, list(x = temp$x)))
Here is the output:
Your specification of weights is correct. The bad fit you obtained is due to your faulty model assumption. You assumed:
y ~ 1 - exp(a + b * x)
Note that exp() gives strictly positive values, so y will be no larger than 1. However, y values in your data range up to 35.
My idea is not perfect, but it might give you a better starting point. Consider:
y ~ a * x * exp(b * x * x + c * x)
Using your data:
x <- c(0, sqrt((0.05)^2+(0.05)^2)*seq(from = 1, to = 20, by = 1))
y <- c(0, 11,20,27,32,35,36,36.5,25,16,9,4,1,7.87e-16,2.07e-15,-9.36e-16,1.61e-15,-3.81e-16,3.92e-16,7.65e-16,-8.26e-16)
fit <- nls(y ~ a * x * exp(b * x * x + c * x), start = list(a = 30, b= -1, c = -1))
plot(x, y)
lines(x, predict(fit, list(x)))