Selecting multiple variables into model starting with common name in R - r

As in SAS we can start multiple varibles using colon(:) option with start name. I wanted to do the same in R for modeling purpose.
Any suggestions?

There are probably many ways to do this. Here is one with a regular expression that doesn't do exactly what you want, but might do the trick:
x1 = rnorm(100)
x2 = rnorm(100)
z = rnorm(100)
a = rnorm(100)
y = x1+x2+z
d = data.frame(x1,x2,z,y)
X = as.matrix(d[,grepl("x",colnames(d))])
head(X)
m = lm(y~X+a)
summary(m)

as.formula(paste("y~", paste(names(mydata)[substr(names(mydata), 1, 1)=="x"], collapse="+"))) -> myformula
gives a formula object myformula for a regression of y on all variables with names beginning with x in the data frame mydata that you can use in models, e.g. lm(myformula, data=mydata). So you're not sub-setting the data frame, which can be a nuisance when it's big.

Related

How to use apply or lapply function

I have columns of data and would like to use the WD as an independent variable and E1-E14 as dependent variable and do a regression for each and write the output to a csv file. Please helpenter image description here
This what I did, however it outputs the same results for all columns. I think the mod variable is being incorrectly set.
mod <- function(y) lm(E1 ~ WD , data = data)
lapply(data[,5:16], mod)
This is probably what you want, but it may requires some changes to get to the full extent of what you need.
df <- data.frame(WD = c(1,1,0,0,0,1,1,1,0,0),
E1 = rnorm(10,0,1),
E2 = rnorm(10,0,1))
mod <- function(x){
lm(WD ~ x, data = df)
}
sapply(df[setdiff(names(df),"WD")],mod)

Is there an R function that resolve a second order linear model?

I´m a begginer in R and programming and struggling in doing problably a simple task.
I've made a code that creates a second model order and i want to input variables in this model and find the "Y value"
I´ve tried to use the predict function, but is actually pretty complex and I can't got anywhere.
I did this so far:
modFOI <- rsm(Rendimento~FO(x1,x2,x3,x4)+TWI(x1,x2,x3,x4)+PQ(x1,x2,x3,x4),data=CR) # com interações
summary(modFOI)
print(modFOI)
With that, i found the SO model, but now i want to create variables like x1,x2,x3 and input that in the model and find the Y. I also woud like to find the optimum Y
Simplest way to create a polynomial (2nd order) that I can think of is the following:
DF <- data.frame(x = runif(10,0,1),
y = runif(10,0,1) )
mod <- lm(DF$y ~ DF$x + I(DF$x^2))
predict(mod, new.data=data.frame(x=c(1,2,3,4,5)))
NB. when using predict the new.data must be in a data.frame format, and the variable must have the same name as the variable in the model (here, x)
Hope this helps
The optimum value is shown as the stationary point in the output of summary(modFOI). You may also run steepest(modFOI) to see a trace of the estimated values along the path of steepest ascent.
To predict, create a data frame with the desired sets of x values. For example,
testdat <- data.frame(x1 = -1:1, x2 = 0, x3 = 0, x4 = 1)
Then use the predict() function with this is newdata:
predict(modFOI, newdata = testdat)

lm function gives estimate for the y-variable also

I am trying to run a simple lm model. I am using the following
dt <- data.table(
y=rnorm(100,0,1),
x1=rnorm(100,0,1),
x2=rnorm(100,0,1),
x3=rnorm(100,0,1))
y_var2 <- names(dt)[names(dt)%like%"y"]
x_var2 <- names(dt)[names(dt)%like%"x"]
tmp2 <- summary(a <- lm(get(y_var2)~.,dt[,c(x_var2,y_var2),with=F]))
coefs2 <- as.data.table(tmp2$coefficients,keep.rownames = T)
So in the end, coefs2 should contain the estimates, p-values etc. But in the last row of the coefs2 i also see the y-variable.
But if I use
tmp2 <- summary(a <- lm(y~.,dt[,c(x_var2,y_var2),with=F]))
Then this does not happen. Why is that ?
This has to do with how R stores variables. y_var2 is a character "y" and you fill it into the formula as a character variable which you wish to model with all variables in your data.table dt. However, you have to tell R that you wish to evaluate the formula y~. and not "y"~. which are two different expressions for R.
lm( formula(paste(y_var2,"~.")),dt[,c(x_var2,y_var2),with=F])
will do the trick. formula constructs a formula out of the string variable with which a contructed the expression.
Actually it would probably be cleaner just to make the formula with reformulate() and the data= parameter of lm
tmp2 <- summary(a <- lm(reformulate(x_var2, y_var2), dt))

R using rgp symbolicRegression for equation discovery

I am trying to use the package rgp for equations discovery
library(rgp)
x = c (1:100)
y = 5*x+3*sin(x)+4*x^2+75
data1 = data.frame(x,y)
newFuncSet <- functionSet("+","-","*")
result1 <- symbolicRegression(y ~ x, data = data1, functionSet = newFuncSet, stopCondition = makeStepsStopCondition(2000))
plot(data1$y, col=1, type="l"); points(predict(result1, newdata = data1), col=2, type="l")
model <- result1$population[[which.min(result1$fitnessValues)]]
However, I keep getting an error message.I would be grateful for your help in pointing out the errors I have made above.
Useful references (it would be great to have this in R):
https://www.researchgate.net/publication/237050734_Improving_Genetic_Programming_Based_Symbolic_Regression_Using_Deterministic_Machine_Learning
The problem is that R treats the x vector as integers and has some problems with types further. Try to use type x into numeric specifically:
x <- as.numeric(1:100)
It worked for me.

Pass glm predictors from a list

I have a large set of model specifications to test, which share a dv but have unique IVs. In the following example
foo <- data.frame(dv = sample(c(0,1), 100, replace=T),
x1 = runif(100),
x2 = runif(100))
I want the first model to only include x1, the second x2, the third both, and the fourth their interaction. So I thought a sensible way would be to build a list of formula statements:
bar <- list("x1",
"x2",
"x1+x2",
"x1*x2")
which I would then use in a llply call from the plyr package to obtain a list of model objects.
require(plyr)
res <- llply(bar, function(i) glm(dv ~ i, data = foo, family = binomial()))
Unfortunately I'm told
Error in model.frame.default(formula = dv ~ i, data = foo, drop.unused.levels = TRUE):variable lengths differ (found for 'i')
Obviously I'm mixing up something fundamental--do I need to manipulate the original foo list in some fashion?
Your problem is with how you are specifying the formula, since inside the function i is a variable. This would work:
glm(paste("dv ~", i), data = foo, family = binomial())
The problem is that dv ~ i isn't a formula. i is (inside the anonymous function) simply a symbol that represents a variable containing a character value.
Try this:
bar <- list("dv~x1",
"dv~x2",
"dv~x1+x2",
"dv~x1*x2")
res <- llply(bar, function(i) glm(i, data = foo, family = binomial()))
But setting statistical issues aside, it might possibly be easier to use something like ?step or ?stepAIC in the MASS package for tasks similar to this?

Resources