How to make regression based on grouped rows and loop over columns? - r

What I want to do is to perform a regression loop that always has the same predictor but loops over responses (here: y1, y2 and y3). The problem is that I want it also to be done for each category of a grouping variable. In the example data below, I want to make the regression y_i=x for all three y variables, which would result in three regressions. But I want this to be done separately for group=a, group=b and group=c, resulting in 9 different regressions (preferably stored as lists). Cant figure out how to do it! Anyone who has an idea on how to do this?
My idea so far was to maybe do a for-loop or lapply combined with dplyr::group_by, but can't get it to work.
Example data (I have a much larger data set for the actual analysis).
set.seed(123)
dat <- data.frame(group=c(rep("a",10), rep("b",10), rep("c",10)),
x=rnorm(30), y1=rnorm(30), y2=rnorm(30), y3=rnorm(30))

1) Use lmList in nlme (which comes with R so you don't have to install it).
library(nlme)
regs <- lmList(cbind(y1, y2, y3) ~ x | group, dat)
giving an lmList object having a component for each group. We show the component for group a and the other groups are similar.
> regs$a
Call:
lm(formula = object, data = dat, na.action = na.action)
Coefficients:
y1 y2 y3
(Intercept) 0.2943 0.1395 0.4539
x 0.3721 -0.2206 -0.2255
2) Another approach is to perform one overall lm giving an lm object having the same coefficients as above.
lm(cbind(y1, y2, y3) ~ group + x:group + 0, dat)
3) We could also use one of several list comprehension packages. This gives a list of 9 components. The names of the components identify the combination used as does the call component (shown in the Call: line of the output) within each main component. Note t hat the current CRAN version is 0.1.0 but the code below relies on listcompr 0.1.1 which can be obtained from github until it is put on CRAN.
# install.github("patrickroocks/listcompr")
library(listcompr)
packageVersion("listcompr") # need version 0.1.1 or later
regs <- gen.named.list("{y}.{g}",
do.call("lm",
list(reformulate("x", y), quote(dat), subset = bquote(dat$group == .(g)))
), y = c("y1", "y2", "y3"), g = unique(dat$group)
)
If you don't mind that the Call: line in the output is less descriptive then it can be simplified to:
gen.named.list("{y}.{g}", lm(reformulate("x", y), dat, subset = group == g),
y = c("y1", "y2", "y3"), g = unique(dat$group))
Note
The input corrected from question which had two y2's.
set.seed(123)
dat <- data.frame(group=c(rep("a",10), rep("b",10), rep("c",10)),
x=rnorm(30), y1=rnorm(30), y2=rnorm(30), y3=rnorm(30))

Related

Remove linear dependent variables while using the bife package

Some pre-programmed models automatically remove linear dependent variables in their regression output (e.g. lm()) in R. With the bife package, this does not seem to be possible. As stated in the package description in CRAN on page 5:
If bife does not converge this is usually a sign of linear dependence between one or more regressors
and the fixed effects. In this case, you should carefully inspect your model specification.
Now, suppose the problem at hand involves doing many regressions and one cannot inspect adequately each regression output -- one has to suppose some sort of rule-of-thumb regarding the regressors. What could be some of the alternatives to remove linear dependent regressors more or less automatically and achieve an adequate model specification?
I set a code as an example below:
#sample coding
x=10*rnorm(40)
z=100*rnorm(40)
df1=data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))
df2=data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))
df3=rbind(df1,df2)
df3=rbind(df1,df2)
for(i in 1:4) {
x=df3[df3$Region==i,]
model = bife::bife(a ~ x + y + z | ID, data = x)
results=data.frame(Region=unique(df3$Region))
results$Model = results
if (i==1){
df4=df
next
}
df4=rbind(df4,df)
}
Error: Linear dependent terms detected!
Since you're only looking at linear dependencies, you could simply leverage methods that detect them, like for instance lm.
Here's an example of solution with the package fixest:
library(bife)
library(fixest)
x = 10*rnorm(40)
z = 100*rnorm(40)
df1 = data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))
df2 = data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))
df3 = rbind(df1, df2)
vars = c("x", "y", "z")
res_all = list()
for(i in 1:4) {
x = df3[df3$Region == i, ]
coll_vars = feols(a ~ x + y + z | ID, x, notes = FALSE)$collin.var
new_fml = xpd(a ~ ..vars | ID, ..vars = setdiff(vars, coll_vars))
res_all[[i]] = bife::bife(new_fml, data = x)
}
# Display all results
for(i in 1:4) {
cat("\n#\n# Region: ", i, "\n#\n\n")
print(summary(res_all[[i]]))
}
The functions needed here are feols and xpd, the two are from fixest. Some explanations:
feols, like lm, removes variables on-the-fly when they are found to be collinear. It stores the names of the collinear variables in the slot $collin.var (if none is found, it's NULL).
Contrary to lm, feols also allows fixed-effects, so you can add it when you look for linear dependencies: this way you can spot complex linear dependencies that would also involve the fixed-effects.
I've set notes = FALSE otherwise feols would have prompted a note referring to collinearity.
feols is fast (actually faster than lm for large data sets) so won't be a strain on your analysis.
The function xpd expands the formula and replaces any variable name starting with two dots with the associated argument that the user provide.
When the arguments of xpd are vectors, the behavior is to coerce them with pluses, so if ..vars = c("x", "y") is provided, the formula a ~ ..vars | ID will become a ~ x + y | ID.
Here it replaces ..vars in the formula by setdiff(vars, coll_vars)), which is the vector of variables that were not found to be collinear.
So you get an algorithm with automatic variable removal before performing bife estimations.
Finally, just a side comment: in general it's better to store results in lists since it avoids copies.
Update
I forgot, but if you don't need bias correction (bife::bias_corr), then you can directly use fixest::feglm which automatically removes collinear variables:
res_bife = bife::bife(a ~ x + z | ID, data = df3)
res_feglm = fixest::feglm(a ~ x + y + z | ID, df3, family = binomial)
rbind(coef(res_bife), coef(res_feglm))
#> x z
#> [1,] -0.02221848 0.03045968
#> [2,] -0.02221871 0.03045990

Is there an R function that resolve a second order linear model?

I´m a begginer in R and programming and struggling in doing problably a simple task.
I've made a code that creates a second model order and i want to input variables in this model and find the "Y value"
I´ve tried to use the predict function, but is actually pretty complex and I can't got anywhere.
I did this so far:
modFOI <- rsm(Rendimento~FO(x1,x2,x3,x4)+TWI(x1,x2,x3,x4)+PQ(x1,x2,x3,x4),data=CR) # com interações
summary(modFOI)
print(modFOI)
With that, i found the SO model, but now i want to create variables like x1,x2,x3 and input that in the model and find the Y. I also woud like to find the optimum Y
Simplest way to create a polynomial (2nd order) that I can think of is the following:
DF <- data.frame(x = runif(10,0,1),
y = runif(10,0,1) )
mod <- lm(DF$y ~ DF$x + I(DF$x^2))
predict(mod, new.data=data.frame(x=c(1,2,3,4,5)))
NB. when using predict the new.data must be in a data.frame format, and the variable must have the same name as the variable in the model (here, x)
Hope this helps
The optimum value is shown as the stationary point in the output of summary(modFOI). You may also run steepest(modFOI) to see a trace of the estimated values along the path of steepest ascent.
To predict, create a data frame with the desired sets of x values. For example,
testdat <- data.frame(x1 = -1:1, x2 = 0, x3 = 0, x4 = 1)
Then use the predict() function with this is newdata:
predict(modFOI, newdata = testdat)

R - Using apply only on specific columns

I am using a package in R that fits a specific form of a regression model. However, unlike the base lm() function that permits the x and y to be separate objects, the function that I'm using requires them to be in the same dataframe.
My problem arises because I have a lot of variables that I want to regress on y independently. Therefore, I have a dataframe with 10 predictor variables (x1, x2... x10) and one criterion variable (y), 11 columns in total. I could use a for loop to run ten separate regressions, but I want to avoid it and use the apply function instead. However, if I call apply on my dataframe, in the last step it will regress y on y itself and I want to avoid this. Is there a function similar to apply which I could run and specify thiat I only want it to run 10 times and not 11, or is there another workaround to this problem?
Here's a tidyverse solution:
library( tidyverse )
xx <- c("disp", "hp", "drat", "wt") # Names of predictor variables
y <- "mpg" # Name of response
str_c( y, xx, sep="~" ) %>%
map( as.formula ) %>% # Optional (see below)
map( lm, data = mtcars )
str_c simply builds up formulas as strings (e.g., "mpg~disp"). While lm accepts strings directly, your particular regression model might not. If it requires an actual formula, you can convert strings to formulas using as.formula (Thanks for the suggestion, #J.Doe!). Other than that, simply replace lm with your particular model and mtcars with your data frame.
Here's the same solution using base R without any additional packages:
strs <- paste( y, xx, sep="~" )
strs <- lapply( strs, as.formula ) # Optional
lapply( strs, lm, data=mtcars )
Using the builtin anscombe data frame having columns x1, x2, x3, x4, y1, y2, y3, y4 suppose we want to regress y1 on each of x1, x2, x3, x4 separately.
First create a character vector of the names of the independent variables, xnames, and the use lapply to run the indicated run_lm over it. That function pastes together the required formula and performs the lm returning an "lm" class object. L, the result, is a list of such objects, one for each regression.
No packages are used.
xnames <- names(anscombe)[1:4]
run_lm <- function(nm) lm(paste("y1 ~", nm), anscombe)
L <- lapply(xnames, run_lm)
Alternately, this shorter version of run_lm would also work with the above lapply but the Call: output line is not as nice:
run_lm <- function(nm) lm(anscombe[c("y1", nm)])

Plotting SVM Linear Separator in R

I'm trying to plot the 2-dimensional hyperplanes (lines) separating a 3-class problem with e1071's svm. I used the default method (so there is no formula involved) like so:
library('e1071')
## S3 method for class 'default':
machine <- svm(x, y, kernel="linear")
I cannot seem to plot it by using the plot.svm method:
plot(machine, x)
Error in plot.svm(machine, x) : missing formula.
But I did not use the formula method, I used the default one, and if I pass '~' or '~.' as a formula argument it'll complain about the matrix x not being a data.frame.
Is there a way of plotting the fitted separator/s for the 2D problem while using the default method?
How may I achieve this?
Thanks in advance.
It appears that although svm() allows you to specify your input using either the default or formula method, plot.svm() only allows a formula method. Also, by only giving x to plot.svm(), you are not giving it all the info it needs. It also needs y.
Try this:
library(e1071)
x <- prcomp(iris[,1:4])$x[,1:2]
y <- iris[,5]
df <- data.frame(cbind(x[],y[]))
machine <- svm(y ~ PC1 + PC2, data=df)
plot(machine, data=df)
It appears that your x has more than two feature-variables or columns.
Since plot.svm() plots only 2-Dimensions at a time, you need to specify these dimensions explicitly by providing a formula argument.
Ex:- ## more than two variables: fix 2 dimensions
data(iris)
m2 <- svm(Species~., data = iris)
plot(m2, iris, Petal.Width ~ Petal.Length,slice = list(Sepal.Width = 3, Sepal.Length = 4))
In cases where the data-frames has only two dimensions by default, you can ignore the formula argument.
Ex:- ## a simple example
data(cats, package = "MASS")
m <- svm(Sex~., data = cats)
plot(m, cats)
These details can be found at plot.svm() documentation here https://www.rdocumentation.org/packages/e1071/versions/1.7-3/topics/plot.svm

Pass glm predictors from a list

I have a large set of model specifications to test, which share a dv but have unique IVs. In the following example
foo <- data.frame(dv = sample(c(0,1), 100, replace=T),
x1 = runif(100),
x2 = runif(100))
I want the first model to only include x1, the second x2, the third both, and the fourth their interaction. So I thought a sensible way would be to build a list of formula statements:
bar <- list("x1",
"x2",
"x1+x2",
"x1*x2")
which I would then use in a llply call from the plyr package to obtain a list of model objects.
require(plyr)
res <- llply(bar, function(i) glm(dv ~ i, data = foo, family = binomial()))
Unfortunately I'm told
Error in model.frame.default(formula = dv ~ i, data = foo, drop.unused.levels = TRUE):variable lengths differ (found for 'i')
Obviously I'm mixing up something fundamental--do I need to manipulate the original foo list in some fashion?
Your problem is with how you are specifying the formula, since inside the function i is a variable. This would work:
glm(paste("dv ~", i), data = foo, family = binomial())
The problem is that dv ~ i isn't a formula. i is (inside the anonymous function) simply a symbol that represents a variable containing a character value.
Try this:
bar <- list("dv~x1",
"dv~x2",
"dv~x1+x2",
"dv~x1*x2")
res <- llply(bar, function(i) glm(i, data = foo, family = binomial()))
But setting statistical issues aside, it might possibly be easier to use something like ?step or ?stepAIC in the MASS package for tasks similar to this?

Resources