Extract Table Outputs like Summary and ANOVA from a "by" Object - r

I have a regression model created with by. I know I can use sapply to extract specific parts of the model for each factor, but what if I wanted something like the whole summary, anova, etc.?
model <- with(data, by(data, factor, function(data) lm(y ~ x, data=data)))

sapply will coerce the results of summary.lm and anova.lm to a matrix. I think you may want to use lapply, which applies a function (here summary) on each element in the list produced by by, and returns a list.
models <- by(warpbreaks, warpbreaks$tension, function(x){
lm(breaks ~ wool, data = x)
})
lapply(models, summary)

Related

map with tidy eval with paste

I need to run lm on nested data and use map with lm. But I put it in a function to have a different set of independent variables in the model. The code looks like this
predictors1<-c("TWO_YEAR_PRIOR_SCALE_0","PV1.Dim1")
student_progress_model<-function(df, predictors){
lm(
as.formula(paste("SCALE_SCORE_0", paste(predictors, collapse="+"), sep="~")),
data=df, na.action=na.exclude)
}
data_nested<-data_nested%>%
mutate(
model=map(data, student_progress_model(df=., predictors=predictors1)),
stand_resids = map2(data, model, rstandard)
)
It comes up with the error that object 'SCALE_SCORE_0' not found
How can i specify that my variables are in the data passed to the function? Is it something to do with tidy eval?
I tried to put {{}} in my function like
lm(
as.formula(paste("SCALE_SCORE_0", paste({{predictors}}, collapse="+"), sep="~")),
data=df, na.action=na.exclude)
But it does not help...
also, will rstandard give me standardised residuals this way?
Thank you

how to pass variable through model.matrix() in r

I'm creating a function that performs cross-validation and ridge regression to select predictors for a model. The inputs of my function are dataframe and the desired outcome variable outcome (what is being predicted). I'm using model.matrix() to create an x matrix that I will pass to glmnet(). My function uses outcome as the object argument in model.matrix(), but it looks like outcome is the wrong data type to pass through model.matrix(). Using model.matrix() normally, I would write something like model.matrix(Weight~.,dataframe). In this case, however, model.matrix won't work as model.matrix(outcome~.,dataframe) or model.matrix(dataframe$outcome~.,dataframe). Any ideas?
If 'outcome' is the object that stores the string "Weight", then we can paste with formula
model.matrix(formula(paste(outcome, "~ .")), dataframe)
A reproducible example with 'iris' dataset
data(iris)
outcome <- "Species"
m1 <- model.matrix(formula(paste(outcome, "~ .")), iris)
m2 <- model.matrix(Species ~ ., iris)
identical(m1, m2)
#[1] TRUE

Predict() + assign() loop across variables

I have estimated several models (a, b) and I want to calculate predicted probabilities for each model using a single data frame (df) and store the predicted probabilities of each model as new variables in that data frame. For example:
a <- lm(y ~ z, df) # estimate model a
b <- glm(w ~ x, df) # estimate model b
models <- c("a","b") # create vector of model objects
for (i in models) {
assign(
paste("df$", i, sep = ""),
predict(i, df)
)}
I have tried the above but receive the error "no applicable method for 'predict' applied to an object of class "character"" with the last word changing as I change class of the predicted object, e.g. predict(as.numeric(i),df).
Any ideas? Ideally I could vectorize this as well.
You should rarely have to use assign() and $ should not be used with variable names. The [[]] operator is better for dynamic subsetting than $. And it would be easier if you just made a list if the models rather than just their names. Here's an example
df<-data.frame(x=runif(30), y=runif(30), w=runif(30), z=runif(30))
a <- lm(y ~ z, df) # estimate model a
b <- lm(w ~ x, df) # estimate model b
models <- list(a=a,b=b) # create vector of model objects
# 1) for loop
for (m in names(models)) {
df[[m]] <- predict(models[[m]], df)
}
Or rather than a for loop, you could generate all the values with Map and then append with cdbind afterward
# 2) Map/cbind
df <- cbind(df, Map(function(m) predict(m,df), models))

How to subset a range of values in lm()

The help file for lm() doesn't go into the syntax for the subset argument. I am not sure how to get it to find the line of best fit for only a portion of my data set. This question is similar, but I wasn't able to solve my particular problem using it. How does the subset argument work in the lm() function?
Here is my code:
with(dat[dat$SIZE <7 & dat$SIZE > 0.8 ,], plot(SP.RICH~SIZE, log="x",
xlim=c(1,9), ylim=c(60,180), ylab="plant species richness",
xlab="log area (ha)", type="n"))
with(dat[dat$SIZE <7 & dat$SIZE > 0.8 ,], points(SP.RICH~SIZE, pch=20, cex=1))
fit=lm(SP.RICH~SIZE, subset=c(1:7))
I would like to make sure that the regression line is drawn only for the values that I subset above in the plot() and points() commands.
The subset parameter in lm() and other model fitting functions takes as its argument a logical vector the length of the dataframe, evaluated in the environment of the dataframe. So, if I understand you correctly, I would use the following:
fit <- lm(SP.RICH~SIZE, data=dat, subset=(SIZE>0.8 & SIZE<7))
But the above solution does not help if you want to run one lm for each group in your data - lets say that you have different countries as a column and you want to understand the relationship between richness and size within each country.
For that I recommend following the help for the function by in R http://astrostatistics.psu.edu/su07/R/html/base/html/by.html:
require(stats)
attach(warpbreaks)
by(warpbreaks[, 1:2], tension, summary)
by(warpbreaks[, 1], list(wool = wool, tension = tension), summary)
by(warpbreaks, tension, function(x) lm(breaks ~ wool, data = x))
## now suppose we want to extract the coefficients by group
tmp <- by(warpbreaks, tension, function(x) lm(breaks ~ wool, data = x))
sapply(tmp, coef)
From the list tmp you can extract any lm parameters you like.

Accessing fitted.values when using ddply

I am using ddply to execute glm on subsets of my data. I am having difficulty accessing the estimated Y values. I am able to get the model parameter estimates using the below code, but all the variations I've tried to get the fitted values have fallen short. The dependent and independent variables in the glm model are column vectors, as is the "Dmsa" variable used in the ddply operation.
Define the model:
Model <- function(df){coef(glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=df))}
Execute the model on subsets:
Modrpt <- ddply(msadata, "Dmsa", Model)
Print Modrpt gives the model coefficients, but no Y estimates.
I know that if I wasn't using ddply, I can access the glm estimated Y values by using the code:
Model <- glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=msadata)
fits <- Model$fitted.values
I have tried both of the following to get the fitted values for the subsets, but no luck:
fits <- fitted.values(ddply(msadata, "Dmsa", Model))
fits <- ddply(msadata, "Dmsa", fitted.values(Model))
I'm sure this is a very easy to code...unfortunately, I'm just learning R. Does anyone know where I am going wrong?
You can use an anonymous function in your call to ddply e.g.
require(plyr)
data(iris)
model <- function(df){
lm( Petal.Length ~ Sepal.Length + Sepal.Width , data = df )
}
ddply( iris , "Species" , function(x) fitted.values( model(x) ) )
This has the advantage that you can also, without rewriting your model function, get thecoef values by doing
ddply( iris , "Species" , function(x) coef( model(x) ) )
As #James points out, this will fall down if you have splits of unequal size, better to use dlply which puts the result of each subset in it's own list element.
(I make no claims for statistical relevance or correctness of the example model - it is just an example)
I'd recommending doing this in two steps:
library(plyr)
# First first the models
models <- dlply(iris, "Species", lm,
formula = Petal.Length ~ Sepal.Length + Sepal.Width )
# Next, extract the fitted values
ldply(models, fitted.values)
# Or maybe
ldply(models, as.data.frame(fitted.values))

Resources