I am using ddply to execute glm on subsets of my data. I am having difficulty accessing the estimated Y values. I am able to get the model parameter estimates using the below code, but all the variations I've tried to get the fitted values have fallen short. The dependent and independent variables in the glm model are column vectors, as is the "Dmsa" variable used in the ddply operation.
Define the model:
Model <- function(df){coef(glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=df))}
Execute the model on subsets:
Modrpt <- ddply(msadata, "Dmsa", Model)
Print Modrpt gives the model coefficients, but no Y estimates.
I know that if I wasn't using ddply, I can access the glm estimated Y values by using the code:
Model <- glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=msadata)
fits <- Model$fitted.values
I have tried both of the following to get the fitted values for the subsets, but no luck:
fits <- fitted.values(ddply(msadata, "Dmsa", Model))
fits <- ddply(msadata, "Dmsa", fitted.values(Model))
I'm sure this is a very easy to code...unfortunately, I'm just learning R. Does anyone know where I am going wrong?
You can use an anonymous function in your call to ddply e.g.
require(plyr)
data(iris)
model <- function(df){
lm( Petal.Length ~ Sepal.Length + Sepal.Width , data = df )
}
ddply( iris , "Species" , function(x) fitted.values( model(x) ) )
This has the advantage that you can also, without rewriting your model function, get thecoef values by doing
ddply( iris , "Species" , function(x) coef( model(x) ) )
As #James points out, this will fall down if you have splits of unequal size, better to use dlply which puts the result of each subset in it's own list element.
(I make no claims for statistical relevance or correctness of the example model - it is just an example)
I'd recommending doing this in two steps:
library(plyr)
# First first the models
models <- dlply(iris, "Species", lm,
formula = Petal.Length ~ Sepal.Length + Sepal.Width )
# Next, extract the fitted values
ldply(models, fitted.values)
# Or maybe
ldply(models, as.data.frame(fitted.values))
Related
I am using lapply to perform several glm regressions on one dependent variable by one independent variable at a time. but I'm not sure how to extract the P values at a time.
There are 200 features in my dataset, but the code below only gave me the P value of feature#1. How can I get a matrix of all P values of the 200 features?
valName<- as.data.frame(colnames(repeatData))
featureName<-valName[3,]
lapply(featureName,
function(var) {
formula <- as.formula(paste("outcome ~", var))
fit.logist <- glm(formula, data = repeatData, family = binomial)
summary(fit.logist)
Pvalue<-coef(summary(fit.logist))[,'Pr(>|z|)']
})
I
I simplified your code a little bit; (1) used reformulate() (not really different, just prettier) (2) returned only the p-value for the focal variable (not the intercept p-value). (If you leave out the 2, you'll get a 2-row matrix with intercept and focal-variable p-values.)
My example uses the built-in mtcars data set, with an added (fake) binomial response.
repeatData <- data.frame(outcome=rbinom(nrow(mtcars), size=1, prob=0.5), mtcars)
ff <- function(var) {
formula <- reformulate(var, response="outcome")
fit.logist <- glm(formula, data = repeatData, family = binomial)
coef(summary(fit.logist))[2, 'Pr(>|z|)']
}
## skip first column (response variable).
sapply(names(repeatData)[-1], ff)
I'm creating a function that performs cross-validation and ridge regression to select predictors for a model. The inputs of my function are dataframe and the desired outcome variable outcome (what is being predicted). I'm using model.matrix() to create an x matrix that I will pass to glmnet(). My function uses outcome as the object argument in model.matrix(), but it looks like outcome is the wrong data type to pass through model.matrix(). Using model.matrix() normally, I would write something like model.matrix(Weight~.,dataframe). In this case, however, model.matrix won't work as model.matrix(outcome~.,dataframe) or model.matrix(dataframe$outcome~.,dataframe). Any ideas?
If 'outcome' is the object that stores the string "Weight", then we can paste with formula
model.matrix(formula(paste(outcome, "~ .")), dataframe)
A reproducible example with 'iris' dataset
data(iris)
outcome <- "Species"
m1 <- model.matrix(formula(paste(outcome, "~ .")), iris)
m2 <- model.matrix(Species ~ ., iris)
identical(m1, m2)
#[1] TRUE
I have a large data set for which I need to run a linear model comparing groups.
I need to find the p-values for group comparisons using a linear model. There are four groups (so I need 1~2, 1~3. 1~4, 2~3, 2~4, 3~4) and and there are 130 columns for which the data from these groups needs to be compared. Any help would be greatly appreciated!!
I have this, which gives me exactly what I need.
fit<-lm(variable~group, data=data)
summary(fit)
However, with all of the groups and columns, I have nearly 800 comparisons to make, so I want to avoid doing this manually. I tried writing a for loop, but it isn't working.
k<-data.frame()
for (i in 1:130){
[i,1]<-colnames(data)
fit<- lm(i~group, data=data)
[i,2] <- fit$p.value
}
But this has given me a variety of different errors. I really just need the p-values. Help would be greatly greatly appreciated!! Thank you!
(2016-06-18) Your question is not completely answerable at this stage. In the following, I shall point out several problems.
How to get p-value properly
I assume you want p-value of F-statistic for the model, as an indication of goodness of fit. Suppose your fitted model is fit, we should do this way:
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
As an example, I will use built-in dataset trees as an demonstration.
fit <- lm(Height ~ Girth, trees)
## truncated output of summary(fit)
# > summary(fit)
# Residual standard error: 5.538 on 29 degrees of freedom
# Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445
F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
## > p_value
# [1] 0.002757815
So, p_value agrees with the printed summary.
Your loop
I suggest you use vectors rather than data frame during computation/update.
variable <- character(130)
p.value <- numeric(130)
You can combine the results at the end to a data frame via:
k <- data.frame(var = variable, p.value = p.value)
Why? Because this is memory efficient! Now, after those correction, we arrive at:
variable <- character(130)
p.value <- numeric(130)
for (i in 1:130) {
variable[i] <- colnames(data)
fit <- lm(i~group, data=data)
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
p.value[i] <- p_value
}
k <- data.frame(var = variable, p.value = p.value)
Further problems
I still don't think the above code above will work. Because I am not sure whether the following is doing correct:
variable[i] <- colnames(data)
fit <- lm(i~group, data=data)
During the loop, data is not changed, so colnames(data) returns a vector, hence var[i] <- colnames(data) will trigger error.
i~group looks odd. Do you have i in your data?
I can't help you solve these issues. I have no idea of what your data looks like. But if you could put in a subset of your data, it would be OK.
Follow-up (2016-06-19)
Thank you. This has been extremely helpful. I don't have "i" in my data, but I was hoping that I could use that to represent the different column names, so that it goes through all of them. Is there a way to assign column names numbers so that this would work?
Yes, but I need to know what you have for each column.
Column 1 has a group number. The following columns have data for different factors I am looking at.
OK, so I think ncol(data) = 131, where the first column is group, and the remaining 130 columns are what you will test. Then this should work:
variable <- colnames(data)[-1]
p.value <- numeric(130)
for (i in 1:130) {
fit <- lm(paste(variable[i], "group", sep = "~"), data=data)
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
p.value[i] <- p_value
}
k <- data.frame(var = variable, p.value = p.value)
It is possible to use sapply() instead of the above for loop. But I think there is no performance difference, as loop overhead is so much tiny compared with lm() and summary().
I think this can get you started at least. It uses the dplyr and broom packages. The basic idea is to define all the formulas you want as characters then use lapply() to run them through lm().
library(dplyr)
library(broom)
# Generate a vector of wanted formulas
forms <- c("mpg ~ cyl", "mpg ~ wt")
# Function to apply formula
lmit <- function(form){
tidy(lm(as.formula(form), mtcars)) %>%
mutate(formula = form)
}
# Apply it and bind into a dataframe
results <- bind_rows(lapply(forms, lmit))
The help file for lm() doesn't go into the syntax for the subset argument. I am not sure how to get it to find the line of best fit for only a portion of my data set. This question is similar, but I wasn't able to solve my particular problem using it. How does the subset argument work in the lm() function?
Here is my code:
with(dat[dat$SIZE <7 & dat$SIZE > 0.8 ,], plot(SP.RICH~SIZE, log="x",
xlim=c(1,9), ylim=c(60,180), ylab="plant species richness",
xlab="log area (ha)", type="n"))
with(dat[dat$SIZE <7 & dat$SIZE > 0.8 ,], points(SP.RICH~SIZE, pch=20, cex=1))
fit=lm(SP.RICH~SIZE, subset=c(1:7))
I would like to make sure that the regression line is drawn only for the values that I subset above in the plot() and points() commands.
The subset parameter in lm() and other model fitting functions takes as its argument a logical vector the length of the dataframe, evaluated in the environment of the dataframe. So, if I understand you correctly, I would use the following:
fit <- lm(SP.RICH~SIZE, data=dat, subset=(SIZE>0.8 & SIZE<7))
But the above solution does not help if you want to run one lm for each group in your data - lets say that you have different countries as a column and you want to understand the relationship between richness and size within each country.
For that I recommend following the help for the function by in R http://astrostatistics.psu.edu/su07/R/html/base/html/by.html:
require(stats)
attach(warpbreaks)
by(warpbreaks[, 1:2], tension, summary)
by(warpbreaks[, 1], list(wool = wool, tension = tension), summary)
by(warpbreaks, tension, function(x) lm(breaks ~ wool, data = x))
## now suppose we want to extract the coefficients by group
tmp <- by(warpbreaks, tension, function(x) lm(breaks ~ wool, data = x))
sapply(tmp, coef)
From the list tmp you can extract any lm parameters you like.
I have multiple cox models (with one variable static in all models) and am trying to extract the coefficient for that variable.
In all models the coefficient is indexed as follows: for example in model1 it is model1[[8]][1] ; for model2 it is model2[[8]][1] etc. I attempted to create a for loop but R as shown below but its not working.
Could someone help me why I am getting an error when running the following code
for (i in 1:5) {
coef[i] <- exp(summary(model[i])[[8]][1])
}
I get the following error "object 'model' not found".
Many thanks in advance
A
Here is an example of what I meant in my comment
data(iris)
model1 <- lm(data = iris, Sepal.Length ~ Sepal.Width + Species)
model2 <- lm(data = iris, Sepal.Length ~ Sepal.Width)
You can do this so you don't have to type all the models.
model.list<-mget(grep("model[0-9]+$", ls(),value=T))
ls() lists all the object you have and grep() is taking all the objects that have names "model" followed by a number.
coefs<-lapply(model.list,function(x)coef(x)[2])
unlist(coefs)
Sepal.Width Sepal.Width
0.8035609 -0.2233611
Here's a generalized example:
model1 <- 1:5
model2 <- 2:6
I can execute a function like mean to find the average of each vector with a for loop:
for(i in 1:2) print(mean(get(paste0('model', i))))
#[1] 3
#[1] 4
It works. But the a more standard approach is to use the list object. Then I can execute the desired function with built-in functions like sapply:
lst <- list(model1, model2)
sapply(lst, mean)
#[1] 3 4