Where does R store subset information in glm object? - r

I'm trying to do some post-processing of a large number of glm models that I am working with, but I need to extract information about the data subset from the glm objects.
As a toy example:
x <- rnorm(100)
y <- rnorm(100,x,0.5)
s<-sample(c(T,F),100,replace=T)
myGlm <- glm(y~x, subset= s)
From this, I need to know which of the 100 observations were used by getting the information out of myGlm. I thought that myGlm$data would have the subsetted data, but it actually has all 100 observations in it. I looked through str(myGlm) to no avail. However, it is quite clear that somewhere in the object, information about the subset s is stored.
This seems like it should be totally trivial!
Thanks in advance.

as.numeric(rownames(myGlm$model))

Related

Removing Outliers 3SDs from the mean of a monoexponetial function in R

I have a large data set that analyzes exercising subjects' oxygen consumption over time (x= Time, y = VO2). This data fits a monoexponential function.
Here is a brief, sample data frame:
'''
VO2 <- c(11.71,9.84,17.96,18.87,14.58,13.38,5.89,20.28,20.03,31.17,22.07,30.29,29.08,32.89,29.01,29.21,32.42,25.47,30.51,37.86,23.48,40.27,36.25,19.34,36.53,35.19,22.45,30.23,3,19.48,25.35,22.74)
Time <- c(0,2,27,29,31,33,39,77,80,94,99,131,133,134,135,149,167,170,177,178,183,184,192,222,239,241,244,245,250,251,255,256)
DF <- data.frame(VO2,Time)
'''
visual representation of the data -- * note that this data set is much smaller (and therefore might not fit a function as well) as the full data set.
I am somewhat new to R and very much not a mathematical expert. I would appreciate your help with the two goals of this data set.
Based on typical conventions of the laboratory I work in, this data should be fit to a monoexponential function
I would love some insight into fitting data to a function such as this. Note that I have many similar data sets (for different subjects) and need to fit a monoexponential function to each of them. It would be best if fit could be applied generically across my data sets.
Based on this monoexponential function, I would like to identify and remove any outlying points. Here I will define an outlier as any point >3 standard deviations from the mean of the monoexponential function.
So far, I have this (unsuccessful) code to fit a function to the above data. Not only does it not fit well, but I am also unable to create a smooth function.
'''
fit <- lm(VO2~poly(Time,2,raw=TRUE))
xx <- seq(1,250, length=32)
plot(Time,VO2,pch=19,ylim=c(0,50))+
lines(xx, predict(fit, data.frame(DF=xx)), col="red")
'''
Thank you to all the individuals who have commented and provided their valuable feedback. As I continue to learn and research, I will add to this post with successful/less successful attempts at the code for this process. Thank you for your knowledge, assistance and understanding.

How do I create a simple linear regression function in R that iterates over the entire dataframe?

I am working through ISLR and am stuck on a question. Basically, I am trying to create a function that iterates through an entire dataframe. It is question 3.7, 15a.
For each predictor, fit a simple linear regression model to predictthe response. Describe your results. In which of the models isthere a statistically significant association between the predictor and the response? Create some plots to back up your assertions.
So my thinking is like this:
y = Boston$crim
x = Boston[, -crim]
TestF1 = lm(y ~ x)
summary(TestF1)
But this is nowhere near the right answer. I was hoping to break it down by:
Iterate over the entire dataframe with crim as my response and the others as predictors
Extract the p values that are statistically significant (or extract the ones insignificant)
Move on to the next question (which is considerably easier)
But I am stuck. I've googled but can't find anything. I tried this combn(Boston) thing but it didn't work either. Please help, thank you.
If your problem is to iterate over a data frame, here is an example for mtrcars (mpg is the targer variable, and the rest are predictors, assuming models with a single predictor). The idea is to generate strings and convert them to formulas:
lms <- vector(mode = "list", length = ncol(mtcars)-1)
for (i in seq_along(lms)){
lms[[i]] <- lm(as.formula(paste0("mpg~",names(mtcars)[-1][i])), data = mtcars)
}
If you want to look at each and every variable combination, start with a model with all variables and then eliminate non-significant predictors finding the best model.

Why can't I find my factor names when I extract residuals?

I'm working with some election data trying to separate it by "state" and "election."
I ran a regression with fixed effects for state and year (as you'll see below), got my summary data, and have been trying to use the resid() function to extract the residuals.
m5 <- lm(demVote ~ state*year, data=presidentialElections)
plot(resid(m5) ~ fitted(m5))
resid.m5 <- resid(m5)
I think it all worked above just perfectly. However, here's where I'm lost - if I do summary(resid.m5) (where I put the extracted residuals, or so I thought), I can't seem to find my factor names anymore. If I want to see my residuals per state or per year (or an average of them by state/year, for example) then how do I access that with the resid() function? Thanks!
Just as was said in the comments before, you have to realize that the residuals that are being returned are in the same order as your observations in the data set.
Here is an example using the iris data set that comes with every R installation (and a probably quite nonsensical regression):
data(iris)
m5 <- lm(Sepal.Length ~ Species*Sepal.Width, data=iris)
resid.m5 <- resid(m5)
dta.complete <- data.frame(iris, r.m5=resid.m5)
Here, the residuals are combined with the original data. It is perhaps a little unorthodox, but why not keep things together. Now you can use all the classical subsetting as much as you like. For instance:
with(dta.complete, by(r.m5, Species, mean))
Good luck!

Calculate many AUCs in R

I am fairly new to R. I am using the ROCR package in R to calculate AUC, which I can do for one predictor just fine. What I am looking to do is perform many AUC calculations for 100 different variables.
What I have done so far is the following:
varlist <- names(mydata)[2:101]
formlist <- lapply(varlist, function(x) paste0("prediction(",x,"mydata$V1))
However then the formulas are in text format, and the as.formula is giving me an error. Any help appreciated! Thanks in advance!
The function inside your lapply looks like it is just outputting a statement like prediction(varmydata$V1). I am guessing you actually want to run that command. If so, you probably want something like
lapply(varlist,function(x) prediction(mydata[x]))
but it is hard to tell without a reproducible situation. Also, it looks like your code has a missing quote.
If I understand you correctly, you want to use the first column of mydata as predictions, and all other variables as labels, one after the other.
Is this the correct way to treat mydata? This way is rather uncommon. It is more common to have the same true labels for several diffent predictions (e.g. iterated cross validation, comparison of different classifiers).
However, to answer your original question:
predictions and labels need to have the same shape for ROCR::prediction, e.g.
either as matrix
prediction (matrix (rep (mydata$V1, 10), 10), mydata [, -1])
or as lists:
prediction (mydata [rep (1, ncol (mydata) - 1)], mydata [-1])

Rpart Variables were specificed with different types from the fit?

I make a classification tree using rpart. The data has 10 columns, all properly labeled. Five of these columns contain information such as the day of the week in the form of "Wed" and the other five contain numeric values.
I can successfully make a tree using Rpart, but when I try to run a test set of the data, or even the training set that made the tree, I get a bunch of warnings saying that the variables containing characters were changed to a factor, and then an error that says those same variables were specified with a different type from the fit.
Anyone know how to fix this?
My relavent code should be
library(rpart)
#read data into info
info <- data.frame(info)
set.seed(30198)
train_ind <- sample(1:2000, 1500)
training_data_info <- info[train_ind, ]
test_data_info <- info[-train_ind, ]
training_data_info <- data.frame(training_data_info)
test_data_info <- data.frame(test_data_info)
tree <- rpart(info ~ ., data = training_data_info, method = "class")
info.test.fit <- predict(tree, newdata=test_data_info) #this is where it goes wrong
You can't use character vectors in an rpart fit. You have to code them as factors. The code does this for you, but then you hit the problem that it is entirely possible for the test data to have a different set of levels from the training data used to fit the tree.
The error arises from the use of these two lines:
training_data_info <- data.frame(training_data_info)
test_data_info <- data.frame(test_data_info)
These are redundant, the objects are already data frames. All this achieves is to drop those levels from the whole dataset that are missing in either the training or test datasets. And that is where the error comes from. Try without those two lines and you should be good to go.

Resources