At first, I use RStudio.
I have a data frame (APD) and I would like to fit the w.r.t to the factor Serial_number. The fit is a lm fit. Then I would like to use this fit to do a calibration (calibrate() out of the investr-package).
Here is an example picture of my data:
Here's the data: Data
Currently I use following lines to fit via Serial_number:
Coefficients<- APD %>%
group_by(Serial_number) %>%
do(tidy(fit<- lm(log(log(Amplification)) ~ Voltage_transformed, .)))
But here, I cannot apply the calibrate()-function. Calibrate function needs an object, that inherits from "lm". And tidy only works for S3/S4-objects.
Do you have an idea?
In your posted code, you're trying to rbind the predicted values from each model, not the coefficients. The function for coefficients is just coefficients(object).
I would also suggest un-nesting your code, since that makes it hard to read and change later on. Here are two generalized functions (each make assumptions, so edit as needed):
lm_by_variable <- function(data_, formula_, byvar) {
by(
data_,
data_[[byvar]],
FUN = lm,
formula = formula_,
simplify = FALSE
)
}
combine_coefficients <- function(fit_list) {
all_coefficients <- lapply(fit_list, coefficients)
do.call('rbind', all_coefficients)
}
lm_by_variable(...) should be pretty self-evident: group by byvar, use lm with the given formula on each subset, and don't simplify the result. Simplifying results is really only useful for interactive work. In a script, it's better to know exactly what will be returned. In this case, a list.
The next function, combine_coefficients(...) returns a matrix of the fitted coefficients. It assumes every fitted model in fit_list has the same terms. We could add logic to make it more robust, but that doesn't seem necessary in this case.
Related
I am working through ISLR and am stuck on a question. Basically, I am trying to create a function that iterates through an entire dataframe. It is question 3.7, 15a.
For each predictor, fit a simple linear regression model to predictthe response. Describe your results. In which of the models isthere a statistically significant association between the predictor and the response? Create some plots to back up your assertions.
So my thinking is like this:
y = Boston$crim
x = Boston[, -crim]
TestF1 = lm(y ~ x)
summary(TestF1)
But this is nowhere near the right answer. I was hoping to break it down by:
Iterate over the entire dataframe with crim as my response and the others as predictors
Extract the p values that are statistically significant (or extract the ones insignificant)
Move on to the next question (which is considerably easier)
But I am stuck. I've googled but can't find anything. I tried this combn(Boston) thing but it didn't work either. Please help, thank you.
If your problem is to iterate over a data frame, here is an example for mtrcars (mpg is the targer variable, and the rest are predictors, assuming models with a single predictor). The idea is to generate strings and convert them to formulas:
lms <- vector(mode = "list", length = ncol(mtcars)-1)
for (i in seq_along(lms)){
lms[[i]] <- lm(as.formula(paste0("mpg~",names(mtcars)[-1][i])), data = mtcars)
}
If you want to look at each and every variable combination, start with a model with all variables and then eliminate non-significant predictors finding the best model.
The toy model below stands in for one with a bunch more variables, transforms, lags, etc. Assume I got that stuff right.
My data is ordered in time, but is now formatted as an R time series, because I need to exclude certain periods, etc. I'd rather not make it a time series for this reason, because I think it would be easy to muck up, but if I need to, or it greatly simplifies the estimating process, I'd like to just use an integer sequence, such as index. below, to represent time if that is allowed.
My problem is a simple one (I hope). I would like to use the first part of my data to estimate the coefficients of the model. Then I want to use those estimates, and not estimates from a sliding window, to do one-ahead forecasts for each of the remaining values of that data. The idea is that the formula is applied with a sliding window even though it is not estimated with one. Obviously I could retype the model with coefficients included and then get what I want in multiple ways, with base R sapply, with tidyverse dplyr::mutate or purrr::map_dbl, etc. But I am morally certain there is some standard way of pulling the formula out of the lm object and then wielding it as one desires, that I just haven't been able to find. Example:
set.seed(1)
x1 <- 1:20
y1 <- 2 + x1 + lag(x1) + rnorm(20)
index. <- x1
data. <- tibble(index., x1, y1)
mod_eq <- y1 ~ x1 + lag(x1)
lm_obj <- lm(mod_eq, data.[1:15,])
and I want something along the lines of:
my_forecast_values <- apply_eq_to_data(eq = get_estimated_equation(lm_obj), my_data = data.[16:20])
and the lag shouldn't give me an error.
Also, this is not part of my question per se, but I could use a pointer to a nice tutorial on using R formulas and the standard estimation output objects produced by lm, glm, nls and the like. Not the statistics, just the programming.
The common way to use the coefficients is by calling the predict(), coefficients(), or summary() function on the model object for what it is worth. You might try the ?predict.lm() documentation for details on formula.
A simple example:
data.$lagx <- dplyr::lag(data.$x1, 1) #create lag variable
lm_obj1 <- lm(data=data.[2:15,], y1 ~ x1 + lagx) #create model object
data.$pred1 <- predict(lm_obj1, newdata=data.[16,20]) #predict new data; needs to have same column headings
I ran a three way repeated measures ANOVA with ezANOVA.
anova_1<-ezANOVA(data = main_data, dv = .(rt), wid.(id),
within = .(A,B,C), type = 3, detailed = TRUE)
I'm trying to see what's going on with the residuals via a qqplot but I don't know how to get to them or if they'r even there. With my lme models I simply extract them from the model
main_data$model_residuals <- as.numeric(residuals(model_1))
and plot them
residuals_qq<-ggplot(main_data, aes(sample = main_data$model_residuals)) +
stat_qq(color="black", alpha=1, size =2) +
geom_abline(intercept = mean(main_data$model_residuals), slope = sd(main_data$model_residuals))
I'd like to use ggplot since I'd like to keep a sense of consistency in my graphing.
EDIT
Maybe I wasn't clear in what I'm trying to do. With lme models I can simply create the variable model_residuals from the residuals object in the main_data data.frame that then contains the residuals I plot in ggplot. I want to know if something similar is possible for the residuals in ezAnova or if there is a way I can get hold of the residuals for my ANOVA.
I had the same trouble with ezANOVA. The solution I went for was to switch to ez.glm (from the afex package). Both ezANOVA and ez.glm wrap a function from a different package, so you should get the same results.
This would look like this for your example:
anova_1<-ez.glm("id", "rt", main_data, within=c("A","B","C"), return="full")
nice.anova(anova_1$Anova) # show the ANOVA table like ezANOVA does.
Then you can pull out the lm object and get your residuals in the usual way:
residuals(anova_1$lm)
Hope that helps.
Edit: A few changes to make it work with the last version
anova_1<-aov_ez("id", "rt", main_data, within=c("A","B","C"))
print(m1)
print(m1$Anova)
summary(m1$Anova)
summary(m1)
Then you can pull out the lm object and get your residuals in the usual way:
residuals(anova_1$lm)
A quite old post I know, but it's possible to use ggplot to plot the residuals after modeling your data with ez package by using this function:
proj(ez_outcome$aov)[[3]][, "Residuals"]
then:
qplot(proj(ez_outcome$aov)[[3]][, "Residuals"])
Hope it helps.
Also potentially adding to an old post, but I butted up against this problem as well and as this is the first thing that pops up when searching for this question I thought I might add how I got around it.
I found that if you include the return_aov = TRUE argument in the ezANOVA setup, then the residuals are in there, but ezANOVA partitions them up in the resulting list it produces within each main and interaction effect, similar to how base aov() does if you include an Error term for subject ID as in this case.
These can be pulled out into their own list with purrr by mapping the residual function over this aov sublist in ezANOVA, rather than the main output. So from the question example, it becomes:
anova_1 <- ezANOVA(data = main_data, dv = .(rt), wid = .(id),
within = .(A,B,C), type = 3, detailed = TRUE, return_aov = TRUE)
ezanova_residuals <- purrr::map(anova_1$aov, residuals)
This will produce a list where each entry is the residuals from a part of the ezANOVA model for effects and interactions, i.e. $(Intercept), $id, id:a, id:b, id:a:b etc.
I found it useful to then stitch these together in a tibble using enframe and unnest (as the list components will probably be different lengths) to very quickly get them in a long format, that can then be plotted or tested:
ezanova_residuals_tbl <- enframe(ezanova_residuals) %>% unnest
hist(ezanova_residuals_tbl$value)
shapiro.test(ezanova_residuals_tbl$value)
I've not used this myself but the mapping idea also works for the coefficients and fitted.values functions to pull them out of the ezANOVA results, if needed. They might come out in some odd formats and need some extra manipulation afterwards though.
Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.
I am trying to use tapply() function to run models by several categories with not much sucess. My data has 20 clinics and I want to run the models BY each clinic.
Heres my model:
attach(qregdata)
rq(logA~ dose+ chtcm + cage +raceth + sex,tau=.9)
My data as a variable clinic (with values 1-20). Does anybody know how to run this model BY clinic in R as in other statistical packages?
A very general way of accomplishing this is shown in the following. The ddply function runs a supplied function (in this case lm) for each clinic. You can also run it on more complex cross-sections of your data. E.g. .(clinic,level) would run a separate model on each combination of clinic and level. The term lm(y~x)$coef[1] gets the intercept of the linear model. I think there is no easy way to save all the output of each model fit at once.
n <- 10
clinic <- factor(rep(1:3,each=n))
x <- rep(0:(n-1),3)
y <- rnorm(3*n)*x
d <- data.frame(clinic,x,y)
# plot data and linear fits
library(ggplot2)
ggplot(d,aes(x,y)) + geom_point() + facet_wrap(~clinic) + stat_smooth(method='lm')
# run a separate model for each clinic
library(plyr)
ddply(d,.(clinic),summarize,intercept=lm(y~x)$coef[1],slope=lm(y~x)$coef[2])
You could use 'lappy' across the unique values of clinic, and the use subset to extract the section of your Dataset for that clinic. Then just fit the model to the subset.
This will return a list of models, which you can then further process.
I had a similar issue to this recently and wanted to share a response in case someone is still interested in this topic; sorry to dredge up an old post.
tapply is very convenient to work with when the input object (the object being "split") is a vector. If the input object being split is a rectangular data set, it can be much simpler to use the (aptly named, in this case) by function, which is a convenient wrapper for tapply intended for data.frame objects. The return object of the by function is of the class by which can be simplified to an array or a list using the argument simplify = TRUE.
Certainly there are more efficient ways to perform this operation, but if you are looking for a tapply-like solution - by is it.
Here's an example using lm to regress petal width on sepal width "by" species in the iris data set:
## Load iris data
data(iris)
## Fit a model to each species-specific subset of the data
fitBySpecies <- by(
data = iris,
INDICES = iris$Species,
FUN = function(speciesSubset)
lm(Petal.Width ~ Sepal.Width, data = speciesSubset)
)