Regression model BY categories using tapply() in R - r

I am trying to use tapply() function to run models by several categories with not much sucess. My data has 20 clinics and I want to run the models BY each clinic.
Heres my model:
attach(qregdata)
rq(logA~ dose+ chtcm + cage +raceth + sex,tau=.9)
My data as a variable clinic (with values 1-20). Does anybody know how to run this model BY clinic in R as in other statistical packages?

A very general way of accomplishing this is shown in the following. The ddply function runs a supplied function (in this case lm) for each clinic. You can also run it on more complex cross-sections of your data. E.g. .(clinic,level) would run a separate model on each combination of clinic and level. The term lm(y~x)$coef[1] gets the intercept of the linear model. I think there is no easy way to save all the output of each model fit at once.
n <- 10
clinic <- factor(rep(1:3,each=n))
x <- rep(0:(n-1),3)
y <- rnorm(3*n)*x
d <- data.frame(clinic,x,y)
# plot data and linear fits
library(ggplot2)
ggplot(d,aes(x,y)) + geom_point() + facet_wrap(~clinic) + stat_smooth(method='lm')
# run a separate model for each clinic
library(plyr)
ddply(d,.(clinic),summarize,intercept=lm(y~x)$coef[1],slope=lm(y~x)$coef[2])

You could use 'lappy' across the unique values of clinic, and the use subset to extract the section of your Dataset for that clinic. Then just fit the model to the subset.
This will return a list of models, which you can then further process.

I had a similar issue to this recently and wanted to share a response in case someone is still interested in this topic; sorry to dredge up an old post.
tapply is very convenient to work with when the input object (the object being "split") is a vector. If the input object being split is a rectangular data set, it can be much simpler to use the (aptly named, in this case) by function, which is a convenient wrapper for tapply intended for data.frame objects. The return object of the by function is of the class by which can be simplified to an array or a list using the argument simplify = TRUE.
Certainly there are more efficient ways to perform this operation, but if you are looking for a tapply-like solution - by is it.
Here's an example using lm to regress petal width on sepal width "by" species in the iris data set:
## Load iris data
data(iris)
## Fit a model to each species-specific subset of the data
fitBySpecies <- by(
data = iris,
INDICES = iris$Species,
FUN = function(speciesSubset)
lm(Petal.Width ~ Sepal.Width, data = speciesSubset)
)

Related

How to use Amelia package to get a best time series model in R

I'm trying to handle the missing data from a data frame use multiple imputations, professor advice me to use Amelia package. And I can build the time series model, but when I try to use lapply function to repeatedly run the time series model in each dataset, I got an error on the function in lapply.
My data frame have three variables, date, pm25, pm10. I can built an AR model for pm25.
And the imputation code is:
imp <- amelia(Exetertibble, m=50, ts = "date")
So I can get 50 imputations, and the time series model would like this:
model1 <- arima(imp$imputations$imp1$pm25, order = c(1,0,0))
Then I try to use lapply function:
extractcoefs <- lapply(imp$imputations, coef(model1))
There is an error, it said that the coef(model)is not a function or character or symbol.
My aim is to combine the 50 imputations and get the best result of coefficient of the time series model, I don't know how to write a correct function in there.
I also tried:
extractcoefs <- lapply(imp$imputations, coef(arima(order=c(1,0,0))))
and:
extractcoefs <- lapply(imp$imputations, arima(order=c(1,0,0)$coef))
No idea, what you are trying to do.
Look at this example for lapply:
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
# compute the list mean for each list element
lapply(x, mean)
So you give lapply a list und apply a function on each of the list elements. In this case the function is mean().
So for this example you will get the mean for a, beta and logic.
You are using lapply on imp$imputations.
You got imp$imputations from your call to the amelia() function. Which gives you an instance of S3 class "amelia". This instances includes several objects, one of these is a list imp, which has as list elements all the imputed datasets (in your case 50).
So using lapply(imp$imputations, coef(model1)) will apply the function in the second part on all imputed datasets. The only problem is, your second part isn't really a function. Also you can't apply coef on the imputed datasets. You must apply coef() on a model object, because it returns the model coefficients form the model.
I guess you want to do the following:
Generate your m=50 imputed datasets
Build a arima model for each dataset
Get the coefficients for each of this model
You could just use a for loop through the m=50 datasets for this.
Take this as an example:
data(africa)
imp <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc", m = 5)
for (i in 1:length(imp$imputations))
{
model <- arima(imp$imputations[[i]]$gdp_pc)
coe <- coef(model)
print(coe)
}
This would give you 50 results of coef. (for the different arima models build on the different m=50 imputed datasets)

How to predict dependent values using fitted model in r?

I am fitting a model with:
var4pca <- lm(lg[5:415,1] ~ pcalg1$x[, 1:8] + pcalg2$x[, 1:8] + pcalg3$x[, 1:8] + pcalg4$x[, 1:8])
I now want to predict values for a validation set(83 rows). How can I do this?
I am trying to use:
pred_pca<-predict(var4pca, va)
where va is my validation set. But this is returning me a vector with length 411, whereas I only want length 83
In my experience, lm is very fussy about prediction. It demands that the new data look exactly like the data used to create the model. By that I mean things like col names have to match. What typically will work is to create a data frame of all the data and then create df.train and df.test as the correct rows of the data frame. That should do the trick. As joran says be careful with formulas. One advantage of putting all the data into a df with named cols is that then one can use the formula depvar ~. - typically much easier to write.

How to run regTermTest on every coefficient in a regression model in R?

I want to run a Wald test to evaluate the statistical significance of each coefficient in the model using the regTermTest function of the survey package (as described here).
The syntax of regTermTest calls for the model followed by the test.terms, but if you list multiple test terms it seems to evaluate them all together rather than separately.
library(caret) # for the GermanCredit sample dataset
data(GermanCredit)
mod1 <- glm(Class ~ Age + as.factor(ForeignWorker) + Property.RealEstate + Housing.Own + CreditHistory.Critical, data = GermanCredit, family = binomial(link='logit'))
library(survey)
regTermTest(mod1, c("Age", "ForeignWorker", "Property.RealEstate", "Housing.Own", "CreditHistory.Critical"))
#
Of course, I could separate them out this way, but it's clunky and repetitive (i.e. the following code produces the desired result but is inefficient when dealing with lots of variables):
regTermTest(mod1, "Age")
regTermTest(mod1, "ForeignWorker")
regTermTest(mod1, "Property.RealEstate")
regTermTest(mod1, "Housing.Own")
regTermTest(mod1, "CreditHistory.Critical")
I've tried extracting the coefficient names into a vector and inserting it into a for loop, but it didn't work (it combines all the terms into one evaluation rather than separately estimating their importance):
vars <- names(mod1$coefficients)
vars <- vars[-1]
for (i in 1:length(vars)) {
iv = vars[i]
rtest <- regTermTest(mod1, iv)
}
How can I efficiently code this?
(Updated)
The *apply family can help, depending on how you want things to look.
lapply(names(mod1$model)[-1], function(x) regTermTest(mod1, x))
sapply(names(mod1$model)[-1], function(x) regTermTest(mod1, x))
You'll have a bit of work to do if you wanted to display the results in a nice way.
(Explanation of update).
The original solution just followed the questioner's idea to use names(mod1$coefficients). But that won't work if there is a factor variable, since mod1$coefficients will contain the name(s) of the variable concatenated with non-default values in the way R regression models always deal with categorical variables. That confuses regTermTest because it goes looking for a variable in the dataset that doesn't exist so it returns a baffling error message.

Fit and calibrate data frame via factors

At first, I use RStudio.
I have a data frame (APD) and I would like to fit the w.r.t to the factor Serial_number. The fit is a lm fit. Then I would like to use this fit to do a calibration (calibrate() out of the investr-package).
Here is an example picture of my data:
Here's the data: Data
Currently I use following lines to fit via Serial_number:
Coefficients<- APD %>%
group_by(Serial_number) %>%
do(tidy(fit<- lm(log(log(Amplification)) ~ Voltage_transformed, .)))
But here, I cannot apply the calibrate()-function. Calibrate function needs an object, that inherits from "lm". And tidy only works for S3/S4-objects.
Do you have an idea?
In your posted code, you're trying to rbind the predicted values from each model, not the coefficients. The function for coefficients is just coefficients(object).
I would also suggest un-nesting your code, since that makes it hard to read and change later on. Here are two generalized functions (each make assumptions, so edit as needed):
lm_by_variable <- function(data_, formula_, byvar) {
by(
data_,
data_[[byvar]],
FUN = lm,
formula = formula_,
simplify = FALSE
)
}
combine_coefficients <- function(fit_list) {
all_coefficients <- lapply(fit_list, coefficients)
do.call('rbind', all_coefficients)
}
lm_by_variable(...) should be pretty self-evident: group by byvar, use lm with the given formula on each subset, and don't simplify the result. Simplifying results is really only useful for interactive work. In a script, it's better to know exactly what will be returned. In this case, a list.
The next function, combine_coefficients(...) returns a matrix of the fitted coefficients. It assumes every fitted model in fit_list has the same terms. We could add logic to make it more robust, but that doesn't seem necessary in this case.

lmList diagnostic plots - is it possible to subset data during a procedure or do data frames have to be subset and then passed in?

I am new to R and am trying to produce a vast number of diagnostic plots for linear models for a huge data set.
I discovered the lmList function from the nlme package.
This works a treat but what I now need is a means of passing in a fraction of this data into the plot function so that the resulting plots are not minute and unreadable.
In the example below 27 plots are nicely displayed. I want to produce diagnostics for much more data.
Is it necessary to subset the data first? (presumably with loops) or is it possible to subset within the plotting function (presumably with some kind of loop) rather than create 270 data frames and pass them all in separately?
I'm sorry to say that my R is so basic that I do not even know how to pass variables into names and values together in for loops (I tried using the paste function but it failed).
The data and function for the example are below – I would be picking values of Subject by their row numbers within the data frame. I grant that the 27 plots here show nicely but for sake of example it would be nice to split them into say into 3 sets of 9.
fm1 <- lmList(distance ~ age | Subject, Orthodont)
# observed versus fitted values by Subject
plot(fm1, distance ~ fitted(.) | Subject, abline = c(0,1))
Examples from:
https://stat.ethz.ch/R-manual/R-devel/library/nlme/html/plot.lmList.html
I would be most grateful for help and hope that my question isn't insulting to anyone's intelligence or otherwise annoying.
I can't see how to pass a subset to the plot.lmList function. But, here is a way to do it using standard split-apply-combine strategy. Here, the Subjects are just split into three arbitrary groups of 9, and lmList is applied to each group.
## Make 3 lmLists
fits <- lapply(split(unique(Orthodont$Subject), rep(1:3, each=3)), function(x) {
eval(substitute(
lmList(distance ~ age | Subject, # fit the data to subset
data=Orthodont[Orthodont$Subject %in% x,]), # use the subset
list(x=x))) # substitue the actual x-values so the proper call gets stored
})
## Make plots
for (i in seq_along(fits)) {
dev.new()
print(plot(fits[[i]], distance ~ fitted(.) | Subject, abline = c(0,1)))
}

Resources