I've built an R function that uses the same explanatory variables on a range of columns. I've used the glm function, but now I need to do the same with svyglm from the survey package. The main problem I'm having is that I can't build loops by using svyglm(Data[,i]~explanatoryVariables) as I do in glm, because it doesn't like column names (which are however very practical in loops).
For example if you try
library(survey)
data(api)
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
summary(svyglm(api00~ell+meals+mobility, design=dstrat))
everything is fine but if you want to loop through several dependent variables by using the column number (here 13), you get an error
summary(svyglm(apistrat[,13]~ell+meals+mobility,data=apistrat, design=dstrat))
Does anyone know how to get around this? To give a simple example (never mind the statistical accuracy or the link function) I need to achieve the equivalent of this in normal glm but using svyglm instead
for(i in (12:15)){
print(glm(apistrat[,i]~ ell+meals,data=apistrat)$aic)
}
You need to use as.formula to paste the appropriate columns for evaluation. I created a custom function for your case:
mysvy <- function(data, columns, ...) {
model <- lapply(as.list(columns), function(x) {
summary(svyglm(as.formula(paste0(names(data)[x], "~ell+meals+mobility")),
data = data, ...))
})
return(model)
}
Then you can run your your desired columns through the function.
# To run columns 13 - 15 and get the results into a list
results <- mysvy(apistrat, 13:15, design = dstrat)
# should return a list of 3. results[[1]] to see the first
Related
I'm trying to handle the missing data from a data frame use multiple imputations, professor advice me to use Amelia package. And I can build the time series model, but when I try to use lapply function to repeatedly run the time series model in each dataset, I got an error on the function in lapply.
My data frame have three variables, date, pm25, pm10. I can built an AR model for pm25.
And the imputation code is:
imp <- amelia(Exetertibble, m=50, ts = "date")
So I can get 50 imputations, and the time series model would like this:
model1 <- arima(imp$imputations$imp1$pm25, order = c(1,0,0))
Then I try to use lapply function:
extractcoefs <- lapply(imp$imputations, coef(model1))
There is an error, it said that the coef(model)is not a function or character or symbol.
My aim is to combine the 50 imputations and get the best result of coefficient of the time series model, I don't know how to write a correct function in there.
I also tried:
extractcoefs <- lapply(imp$imputations, coef(arima(order=c(1,0,0))))
and:
extractcoefs <- lapply(imp$imputations, arima(order=c(1,0,0)$coef))
No idea, what you are trying to do.
Look at this example for lapply:
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
# compute the list mean for each list element
lapply(x, mean)
So you give lapply a list und apply a function on each of the list elements. In this case the function is mean().
So for this example you will get the mean for a, beta and logic.
You are using lapply on imp$imputations.
You got imp$imputations from your call to the amelia() function. Which gives you an instance of S3 class "amelia". This instances includes several objects, one of these is a list imp, which has as list elements all the imputed datasets (in your case 50).
So using lapply(imp$imputations, coef(model1)) will apply the function in the second part on all imputed datasets. The only problem is, your second part isn't really a function. Also you can't apply coef on the imputed datasets. You must apply coef() on a model object, because it returns the model coefficients form the model.
I guess you want to do the following:
Generate your m=50 imputed datasets
Build a arima model for each dataset
Get the coefficients for each of this model
You could just use a for loop through the m=50 datasets for this.
Take this as an example:
data(africa)
imp <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc", m = 5)
for (i in 1:length(imp$imputations))
{
model <- arima(imp$imputations[[i]]$gdp_pc)
coe <- coef(model)
print(coe)
}
This would give you 50 results of coef. (for the different arima models build on the different m=50 imputed datasets)
I want to run a Wald test to evaluate the statistical significance of each coefficient in the model using the regTermTest function of the survey package (as described here).
The syntax of regTermTest calls for the model followed by the test.terms, but if you list multiple test terms it seems to evaluate them all together rather than separately.
library(caret) # for the GermanCredit sample dataset
data(GermanCredit)
mod1 <- glm(Class ~ Age + as.factor(ForeignWorker) + Property.RealEstate + Housing.Own + CreditHistory.Critical, data = GermanCredit, family = binomial(link='logit'))
library(survey)
regTermTest(mod1, c("Age", "ForeignWorker", "Property.RealEstate", "Housing.Own", "CreditHistory.Critical"))
#
Of course, I could separate them out this way, but it's clunky and repetitive (i.e. the following code produces the desired result but is inefficient when dealing with lots of variables):
regTermTest(mod1, "Age")
regTermTest(mod1, "ForeignWorker")
regTermTest(mod1, "Property.RealEstate")
regTermTest(mod1, "Housing.Own")
regTermTest(mod1, "CreditHistory.Critical")
I've tried extracting the coefficient names into a vector and inserting it into a for loop, but it didn't work (it combines all the terms into one evaluation rather than separately estimating their importance):
vars <- names(mod1$coefficients)
vars <- vars[-1]
for (i in 1:length(vars)) {
iv = vars[i]
rtest <- regTermTest(mod1, iv)
}
How can I efficiently code this?
(Updated)
The *apply family can help, depending on how you want things to look.
lapply(names(mod1$model)[-1], function(x) regTermTest(mod1, x))
sapply(names(mod1$model)[-1], function(x) regTermTest(mod1, x))
You'll have a bit of work to do if you wanted to display the results in a nice way.
(Explanation of update).
The original solution just followed the questioner's idea to use names(mod1$coefficients). But that won't work if there is a factor variable, since mod1$coefficients will contain the name(s) of the variable concatenated with non-default values in the way R regression models always deal with categorical variables. That confuses regTermTest because it goes looking for a variable in the dataset that doesn't exist so it returns a baffling error message.
At first, I use RStudio.
I have a data frame (APD) and I would like to fit the w.r.t to the factor Serial_number. The fit is a lm fit. Then I would like to use this fit to do a calibration (calibrate() out of the investr-package).
Here is an example picture of my data:
Here's the data: Data
Currently I use following lines to fit via Serial_number:
Coefficients<- APD %>%
group_by(Serial_number) %>%
do(tidy(fit<- lm(log(log(Amplification)) ~ Voltage_transformed, .)))
But here, I cannot apply the calibrate()-function. Calibrate function needs an object, that inherits from "lm". And tidy only works for S3/S4-objects.
Do you have an idea?
In your posted code, you're trying to rbind the predicted values from each model, not the coefficients. The function for coefficients is just coefficients(object).
I would also suggest un-nesting your code, since that makes it hard to read and change later on. Here are two generalized functions (each make assumptions, so edit as needed):
lm_by_variable <- function(data_, formula_, byvar) {
by(
data_,
data_[[byvar]],
FUN = lm,
formula = formula_,
simplify = FALSE
)
}
combine_coefficients <- function(fit_list) {
all_coefficients <- lapply(fit_list, coefficients)
do.call('rbind', all_coefficients)
}
lm_by_variable(...) should be pretty self-evident: group by byvar, use lm with the given formula on each subset, and don't simplify the result. Simplifying results is really only useful for interactive work. In a script, it's better to know exactly what will be returned. In this case, a list.
The next function, combine_coefficients(...) returns a matrix of the fitted coefficients. It assumes every fitted model in fit_list has the same terms. We could add logic to make it more robust, but that doesn't seem necessary in this case.
As part of my data analysis (on time series), I am checking for correlation between log-returns and realized volatility.
My data consists of time series spanning several years for around hundred different companies (large zoo object, ~2 MB filesize). To check for the above-mentioned correlation, I have used the following code to calculate several rolling variances (a.k.a. realized volatility):
rollvar5 <- sapply(returns, rollVar, n=5, na.rm=TRUE)
rollvar10 <- sapply(returns, rollVar, n=10, na.rm=TRUE)
using the simple fTrading function rollVar. I have then converted the rolling variances to zoo objects and added the date index (by exporting to the results to csv files and manually adding the date, and then using read.zoo - not very sophisticated but it works just fine).
Now I wish to create around 100 linear regression models, each linking the log-returns of a company to the realized volatility to the specified company. On an individual basis, this would look like the following:
lm_rollvar5 <- lm(returns$[5:1000,1] ~ rollvar5[5:1000,1])
lm_rollvar10 <- lm(returns$[10:1000,1] ~ rollvar10[10:1000,1])
This works without problems.
Now I wish to extend this to automatically create the linear regression models for all 100 companies. What I've tried was a simple for-loop:
NC <- ncol(returns)
for(i in 1:NC){
lm_rollvar5 <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
summary(lm_rollvar5)
lm_rollvar10 <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
summary(lm_rollvar10)
}
Is there any way I could optimize my approach? (i.e. how could I save all regression results in a simple way). Since now the for-loop just outputs hundreds of regression results, which is quite ineffective in analyzing the results.
I also tried to use the apply function but I am unsure how to use it in this case, since there are several timeseries objects (the returns and the rolling variances are saved in different objects as you can see).
As to your question how you could save all regression results in a simple way, this is a bit difficult to answer given that we don't know what you need to do, and what you consider "simple". However, you could define a list outside the loop and store each regression model in this list so that you can access the models without refitting them later. Try e.g.
NC <- ncol(returns)
lm_rollvar5 <- vector(mode="list", length=NC)
lm_rollvar10 <- vector(mode="list", length=NC)
for(i in 1:NC){
lm_rollvar5[[i]] <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
lm_rollvar10[[i]] <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
}
This gives you the fitted model for firm i at the i-th position in the list. In the same manner, you can also save the output of summary. Or you do sth like
my.summaries_5 <- lapply(lm_rollvar5, summary)
which gives you a list of summaries.
I am trying to use tapply() function to run models by several categories with not much sucess. My data has 20 clinics and I want to run the models BY each clinic.
Heres my model:
attach(qregdata)
rq(logA~ dose+ chtcm + cage +raceth + sex,tau=.9)
My data as a variable clinic (with values 1-20). Does anybody know how to run this model BY clinic in R as in other statistical packages?
A very general way of accomplishing this is shown in the following. The ddply function runs a supplied function (in this case lm) for each clinic. You can also run it on more complex cross-sections of your data. E.g. .(clinic,level) would run a separate model on each combination of clinic and level. The term lm(y~x)$coef[1] gets the intercept of the linear model. I think there is no easy way to save all the output of each model fit at once.
n <- 10
clinic <- factor(rep(1:3,each=n))
x <- rep(0:(n-1),3)
y <- rnorm(3*n)*x
d <- data.frame(clinic,x,y)
# plot data and linear fits
library(ggplot2)
ggplot(d,aes(x,y)) + geom_point() + facet_wrap(~clinic) + stat_smooth(method='lm')
# run a separate model for each clinic
library(plyr)
ddply(d,.(clinic),summarize,intercept=lm(y~x)$coef[1],slope=lm(y~x)$coef[2])
You could use 'lappy' across the unique values of clinic, and the use subset to extract the section of your Dataset for that clinic. Then just fit the model to the subset.
This will return a list of models, which you can then further process.
I had a similar issue to this recently and wanted to share a response in case someone is still interested in this topic; sorry to dredge up an old post.
tapply is very convenient to work with when the input object (the object being "split") is a vector. If the input object being split is a rectangular data set, it can be much simpler to use the (aptly named, in this case) by function, which is a convenient wrapper for tapply intended for data.frame objects. The return object of the by function is of the class by which can be simplified to an array or a list using the argument simplify = TRUE.
Certainly there are more efficient ways to perform this operation, but if you are looking for a tapply-like solution - by is it.
Here's an example using lm to regress petal width on sepal width "by" species in the iris data set:
## Load iris data
data(iris)
## Fit a model to each species-specific subset of the data
fitBySpecies <- by(
data = iris,
INDICES = iris$Species,
FUN = function(speciesSubset)
lm(Petal.Width ~ Sepal.Width, data = speciesSubset)
)