I have a group of 51 variables into which I have applied Principal Component Analysis and selected six factors based on the Kaiser-Guttman criterion. I'm using R for my analysis and did this with the following function:
prca.searchwords <- prcomp(searchwords.ts, scale=TRUE)
summary(prca.searchwords)
prca.searchwords$sdev^2
Next I would like to use these six extracted factors in a dynamic linear regression model as explanatory variables in groups of one, two, three & four and choose the regression model that explains most of the variation of the dependent variable. The six variables are prca.searchwords$x[,1] + prca.searchwords$x[,2] + prca.searchwords$x[,3] + prca.searchwords$x[,4] + prca.searchwords$x[,5] + prca.searchwords$x[,6]
Which I convert to time series before using in a regression:
prca.searchwords.1.ts <- ts(data=prca.searchwords$x[,1], freq=12, start=c(2004, 1))
prca.searchwords.2.ts <- ts(data=prca.searchwords$x[,2], freq=12, start=c(2004, 1))
I'm using the dynlm package in R for this (I chose to use dynamic regression because other regressions that I perform require lagged values of the independent variables).
For example with the first two factors it would look like this:
private.consumption.searchwords.dynlm <- dynlm(monthly.privateconsumption.ts ~ prca.searchwords.1.ts + prca.searchwords.2.ts)
summary(private.consumption.searchwords.dynlm)
The problem I'm facing is that I would like to do this for all possible combinations of one, two, three and four factors of those six factors that I have chosen to use. This would mean that I would have to do six regressions for 1 variable groups, 15 for two variables, 20 for three variables and 15 for four variables. I would like to do this as efficiently as possible, without having to type 51 different regressions manually.
I'm a relatively new R user and therefore I still struggle with these general coding tricks that will radically speed up my analysis. Could someone please point me into the right direction?
Thank you!
You could build all the formula you are intereted in running using string manipulation functions then convert those to propert formuals and apply over the list of models you want to run. For example
vars <- paste0("prca.searchwords.",1:6,".ts")
resp <- unlist(lapply(1:6, function(i) apply(combn(vars,i), 2, paste, collapse=" + ")))
result <- lapply(resp, function(r) {
do.call("dynlm", list(as.formula(paste0("monthly.privateconsumption.ts ~ ", r))))
})
Related
I am new to R and am trying to produce a vast number of diagnostic plots for linear models for a huge data set.
I discovered the lmList function from the nlme package.
This works a treat but what I now need is a means of passing in a fraction of this data into the plot function so that the resulting plots are not minute and unreadable.
In the example below 27 plots are nicely displayed. I want to produce diagnostics for much more data.
Is it necessary to subset the data first? (presumably with loops) or is it possible to subset within the plotting function (presumably with some kind of loop) rather than create 270 data frames and pass them all in separately?
I'm sorry to say that my R is so basic that I do not even know how to pass variables into names and values together in for loops (I tried using the paste function but it failed).
The data and function for the example are below – I would be picking values of Subject by their row numbers within the data frame. I grant that the 27 plots here show nicely but for sake of example it would be nice to split them into say into 3 sets of 9.
fm1 <- lmList(distance ~ age | Subject, Orthodont)
# observed versus fitted values by Subject
plot(fm1, distance ~ fitted(.) | Subject, abline = c(0,1))
Examples from:
https://stat.ethz.ch/R-manual/R-devel/library/nlme/html/plot.lmList.html
I would be most grateful for help and hope that my question isn't insulting to anyone's intelligence or otherwise annoying.
I can't see how to pass a subset to the plot.lmList function. But, here is a way to do it using standard split-apply-combine strategy. Here, the Subjects are just split into three arbitrary groups of 9, and lmList is applied to each group.
## Make 3 lmLists
fits <- lapply(split(unique(Orthodont$Subject), rep(1:3, each=3)), function(x) {
eval(substitute(
lmList(distance ~ age | Subject, # fit the data to subset
data=Orthodont[Orthodont$Subject %in% x,]), # use the subset
list(x=x))) # substitue the actual x-values so the proper call gets stored
})
## Make plots
for (i in seq_along(fits)) {
dev.new()
print(plot(fits[[i]], distance ~ fitted(.) | Subject, abline = c(0,1)))
}
As part of my data analysis (on time series), I am checking for correlation between log-returns and realized volatility.
My data consists of time series spanning several years for around hundred different companies (large zoo object, ~2 MB filesize). To check for the above-mentioned correlation, I have used the following code to calculate several rolling variances (a.k.a. realized volatility):
rollvar5 <- sapply(returns, rollVar, n=5, na.rm=TRUE)
rollvar10 <- sapply(returns, rollVar, n=10, na.rm=TRUE)
using the simple fTrading function rollVar. I have then converted the rolling variances to zoo objects and added the date index (by exporting to the results to csv files and manually adding the date, and then using read.zoo - not very sophisticated but it works just fine).
Now I wish to create around 100 linear regression models, each linking the log-returns of a company to the realized volatility to the specified company. On an individual basis, this would look like the following:
lm_rollvar5 <- lm(returns$[5:1000,1] ~ rollvar5[5:1000,1])
lm_rollvar10 <- lm(returns$[10:1000,1] ~ rollvar10[10:1000,1])
This works without problems.
Now I wish to extend this to automatically create the linear regression models for all 100 companies. What I've tried was a simple for-loop:
NC <- ncol(returns)
for(i in 1:NC){
lm_rollvar5 <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
summary(lm_rollvar5)
lm_rollvar10 <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
summary(lm_rollvar10)
}
Is there any way I could optimize my approach? (i.e. how could I save all regression results in a simple way). Since now the for-loop just outputs hundreds of regression results, which is quite ineffective in analyzing the results.
I also tried to use the apply function but I am unsure how to use it in this case, since there are several timeseries objects (the returns and the rolling variances are saved in different objects as you can see).
As to your question how you could save all regression results in a simple way, this is a bit difficult to answer given that we don't know what you need to do, and what you consider "simple". However, you could define a list outside the loop and store each regression model in this list so that you can access the models without refitting them later. Try e.g.
NC <- ncol(returns)
lm_rollvar5 <- vector(mode="list", length=NC)
lm_rollvar10 <- vector(mode="list", length=NC)
for(i in 1:NC){
lm_rollvar5[[i]] <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
lm_rollvar10[[i]] <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
}
This gives you the fitted model for firm i at the i-th position in the list. In the same manner, you can also save the output of summary. Or you do sth like
my.summaries_5 <- lapply(lm_rollvar5, summary)
which gives you a list of summaries.
In modeling it is helpful to do univariate regressions of a dependent on an independent in linear, quadratic, cubic and quaternary (?) forms to see which captures the basic shape of the statistical data. I'm a fairly new R programmer and need some help.
Here's pseudocode:
for i in 1:ncol(data)
data[,ncol(data) + i] <- data[, i]^2 # create squared term
data[,ncol(data) + i] <- data[, i]^3 # create cubed term
...and similarly for cubed and fourth power terms
# now do four regressions starting with linear and including one higher order term each time and display for each i the form of regression that has the highest adj R2.
lm(y ~ data[,i], ... )
# retrieve R2 and save indexed for linear case in vector in row i
lm(y tilda data[,i], data[,ncol(data) + i, ...]
# retrieve R2 and save...
Result is a dataframe indexed by i with column name in data of original x variable and results for each of the four regressions (all run with an intercept term).
Ordinarily we do this by looking at plots but where you have 800 variables that is not feasible.
If you really want to help out write code to automatically insert the required number of exponentiated variables into data.
And this doesn't even take care of the kinky variables that come clumped up in a couple of clusters or only relevant at one value, etc.
I'd say the best way to do this is by using the polynomial function in R, poly(). Imagine you have an independent numeric variable, x, and a numeric response variable, y.
models=list()
for (i in 1:4)
models[[i]]=lm(y~poly(x,i),raw=TRUE)
The raw=TRUE part ensures that the model uses the raw polynomials, and not the orthogonal polynomials.
When you want to get one of the models, just type in models[[1]] or models[[2]], etc.
I am currently using mitools to analyze 5 imputed datasets (I used Amelia). The unit of analysis is country-year. Each dataset has over 3600 observations. The dependent variable of interest for all analyses is a lagged count variable.
Given the nature of the data (panel & count), I decided to use the pglm package. I have been using the following code to attempt to estimate pglm models across all five imputed datasets, in order to combine those estimates with mitools:
setwd("C:/Desktop")
library(mitools)
library(pglm)
#The data.dir code below establishes where the files are located
data.dir <- "C:/Desktop"
#The imputationList function combines all 5 imputed gtot datasets into one object
#called allgtot
allcountdata <- imputationList(lapply(list.files(data.dir, pattern = "panelcount.\\.dta", full=TRUE),read.dta, warn.missing.labels=FALSE))
#To see if the combining was successful
allcountdata
#Lets see if all of the column names are still there
colnames(allcountdata)
count_mitools<-pglm(count_lag ~
HROsec
+ hrofilled
+ physint
+ democracy
+ log(PopTotal)
+ GDPpercapita
+ cell
+ gini
+ polity2
+ CivilWar
+ milper
+ elp
+ factor(year),
allcountdata,
family="negbin",
model="within",
print.level=0,
method="nr",
index=c("cowcode", "year"))
However, I get the following error:
Error in is.finite(x[[i]]) :
default method not implemented for type 'list'
I attempt to coerce the list object into a dataframe as follows:
as.data.frame(allcountdata)
However, R states that I cannot do so.
What can I do to make the pglm estimation work?
If I cannot use pglm, what are alternatives models (and corresponding R packages) suitable for panel count data regression analysis that are either (preferably both):
a) compatible with a mitools analysis
b) accepting of list objects for estimation
I am providing a link to a compressed zipfile containing all five imputed datasets I am working with here.
I am trying to use tapply() function to run models by several categories with not much sucess. My data has 20 clinics and I want to run the models BY each clinic.
Heres my model:
attach(qregdata)
rq(logA~ dose+ chtcm + cage +raceth + sex,tau=.9)
My data as a variable clinic (with values 1-20). Does anybody know how to run this model BY clinic in R as in other statistical packages?
A very general way of accomplishing this is shown in the following. The ddply function runs a supplied function (in this case lm) for each clinic. You can also run it on more complex cross-sections of your data. E.g. .(clinic,level) would run a separate model on each combination of clinic and level. The term lm(y~x)$coef[1] gets the intercept of the linear model. I think there is no easy way to save all the output of each model fit at once.
n <- 10
clinic <- factor(rep(1:3,each=n))
x <- rep(0:(n-1),3)
y <- rnorm(3*n)*x
d <- data.frame(clinic,x,y)
# plot data and linear fits
library(ggplot2)
ggplot(d,aes(x,y)) + geom_point() + facet_wrap(~clinic) + stat_smooth(method='lm')
# run a separate model for each clinic
library(plyr)
ddply(d,.(clinic),summarize,intercept=lm(y~x)$coef[1],slope=lm(y~x)$coef[2])
You could use 'lappy' across the unique values of clinic, and the use subset to extract the section of your Dataset for that clinic. Then just fit the model to the subset.
This will return a list of models, which you can then further process.
I had a similar issue to this recently and wanted to share a response in case someone is still interested in this topic; sorry to dredge up an old post.
tapply is very convenient to work with when the input object (the object being "split") is a vector. If the input object being split is a rectangular data set, it can be much simpler to use the (aptly named, in this case) by function, which is a convenient wrapper for tapply intended for data.frame objects. The return object of the by function is of the class by which can be simplified to an array or a list using the argument simplify = TRUE.
Certainly there are more efficient ways to perform this operation, but if you are looking for a tapply-like solution - by is it.
Here's an example using lm to regress petal width on sepal width "by" species in the iris data set:
## Load iris data
data(iris)
## Fit a model to each species-specific subset of the data
fitBySpecies <- by(
data = iris,
INDICES = iris$Species,
FUN = function(speciesSubset)
lm(Petal.Width ~ Sepal.Width, data = speciesSubset)
)