Writing a function that outputs several regression results - r

I have a mega data frame containing monthly stock returns from january 1970 to december 2009 (rows) for 7 different countries including the US (columns). My task is to regress the stock returns of each country (dependent variable) on the USA stock returns (independent variable) using the values of 4 different time periods namely the 70s, the 80s, the 90s and the 00s.
The data set (.csv) can be downloaded at:
https://docs.google.com/file/d/0BxaWFk-EO7tjbG43Yl9iQVlvazQ/edit
This means that I have 24 regressions to run seperately and report the results, which I have already done using the lm() function. However, I am currently attempting to use R smarter and create custom functions that will achieve my purpose and produce the 24 sets of results.
I have created sub data frames containing the observations clustered according to the time periods knowing that there are 120 months in a decade.
seventies = mydata[1:120, ] # 1970s (from Jan. 1970 to Dec. 1979)
eighties = mydata[121:240, ] # 1980s (from Jan. 1980to Dec. 1989)
nineties = mydata[241:360, ] # 1990s (from Jan. 1990 to Dec. 1999)
twenties = mydata[361:480, ] # 2000s (from Jan. 2000 to Dec. 2009)
NB: Each of the newly created variables are 120 x 7 matrices for 120 observations across 7 countries.
Running the 24 regressions using Java would require the use of imbricated for loops.
Could anyone provide the steps I must take to write a function that will arrive a the desired result? Some snippets of R code would also be appreciated. I am also thinking the mapply function will be used.
Thank you and let me know if my post needs some editing.

try this:
install.packages('plyr')
library('plyr')
myfactors<-c(rep("seventies",120),rep("eighties",120),rep("nineties",120),rep("twenties",120))
tapply(y,myfactors,function(y,X){ fit<-lm(y~ << regressors go here>>; return (fit);},X=mydata)

The lm function will accept a matrix as the response varible and compute seperate regressions for each of the columns, so you can just combine (cbind) the different countries together for that part.
If you are willing to assume that the different decades have the same variance then you could fit the different decades using a dummy variable for decade (look at the gl function for a quick way to calculate a decade factor) and do everything in one call to lm. A simple example:
fit <- lm( cbind( Sepal.Width, Sepal.Length, Petal.Width ) ~ 0 + Species + Petal.Length:Species,
data=iris )
This will give the same coefficient estimates as the seperate regressions, only the standard deviations and degrees of freedom (and therefore the tests and anything else that depends on those) will be different from running the regressions individually.
If you need the standard deviations computed individually for each decade then you can use tapply or sapply (passing decade info into the subset argument of lm) or other apply functions.
For displaying the results from several different regression models the new stargazer package may be of interest.

Try using the 'stargazer' package for publication-quality text or LaTeX regression results tables.

Related

How to choose one row from each seven dataframe to form a new dataframe?

I have seven dataframes now, they have same structures including 41 rows and 12 columns. Now I want to choose one row from each dataframe to get a new dataframe. I can get 41^7 new dataframes theoretically so that I can run regressions with all of them. My target is to get a range of coefficient of my most important independent variable, but now my crucial step is get these 41^7 new dataframes. What should I do?
I have done a panel regression with 30 individual and 12 semi-year. For my most important variable, I calculated it due to its partly missing. Now I want to adjust its method of calculation to get a interval of coefficient. My process met an obstacle here.
FIT <- t(cbind(FITgan[1,],FITxin[1,],FITqing[1,],FITnei[1,],FIThe[1,],FITshan[1,],FITshann[1,]))
This is my attempt for one combination, the number of whole combination is 41^7(7 dataframes, each with 41 rows). I would like to get a list with all these results.
The number of whole combination is 41^7(7 dataframes, each with 41 rows). I would like to get a list with all these results. Thanks!!
I am not sure if this questions belong here. It rather sounds to be a topic for Cross Validated. As you probably should not do a regression on 41^7 dataframes I would suggest a bootstrapping approach:
sample from your true observations WITH replacement 41 times.
run your regression
save coeffiecients of interest
repeat step 1-3 maybe 10,000 times
estimate a simulated distribution from your coefficients of interest
This will lead you to a better understanding of the underlying uncertainty of your parameters.

Run correlation between unequal sized blocks of data in 2 time series

I have two datasets and I need to estimate correlation of the series in these two datasets. For example, one series in of length 189 and the other series is of length 192. The end point of these series respond to the same time period ,i.e., Dec 2015. The difference is in the start point of this series. I need to estimate the correlation for the blocks of 12 data points in both the series starting from the last point. For ex, the first block would be from Jan 2015 to Dec 2015, second block would be from Jan 2014 to Dec 2014. Since the last block would have unequal data length, the data length can be equalized and the last block can be of less than 12 months. For example in the example, the last block would be of length 9 months. How to create a loop and run this?
I tried the following. This is giving me results but I am getting the same value of correlation for all the loop runs. DOn't know where am I going wrong.
correl=data.frame(x=numeric(0))
r=nrow(US)
s=nrow(Argentina)
a=ifelse(r<s,r,s)
for (i in 1:(a%/%12)) {
if(i<a%/%12){
elmnt1= US[r-11:r,]$IIP
elmnt2= Argentina[s-11:s,]$IIP
} else {
elmnt1= US[1:r%%12,]$IIP
elmnt2=Argentina[1:s%%12,]$IIP
}
corr=cor(elmnt1, elmnt2)
correl$x[i,]=corr
r=r-12
s=s-12
}
You don't need to create a for loop solution for this. Not having matching observation lengths is a common problem that occurs in research, and there are answers built into the correlation functions to handle this. If you have two variables in the same data frame that are of different lengths, here are some options:
#Use cor.test(), which automatically matches lengths (i.e. excludes NAs):
cor.test(x,y)
#Or add the following argument to the cor() function for the same purpose:
cor(x,y,use='complete.obs')
As long as your x and y are in the same table, and are presumably matched by date in this case, then these options should solve the problem.

Using acf function in r for time series data

I am new to time-series analysis and have a data set with a daily time step at 5 factor levels. My goal is to use the acf function in R to determine whether there is significant autocorrelation across the response variable of interest so that I can justify whether or not a time-series model is necessary.
I have sorted the dataset by Day, and am using the following code:
acf(DE_vec, lag.max=7)
The dataset has not been converted to a time-series object…it is a vector sorted by Day.
My first question is whether the dataframe should be converted to a time-series object, or if it is also correct to sort the vector by Day?
Second, if I have a variable repeated over the 5 levels for each Day, then should I construct 5 different acf plots for each level, or would it be ok to pool over stations as was done with the code above?
Thanks in advance,
Yes, acf() will work on a data.frame class, and yes, you should compute the ACF for each of the 5 levels separately. If you pass the entire df to acf(), it will return the ACF for each of the levels.
If you are curious about the relationship across levels, then you need to use ccf() or some mutual information metric like those in the entropy or infotheo pkgs.

R: Time Series Regression with NA and multiple dependent variables

I would like to run a time series regression with a list of dependent variables as the column. I would like to regress each column on a set of independent variables. I know you can just use
lm(dataframe~independent variables)
because if the dependent variable is a matrix, then they will just go through each column.
However, my dependent variables are information about stocks through time and sometimes information is not available for every single stock at every time point, so I have some NA values. The problem that I am having is that if I use lm, I have to omit the NA values, i.e. the lm function removes the whole row when running the regression. This is fine if I only want to run a regression on one dependent variable, but I have a list(1000+) of dependent variables which I would like to run my regression on. Because my dataset is only 15+ years, there is are missing values for very single time point, so when I run my lm regression, I get an error because the lm function has removed every single row when running the regression. The only way that I can think of to solve this problem is to run a for loop and run a separate regression for each stock, which I think will take a very long time to compute. For example, the following is an example of my data:
135081(P) 135084(P) 135090(P)
1994-12-30 NA NA NA
1995-01-02 NA NA NA
1995-01-03 06864935 NA NA
1995-01-04 NA NA -0.05474644
1995-01-05 NA NA 0.20894900
1995-01-06 NA -0.45672832 -0.02378632
so if I run a time series regression on this, I would get an error because the lm function would skip every single row.
So my question is, would there be another way to run a time series regression across a data frame with different DEPENDENT variables where the regression "skips" the NA for just the one particular dependent variable instead of skipping it for every other dependent variable as well?
I don't think using na.omit is correct because it removes the time series properties of my dataset and using na.action=NULL doesn't work because I have NA in my dataset.
Thank you a lot for your help.
You might want to employ a multiple imputation method using something like the Amelia 2 package on CRAN in order to properly account for increased uncertainty in your estimates due to missingness, and also to help minimize biases that result from case-wise deletion. See for example:
Honaker, J. and King, G. (2010). What to do about missing values in time-series cross-section data. American Journal of Political Science, 54(2):561–581.

Time Dummy Variables and Regressing columns of a dataframe as dependent variables

I have tried to search this question on here but I couldn't find anything so sorry if this question has already been answered. My dataset consists of daily information for a large number of stocks (1000+) over a 10 year period. So I have read my dataset as a data frame time series where each column is a separate stock. I would like to regress each of the stock against month dummy variables capture the season variation and obtain the residuals. What I have done is the following:
for (i in 1:1000){
month.f<-factor(months(time(stockinfo[,i])))
dummy<-model.matrix(month.f)
residStock[,1]<-residuals(lm(stockinfo[,i]~dummy,na.action=na.exclude))
}
#Stockinfo is data.frame
Is this the correct way to do it?
Secondly, i would like to run a regression using the residuals as the the dependent variable and other independent variables from another data frame. What would be the best way to do this, would I have to use a for loop again?
Thank you a lot for your help.
You can create a list of stocks as follows and then use Map function and can avoid R for loop (Not tested since you didn't provide the sample data)
Assume your data is mydata with month as 1,2, you use 11 months as dummy if there are 12 months
mystock<-list("APP~","INTEL~","MICROSOFT~") # stocks with tilde sign
myresi<-Map(function(x) resi(lm(as.formula(paste(x,paste(levels(as.factor(mydata$month))[-1],collapse="+"))),data=mydata),mystock) #-1 means we are using only 11 months excluding first as base month
Say your independent var is indep1,indep2, and indep3 and dependent is dep (And assuming that dep and indep are same for each stocks)
myestimate<-Map(function(x)lm(dep~indep1+indep2+indep3,data=x),myresi)

Resources