I would like to run a time series regression with a list of dependent variables as the column. I would like to regress each column on a set of independent variables. I know you can just use
lm(dataframe~independent variables)
because if the dependent variable is a matrix, then they will just go through each column.
However, my dependent variables are information about stocks through time and sometimes information is not available for every single stock at every time point, so I have some NA values. The problem that I am having is that if I use lm, I have to omit the NA values, i.e. the lm function removes the whole row when running the regression. This is fine if I only want to run a regression on one dependent variable, but I have a list(1000+) of dependent variables which I would like to run my regression on. Because my dataset is only 15+ years, there is are missing values for very single time point, so when I run my lm regression, I get an error because the lm function has removed every single row when running the regression. The only way that I can think of to solve this problem is to run a for loop and run a separate regression for each stock, which I think will take a very long time to compute. For example, the following is an example of my data:
135081(P) 135084(P) 135090(P)
1994-12-30 NA NA NA
1995-01-02 NA NA NA
1995-01-03 06864935 NA NA
1995-01-04 NA NA -0.05474644
1995-01-05 NA NA 0.20894900
1995-01-06 NA -0.45672832 -0.02378632
so if I run a time series regression on this, I would get an error because the lm function would skip every single row.
So my question is, would there be another way to run a time series regression across a data frame with different DEPENDENT variables where the regression "skips" the NA for just the one particular dependent variable instead of skipping it for every other dependent variable as well?
I don't think using na.omit is correct because it removes the time series properties of my dataset and using na.action=NULL doesn't work because I have NA in my dataset.
Thank you a lot for your help.
You might want to employ a multiple imputation method using something like the Amelia 2 package on CRAN in order to properly account for increased uncertainty in your estimates due to missingness, and also to help minimize biases that result from case-wise deletion. See for example:
Honaker, J. and King, G. (2010). What to do about missing values in time-series cross-section data. American Journal of Political Science, 54(2):561–581.
Related
The glm function seems to only work if NA values are removed. However, I do not want to remove any data. Is there any way to run a logistic regression with missing values, without changing the data?
Thank you in advance!
There are a few ways to deal with NA values if you do not want do delete the whole row.
It kind of depends on why there is an NA value, for example in a survey the NA could be a non answer which could be an answer in itself, so you could count the NA as its own category.
If the value is missing because of the data collection process, you need to do some data imputation. You could run a simple KNN if you don't want to take to much time creating a model for the inputation.
As a new R user I'm having trouble understanding why the NA valus in my dataframe keep changing. I'm running my code on Kaggle. Maybe that's where my problem is arising from?
Original dataframe titled "abc"
There are multiple columns that have NA values so I decided to try using multiple imputation to handle the na values.
So I created a new dataframe with just the columns that had na values and begin imputation
This is the new dataframe titled "abc1"
abc1 <- select(abc, c(9,10,15,16,17,18,19,25,26))
#mice imputation
input_data = abc1
my_imp = mice(input_data, m=5, method="pmm", maxit=20)
summary(input_data$m_0_9)
my_imp$imp$m_0_9
When the imputation begins it creates 5 columns that contain new values to fill in for the NA values of column m_0_9 and I choose which column.
Imputation of column 'm_0_9'
Then I run this code:
final_clean_abc1 <- complete(my_imp,5)
This assigns the values from column 5 of the last image to the NA values in my "abc1" dataframe and saves as "final_clean_abc1."
Lastly I replace the columns from the original "abc" dataframe that had missing values with the new columns in "final_clean_abc1."
I know this probably isnt the cleanest:
abc$m_0_9 <- final_clean_abc1$m_0_9
abc$m_10_12 <- final_clean_abc1$m_10_12
abc$f_0_9 <- final_clean_abc1$f_0_9
abc$f_10_12 <- final_clean_abc1$f_10_12
abc$f_13_14 <- final_clean_abc1$f_13_14
abc$f_15 <- final_clean_abc1$f_15
abc$f_16 <- final_clean_abc1$f_16
abc$asian_pacific_islander <- final_clean_abc1$asian_pacific_islander
abc$american_indian <- final_clean_abc1$american_indian
Now that I have a dataframe 'abc' with no missing values this is where my problem arises. I should be seeing '162' for row 10 for the m_0_9 column but when I save my code and view it on Kaggle I get the value '7' for that specific row and column. As shown in the photo below.
"abc" dataframe with no NA values
Hopefully this makes sense I tried to be as specific as I could be.
There are multiple stochastic processes going on in mice to impute multiple values for one target value, of which are then averaged. You should not expect the same result each time you run mice.
From the MICE documentation
In the first step, the dataset with missing values (i.e. the
incomplete dataset) is copied several times. Then in the next step,
the missing values are replaced with imputed values in each copy of
the dataset. In each copy, slightly different values are imputed due
to random variation. This results in mulitple imputed datasets. In the
third step, the imputed datasets are each analyzed and the study
results are then pooled into the final study result. In this Chapter,
the first phase in multiple imputation, the imputation step, is the
main topic. In the next Chapter, the analysis and pooling phases are
discussed.
https://bookdown.org/mwheymans/bookmi/multiple-imputation.html
We have a wonderful series of vignettes that detail the use of mice. Part of this series is the stochastic nature of the algorithm and how to fix that. Setting mice(yourdata, seed = 123) would generate the same set of multiple imputation every time.
I'm working with a dataset that is comparing the abundance of certain species against environmental variables in various sampling sites.
For some of the sites, environmental variables could not be measured in the field. As a result, these values are written as "NA" in my dataset.
However, for the variables relating to species abundance, there are some values which are zero, simply because at that particular site, one or more species were simply not observed.
I'm using the mice package to deal with these NA values using imputation methods. However, I also want to use the VIM package with the functions "md.pattern" and "aggr" to assess the proportion of missing values. The issue is that when using these functions, R is not only considering the NA values as missing data, but also the zero values as missing data. How can I make it so R only detects NA as missing data and not the values which are zero?
I am using the mice package to impute data, and I have read about post processing to restrict imputed values. One of the variables that I am imputing categorical variable with 10 different levels (a,b,c,d,e,f,g,h,i,j). The missing values can take everything as value except a and d. I need to make it so people with category a or d have values of NA after the imputation. Because when I'm imputing now, people are imputed based on all the available levels and that is wrong.
I have also tried to create another binary variable that says actually 0 and 1 in order to make it work but it still imputed in the wrong way.
Any ideas about post processing this in mice in R?
I have a mega data frame containing monthly stock returns from january 1970 to december 2009 (rows) for 7 different countries including the US (columns). My task is to regress the stock returns of each country (dependent variable) on the USA stock returns (independent variable) using the values of 4 different time periods namely the 70s, the 80s, the 90s and the 00s.
The data set (.csv) can be downloaded at:
https://docs.google.com/file/d/0BxaWFk-EO7tjbG43Yl9iQVlvazQ/edit
This means that I have 24 regressions to run seperately and report the results, which I have already done using the lm() function. However, I am currently attempting to use R smarter and create custom functions that will achieve my purpose and produce the 24 sets of results.
I have created sub data frames containing the observations clustered according to the time periods knowing that there are 120 months in a decade.
seventies = mydata[1:120, ] # 1970s (from Jan. 1970 to Dec. 1979)
eighties = mydata[121:240, ] # 1980s (from Jan. 1980to Dec. 1989)
nineties = mydata[241:360, ] # 1990s (from Jan. 1990 to Dec. 1999)
twenties = mydata[361:480, ] # 2000s (from Jan. 2000 to Dec. 2009)
NB: Each of the newly created variables are 120 x 7 matrices for 120 observations across 7 countries.
Running the 24 regressions using Java would require the use of imbricated for loops.
Could anyone provide the steps I must take to write a function that will arrive a the desired result? Some snippets of R code would also be appreciated. I am also thinking the mapply function will be used.
Thank you and let me know if my post needs some editing.
try this:
install.packages('plyr')
library('plyr')
myfactors<-c(rep("seventies",120),rep("eighties",120),rep("nineties",120),rep("twenties",120))
tapply(y,myfactors,function(y,X){ fit<-lm(y~ << regressors go here>>; return (fit);},X=mydata)
The lm function will accept a matrix as the response varible and compute seperate regressions for each of the columns, so you can just combine (cbind) the different countries together for that part.
If you are willing to assume that the different decades have the same variance then you could fit the different decades using a dummy variable for decade (look at the gl function for a quick way to calculate a decade factor) and do everything in one call to lm. A simple example:
fit <- lm( cbind( Sepal.Width, Sepal.Length, Petal.Width ) ~ 0 + Species + Petal.Length:Species,
data=iris )
This will give the same coefficient estimates as the seperate regressions, only the standard deviations and degrees of freedom (and therefore the tests and anything else that depends on those) will be different from running the regressions individually.
If you need the standard deviations computed individually for each decade then you can use tapply or sapply (passing decade info into the subset argument of lm) or other apply functions.
For displaying the results from several different regression models the new stargazer package may be of interest.
Try using the 'stargazer' package for publication-quality text or LaTeX regression results tables.