I have tried to search this question on here but I couldn't find anything so sorry if this question has already been answered. My dataset consists of daily information for a large number of stocks (1000+) over a 10 year period. So I have read my dataset as a data frame time series where each column is a separate stock. I would like to regress each of the stock against month dummy variables capture the season variation and obtain the residuals. What I have done is the following:
for (i in 1:1000){
month.f<-factor(months(time(stockinfo[,i])))
dummy<-model.matrix(month.f)
residStock[,1]<-residuals(lm(stockinfo[,i]~dummy,na.action=na.exclude))
}
#Stockinfo is data.frame
Is this the correct way to do it?
Secondly, i would like to run a regression using the residuals as the the dependent variable and other independent variables from another data frame. What would be the best way to do this, would I have to use a for loop again?
Thank you a lot for your help.
You can create a list of stocks as follows and then use Map function and can avoid R for loop (Not tested since you didn't provide the sample data)
Assume your data is mydata with month as 1,2, you use 11 months as dummy if there are 12 months
mystock<-list("APP~","INTEL~","MICROSOFT~") # stocks with tilde sign
myresi<-Map(function(x) resi(lm(as.formula(paste(x,paste(levels(as.factor(mydata$month))[-1],collapse="+"))),data=mydata),mystock) #-1 means we are using only 11 months excluding first as base month
Say your independent var is indep1,indep2, and indep3 and dependent is dep (And assuming that dep and indep are same for each stocks)
myestimate<-Map(function(x)lm(dep~indep1+indep2+indep3,data=x),myresi)
Related
I have a dataset in R that I would like to find the average of a given variable for each year in the dataset (here, from 1871-2019). Not every year has the same number of entries, and so I have encountered two problems: first, how to find the average of the variable for each year, and second, how to add the column of averages to the dataset. I am unsure how to approach the first problem, but I attempted a version of the second problem by simply finding the sum of each group and then trying to add those values to the dataset for each entry of a given year with the code teams$SBtotal <- tapply(teams$SB, teams$yearID, FUN=sum). That code resulted in an error that notes replacement has 149 rows, data has 2925. I know that this can be done less quickly in Excel, but I'm hoping to be able to use R to solve this problem.
The tapply should work
data(iris)
tapply(iris$Sepal.Length, iris$Species, FUN = sum)
I have a data frame consisting of three variables named momentum returns(numeric),volatility (factor) and market states (factor). Volatility and market states both have two -two levels. Volatility have levels named high and low. Market states have level named positive and negative I want to make a two sorted table. I want mean of momentum returns in every case.
library(wakefield)
mom<-rnorm(30)
vol<-r_sample_factor(30,x=c("high","low"))
mar_state<-r_sample_factor(30,x=c("positive","negtive"))
df<-data.frame(mom,vol,mar)
Based on the suggestion given by #r2evans if you want mean of every sorted cases you can apply following code.
xtabs(mom~vol+mar,aggregate(mom~vol+mar,data=df,mean))
## If you want simple sum in every case
xtabs(mom~vol+mar,data=df)
You can also do this with help of data.table package. This approach will do same task in less time.
library(data.table)
df<-as.data.table(df)
## if you want results in data frame format
df[,.(mean(mom)),by=.(vol,mar)]
## if you want in simple vector form
df[,mean(mom),by=vol,mar]
I have a panel data set with return, ESG score and market value for a number of companies over 11 years. I need to extract data for all variables for one year at a time, to make yearly portfolios.
The data frame looks like this:
How can I extract one year at a time and then construct portfolios of high and low ESG score for each year?
Thanks in advance
Have you considered processing the data with Python and Pandas instead of R? The following solution should help to slice your data into different time intervals:
Slice JSON File into Different Time Intercepts with Python
In terms of sorting ESG scores, you can use the following command: df.sort_values('ESG')
Hope that helps and good luck with your dataset.
I am new to time-series analysis and have a data set with a daily time step at 5 factor levels. My goal is to use the acf function in R to determine whether there is significant autocorrelation across the response variable of interest so that I can justify whether or not a time-series model is necessary.
I have sorted the dataset by Day, and am using the following code:
acf(DE_vec, lag.max=7)
The dataset has not been converted to a time-series object…it is a vector sorted by Day.
My first question is whether the dataframe should be converted to a time-series object, or if it is also correct to sort the vector by Day?
Second, if I have a variable repeated over the 5 levels for each Day, then should I construct 5 different acf plots for each level, or would it be ok to pool over stations as was done with the code above?
Thanks in advance,
Yes, acf() will work on a data.frame class, and yes, you should compute the ACF for each of the 5 levels separately. If you pass the entire df to acf(), it will return the ACF for each of the levels.
If you are curious about the relationship across levels, then you need to use ccf() or some mutual information metric like those in the entropy or infotheo pkgs.
I have a mega data frame containing monthly stock returns from january 1970 to december 2009 (rows) for 7 different countries including the US (columns). My task is to regress the stock returns of each country (dependent variable) on the USA stock returns (independent variable) using the values of 4 different time periods namely the 70s, the 80s, the 90s and the 00s.
The data set (.csv) can be downloaded at:
https://docs.google.com/file/d/0BxaWFk-EO7tjbG43Yl9iQVlvazQ/edit
This means that I have 24 regressions to run seperately and report the results, which I have already done using the lm() function. However, I am currently attempting to use R smarter and create custom functions that will achieve my purpose and produce the 24 sets of results.
I have created sub data frames containing the observations clustered according to the time periods knowing that there are 120 months in a decade.
seventies = mydata[1:120, ] # 1970s (from Jan. 1970 to Dec. 1979)
eighties = mydata[121:240, ] # 1980s (from Jan. 1980to Dec. 1989)
nineties = mydata[241:360, ] # 1990s (from Jan. 1990 to Dec. 1999)
twenties = mydata[361:480, ] # 2000s (from Jan. 2000 to Dec. 2009)
NB: Each of the newly created variables are 120 x 7 matrices for 120 observations across 7 countries.
Running the 24 regressions using Java would require the use of imbricated for loops.
Could anyone provide the steps I must take to write a function that will arrive a the desired result? Some snippets of R code would also be appreciated. I am also thinking the mapply function will be used.
Thank you and let me know if my post needs some editing.
try this:
install.packages('plyr')
library('plyr')
myfactors<-c(rep("seventies",120),rep("eighties",120),rep("nineties",120),rep("twenties",120))
tapply(y,myfactors,function(y,X){ fit<-lm(y~ << regressors go here>>; return (fit);},X=mydata)
The lm function will accept a matrix as the response varible and compute seperate regressions for each of the columns, so you can just combine (cbind) the different countries together for that part.
If you are willing to assume that the different decades have the same variance then you could fit the different decades using a dummy variable for decade (look at the gl function for a quick way to calculate a decade factor) and do everything in one call to lm. A simple example:
fit <- lm( cbind( Sepal.Width, Sepal.Length, Petal.Width ) ~ 0 + Species + Petal.Length:Species,
data=iris )
This will give the same coefficient estimates as the seperate regressions, only the standard deviations and degrees of freedom (and therefore the tests and anything else that depends on those) will be different from running the regressions individually.
If you need the standard deviations computed individually for each decade then you can use tapply or sapply (passing decade info into the subset argument of lm) or other apply functions.
For displaying the results from several different regression models the new stargazer package may be of interest.
Try using the 'stargazer' package for publication-quality text or LaTeX regression results tables.