I need to evaluate the post split stock performance with the quantmod package in R for NYSE,AMEX,NASDAQ. My problem is that I'm only able to look up specific symbols ( getSymbols()), but I need to separate my data into non splitting firms and splitting firms, to compare them. Does anyone have an idea how I can do this for the last 25 years ?
Thanks
Since you're using the quantmod package, you can use getSplits() function to determine which splits a stock has had within a certain timeframe. Since you're dealing with a large number of stocks, you can use a custom function to get what you want.
getSymbolSplit <- function(symbol,xts,date) {
splitCheck <- getSplits(symbol,from = date)
if(anyNA(splitCheck, recursive = FALSE)){
xts$Split <- 0
} else {
xts$Split <- 1
}
return(xts)
}
Once you have that function, you can quickly add a split to the existing stocks data. Example:
getSymbols('GOOG')
GOOG <- getSymbolSplit('GOOG',GOOG,'1993-01-01')
The getSymbols() function creates the xts named GOOG, so our function checks if any splits have happened since 1993-01-01 (yes) and adds a column Split with the value of 1.
getSymbols('REGN')
REGN <- getSymbolSplit('REGN',REGN,'1993-01-01')
Same deal, but REGN has had no splits since 1993 and the column has a value of 0.
Now you have a clear binary variable for grouping between firms that have had a split in your given timeframe.
As a warning, I encountered a problem with BRK-A. R does not normally permit names that include a '-', and the function breaks when trying to pass an xts named BRK-A. If you have stocks that use a - in their symbol, I recommend you rename them before using them. This function is not the only place where the dash could cause problems.
Related
I'm using quantmod to work on multiple symbols in R. My instinct is to combine the symbols into a list of xts objects, then use lapply do do what I need to do. However, some of the things that make quantmod convenient seem (to this neophyte) not to play nicely with lists. An example:
> symbols <- c("SPY","GLD")
> getSymbols(symbols)
> prices.list <- mget(symbols)
> names(prices.list) <- symbols
> returns.list <- lapply(prices.list, monthlyReturn, leading = FALSE)
This works. But it's unclear to me which column of prices it is using. If I try to specify adjusted close, it throws an error:
> returns.list <- lapply(Ad(prices.list), monthlyReturn, leading = FALSE)
Error in Ad(prices.list) :
subscript out of bounds: no column name containing "Adjusted"
The help for Ad() confirms that it works on "a suitable OHLC object," not on a list of OHLC objects. In this particular case, how can I specify that lapply should apply the monthlyReturn function to the Adjusted column?
More generally, what is the best practice for working with multiple symbols in quantmod? Is it to use lists, or is another approach better suited?
Answer monthlyReturn:
All the **Return functions are based on periodReturn. The default check of periodReturn is to make sure it is an xts objects and then takes the open price as the start value and the close price as the last value and calculates the return. If these are available at least. If these are not available it will calculate the return based on the first value of the timeseries and the last value of the timeseries, taking into account the needed time interval (month, day, year, etc).
Answer for lapply:
You want do 2 operations on a list object, so using an function inside the lapply should be used:
lapply(prices.list, function(x) monthlyReturn(Ad(x), leading = FALSE))
This will get what you want.
Answer for multiple symbols:
Do what you are doing.
run and lapply when getting the symbols:
stock_prices <- lapply(symbols, getSymbols, auto.assign = FALSE)
use packages tidyquant or BatchGetSymbols to get all the data in a big tibble.
... probably forgot a few. There are multiple SO answers about this.
Suppose I have a dataset of the following form:
City=c(1,2,2,1)
Business=c(2,1,1,2)
ExpectedRevenue=c(35,20,15,19)
zz=data.frame(City,Business,ExpectedRevenue)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))
My actual dataset contains about 200K rows. Furthermore, it contains information for over 100 cities.
Suppose, for each city (which I also call "Type"), I have the following functions which need to be applied:
#Writing the custom functions for the categories here
Type1=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
Type2=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)-100*rnorm(1)
return(BusinessMax)
}
Once again the above two functions are extremely simply ones that I use for illustration. The idea here is that for each City (or "Type") I need to run a different function for each row in my dataset. In the above two functions, I used rnorm in order to check and make sure that we are drawing different values for each row.
Now for the entire dataset, I want to first divide the observation into its different City (or "Types"). I can do this using (zz_new[["City"]]==1) [also see below]. And then run the respective functions for each classes. However, when I run the code below, I get -Inf.
Can someone help me understand why this is happening?
For the example data, I would expect to obtain 20 plus 10 times some random value (for Type =1) and 35 minus 100 times some random value (for Type=2). The values should also be different for each row since I am drawing them from a random normal distribution.
library(dplyr) #I use dplyr here
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
Thanks a lot in advance.
Let's take a look at your code.
I rewrite your code
library(dplyr)
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
to
zz_new %>%
mutate(AdjustedRevenue = case_when(City == 1 ~ Type1(zz_new,zz_new),
City == 2 ~ Type2(zz_new,zz_new)))
since you are using dplyr but don't use the powerful tools provided by this package.
Besides the usage of mutate one key change is that I replaced zz_new[,] with zz_new. Now we see that both arguments of your Type-functions are the same dataframe.
Next step: Take a look at your function
Type1 <- function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
which is called by Type1(zz_new,zz_new). So the definition of NewSet gives us
NewSet=full_data[which(!full_data$City==observation$City),]
# replace the arguments
NewSet <- zz_new[which(!zz_new$City==zz_new$City),]
Thus NewSet is always a dataframe with zero rows. Applying max to an empty column of a data.frame yields -Inf.
My objective: read data files from yahoo then perform calculations on each xts using lists to create the names of xts and the names of columns to assign results to.
Why? I want to perform the same calculations for a large number of xts datasets without having to retype separate lines to perform the same calculations on each dataset.
First, get the datasets for 2 ETFs:
library(quantmod)
# get ETF data sets for example
startDate = as.Date("2013-12-15") #Specify period of time we are interested in
endDate = as.Date("2013-12-31")
etfList <- c("IEF","SPY")
getSymbols(etfList, src = "yahoo", from = startDate, to = endDate)
To simplify coding, replace the ETF. prefix from yahoo data
colnames(IEF) <- gsub("SPY.","", colnames(SPY))
colnames(IEF) <- gsub("IEF.","", colnames(IEF))
head(IEF,2)
Open High Low Close Volume Adjusted
#2013-12-16 100.86 100.87 100.52 100.61 572400 98.36
#2013-12-17 100.60 100.93 100.60 100.93 694800 98.67
Creating new columns using the functions in quantmod is straightforward, e.g.,
SPY$logRtn <- periodReturn(Ad(SPY),period='daily',subset=NULL,type='log')
IEF$logRtn <- periodReturn(Ad(IEF),period='daily',subset=NULL,type='log')
head(IEF,2)
# Open High Low Close Volume Adjusted logRtn
#2013-12-16 100.86 100.87 100.52 100.61 572400 98.36 0.0000000
#2013-12-17 100.60 100.93 100.60 100.93 694800 98.67 0.0031467
but rather that creating a new statement to perform the calculation for each ETF, I want to use a list instead. Here's the general idea:
etfList
#[1] "IEF" "SPY"
etfColName = "logRtn"
for (etfName in etfList) {
newCol <- paste(etfName, etfColName, sep = "$"
newcol <- periodReturn(Ad(etfName),period='daily',subset=NULL,type='log')
}
Of course, using strings (obviously) doesn't work, because
typeof(newCol) # is [1] "character"
typeof(logRtn) # is [1] "double"
I've tried everything I can think of (at least twice) to coerce the character string etfName$etfColName into an object that I can assign calculations to.
I've looked at many variations that work with data.frames, e.g., mutate() from dplyr, but don't work on xts data files. I could convert datasets back/forth from xts to data.frames, but that's pretty kludgy (to say the least).
So, can anyone suggest an elegant and straightforward solution to this problem (i.e., in somewhat less than 25 lines of code)?
I shall be so grateful that, when I make enough to buy my own NFL team, you will always have a place of honor in the owner's box.
This type of task is a lot easier if you store your data in a new environment. Then you can use eapply to loop over all the objects in the environment and apply a function to them.
library(quantmod)
etfList <- c("IEF","SPY")
# new environment to store data
etfEnv <- new.env()
# use env arg to make getSymbols load the data to the new environment
getSymbols(etfList, from="2013-12-15", to="2013-12-31", env=etfEnv)
# function containing stuff you want to do to each instrument
etfTransform <- function(x, ...) {
# remove instrument name prefix from colnames
colnames(x) <- gsub(".*\\.", "", colnames(x))
# add return column
x$logRtn <- periodReturn(Ad(x), ...)
x
}
# use eapply to apply your function to each instrument
etfData <- eapply(etfEnv, etfTransform, period='daily', type='log')
(I didn't realize that you had posted a reproducible example.)
See if this is helpful:
etfColName = "logRtn"
for ( etfName in etfList ) {
newCol <- get(etfName)[ , etfColName]
assign(etfName, cbind( get(etfName),
periodReturn( Ad(get(etfName)),
period='daily',
subset=NULL,type='log')))
}
> names(SPY)
[1] "SPY.Open" "SPY.High" "SPY.Low" "SPY.Close"
[5] "SPY.Volume" "SPY.Adjusted" "logRtn" "daily.returns"
I'm not an quantmod user and it's only from the behavior I see that I believe the Ad function returns a named vector. (So I did not need to do any naming.)
R is not a macro language, which means you cannot just string together character values and expect them to get executed as though you had typed them at the command line. Theget and assign functions allow you to 'pull' and 'push' items from the data object environment on the basis of character values, but you should not use the $-function in conjunction with them.
I still do not see a connection between the creation of newCol and the actual new column that your code was attempting to create. They have different spellings so would have been different columns ... if I could have figured out what you were attempting.
I've got this dataset
install.packages("combinat")
install.packages("quantmod")
library(quantmod)
library(combinat)
library(utils)
getSymbols("AAPL",from="2012-01-01")
data<-AAPL
p1<-4
dO<-data[,1]
dC<-data[,4]
emaO<-EMA(dO,n=p1)
emaC<-EMA(dC,n=p1)
Pos_emaO_dO_UP<-emaO>dO
Pos_emaO_dO_D<-emaO<dO
Pos_emaC_dC_UP<-emaC>dC
Pos_emaC_dC_D<-emaC<dC
Pos_emaC_dO_D<-emaC<dO
Pos_emaC_dO_UP<-emaC>dO
Pos_emaO_dC_UP<-emaO>dC
Pos_emaO_dC_D<-emaO<dC
Profit_L_1<-((lag(dC,-1)-lag(dO,-1))/(lag(dO,-1)))*100
Profit_L_2<-(((lag(dC,-2)-lag(dO,-1))/(lag(dO,-1)))*100)/2
Profit_L_3<-(((lag(dC,-3)-lag(dO,-1))/(lag(dO,-1)))*100)/3
Profit_L_4<-(((lag(dC,-4)-lag(dO,-1))/(lag(dO,-1)))*100)/4
Profit_L_5<-(((lag(dC,-5)-lag(dO,-1))/(lag(dO,-1)))*100)/5
Profit_L_6<-(((lag(dC,-6)-lag(dO,-1))/(lag(dO,-1)))*100)/6
Profit_L_7<-(((lag(dC,-7)-lag(dO,-1))/(lag(dO,-1)))*100)/7
Profit_L_8<-(((lag(dC,-8)-lag(dO,-1))/(lag(dO,-1)))*100)/8
Profit_L_9<-(((lag(dC,-9)-lag(dO,-1))/(lag(dO,-1)))*100)/9
Profit_L_10<-(((lag(dC,-10)-lag(dO,-1))/(lag(dO,-1)))*100)/10
which are given to this frame
frame<-data.frame(Pos_emaO_dO_UP,Pos_emaO_dO_D,Pos_emaC_dC_UP,Pos_emaC_dC_D,Pos_emaC_dO_D,Pos_emaC_dO_UP,Pos_emaO_dC_UP,Pos_emaO_dC_D,Profit_L_1,Profit_L_2,Profit_L_3,Profit_L_4,Profit_L_5,Profit_L_6,Profit_L_7,Profit_L_8,Profit_L_9,Profit_L_10)
colnames(frame)<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D","Profit_L_1","Profit_L_2","Profit_L_3","Profit_L_4","Profit_L_5","Profit_L_6","Profit_L_7","Profit_L_8","Profit_L_9","Profit_L_10")
There is vector with variables for later usage
vector<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D")
I made all possible combination with 4 variables of the vector (there are no depended variables)
comb<-as.data.frame(combn(vector,4))
comb
and get out the ,,nonsense" combination (where are both possible values of variable)
rc<-comb[!sapply(comb, function(x) any(duplicated(sub('_D|_UP', '', x))))]
rc
Then I prepare the first combination to later subseting
var<-paste(rc[,1],collapse=" & ")
var
and subset the frame (with all DVs)
kr<-eval(parse(text=paste0('subset(frame,' , var,')' )))
kr
Now I have the subseted df by the first combination of 4 variables.
Then I used the evaluation function on it
evaluation<-function(x){
s_1<-nrow(x[x$Profit_L_1>0,])/nrow(x)
s_2<-nrow(x[x$Profit_L_2>0,])/nrow(x)
s_3<-nrow(x[x$Profit_L_3>0,])/nrow(x)
s_4<-nrow(x[x$Profit_L_4>0,])/nrow(x)
s_5<-nrow(x[x$Profit_L_5>0,])/nrow(x)
s_6<-nrow(x[x$Profit_L_6>0,])/nrow(x)
s_7<-nrow(x[x$Profit_L_7>0,])/nrow(x)
s_8<-nrow(x[x$Profit_L_8>0,])/nrow(x)
s_9<-nrow(x[x$Profit_L_9>0,])/nrow(x)
s_10<-nrow(x[x$Profit_L_10>0,])/nrow(x)
n_1<-nrow(x[x$Profit_L_1>0,])/nrow(frame)
n_2<-nrow(x[x$Profit_L_2>0,])/nrow(frame)
n_3<-nrow(x[x$Profit_L_3>0,])/nrow(frame)
n_4<-nrow(x[x$Profit_L_4>0,])/nrow(frame)
n_5<-nrow(x[x$Profit_L_5>0,])/nrow(frame)
n_6<-nrow(x[x$Profit_L_6>0,])/nrow(frame)
n_7<-nrow(x[x$Profit_L_7>0,])/nrow(frame)
n_8<-nrow(x[x$Profit_L_8>0,])/nrow(frame)
n_9<-nrow(x[x$Profit_L_9>0,])/nrow(frame)
n_10<-nrow(x[x$Profit_L_10>0,])/nrow(frame)
pr_1<-sum(kr[,"Profit_L_1"])/nrow(kr[,kr=="Profit_L_1"])
pr_2<-sum(kr[,"Profit_L_2"])/nrow(kr[,kr=="Profit_L_2"])
pr_3<-sum(kr[,"Profit_L_3"])/nrow(kr[,kr=="Profit_L_3"])
pr_4<-sum(kr[,"Profit_L_4"])/nrow(kr[,kr=="Profit_L_4"])
pr_5<-sum(kr[,"Profit_L_5"])/nrow(kr[,kr=="Profit_L_5"])
pr_6<-sum(kr[,"Profit_L_6"])/nrow(kr[,kr=="Profit_L_6"])
pr_7<-sum(kr[,"Profit_L_7"])/nrow(kr[,kr=="Profit_L_7"])
pr_8<-sum(kr[,"Profit_L_8"])/nrow(kr[,kr=="Profit_L_8"])
pr_9<-sum(kr[,"Profit_L_9"])/nrow(kr[,kr=="Profit_L_9"])
pr_10<-sum(kr[,"Profit_L_10"])/nrow(kr[,kr=="Profit_L_10"])
mat<-matrix(c(s_1,n_1,pr_1,s_2,n_2,pr_2,s_3,n_3,pr_3,s_4,n_4,pr_4,s_5,n_5,pr_5,s_6,n_6,pr_6,s_7,n_7,pr_7,s_8,n_8,pr_8,s_9,n_9,pr_9,s_10,n_10,pr_10),ncol=3,nrow=10,dimnames=list(c(1:10),c("s","n","pr")))
df<-as.data.frame(mat)
return(df)
}
result<-evaluation(kr)
result
And I need to help in several cases.
1, in evaluation function the way the matrix is made is wrong (s_1,n_1,pr_1 are starting in first column but I need to start the order by rows)
2, I need to use some loop/lapply function to go trough all possible combinations (not only the first one like in this case (var<-paste(rc[,1],collapse=" & ")) and have the understandable output where is evaluation function used on every combination and I will be able to see for which combination of variables is the evaluation done (understand I need to recognize for what is this evaluation made) and compare evaluation results for each combination.
3, This is not main point, BUT I generally want to evaluate all possible combinations (it means for 2:n number of variables and also all combinations in each of them) and then get the best possible combination according to specific DV (Profit_L_1 or Profit_L_2 and so on). And I am so weak in looping now, so, if it this possible, keep in mind what am I going to do with it later.
Thanks, feel free to update, repair or improve the question (if there is something which could be done way more easily, effectively - do it - I am open for every senseful advice.
I'm quite new to R and I'm trying to write a function that normalizes my data in diffrent dataframes.
The normalization process is quite easy, I just divide the numbers I want to normalize by the population size for each object (that is stored in the table population).
To know which object relates to one and another I tried to use IDs that are stored in each dataframe in the first column.
I thought to do so because some objects that are in the population dataframe have no corresponding objects in the dataframes to be normalized, as to say, the dataframes sometimes have lesser objects.
Normally one would built up a relational database (which I tried) but it didn't worked out for me that way. So I tried to related the objects within the function but the function didn't work. Maybe someone of you has experience with this and can help me.
so my attempt to write this function was:
# Load Tables
# Agriculture, Annual Crops
table.annual.crops <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Agriculture, Bianual and Perrenial Crops
table.bianual.crops <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Fishery
table.fishery <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Population per Municipality
table.population <-read.table ("C:\\Users\\etc", header=T,sep=";")
# attach data
attach(table.annual.crops)
attach(table.bianual.crops)
attach(table.fishery)
attach(table.population)
# Create a function to normalize data
# Objects should be related by their ID in the first column
# Values to be normalized and the population appear in the second column
funktion.norm.percapita<-function (x,y){if(x[,1]==y[,1]){x[,2]/y[,2]}else{return("0")}}
# execute the function
funktion.norm.percapita(table.annual.crops,table.population)
Lets start with the attach steps... why? Its usually unecessary and can get you into trouble! Especially since both your population data.frame and your crops data.frame have Geocode as a column!
as suggested in the comments, you can use merge. This will by default combine data.frames using columns of the same name. You can specify which columns on which to merge with the by parameters.
dat <- merge(table.annual.crops, table.population)
dat$crop.norm <- dat$CropValue / dat$Population
The reason your function isn't working? Look at the results of your if statemnt.
table.annual.crops[,1] == table.population[,1]
Gives a vector of booleans that will recycle the shorter vector. If your data is quite large (on the order of millions of rows) the merge function can be slow. if this is the case, take a look at the data.table package and use its merge function instead.