Return dispersion calculation in R - r

I would like to calculate the return dispersion over a long period of time. The formula I use is the formula for the equal weighted standard deviation (see here). I tried to use the sd() and apply() function, but it did not work.
Formula:
r(i,t) are i=4 stock returns (so n=4) at time (t)
R(SMI,t) is the index at time (t)
NOVARTIS R NESTLE R ROCHE UBS GROUP
2005-07-18 1.11200510 -0.14716706 -0.4210533 -0.28876340
2005-07-19 0.23668650 -0.22115748 -0.3623192 0.67176884
2005-07-20 0.07877117 -0.44378771 4.0313698 -0.47844392
2005-07-21 -0.55270571 -0.37133351 -0.8754068 0.28604262
2005-07-22 -0.23781224 -0.07443246 0.2926546 0.00000000
2005-07-25 0.23781224 0.74184316 0.4082829 -0.09525666
This is my index
SMI
2005-07-18 -0.01077012
2005-07-19 0.53767147
2005-07-20 -0.02208674
2005-07-21 -0.10192245
2005-07-22 0.01653908
2005-07-25 0.03050783
Now I want to calculate the RD for every time (t), so I get a timeseries for all RDs.
What functions, loops or other techniques should I look at? I do not want to do it by hand because the formula may be applied to bigger dataset.

I made up my own sample data because it was easier but I think this is what you're after. It uses data.table and reshape2 for the heavy lifting.
library(data.table)
library(reshape2)
#make fake data
set.seed(100)
rit<-data.table(dATE=as.POSIXct('2005-07-18')+(60*60*24*0:5),
stock1=runif(6,-1,1),
stock2=runif(6,-1,1),
stock3=runif(6,-1,1),
stock4=runif(6,-1,1))
smi<-data.table(dATE=as.POSIXct('2005-07-18')+(60*60*24*0:5),smi=runif(6,-1,1))
#to convert from a matrix like object
#(I can't quickly figure out how to pull POSIXct out of ts object
#so it's hard coded dates but will still work)
rit<-data.table(your_rit_object)
rit[,dATE=seq(from=as.POSIXct('2005-07-18'), to=as.POSIXct('2005-07-25'),by='days')
smi<-data.table(your_smi_object)
smi[,dATE=seq(from=as.POSIXct('2005-07-18'), to=as.POSIXct('2005-07-25'),by='days')
#melt table from wide to long
ritmelt<-melt(rit,id.vars="dATE")
#combine with smi table
ritmeltsmi<-merge(ritmelt,smi,by='dATE')
#implement formula
ritmeltsmi[,sqrt(sum((value-smi)^2))/.N,by=dATE]
#if you want to name the new column you could do this instead
#ritmeltsmi[,list(RD=sqrt(sum((value-smi)^2))/.N),by=dATE]

Related

Why are aggregate data frame and aggregate formula do not return the same results?

I just started learning R last month and I am learning the aggregate functions.
To start off, I have a data called property and I am trying to get the mean price per city.
I first used the formula method of aggregate:
mean_price_per_city_1 <- aggregate(PRICE ~ PROPERTYCITY,
property_data, mean)
The results are as follow (just the head):
PROPERTYCITY
PRICE
1.00
ALLISON PARK
193814.08
AMBRIDGE
62328.92
ASPINWALL
226505.50
BADEN
400657.52
BAIRDFORD
59337.37
Then I decided to try the data frame method:
mean_price_per_city_2 <- aggregate(list(property_data$PRICE),
by = list(property_data$PROPERTYCITY),
FUN = mean)
The results are as follow (just the head):
Group.1
c.12000L.. 1783L..4643L..
1.00
ALLISON PARK
NA
AMBRIDGE
62328.92
ASPINWALL
226505.50
BADEN
400657.52
BAIRDFORD
59337.37
I thought that the two methods will return the same results. However I noticed that when I used the data frame method, there are NAs in the second column.
I tried checking if there are NAs in the PRICE column, but there is none. So I am lost why the two methods don't return the same values.
You have two issues. First aggregate(list(property_data$PRICE), by = list(property_data$PROPERTYCITY), FUN = mean) should just have property_data$PRICE without the list. Only the by= argument must be a list. That is why your column name is so strange. Second, as documented in the manual page (?aggregate), the formula method has a default value of na.action=na.omit, but the method for class data.frame does not. Since you have at least one missing value in the ALLISON PARK group, the formula command deleted that value, but the second command did not so the result for ALLISON PARK is NA.

RSI outputs in Technical Trading Rules (TTR) package

I'm learning to use R's capability in technical trading with Technical Trading Rules (TTR) package. Assume a crypto portfolio and BTC its reference currency. Historical hourly data (60 period) is collected using cryptocompare.com API and converted to zoo object. The aim is to create a 14-period RSI for each crypto (and possibly visualize all in one canvas). For each crypto, I expect RSI output to be 14 NA followed by 46 calculated values. But I'm getting 360 outputs. What am I missing here?
require(jsonlite)
require(dplyr)
require(TTR)
portfolio <- c("ETH", "XMR", "IOT")
for(i in 1:length(portfolio)) {
hour_data <- fromJSON(paste0("https://min-api.cryptocompare.com/data/histohour?fsym=", portfolio[i], "&tsym=BTC&limit=60", collapse = ""))
read.zoo(hour_data$Data) %>%
RSI(n = 14) %>%
print()
}
Also, my time series data is in the following form (first column timestamp):
close high low open volumefrom volumeto
1506031200 261.20 264.97 259.78 262.74 4427.84 1162501.8
1506034800 258.80 261.20 255.68 261.20 2841.67 735725.4
Does TTR use more conventional OHLC (open, high, low, close) order?
The RSI() function expects a univariate price series. You passed it an object with 6 columns, so it converted that to a univariate vector. You need to subset the output of read.zoo() so that only the "close" column is passed to RSI().

HMM text recognition in R depmixs4

I'm wondering how I would utilize the depmixs4 package for R to run HMM on a dataset. What functions would I use so I get a classification of a testing data set?
I have a file of training data, a file of label data, and a test data.
Training data consists of 4620 rows. Each row has 1079 values. These values are 83 windows with 13 values per window so in otherwords the 1079 is data that is made up of 83 states and each category has 13 observations. Each of these rows with 1079 values is a spoken word so it have 4620 utterances. But in total the data only has 7 distinct words. each of these distinct words have 660 different utterances hence the 4620 rows of words.
So we have words (0-6)
The label file is a list where each row is labeled 0-6 corresponding to what word they are. For example row 300 is labeled 2, row 450 is labeled 6 and 520 is labeled 0.
The test file contains about 5000 rows structured exactly like the training data except there are no labels assocaiated with it.
I want to use HMM to using the training data to classify the test data.
How would I use depmixs4 to output a classification of my test data?
I'm looking at :
depmix(response, data=NULL, nstates, transition=~1, family=gaussian(),
prior=~1, initdata=NULL, respstart=NULL, trstart=NULL, instart=NULL,
ntimes=NULL,...)
but I don't know what response refers to or any of the other parameters.
Here's a quick, albeit incomplete, test to get you started, if only to familiarize you with the basic outline. Please note that this is a toy example and it merely scratches the surface for HMM design/analysis. The vignette for the depmixs4 package, for instance, offers quite a lot of context and examples. Meanwhile, here's a brief intro.
Let's say that you wanted to investigate if industrial production offers clues about economic recessions. First, let's load the relevant packages and then download the data from the St. Louis Fed:
library(quantmod)
library(depmixS4)
library(TTR)
fred.tickers <-c("INDPRO")
getSymbols(fred.tickers,src="FRED")
Next, transform the data into rolling 1-year percentage changes to minimize noise in the data and convert data into data.frame format for analysis in depmixs4:
indpro.1yr <-na.omit(ROC(INDPRO,12))
indpro.1yr.df <-data.frame(indpro.1yr)
Now, let's run a simple HMM model and choose just 2 states--growth and contraction. Note that we're only using industrial production to search for signals:
model <- depmix(response=INDPRO ~ 1,
family = gaussian(),
nstates = 2,
data = indpro.1yr.df ,
transition=~1)
Now let's fit the resulting model, generate posterior states
for analysis, and estimate probabilities of recession. Also, we'll bind the data with dates in an xts format for easier viewing/analysis. (Note the use of set.seed(1), which is used to create a replicable starting value to launch the modeling.)
set.seed(1)
model.fit <- fit(model, verbose = FALSE)
model.prob <- posterior(model.fit)
prob.rec <-model.prob[,2]
prob.rec.dates <-xts(prob.rec,as.Date(index(indpro.1yr)),
order.by=as.Date(index(indpro.1yr)))
Finally, let's review and ideally plot the data:
head(prob.rec.dates)
[,1]
1920-01-01 1.0000000
1920-02-01 1.0000000
1920-03-01 1.0000000
1920-04-01 0.9991880
1920-05-01 0.9999549
1920-06-01 0.9739622
High values (>0.80 ??) indicate/suggest that the economy is in recession/contraction.
Again, a very, very basic introduction, perhaps too basic. Hope it helps.

ANOVA in R using summary data

is it possible to run an ANOVA in r with only means, standard deviation and n-value? Here is my data frame:
q2data.mean <- c(90,85,92,100,102,106)
q2data.sd <- c(9.035613,11.479667,9.760268,7.662572,9.830258,9.111457)
q2data.n <- c(9,9,9,9,9,9)
q2data.frame <- data.frame(q2data.mean,q2data.sq,q2data.n)
I am trying to find the means square residual, so I want to take a look at the ANOVA table.
Any help would be really appreciated! :)
Here you go, using ind.oneway.second from the rspychi package:
library(rpsychi)
with(q2data.frame, ind.oneway.second(q2data.mean,q2data.sd,q2data.n) )
#$anova.table
# SS df MS F
#Between (A) 2923.5 5 584.70 6.413
#Within 4376.4 48 91.18
#Total 7299.9 53
# etc etc
Update: the rpsychi package was archived in March 2022 but the function is still available here: http://github.com/cran/rpsychi/blob/master/R/ind.oneway.second.R (hat-tip to #jrcalabrese in the comments)
As an unrelated side note, your data could do with some renaming. q2data.frame is a data.frame, no need to put it in the title. Also, no need to specify q2data.mean inside q2data.frame - surely mean would suffice. It just means you end up with complex code like:
q2data.frame$q2data.mean
when:
q2$mean
would give you all the info you need.

R - Efficiently create dataframe from large raster excluding NA values

apologies for cross-posting something similar in the GIS stack.
I am looking for a more efficient way to create a frequency table based on a large raster in R.
Currently, I have a few dozen rasters, ~ 150 million cells in each, and I need to create frequencies tables for each. These rasters are derived from masking a base raster with a few hundred small sampling locations*. Therefore the rasters I am creating the tables from contain ~99% NA values.
My current working approach is this:
sampling_site_raster <- raster("FILE")
base_raster <- raster("FILE")
sample_raster <- mask(base_raster, sampling_site_raster)
DF <- as.data.frame(freq(sample_raster, useNA='no', progress='text'))
### run time for the freq() process ###
user system elapsed
162.60 4.85 168.40
this uses the freq() function from the raster package of R. The usaNA=no flag will dump the NA values.
My questions are:
1) is there a more efficient way to create a frequency table from a large raster that is 99% NA values?
or
2) is the a more efficient way to derive the values from the base raster than by using mask()? (using the Mask GP function in ArcGIS is very fast, but still has the NA values and is an extra step
*additional info: The sample areas represented by sampling_site_raster are irregular shapes of various sizes spread randomly across the study area. In the sampling_site_raster the sampling sites are encoded as 1 and non-sampling areas as NA.
Thank you!
If you mask the raster by raster, you will always get another huge raster. I don't think this is a way to make things faster.
What I would do is to try to mask by polygon layer using extract:
res <- extract(raster, polygons)
Then you will have all the cell values for each polygon and can run freq on them.

Resources