Rolling standard deviation for multiple firm, with different time periods - r

I have a dataset with monthly stock return for approximately 100 firms. They have different time periods, and the reason for this is when they went on and off the stock exchange.
I have ordered my dataset by Company, Year, Month and I want the standard deviation to account for this so that it starts for a firm after 24 months, and ends when the last observation for that firm is due.
This means that the command has to be able to tell the difference between firms, so that the window doesn't transfer over to the next firm.
Year, Month, Company, Return
1990, 1, Company 1, -0,005
1990, 2, Company 1 , 0,003
etc...
1990, 1, Company 2, ...
1990, 2, Company 2, ...
etc...
2017, 6, Company 50, ...
I have been trying with this code, but it just keeps going when the next row contains a new firm, i.e. it just does a rolling standard deviation for the whole dataset.
rolling_sd <- (rollapply(Dataset$RETURN, width=24,
FUN = sd, fill=NA, align = "right"))
Also it does not align with the right date. If I have no align command, the first row of standard deviation should be 24 rows down, with the "right" it moves 12 down, but still not properly aligned.
How can I make it to take Company name into account?

If you omit the align="right" argument the sd values would be centered as discussed in the question but since the code shown does use right alignment the sd values would start in row 24. I suspect you are confusing runs made with and without the align= argument.
Using the data shown in the Note at the end and changing 24 to 3 in order to demonstrate it with this smaller dataset we use ave to apply the rolling sd to each company separately. The r at the end of rollapplyr is a shorter way of specifying align="right". With right alignment the sd shown in the ith row is the sd of the width rows ending in row i, i.e. rows i-width+1 to i inclusive.
library(zoo)
roll <- function(x) rollapplyr(x, width = 3, FUN = sd, fill = NA)
transform(Dataset, sd = ave(RETURN, Company, FUN = roll))
giving:
Year Month Company RETURN sd
1 1 1 A -0.042484496 NA
2 1 2 A 0.057661027 NA
3 1 3 A -0.018204616 0.05224021
4 1 4 A 0.076603481 0.05017135
5 2 1 A 0.088093457 0.05833792
6 2 2 A -0.090888700 0.10018338
7 2 3 A 0.005621098 0.08958278
8 2 4 A 0.078483809 0.08496093
9 1 1 B -0.042484496 NA
10 1 2 B 0.057661027 NA
11 1 3 B -0.018204616 0.05224021
12 1 4 B 0.076603481 0.05017135
13 2 1 B 0.088093457 0.05833792
14 2 2 B -0.090888700 0.10018338
15 2 3 B 0.005621098 0.08958278
16 2 4 B 0.078483809 0.08496093
Note
Some data in reproducible form
set.seed(123)
tmp <- data.frame(Year = c(1, 1, 1, 1, 2, 2, 2, 2), Month = 1:4, Company = "A",
RETURN = runif(8, -.1, .1))
Dataset <- rbind(tmp, transform(tmp, Company = "B"))

Related

Boxplots from frequency columns in ggplot2

I have a dataframe such as the example below, which describes the number of students achieving specific scores (25-100) in each class (a,b,c)
df
# score class_a class_b class_c
# 1 25 0 10 5
# 2 50 5 3 7
# 3 75 2 2 2
# 4 100 0 6 4
I would like to create a box blot with class on the x axis, and the scores as the y axis, in order to show the range of scores for each class.
But, I am really not sure how to do this with summarized data such as this. I have tried:
library(reshape2)
df1 <- melt(df, id.vars='score')
But I am not sure this is the right direction.
Data
df <- data.frame(score=c(25, 50, 75, 100), class_a=c(0, 5, 2, 0),
class_b=c(10, 3, 2, 6), class_c=c(5, 7, 2, 4))
You may repeat the scores according to the frequencies in each class and boxplot the list.
Map(rep.int, df[1], df[-1]) |> boxplot()

Compute conditionally across rows in data.table in R

I have a data.table with three relevant columns: id, timepoint and metric (actual size is much larger).
I am trying to calculate the percent change between the metric values at timepoints A and D and use it to create a label (Good metric, Half-decent metric, Subpar metric).
The situation becomes more complicated because if the metric is less than or equal to 2, then the new column should report "Super metric!". If not, then the percent difference should be calculated. Based off of the percent change, the id's will be reported as either "Subpar metric" (< 30%), "Half-decent metric"(30 - 50%), "Good metric" (>50%).
If there is an NA value at timepoints A or D, then returning NA is okay. If timepoint A or D are missing, also return NA.
My initial thought was that I could calculate this in data.table without creating unnecessary columns, but I haven't even been able to get the more simple solution where I do the calculations separately and then join them later.
# Example data
library(data.table)
dat <- data.table(id = c(1,1,1,1,2,2,3,3,3,3,4,4,4,6,6,10,10,10,11,11,12,12,14,14),
timepoint = c("A","B","C","D","A","D","A","B","C","D","A","B","C","A","D","A","B","D", "A","D","A","D", "A","D"),
metric = c(NA, 3, 3, 4, 4, 2, 3, 3, 2, 1, 4, 3, NA, NA, 4, 1, 5, 2, 5,3, 5,5,6,3))
Partial solution: first identify the "Super metric" id's, but I would like this to class all instances of "Super metric" id's as such (right now it returns "Super metric" only for timepoint D.
# Inefficient solution
# Step 1: Identify id's that need to be computed
dat1 <- dat[, `:=` (Metric_score = if (metric <= 2 & timepoint == "D")
Metric_score = "Super metric"
else Metric_score = "Calc PC"),
by = 'id,timepoint']
# id timepoint metric Metric_score
# 1: 1 A NA Calc PC
# 2: 1 B 3 Calc PC
# 3: 1 C 3 Calc PC
# 4: 1 D 4 Calc PC
# 5: 2 A 4 Calc PC # Should be Super metric
# 6: 2 D 2 Super metric
Performing the calculation:
This calculates the percent change for all ID's, regardless of whether or not it needs to be calculated
# Step 2: Calculate percent change between timepoint D and A
dat[ , `:=`(col = (metric[timepoint == "A"] - metric[timepoint == "D"])/metric[timepoint == "A"]*100), by = 'id']
Desired output: Class each metric as "Super metric" when final score (timepoint D) is <= 2, otherwise, calculate percent change ((metric#timeD-metric#timeA)/metric#timeA)*100) and classify based on result ("Subpar metric" (< 30%), "Half-decent metric"(30 - 50%), "Good metric" (>50%)
id
timepoint
metric
metric_class
1
A
NA
NA
1
B
3
NA
1
C
3
NA
1
D
4
NA
2
A
4
Super metric
2
D
2
Super metric
3
A
3
Super metric
3
B
3
Super metric
3
C
2
Super metric
3
D
1
Super metric
4
A
4
NA
4
B
3
NA
4
C
NA
NA
6
A
NA
NA
6
D
4
NA
10
A
1
Super metric
10
B
5
Super metric
10
D
2
Super metric
11
A
5
Half-decent metric
11
D
3
Half-decent metric
12
A
5
Subpar metric
12
D
5
Subpar metric
14
A
6
Good metric
14
D
3
Good metric
Using fcase should give you a desirable result.
Since 0.5 is both between 0.3-0.5 and >= 0.5 it will take the first case in the list which is "Good metric" in this case, if you want that changed you can simply change the order.
metrics <- dcast.data.table(dat, id~timepoint)
metrics[, metric_class := fcase(D <= 2, "Super metric",
abs(D-A)/A < 0.3, "Subpar metric",
abs(D-A)/A >= 0.5, "Good metric",
between(abs(D-A)/A, 0.3, 0.5), "Half-decent metric")]
dat <- merge(dat, metrics[, .(id, metric_class)], by = "id")
Here is another approach that doesn't require dcast.
metric_class <- function(t,m) {
if("D" %in% t && m[t=="D"]<=2) return(rep("Super metric", length(t)))
mvals = c("a"= m[t=="A"], "d" = m[t=="D"])
val = abs((mvals["d"]-mvals["a"])/mvals["a"])
return(rep(fcase(val<0.3, "Subpar metric", val>=0.5, "Good metric", val>=0.3 & val<0.5, "Half-decent metric"), length(t)))
}
setDT(dat)[, metric_class:=metric_class(timepoint, metric), by=id][]

Assigning values in a column to deciles when breaks are not unique

Assume that I have a vector with 1000 numbers in it. I want to obtain the deciles of this vector and then find the mean of each decile. However, there are 215+ zeros in this vector. Meaning that the first and second breaks will be zero, thus I will run into Cut() error - 'breaks' are not unique error. What I want is to assign 100 zeros to the first decile, another 100 to the second decile and the last 15 zeros to the third decile. Such that the mean of the first and second deciles will be zero. Here is a reproducible and smaller example with the similar problem:
v=c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 5, 6, 3, 7)
cut_q10 <- quantile(v, probs = seq(0, 1, 0.1))
v_q10 =cut(v, breaks = cut_q10,labels = FALSE)
#Error in cut.default(v, breaks = cut_q10, labels = FALSE) :
# 'breaks' are not unique
What I would like to obtain is:
v_q10 = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,10,9,10)
or
v_q10 = c(2,2,1,1,3,4,4,3,5,5,6,6,7,7,8,8,9,10,9,10)
etc...
All of them are acceptable as long as there is two 0's in the first decile, two 0's in the second, two 1's in the third, two 1's in the fourth etc. etc. such that regardless of which v_q10 is obtained when I find the means of each decile I attain this :
merged = as.data.frame(cbind(v,v_q10))
merged = merged%>%group_by(v_q10)%>%summarise(means = mean(v))
v_q10 means
# <dbl> <dbl>
# 1 1 0
# 2 2 0
# 3 3 1
# 4 4 1
# 5 5 1
# 6 6 2
# 7 7 2
# 8 8 3
# 9 9 4
#10 10 6.5
I know that it is possible to achieve this by writing a long code but I was wondering if there is a function or a code of a few lines that can achieve this.
Thanks in advance.
Try this:
cut(rank(v, ties = "first"), 10, lab = FALSE)
## [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 10 9 10
Alternatives include using ties = "last" or using ties = "random" or using order(order(v)) in place of rank(...).

r - Subsetting time-series data.frame based on time and threshold

How would you subset a time-series data.frame based on time and a threshold value?
I have this data:
year <- seq(2000, 2009, 1)
v1 <- sample(1:10, 10, replace=T)
df <- data.frame(year, v1)
That looks like this:
> df
year v1
1 2000 9
2 2001 4
3 2002 5
4 2003 4
5 2004 5
6 2005 3
7 2006 3
8 2007 3
9 2008 9
10 2009 6
I want to subset the data by groups of sequential years for which the summed score on v1 exceeds the value of 10.
On this example data the first subset should hold observations of the year 2000 & 2001. The second subset should hold the observations of year 2002, 2003 and 2004.
The real data has about 8 million observations covering 120 years.
You can implement a customized cumsum using Reduce function, reset the sum when the total exceeds 10 and at same time increment a count as group variable:
library(data.table)
transpose(Reduce(function(x, y) if(x[1] > 10) c(y, x[2]+1) else c(x[1] + y, x[2]),
init = c(0, 1), df$v1, accumulate = T))[[2]][-1]
# here the init parameter will take two parameters, the first one keep track of the cumsum,
# and the second one serves as a group variable, when the sum exceeds 10, reset the sum to
# zero and increase the group variable by one
# [1] 1 1 2 2 2 3 3 3 3 4
It takes around 20 seconds to run over 10 million observations vector:
v = sample(1:10, 10000000, replace = T)
system.time(transpose(Reduce(function(x, y) if(x[1] > 10) c(y, x[2]+1) else c(x[1] + y, x[2]), init = c(0, 1), v, accumulate = T))[[2]])
# user system elapsed
# 19.509 0.552 20.081

Count consecutive occurrences of a specific value in every row of a data frame in R

I've got a data.frame of monthly values of a variable for many locations (so many rows) and I want to count the numbers of consecutive months (i.e consecutive cells) that have a value of zero. This would be easy if it was just being read left to right, but the added complication is that the end of the year is consecutive to the start of the year.
For example, in the shortened example dataset below (with seasons instead of months),location 1 has 3 '0' months, location 2 has 2, and 3 has none.
df<-cbind(location= c(1,2,3),
Winter=c(0,0,3),
Spring=c(0,2,4),
Summer=c(0,2,7),
Autumn=c(3,0,4))
How can I count these consecutive zero values? I've looked at rle but I'm still none the wiser currently!
Many thanks for any help :)
You've identified the two cases that the longest run can take: (1) somewhere int he middle or (2) split between the end and beginning of each row. Hence you want to calculate each condition and take the max like so:
df<-cbind(
Winter=c(0,0,3),
Spring=c(0,2,4),
Summer=c(0,2,7),
Autumn=c(3,0,4))
#> Winter Spring Summer Autumn
#> [1,] 0 0 0 3
#> [2,] 0 2 2 0
#> [3,] 3 4 7 4
# calculate the number of consecutive zeros at the start and end
startZeros <- apply(df,1,function(x)which.min(x==0)-1)
#> [1] 3 1 0
endZeros <- apply(df,1,function(x)which.min(rev(x==0))-1)
#> [1] 0 1 0
# calculate the longest run of zeros
longestRun <- apply(df,1,function(x){
y = rle(x);
max(y$lengths[y$values==0],0)}))
#> [1] 3 1 0
# take the max of the two values
pmax(longestRun,startZeros +endZeros )
#> [1] 3 2 0
Of course an even easier solution is:
longestRun <- apply(cbind(df,df),# tricky way to wrap the zeros from the start to the end
1,# the margin over which to apply the summary function
function(x){# the summary function
y = rle(x);
max(y$lengths[y$values==0],
0)#include zero incase there are no zeros in y$values
})
Note that the above solution works because my df does not include the location field (column).
Try this:
df <- data.frame(location = c(1, 2, 3),
Winter = c(0, 0, 3),
Spring = c(0, 2, 4),
Summer = c(0, 2, 7),
Autumn = c(3, 0, 4))
maxcumzero <- function(x) {
l <- x == 0
max(cumsum(l) - cummax(cumsum(l) * !l))
}
df$N.Consec <- apply(cbind(df[, -1], df[, -1]), 1, maxcumzero)
df
# location Winter Spring Summer Autumn N.Consec
# 1 1 0 0 0 3 3
# 2 2 0 2 2 0 2
# 3 3 3 4 7 4 0
This adds a column to the data frame specifying the maximum number of times zero has occurred consecutively in each row of the data frame. The data frame is column bound to itself to be able to detect consecutive zeroes between autumn and winter.
The method used here is based on that of Martin Morgan in his answer to this similar question.

Resources