Time-series average of cross-sectional correlations - r

I have a panel dataset looking like this:
head(panel_data)
date symbol close rv rv_plus rv_minus rskew rkurt Mkt.RF SMB HML
1 1999-11-19 a 25.4 19.3 6.76 12.6 -0.791 4.36 -0.11 0.35 -0.5
2 1999-11-22 a 26.8 10.1 6.44 3.69 0.675 5.38 0.02 0.22 -0.92
3 1999-11-23 a 25.2 8.97 2.56 6.41 -1.04 4.00 -1.29 0.08 0.3
4 1999-11-24 a 25.6 5.81 2.86 2.96 -0.505 5.45 0.87 0.08 -0.89
5 1999-11-26 a 25.6 2.78 1.53 1.25 0.617 5.60 0.23 0.92 -0.2
6 1999-11-29 a 26.1 5.07 2.76 2.30 -0.236 7.27 -0.6 0.570 -0.14
where the variable symbol depicts different stocks. I want to calculate the time-series average of the cross-sectional correlation between the variables rskew and rkurt. This means I need to compute the correlation between rskew and rkurt over all different stocks at each point in time and then calculate the time-series average afterwards.
I tried to do it with the rollapply function from the zoo package, but since the number of different stocks is not the same for all dates, I cannot simply define width as an integer. Here is what i tried for a sample width of 20:
panel_data <- panel_data %>%
group_by(date) %>%
mutate(cor_skew_kurt = rollapply(data = panel_data[7:8],
width=20,
FUN=cor,
align="right",
na.rm=TRUE,
fill=NA)) %>%
ungroup
Is there a way to do this without having to define a fixed width for each date group?
Or should I maybe use a different approach to do this?

[Edited] Can you try running the below code? I have recreated an example emulating your issue. if I understood your problem correctly this code should at least put you on the path to the right solution as it solves the issue of unequal time window length.
###################
#Recreating an example dataset with unequal dates across stocks
seed(1)
date6 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26','1999-11-29')
date5 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26')
date4 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24')
cor_skew_kurt <- c(rep(NaN,21))
symbol <- c(rep('a',6),rep('b',5),rep('c',4),rep('d',6))
rskew <- rnorm(21,mean=1, sd =1)
rkurt <- rnorm(21, mean=5, sd = 1)
panel_data <- cbind.data.frame(date = c(date6,date5,date4,date6), symbol = symbol, rskew = rskew, rkurt = rkurt, cor_skew_kurt = cor_skew_kurt )
panel_data$date <- as.Date(panel_data$date, '%Y-%m-%d')
# Computing the cor_skew_kurt and filling the table <- ANSWER TO YOUR QUESTION
for (date in unique(panel_data$date))
{
panel_data[panel_data$date == date,"cor_skew_kurt"] <- as.double(cor(panel_data[panel_data$date == date,'rskew'],panel_data[panel_data$date == date,'rkurt']))
}

Related

Read functions as text and use for plotting

I have a set of 500 equations listed in a single column of a .csv file. The equations are written as text like this (for example):
15+6.2*A-4.3*B+3.7*C-7.9*B*C+2*D^2
(this is the "right" side of the equation, which equals "y", but the text "y=" does not appear in the .csv file)
These are general linear models that have been written to a .csv file by someone else. Not all models have the same number of variables.
I would like to read these functions into R and format them in a way that will allow for using them to (iteratively) make simple line plots (one for each n = 500 models) of "y" across a range of values for A (shown on the x-axis), given values of B, C, and D.
Does anyone have any suggestions for how to do this?
I thought of something based on this [post][1], it is not the best solution, but it seems to work.
Equations
Created two equations for an example
models <- c("15+6.2*A-4.3*B+3.7*C-7.9*B*C+2*D^2","50+6.2*A-4.3*B+3.7*C-7.9*B*C+2*D^2")
models_names <- c("model1","model2")
Data
Random data as an example
data <-
tibble(
A = rnorm(100),
B = rnorm(100),
C = rnorm(100),
D = rnorm(100)
)
Function
Then a created a function that uses those text equations and apply as function returning the values
text_model <- function(formula){
eval(parse(text = paste('f <- function(A,B,C,D) { return(' , formula , ')}', sep='')))
out <- f(data$A,data$B,data$C,data$D)
return(out)
}
Applied equations
Finally, I apply each equation for the data, binding both.
data %>%
bind_cols(
map(.x = models,.f = text_model) %>%
set_names(models_names) %>%
bind_rows(.id = "model")
)
# A tibble: 100 x 6
A B C D model1 model2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.0633 1.18 -0.409 2.01 9.52 54.9
2 -0.00207 1.35 1.28 1.59 9.16 40.3
3 0.798 -0.141 1.58 -0.123 20.6 63.2
4 -0.162 -0.0795 0.408 0.663 14.3 52.0
5 -1.11 0.788 -1.37 1.20 4.71 46.0
6 2.80 1.84 -0.850 0.161 24.4 68.7
7 1.03 0.550 0.907 -1.92 19.0 60.8
8 0.515 -0.179 -0.980 0.0437 19.0 48.9
9 -0.353 0.0643 1.39 1.30 12.5 55.3
10 -0.427 -1.01 -1.11 -0.547 16.7 39.3
# ... with 90 more rows

Ramp up/down missing time-series data in R

I have a set of time-series data (GPS speed data, specifically), which includes gaps of missing values where the signal was lost. For missing periods of short durations I am about to fill simply using a na.spline, however this is inappropriate with longer time periods. I would like to ramp the values from the last true value down to zero, based on predefined acceleration limits.
#create sample data frame
test <- as.data.frame(c(6,5.7,5.4,5.14,4.89,4.64,4.41,4.19,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,5,5.1,5.3,5.4,5.5))
names(test)[1] <- "speed"
#set rate of acceleration for ramp
ramp <- 6
#set sampling rate of receiver
Hz <- 1/10
So for missing data the ramp would use the previous value and the rate of acceleration to get the next data point, until speed reached zero (i.e. last speed [4.19] + (Hz * ramp)), yielding the following values:
3.59
2.99
2.39
1.79
1.19
0.59
0
Lastly, I need to do this in the reverse fashion, to ramp up from zero when the signal picks back up again.
Hope this is clear.
Cheers
It's not really elegant, but you can do it in a loop.
na.pos <- which(is.na(test$speed))
acc = FALSE
for (i in na.pos) {
if (acc) {
speed <- test$speed[i-1]+(Hz*ramp)
}
else {
speed <- test$speed[i-1]-(Hz*ramp)
if (round(speed,1) < 0) {
acc <- TRUE
speed <- test$speed[i-1]+(Hz*ramp)
}
}
test[i,] <- speed
}
The result is:
speed
1 6.00
2 5.70
3 5.40
4 5.14
5 4.89
6 4.64
7 4.41
8 4.19
9 3.59
10 2.99
11 2.39
12 1.79
13 1.19
14 0.59
15 -0.01
16 0.59
17 1.19
18 1.79
19 2.39
20 2.99
21 3.59
22 4.19
23 4.79
24 5.00
25 5.10
26 5.30
27 5.40
28 5.50
Note that '-0.01', because 0.59-(6*10) is -0.01, not 0. You can round it later, I decided not to.
When the question says "ramp the values from the last true value down to zero" in each run of NAs I assume that that means that any remaining NAs in the run after reaching zero are also to be replaced by zero.
Now, use rleid from data.table to create a grouping vector the same length as test$speed identifying each run in is.na(test$speed) and use ave to create sequence numbers within such groups, seqno. Then calculate the declining sequences, ramp_down by combining na.locf(test$speed) and seqno. Finally replace the NAs.
library(data.table)
library(zoo)
test_speed <- test$speed
seqno <- ave(test_speed, rleid(is.na(test_speed)), FUN = seq_along)
ramp_down <- pmax(na.locf(test_speed) - seqno * ramp * Hz, 0)
result <- ifelse(is.na(test_speed), ramp_down, test_speed)
giving:
> result
[1] 6.00 5.70 5.40 5.14 4.89 4.64 4.41 4.19 3.59 2.99 2.39 1.79 1.19 0.59 0.00
[16] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 5.10 5.30 5.40 5.50

Biomass model with density

I am trying to develop a common biomass model for several species with wood density.
here is my data set
Species_Name DBH_cm Wood_Density Leaf_Biomass_kg
Aam 10.9 0.55 4.495175666
Aam 8.3 0.55 3.003987585
Aam 18.3 0.55 7.0453234
Akashmoni 26.6 0.68 8.68327883
Akashmoni 18 0.68 5.514198965
Akashmoni 20.6 0.68 7.140993296
Amloki 13.7 0.64 0.418757191
Amloki 14.6 0.64 0.348964326
Amra 19 0.29 0
Arjun 13.3 0.82 0
Bajna 13 0.70 0
Bel 19.6 0.83 0.458638794
Sal 14.40 0.82 0.996750392
Sal 12.20 0.82 0.644956136
Sal 10.00 0.82 0.947928706
Sal 14.20 0.82 0.767434214
Sal 11.50 0.82 0.636970398
Sal 13.20 0.82 0.445111844
Sal 13.30 0.82 0.706039477
Sal 10.70 0.82 0.475809213
I tried to give NA to missing values by using
tree[which(tree$Leaf_Biomass_kg == 0),]$Leaf_Biomass_kg <- NA
my model code is
library(nlme)
start <- coef(lm(log(Leaf_Biomass_kg)~log(DBH_cm)+log(Wood_Density), data=tree))
start[1] <- exp(start[1])
names(start) <- c("a","b1", "b2")
m <- nlme(Leaf_Biomass_kg~a*DBH_cm^b1*Wood_Density^b2,
data=cbind(tree,g="a"),
fixed=a+b1+b2~1,
start=start,
groups=~g,
weights=varPower(form=~DBH_cm))
it gives
Error in finiteDiffGrad(model, data, pars) :
NAs are not allowed in subscripted assignments
Can anyone help me in this regard
I add na.action=na.exclude in my model but the problem still exists
m <- nlme(Leaf_Biomass_kg~a*DBH_cm^b1*Wood_Density^b2,
data=cbind(tree,g="a"),
fixed=a+b1+b2~1,
start=start,
groups=~g,
weights=varPower(form=~DBH_cm)
na.action=na.exclude)
You have missing values NA in your data set. Try setting na.exclude or remove records with NA values. You may want to impute missing values. This question has been adressed here:
https://stats.stackexchange.com/questions/11000/how-does-r-handle-missing-values-in-lm

Trying to run many anovas and get an F value for each row

I'm working with a dataset that looks like what is shown below. Now I know that this is not the type of format that R likes. I know how to tidy up the data, but then I'm not sure what I'd do in order to obtain an F statistic for each unique_id, which is my goal. Is there an easy way to do that? Otherwise is there a way I could use some type of apply function to tidy up each row independently, perform, the anova, and then add F statistics as a new column?
unique_id heart heart heart kidney kidney kidney cortex cortex cortex
373020.8 1.39 1.18 1.30 2.71 2.96 2.52 1.97 1.67 1.44
371588.9 1.93 2.35 2.50 2.54 1.63 2.23 2.68 2.89 1.86
367772.8 0.42 0.51 0.97 1.02 0.03 0.82 0.01 0.90 1.01
I'm partial to data.tables and you can easily do this with a DT after melting your data. Here's my take, with DT as your data.table containing the info you provided.
DT <- dt<- melt(dt,
id.vars = c("unique_id"),
measure.vars = c("heart","cortex","kidney"))
DT[,fstat:=summary(aov(value~variable))[[1]][1,"F value"],by=unique_id]
This calculates the F-Stat by unique_id and should work.

How to add shaded confidence intervals to line plot with specified values

I have a small table of summary data with the odds ratio, upper and lower confidence limits for four categories, with six levels within each category. I'd like to produce a chart using ggplot2 that looks similar to the usual one created when you specify a lm and it's se, but I'd like R just to use the pre-specified values I have in my table. I've managed to create the line graph with error bars, but these overlap and make it unclear. The data look like this:
interval OR Drug lower upper
14 0.004 a 0.002 0.205
30 0.022 a 0.001 0.101
60 0.13 a 0.061 0.23
90 0.22 a 0.14 0.34
180 0.25 a 0.17 0.35
365 0.31 a 0.23 0.41
14 0.84 b 0.59 1.19
30 0.85 b 0.66 1.084
60 0.94 b 0.75 1.17
90 0.83 b 0.68 1.01
180 1.28 b 1.09 1.51
365 1.58 b 1.38 1.82
14 1.9 c 0.9 4.27
30 2.91 c 1.47 6.29
60 2.57 c 1.52 4.55
90 2.05 c 1.31 3.27
180 2.422 c 1.596 3.769
365 2.83 c 1.93 4.26
14 0.29 d 0.04 1.18
30 0.09 d 0.01 0.29
60 0.39 d 0.17 0.82
90 0.39 d 0.2 0.7
180 0.37 d 0.22 0.59
365 0.34 d 0.21 0.53
I have tried this:
limits <- aes(ymax=upper, ymin=lower)
dodge <- position_dodge(width=0.9)
ggplot(data, aes(y=OR, x=days, colour=Drug)) +
geom_line(stat="identity") +
geom_errorbar(limits, position=dodge)
and searched for a suitable answer to create a pretty plot, but I'm flummoxed!
Any help greatly appreciated!
You need the following lines:
p<-ggplot(data=data, aes(x=interval, y=OR, colour=Drug)) + geom_point() + geom_line()
p<-p+geom_ribbon(aes(ymin=data$lower, ymax=data$upper), linetype=2, alpha=0.1)
Here is a base R approach using polygon() since #jmb requested a solution in the comments. Note that I have to define two sets of x-values and associated y values for the polygon to plot. It works by plotting the outer perimeter of the polygon. I define plot type = 'n' and use points() separately to get the points on top of the polygon. My personal preference is the ggplot solutions above when possible since polygon() is pretty clunky.
library(tidyverse)
data('mtcars') #built in dataset
mean.mpg = mtcars %>%
group_by(cyl) %>%
summarise(N = n(),
avg.mpg = mean(mpg),
SE.low = avg.mpg - (sd(mpg)/sqrt(N)),
SE.high =avg.mpg + (sd(mpg)/sqrt(N)))
plot(avg.mpg ~ cyl, data = mean.mpg, ylim = c(10,30), type = 'n')
#note I have defined c(x1, x2) and c(y1, y2)
polygon(c(mean.mpg$cyl, rev(mean.mpg$cyl)),
c(mean.mpg$SE.low,rev(mean.mpg$SE.high)), density = 200, col ='grey90')
points(avg.mpg ~ cyl, data = mean.mpg, pch = 19, col = 'firebrick')

Resources