I have a daily data base of retunrs from a portfolio. For a model I am replicating the authors calculate for each month, the realized variance RVt from daily returns in the previous 21 sessions.
To do this here is a small example of how I am trying to calculate it:
x <- rnorm(24795, 0, 0.2) #Generate random numbers to simulate my sample
x_2 <-x^2 #the model specify to work with the square returns
# I need the monthly sum of the square returns. For this I create a matrix
#with the length if x/20 because each month consist in 20 trading sessions
rv <- matrix(NA, nrow=(length(x_2)/20), ncol=1)
#I create the first step
rv[1] <- sum(x_2[1:20])
#I create a loop to make the sum of from x_2[21:40] and continue
# with this 20 steps sums
for (i in 2:1239){
rv[i] <- sum(x_2[i+20:i+39])
}
rv
The problem is that my loop is summing as:
x_2[21:40]
x_2[22:41]
x_2[23:42]
instead of
x_2[21:40]
x_2[41:60]
x_2[61:80]
Does anyone knows what I a doing wrong?
Here is a picture of the forula from the paper:
Formula
Thanks
Miguel
We could use seq
i1 <- seq(21, length(x_2), by = 20)
i1 <- i1[-length(i1)]
i2 <- c(i1[-1] - 1, length(x_2))
head(i1)
#[1] 21 41 61 81 101 121
head(i2)
#[1] 40 60 80 100 120 140
rv[-1] <- unlist(Map(function(i, j) sum(x_2[i:j]), i1, i2))
-output
> head(rv)
[,1]
[1,] 1.0533125
[2,] 1.0914327
[3,] 0.7530577
[4,] 1.0559202
[5,] 0.6579956
[6,] 0.9139404
> tail(rv)
[,1]
[1234,] 0.7115833
[1235,] 0.6104712
[1236,] 0.6161004
[1237,] 0.7440868
[1238,] 0.7284476
[1239,] 1.8718138
We can use tapply dividing every 20 numbers in a group.
result <- tapply(x_2, ceiling(seq_along(x_2)/20), sum)
Verify the result -
head(result)
# 1 2 3 4 5 6
#1.0872762 0.4487953 1.1764887 0.8852306 0.8394201 1.0295633
sum(x_2[1:20])
#[1] 1.087276
sum(x_2[21:40])
#[1] 0.4487953
sum(x_2[41:60])
#[1] 1.176489
Related
Context
I asked this question recently:
Comparing partitions from split() using a nested for loop containing an if statement
where I needed to compare partitions generated by split() from a distance matrix using the code fix provided by #robertdj
set.seed(1234) # set random seed for reproducibility
# generate random normal variates
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y) # merge vectors into dataframe
d <- dist(x) # generate distance matrix
splt <- split(d, 1:5) # split data with 5 values in each partition
for (i in 1:length(splt)) {
for (j in 1:length(splt)) {
if (i != j) {
a <- length(which(splt[[i]] >= min(splt[[j]]))) / length(splt[[i]])
b <- length(which(splt[[j]] <= max(splt[[i]]))) / length(splt[[j]])
}
}
}
I generated a MWE where each split contained the same number of elements. I did this just for illustrative purposes, fully knowing that this would not necessarily hold for real data.
As per #Robert Hacken's comment if I instead do
d <- na.omit(d[lower.tri(d)])
I get partitions of unequal length.
Real Data
However my real data does not have the "same size" property. My real data contains many more partitions than only 5 in my MWE.
Here is my code
splt <- split(dist_matrix, sub("(?:(.*)\\|){2}(\\w+)\\|(\\w+)\\|.*?$", "\\1-\\2", colnames(dist_matrix)))
The distance matrix dist_matrix contains FASTA headers from which I extract the species names.
I then use splt above in the doubly nested loop.
For instance, splt[[4]] contains 5 values, whereas splt[[10]] contains 9.
splt[[4]]
[1] 0.1316667 0.1383333 0.1166667 0.1333333 0.1216667
splt[[10]]
[1] 0.1450000 0.1483333 0.1316667 0.1316667 0.1333333 0.1333333 0.1166667 0.1166667 0.1200000
Expected Output
For my real problem, each partition corresponds to distances for a single species to all other unique species. So, if Species X has two DNA sequences representing it and there are 10 species in total, the partition for Species X should contain 20 distances. However I don't want the partition to include the distance between the two sequences for species A.
splt would thus contain 10 partitions (each not necessarily of the same length) for all species
The expected output of a and b is a number between 0-1 inclusive. I think these numbers should be small in my real example, but they are large when I try to run my code, which I think is a consequence of the warning().
What I've Done
I've read on SO that %in% is typically used to resolve the warning
In splt[[i]] == splt[[j]] :
longer object length is not a multiple of shorter object length
except in my case, I believe I would need %notin% <- Negate(%in%).
However, %notin% gives the error in my original post
the condition has length > 1
Question
How can my nested loop be altered to remove the warning?
I'm going to go out on a limb by interpreting parts of what you say, discarding your code, and seeing what I can come up with. If nothing else, it may spark conversation to explain what about my interpretations are correct (and which are incorrect).
Starting with the splt as generated by the random data, then replacing elements 4 and 5 with longer vectors,
set.seed(1234)
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y)
d <- dist(x)
splt <- split(d, 1:5)
splt[[4]] <- rnorm(4)
splt[[5]] <- rnorm(10)
We have:
splt <- list("1" = c(1.48449499149608, 2.62312694474001), "2" = c(2.29150692606848, 0.15169544670039), "3" = c(1.13863195324393, 3.43013887931241), "4" = c(-0.477192699753547, -0.998386444859704, -0.77625389463799, 0.0644588172762693), "5" = c(-0.693720246937475, -1.44820491038647, 0.574755720900728, -1.02365572296388, -0.0151383003641817, -0.935948601168394, 1.10229754620026, -0.475593078869057, -0.709440037512506, -0.501258060594761))
splt
# $`1`
# [1] 1.484495 2.623127
# $`2`
# [1] 2.2915069 0.1516954
# $`3`
# [1] 1.138632 3.430139
# $`4`
# [1] -0.47719270 -0.99838644 -0.77625389 0.06445882
# $`5`
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
You reference expressions like which(splt[[i]] >= min(splt[[j]])), which I'm interpreting to mean *"what is the ratio of splt[[i]] that is above the max value in splt[[j]]. Since we're comparing (for example) splt[[1]] with all of splt[[2]] through splt[[5]] here, and likewise for the others, we're going to have a square matrix where the diagonal is splt[[i]]-vs-splt[[i]] (likely not interesting).
Some quick math so we know what we should end up with:
splt[[1]]
# [1] 1.484495 2.623127
range(splt[[2]])
# [1] 0.1516954 2.2915069
Since 1 from [[1]] is greater than 2's max of 2.29, we expect 0.5 in a comparison between the two (for >= max(.)); similarly, none of [[1]] is below 0.15, so we expect a 0 there.
Similarly, [[5]] over [[4]]:
splt[[5]]
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
range(splt[[4]])
# [1] -0.99838644 0.06445882
### 2 of 10 are greater than the max
sum(splt[[5]] >= max(splt[[4]])) / length(splt[[5]])
# [1] 0.2
### 9 of 10 are lesser than the min
sum(splt[[5]] <= min(splt[[4]])) / length(splt[[5]])
# [1] 0.2
We can use outer, but sometimes that can be confusing, especially since in this case we'd need to Vectorize the anon-func passed to it. I'll adapt your double-for loop premise into nested sapply calls.
Greater than the other's max
sapply(splt, function(y) sapply(setNames(splt, paste0("max", seq_along(splt))), function(z) sum(y >= max(z)) / length(y)))
# 1 2 3 4 5
# max1 0.5 0.0 0.5 0.00 0.0
# max2 0.5 0.5 0.5 0.00 0.0
# max3 0.0 0.0 0.5 0.00 0.0
# max4 1.0 1.0 1.0 0.25 0.2
# max5 1.0 0.5 1.0 0.00 0.1
Interpretation and subset validation:
1 with max of 2: comparing [[1]] (first column) with the max value from [[2]] (second row), half of 1's values are greater, so we have 0.5 (as expected).
5 with max of 4: comparing [[5]] (fifth column) with the max value from [[4]] (fourth row), 0.2 meet the condition.
Less than the other's min
sapply(splt, function(y) sapply(setNames(splt, paste0("min", seq_along(splt))), function(z) sum(y <= min(z)) / length(y)))
# 1 2 3 4 5
# min1 0.5 0.5 0.5 1.00 1.0
# min2 0.0 0.5 0.0 1.00 0.8
# min3 0.0 0.5 0.5 1.00 1.0
# min4 0.0 0.0 0.0 0.25 0.2
# min5 0.0 0.0 0.0 0.00 0.1
Same two pairs:
1 with min of 2 (row 2, column 1) is 0, as expected
5 with min of 4 (row 4, column 5) is 0.2, as expected
Edit: #compbiostats pointed out that while sum(..) should produce the same results as length(which(..)), the latter may be more robust to missing-data (e.g., NA values, c.f., Difference between sum(), length(which()), and nrow() in R). For sum(..) to share that resilience, we should add na.rm=TRUE) to both sum(.) and min(.) in the above calls. Thanks #compbiostats!
I'm new to R, so I apologize if this is a straightforward question, however I've done quite a bit of searching this evening and can't seem to figure it out. I've got a data frame with a whole slew of variables, and what I'd like to do is create a table of the correlations among a subset of these, basically the equivalent of "pwcorr" in Stata, or "correlations" in SPSS. The one key to this is that not only do I want the r, but I also want the significance associated with that value.
Any ideas? This seems like it should be very simple, but I can't seem to figure out a good way.
Bill Venables offers this solution in this answer from the R mailing list to which I've made some slight modifications:
cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
cor.mat <- t(R)
cor.mat[upper.tri(cor.mat)] <- NA
cor.mat
}
So let's test it out:
set.seed(123)
data <- matrix(rnorm(100), 20, 5)
cor.prob(data)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 NA NA NA NA
[2,] 0.7005361 1.0000000 NA NA NA
[3,] 0.5990483 0.6816955 1.0000000 NA NA
[4,] 0.6098357 0.3287116 0.5325167 1.0000000 NA
[5,] 0.3364028 0.1121927 0.1329906 0.5962835 1
Does that line up with cor.test?
cor.test(data[,2], data[,3])
Pearson's product-moment correlation
data: data[, 2] and data[, 3]
t = 0.4169, df = 18, p-value = 0.6817
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3603246 0.5178982
sample estimates:
cor
0.09778865
Seems to work ok.
Here is something that I just made, I stumbled on this post because I was looking for a way to take every pair of variables, and get a tidy nX3 dataframe. Column 1 is a variable, Column 2 is a variable, and Column 3 and 4 are their absolute value and true correlation. Just pass the function a dataframe of numeric and integer values.
pairwiseCor <- function(dataframe){
pairs <- combn(names(dataframe), 2, simplify=FALSE)
df <- data.frame(Vairable1=rep(0,length(pairs)), Variable2=rep(0,length(pairs)),
AbsCor=rep(0,length(pairs)), Cor=rep(0,length(pairs)))
for(i in 1:length(pairs)){
df[i,1] <- pairs[[i]][1]
df[i,2] <- pairs[[i]][2]
df[i,3] <- round(abs(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]])),4)
df[i,4] <- round(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]]),4)
}
pairwiseCorDF <- df
pairwiseCorDF <- pairwiseCorDF[order(pairwiseCorDF$AbsCor, decreasing=TRUE),]
row.names(pairwiseCorDF) <- 1:length(pairs)
pairwiseCorDF <<- pairwiseCorDF
pairwiseCorDF
}
This is what the output is:
> head(pairwiseCorDF)
Vairable1 Variable2 AbsCor Cor
1 roll_belt accel_belt_z 0.9920 -0.9920
2 gyros_dumbbell_x gyros_dumbbell_z 0.9839 -0.9839
3 roll_belt total_accel_belt 0.9811 0.9811
4 total_accel_belt accel_belt_z 0.9752 -0.9752
5 pitch_belt accel_belt_x 0.9658 -0.9658
6 gyros_dumbbell_z gyros_forearm_z 0.9491 0.9491
I've found that the R package picante does a nice job dealing with the problem that you have. You can easily pass your dataset to the cor.table function and get a table of correlations and p-values for all of your variables. You can specify Pearson's r or Spearman in the function. See this link for help:
http://www.inside-r.org/packages/cran/picante/docs/cor.table
Also remember to remove any non-numeric columns from your dataset prior to running the function. Here's an example piece of code:
install.packages("picante")
library(picante)
#Insert the name of your dataset in the code below
cor.table(dataset, cor.method="pearson")
You can use the sjt.corr function of the sjPlot-package, which gives you a nicely formatted correlation table, ready for use in your Office application.
Simplest function call is just to pass the data frame:
sjt.corr(df)
See examples here.
I would like to generate 500 different combination of a,b,and c meeting the following conditions
a+ b+ c = 1 and
a < b < c
here is a basic sample of generating random numbers, however, I need to generate it based on aforementioned conditions.
Coeff = data.frame(a=runif(500, min = 0, max = 1),
b=runif(500, min = 0, max = 1),
c=runif(500, min = 0, max = 1))
myrandom <- function(n) {
m <- matrix(runif(3*n), ncol=3)
m <- cbind(m, rowSums(m)) # rowSums is efficient
t(apply(m, 1, function(a) sort(a[1:3] / a[4])))
}
Demonstration:
set.seed(2)
(m <- myrandom(5))
# [,1] [,2] [,3]
# [1,] 0.1099815 0.3287708 0.5612477
# [2,] 0.1206611 0.2231769 0.6561620
# [3,] 0.2645362 0.3509054 0.3845583
# [4,] 0.2057215 0.2213517 0.5729268
# [5,] 0.2134069 0.2896015 0.4969916
all(abs(rowSums(m) - 1) < 1e-8) # CONSTRAINT 1: a+b+c = 1
# [1] TRUE
all(apply(m, 1, diff) > 0) # CONSTRAINT 2: a < b < c
# [1] TRUE
Note:
my test for "sum to 1" is more than just ==1 because of IEEE-754 and R FAQ 7.31, suggesting that any floating-point test should be an inequality vice a test for equality; if you test for ==1, you will eventually find occurrences where it does not appear to be satisfied:
set.seed(2)
m <- myrandom(1e5)
head(which(rowSums(m) != 1))
# [1] 73 109 199 266 367 488
m[73,]
# [1] 0.05290744 0.24824770 0.69884486
sum(m[73,])
# [1] 1
sum(m[73,]) == 1
# [1] FALSE
abs(sum(m[73,]) - 1) < 1e-15
# [1] TRUE
max(abs(rowSums(m) - 1))
# [1] 1.110223e-16
I would like to point out that ANY distribution law (uniform, gaussian, exponential, ...) will produce numbers a, b and c meeting your condition as soon as you normalize and sort them, so there should be some domain knowledge to prefer one over the other.
As an alternative, I would propose to use Dirichlet distribution which produce numbers naturally satisfying your first condition: a+b+c=1. It was applied to rainfall modelling as well, I believe (https://arxiv.org/pdf/1801.02962.pdf)
library(MCMCpack)
abc <- rdirichlet(n, c(1,1,1))
sum(abc) # should output n
You could vary power law values to shape the data, and, of course, sort them to satisfy your second condition. For many cases it is easy to argue about your model behavior if it uses Dirichlet (Dirichlet being prior for multinomial in Bayes approach, f.e.)
I'm trying to write a loop that will iterate through my vector of doubles, and calculate the standard deviation of every group of 5 values. Below is the code I've written to do so, however, when I attempt to run it, it gives the majority of my out NA as a value, which isn't accurate.
data is large matrix with 53412 elements, should be approx 1175 rows.
for(i in floor((nrow(data)/5)-5)){sd5[i] <-sd(data[seq((5*i) + 1,(5*i) + 5),6])}
I've attempted to itterate through it manually, just executing the following in the console
sd(data[seq((5) + 1,(5*i) + 5),6])
sd(data[seq((10) + 1,(10) + 5),6])
sd(data[seq((15) + 1,(15) + 5),6])
Each of those operate properly, however, when I attempted to do it with the loop, it results in NA for the majority of my data, including the 2nd and 3rd lines in the code block above.
Here is a few lines from the CSV it is reading
2016-04-01,108.779999,110.00,108.199997,109.989998,25626200,109.989998
2016-03-31,109.720001,109.900002,108.879997,108.989998,25685700,108.989998
2016-03-30,108.650002,110.419998,108.599998,109.559998,45159900,109.559998
2016-03-29,104.889999,107.790001,104.879997,107.68,30774100,107.68
2016-03-28,106.00,106.190002,105.059998,105.190002,19303600,105.190002
Just in case, I wanted to point out that I am grabbing the correct values from the CSV file, atleast when I manually execute sd(), as I've compared the console output to the CSV file. However, that doesn't mean I'm not iterating incorrectly in a way that I just can't seem to find. I set the loop to round down to avoid any out of bounds errors.
No need for a for loop.
If the vector is stored as x, you could do:
NN <- length(x)
x <- x[1:(5*floor(length(x)/5))]
dim(x) <- c(5, length(x)/5)
apply(x, 2, sd)
If it's in a data.frame, I'd use data.table (especially since sd is GForce-optimized in the current devel version):
library(data.table); setDT(data)
data[ , sd(x), by = .(grp = (0:(length(x) - 1) %/% 5))]
You can just recast into a 5 column (or row) matrix and get the FUN of the rows (or columns)
And since the matrix is so large, you can use the matrixStats library
mm <- read.csv(header = FALSE, text = "2016-04-01,108.779999,110.00,108.199997,109.989998,25626200,109.989998
2016-03-31,109.720001,109.900002,108.879997,108.989998,25685700,108.989998
2016-03-30,108.650002,110.419998,108.599998,109.559998,45159900,109.559998
2016-03-29,104.889999,107.790001,104.879997,107.68,30774100,107.68
2016-03-28,106.00,106.190002,105.059998,105.190002,19303600,105.190002")
set.seed(1)
mm <- mm[, -1]
mm <- matrix(sample(unlist(mm), 1500 * 55000, TRUE), 1500)
# num [1:1500, 1:55000] 110 109 110 110 110 ...
m2 <- matrix(mm, ncol = 5, byrow = TRUE)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 110.42 108.88 109.56 109.56 109.90
# [2,] 108.99 107.68 105.19 107.68 109.72
# [3,] 109.90 110.00 25626200.00 108.88 30774100.00
# [4,] 105.06 25685700.00 105.19 108.88 30774100.00
# [5,] 107.68 109.90 105.19 104.89 107.79
# [6,] 108.88 108.78 108.88 108.99 108.20
system.time({
sds <- apply(m2, 1, sd)
})
# user system elapsed
## a damn long time
# Timing stopped at: 114.028 0.81 115.398
library('matrixStats')
system.time({
sds <- rowSds(m2)
})
# user system elapsed
# 0.347 0.051 0.402
head(sds)
# [1] 5.620328e-01 1.726982e+00 1.555266e+07 1.556640e+07 2.072692e+00 3.141340e-01
I'm new to R, so I apologize if this is a straightforward question, however I've done quite a bit of searching this evening and can't seem to figure it out. I've got a data frame with a whole slew of variables, and what I'd like to do is create a table of the correlations among a subset of these, basically the equivalent of "pwcorr" in Stata, or "correlations" in SPSS. The one key to this is that not only do I want the r, but I also want the significance associated with that value.
Any ideas? This seems like it should be very simple, but I can't seem to figure out a good way.
Bill Venables offers this solution in this answer from the R mailing list to which I've made some slight modifications:
cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
cor.mat <- t(R)
cor.mat[upper.tri(cor.mat)] <- NA
cor.mat
}
So let's test it out:
set.seed(123)
data <- matrix(rnorm(100), 20, 5)
cor.prob(data)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 NA NA NA NA
[2,] 0.7005361 1.0000000 NA NA NA
[3,] 0.5990483 0.6816955 1.0000000 NA NA
[4,] 0.6098357 0.3287116 0.5325167 1.0000000 NA
[5,] 0.3364028 0.1121927 0.1329906 0.5962835 1
Does that line up with cor.test?
cor.test(data[,2], data[,3])
Pearson's product-moment correlation
data: data[, 2] and data[, 3]
t = 0.4169, df = 18, p-value = 0.6817
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3603246 0.5178982
sample estimates:
cor
0.09778865
Seems to work ok.
Here is something that I just made, I stumbled on this post because I was looking for a way to take every pair of variables, and get a tidy nX3 dataframe. Column 1 is a variable, Column 2 is a variable, and Column 3 and 4 are their absolute value and true correlation. Just pass the function a dataframe of numeric and integer values.
pairwiseCor <- function(dataframe){
pairs <- combn(names(dataframe), 2, simplify=FALSE)
df <- data.frame(Vairable1=rep(0,length(pairs)), Variable2=rep(0,length(pairs)),
AbsCor=rep(0,length(pairs)), Cor=rep(0,length(pairs)))
for(i in 1:length(pairs)){
df[i,1] <- pairs[[i]][1]
df[i,2] <- pairs[[i]][2]
df[i,3] <- round(abs(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]])),4)
df[i,4] <- round(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]]),4)
}
pairwiseCorDF <- df
pairwiseCorDF <- pairwiseCorDF[order(pairwiseCorDF$AbsCor, decreasing=TRUE),]
row.names(pairwiseCorDF) <- 1:length(pairs)
pairwiseCorDF <<- pairwiseCorDF
pairwiseCorDF
}
This is what the output is:
> head(pairwiseCorDF)
Vairable1 Variable2 AbsCor Cor
1 roll_belt accel_belt_z 0.9920 -0.9920
2 gyros_dumbbell_x gyros_dumbbell_z 0.9839 -0.9839
3 roll_belt total_accel_belt 0.9811 0.9811
4 total_accel_belt accel_belt_z 0.9752 -0.9752
5 pitch_belt accel_belt_x 0.9658 -0.9658
6 gyros_dumbbell_z gyros_forearm_z 0.9491 0.9491
I've found that the R package picante does a nice job dealing with the problem that you have. You can easily pass your dataset to the cor.table function and get a table of correlations and p-values for all of your variables. You can specify Pearson's r or Spearman in the function. See this link for help:
http://www.inside-r.org/packages/cran/picante/docs/cor.table
Also remember to remove any non-numeric columns from your dataset prior to running the function. Here's an example piece of code:
install.packages("picante")
library(picante)
#Insert the name of your dataset in the code below
cor.table(dataset, cor.method="pearson")
You can use the sjt.corr function of the sjPlot-package, which gives you a nicely formatted correlation table, ready for use in your Office application.
Simplest function call is just to pass the data frame:
sjt.corr(df)
See examples here.