Generate summary table from bins of a plot - r

I have a dataset of the form:
d = data.frame(seq(0.01,1,by=0.01), c(seq(0.27,0.1,-0.01),seq(0.1,0.5,0.01),seq(0.5,0.1,-0.01)))
names(d) = c("X","Y")
ggplot(d, aes(x=X, y=Y)) + geom_line()
I am trying to generate a summary table that bins the Y variable into equal groups of 10% and generate the summary statistics of X for each bin. This is how I would like my result to look like:
Y Group X Group
0-10% {Range1: 10-30%, mean1, median1, sd1} {Range2: 85-100%, mean2, median2, sd2}
10-20% ...
20-30% ...
30-40% ...
40-50% ...
The ranges of X are not always two, 20-30% of Y has three ranges of X and 40-50% has one.
I have many large datasets on which this has to be implemented. The data is for reproducing the problem. My actual data could have many inflection points, as this code has to run on many combinations of X and Y.

Output not formatted like yours.
But here is a close solution. You can easily reformat to your liking. It seems you are binning Y in 10 groups but not sure on X. I am using 10 groups on X too.
d = data.frame(seq(0.01,1,by=0.01), c(seq(0.27,0.1,-0.01),seq(0.1,0.5,0.01),seq(0.5,0.1,-0.01)))
names(d) = c("X","Y")
library(dplyr)
d$x.decile<-ntile(d$X,10)
d$y.decile<-ntile(d$Y,10)
summary<-data.frame(d%>%group_by(y.decile, x.decile)%>%summarise(mean=mean(X),median=median(X), min=min(X), max=max(X), sd=sd(X)))
> summary
y.decile x.decile mean median min max sd
1 1 2 0.175 0.175 0.15 0.20 0.018708287
2 1 3 0.210 0.210 0.21 0.21 NaN
3 1 10 0.990 0.990 0.98 1.00 0.010000000
4 2 2 0.135 0.135 0.13 0.14 0.007071068
5 2 3 0.235 0.235 0.22 0.25 0.012909944
6 2 10 0.955 0.955 0.94 0.97 0.012909944
7 3 1 0.095 0.095 0.09 0.10 0.007071068

You can get the format you want with melt and dcast from the reshape package.
In the code below, I've cut the data into 10 Y groups and 2 X groups, just to keep the width of the output reasonable. Change 2 to 10 in the ntile function to get actual deciles for X. Also, I haven't included every summary item, but hopefully the code below will guide you for adding additional information.
library(dplyr)
library(reshape2)
sm = d %>% group_by(`Y decile`=ntile(Y,10), X.decile=ntile(X,2)) %>%
summarise(`X decile` = paste0("{Count: ", n(), ", Range: ", min(X),"-",max(X),", Median: ",median(X),"}"))
sm %>% melt(id.var=c("Y decile","X.decile")) %>%
dcast(`Y decile` ~ variable + X.decile, value.var="value", fill="")
Y decile X decile_1 X decile_2
1 1 {Count: 7, Range: 0.15-0.21, Median: 0.18} {Count: 3, Range: 0.98-1, Median: 0.99}
2 2 {Count: 6, Range: 0.13-0.25, Median: 0.225} {Count: 4, Range: 0.94-0.97, Median: 0.955}
3 3 {Count: 7, Range: 0.09-0.28, Median: 0.12} {Count: 3, Range: 0.91-0.93, Median: 0.92}
4 4 {Count: 6, Range: 0.06-0.31, Median: 0.185} {Count: 4, Range: 0.87-0.9, Median: 0.885}
5 5 {Count: 8, Range: 0.02-0.35, Median: 0.185} {Count: 2, Range: 0.85-0.86, Median: 0.855}
6 6 {Count: 5, Range: 0.01-0.39, Median: 0.37} {Count: 5, Range: 0.8-0.84, Median: 0.82}
7 7 {Count: 5, Range: 0.4-0.44, Median: 0.42} {Count: 5, Range: 0.75-0.79, Median: 0.77}
8 8 {Count: 5, Range: 0.45-0.49, Median: 0.47} {Count: 5, Range: 0.7-0.74, Median: 0.72}
9 9 {Count: 1, Range: 0.5-0.5, Median: 0.5} {Count: 9, Range: 0.51-0.69, Median: 0.65}
10 10 {Count: 10, Range: 0.55-0.64, Median: 0.595}
melt isn't actually necessary here. You could to the following, where the extra line at the end is to get more explanatory names.
sm = d %>% group_by(`Y decile`=ntile(Y,10), X.decile=ntile(X,2)) %>%
summarise(`X decile` = paste0("{N: ", n(), ", Range: ", min(X),"-",max(X),", Median: ",median(X),"}")) %>%
dcast(`Y decile` ~ X.decile, value.var="X decile", fill="", value.name=) %>%
setNames(., c(names(.)[1], paste0("X decile ", names(.)[-1])))

The quantile and aggregate functions can help you.
# Create data frame
d <- data.frame(seq(0.01,1,by=0.01), c(seq(0.27,0.1,- 0.01),seq(0.1,0.5,0.01),seq(0.5,0.1,-0.01)))
names(d) <- c("X","Y")
# Define bins
bins <- quantile(d$Y, seq(0.1,1,length.out=10))
# Create indicator variable for which bin each Y belongs in
ag <- c()
for (i in 1:nrow(d)) {ag[i] <- which(d$Y[i] < bins)[1]}
# Compute summary statistics
means <- aggregate(d$X, by=list(ag), mean)
medians <- aggregate(d$X, by=list(ag), median)
variances <- aggregate(d$X, by=list(ag), var)
# Put them all into a new data frame
data.frame(group=(1:10),mean=means[,2], median=medians[,2], variance=variances[,2])
## group mean median variance
##1 1 0.4533333 0.200 0.162250000
##2 2 0.4709091 0.240 0.148969091
##3 3 0.3990000 0.265 0.134543333
##4 4 0.4650000 0.305 0.139583333
##5 5 0.3525000 0.325 0.114278571
##6 6 0.4983333 0.385 0.097178788
##7 7 0.5950000 0.595 0.034250000
##8 8 0.5950000 0.595 0.017583333
##9 9 0.5950000 0.595 0.006472222
##10 10 0.5950000 0.595 0.001171429

Related

how to calculate standard deviation of values in 10 intervals?

I want to calculate a standard deviation step by 10 in R; for example
For a large number of values, I want to calculate the SD of the values in 10 intervals. 0-10, 10-20, 20-30 ...
Example: I have a vector of :
exemple <- seq (0,100,10)
If I do sd (example) : I have the value of standard deviation but for all values in example.
But, how can I do to calculate the standard deviation to this example selecting 10 by 10 steps ?
But instead of calculating the standard deviation of all these values, I want to calculate it between 0 and 10, between 10 and 20, between 20 and 30 etc…
I precise in interval 0-10 : we have values, in intervals 10-20, we have also values.. etc.
exemple2 0 to 10, we have values : 0.2, 0.3, 0.5, 0.7, 0.6, 0.7, 0.03, 0.09, 0.1, 0.05
An image for more illustrations :
Can someone help me please ?
You may use cut/findInterval to divide the data into groups and take sd of each group.
set.seed(123)
vec <- runif(100, max = 100)
tapply(vec, cut(vec, seq(0,100,10)), sd)
# (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
#3.438162 2.653866 2.876299 2.593230 2.353325 2.755474 2.454519 3.282779 3.658064 3.021508
Here is a solution using dplyr:
library(dplyr)
## Create a random a dataframe with a random variable with 1000 values between 1 and 100
df <- data.frame(x = runif(1000, 1, 100)
## Create a grouping variables, binning by 10
df$group <- findInterval(df$x, seq(10, 100, by=10))
## Calculate SD by group
df %>%
group_by(group) %>%
summarise(Std.dev = sd(x))
# A tibble: 10 x 2
group St.dev
* <int> <dbl>
1 0 2.58
2 1 2.88
3 2 2.90
4 3 2.71
5 4 2.84
6 5 2.90
7 6 2.88
8 7 2.68
9 8 2.98
10 9 2.89

Wrong degrees of freedom in lsmeans and SE calculation in R

I have this sample data:
Sample Replication Days
1 1 10
1 1 14
1 1 13
1 1 14
2 1 NA
2 1 5
2 1 18
2 1 20
1 2 16
1 2 NA
1 2 18
1 2 21
2 2 15
2 2 7
2 2 12
2 2 14
I have four observations for each sample with a total of 64 samples in each of the two replications. In total, I have 512 values for both the replications. I also have some missing values designated as 'NA'. I prformed ANOVA for Mean values for each Sample for each Rep that I generated using
library(tidyverse)
df <- Data %>% group_by(Sample, Rep) %>% summarise(Mean = mean(Days, na.rm = TRUE))
curve.anova <- aov(Mean~Rep+Sample, data=df)
Result of anova is:
> summary(curve.anova)
Df Sum Sq Mean Sq F value Pr(>F)
Rep 1 6.1 6.071 2.951 0.0915 .
Sample 63 1760.5 27.945 13.585 <2e-16 ***
Residuals 54 111.1 2.057
I created a table for mean and SE values,
ANOVA<-lsmeans(curve.anova, ~Sample)
ANOVA<-summary(ANOVA)
write.csv(ANOVA, file="Desktop/ANOVA.csv")
A few lines from file are:
Sample lsmean SE df lower.CL upper.CL
1 24.875 1.014145417 54 22.84176086 26.90823914
2 25.5 1.014145417 54 23.46676086 27.53323914
3 31.32575758 1.440722628 54 28.43728262 34.21423253
4 26.375 1.014145417 54 24.34176086 28.40823914
5 26.42424242 1.440722628 54 23.53576747 29.31271738
6 25.5 1.014145417 54 23.46676086 27.53323914
7 28.375 1.014145417 54 26.34176086 30.40823914
8 24.875 1.014145417 54 22.84176086 26.90823914
9 21.16666667 1.014145417 54 19.13342752 23.19990581
10 23.875 1.014145417 54 21.84176086 25.90823914
df for all 64 samples is 54 and the error bars in the ggplot are mostly equal for all the Samples. SE values are larger than the manually calculated values. Based on anova results, df=54 is for residuals.
I want to double check the ANOVA results so that they are correct and I am correctly generating lsmeans and SE to plot a bargraph using ggplot with confirdence interval error bars.
I will appreciate any help. Thank you!
After reading your comments, I think your workflow as an issue. Basically, when you are applying your anova test, you are doing it on means of the different samples.
So, in your example, when you are doing :
curve.anova <- aov(Mean~Rep+Sample, data=df)
You are comparing these values:
> df
# A tibble: 4 x 3
# Groups: Sample [2]
Sample Replication Mean
<dbl> <dbl> <dbl>
1 1 1 12.8
2 1 2 18.3
3 2 1 14.3
4 2 2 12
So, basically, you are comparing two groups with two values per group.
So, when you tried to remove the Replication group, you get an error because the output of:
df = Data %>% group_by(Sample %>% summarise(Mean = mean(Days, na.rm = TRUE))
is now:
# A tibble: 2 x 2
Sample Mean
<dbl> <dbl>
1 1 15.1
2 2 13
So, applying anova test on that dataset means that you are comparing two groups with one value each. So, you can't compute residuals and SE.
Instead, you should do it on the full dataset without trying to calculate the mean first:
anova_data <- aov(Days~Sample+Replication, data=Data)
anova_data2 <- aov(Days~Sample, data=Data)
And their output are:
> summary(anova_data)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 16.07 16.071 0.713 0.416
Replication 1 9.05 9.054 0.402 0.539
Residuals 11 247.80 22.528
2 observations deleted due to missingness
> summary(anova_data2)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 16.07 16.07 0.751 0.403
Residuals 12 256.86 21.41
2 observations deleted due to missingness
Now, you can apply lsmeans:
A_d = summary(lsmeans(anova_data, ~Sample))
A_d2 = summary(lsmeans(anova_data2, ~Sample))
> A_d
Sample lsmean SE df lower.CL upper.CL
1 15.3 1.8 11 11.29 19.2
2 12.9 1.8 11 8.91 16.9
Results are averaged over the levels of: Replication
Confidence level used: 0.95
> A_d2
Sample lsmean SE df lower.CL upper.CL
1 15.1 1.75 12 11.33 19.0
2 13.0 1.75 12 9.19 16.8
Confidence level used: 0.95
It does not change a lot the mean and the SE (which is good because it means that your replicate are consistent and you don't have too much variabilities between those) but it reduces the confidence interval.
So, to plot it, you can:
library(ggplot2)
ggplot(A_d, aes(x=as.factor(Sample), y=lsmean)) +
geom_bar(stat="identity", colour="black") +
geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)
Based on your initial question, if you want to check that the output of ANOVA is correct, you can mimick fake data like this:
d2 <- data.frame(Sample = c(rep(1,10), rep(2,10)),
Days = c(rnorm(10, mean =3), rnorm(10, mean = 8)))
Then,
curve.d2 <- aov(Days ~ Sample, data = d2)
ANOVA2 <- lsmeans(curve.d2, ~Sample)
ANOVA2 <- summary(ANOVA2)
And you get the following output:
> summary(curve.d2)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 139.32 139.32 167.7 1.47e-10 ***
Residuals 18 14.96 0.83
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> ANOVA2
Sample lsmean SE df lower.CL upper.CL
1 2.62 0.288 18 2.02 3.23
2 7.90 0.288 18 7.29 8.51
Confidence level used: 0.95
And for the plot
ggplot(ANOVA2, aes(x=as.factor(Sample), y=lsmean)) +
geom_bar(stat="identity", colour="black") +
geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)
As you can see, we get lsmeans for d2 close to 3 and 8 what we set at the first place. So, I think your output are correct. Maybe your data do not present any significant differences and the computation of SE are the same because the distribution of your data are the same. It is what it is.
I hope this answer helps you.
Data
df = data.frame(Sample = c(rep(1,4), rep(2,4),rep(1,4), rep(2,4)),
Replication = c(rep(1,8), rep(2,8)),
Days = c(10,14,13,14,NA,5,18,20,16,NA,18,21,15,7,12,14))

95% winsorization by groups over multiple variables

In my real data, I have multiple outliers for multiple variables. My data looks something like the example below but the numbers here are completely random. I would like to pull in all data points that are greater than or less than 2 SD using a 95% winsorization.
df <- read.csv(header=TRUE, text="
id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
I am aware of the 'winsorize' function in the 'robustHD' package but am not sure: how to ensure the winsorization accounts for the 2 different groups, and including multiple variables in that winsorization.
I have tried this code to fix the problem but the code is not complete:
library(robustHD)
library(dplyr)
new.df.wins = df %>%
group_by(group) %>%
mutate(measure_winsorized = winsorize(c(test1,test2)))
An error is returned indicating
Error: Column `measure_winsorized` must be length 45 (the group size) or one, not 90
I am open to other ideas too. Thanks!
Consider creating two new fields for each numeric field to be winsorized:
new.df.wins <- df %>%
group_by(group) %>%
mutate(measure_winsorized_test1 = winsorize(test1),
measure_winsorized_test2 = winsorize(test2))
Alternatively with base R's ave:
new.df.wins <- within(df, {
measure_winsorized_test2 <- ave(test2, group, FUN=winsorize)
measure_winsorized_test1 <- ave(test1, group, FUN=winsorize)
})
Should you want to winsorize both simultaneously, assign to two new columns at once:
# TIDYVERSE (dplyr)
new.df.wins <- df %>%
group_by(group) %>%
mutate_at(.funs = list(wins = winsorize), .vars = vars(test1:test2))
# TINYVERSE (I.E. BASE R)
df[c("test1_wins", "test2_wins")] <- with(df, ave(cbind(test1, test2),
group, FUN=winsorize))
You can make a version of winsorize() that works on data frames, and use that with by()
# Example data
set.seed(1)
df2 <- round(matrix(rt(100, 4), 20), 3)
df2 <- data.frame(id=seq_len(nrow(df2)),
group=sort(rep(1:2, length=nrow(df2))),
test=df2)
df2[c(1:3, 11:13),]
# id group test.1 test.2 test.3 test.4 test.5
# 1 1 1 -0.673 -1.227 0.015 -0.831 0.024
# 2 2 1 -0.584 1.059 1.492 0.833 -0.377
# 3 3 1 0.572 0.613 -1.924 -0.672 1.184
# 11 11 2 0.054 0.020 2.241 -0.103 -0.047
# 12 12 2 1.746 -0.788 -0.268 -1.921 4.577
# 13 13 2 -0.472 -1.294 -0.258 0.795 -1.110
# data frame version of winsorize
winsorizedf <- function(x, ...) {
do.call(cbind, lapply(x, winsorize, ...))
}
# winsorize every column, except the two first ones, grouped by df2$group
w <- do.call(rbind,
by(df2[, -(1:2)], df2$group, winsorizedf))
# combine the winsorized columns with the original id and group columns
dfw <- data.frame(df2[, 1:2], round(w, 2))
dfw[c(1:3, 11:13),]
# id group test.1 test.2 test.3 test.4 test.5
# 1 1 1 -0.63 -1.23 0.02 -0.83 0.02
# 2 2 1 -0.58 1.06 1.49 0.26 -0.38
# 3 3 1 0.57 0.61 -1.60 -0.67 1.18
# 11 11 2 0.05 0.02 1.23 -0.10 -0.05
# 12 12 2 1.70 -0.79 -0.27 -1.92 4.58
# 13 13 2 -0.47 -1.07 -0.26 0.80 -1.11

R not enough observation, arguments are treated as the container, rather than the content itself

So, I am trying to make bartlett or any test in R. it's working good with imported data:
data(foster, package = "HSAUR")
bartlett.test(weight ~ litgen,data = foster)
But not with my data:
mdat <- matrix(c(2.3,2.2,2.25, 2.2,2.1,2.2, 2.15, 2.15, 2.2, 2.25, 2.15, 2.25), nrow = 3, ncol = 4)
working_df = data.frame(mdat)
bartlett.test(X1 ~ X2, data = working_df)
Error in bartlett.test.default(c(2.3, 2.2, 2.25), c(2.2, 2.1, 2.2)) :
there must be at least 2 observations in each group
I have tried all the different functions, assignments but the problem is that the arguments are treated as a single object rather than its content
How can I make a barttlet test with my dataframes? How do make the arguments be the contents, rather than the container?
I don't know what you mean when you talk about "contents" and "container". The documentation at ?bartlett.test is pretty straightforward. You're trying to use a formula, so we'll look at the description of the formula argument:
formula a formula of the form lhs ~ rhs where lhs gives the data values and rhs the corresponding groups.
This matches with the structure of the foster data, where weight is numeric, and litgen is a categorical grouper.
head(foster)
litgen motgen weight
1 A A 61.5
2 A A 68.2
3 A A 64.0
4 A A 65.0
5 A A 59.7
6 A B 55.0
So, you need to put your data in that format.
your_data = data.frame(x = c(mdat), group = c(col(mdat)))
your_data
# x group
# 1 2.30 1
# 2 2.20 1
# 3 2.25 1
# 4 2.20 2
# 5 2.10 2
# 6 2.20 2
# 7 2.15 3
# 8 2.15 3
# 9 2.20 3
# 10 2.25 4
# 11 2.15 4
# 12 2.25 4
bartlett.test(x ~ group, data = your_data)
# Bartlett test of homogeneity of variances
#
# data: x by group
# Bartlett's K-squared = 0.86607, df = 3, p-value = 0.8336
That's all your groups at once. If you want to do pairwise comparisons, give subsets of you data to bartlett.test.

R: aggregating time series groups of irregular length

I think this is a split-apply-combine problem, but with a time series twist. My data consists of irregular counts and I need to perform some summary statistics on each group of counts. Here's a snapshot of the data:
And here's it is for your console:
library(xts)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
xtsData <- xts(cbind(returns,count,maxCount,sumCount),date)
I have no idea how to construct the max and cumSum columns, especially since each count series is of an irregular length. Since I won't always know the start and end points of a count series, I'm lost at trying to figure out the index of these groups. Thanks for your help!
UPDATE: here is my for loop for attempting to calculating cumSum. it's not the cumulative sum, just the returns necessary, i'm still unsure how to apply functions to these ranges!
xtsData <- cbind(xtsData,mySumCount=NA)
# find groups of returns
for(i in 1:nrow(xtsData)){
if(is.na(xtsData[i,"count"]) == FALSE){
xtsData[i,"mySumCount"] <- xtsData[i,"returns"]
}
else{
xtsData[i,"mySumCount"] <- NA
}
}
UPDATE 2: thank you commenters!
# report returns when not NA count
x1 <- xtsData[!is.na(xtsData$count),"returns"]
# cum sum is close, but still need to exclude the first element
# -0.009 in the first series of counts and .027 in the second series of counts
x2 <- cumsum(xtsData[!is.na(xtsData$count),"returns"])
# this is output is not accurate because .03 is being displayed down the entire column, not just during periods when counts != NA. is this just a rounding error?
x3 <- max(xtsData[!is.na(xtsData$count),"returns"])
SOLUTION:
# function to pad a vector with a 0
lagpad <- function(x, k) {
c(rep(0, k), x)[1 : length(x)]
}
# group the counts
x1 <- na.omit(transform(xtsData, g = cumsum(c(0, diff(!is.na(count)) == 1))))
# cumulative sum of the count series
z1 <- transform(x1, cumsumRet = ave(returns, g, FUN =function(x) cumsum(replace(x, 1, 0))))
# max of the count series
z2 <- transform(x1, maxRet = ave(returns, g, FUN =function(x) max(lagpad(x,1))))
merge(xtsData,z1$cumsumRet,z2$maxRet)
The code shown is not consistent with the output in the image and there is no explanation provided so its not clear what manipulations were wanted; however, the question did mention that the main problem is distinguishing the groups so we will address that.
To do that we compute a new column g whose rows contain 1 for the first group, 2 for the second and so on. We also remove the NA rows since the g column is sufficient to distinguish groups.
The following code computes a vector the same length as count by first setting each NA position to FALSE and each non-NA position to TRUE. It then differences each position of that vector with the prior position. To do that it implicitly converts FALSE to 0 and TRUE to 1 and then performs the differencing. Next we convert this last result to a logical vector which is TRUE for each 1 component and FALSE otherwise. Since the first component of the vector that is differenced has no prior position we prepend 0 for that. The prepending operation implicitly converts the TRUE and FALSE values just generated to 1 and 0 respectively. Taking the cumsum fills in the first group with 1, the second with 2 and so on. Finally omit the NA rows:
x <- na.omit(transform(x, g = cumsum(c(0, diff(!is.na(count)) == 1))))
giving:
> x
returns count maxCount sumCount g
2010-11-26 -0.009 1 0.030 0.000 1
2010-12-03 0.030 1 0.030 0.030 1
2010-12-10 0.013 2 0.030 0.042 1
2010-12-17 0.003 2 0.030 0.045 1
2010-12-24 0.010 3 0.030 0.056 1
2010-12-31 0.001 4 0.030 0.056 1
2011-01-07 0.011 5 0.030 0.067 1
2011-01-14 0.017 6 0.030 0.084 1
2011-01-21 -0.008 7 0.030 0.077 1
2011-01-28 -0.005 7 0.030 0.071 1
2011-02-04 0.027 7 0.030 0.098 1
2011-02-11 0.014 7 0.030 0.112 1
2011-02-18 0.010 7 0.030 0.123 1
2011-03-18 0.027 1 0.027 0.000 2
2011-03-25 -0.019 2 0.027 -0.019 2
attr(,"na.action")
2010-11-18 2010-11-19 2011-02-25 2011-03-04 2011-03-11 2011-03-26 2011-03-27
1 2 16 17 18 21 22
attr(,"class")
[1] "omit"
You can now use ave to perform any calculations you like. For example to take cumulative sums of returns by group:
transform(x, cumsumRet = ave(returns, g, FUN = cumsum))
Replace cumsum with any other function that is suitable for use with ave.
Ah, so "count" are the groups and you want the cumsum per group and the max per group. I think in data.table, so here is how I would do it.
library(xts)
library(data.table)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
DT<-data.table(date,returns,count)]
DT[!is.na(count),max:=max(returns),by=count]
DT[!is.na(count),cumSum:= cumsum(returns),by=count]
#if you need an xts object at the end, then.
xtsData <- xts(cbind(DT$returns,DT$count, DT$max,DT$cumSum),DT$date)

Resources