Determine the length of a season with conditions (ex. winter snow season) - r

I want to determine the length of the snow season in the following data frame:
DATE SNOW
1998-11-01 0
1998-11-02 0
1998-11-03 0.9
1998-11-04 1
1998-11-05 0
1998-11-06 1
1998-11-07 0.6
1998-11-08 1
1998-11-09 2
1998-11-10 2
1998-11-11 2.5
1998-11-12 3
1998-11-13 6.5
1999-01-01 15
1999-01-02 15
1999-01-03 19
1999-01-04 18
1999-01-05 17
1999-01-06 17
1999-01-07 17
1999-01-08 17
1999-01-09 16
1999-03-01 6
1999-03-02 5
1999-03-03 5
1999-03-04 5
1999-03-05 5
1999-03-06 2
1999-03-07 2
1999-03-08 1.6
1999-03-09 1.2
1999-03-10 1
1999-03-11 0.6
1999-03-12 0
1999-03-13 1
Snow season is defined by a snow depth (SNOW) of more than 1 cm for at least 10 consecutive days (so if there is snow one day in November but after it melts and depth is < 1 cm we consider the season not started).
My idea would be to determine:
1) the date of snowpack establishement (in my example 1998-11-08)
2) the date of "disappearing" (here 1999-03-11)
3) calculate the length of the period (nb of days between 1998-11-05 and 1999-03-11)
For the 3rd step I can easily get the number between 2 dates using this method.
But how to define the dates with conditions?

This is one way:
# copy data from clipboard
d <- read.table(text=readClipboard(), header=TRUE)
# coerce DATE to Date type, add event grouping variable that numbers the groups
# sequentially and has NA for values not in events.
d <- transform(d, DATE=as.Date(DATE),
event=with(rle(d$SNOW >= 1), rep(replace(ave(values, values, FUN=seq), !values, NA), lengths)))
# aggregate event lengths in days
event.days <- aggregate(DATE ~ event, data=d, function(x) as.numeric(max(x) - min(x), units='days'))
# get those events greater than 10 days
subset(event.days, DATE > 10)
# event DATE
# 3 3 122
You can also use the event grouping variable to find the start dates:
starts <- aggregate(DATE ~ event, data=d, FUN=head, 1)
# 1 1 1998-11-04
# 2 2 1998-11-06
# 3 3 1998-11-08
# 4 4 1999-03-13
And then merge this with event.days:
merge(event.days, starts, by='event')
# event DATE.x DATE.y
# 1 1 0 1998-11-04
# 2 2 0 1998-11-06
# 3 3 122 1998-11-08
# 4 4 0 1999-03-13

Related

Developing a row extraction rule

I want to develop a rule to extract certain rows from a matrix. I set up the example as follows:
mat1 = data.frame(matrix(nrow=508, ncol =5))
mat1[1:20,1] = rep(1,20)
mat1[1:20,2:5] = rnorm(20*4,0,1)
mat2 = data.frame(matrix(nrow=508, ncol =5))
seq1 <- seq(1,3,1)
mat2[1:27,1] = rep(seq1,9)
mat2[1:27,2:5] = rnorm(27*4,0,1)
mat3 = data.frame(matrix(nrow=508, ncol =5))
mat3[1:32,1] = rep(seq(1,4,1),8)
mat3[1:32,2:5] = rnorm(32*4,0,1)
colnames(mat1) = colnames(mat2) = colnames(mat3) = c("Cohort Number", "Alpha(t-1)", "date1", "date2", "date3")
mat.list <- list(mat1,mat2,mat3)
Example matrix
Cohort Number Alpha(t-1) date1 date2 date3
1 1 -1.76745451 -1.3227308 2.7099501 -0.13797329
2 1 -0.72651808 -0.8714317 1.3200554 0.76964663
3 1 -0.50325892 0.0742336 -0.6460628 0.30148135
4 1 0.79592650 0.1353875 -0.5694022 -0.59019913
5 1 1.94064961 0.2255595 0.3156252 -0.90996475
6 1 0.27134932 0.3966957 -1.9198976 0.23998928
7 1 -1.13272507 -0.8603225 -1.2042036 0.06609958
8 1 -2.12392748 1.0905405 -0.3788234 0.92850110
9 1 0.22038996 0.4500683 -1.4617004 0.58498275
10 1 0.26348734 -0.8340913 1.2631368 -1.48490518
11 1 0.26931077 -0.5230622 -0.6615288 1.45668453
12 1 -2.03067695 -0.6432484 0.4801026 0.01808834
13 1 1.25915656 -0.1116544 -0.3004298 -1.04072722
14 1 -2.27894271 -2.1058424 -0.3351053 -1.04132045
15 1 0.47742052 2.1564274 -0.4733351 -0.53152019
16 1 -1.57680089 -0.1340645 -0.3134633 0.53223567
17 1 0.25245813 -0.8243152 0.5998211 -1.01892301
18 1 0.18391447 -1.3500645 1.6059798 1.43359399
19 1 -0.09602031 1.4921338 -0.6455687 0.66385823
20 1 -0.13613759 2.2474816 0.7311762 -2.46849071
mat2[1:27,]
Cohort Number Alpha(t-1) date1 date2 date3
1 1 -0.76033920 1.317636591 -0.09684526 -0.08796725
2 2 0.05123185 -0.731591674 -0.37247406 0.04470346
3 3 -0.78460201 0.890336570 1.26737475 -0.39062992
4 1 -0.14111920 1.255008475 -0.32799815 -0.77277716
5 2 -0.46044451 1.175157970 0.82187906 0.54326905
6 3 -0.46804365 0.704203273 -2.04539007 -1.74782065
7 1 0.42009824 0.488807461 3.21093186 -0.13745029
8 2 1.27083389 -1.316989452 0.43565921 0.07870330
9 3 -0.16581119 1.872955624 -0.22399155 -0.79334562
10 1 -1.33436656 0.589311311 -1.03871415 -1.06221057
11 2 1.56584985 0.020699064 0.45691456 0.15858065
12 3 1.07756426 -0.045200151 0.05124461 -1.86633279
13 1 -1.01264994 -0.229406681 1.24954420 0.88846407
14 2 -0.09950713 -0.515798138 1.62560454 -0.20191909
15 3 -0.28319479 0.450854419 1.42963386 -1.11964154
16 1 0.51771608 -1.407248379 0.62626313 0.97775246
17 2 -0.43951262 -0.368739441 0.66564013 -0.79980882
18 3 -0.15865277 -0.231475146 0.37582330 0.93685867
19 1 -0.57758129 0.235550070 0.42480442 -0.14379249
20 2 -0.81726414 -1.207593079 -0.30000514 0.68967230
21 3 -0.72926703 -0.458849409 1.51162785 1.40921409
22 1 -0.32220454 0.334996561 1.26073381 -2.03405958
23 2 -0.51450039 -0.305634241 1.51021957 0.39775430
24 3 1.15476297 -1.040126709 -0.36192432 -0.37346894
25 1 -0.88053587 -0.006829769 -0.89855797 -0.39840858
26 2 -0.64435448 0.209561006 -0.13986834 -0.61308957
27 3 1.22492942 0.812693992 -1.32371617 -1.21852365
and
> mat3[1:32,]
Cohort Number Alpha(t-1) date1 date2 date3
1 1 -0.7657871 -0.35390862 -0.23539987 -1.8365309
2 2 -0.6631690 1.36450837 0.78403072 -0.8344993
3 3 -1.0134022 -0.28380021 0.72149463 -0.7890273
4 4 2.6419455 0.26998803 2.03606725 0.8099134
5 1 -0.1383910 0.90845134 1.09273919 0.4651443
6 2 -0.7549340 -0.23185551 2.21119705 -0.1386960
7 3 0.7296121 -1.09145187 -1.18092505 0.1510642
8 4 -0.5583415 0.71988405 0.09454476 -0.8661514
9 1 -0.2420894 -0.03215026 -2.51249946 1.1659027
10 2 -0.6434337 -0.13910557 -1.10373674 1.2377968
11 3 -0.6297123 2.09797419 0.87128407 -0.1351845
12 4 0.6674166 0.48707847 0.36373509 1.0680623
13 1 0.6254708 -0.61311671 0.82542494 1.7320687
14 2 -2.4704173 0.98460064 -1.10416042 2.9627952
15 3 -0.2544887 0.63177246 -0.39138717 1.6942072
16 4 -0.9807623 1.11882794 -0.47669974 1.2383798
17 1 -0.6900549 1.68086482 -0.01405476 -1.3099288
18 2 1.4510505 -0.04752782 1.49735258 0.2963673
19 3 -1.1355194 -1.76263532 -1.49318214 1.3524114
20 4 0.7168833 -0.76833639 0.60752304 -1.0647885
21 1 2.0004745 2.13931057 -1.35036048 -0.7694501
22 2 2.0985591 0.01569677 0.33975952 -1.4979973
23 3 0.1703261 -1.47625208 -1.13228671 0.5686501
24 4 0.2632233 -0.55672667 0.33428217 0.5341078
25 1 -0.2741324 -1.61301237 0.78861248 0.4982554
26 2 -0.8793897 -1.07266362 -0.78158128 0.9127354
27 3 0.3920579 -0.59869834 -0.76775259 1.8137107
28 4 -1.4088488 -0.54954542 0.32421016 0.7284813
29 1 -1.2421837 0.50599077 1.62464999 0.6801672
30 2 -2.8980422 0.42197236 0.45243582 1.4939070
31 3 0.3965108 -1.35877353 1.52230797 -1.6552039
32 4 0.8112229 0.51970084 0.30830797 -2.0563928
What I want to do:
For every matrix in mat.list I want to extract 6 rows of data, according to certain criteria, and place these rows as a data.frame in a list labelled Output1. I want to store all remaining rows as a data.frame in Output2.
The process:
1) Group data by cohort number.
2a. If there is 1 group (Cohort Number can only = 1). Move to column 2 and extract the 6 rows of matrix with the highest value for "Alpha(t-1)". Store these rows as a data.frame in a list named "Output1". Store all remaining rows as a data.frame in a list named "Output2".
2b. If there are 2 groups (Cohort number can = 1 or Cohort Number can =2) move to column 2 and extract the 3 rows with the largest "Alpha(t-1)" corresponding to Cohort Number ==1 and extract the 3 rows with largest"Alpha(t-1)" corresponding to Cohort Number == 2. Place the 6 rows extracted as a data.frame in a list named "Output1". Place all remaining rows as a data.frame in a list named "Output2".
2c. If there are 3 groups ("Cohort Number can = 1, Cohort Number can =2, Cohort Number can =3 ) move to column 2 and extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number ==1, extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number =2 and extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number =3
2d. If there are 4 groups ("Cohort Number can = 1, Cohort Number can =2, Cohort Number can =3, Cohort Number = 4) move to column 2. Extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number ==1. Extract the 2 row with the largest "Alpha(t-1)" corresponding to Cohort Number ==2. Extract the 1 row with the largest "Alpha(t-1)" corresponding to Cohort Number ==3 and Extract the 1 row with the largest "Alpha(t-1)" corresponding to Cohort Number ==4. Store the 6 key rows as a data.frame in Output1. Store all remaining rows as a data.frame in the list Output2.
Desired Output:
Output1 <- c()
Output2 <- c()
Output1[[1]] = mat1 %>% group_by(`Cohort Number`) %>% top_n(6, `Alpha(t-1)`)
Output1[[2]] = mat2 %>% group_by(`Cohort Number`) %>% top_n(2, `Alpha(t-1)`)
> Output1[[1]]
# A tibble: 6 x 5
# Groups: Cohort Number [1]
`Cohort Number` `Alpha(t-1)` date1 date2 date3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.796 0.135 -0.569 -0.590
2 1 1.94 0.226 0.316 -0.910
3 1 0.271 0.397 -1.92 0.240
4 1 0.269 -0.523 -0.662 1.46
5 1 1.26 -0.112 -0.300 -1.04
6 1 0.477 2.16 -0.473 -0.532
> Output1[[2]]
# A tibble: 6 x 5
# Groups: Cohort Number [3]
`Cohort Number` `Alpha(t-1)` date1 date2 date3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.420 0.489 3.21 -0.137
2 2 1.27 -1.32 0.436 0.0787
3 2 1.57 0.0207 0.457 0.159
4 1 0.518 -1.41 0.626 0.978
5 3 1.15 -1.04 -0.362 -0.373
6 3 1.22 0.813 -1.32 -1.22
Overall I need a function to do this because i have over 1000 matrices in my actual application and can't do this manually.
We can count the number of distinct values in Cohort Number and based on that select the value of n in top_n. For distinct values which are more than 3, we create vector of values to select in top_n for each Cohort Number.
library(tidyverse)
output1 <- map(mat.list, function(x) {
dist <- n_distinct(x$`Cohort Number`, na.rm = TRUE)
if(dist <= 3)
x %>%
group_by(`Cohort Number`) %>%
top_n(6/dist, `Alpha(t-1)`)
else
map2_df(list(2, 2, 1, 1),x %>% na.omit %>% group_split(`Cohort Number`),
~.y %>% top_n(.x, `Alpha(t-1)`))
})
and for output2, we use map2 with ant_join
output2 <- map2(mat.list, output1, anti_join)
Confirming the output
map_dbl(output1, nrow)
#[1] 6 6 6
map_dbl(output2, nrow)
#[1] 502 502 502

When a variable switches from 1 to 2, delete some data from the other variables and average what's left?

I am analysing some data and need help.
Basically, I have a dataset that looks like this:
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
As can be seen, there's a switch column that switches between 1 and 2 every 10 data points. I want to write a code that does: when the "switch" column changes its value (from 1 to 2, or 2 to 1), delete the first 5 rows of data after the switch (i.e. leaving the 5 last data points for all the 4 variables), average the rest of the data points for O2 and CO2, and put them in 2 new columns (avg.O2 and avg.CO2) before the next switch. Then repeat this process until the end.
It's quite easy to do manually on paper or excel, but my real dataset would comprise thousands of data points and I would like to use R to do it automatically for me. So anyone has any ideas that could help me?
Please find my edits which should work for both regular and irregular
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
CleanMachineData <- function(Data, SwitchData, UnreliableRows = 5){
# First, we can properly turn your switch column into a grouping column (1,2,1,2)->(1,2,3,4)
grouplength <- rle(Data[,"switch"])$lengths
# mapply lets us input vector arguments into typically one/first-element only argument functions.
# In this case we create a sequence of lengths (output is a list/vector)
grouping <- mapply(seq, grouplength)
# Here we want it to become a single vector representing groups
groups <- mapply(rep, 1:length(grouplength), each = grouplength)
# if frequency was irregular, it will be a list, if regular it will be a matrix
# convert either into a vector by doing as follows:
if(class(grouping) == "list"){
groups <- unlist(groups)
} else {
groups <- as.vector(groups)
}
Data$group <- groups
#
# vector of the first row of each new switch (except the starting 0)
switchRow <- c(0,which(abs(diff(SwitchData)) == 1))+1
# I use "as.vector" to turn the matrix output of mapply into a sequence of numbers.
# "ToRemove" will have all the row numbers to get rid of from your original data, except for what happens before (in this case) row 10
ToRemove <- c(1:UnreliableRows, as.vector(mapply(seq, switchRow, switchRow+(UnreliableRows)-1)))
# I concatenate the missing beginning (1,2,3,4,5) and theToRemove them with c() and then remove them from n with "-"
Keep <- seq(nrow(Data))[-c(1:UnreliableRows,ToRemove)]
# Create the new data, (in case you don't know: data[<ROW>,<COLUMN>])
newdat <- Data[-ToRemove,]
# print the results
newdat
}
dat <- CleanMachineData(test.data, test.data$switch, 5)
dat
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
19 2017-04-18 2 21.03252 0.07960098 2
20 2017-04-19 2 21.04032 0.07892145 2
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
39 2017-05-08 2 21.04136 0.07781525 4
40 2017-05-09 2 21.05375 0.07941123 4
aggregate(cbind(O2,CO2) ~ group, dat, mean)
group O2 CO2
1 1 21.04675 0.07812336
2 2 21.03497 0.07819329
3 3 21.03967 0.07834986
4 4 21.04166 0.07882221
# crazier, irregular switching
test.data2 <- test.data
test.data2$switch <- unlist(mapply(rep, 1:2, times = 1, each = c(10,8,10,5,3,10)))[1:20]
dat2 <- CleanMachineData(test.data2, test.data2$switch, 5)
dat2
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
24 2017-04-23 1 21.05658 0.07669662 3
25 2017-04-24 1 21.04452 0.07983165 3
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
# You can try removing a vector with the following
lapply(5:7, function(x) {
dat <- CleanMachineData(test.data2, test.data2$switch, x)
list(data = dat, means = aggregate(cbind(O2,CO2)~group, dat, mean))
})
Use
test.data[rep(c(FALSE, TRUE), each=5),]
to select always the last five rows from the group of 10 rows.
Then you can use aggregate:
d2 <- test.data[rep(c(FALSE, TRUE), each=5),]
aggregate(cbind(O2, CO2) ~ 1, data=d2, FUN=mean)
If you want the average for every 5-rows-group:
aggregate(cbind(O2, CO2) ~ gl(k=5, n=nrow(d2)/5L), data=d2, FUN=mean)
Here is a generalization for the situation of arbitrary number of rows in test.data:
stay <- rep(c(FALSE, TRUE), each=5, length.out=nrow(test.data))
d2 <- test.data[stay,]
group <- gl(k=5, n=nrow(d2)/5L+1L, length=nrow(d2))
aggregate(cbind(O2, CO2) ~ group, data=d2, FUN=mean)
Here is a variant for mixing the data with the averages:
group <- gl(k=10, n=nrow(test.data)/10L+1L, length=nrow(test.data))
L <- split(test.data, group)
mySummary <- function(x) {
if (nrow(x) <= 5) return(NULL)
x <- x[-(1:5),]
d.avg <- aggregate(cbind(O2, CO2) ~ 1, data=x, FUN=mean)
rbind(x, cbind(date=NA, switch=-1, d.avg))
}
lapply(L, mySummary) # as list of dataframes
do.call(rbind, lapply(L, mySummary)) # as one dataframe

How to generalize this algorithm (sign pattern match counter)?

I have this code in R :
corr = function(x, y) {
sx = sign(x)
sy = sign(y)
cond_a = sx == sy && sx > 0 && sy >0
cond_b = sx < sy && sx < 0 && sy >0
cond_c = sx > sy && sx > 0 && sy <0
cond_d = sx == sy && sx < 0 && sy < 0
cond_e = sx == 0 || sy == 0
if(cond_a) return('a')
else if(cond_b) return('b')
else if(cond_c) return('c')
else if(cond_d) return('d')
else if(cond_e) return('e')
}
Its role is to be used in conjunction with the mapply function in R in order to count all the possible sign patterns present in a time series. In this case the pattern has a length of 2 and all the possible tuples are : (+,+)(+,-)(-,+)(-,-)
I use the corr function this way :
> with(dt['AAPL'], table(mapply(corr, Return[-1], Return[-length(Return)])) /length(Return)*100)
a b c d e
24.6129416 25.4466058 25.4863041 24.0174672 0.3969829
> dt["AAPL",list(date, Return)]
symbol date Return
1: AAPL 2014-08-29 -0.3499903
2: AAPL 2014-08-28 0.6496702
3: AAPL 2014-08-27 1.0987923
4: AAPL 2014-08-26 -0.5235654
5: AAPL 2014-08-25 -0.2456037
I would like to generalize the corr function to n arguments. This mean that for every nI would have to write down all the conditions corresponding to all the possible n-tuples. Currently the best thing I can think of for doing that is to make a python script to write the code string using loops, but there must be a way to do this properly. Do you have an idea about how I could generalize the fastidious condition writing, maybe I could try to use expand.grid but how do the matching then ?
I think you're better off using rollapply(...) in the zoo package for this. Since you seem to be using quantmod anyway (which loads xts and zoo), here is a solution that does not use all those nested if(...) statements.
library(quantmod)
AAPL <- getSymbols("AAPL",auto.assign=FALSE)
AAPL <- AAPL["2007-08::2009-03"] # AAPL during the crash...
Returns <- dailyReturn(AAPL)
get.patterns <- function(ret,n) {
f <- function(x) { # identifies which row of `patterns` matches sign(x)
which(apply(patterns,1,function(row)all(row==sign(x))))
}
returns <- na.omit(ret)
patterns <- expand.grid(rep(list(c(-1,1)),n))
labels <- apply(patterns,1,function(row) paste0("(",paste(row,collapse=","),")"))
result <- rollapply(returns,width=n,f,align="left")
data.frame(100*table(labels[result])/(length(returns)-(n-1)))
}
get.patterns(Returns,n=2)
# Var1 Freq
# 1 (-1,-1) 22.67303
# 2 (-1,1) 26.49165
# 3 (1,-1) 26.73031
# 4 (1,1) 23.15036
get.patterns(Returns,n=3)
# Var1 Freq
# 1 (-1,-1,-1) 9.090909
# 2 (-1,-1,1) 13.397129
# 3 (-1,1,-1) 14.593301
# 4 (-1,1,1) 11.722488
# 5 (1,-1,-1) 13.636364
# 6 (1,-1,1) 13.157895
# 7 (1,1,-1) 12.200957
# 8 (1,1,1) 10.765550
The basic idea is to create a patterns matrix with 2^n rows and n columns, where each row represents one of the possible patterns (e,g, (1,1), (-1,1), etc.). Then pass the daily returns to this function n-wise using rollapply(...) and identify which row in patterns matches sign(x) exactly. Then use this vector of row numbers an an index into labels, which contains a character representation of the patterns, then use table(...) as you did.
This is general for an n-day pattern, but it ignores situations where any return is exactly zero, so the $Freq columns do not add up to 100. As you can see, this doesn't happen very often.
It's interesting that even during the crash it was (very slightly) more likely to have two up days in succession, than two down days. If you look at plot(Cl(AAPL)) during this period, you can see that it was a pretty wild ride.
This is a little different approach but it may give you what you're looking for and allows you to use any size of n-tuple. The basic approach is to find the signs of the adjacent changes for each sequential set of n returns, convert the n-length sign changes into n-tuples of 1's and 0's where 0 = negative return and 1 = positive return. Then calculate the decimal value of each n-tuple taken as binary number. These numbers will clearly be different for each distinct n-tuple. Using a zoo time series for these calculations provides several useful functions including get.hist.quote() to retrieve stock prices, diff() to calculate returns, and the rollapply() function to use in calculating the n-tuples and their sums.The code below does these calculations, converts the sum of the sign changes back to n-tuples of binary digits and collects the results in a data frame.
library(zoo)
library(tseries)
n <- 3 # set size of n-tuple
#
# get stock prices and compute % returns
#
dtz <- get.hist.quote("AAPL","2014-01-01","2014-10-01", quote="Close")
dtz <- merge(dtz, (diff(dtz, arithmetic=FALSE ) - 1)*100)
names(dtz) <- c("prices","returns")
#
# calculate the sum of the sign changes
#
dtz <- merge(dtz, rollapply( data=(sign(dtz$returns)+1)/2, width=n,
FUN=function(x, y) sum(x*y), y = 2^(0:(n-1)), align="right" ))
dtz <- fortify.zoo(dtz)
names(dtz) <- c("date","prices","returns", "sum_sgn_chg")
#
# convert the sum of the sign changes back to an n-tuple of binary digits
#
for( i in 1:nrow(dtz) )
dtz$sign_chg[i] <- paste(((as.numeric(dtz$sum_sgn_chg[i]) %/%(2^(0:2))) %%2), collapse="")
#
# report first part of result
#
head(dtz, 10)
#
# report count of changes by month and type
#
table(format(dtz$date,"%Y %m"), dtz$sign_chg)
An example of possible output is a table showing the count of changes by type for each month.
000 001 010 011 100 101 110 111 NANANA
2014 01 1 3 3 2 3 2 2 2 3
2014 02 1 2 4 2 2 3 2 3 0
2014 03 2 3 0 4 4 1 4 3 0
2014 04 2 3 2 3 3 2 3 3 0
2014 05 2 2 1 3 1 2 3 7 0
2014 06 3 4 3 2 4 1 1 3 0
2014 07 2 1 2 4 2 5 5 1 0
2014 08 2 2 1 3 1 2 2 8 0
2014 09 0 4 2 3 4 2 4 2 0
2014 10 0 0 1 0 0 0 0 0 0
so this would show that in month 1, January of 2014, there was one set of three days with 000 indicating 3 down returns , 3 days with the 001 change indicating two down return and followed by one positive return and so forth. Most months seem to have a fairly random distribution but May and August show 7 and 8 sets of 3 days of positive returns reflecting the fact that these were strong months for AAPL.

Subset first time of two hour time intervals

I have problem in subsetting times.
1) I would like to filter my data by time intervals where one is in midnight and another in midday.
2) And i need only first time that occurs in each interval.
Data frame looks like this
DATE v
1 2007-07-28 00:41:00 1
2 2007-07-28 02:00:12 5
3 2007-07-28 02:01:19 3
4 2007-07-28 02:44:08 2
5 2007-07-28 04:02:18 3
6 2007-07-28 09:59:16 4
7 2007-07-28 11:21:32 8
8 2007-07-28 11:58:40 5
9 2007-07-28 13:20:52 4
10 2007-07-28 13:21:52 9
11 2007-07-28 14:41:32 3
12 2007-07-28 15:19:00 9
13 2007-07-29 01:01:48 2
14 2007-07-29 01:41:08 5
Result should look like this
DATE v
2 2007-07-28 02:00:12 5
9 2007-07-28 13:20:52 4
13 2007-07-29 01:01:48 2
Reproducible code
DATE<-c("2007-07-28 00:41:00", "2007-07-28 02:00:12","2007-07-28 02:01:19", "2007-07-28 02:44:08", "2007-07-28 04:02:18","2007-07-28 09:59:16", "2007-07-28 11:21:32", "2007-07-28 11:58:40","2007-07-28 13:20:52", "2007-07-28 13:21:52", "2007-07-28 14:41:32","2007-07-28 15:19:00", "2007-07-29 01:01:48", "2007-07-29 01:41:08")
v<-c(1,5,3,2,3,4,8,5,4,9,3,9,2,5)
hyljes<-data.frame(cbind(DATE,v))
df <-subset(hyljes, format(as.POSIXct(DATE),"%H") %in% c ("01":"02","13":"14"))
There´s problem with making intervals. It allows me to subset hours "13":"14" but not for "01":"02". Is there any reasonable answers for that?
And i haven´t found the way how to get only first elements from each interval.
Any help is appreciated!
Try
hyljes[c(1, head(cumsum(rle(as.POSIXlt(hyljes$DATE)$hour < 13)$lengths) + 1, -1)), ]
## DATE v
## 1 2007-07-28 00:41:00 1
## 9 2007-07-28 13:20:52 4
## 13 2007-07-29 01:01:48 2
as.POSIXlt(hyljes$DATE)$hour < 13 gives you whether time is before or after noon
rle(...)$lengths gives you lengths of the runs of TRUEs and FALSEs
cumsum of above + 1 gives you indices of first record in each run
head(...,-1) trims of last element
c(1, ...) adds back first index - which should be always be included by definition
There are lots of little manipulations in here, but the end result gets you where you need to be:
hyljes <- [YOUR DATA]
hyljes$DATE <- as.POSIXct(hyljes$DATE, format = "%Y-%m-%d %H:%M:%S")
hyljes$hour <- strftime(hyljes$DATE, '%H')
hyljes$date <- strftime(hyljes$DATE, '%Y-%m-%d')
hyljes$am_pm <- ifelse(hyljes$hour < 12, 'am', 'pm')
mins <- ddply(hyljes, .(date, am_pm), summarise, min = min(DATE))$min
hyljes[hyljes[, 1] %in% mins, 1:2]
DATE v
1 2007-07-28 00:41:00 1
9 2007-07-28 13:20:52 4
13 2007-07-29 01:01:48 2

Row Differences in Dataframe by Group

My problem has to do with finding row differences in a data frame by group. I've tried to do this a few ways. Here's an example. The real data set is several million rows long.
set.seed(314)
df = data.frame("group_id"=rep(c(1,2,3),3),
"date"=sample(seq(as.Date("1970-01-01"),Sys.Date(),by=1),9,replace=F),
"logical_value"=sample(c(T,F),9,replace=T),
"integer"=sample(1:100,9,replace=T),
"float"=runif(9))
df = df[order(df$group_id,df$date),]
I ordered it by group_id and date so that the diff function can find the sequential differences, which results in time ordered differences of the logical, integer, and float variables. I could easily do some sort of apply(df,2,diff), but I need it by group_id. Hence, doing apply(df,2,diff) results in extra unneeded results.
df
group_id date logical_value integer float
1 1 1974-05-13 FALSE 4 0.03472876
4 1 1979-12-02 TRUE 45 0.24493995
7 1 1980-08-18 TRUE 2 0.46662253
5 2 1978-12-08 TRUE 56 0.60039164
2 2 1981-12-26 TRUE 34 0.20081799
8 2 1986-05-19 FALSE 60 0.43928929
6 3 1983-05-22 FALSE 25 0.01792820
9 3 1994-04-20 FALSE 34 0.10905326
3 3 2003-11-04 TRUE 63 0.58365922
So I thought I could break up my data frame into chunks by group_id, and pass each chunk into a user defined function:
create_differences = function(data_group){
apply(data_group, 2, diff)
}
But I get errors using the code:
diff_df = lapply(split(df,df$group_id),create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
by(df,df$group_id,create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
As a side note, the data is nice, no NAs, nulls, blanks, and every group_id has at least 2 rows associated with it.
Edit 1: User alexis_laz correctly pointed out that my function needs to be sapply(data_group, diff).
Using this edit, I get a list of data frames (one list entry per group).
Edit 2:
The expected output would be a combined data frame of differences. Ideally, I would like to keep the group_id, but if not, it's not a big deal. Here is what the sample output should be like:
diff_df
group_id date logical_value integer float
[1,] 1 2029 1 41 0.2102112
[2,] 1 260 0 -43 0.2216826
[1,] 2 1114 0 -22 -0.3995737
[2,] 2 1605 -1 26 0.2384713
[1,] 3 3986 0 9 0.09112507
[2,] 3 3485 1 29 0.47460596
I think regarding the fact that you have millions of rows you can move to the data.table suitable for by group actions.
library(data.table)
DT <- as.data.table(df)
## this will order per group and per day
setkeyv(DT,c('group_id','date'))
## for all column apply diff
DT[,lapply(.SD,diff),group_id]
# group_id date logical_value integer float
# 1: 1 2029 days 1 41 0.21021119
# 2: 1 260 days 0 -43 0.22168257
# 3: 2 1114 days 0 -22 -0.39957366
# 4: 2 1604 days -1 26 0.23847130
# 5: 3 3987 days 0 9 0.09112507
# 6: 3 3485 days 1 29 0.47460596
It certainly won't be as quick compared to data.table but below is an only slightly ugly base solution using aggregate:
result <- aggregate(. ~ group_id, data=df, FUN=diff)
result <- cbind(result[1],lapply(result[-1], as.vector))
result[order(result$group_id),]
# group_id date logical_value integer float
#1 1 2029 1 41 0.21021119
#4 1 260 0 -43 0.22168257
#2 2 1114 0 -22 -0.39957366
#5 2 1604 -1 26 0.23847130
#3 3 3987 0 9 0.09112507
#6 3 3485 1 29 0.47460596

Resources