How to compute the average for specific days? - r

If we have data for three years:
dat= (x1:x1096)
I want to compute the average this way:
[x1+x366(the first day in the second year)+ x731(the first day in the third year)]/3
[x2+x367(the second day in the second year)+ x732(the second day in the third year)]/3
and so on till the day 365:
[x365+x730(the last day in the second year)+ x1096(the last day in the third year)]/3
finally I will get 365 values out of that.
dat= c(1:1096)
Any idea on how to do this?

data.table comes quite handy in here: (even though a base R solution is perfectly doable!):
> set.seed(1)
> dat <- data.table(date=seq(as.Date("2010-01-01"), as.Date("2012-12-31"), "days"),
+ var=rnorm(1096))
> dat
date var
1: 2010-01-01 -0.626453811
2: 2010-01-02 0.183643324
3: 2010-01-03 -0.835628612
4: 2010-01-04 1.595280802
5: 2010-01-05 0.329507772
---
1092: 2012-12-27 0.711213964
1093: 2012-12-28 -0.337691156
1094: 2012-12-29 -0.009148952
1095: 2012-12-30 -0.125309208
1096: 2012-12-31 -2.090846097
> dat[, mean(var), by=list(month=month(date), mday(date))]
month mday V1
1: 1 1 -0.16755484
2: 1 2 0.59942582
3: 1 3 -0.44336168
4: 1 4 0.01297244
5: 1 5 -0.20317854
---
362: 12 28 -0.18076284
363: 12 29 0.07302903
364: 12 30 -0.01790655
365: 12 31 -0.87164859
366: 2 29 -0.78859794
the 29th on Feb is at end because when [.data.table did the groups that day was the last unique combination (of month(date) and mday(date)) found, cause it appears first time in 2012. once you have your result you can assign the keys and so sort the table:
> result <- dat[, mean(var), by=list(month=month(date), mday(date))]
> setkey(result, month, mday)
> result
month mday V1
1: 1 1 -0.16755484
2: 1 2 0.59942582
3: 1 3 -0.44336168
4: 1 4 0.01297244
5: 1 5 -0.20317854
---
362: 12 27 -0.60348463
363: 12 28 -0.18076284
364: 12 29 0.07302903
365: 12 30 -0.01790655
366: 12 31 -0.87164859

Here is a base solution that account for leap years:
# First your data
set.seed(1)
dat <- rnorm(1096) #Value for each day
day <- seq(as.Date("2010-01-01"), as.Date("2012-12-31"), "days") #Corresponding days
sapply(split(dat,format(day,"%m-%d")),mean)
01-01 01-02 01-03 01-04 01-05 01-06 01-07 01-08 01-09
-0.167554841 0.599425816 -0.443361675 0.012972442 -0.203178536 -0.553501370 0.563475994 -0.094459075 0.567263811
01-10 01-11 01-12 01-13 01-14 01-15 01-16 01-17 01-18
-0.325835336 -0.247226807 -0.272224241 0.171886332 -0.562604980 0.640473418 -0.209380261 0.709635402 -0.263715734
01-19 01-20 01-21 01-22 01-23 01-24 01-25 01-26 01-27
0.929096171 1.173422823 -0.197411808 -0.730959553 -0.277022971 -1.075673025 -0.494038031 -0.255709319 0.827062779
01-28 01-29 01-30 01-31 02-01 02-02 02-03 02-04 02-05
0.208963353 0.215192803 -0.118735162 0.141028516 0.703267761 -0.282852177 -0.297731589 -0.112031601 0.784073396
02-06 02-07 02-08 02-09 02-10 02-11 02-12 02-13 02-14
0.714499179 0.206640777 0.283234842 -0.255182989 -0.293285997 -0.761585755 0.443379228 1.138436815 -0.483004921
02-15 02-16 02-17 02-18 02-19 02-20 02-21 02-22 02-23
-0.692188333 0.701422889 0.677544133 -0.423576371 0.498868978 0.053960271 0.518228979 -0.250840385 -0.722647734
02-24 02-25 02-26 02-27 02-28 02-29 03-01 03-02 03-03
1.344507325 0.693403586 -0.226489715 -0.406929668 -0.171335064 -0.788597935 0.115894011 1.798749522 -0.502676829
03-04 03-05 03-06 03-07 03-08 03-09 03-10 03-11 03-12
0.244453933 -0.278023124 -0.817932086 -0.618472996 -0.842995408 -0.887451556 0.432459430 0.559562525 -0.516256302
03-13 03-14 03-15 03-16 03-17 03-18 03-19 03-20 03-21
0.392447923 0.191049834 -0.727128826 -0.261740657 -0.189455949 0.775326029 0.236835450 -0.266491426 -0.010319849
03-22 03-23 03-24 03-25 03-26 03-27 03-28 03-29 03-30
-0.949967889 -0.277676523 -0.556777524 -0.507373521 0.076952129 0.697147181 -0.416867359 -0.906909972 -0.231494410
03-31 04-01 04-02 04-03 04-04 04-05 04-06 04-07 04-08
-0.453616811 0.158367456 0.670354625 -0.285493660 -0.040162162 0.762953404 -0.388049908 1.079423205 -0.246508050
04-09 04-10 04-11 04-12 04-13 04-14 04-15 04-16 04-17
-0.215358691 -0.337611847 0.486368813 0.115883308 -0.282207017 0.614554509 0.531435739 1.063455284 -0.199968099
04-18 04-19 04-20 04-21 04-22 04-23 04-24 04-25 04-26
-0.080662691 -0.052822528 1.679629547 -1.341639141 0.986160744 0.468143827 0.029621883 -0.025910053 0.061093981
04-27 04-28 04-29 04-30 05-01 05-02 05-03 05-04 05-05
-0.387992910 -0.917561336 0.161867089 0.874549452 0.866708261 0.048304939 -1.209756576 -0.825689257 -0.176605953
05-06 05-07 05-08 05-09 05-10 05-11 05-12 05-13 05-14
-0.381265758 0.419105218 -0.440418731 -0.293923704 1.427366374 -0.020773738 -0.358619841 -0.294738750 -0.269765222
05-15 05-16 05-17 05-18 05-19 05-20 05-21 05-22 05-23
0.277361477 -0.505072373 -0.765572754 -0.493223200 -0.253297588 0.902399037 0.007676731 -0.273059247 -0.784701888
05-24 05-25 05-26 05-27 05-28 05-29 05-30 05-31 06-01
0.063532445 -0.681369105 -1.034300631 0.689037398 -0.209889037 -0.535166412 -0.994984541 0.438795387 -0.167806908
06-02 06-03 06-04 06-05 06-06 06-07 06-08 06-09 06-10
0.079629296 -0.063908968 0.484892252 -0.922112094 0.978258635 -0.790949931 -0.303356059 0.681310315 -0.512109593
06-11 06-12 06-13 06-14 06-15 06-16 06-17 06-18 06-19
0.337126461 0.526594905 0.742784618 -0.163083706 0.027435241 0.709630255 -1.144544436 -0.374108608 0.102721328
06-20 06-21 06-22 06-23 06-24 06-25 06-26 06-27 06-28
0.577569049 0.224528626 0.206667019 0.392007605 -0.557974448 0.068685789 0.460201512 1.101334023 0.035838933
06-29 06-30 07-01 07-02 07-03 07-04 07-05 07-06 07-07
0.873903793 -0.586658280 -0.395094221 0.303312480 -0.631756580 0.088308518 0.046129624 0.642985443 -0.615693218
07-08 07-09 07-10 07-11 07-12 07-13 07-14 07-15 07-16
0.372776652 0.453644860 0.466905164 -0.526930331 -0.351139797 0.250132593 -0.881175203 -1.090136940 0.409708249
07-17 07-18 07-19 07-20 07-21 07-22 07-23 07-24 07-25
0.206436178 0.056134229 -0.057927905 0.807127686 0.423170493 -0.325181464 -0.053593067 0.261438323 0.520617153
07-26 07-27 07-28 07-29 07-30 07-31 08-01 08-02 08-03
0.053800701 0.326492953 -0.471839346 0.438963172 0.499502012 0.620917026 0.619923442 -1.422177067 0.212056501
08-04 08-05 08-06 08-07 08-08 08-09 08-10 08-11 08-12
0.497181456 0.703607380 -0.054104370 0.931407619 0.545759743 -0.323646872 0.127371847 0.017697636 -0.033060879
08-13 08-14 08-15 08-16 08-17 08-18 08-19 08-20 08-21
-0.583034512 0.824859915 -0.019064796 -0.226035270 -1.026526076 -0.882074229 -0.079167867 -2.073168805 0.378121135
08-22 08-23 08-24 08-25 08-26 08-27 08-28 08-29 08-30
-0.004516521 -0.661187139 0.339497500 -0.042210229 0.026970585 0.431653210 0.104619786 0.149562359 -0.473661114
08-31 09-01 09-02 09-03 09-04 09-05 09-06 09-07 09-08
-0.235250025 -0.624645896 0.141205349 -0.485201261 0.097633486 0.462059099 -0.500082678 1.386621118 -0.070895288
09-09 09-10 09-11 09-12 09-13 09-14 09-15 09-16 09-17
-0.126090048 -0.371028573 -0.010479329 0.192555782 0.025085776 -1.410061589 1.046273116 0.938254501 -0.072773342
09-18 09-19 09-20 09-21 09-22 09-23 09-24 09-25 09-26
-0.272947102 0.279357832 0.172702983 0.219560592 0.922992902 -0.612832806 -0.450896711 -1.134353324 -0.336199724
09-27 09-28 09-29 09-30 10-01 10-02 10-03 10-04 10-05
-0.459242718 0.049888664 0.079844541 -0.058636867 0.581553407 -0.315806482 -0.163864166 -1.513984901 0.069093641
10-06 10-07 10-08 10-09 10-10 10-11 10-12 10-13 10-14
-0.325709367 0.114176104 -0.470510646 -0.393891025 -0.659031395 -0.224657523 -0.336803115 -0.510526475 -0.941899166
10-15 10-16 10-17 10-18 10-19 10-20 10-21 10-22 10-23
0.559205646 0.346629848 0.310935589 -0.851962382 0.387930834 0.505692192 -0.738722861 0.410302113 -0.181359914
10-24 10-25 10-26 10-27 10-28 10-29 10-30 10-31 11-01
0.831105889 -0.398852239 -0.164535170 -0.870295447 0.057609116 -1.058556114 0.809784093 0.188277796 1.432543613
11-02 11-03 11-04 11-05 11-06 11-07 11-08 11-09 11-10
0.040680316 0.711553107 0.565285429 -0.829181807 0.455487776 -0.037182199 -0.644669824 -0.704611643 0.491631958
11-11 11-12 11-13 11-14 11-15 11-16 11-17 11-18 11-19
-0.051188454 0.963031185 -0.511791970 0.193671830 -0.333065645 -0.176479500 0.367566807 -0.056534518 1.391773053
11-20 11-21 11-22 11-23 11-24 11-25 11-26 11-27 11-28
0.162741879 -0.269991630 0.866532461 -0.352034768 -0.028515790 -0.671437717 -0.393703641 0.394041604 -0.959721458
11-29 11-30 12-01 12-02 12-03 12-04 12-05 12-06 12-07
-0.187149463 0.203037321 -0.824439261 -0.081277243 0.361409692 -0.300022665 -0.067589145 -0.265877741 -0.474834675
12-08 12-09 12-10 12-11 12-12 12-13 12-14 12-15 12-16
-0.903405316 0.026396956 0.930117145 -0.489879346 -0.481598661 0.122388492 0.042287328 -0.160328704 0.777249363
12-17 12-18 12-19 12-20 12-21 12-22 12-23 12-24 12-25
-0.359802827 0.252189848 0.754686655 -0.012767780 0.683605939 0.782528149 -0.786087093 0.751560196 -0.610885984
12-26 12-27 12-28 12-29 12-30 12-31
0.203570612 -0.603484627 -0.180762839 0.073029026 -0.017906554 -0.871648586
The idea is to split according to the day of the year (%d-%m) and do the mean of each subgroup.
EDIT - Michele (I thought it was better to improve this answer as exclusively base related, instead of mine):
If the above vector were used to create a data.frame then this solution is good alternative:
dat <- data.frame(date=day, var=dat)
> ddply(dat, .(day=format(date,"%m-%d")), summarise, result=mean(var))
day result
1 01-01 -0.167554841
2 01-02 0.599425816
3 01-03 -0.443361675
4 01-04 0.012972442
5 01-05 -0.203178536
6 01-06 -0.553501370
NB: sorry, it actually uses plyr package but it still uses data.frame and ddply could be replaced by by from base package.

Perhaps like this? I tried it out on a slightly smaller example than your 1:1096 vector - I used 5 values per year instead.
# the data, here 3 years with 5 values per year.
dat <- 1:15
# put your vector in a matrix
# by default, the matrix is filled column-wise
# thus, each column corresponds to a year, and each row to day of year
mm <- matrix(dat, ncol = 3)
# calculate row means
mm <- cbind(mm, rowMeans(mm))
mm
# [,1] [,2] [,3] [,4]
# [1,] 1 6 11 6
# [2,] 2 7 12 7
# [3,] 3 8 13 8
# [4,] 4 9 14 9
# [5,] 5 10 15 10
Update
Another base alternative that accounts for leap years, using the same (i.e. set.seed(1)) 'full' data from #Michele's answer:
df2 <- aggregate(var ~ format(date, "%m-%d"), data = dat, FUN = mean)
head(df2)
# format(date, "%m-%d") var
# 1 01-01 -0.16755484
# 2 01-02 0.59942582
# 3 01-03 -0.44336168
# 4 01-04 0.01297244
# 5 01-05 -0.20317854
# 6 01-06 -0.55350137

Related

How to merge observations with close dates in r

I have a database where animals in a herd are tested every 6 months (number of animals can change over the time). The issue is that all the animals in a herd are not tested on the same day but within a period of time of 2 months.
I would like to know who I can create a new column that merges all these close dates (grouping by herd), so I can calculate the number of times a herd has been tested.
This is an example of a herd that has been tested 8 times, but at different dates. Each dot represents an animal:
Here is an example of the data:
df <- data.frame(
animal = c("Animal1", "Animal2", "Animal3", "Animal4", "Animal5", "Animal6", "Animal1", "Animal2", "Animal3", "Animal4", "Animal5", "Animal6", "Animal7", "Animal8", "Animal9", "Animal10", "Animal11", "Animal12", "Animal7", "Animal8", "Animal9", "Animal10", "Animal11", "Animal12"),
herd = c("Herd1","Herd1","Herd1", "Herd1","Herd1","Herd1", "Herd1","Herd1","Herd1", "Herd1","Herd1","Herd1","Herd2","Herd2", "Herd2","Herd2","Herd2","Herd2", "Herd2","Herd2", "Herd2","Herd2","Herd2","Herd2"),
date = c("2017-01-01", "2017-01-01", "2017-01-17","2017-02-04", "2017-02-04", "2017-02-05", "2017-06-01" , "2017-06-03", "2017-07-01", "2017-06-21", "2017-06-01", "2017-06-15", "2017-02-01", "2017-02-01", "2017-02-15", "2017-02-21", "2017-03-05", "2017-03-01", "2017-07-01", "2017-07-01", "2017-07-15", "2017-07-21", "2017-08-05", "2017-08-01"))
So the desired outcome will be:
animal herd date testing
1 Animal1 Herd1 2017-01-01 1
2 Animal2 Herd1 2017-01-01 1
3 Animal3 Herd1 2017-01-17 1
4 Animal4 Herd1 2017-02-04 1
5 Animal5 Herd1 2017-02-04 1
6 Animal6 Herd1 2017-02-05 1
7 Animal1 Herd1 2017-06-01 2
8 Animal2 Herd1 2017-06-03 2
9 Animal3 Herd1 2017-07-01 2
10 Animal4 Herd1 2017-06-21 2
11 Animal5 Herd1 2017-06-01 2
12 Animal6 Herd1 2017-06-15 2
13 Animal7 Herd2 2017-02-01 1
14 Animal8 Herd2 2017-02-01 1
15 Animal9 Herd2 2017-02-15 1
16 Animal10 Herd2 2017-02-21 1
17 Animal11 Herd2 2017-03-05 1
18 Animal12 Herd2 2017-03-01 1
19 Animal7 Herd2 2017-07-01 2
20 Animal8 Herd2 2017-07-01 2
21 Animal9 Herd2 2017-07-15 2
22 Animal10 Herd2 2017-07-21 2
23 Animal11 Herd2 2017-08-05 2
24 Animal12 Herd2 2017-08-01 2
I would like to apply something like this but considering that dates closed to each other are the same testing
df %>%
group_by(herd) %>%
mutate(testing = dense_rank(date))
Thanks!
You can group_by every 5 months and apply dense_rank. Since your smallest gap between two dates from the same animal is 5 months, the unit has to be 5 months.
library(dplyr)
library(lubridate)
df %>%
group_by(testing = dense_rank(floor_date(ymd(date), unit = "5 months")))

read.csv in R reading dates differently

I have two very similar csv files. Stock prices for 2 different stocks downloaded from the same source in the same format. However, read.csv in R is reading them differently.
> tab1=read.csv(path1)
> tab2=read.csv(path2)
> head(tab1)
Date Open High Low Close Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100 156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400 159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900 152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900 140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100 142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100 138.4461
> head(tab2)
Date Open High Low Close Volume Adj.Close
1 12/1/2014 73.39 75.20 71.75 72.29 1561400 71.92211
2 11/3/2014 69.28 74.92 67.88 73.74 1421600 72.97650
3 10/1/2014 66.18 74.95 63.42 69.21 1775400 68.49341
4 9/2/2014 68.34 68.57 65.49 66.32 1249200 65.63333
5 8/1/2014 67.45 68.99 65.88 68.26 1655400 67.20743
6 7/1/2014 64.07 69.50 63.09 67.46 1733600 66.41976
If I try to use colClasses in read.csv then the dates for the second table are read incorrectly.
> tab1=read.csv(path1,colClasses=c("Date",rep("numeric",6)))
> tab2=read.csv(path2,colClasses=c("Date",rep("numeric",6)))
> head(tab1)
Date Open High Low Close Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100 156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400 159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900 152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900 140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100 142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100 138.4461
> head(tab2)
Date Open High Low Close Volume Adj.Close
1 0012-01-20 73.39 75.20 71.75 72.29 1561400 71.92211
2 0011-03-20 69.28 74.92 67.88 73.74 1421600 72.97650
3 0010-01-20 66.18 74.95 63.42 69.21 1775400 68.49341
4 0009-02-20 68.34 68.57 65.49 66.32 1249200 65.63333
5 0008-01-20 67.45 68.99 65.88 68.26 1655400 67.20743
6 0007-01-20 64.07 69.50 63.09 67.46 1733600 66.41976
Not sure how I can make this issue reproducible without attaching the .csv files. I'm attaching snapshots of the two files. Any help will be appreciated.
Thanks
This can be solved by reading in the dates as a character vector and then calling strptime() inside transform():
transform(read.csv(path2,colClasses=c('character',rep('numeric',6))),Date=as.Date(strptime(Date,'%m/%d/%Y')));
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400 71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600 72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400 68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200 65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400 67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600 66.41976
Edit: You can try to "detect" the date format dynamically using your own assumptions, but this will only be as reliable as your assumptions:
readStockData <- function(path) {
tab <- read.csv(path,colClasses=c('character',rep('numeric',6)));
tab$Date <- as.Date(tab$Date,if (grepl('^\\d+/\\d+/\\d+$',tab$Date[1])) '%m/%d/%Y' else '%Y-%m-%d');
tab;
};
readStockData(path1);
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 158.35 162.92 157.12 157.12 2719100 156.1488
## 2 2014-11-03 153.14 160.86 152.98 160.09 2243400 159.1004
## 3 2014-10-01 141.16 154.44 130.60 153.77 3825900 152.0036
## 4 2014-09-02 143.30 147.87 140.66 141.68 2592900 140.0525
## 5 2014-08-01 140.15 145.39 138.43 144.00 2027100 142.3459
## 6 2014-07-01 143.41 146.43 140.60 140.89 2131100 138.4461
readStockData(path2);
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400 71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600 72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400 68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200 65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400 67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600 66.41976
In the above I've made the assumption that there is at least one record in the file and that all records use the same Date format, thus the first Date value (tab$Date[1]) can be used for the detection.

create a new column based on the subtraction results from two columns

I have two large data sets like these:
df1=data.frame(subject = c(rep(1, 12), rep(2, 10)), day =c(1,1,1,1,1,2,3,15,15,15,15,19,1,1,1,1,2,3,15,15,15,15),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/16/2012 17:22','4/16/2012 17:45','4/16/2012 18:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/1/2012 16:28','5/1/2012 17:00','5/5/2012 17:00','4/23/2012 5:56','4/23/2012 6:30','4/23/2012 16:55','4/23/2012 17:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/8/2012 15:55','5/8/2012 16:30'))
df2=data.frame(subject = c(rep(1, 10), rep(2, 10)), day=c(1,1,2,2,3,3,9,9,15,15,1,1,2,2,3,3,9,9,15,15),dtime=c('4/16/2012 6:15','4/16/2012 15:16','4/18/2012 7:15','4/18/2012 21:45','4/19/2012 7:05','4/19/2012 23:17','4/28/2012 7:15','4/28/2012 21:12','5/1/2012 7:15','5/1/2012 15:15','4/23/2012 6:45','4/23/2012 16:45','4/25/2012 6:45','4/25/2012 21:30','4/26/2012 6:45','4/26/2012 22:00','5/2/2012 7:00','5/2/2012 22:00','5/8/2012 6:45','5/8/2012 15:45'))
...
in df2, the 'dtime' contains two time points for each subject on each day. I want to use the time points for each sub on each day in df1 (ie. 'stime') to subtract the second time point for each sub on each day in df2, if the result is positive, then give the second time point in dtime for that observation, otherwise give the first time point. For example, for subject 1 on day 1, ('4/16/2012 6:25'-'4/16/2012 15:16')<0, so we give the first time point '4/16/2012 6:15' to this obs; ('4/16/2012 17:22'-'4/16/2012 15:16')>0,
so we give this second time point '4/16/2012 15:16' to this obs. The expected output should look like this:
df3=data.frame(subject = c(rep(1, 12), rep(2, 10)), day =c(1,1,1,1,1,2,3,15,15,15,15,19,1,1,1,1,2,3,15,15,15,15),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/16/2012 17:22','4/16/2012 17:45','4/16/2012 18:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/1/2012 16:28','5/1/2012 17:00','5/5/2012 17:00','4/23/2012 5:56','4/23/2012 6:30','4/23/2012 16:55','4/23/2012 17:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/8/2012 15:55','5/8/2012 16:30'), dtime=c('4/16/2012 6:15','4/16/2012 6:15','4/16/2012 15:16','4/16/2012 15:16','4/16/2012 15:16','4/18/2012 7:15','4/19/2012 7:05','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 15:15','5/1/2012 15:15','.','4/23/2012 6:45','4/23/2012 6:45','4/23/2012 16:45','4/23/2012 16:45','4/25/2012 6:45','4/26/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 15:45','5/8/2012 15:45'))
...
I used the code below to realize this, however, due to the missing 'dtime' for day 19, R kept giving me the error:
df1$dtime <- apply(df1, 1, function(x){
choices <- df2[ df2$subject==as.numeric(x["subject"]) &
df2$day==as.numeric(x["day"]) , "dtime"]
if( as.POSIXct(x["stime"], format="%m/%d/%Y %H:%M") <
as.POSIXct(choices[2],format="%m/%d/%Y %H:%M") ) {
choices[1]
}else{ choices[2] }
} )
Error in if (as.POSIXct(x["stime"], format = "%m/%d/%Y %H:%M") < as.POSIXct(choices[2], : missing value where TRUE/FALSE needed
Does anyone have idea how to solve this problem?
As a start, I inputted the two data frames in to try things out. Here is what I am thinking in terms of a pseudo-code approach (will leave you to finish the code). df1, when inputted, looks like the following:
subject day stime
1 1 1 4/16/2012 6:25
2 1 1 4/16/2012 7:01
3 1 1 4/16/2012 17:22
4 1 1 4/16/2012 17:45
5 1 1 4/16/2012 18:13
6 1 2 4/18/2012 6:50
7 1 3 4/19/2012 6:55
8 1 15 5/1/2012 6:28
9 1 15 5/1/2012 7:00
10 1 15 5/1/2012 16:28
11 1 15 5/1/2012 17:00
12 2 1 4/23/2012 5:56
13 2 1 4/23/2012 6:30
14 2 1 4/23/2012 16:55
15 2 1 4/23/2012 17:20
16 2 2 4/25/2012 6:32
17 2 3 4/26/2012 6:28
18 2 15 5/8/2012 5:54
19 2 15 5/8/2012 6:30
20 2 15 5/8/2012 15:55
21 2 15 5/8/2012 16:30
Why not try the following:
First, write a simple loop that will enable you to loop through each of the values in the stime column for both df1 and df2. Do make this easy, you could convert the df1 and df2 data frame into a matrix if you like (using as.matrix(), which is my preference).
After you grab the first value in row 1, column, 3 from df1, which is 4/16/2012 6:25, pull out the 6:25 and store it in a temporary variable ... let's call this variable a
Do the exact same thing for df2, which you also want to compare to, and store this in a temporary variable, except grab the variable from the relevant position ... let's call this variable b
Subtract the two temporary variables (you may need to write some code to get the two parts set up so that you can easily do an a-b and get a numerical answer. That said, I will leave that up to you).
Check whether the answer is positive or negative using a simple conditional if statement
Get the value of a or b depending on the output from your conditional check
Add this new value to a new data table, with the appropriate subject and day. You have called this df3.
I'm getting different answers than you. First I made a copy of df1 to work with:
df4 <- df1
df4$dtime <- apply(df4, 1, function(x){
choices <- df2[ df2$subject==as.numeric(x["subject"]) &
df2$day==as.numeric(x["day"]) , "dtime"]
if( as.POSIXct(x["stime"], format="%m/%d/%Y %H:%M") <
as.POSIXct(choices[1],format="%m/%d/%Y %H:%M") ) {
choices[1]
}else{ choices[2] }
} )
#----------------------------------------------
subject day stime dtime
1 1 1 4/16/2012 6:25 4/16/2012 15:16
2 1 1 4/16/2012 7:01 4/16/2012 15:16
3 1 1 4/16/2012 17:22 4/16/2012 15:16
4 1 1 4/16/2012 17:45 4/16/2012 15:16
5 1 1 4/16/2012 18:13 4/16/2012 15:16
6 1 2 4/18/2012 6:50 4/18/2012 7:15
7 1 3 4/19/2012 6:55 4/19/2012 7:05
8 1 15 5/1/2012 6:28 5/1/2012 7:15
9 1 15 5/1/2012 7:00 5/1/2012 7:15
10 1 15 5/1/2012 16:28 5/1/2012 15:15
11 1 15 5/1/2012 17:00 5/1/2012 15:15
12 2 1 4/23/2012 5:56 4/23/2012 6:45
13 2 1 4/23/2012 6:30 4/23/2012 6:45
14 2 1 4/23/2012 16:55 4/23/2012 16:45
15 2 1 4/23/2012 17:20 4/23/2012 16:45
16 2 2 4/25/2012 6:32 4/25/2012 6:45
17 2 3 4/26/2012 6:28 4/26/2012 6:45
18 2 15 5/8/2012 5:54 5/8/2012 6:45
19 2 15 5/8/2012 6:30 5/8/2012 6:45
20 2 15 5/8/2012 15:55 5/8/2012 15:45
21 2 15 5/8/2012 16:30 5/8/2012 15:45

ggplot2 for intra-day financial data

I have intra-day trade data which I'm trying to plot using ggplot.
For a given day, the data looks as such (for example)…
head(q1)
time last bid ask volume center
1 2014-03-19 09:30:00.480 63.74 63.39 63.74 200 11
2 2014-03-19 09:30:00.645 63.41 63.41 63.60 100 11
3 2014-03-19 09:30:00.645 63.48 63.41 63.60 100 11
4 2014-03-19 09:30:02.792 63.59 63.44 63.60 100 11
5 2014-03-19 09:30:03.023 63.74 63.44 63.75 100 12
6 2014-03-19 09:30:12.987 63.72 63.44 63.76 100 11
tail(q1)
time last bid ask volume center
2116 2014-03-19 15:59:56.266 61.68 61.67 61.74 168 12
2117 2014-03-19 15:59:58.515 61.68 61.68 61.73 100 28
2118 2014-03-19 15:59:59.109 61.69 61.68 61.73 500 11
2119 2014-03-19 16:00:00.411 61.72 61.69 61.73 100 11
2120 2014-03-19 16:00:00.411 61.72 61.69 61.73 200 11
2121 2014-03-19 16:00:00.411 61.72 61.69 61.73 351 11
It's easy to use gglot to visualize a single day of data, where I'm having trouble is linking multiple days together on the same plot. If I have 2 consecutive days in the data frames q1 & q2, how can I plot these on a single plot without the time gap when the market is closed and the lines linking the end of one day to another?
You could try creating a new transformation that transforms day time to a seamless scale of trading time:
9:30 -> 0
12:00 -> approx. 0.5
16:00 -> 1
9:30 next day -> 1
Something along the following lines could do (but I haven't tried myself):
library(scales)
trading_day_trans <- function() {
trans_new("trading_day", trans, inv,
pretty_breaks(), domain = domain)
}
ggplot(rbind.fill(q1, q2)) + ... + coord_trans(xtrans = "trading_day")
You need to provide trans (the transformation function, time -> linear), inv (the inverse transform, linear -> time) and domain (a time vector of length 2, min-max).
Adapted from http://blog.ggplot2.org/post/25938265813/defining-a-new-transformation-for-ggplot2-scales .

How can I filter specifically for certain months if the days are not the same in each year?

This is probably a very simple question that has been asked already but..
I have a data frame that I have constructed from a CSV file generated in excel. The observations are not homogeneously sampled, i.e they are for "On Peak" times of electricity usage. That means they exclude different days each year. I have 20 years of data (1993-2012) and am running both non Robust and Robust LOESS to extract seasonal and linear trends.
After the decomposition has been done, I want to focus only on the observations from June through September.
How can I create a new data frame of just those results?
Sorry about the formatting, too.
Date MaxLoad TMAX
1 1993-01-02 2321 118.6667
2 1993-01-04 2692 148.0000
3 1993-01-05 2539 176.0000
4 1993-01-06 2545 172.3333
5 1993-01-07 2517 177.6667
6 1993-01-08 2438 157.3333
7 1993-01-09 2302 152.0000
8 1993-01-11 2553 144.3333
9 1993-01-12 2666 146.3333
10 1993-01-13 2472 177.6667
As Joran notes, you don't need anything other than base R:
## Reproducible data
df <-
data.frame(Date = seq(as.Date("2009-03-15"), as.Date("2011-03-15"), by="month"),
MaxLoad = floor(runif(25,2000,3000)), TMAX=runif(25,100,200))
## One option
df[months(df$Date) %in% month.name[6:9],]
# Date MaxLoad TMAX
# 4 2009-06-15 2160 188.4607
# 5 2009-07-15 2151 164.3946
# 6 2009-08-15 2694 110.4399
# 7 2009-09-15 2460 150.4076
# 16 2010-06-15 2638 178.8341
# 17 2010-07-15 2246 131.3283
# 18 2010-08-15 2483 112.2635
# 19 2010-09-15 2174 160.9724
## Another option: strftime() will be more _generally_ useful than months()
df[as.numeric(strftime(df$Date, "%m")) %in% 6:9,]

Resources