I have a data set (x) that looks like this:
DATE WEEKDAY A B C D
2011-02-04 Friday 113 67 109 72
2011-02-05 Saturday 1 0 0 1
2011-02-06 Sunday 9 5 0 0
2011-02-07 Monday 154 48 85 60
str(x):
'data.frame': 4 obs. of 6 variables:
$ DATE : Date, format: "2011-02-04" "2011-02-05" "2011-02-06" "2011-02-07"
$ WEEKDAY: Factor w/ 7 levels "Friday","Monday",..: 1 3 4 2
$ A : num 113 1 9 154
$ B : num 67 0 5 48
$ C : num 109 0 0 85
$ D : num 72 1 0 60
Tuesday - Saturday values don't change, but I want Sunday to be the sum of Saturday and Sunday and Monday to be the sum of Saturday, Sunday, and Monday.
I tried shifting Saturday's and Sunday's dates to date + 2 and date + 1 respectively, then aggregating by date, but I lose the weekend records.
For my example, the correct results would be the following:
DATE WEEKDAY A B C D
2011-02-04 Friday 113 67 109 72
2011-02-05 Saturday 1 0 0 1
2011-02-06 Sunday 10 5 0 1
2011-02-07 Monday 164 53 85 61
How can I roll up weekend values into the next day?
Three weeks' worth of data:
DATE WEEKDAY A B C D
1 2011-01-02 Sunday 2 1 0 0
2 2011-01-03 Monday 153 51 7 1
3 2011-01-04 Tuesday 182 103 13 5
4 2011-01-05 Wednesday 192 102 14 12
5 2011-01-06 Thursday 160 67 50 20
6 2011-01-07 Friday 154 96 50 39
7 2011-01-09 Sunday 0 0 0 1
8 2011-01-10 Monday 195 94 48 39
9 2011-01-11 Tuesday 206 72 71 38
10 2011-01-12 Wednesday 232 94 96 52
11 2011-01-13 Thursday 178 113 93 52
12 2011-01-14 Friday 173 97 68 56
13 2011-01-15 Saturday 2 0 1 0
14 2011-01-17 Monday 170 91 66 52
15 2011-01-18 Tuesday 176 76 70 78
16 2011-01-19 Wednesday 164 159 117 37
17 2011-01-20 Thursday 198 87 95 111
18 2011-01-21 Friday 213 86 89 90
19 2011-01-24 Monday 195 73 102 52
20 2011-01-25 Tuesday 193 108 116 70
21 2011-01-26 Wednesday 193 102 118 63
Since you've provided a small data, I've not been able to test this on a bigger data. But the idea is something like this. I'll use data.table as I find it can be very efficient here.
The code:
require(data.table)
my_days <- c("Saturday", "Sunday", "Monday")
dt <- data.table(df)
dt[, `:=`(DATE = as.Date(DATE))]
setkey(dt, "DATE")
dt[WEEKDAY %in% my_days, `:=`(A = cumsum(A), B = cumsum(B),
C = cumsum(C), D = cumsum(D)), by = format(DATE-1, "%W")]
The idea:
First, change the DATE Column to actual Date type using as.Date (line 4).
Second, ensure that the columns are sorted by DATE column by setting the key column of dt to DATE (line 5).
Now, the last line (line 6) is where all the magic happens and is the trickiest:
The first part of the expression WEEKDAY %in% my_days, subsets the data.table dt with only days = Sat, Sun or Mon.
The last part of the same line by = format(DATE-1, "%W"), subsets the data by the week they belong to. Here, since Monday falls on the next week, just subtract 1 from the current Date and then get the week number. This will group the Dates by Week, where, Tuesday until Monday should have the same week.
The expression in the middle ':='(A = ... , D = ...) computes the cumsum and replaces just those values per grouping by reference.
For the new data you've posted, I get this as the result. Let me know if it's not what you seek.
# DATE WEEKDAY A B C D
# 1: 2011-01-02 Sunday 2 1 0 0
# 2: 2011-01-03 Monday 155 52 7 1
# 3: 2011-01-04 Tuesday 182 103 13 5
# 4: 2011-01-05 Wednesday 192 102 14 12
# 5: 2011-01-06 Thursday 160 67 50 20
# 6: 2011-01-07 Friday 154 96 50 39
# 7: 2011-01-09 Sunday 0 0 0 1
# 8: 2011-01-10 Monday 195 94 48 40
# 9: 2011-01-11 Tuesday 206 72 71 38
# 10: 2011-01-12 Wednesday 232 94 96 52
# 11: 2011-01-13 Thursday 178 113 93 52
# 12: 2011-01-14 Friday 173 97 68 56
# 13: 2011-01-15 Saturday 2 0 1 0
# 14: 2011-01-17 Monday 172 91 67 52
# 15: 2011-01-18 Tuesday 176 76 70 78
# 16: 2011-01-19 Wednesday 164 159 117 37
# 17: 2011-01-20 Thursday 198 87 95 111
# 18: 2011-01-21 Friday 213 86 89 90
# 19: 2011-01-24 Monday 195 73 102 52
# 20: 2011-01-25 Tuesday 193 108 116 70
# 21: 2011-01-26 Wednesday 193 102 118 63
# DATE WEEKDAY A B C D
Related
I have downloaded a table of stream diversion data ("df_download"). The column names of this table are primarily taken from the ID numbers of the gauging stations.
I want to conditionally replace the ID numbers that have been used for column names with text for the station names, which will help make the data more readable when I'm sharing the results. I created a table ("stationIDs") with the ID numbers and station names to use as a reference for changing the column names of "df_download".
I can replace the column names individually, but I want to write a loop of some kind that will address all of the columns of "df_download" and change the names of the columns referenced in the dataframe "stationIDs".
An example of what I'm trying to do is below.
Downloaded Data ("df_download")
A portion of the downloaded data is similar to this:
df_downloaded <- data.frame(Var1 = seq(as.Date("2012-01-01"),as.Date("2012-12-01"), by="month"),
Var2 = sample(50:150,12, replace =TRUE),
Var3 = sample(10:100,12, replace =TRUE),
Var4 = sample(15:45,12, replace =TRUE),
Var5 = sample(50:200,12, replace =TRUE),
Var6 = sample(15:100,12, replace =TRUE),
Var7 = c(rep(0,3),rep(13,6),rep(0,3)),
Var8 = rep(5,12))
colnames(df_downloaded) <- c("Diversion.Date","360410059","360410060",
"360410209","361000655","361000656","Irrigation","Seep")
df_download # not run
#
# Diversion.Date 360410059 360410060 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
# 7 2012-07-01 86 77 20 130 63 13 5
# 8 2012-08-01 118 29 27 118 57 13 5
# 9 2012-09-01 142 18 45 116 27 13 5
# 10 2012-10-01 74 68 34 182 79 0 5
# 11 2012-11-01 106 48 27 95 74 0 5
# 12 2012-12-01 91 41 20 179 55 0 5
Reference Table ("stationIDs")
stationIDs <- data.frame(ID = c("360410059", "360410060", "360410209", "361000655", "361000656"),
Names = c("RimView", "IPCO", "WMA.Ditch", "RV.Bypass", "LowerFalls"))
stationIDs # not run
#
# ID Names
# 1 360410059 RimView
# 2 360410060 IPCO
# 3 360410209 WMA.Ditch
# 4 361000655 RV.Bypass
# 5 361000656 LowerFalls
I can replace the column names in "df_downloaded" using individual statements. I show the first three iterations below.
After three iterations "RimValley", "IPCO", and "WMA.Ditch" have replaced their respective gauge ID numbers.
names(df_downloaded) <- gsub(stationIDs$ID[1],stationIDs$Name[1],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView 360410060 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
names(df_downloaded) <- gsub(stationIDs$ID[2],stationIDs$Name[2],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView IPCO 360410209 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
names(df_downloaded) <- gsub(stationIDs$ID[3],stationIDs$Name[3],names(df_downloaded))
# head(df_downloaded)
# Diversion.Date RimView IPCO WMA.Ditch 361000655 361000656 Irrigation Seep
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
If I try to do the renaming using a for loop, I end up with NAs for column names.
for(i in seq_along(names(df_downloaded))){
names(df_downloaded) <- gsub(stationIDs$ID[i],stationIDs$Name[i],names(df_downloaded))
}
# head(df_downloaded)
# NA NA NA NA NA NA NA NA
# 1 2012-01-01 93 57 28 101 16 0 5
# 2 2012-02-01 102 68 19 124 98 0 5
# 3 2012-03-01 124 93 36 109 56 0 5
# 4 2012-04-01 94 96 23 54 87 13 5
# 5 2012-05-01 83 70 43 119 15 13 5
# 6 2012-06-01 78 63 45 195 15 13 5
I really want to be able to change the names with a for loop or something similar, because because the number of stations that I download data from changes depending on the years that I am analyzing.
Thanks for taking time to look at my question.
We can use match
#Convert factor columns to character
stationIDs[] <- lapply(stationIDs, as.character)
#Match names of df_downloaded with stationIDs$ID
inds <- match(names(df_downloaded), stationIDs$ID)
#Replace the matched name with corresponding Names from stationIDs
names(df_downloaded)[which(!is.na(inds))] <- stationIDs$Names[inds[!is.na(inds)]]
df_downloaded
# Diversion.Date RimView IPCO WMA.Ditch RV.Bypass LowerFalls Irrigation Seep
#1 2012-01-01 142 14 41 200 79 0 5
#2 2012-02-01 97 100 35 176 22 0 5
#3 2012-03-01 85 59 26 88 71 0 5
#4 2012-04-01 68 49 34 63 15 13 5
#5 2012-05-01 62 58 44 87 16 13 5
#6 2012-06-01 70 59 33 145 87 13 5
#7 2012-07-01 112 65 25 52 64 13 5
#8 2012-08-01 75 12 27 103 19 13 5
#9 2012-09-01 73 65 36 172 68 13 5
#10 2012-10-01 87 35 27 146 42 0 5
#11 2012-11-01 122 17 33 183 32 0 5
#12 2012-12-01 108 65 15 120 99 0 5
You can do this dplyr and tidyr. You basically want to make your data long so that the IDs are in a column so that you can do a join on this with your reference of IDs to names. Then you can make your data wide again.
df_downloaded %>%
gather(ID, value, -Diversion.Date, -Irrigation, -Seep) %>%
left_join(., stationIDs) %>%
dplyr::select(-ID) %>%
spread(Names, value)
I have two data tables. The first table is matrix with coordinates and precipitation. It consists of four columns with latitude, longitude, precipitation and day of monitoring. The example of table is:
latitude_1 longitude_1 precipitation day_mon
54.17 62.15 5 34
69.61 48.65 3 62
73.48 90.16 7 96
66.92 90.27 7 19
56.19 96.46 9 25
72.23 74.18 5 81
88.00 95.20 7 97
92.44 44.41 6 18
95.83 52.91 9 88
99.68 96.23 8 6
81.91 48.32 8 96
54.66 52.70 0 62
95.31 91.82 2 84
60.32 96.25 9 71
97.39 47.91 7 76
65.21 44.63 9 3
The second table consists of 5 columns : station number, longitude, latitude, day when monitoring began, day when monitoring ends. It looks like:
station latitude_2 longitude_2 day_begin day_end
15 50.00 93.22 34 46
11 86.58 85.29 15 47
14 93.17 63.17 31 97
10 88.56 61.28 15 78
13 45.29 77.10 24 79
6 69.73 99.52 13 73
4 45.60 77.36 28 95
13 92.88 62.38 9 51
1 65.10 64.13 7 69
10 60.57 86.77 34 64
3 53.62 60.76 23 96
16 87.82 59.41 38 47
1 47.83 95.89 21 52
11 75.42 46.20 38 87
3 55.71 55.26 2 73
16 71.65 96.15 36 93
I want to sum precipitations from 1 table. But I have two conditions:
day_begin< day_mon< day_end. Day of monitoring(day_mon from 1 table) should be less than day of end and more than day of begin (from 2 table)
Sum precipitation from the point which is closer than others. distance between point of monitoring (coordinates consists
longitude_1 and latitude_1) and station (coordinates consists
longitude_2 and latitude_2) should be minimum. The distance is calculated by the formula :
R = 6400*arccos(sin(latitude_1)*sin(latitude_2)+cos(latitude_1)*cos(latitude_2))*cos(longitude_1-longitude_2))
At last I want to get results as table :
station latitude_2 longitude_2 day_begin day_end Sum
15 50 93.22 34 46 188
11 86.58 85.29 15 47 100
14 93.17 63.17 31 97 116
10 88.56 61.28 15 78 182
13 45.29 77.1 24 79 136
6 69.73 99.52 13 73 126
4 45.6 77.36 28 95 108
13 92.88 62.38 9 51 192
1 65.1 64.13 7 69 125
10 60.57 86.77 34 64 172
3 53.62 60.76 23 96 193
16 87.82 59.41 38 47 183
1 47.83 95.89 21 52 104
11 75.42 46.2 38 87 151
3 55.71 55.26 2 73 111
16 71.65 96.15 36 93 146
I know how to calculate it in C++. What function should I use in R?
Thank you for your help!
I'm not sure if I solved your problem correctly... but here it comes..
I used a data.table approach.
library( tidyverse )
library( data.table )
#step 1. join days as periods
#create a dummy variables to create a virtual period in dt1
dt1[, point_id := .I]
dt1[, day_begin := day_mon]
dt1[, day_end := day_mon]
setkey(dt2, day_begin, day_end)
#overlap join finding all stations for each point that overlap periods
dt <- foverlaps( dt1, dt2, type = "within" )
#step 2. calculate the distance station for each point based on TS-privided formula
dt[, distance := 6400 * acos( sin( latitude_1 ) * sin( latitude_2 ) + cos( latitude_1 ) * cos( latitude_2 ) ) * cos( longitude_1 - longitude_2 ) ]
#step 3. filter (absolute) minimal distance based on point_id
dt[ , .SD[which.min( abs( distance ) )], by = point_id ]
# point_id station latitude_2 longitude_2 day_begin day_end latitude_1 longitude_1 precipitation day_mon i.day_begin i.day_end distance
# 1: 1 1 47.83 95.89 21 52 54.17 62.15 5 34 34 34 -248.72398
# 2: 2 6 69.73 99.52 13 73 69.61 48.65 3 62 62 62 631.89228
# 3: 3 14 93.17 63.17 31 97 73.48 90.16 7 96 96 96 -1519.84886
# 4: 4 11 86.58 85.29 15 47 66.92 90.27 7 19 19 19 1371.54757
# 5: 5 11 86.58 85.29 15 47 56.19 96.46 9 25 25 25 1139.46849
# 6: 6 14 93.17 63.17 31 97 72.23 74.18 5 81 81 81 192.99264
# 7: 7 14 93.17 63.17 31 97 88.00 95.20 7 97 97 97 5822.81529
# 8: 8 3 55.71 55.26 2 73 92.44 44.41 6 18 18 18 -899.71206
# 9: 9 3 53.62 60.76 23 96 95.83 52.91 9 88 88 88 45.16237
# 10: 10 3 55.71 55.26 2 73 99.68 96.23 8 6 6 6 -78.04484
# 11: 11 14 93.17 63.17 31 97 81.91 48.32 8 96 96 96 -5467.77459
# 12: 12 3 53.62 60.76 23 96 54.66 52.70 0 62 62 62 -1361.57863
# 13: 13 11 75.42 46.20 38 87 95.31 91.82 2 84 84 84 -445.18765
# 14: 14 14 93.17 63.17 31 97 60.32 96.25 9 71 71 71 -854.86321
# 15: 15 3 53.62 60.76 23 96 97.39 47.91 7 76 76 76 1304.41634
# 16: 16 3 55.71 55.26 2 73 65.21 44.63 9 3 3 3 -7015.57516
Sample data
dt1 <- read.table( text = "latitude_1 longitude_1 precipitation day_mon
54.17 62.15 5 34
69.61 48.65 3 62
73.48 90.16 7 96
66.92 90.27 7 19
56.19 96.46 9 25
72.23 74.18 5 81
88.00 95.20 7 97
92.44 44.41 6 18
95.83 52.91 9 88
99.68 96.23 8 6
81.91 48.32 8 96
54.66 52.70 0 62
95.31 91.82 2 84
60.32 96.25 9 71
97.39 47.91 7 76
65.21 44.63 9 3", header = TRUE ) %>%
setDT()
dt2 <- read.table( text = "station latitude_2 longitude_2 day_begin day_end
15 50.00 93.22 34 46
11 86.58 85.29 15 47
14 93.17 63.17 31 97
10 88.56 61.28 15 78
13 45.29 77.10 24 79
6 69.73 99.52 13 73
4 45.60 77.36 28 95
13 92.88 62.38 9 51
1 65.10 64.13 7 69
10 60.57 86.77 34 64
3 53.62 60.76 23 96
16 87.82 59.41 38 47
1 47.83 95.89 21 52
11 75.42 46.20 38 87
3 55.71 55.26 2 73
16 71.65 96.15 36 93", header = TRUE ) %>%
setDT()
I need some help for my work;
I have a dataset like this:
DATE COD QTA
2014-01-02 87 11
2014-01-05 87 5
2015-02-03 45 3
2015-06-21 45 92
2014-09-18 74 34
2015-04-21 74 27
I need to create, for eache value of the variable COD, the sequence of all dates from the min value (example: for COD 87, the min date is 2014-01-02) to the Sys.Date(). The final result that I would like to have is something like that:
DATE COD QTA
2014-01-02 87 11
2014-01-03 87 0
2014-01-04 87 0
2014-01-05 87 5
2014-01-06 87 0
... 87 ...
Sys.Date() 87 x
2015-02-03 45 3
2015-02-04 45 0
2015-02-05 45 0
... 45 ...
Sys.Date() 45 x
How can I do that? Thanks guys!
A data.table solution:
require(data.table)
dt<-as.data.table(df)
dt[dt[,list(DATE=seq(min(DATE),Sys.Date(),by="day")),by=COD],
on=c("COD","DATE")][,QTA:=ifelse(is.na(QTA),0,QTA)][]
# DATE COD QTA
# 1: 2014-01-02 87 11
# 2: 2014-01-03 87 0
# 3: 2014-01-04 87 0
# 4: 2014-01-05 87 5
# 5: 2014-01-06 87 0
# ---
#2601: 2016-12-19 74 0
#2602: 2016-12-20 74 0
#2603: 2016-12-21 74 0
#2604: 2016-12-22 74 0
#2605: 2016-12-23 74 0
I have a dataframe with the following structure:
set.seed(12345)
df <- data.frame(cat1 = rep(1:4, each = 6),
cat2 = rep(1:2, each = 3,4),
day = rep(as.Date(c("2016-01-01", "2016-01-02", "2016-01-03")),8),
x = sample(80:120,24),
y = sample(80:120,24))
cat1 cat2 day x y
1 1 1 01.01.2016 109 106
2 1 1 02.01.2016 115 95
3 1 1 03.01.2016 120 107
4 1 2 01.01.2016 113 100
5 1 2 02.01.2016 96 88
6 1 2 03.01.2016 85 97
7 2 1 01.01.2016 91 118
8 2 1 02.01.2016 97 80
9 2 1 03.01.2016 104 86
10 2 2 01.01.2016 111 101
11 2 2 02.01.2016 81 91
12 2 2 03.01.2016 84 90
13 3 1 01.01.2016 101 105
14 3 1 02.01.2016 80 108
15 3 1 03.01.2016 90 96
16 3 2 01.01.2016 92 83
17 3 2 02.01.2016 89 99
18 3 2 03.01.2016 112 109
19 4 1 01.01.2016 118 111
20 4 1 02.01.2016 100 115
21 4 1 03.01.2016 103 85
22 4 2 01.01.2016 86 112
23 4 2 02.01.2016 98 81
24 4 2 03.01.2016 105 113
I need to calculate an index from a fixed date within the dataset over a set of subgroups (cat1, cat2). My desired outcome when indexing on 02.01.2016 looks like this:
cat1 cat2 day x y xi yi
1 1 1 01.01.2016 109 106 0,94783 1,11579
2 1 1 02.01.2016 115 95 1,00000 1,00000
3 1 1 03.01.2016 120 107 1,04348 1,12632
4 1 2 01.01.2016 113 100 1,17708 1,13636
5 1 2 02.01.2016 96 88 1,00000 1,00000
6 1 2 03.01.2016 85 97 0,88542 1,10227
7 2 1 01.01.2016 91 118 0,93814 1,47500
8 2 1 02.01.2016 97 80 1,00000 1,00000
9 2 1 03.01.2016 104 86 1,07216 1,07500
10 2 2 01.01.2016 111 101 1,37037 1,10989
11 2 2 02.01.2016 81 91 1,00000 1,00000
12 2 2 03.01.2016 84 90 1,03704 0,98901
13 3 1 01.01.2016 101 105 1,26250 0,97222
14 3 1 02.01.2016 80 108 1,00000 1,00000
15 3 1 03.01.2016 90 96 1,12500 0,88889
16 3 2 01.01.2016 92 83 1,03371 0,83838
17 3 2 02.01.2016 89 99 1,00000 1,00000
18 3 2 03.01.2016 112 109 1,25843 1,10101
19 4 1 01.01.2016 118 111 1,18000 0,96522
20 4 1 02.01.2016 100 115 1,00000 1,00000
21 4 1 03.01.2016 103 85 1,03000 0,73913
22 4 2 01.01.2016 86 112 0,87755 1,38272
23 4 2 02.01.2016 98 81 1,00000 1,00000
24 4 2 03.01.2016 105 113 1,07143 1,39506
I tried extracting the reference dates for each group with data.table subsets and then using this extracts to calculate indexes but I haven't figured out how to do that properly.
Highly likely this has been answered before, but here are two options with dplyr and data.table:
library(dplyr)
df %>%
group_by(cat1, cat2) %>%
mutate(xi = x/x[day=='2016-01-02'],
yi = y/y[day=='2016-01-02'])
library(data.table)
setDT(df)[, `:=` (xi = x/x[day=='2016-01-02'],
yi = y/y[day=='2016-01-02']),
by = .(cat1, cat2)]
which results in:
cat1 cat2 day x y xi yi
1: 1 1 2016-01-01 109 106 0.9478261 1.1157895
2: 1 1 2016-01-02 115 95 1.0000000 1.0000000
3: 1 1 2016-01-03 120 107 1.0434783 1.1263158
4: 1 2 2016-01-01 113 100 1.1770833 1.1363636
5: 1 2 2016-01-02 96 88 1.0000000 1.0000000
6: 1 2 2016-01-03 85 97 0.8854167 1.1022727
7: 2 1 2016-01-01 91 118 0.9381443 1.4750000
8: 2 1 2016-01-02 97 80 1.0000000 1.0000000
9: 2 1 2016-01-03 104 86 1.0721649 1.0750000
10: 2 2 2016-01-01 111 101 1.3703704 1.1098901
11: 2 2 2016-01-02 81 91 1.0000000 1.0000000
12: 2 2 2016-01-03 84 90 1.0370370 0.9890110
13: 3 1 2016-01-01 101 105 1.2625000 0.9722222
14: 3 1 2016-01-02 80 108 1.0000000 1.0000000
15: 3 1 2016-01-03 90 96 1.1250000 0.8888889
16: 3 2 2016-01-01 92 83 1.0337079 0.8383838
17: 3 2 2016-01-02 89 99 1.0000000 1.0000000
18: 3 2 2016-01-03 112 109 1.2584270 1.1010101
19: 4 1 2016-01-01 118 111 1.1800000 0.9652174
20: 4 1 2016-01-02 100 115 1.0000000 1.0000000
21: 4 1 2016-01-03 103 85 1.0300000 0.7391304
22: 4 2 2016-01-01 86 112 0.8775510 1.3827160
23: 4 2 2016-01-02 98 81 1.0000000 1.0000000
24: 4 2 2016-01-03 105 113 1.0714286 1.3950617
I have this data frame, (df1):
Month index
1 2015-09-01 1.21418847
2 2015-08-01 -4.37919039
3 2015-07-01 -1.16004624
4 2015-06-01 -1.09754890
5 2015-05-01 -4.37919039
6 2015-04-01 -4.37919039
7 2015-03-01 4.37919039
8 2015-02-01 4.37919039
9 2015-01-01 -0.11285150
10 2014-12-01 0.45712044
11 2014-11-01 0.97597018
12 2014-10-01 0.87560496
13 2014-09-01 0.66278156
14 2014-08-01 4.37919039
15 2014-07-01 1.15440685
16 2014-06-01 1.38021497
17 2014-05-01 1.67663242
18 2014-04-01 2.08358406
19 2014-03-01 2.50222843
20 2014-02-01 2.71665822
21 2014-01-01 3.13692051
22 2013-12-01 2.91702023
23 2013-11-01 3.02603774
24 2013-10-01 2.55812363
25 2013-09-01 3.12586325
26 2013-08-01 3.26063617
27 2013-07-01 2.91702023
28 2013-06-01 3.15504505
29 2013-05-01 2.53958494
30 2013-04-01 2.61528861
31 2013-03-01 2.84742861
32 2013-02-01 2.82097624
33 2013-01-01 2.53196473
34 2012-12-01 2.35786991
35 2012-11-01 2.40611260
36 2012-10-01 2.42408844
37 2012-09-01 2.91702023
38 2012-08-01 2.33372249
39 2012-07-01 2.00140636
40 2012-06-01 2.24721387
41 2012-05-01 1.89189602
42 2012-04-01 1.98807663
43 2012-03-01 1.89563925
44 2012-02-01 1.19541625
45 2012-01-01 2.91702023
46 2011-12-01 0.29072412
47 2011-11-01 -2.91702023
48 2011-10-01 -2.91702023
49 2011-09-01 -0.36402331
50 2011-08-01 -0.55409805
51 2011-07-01 -0.05902839
52 2011-06-01 -0.03946940
53 2011-05-01 0.30898661
54 2011-04-01 2.91702023
55 2011-03-01 0.80556310
56 2011-02-01 1.07001901
57 2011-01-01 2.91702023
58 2010-12-01 1.34682208
59 2010-11-01 1.30446466
60 2010-10-01 0.97753435
61 2010-09-01 0.90434619
62 2010-08-01 0.80415571
63 2010-07-01 1.41129808
64 2010-06-01 2.03576435
65 2010-05-01 2.85757135
66 2010-04-01 2.91702023
67 2010-03-01 3.96563441
68 2010-02-01 4.37919039
69 2010-01-01 4.57358010
70 2009-12-01 4.63589893
71 2009-11-01 4.40042885
72 2009-10-01 4.21359930
73 2009-09-01 4.10739350
74 2009-08-01 2.91702023
75 2009-07-01 3.85460338
76 2009-06-01 3.07796824
77 2009-05-01 2.91702023
78 2009-04-01 1.90359672
79 2009-03-01 0.68355248
80 2009-02-01 0.36218125
81 2009-01-01 -0.50814101
82 2008-12-01 0.49310633
83 2008-11-01 2.98877210
84 2008-10-01 2.28716199
85 2008-09-01 0.61433048
86 2008-08-01 0.51258623
87 2008-07-01 1.74079440
88 2008-06-01 2.91702023
89 2008-05-01 1.60899848
90 2008-04-01 2.01574569
91 2008-03-01 1.81341196
92 2008-02-01 1.48482933
93 2008-01-01 1.89122725
94 2007-12-01 1.84400308
95 2007-11-01 1.23545695
96 2007-10-01 0.44341718
97 2007-09-01 0.55630846
98 2007-08-01 0.42806839
99 2007-07-01 -0.75234218
100 2007-06-01 -1.44397151
101 2007-05-01 -2.10673018
102 2007-04-01 -1.40817350
103 2007-03-01 -0.73608848
104 2007-02-01 -0.69200513
105 2007-01-01 -0.51056142
106 2006-12-01 -0.40504212
107 2006-11-01 -0.04161989
108 2006-10-01 -0.10478629
109 2006-09-01 0.07423530
110 2006-08-01 0.13076121
111 2006-07-01 2.91702023
112 2006-06-01 1.02865488
113 2006-05-01 -0.08979180
114 2006-04-01 -1.52792341
115 2006-03-01 -2.52839603
116 2006-02-01 -3.39026284
117 2006-01-01 -3.04045769
I want to calculate quarterly mean for each year. This will result in a data.frame with 39 rows.
I did this code to implement the quarterly mean:
final<-df1[, mean(index), by = quarterly(Month)]
The error mssg is :
Error in `[.data.frame`(df1, , mean(index), :
unused argument (by = month(Month))
Information:
class(df1$index)
"numeric"
class(df1$Month)
"factor"
What i did wrong?
Thanks
It seems you are trying to use data.table syntax on a data frame. So first do
library(data.table)
setDT(df1)
to load the data.table package and set df1 to a data table. Then you can do
final <- df1[, mean(index), keyby = .(year(Month), quarter(Month))]
str(final)
# Classes ‘data.table’ and 'data.frame': 39 obs. of 3 variables:
# $ year : int 2006 2006 2006 2006 2007 2007 2007 2007 2008 2008 ...
# $ quarter: int 1 2 3 4 1 2 3 4 1 2 ...
# $ V1 : num -2.986 -0.196 1.041 -0.184 -0.646 ...
# - attr(*, "sorted")= chr "year" "quarter"
# - attr(*, ".internal.selfref")=<externalptr>
This shows we have 39 rows in the result, as you desire. Some notes: The function is named quarter() not quarterly(), you needed a capital M in Month, and needed to group by year and quarter.