Constructing moving average over a categorical variable in R - r

I'm looking to construct a moving average while aggregating a timeseries dataset over two categorical variables. While I've seen a few other tutorials, none of them seem to capture the specific task I'd like to achieve.
My original dataset (df) has rows for each individual (id) for a series of dates ranging from 0-180 (Days). Individuals can be members of one of two subsets of data (Group).
I then aggregate this data frame to get a daily mean for the two groups.
library(plyr)
summary <- ddply(df, .(Group,Days), summarise,
DV = mean(variable), resp=length(unique(Id)))
The next step, however, is to construct a moving average within the two groups. In the sample dataframe below, I've just constructed a 5-day mean using the previous 5 days.
Group Days DV 5DayMA
exceeded 0 2859
exceeded 1 2948
exceeded 2 4412
exceeded 3 5074
exceeded 4 5098 4078
exceeded 5 5147 4536
exceeded 6 4459 4838
exceeded 7 4730 4902
exceeded 8 4643 4815
exceeded 9 4698 4735
exceeded 10 4818 4670
exceeded 11 4521 4682
othergroup 0 2859
othergroup 1 2948
othergroup 2 4412
othergroup 3 5074
othergroup 4 5098 4078
othergroup 5 5147 4536
othergroup 6 4459 4838
othergroup 7 4730 4902
othergroup 8 4643 4815
othergroup 9 4698 4735
othergroup 10 4818 4670
othergroup 11 4521 4682
Any thoughts on how to do this?

You could try zoo::rollmean
df <- structure(list(Group = c("exceeded", "exceeded", "exceeded",
"exceeded", "exceeded", "exceeded", "exceeded", "exceeded", "exceeded",
"exceeded", "exceeded", "exceeded", "othergroup", "othergroup",
"othergroup", "othergroup", "othergroup", "othergroup", "othergroup",
"othergroup", "othergroup", "othergroup", "othergroup", "othergroup"
), Days = c(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), DV = c(2859L,
2948L, 4412L, 5074L, 5098L, 5147L, 4459L, 4730L, 4643L, 4698L,
4818L, 4521L, 2859L, 2948L, 4412L, 5074L, 5098L, 5147L, 4459L,
4730L, 4643L, 4698L, 4818L, 4521L), X5DayMA = c(NA, NA, NA, NA,
4078L, 4536L, 4838L, 4902L, 4815L, 4735L, 4670L, 4682L, NA, NA,
NA, NA, 4078L, 4536L, 4838L, 4902L, 4815L, 4735L, 4670L, 4682L
)), .Names = c("Group", "Days", "DV", "X5DayMA"), class = "data.frame", row.names = c(NA,
-24L))
head(df)
Group Days DV X5DayMA
1 exceeded 0 2859 NA
2 exceeded 1 2948 NA
3 exceeded 2 4412 NA
4 exceeded 3 5074 NA
5 exceeded 4 5098 4078
6 exceeded 5 5147 4536
library(plyr)
library(zoo)
ddply(
df, "Group",
transform,
5daymean = rollmean(DV, 5, align="right", na.pad=TRUE ))
Group Days DV X5DayMA 5daymean
1 exceeded 0 2859 NA NA
2 exceeded 1 2948 NA NA
3 exceeded 2 4412 NA NA
4 exceeded 3 5074 NA NA
5 exceeded 4 5098 4078 4078.2
6 exceeded 5 5147 4536 4535.8
7 exceeded 6 4459 4838 4838.0
8 exceeded 7 4730 4902 4901.6
9 exceeded 8 4643 4815 4815.4
10 exceeded 9 4698 4735 4735.4
11 exceeded 10 4818 4670 4669.6
12 exceeded 11 4521 4682 4682.0
13 othergroup 0 2859 NA NA
14 othergroup 1 2948 NA NA
15 othergroup 2 4412 NA NA
16 othergroup 3 5074 NA NA
17 othergroup 4 5098 4078 4078.2
18 othergroup 5 5147 4536 4535.8
19 othergroup 6 4459 4838 4838.0
20 othergroup 7 4730 4902 4901.6
21 othergroup 8 4643 4815 4815.4
22 othergroup 9 4698 4735 4735.4
23 othergroup 10 4818 4670 4669.6
24 othergroup 11 4521 4682 4682.0
or even faster with dplyr
library(dplyr)
df %.%
dplyr:::group_by(Group) %.%
dplyr:::mutate('5daymean' = rollmean(DV, 5, align="right", na.pad=TRUE ))
OR the super fast data.table
library(data.table)
dft <- data.table(df)
dft[ , `:=` ('5daymean' = rollmean(DV, 5, align="right", na.pad=TRUE )) , by=Group ]

ave and filter:
with(df, ave(DV, Group, FUN=function(x) filter(x,rep(1/5,5),sides=1)))
# [1] NA NA NA NA 4078.2 4535.8 4838.0 4901.6 4815.4 4735.4
#[11] 4669.6 4682.0 NA NA NA NA 4078.2 4535.8 4838.0 4901.6
#[21] 4815.4 4735.4 4669.6 4682.0

Related

Merge 2 data frames using common date, plus 2 rows before and n-1 rows after

So i need to merge 2 data frames:
The first data frame contains dates in YYYY-mm-dd format and event lengths:
datetime length
2003-06-03 1
2003-06-07 1
2003-06-13 1
2003-06-17 3
2003-06-28 5
2003-07-10 1
2003-07-23 1
...
The second data frame contains dates in the same format and discharge data:
datetime q
2003-05-29 36.2
2003-05-30 34.6
2003-05-31 33.1
2003-06-01 30.7
2003-06-02 30.0
2003-06-03 153.0
2003-06-04 69.0
...
The second data frame is much larger.
I want to merge/join only the following rows of the second data frame to the first:
all rows that have the same date as the first frame (I know this can be done with left_join(df1,df2, by = c("datetime"))
two rows before that row
n-1 rows after that row, where n = "length" value of row in first data frame.
I would like to identify the rows belonging to the same event as well.
Ideally i would have the following output: (Notice the event from 2003-06-17)
EventDatesNancy length q event#
2003-06-03 1 153.0 1
2003-06-07 1 120.0 2
2003-06-13 1 45.3 3
2003-06-15 na 110.0 4
2003-06-16 na 53.1 4
2003-06-17 3 78.0 4
2003-06-18 na 167.0 4
2003-06-19 na 145.0 4
...
I hope this makes clear what I am trying to do.
This might be one approach using tidyverse and fuzzyjoin.
First, indicate event numbers in your first data.frame. Add two columns to indicate the start and end dates (start date is 2 days before the date, and end date is length days - 1 after the date).
Then, you can use fuzzy_inner_join to get the selected rows from the second data.frame. Here, you will want to include where the datetime in the second data.frame falls after the start date and before the end date of the first data.frame.
library(tidyverse)
library(fuzzyjoin)
df1$event <- seq_along(1:nrow(df1))
df1$start_date <- df1$datetime - 2
df1$end_date <- df1$datetime + df1$length - 1
fuzzy_inner_join(
df1,
df2,
by = c("start_date" = "datetime", "end_date" = "datetime"),
match_fun = c(`<=`, `>=`)
) %>%
select(datetime.y, length, q, event)
I tried this out with some made up data:
R> df1
datetime length
1 2003-06-03 1
2 2003-06-12 1
3 2003-06-21 1
4 2003-06-30 3
5 2003-07-09 5
6 2003-07-18 1
7 2003-07-27 1
8 2003-08-05 2
9 2003-08-14 1
10 2003-08-23 1
11 2003-09-01 3
R> df2
datetime q
1 2003-06-03 44
2 2003-06-04 52
3 2003-06-05 34
4 2003-06-06 20
5 2003-06-07 57
6 2003-06-08 67
7 2003-06-09 63
8 2003-06-10 51
9 2003-06-11 56
10 2003-06-12 37
11 2003-06-13 16
12 2003-06-14 54
13 2003-06-15 46
14 2003-06-16 6
15 2003-06-17 32
16 2003-06-18 91
17 2003-06-19 61
18 2003-06-20 42
19 2003-06-21 28
20 2003-06-22 98
21 2003-06-23 77
22 2003-06-24 81
23 2003-06-25 13
24 2003-06-26 15
25 2003-06-27 73
26 2003-06-28 38
27 2003-06-29 27
28 2003-06-30 49
29 2003-07-01 10
30 2003-07-02 89
31 2003-07-03 9
32 2003-07-04 80
33 2003-07-05 68
34 2003-07-06 26
35 2003-07-07 31
36 2003-07-08 29
37 2003-07-09 84
38 2003-07-10 60
39 2003-07-11 19
40 2003-07-12 97
41 2003-07-13 35
42 2003-07-14 47
43 2003-07-15 70
This will give the following output:
datetime.y length q event
1 2003-06-03 1 44 1
2 2003-06-10 1 51 2
3 2003-06-11 1 56 2
4 2003-06-12 1 37 2
5 2003-06-19 1 61 3
6 2003-06-20 1 42 3
7 2003-06-21 1 28 3
8 2003-06-28 3 38 4
9 2003-06-29 3 27 4
10 2003-06-30 3 49 4
11 2003-07-01 3 10 4
12 2003-07-02 3 89 4
13 2003-07-07 5 31 5
14 2003-07-08 5 29 5
15 2003-07-09 5 84 5
16 2003-07-10 5 60 5
17 2003-07-11 5 19 5
18 2003-07-12 5 97 5
19 2003-07-13 5 35 5
If the output desired is different than above, please let me know what should be different so that I can correct it.
Data
df1 <- structure(list(datetime = structure(c(12206, 12215, 12224, 12233,
12242, 12251, 12260, 12269, 12278, 12287, 12296), class = "Date"),
length = c(1, 1, 1, 3, 5, 1, 1, 2, 1, 1, 3), event = 1:11,
start_date = structure(c(12204, 12213, 12222, 12231, 12240,
12249, 12258, 12267, 12276, 12285, 12294), class = "Date"),
end_date = structure(c(12206, 12215, 12224, 12235, 12246,
12251, 12260, 12270, 12278, 12287, 12298), class = "Date")), row.names = c(NA,
-11L), class = "data.frame")
df2 <- structure(list(datetime = structure(c(12206, 12207, 12208, 12209,
12210, 12211, 12212, 12213, 12214, 12215, 12216, 12217, 12218,
12219, 12220, 12221, 12222, 12223, 12224, 12225, 12226, 12227,
12228, 12229, 12230, 12231, 12232, 12233, 12234, 12235, 12236,
12237, 12238, 12239, 12240, 12241, 12242, 12243, 12244, 12245,
12246, 12247, 12248), class = "Date"), q = c(44L, 52L, 34L, 20L,
57L, 67L, 63L, 51L, 56L, 37L, 16L, 54L, 46L, 6L, 32L, 91L, 61L,
42L, 28L, 98L, 77L, 81L, 13L, 15L, 73L, 38L, 27L, 49L, 10L, 89L,
9L, 80L, 68L, 26L, 31L, 29L, 84L, 60L, 19L, 97L, 35L, 47L, 70L
)), class = "data.frame", row.names = c(NA, -43L))

Group data and assign group id based on time intervals in R

I am trying to figure out how to assign group id based on time intervals in R.
More context: I have merged GPS data (lat/lon data points, recorded in irregular intervals) with acceleration data (ACC "bursts" of 82 data points, recorded at the start of every minute - all 82 data points in one burst have the same timestamp).
As GPS points and ACC bursts were collected simultaneously, I now want to group GPS points with the associated ACC bursts: assign all GPS and ACC data that ocurr within the same minute, a unique group id.
EDIT: Here are some sample data. I want to group the GPS point in row 8 to the ACC data within the same minute (in this case above the GPS point).
structure(list(X.1 = 1:11, timestamp = c("2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:47Z", "2019-01-26T16:26:00Z", "2019-01-26T16:26:00Z", "2019-01-26T16:26:00Z"), sensor.type = c("acceleration", "acceleration", "acceleration", "acceleration", "acceleration", "acceleration", "acceleration", "gps", "acceleration", "acceleration", "acceleration"), location.long = c(NA, NA, NA, NA, NA, NA, NA, 44.4777343, NA, NA, NA), location.lat = c(NA, NA, NA, NA, NA, NA, NA, -12.2839707, NA, NA, NA), annotation = c("Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing"), X = c(2219L, 1694L, 1976L, 1744L, 2014L, 2202L, 2269L, NA, 1874L, 2024L, 1990L), Y = c(1416L, 1581L, 1524L, 1620L, 1409L, 1545L, 1771L, NA, 1687L, 1773L, 1813L), Z = c(2189L, 2209L, 2121L, 2278L, 2003L, 2034L, 2060L, NA, 2431L, 2504L, 2428L)), class = "data.frame", row.names = c(NA, -11L))
X.1 timestamp sensor.type location.long location.lat annotation X Y Z
1 1 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 2219 1416 2189
2 2 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 1694 1581 2209
3 3 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 1976 1524 2121
4 4 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 1744 1620 2278
5 5 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 2014 1409 2003
6 6 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 2202 1545 2034
7 7 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 2269 1771 2060
8 8 2019-01-26T16:25:47Z gps 44.47773 -12.28397 Moving/Climbing NA NA NA
9 9 2019-01-26T16:26:00Z acceleration NA NA Moving/Climbing 1874 1687 2431
10 10 2019-01-26T16:26:00Z acceleration NA NA Moving/Climbing 2024 1773 2504
11 11 2019-01-26T16:26:00Z acceleration NA NA Moving/Climbing 1990 1813 2428
Does that make sense? I know lubridate can summarize data based on time intervals but how do I add a new group id (variable) based on timestamps?
Here's a solution using dplyr and lubridate. We convert your timestamp column to a proper datetime class, add a new column rounding down to the nearest minute, and then create an ID based on the rounded timestamp:
library(dplyr)
library(lubridate)
dat %>%
mutate(
timestamp = ymd_hms(timestamp),
minute = floor_date(timestamp, unit = "minute"),
group_id = as.integer(factor(minute))
)
# X.1 timestamp sensor.type location.long location.lat annotation X Y Z
# 1 1 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 2219 1416 2189
# 2 2 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 1694 1581 2209
# 3 3 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 1976 1524 2121
# 4 4 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 1744 1620 2278
# 5 5 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 2014 1409 2003
# 6 6 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 2202 1545 2034
# 7 7 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 2269 1771 2060
# 8 8 2019-01-26 16:25:47 gps 44.47773 -12.28397 Moving/Climbing NA NA NA
# 9 9 2019-01-26 16:26:00 acceleration NA NA Moving/Climbing 1874 1687 2431
# 10 10 2019-01-26 16:26:00 acceleration NA NA Moving/Climbing 2024 1773 2504
# 11 11 2019-01-26 16:26:00 acceleration NA NA Moving/Climbing 1990 1813 2428
# minute group_id
# 1 2019-01-26 16:25:00 1
# 2 2019-01-26 16:25:00 1
# 3 2019-01-26 16:25:00 1
# 4 2019-01-26 16:25:00 1
# 5 2019-01-26 16:25:00 1
# 6 2019-01-26 16:25:00 1
# 7 2019-01-26 16:25:00 1
# 8 2019-01-26 16:25:00 1
# 9 2019-01-26 16:26:00 2
# 10 2019-01-26 16:26:00 2
# 11 2019-01-26 16:26:00 2

Cross join two dataframes by key column using condition in R

I have two dataframes.
mydata1=structure(list(ID_WORKES = c(58005854L, 58005854L, 58002666L,
58002666L), ID_SP_NAR = c(463L, 1951L, 21L, 465L), KOD_DEPO = c(3786L,
3786L, 1439L, 1439L), KOD_DOR = c(58L, 58L, 92L, 92L), COLUMN_MASH = c(6L,
6L, 5L, 5L), prop_violations = structure(c(1L, 2L, 2L, 2L), .Label = c("0.2",
"1"), class = "factor"), mash_score = c(0L, 2L, 2L, 2L)), .Names = c("ID_WORKES",
"ID_SP_NAR", "KOD_DEPO", "KOD_DOR", "COLUMN_MASH", "prop_violations",
"mash_score"), class = "data.frame", row.names = c(NA, -4L))
mydata2=structure(list(ID_SP_NAR = c(463L, 1951L, 21L, 465L, 500L, 600L
)), .Names = "ID_SP_NAR", class = "data.frame", row.names = c(NA,
-6L))
i need crossjoin merge these dataframes by ID_SP_NAR. Mydata2 contatins only key variable ID_SP_NAR.
I need join this in such a way that if the id_workers does not have any codes from the ID_SP_NAR from mydata2, then these code are inserted into the dataset, but for them in variables prop_violations and mash_score must be inserted zero values.
I.E. SP_ID_NAR in mydata2 has such values
ID_SP_NAR
463
1951
21
465
500
600
ID_workes =58005854 has
463,
1951
but another not have.
and
ID_workes =58002666 has 21 and 465 and not anonter!
So desired output after cross join
ID_WORKES ID_SP_NAR KOD_DEPO KOD_DOR COLUMN_MASH prop_violations mash_score
1 58005854 463 3786 58 6 0.2 0
2 58005854 1951 3786 58 6 1 2
3 58005854 21 3786 58 6 0 0
4 58005854 465 3786 58 6 0 0
5 58005854 500 3786 58 6 0 0
6 58005854 600 3786 58 6 0 0
7 58002666 21 1439 92 5 1 2
8 58002666 465 1439 92 5 1 2
9 58002666 500 1439 92 5 0 0
10 58002666 600 1439 92 5 0 0
11 58002666 463 1439 92 5 0 0
12 58002666 1951 1439 92 5 0 0
KOD_DEPO,KOD_DOR,COLUMN_MASH have fixed value , it must be saved too.
How to do that?
merge(mydata1,mydata2, by = ID_SP_NAR) is not working( i try use via left join doesn't work), it doesn't insert zeros as i want .
We could use complete from tidyr to expand the dataset based on the 'ID_WORKES' and the valuse of 'ID_SP_NAR' in the second dataset
library(tidyverse)
mydata1 %>%
mutate_if(is.factor, as.character) %>%
complete(ID_WORKES, ID_SP_NAR = mydata2$ID_SP_NAR,
fill = list(prop_violations = '0', mash_score = 0)) %>%
fill(3:5)
# A tibble: 12 x 7
# ID_WORKES ID_SP_NAR KOD_DEPO KOD_DOR COLUMN_MASH prop_violations mash_score
# <int> <int> <int> <int> <int> <chr> <dbl>
# 1 58002666 21 1439 92 5 1 2
# 2 58002666 463 1439 92 5 0 0
# 3 58002666 465 1439 92 5 1 2
# 4 58002666 500 1439 92 5 0 0
# 5 58002666 600 1439 92 5 0 0
# 6 58002666 1951 1439 92 5 0 0
# 7 58005854 21 1439 92 5 0 0
# 8 58005854 463 3786 58 6 0.2 0
# 9 58005854 465 3786 58 6 0 0
#10 58005854 500 3786 58 6 0 0
#11 58005854 600 3786 58 6 0 0
#12 58005854 1951 3786 58 6 1 2

Combine some columns of two matrices but with common information transposed

I have the following two matrices:
matrix1 (first 10 rows and only some relevant columns):
Prod_Y2010 Prod_Y2011 Prod_Y2012 Prod_Y2013 Prod_Y2014 Place
1 6101 5733 5655 5803 5155 3
2 4614 4513 4322 5211 4397 1
3 5370 5295 4951 5145 4491 3
4 5689 5855 5600 5787 4848 1
5 3598 3491 3462 3765 3094 2
6 6367 6244 5838 6404 5466 7
7 2720 2635 2465 2917 2623 2
8 5077 5113 4456 5503 4749 8
9 5260 5055 4512 5691 4876 2
10 4771 4583 4202 5266 4422 2
where each column is grassland productivity from years 2010 to 2014, and the last column is the place where productivity was measured.
and matrix2:
Year Rain_Place1 Rain_Place2 Rain_Place3 Rain_Place7 Rain_Place8
11 2010 123.0 361.0 60.5 469.7 492.3
12 2011 45.5 404.4 224.8 395.4 417.3
13 2012 318.7 369.4 115.7 322.6 385.8
14 2013 93.2 378.4 155.5 398.2 413.1
15 2014 216.8 330.0 31.0 344.0 387.5
where for each of the same 5 years of matrix1 (which are the rows in matrix 2) I have data on the rainfall for each place.
I do not see how to proceed in R to join the information of the two matrices in such a way that my matrix1 has a series of additional columns intercalated (or interspersed) with the corresponding rain values matching the corresponding years and places. That is, what I need is a new matrix1 such as:
Prod_Y2010 Rain_Y2010 Prod_Y2011 Rain_Y2011 Prod_Y2012 Rain_Y2012 ... Place
1 6101 60.5 5733 224.8 5655 115.7 3
2 4614 123.0 4513 45.5 4322 318.7 1
3 5370 60.5 5295 224.8 4951 115.7 3
4 5689 123.0 5855 45.5 5600 318.7 1
5 3598 361.0 3491 404.4 3462 369.4 2
... ... ... ... ... ... ... ...
Of course the order is not important to me: if all the Rainfall columns are added as new columns at the right end of matrix1, that would be fine anyway.
Needless to say, my real matrices are several thousands rows long, and the number of years is 15.
I would second #jazzurro's comment- reformatting your data to long format would likely make it easier to work with for analysis etc. However, if you want to keep it using the wide format here is a way that might work- it uses the reshape2 and plyr libraries.
Given these data frames (dput() output of your data frames above, only included for reproducibility):
m1<-structure(list(Prod_Y2010 = c(6101L, 4614L, 5370L, 5689L, 3598L,
6367L, 2720L, 5077L, 5260L, 4771L), Prod_Y2011 = c(5733L, 4513L,
5295L, 5855L, 3491L, 6244L, 2635L, 5113L, 5055L, 4583L), Prod_Y2012 = c(5655L,
4322L, 4951L, 5600L, 3462L, 5838L, 2465L, 4456L, 4512L, 4202L
), Prod_Y2013 = c(5803L, 5211L, 5145L, 5787L, 3765L, 6404L, 2917L,
5503L, 5691L, 5266L), Prod_Y2014 = c(5155L, 4397L, 4491L, 4848L,
3094L, 5466L, 2623L, 4749L, 4876L, 4422L), Place = c(3L, 1L,
3L, 1L, 2L, 7L, 2L, 8L, 2L, 2L)), .Names = c("Prod_Y2010", "Prod_Y2011",
"Prod_Y2012", "Prod_Y2013", "Prod_Y2014", "Place"), class = "data.frame", row.names = c(NA,
-10L))
m2<-structure(list(Year = 2010:2014, Rain_Place1 = c(123, 45.5, 318.7,
93.2, 216.8), Rain_Place2 = c(361, 404.4, 369.4, 378.4, 330),
Rain_Place3 = c(60.5, 224.8, 115.7, 155.5, 31), Rain_Place7 = c(469.7,
395.4, 322.6, 398.2, 344), Rain_Place8 = c(492.3, 417.3,
385.8, 413.1, 387.5)), .Names = c("Year", "Rain_Place1",
"Rain_Place2", "Rain_Place3", "Rain_Place7", "Rain_Place8"), class = "data.frame", row.names = c("11",
"12", "13", "14", "15"))
To get the place number from the column names in your rain data frame to use in a later join:
rename <- function(x) {
y <- substr(x, nchar(x), nchar(x))
return(y)
}
Edit: Here is a better rename function, that should work with more than 9 places (modified from an answer here):
rename <- function(x) {
y <- unlist(regmatches(x, gregexpr('\\(?[0-9,.]+', x)))
return(y)
}
sapply(names(m2[2:ncol(m2)]), FUN = rename)
names(m2) <- c(names(m2)[1], sapply(names(m2[2:ncol(m2)]), FUN = rename))
> m2
Year 1 2 3 7 8
1 2010 123.0 361.0 60.5 469.7 492.3
2 2011 45.5 404.4 224.8 395.4 417.3
3 2012 318.7 369.4 115.7 322.6 385.8
4 2013 93.2 378.4 155.5 398.2 413.1
5 2014 216.8 330.0 31.0 344.0 387.5
Melt the rain data frame:
m3<-melt(m2, id.vars = "Year", variable.name = "Place", value.name = "Rain")
> head(m3)
Year Place Rain
1 2010 1 123.0
2 2011 1 45.5
3 2012 1 318.7
4 2013 1 93.2
5 2014 1 216.8
6 2010 2 361.0
Reshape the melted data frame to allow for a join by "Place", and treat "Place" as a character rather than a factor:
m4<-reshape(m3, idvar = "Place", timevar = "Year", direction = "wide")
m4$Place <- as.character(m4$Place)
> m4
Place Rain.2010 Rain.2011 Rain.2012 Rain.2013 Rain.2014
1 1 123.0 45.5 318.7 93.2 216.8
6 2 361.0 404.4 369.4 378.4 330.0
11 3 60.5 224.8 115.7 155.5 31.0
16 7 469.7 395.4 322.6 398.2 344.0
21 8 492.3 417.3 385.8 413.1 387.5
Finally, join this melted/reshaped data frame to your "Prod" data frame.
m5<-join(m1, m4, by = "Place")
> m5
Prod_Y2010 Prod_Y2011 Prod_Y2012 Prod_Y2013 Prod_Y2014 Place Rain.2010 Rain.2011 Rain.2012 Rain.2013 Rain.2014
1 6101 5733 5655 5803 5155 3 60.5 224.8 115.7 155.5 31.0
2 4614 4513 4322 5211 4397 1 123.0 45.5 318.7 93.2 216.8
3 5370 5295 4951 5145 4491 3 60.5 224.8 115.7 155.5 31.0
4 5689 5855 5600 5787 4848 1 123.0 45.5 318.7 93.2 216.8
5 3598 3491 3462 3765 3094 2 361.0 404.4 369.4 378.4 330.0
6 6367 6244 5838 6404 5466 7 469.7 395.4 322.6 398.2 344.0
7 2720 2635 2465 2917 2623 2 361.0 404.4 369.4 378.4 330.0
8 5077 5113 4456 5503 4749 8 492.3 417.3 385.8 413.1 387.5
9 5260 5055 4512 5691 4876 2 361.0 404.4 369.4 378.4 330.0
10 4771 4583 4202 5266 4422 2 361.0 404.4 369.4 378.4 330.0

Difficult aggregation in R, grouping team basketball stats

Thanks in advance for any help or suggestions on this. Here is a shortened example of the dataframe I am working with.
boxscore_stats = structure(list(game_id = c(157046L, 157046L, 157046L, 157046L,
157046L, 157046L, 157046L, 157046L, 157046L, 157046L, 157046L,
157046L, 157046L, 157046L, 157046L, 157046L, 157046L, 157046L,
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L,
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L,
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L,
159151L), team_id = c(116975, 116975, 116975, 116975, 116975,
116975, 116975, 116975, 116975, 120310, 120310, 120310, 120310,
120310, 120310, 120310, 120310, 120310, 121910, 121910, 121910,
121910, 121910, 121910, 121910, 121910, 121910, 121910, 122072,
122072, 122072, 122072, 122072, 122072, 122072, 122072, 122072,
122072, 122072, 122072), minutes_played = c(18.76, 14.63, 8,
16.69, 24.62, 32, 12.79, 5.28, 3.22, 24.35, 10.18, 20.65, 9.59,
25.08, 14.12, 17.46, 23.15, 15.43, 22.84, 19.27, 21.31, 6.41,
17.57, 17.4, 17.29, 7.22, 12.09, 17.25, 2.28, 16.87, 6.6, 19.73,
6.31, 13.25, 26.25, 6.08, 28.71, 11.2, 17.54, 5.17), fieldGoalsMade = c(1L,
1L, 4L, 1L, 2L, 7L, 1L, 1L, 1L, 4L, 0L, 3L, 1L, 3L, 0L, 6L, 7L,
1L, 7L, 4L, 5L, 1L, 2L, 6L, 2L, 0L, 1L, 3L, 0L, 1L, 1L, 3L, 0L,
1L, 11L, 2L, 5L, 1L, 2L, 1L), fieldGoalAttempts = c(8L, 6L, 7L,
2L, 9L, 16L, 3L, 1L, 2L, 12L, 4L, 12L, 3L, 11L, 4L, 9L, 13L,
6L, 12L, 10L, 14L, 2L, 6L, 11L, 6L, 2L, 2L, 6L, 0L, 5L, 3L, 10L,
2L, 3L, 21L, 3L, 17L, 4L, 9L, 2L)), .Names = c("game_id", "team_id",
"minutes_played", "fieldGoalsMade", "fieldGoalAttempts"), row.names = c(NA,
40L), class = "data.frame")
head(boxscore_stats)
game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts
1 157046 116975 18.76 1 8
2 157046 116975 14.63 1 6
3 157046 116975 8.00 4 7
4 157046 116975 16.69 1 2
5 157046 116975 24.62 2 9
6 157046 116975 32.00 7 16
7 157046 116975 12.79 1 3
8 157046 116975 5.28 1 1
9 157046 116975 3.22 1 2
10 157046 120310 24.35 4 12
11 157046 120310 10.18 0 4
12 157046 120310 20.65 3 12
13 157046 120310 9.59 1 3
14 157046 120310 25.08 3 11
15 157046 120310 14.12 0 4
16 157046 120310 17.46 6 9
17 157046 120310 23.15 7 13
18 157046 120310 15.43 1 6
19 159151 121910 22.84 7 12
20 159151 121910 19.27 4 10
21 159151 121910 21.31 5 14
22 159151 121910 6.41 1 2
23 159151 121910 17.57 2 6
24 159151 121910 17.40 6 11
25 159151 121910 17.29 2 6
26 159151 121910 7.22 0 2
27 159151 121910 12.09 1 2
28 159151 121910 17.25 3 6
29 159151 122072 2.28 0 0
30 159151 122072 16.87 1 5
31 159151 122072 6.60 1 3
32 159151 122072 19.73 3 10
33 159151 122072 6.31 0 2
34 159151 122072 13.25 1 3
35 159151 122072 26.25 11 21
36 159151 122072 6.08 2 3
37 159151 122072 28.71 5 17
38 159151 122072 11.20 1 4
39 159151 122072 17.54 2 9
40 159151 122072 5.17 1 2
The important things to note about this dataframe is that each game_id corresponds with two team_ids, for the two teams that played in the game. Each game_id is unique to one game of basketball. Each row corresponds with the stats for a player on the team_ids team in that game. The example above has only two games / 4 teams / 40 players, but my full dataframe has hundreds of games, which each team_id showing up many times.
My first aggregation, which I was able to do, was to aggregate everything by team_id. This code got the job done for me for the first aggregation:
boxscore_stats_aggregated = aggregate(boxscore_stats, by = list(boxscore_stats[, 2]), FUN = sum)
which was fairly straightforward. For any team_id, I had aggregated all of their minutes played, all of their fieldGoalsMade, etc. For my next aggregation though, I need to aggregate by team_id again but instead of aggregating a team by their own rows / stats, instead I need to aggregate the rows / stats of their opponents. This answers the question "For any team, how many fieldsGoalsMade did they allow in total to opponents, etc." So in this case, for team_id = 116975, I would want to aggregate all the rows with team_id 120310. Of course next time team_id 116975 appears in my dataframe in a new game, it is likely that they are playing a different opponent, so this aggregation is not as simple as aggregating by team_id 120310.
I think I should be able to use the relationship between the two team_ids being unique to the unique game_ids to make this aggregation possible, but I am struggling with how it could be implemented.
Thanks!
Here is approach using data.table:
(1) Read in the data:
# Load package
library(data.table)
# Load your data
boxscore_stats <- fread("row game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts
1 157046 116975 18.76 1 8
2 157046 116975 14.63 1 6
3 157046 116975 8.00 4 7
4 157046 116975 16.69 1 2
5 157046 116975 24.62 2 9
6 157046 116975 32.00 7 16
7 157046 116975 12.79 1 3
8 157046 116975 5.28 1 1
9 157046 116975 3.22 1 2
10 157046 120310 24.35 4 12
11 157046 120310 10.18 0 4
12 157046 120310 20.65 3 12
13 157046 120310 9.59 1 3
14 157046 120310 25.08 3 11
15 157046 120310 14.12 0 4
16 157046 120310 17.46 6 9
17 157046 120310 23.15 7 13
18 157046 120310 15.43 1 6
19 159151 121910 22.84 7 12
20 159151 121910 19.27 4 10
21 159151 121910 21.31 5 14
22 159151 121910 6.41 1 2
23 159151 121910 17.57 2 6
24 159151 121910 17.40 6 11
25 159151 121910 17.29 2 6
26 159151 121910 7.22 0 2
27 159151 121910 12.09 1 2
28 159151 121910 17.25 3 6
29 159151 122072 2.28 0 0
30 159151 122072 16.87 1 5
31 159151 122072 6.60 1 3
32 159151 122072 19.73 3 10
33 159151 122072 6.31 0 2
34 159151 122072 13.25 1 3
35 159151 122072 26.25 11 21
36 159151 122072 6.08 2 3
37 159151 122072 28.71 5 17
38 159151 122072 11.20 1 4
39 159151 122072 17.54 2 9
40 159151 122072 5.17 1 2
")
(2) Proceed with the actual calculations:
# Aggregate on team-and game level (data.table style)
boxscore_stats_aggregated <- boxscore_stats[, lapply(.SD, sum), by = list(game_id, team_id)]
# Match EVERY team to each opponent, i.e. still two rows per game
# but columns for opponent's performance added.
# Some teams drops out in the dummy data as they opponent data was missing.
merge(boxscore_stats_aggregated, boxscore_stats_aggregated,
by="game_id", suffixes = c("", ".opponent"))[team_id!=team_id.opponent,]
output looks like that:
# > output
# game_id team_id row minutes_played fieldGoalsMade fieldGoalAttempts team_id.opponent row.opponent minutes_played.opponent fieldGoalsMade.opponent fieldGoalAttempts.opponent
# 1: 1413414 116975 45 135.99 19 54 120310 126 160.01 25 74
# 2: 1413414 120310 126 160.01 25 74 116975 45 135.99 19 54
And just in case, for OP to consider or future readers below is a base R version with merge() for side by side aggregates of team and opposition by game_id. A staging temp col, gamecount, is needed.
# TEAM AGGREGATION
aggdf <- aggregate(.~game_id + team_id, boxscore_stats, FUN = sum)
# GAME COUNT BY TEAM (TEMP COL USED FOR MERGE/FILTER)
aggdf$gamecount <- sapply(1:nrow(aggdf), function(i)
sum(aggdf[1:i, c("game_id")] == aggdf$game_id[i]))
# MERGE AND FILTER
mdf <- merge(aggdf, aggdf, by="game_id")
mdf <- mdf[mdf$team_id.x != mdf$team_id.y & mdf$gamecount.x == 1,]
mdf$gamecount.x <- mdf$gamecount.y <- NULL
# RENAME COL AND ROW NAMES
names(mdf)[grepl("\\.x", names(mdf))] <- gsub("\\.x", "",
names(mdf)[grepl("\\.x", names(mdf))])
names(mdf)[grepl("\\.y", names(mdf))] <- gsub("\\.y", ".opp",
names(mdf)[grepl("\\.y", names(mdf))])
rownames(mdf) <- 1:nrow(mdf)
# game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts team_id.opp
# 1 157046 116975 135.99 19 54 120310
# 2 159151 121910 158.65 31 71 122072
# minutes_played.opp fieldGoalsMade.opp fieldGoalAttempts.opp
# 1 160.01 25 74
# 2 159.99 28 79
If you want to isolate single team_ids, I would use the dplyr package.
For example if you wanted to know % of field goals per team I would write something like:
boxscore_stats %>%
group_by(team_id) %>%
summarize(perc_fg = sum(fieldGoalsMade)/sum(fieldGoalAttempts))
This would give you a new data.frame aggregated by team ID.

Resources