Thanks in advance for any help or suggestions on this. Here is a shortened example of the dataframe I am working with.
boxscore_stats = structure(list(game_id = c(157046L, 157046L, 157046L, 157046L,
157046L, 157046L, 157046L, 157046L, 157046L, 157046L, 157046L,
157046L, 157046L, 157046L, 157046L, 157046L, 157046L, 157046L,
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L,
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L,
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L,
159151L), team_id = c(116975, 116975, 116975, 116975, 116975,
116975, 116975, 116975, 116975, 120310, 120310, 120310, 120310,
120310, 120310, 120310, 120310, 120310, 121910, 121910, 121910,
121910, 121910, 121910, 121910, 121910, 121910, 121910, 122072,
122072, 122072, 122072, 122072, 122072, 122072, 122072, 122072,
122072, 122072, 122072), minutes_played = c(18.76, 14.63, 8,
16.69, 24.62, 32, 12.79, 5.28, 3.22, 24.35, 10.18, 20.65, 9.59,
25.08, 14.12, 17.46, 23.15, 15.43, 22.84, 19.27, 21.31, 6.41,
17.57, 17.4, 17.29, 7.22, 12.09, 17.25, 2.28, 16.87, 6.6, 19.73,
6.31, 13.25, 26.25, 6.08, 28.71, 11.2, 17.54, 5.17), fieldGoalsMade = c(1L,
1L, 4L, 1L, 2L, 7L, 1L, 1L, 1L, 4L, 0L, 3L, 1L, 3L, 0L, 6L, 7L,
1L, 7L, 4L, 5L, 1L, 2L, 6L, 2L, 0L, 1L, 3L, 0L, 1L, 1L, 3L, 0L,
1L, 11L, 2L, 5L, 1L, 2L, 1L), fieldGoalAttempts = c(8L, 6L, 7L,
2L, 9L, 16L, 3L, 1L, 2L, 12L, 4L, 12L, 3L, 11L, 4L, 9L, 13L,
6L, 12L, 10L, 14L, 2L, 6L, 11L, 6L, 2L, 2L, 6L, 0L, 5L, 3L, 10L,
2L, 3L, 21L, 3L, 17L, 4L, 9L, 2L)), .Names = c("game_id", "team_id",
"minutes_played", "fieldGoalsMade", "fieldGoalAttempts"), row.names = c(NA,
40L), class = "data.frame")
head(boxscore_stats)
game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts
1 157046 116975 18.76 1 8
2 157046 116975 14.63 1 6
3 157046 116975 8.00 4 7
4 157046 116975 16.69 1 2
5 157046 116975 24.62 2 9
6 157046 116975 32.00 7 16
7 157046 116975 12.79 1 3
8 157046 116975 5.28 1 1
9 157046 116975 3.22 1 2
10 157046 120310 24.35 4 12
11 157046 120310 10.18 0 4
12 157046 120310 20.65 3 12
13 157046 120310 9.59 1 3
14 157046 120310 25.08 3 11
15 157046 120310 14.12 0 4
16 157046 120310 17.46 6 9
17 157046 120310 23.15 7 13
18 157046 120310 15.43 1 6
19 159151 121910 22.84 7 12
20 159151 121910 19.27 4 10
21 159151 121910 21.31 5 14
22 159151 121910 6.41 1 2
23 159151 121910 17.57 2 6
24 159151 121910 17.40 6 11
25 159151 121910 17.29 2 6
26 159151 121910 7.22 0 2
27 159151 121910 12.09 1 2
28 159151 121910 17.25 3 6
29 159151 122072 2.28 0 0
30 159151 122072 16.87 1 5
31 159151 122072 6.60 1 3
32 159151 122072 19.73 3 10
33 159151 122072 6.31 0 2
34 159151 122072 13.25 1 3
35 159151 122072 26.25 11 21
36 159151 122072 6.08 2 3
37 159151 122072 28.71 5 17
38 159151 122072 11.20 1 4
39 159151 122072 17.54 2 9
40 159151 122072 5.17 1 2
The important things to note about this dataframe is that each game_id corresponds with two team_ids, for the two teams that played in the game. Each game_id is unique to one game of basketball. Each row corresponds with the stats for a player on the team_ids team in that game. The example above has only two games / 4 teams / 40 players, but my full dataframe has hundreds of games, which each team_id showing up many times.
My first aggregation, which I was able to do, was to aggregate everything by team_id. This code got the job done for me for the first aggregation:
boxscore_stats_aggregated = aggregate(boxscore_stats, by = list(boxscore_stats[, 2]), FUN = sum)
which was fairly straightforward. For any team_id, I had aggregated all of their minutes played, all of their fieldGoalsMade, etc. For my next aggregation though, I need to aggregate by team_id again but instead of aggregating a team by their own rows / stats, instead I need to aggregate the rows / stats of their opponents. This answers the question "For any team, how many fieldsGoalsMade did they allow in total to opponents, etc." So in this case, for team_id = 116975, I would want to aggregate all the rows with team_id 120310. Of course next time team_id 116975 appears in my dataframe in a new game, it is likely that they are playing a different opponent, so this aggregation is not as simple as aggregating by team_id 120310.
I think I should be able to use the relationship between the two team_ids being unique to the unique game_ids to make this aggregation possible, but I am struggling with how it could be implemented.
Thanks!
Here is approach using data.table:
(1) Read in the data:
# Load package
library(data.table)
# Load your data
boxscore_stats <- fread("row game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts
1 157046 116975 18.76 1 8
2 157046 116975 14.63 1 6
3 157046 116975 8.00 4 7
4 157046 116975 16.69 1 2
5 157046 116975 24.62 2 9
6 157046 116975 32.00 7 16
7 157046 116975 12.79 1 3
8 157046 116975 5.28 1 1
9 157046 116975 3.22 1 2
10 157046 120310 24.35 4 12
11 157046 120310 10.18 0 4
12 157046 120310 20.65 3 12
13 157046 120310 9.59 1 3
14 157046 120310 25.08 3 11
15 157046 120310 14.12 0 4
16 157046 120310 17.46 6 9
17 157046 120310 23.15 7 13
18 157046 120310 15.43 1 6
19 159151 121910 22.84 7 12
20 159151 121910 19.27 4 10
21 159151 121910 21.31 5 14
22 159151 121910 6.41 1 2
23 159151 121910 17.57 2 6
24 159151 121910 17.40 6 11
25 159151 121910 17.29 2 6
26 159151 121910 7.22 0 2
27 159151 121910 12.09 1 2
28 159151 121910 17.25 3 6
29 159151 122072 2.28 0 0
30 159151 122072 16.87 1 5
31 159151 122072 6.60 1 3
32 159151 122072 19.73 3 10
33 159151 122072 6.31 0 2
34 159151 122072 13.25 1 3
35 159151 122072 26.25 11 21
36 159151 122072 6.08 2 3
37 159151 122072 28.71 5 17
38 159151 122072 11.20 1 4
39 159151 122072 17.54 2 9
40 159151 122072 5.17 1 2
")
(2) Proceed with the actual calculations:
# Aggregate on team-and game level (data.table style)
boxscore_stats_aggregated <- boxscore_stats[, lapply(.SD, sum), by = list(game_id, team_id)]
# Match EVERY team to each opponent, i.e. still two rows per game
# but columns for opponent's performance added.
# Some teams drops out in the dummy data as they opponent data was missing.
merge(boxscore_stats_aggregated, boxscore_stats_aggregated,
by="game_id", suffixes = c("", ".opponent"))[team_id!=team_id.opponent,]
output looks like that:
# > output
# game_id team_id row minutes_played fieldGoalsMade fieldGoalAttempts team_id.opponent row.opponent minutes_played.opponent fieldGoalsMade.opponent fieldGoalAttempts.opponent
# 1: 1413414 116975 45 135.99 19 54 120310 126 160.01 25 74
# 2: 1413414 120310 126 160.01 25 74 116975 45 135.99 19 54
And just in case, for OP to consider or future readers below is a base R version with merge() for side by side aggregates of team and opposition by game_id. A staging temp col, gamecount, is needed.
# TEAM AGGREGATION
aggdf <- aggregate(.~game_id + team_id, boxscore_stats, FUN = sum)
# GAME COUNT BY TEAM (TEMP COL USED FOR MERGE/FILTER)
aggdf$gamecount <- sapply(1:nrow(aggdf), function(i)
sum(aggdf[1:i, c("game_id")] == aggdf$game_id[i]))
# MERGE AND FILTER
mdf <- merge(aggdf, aggdf, by="game_id")
mdf <- mdf[mdf$team_id.x != mdf$team_id.y & mdf$gamecount.x == 1,]
mdf$gamecount.x <- mdf$gamecount.y <- NULL
# RENAME COL AND ROW NAMES
names(mdf)[grepl("\\.x", names(mdf))] <- gsub("\\.x", "",
names(mdf)[grepl("\\.x", names(mdf))])
names(mdf)[grepl("\\.y", names(mdf))] <- gsub("\\.y", ".opp",
names(mdf)[grepl("\\.y", names(mdf))])
rownames(mdf) <- 1:nrow(mdf)
# game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts team_id.opp
# 1 157046 116975 135.99 19 54 120310
# 2 159151 121910 158.65 31 71 122072
# minutes_played.opp fieldGoalsMade.opp fieldGoalAttempts.opp
# 1 160.01 25 74
# 2 159.99 28 79
If you want to isolate single team_ids, I would use the dplyr package.
For example if you wanted to know % of field goals per team I would write something like:
boxscore_stats %>%
group_by(team_id) %>%
summarize(perc_fg = sum(fieldGoalsMade)/sum(fieldGoalAttempts))
This would give you a new data.frame aggregated by team ID.
Related
So i need to merge 2 data frames:
The first data frame contains dates in YYYY-mm-dd format and event lengths:
datetime length
2003-06-03 1
2003-06-07 1
2003-06-13 1
2003-06-17 3
2003-06-28 5
2003-07-10 1
2003-07-23 1
...
The second data frame contains dates in the same format and discharge data:
datetime q
2003-05-29 36.2
2003-05-30 34.6
2003-05-31 33.1
2003-06-01 30.7
2003-06-02 30.0
2003-06-03 153.0
2003-06-04 69.0
...
The second data frame is much larger.
I want to merge/join only the following rows of the second data frame to the first:
all rows that have the same date as the first frame (I know this can be done with left_join(df1,df2, by = c("datetime"))
two rows before that row
n-1 rows after that row, where n = "length" value of row in first data frame.
I would like to identify the rows belonging to the same event as well.
Ideally i would have the following output: (Notice the event from 2003-06-17)
EventDatesNancy length q event#
2003-06-03 1 153.0 1
2003-06-07 1 120.0 2
2003-06-13 1 45.3 3
2003-06-15 na 110.0 4
2003-06-16 na 53.1 4
2003-06-17 3 78.0 4
2003-06-18 na 167.0 4
2003-06-19 na 145.0 4
...
I hope this makes clear what I am trying to do.
This might be one approach using tidyverse and fuzzyjoin.
First, indicate event numbers in your first data.frame. Add two columns to indicate the start and end dates (start date is 2 days before the date, and end date is length days - 1 after the date).
Then, you can use fuzzy_inner_join to get the selected rows from the second data.frame. Here, you will want to include where the datetime in the second data.frame falls after the start date and before the end date of the first data.frame.
library(tidyverse)
library(fuzzyjoin)
df1$event <- seq_along(1:nrow(df1))
df1$start_date <- df1$datetime - 2
df1$end_date <- df1$datetime + df1$length - 1
fuzzy_inner_join(
df1,
df2,
by = c("start_date" = "datetime", "end_date" = "datetime"),
match_fun = c(`<=`, `>=`)
) %>%
select(datetime.y, length, q, event)
I tried this out with some made up data:
R> df1
datetime length
1 2003-06-03 1
2 2003-06-12 1
3 2003-06-21 1
4 2003-06-30 3
5 2003-07-09 5
6 2003-07-18 1
7 2003-07-27 1
8 2003-08-05 2
9 2003-08-14 1
10 2003-08-23 1
11 2003-09-01 3
R> df2
datetime q
1 2003-06-03 44
2 2003-06-04 52
3 2003-06-05 34
4 2003-06-06 20
5 2003-06-07 57
6 2003-06-08 67
7 2003-06-09 63
8 2003-06-10 51
9 2003-06-11 56
10 2003-06-12 37
11 2003-06-13 16
12 2003-06-14 54
13 2003-06-15 46
14 2003-06-16 6
15 2003-06-17 32
16 2003-06-18 91
17 2003-06-19 61
18 2003-06-20 42
19 2003-06-21 28
20 2003-06-22 98
21 2003-06-23 77
22 2003-06-24 81
23 2003-06-25 13
24 2003-06-26 15
25 2003-06-27 73
26 2003-06-28 38
27 2003-06-29 27
28 2003-06-30 49
29 2003-07-01 10
30 2003-07-02 89
31 2003-07-03 9
32 2003-07-04 80
33 2003-07-05 68
34 2003-07-06 26
35 2003-07-07 31
36 2003-07-08 29
37 2003-07-09 84
38 2003-07-10 60
39 2003-07-11 19
40 2003-07-12 97
41 2003-07-13 35
42 2003-07-14 47
43 2003-07-15 70
This will give the following output:
datetime.y length q event
1 2003-06-03 1 44 1
2 2003-06-10 1 51 2
3 2003-06-11 1 56 2
4 2003-06-12 1 37 2
5 2003-06-19 1 61 3
6 2003-06-20 1 42 3
7 2003-06-21 1 28 3
8 2003-06-28 3 38 4
9 2003-06-29 3 27 4
10 2003-06-30 3 49 4
11 2003-07-01 3 10 4
12 2003-07-02 3 89 4
13 2003-07-07 5 31 5
14 2003-07-08 5 29 5
15 2003-07-09 5 84 5
16 2003-07-10 5 60 5
17 2003-07-11 5 19 5
18 2003-07-12 5 97 5
19 2003-07-13 5 35 5
If the output desired is different than above, please let me know what should be different so that I can correct it.
Data
df1 <- structure(list(datetime = structure(c(12206, 12215, 12224, 12233,
12242, 12251, 12260, 12269, 12278, 12287, 12296), class = "Date"),
length = c(1, 1, 1, 3, 5, 1, 1, 2, 1, 1, 3), event = 1:11,
start_date = structure(c(12204, 12213, 12222, 12231, 12240,
12249, 12258, 12267, 12276, 12285, 12294), class = "Date"),
end_date = structure(c(12206, 12215, 12224, 12235, 12246,
12251, 12260, 12270, 12278, 12287, 12298), class = "Date")), row.names = c(NA,
-11L), class = "data.frame")
df2 <- structure(list(datetime = structure(c(12206, 12207, 12208, 12209,
12210, 12211, 12212, 12213, 12214, 12215, 12216, 12217, 12218,
12219, 12220, 12221, 12222, 12223, 12224, 12225, 12226, 12227,
12228, 12229, 12230, 12231, 12232, 12233, 12234, 12235, 12236,
12237, 12238, 12239, 12240, 12241, 12242, 12243, 12244, 12245,
12246, 12247, 12248), class = "Date"), q = c(44L, 52L, 34L, 20L,
57L, 67L, 63L, 51L, 56L, 37L, 16L, 54L, 46L, 6L, 32L, 91L, 61L,
42L, 28L, 98L, 77L, 81L, 13L, 15L, 73L, 38L, 27L, 49L, 10L, 89L,
9L, 80L, 68L, 26L, 31L, 29L, 84L, 60L, 19L, 97L, 35L, 47L, 70L
)), class = "data.frame", row.names = c(NA, -43L))
I have multi-column data as follows. I want to remove rows having duplicate values in depth column.
Date Levels values depth
1 2005-12-31 1 182.80 0
2 2005-12-31 2 182.80 0
3 2005-12-31 5 182.80 2
4 2005-12-31 6 182.80 2
5 2005-12-31 7 182.80 2
6 2005-12-31 8 182.80 3
7 2005-12-31 9 182.80 4
8 2005-12-31 10 182.80 4
9 2005-12-31 11 182.80 5
10 2005-12-31 13 182.70 7
11 2005-12-31 14 182.70 8
12 2005-12-31 16 182.60 10
13 2005-12-31 17 182.50 12
14 2005-12-31 20 181.50 17
15 2005-12-31 23 177.50 23
16 2005-12-31 26 165.90 31
17 2005-12-31 28 155.00 36
18 2005-12-31 29 149.20 40
19 2005-12-31 31 136.90 46
20 2005-12-31 33 126.10 53
21 2005-12-31 35 112.70 60
22 2005-12-31 38 88.23 70
23 2005-12-31 41 67.99 79
24 2005-12-31 44 54.63 87
25 2005-12-31 49 45.40 98
26 2006-12-31 1 182.80 0
27 2006-12-31 2 182.80 0
28 2006-12-31 5 182.80 2
29 2006-12-31 6 182.80 2
30 2006-12-31 7 182.80 2
31 2006-12-31 8 182.80 3
32 2006-12-31 9 182.80 4
33 2006-12-31 10 182.80 4
34 2006-12-31 11 182.70 5
35 2006-12-31 13 182.70 7
36 2006-12-31 14 182.70 8
37 2006-12-31 16 182.60 10
38 2006-12-31 17 182.50 12
39 2006-12-31 20 181.50 17
40 2006-12-31 23 178.60 23
41 2006-12-31 26 168.70 31
42 2006-12-31 28 156.90 36
43 2006-12-31 29 150.40 40
44 2006-12-31 31 137.10 46
45 2006-12-31 33 126.00 53
46 2006-12-31 35 112.70 60
47 2006-12-31 38 91.80 70
48 2006-12-31 41 75.91 79
49 2006-12-31 44 65.17 87
50 2006-12-31 49 58.33 98
I know how to remove duplicates based on a column as follows;
nodup<- distinct(df, column, .keep_all = TRUE)
But how can I do this code for every 25 rows interval?
base R
do.call(rbind, by(dat, (seq_len(nrow(dat))-1) %/% 25,
function(z) z[!duplicated(z$depth),]))
# Date Levels values depth
# 0.1 2005-12-31 1 182.8 0
# 0.3 2005-12-31 5 182.8 2
# 0.6 2005-12-31 8 182.8 3
# 0.7 2005-12-31 9 182.8 4
# 0.9 2005-12-31 11 182.8 5
# 0.10 2005-12-31 13 182.7 7
# 0.11 2005-12-31 14 182.7 8
# 0.12 2005-12-31 16 182.6 10
# 0.13 2005-12-31 17 182.5 12
# 0.14 2005-12-31 20 181.5 17
# 0.15 2005-12-31 23 177.5 23
# 0.16 2005-12-31 26 165.9 31
# 0.17 2005-12-31 28 155.0 36
# 0.18 2005-12-31 29 149.2 40
# 0.19 2005-12-31 31 136.9 46
# 0.20 2005-12-31 33 126.1 53
# 0.21 2005-12-31 35 112.7 60
# 0.22 2005-12-31 38 88.2 70
# 0.23 2005-12-31 41 68.0 79
# 0.24 2005-12-31 44 54.6 87
# 0.25 2005-12-31 49 45.4 98
# 1.26 2006-12-31 1 182.8 0
# 1.28 2006-12-31 5 182.8 2
# 1.31 2006-12-31 8 182.8 3
# 1.32 2006-12-31 9 182.8 4
# 1.34 2006-12-31 11 182.7 5
# 1.35 2006-12-31 13 182.7 7
# 1.36 2006-12-31 14 182.7 8
# 1.37 2006-12-31 16 182.6 10
# 1.38 2006-12-31 17 182.5 12
# 1.39 2006-12-31 20 181.5 17
# 1.40 2006-12-31 23 178.6 23
# 1.41 2006-12-31 26 168.7 31
# 1.42 2006-12-31 28 156.9 36
# 1.43 2006-12-31 29 150.4 40
# 1.44 2006-12-31 31 137.1 46
# 1.45 2006-12-31 33 126.0 53
# 1.46 2006-12-31 35 112.7 60
# 1.47 2006-12-31 38 91.8 70
# 1.48 2006-12-31 41 75.9 79
# 1.49 2006-12-31 44 65.2 87
# 1.50 2006-12-31 49 58.3 98
or
dat[!ave(dat$depth, (seq_len(nrow(dat))-1) %/% 25, FUN = duplicated),]
dplyr
library(dplyr)
dat %>%
group_by(grp = (seq_len(n())-1) %/% 25) %>%
distinct(depth, .keep_all = TRUE) %>%
ungroup() %>%
select(-grp)
# # A tibble: 42 x 4
# Date Levels values depth
# <chr> <int> <dbl> <int>
# 1 2005-12-31 1 183. 0
# 2 2005-12-31 5 183. 2
# 3 2005-12-31 8 183. 3
# 4 2005-12-31 9 183. 4
# 5 2005-12-31 11 183. 5
# 6 2005-12-31 13 183. 7
# 7 2005-12-31 14 183. 8
# 8 2005-12-31 16 183. 10
# 9 2005-12-31 17 182. 12
# 10 2005-12-31 20 182. 17
# # ... with 32 more rows
data.table
library(data.table)
as.data.table(dat)[, .SD[!duplicated(depth),], by=.( (seq_len(nrow(dat))-1) %/% 25 ) ][,-1]
(The [,-1] on the end is because the by= grouping operation implicitly prepends the seq_len(.)... counter as its first column.)
(Notice a theme? :-)
Data
dat <- structure(list(Date = c("2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2005-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31", "2006-12-31"), Levels = c(1L, 2L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 13L, 14L, 16L, 17L, 20L, 23L, 26L, 28L, 29L, 31L, 33L, 35L, 38L, 41L, 44L, 49L, 1L, 2L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 13L, 14L, 16L, 17L, 20L, 23L, 26L, 28L, 29L, 31L, 33L, 35L, 38L, 41L, 44L, 49L), values = c(182.8, 182.8, 182.8, 182.8, 182.8, 182.8, 182.8, 182.8, 182.8, 182.7, 182.7, 182.6, 182.5, 181.5, 177.5, 165.9, 155, 149.2, 136.9, 126.1, 112.7, 88.23, 67.99, 54.63, 45.4, 182.8, 182.8, 182.8, 182.8, 182.8, 182.8, 182.8, 182.8, 182.7, 182.7, 182.7, 182.6, 182.5, 181.5, 178.6, 168.7, 156.9, 150.4, 137.1, 126, 112.7, 91.8, 75.91, 65.17, 58.33), depth = c(0L, 0L, 2L, 2L, 2L, 3L, 4L, 4L, 5L, 7L, 8L, 10L, 12L, 17L, 23L, 31L, 36L, 40L, 46L, 53L, 60L, 70L, 79L, 87L, 98L, 0L, 0L, 2L, 2L, 2L, 3L, 4L, 4L, 5L, 7L, 8L, 10L, 12L, 17L, 23L, 31L, 36L, 40L, 46L, 53L, 60L, 70L, 79L, 87L, 98L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50"))
We could use order and !duplicated:
df = df[order(df[,'depth']),]
df = df[!duplicated(df$depth),]
df
Date Levels values depth
<date> <dbl> <dbl> <dbl>
1 2005-12-31 1 183. 0
2 2005-12-31 5 183. 2
3 2005-12-31 8 183. 3
4 2005-12-31 9 183. 4
5 2005-12-31 11 183. 5
6 2005-12-31 13 183. 7
7 2005-12-31 14 183. 8
8 2006-12-31 49 58.3 9
9 2005-12-31 16 183. 10
10 2005-12-31 17 182. 12
# … with 12 more rows
This question already has answers here:
Categorize numeric variable into group/ bins/ breaks
(4 answers)
Closed 2 years ago.
I'd like to plot a continuous vector as discrete values.
To do so, I'm trying to discretize a continuous vector by transforming it in factor range.
I'm trying to factorize a vector with doubles between 0 and 1.
I'm trying to do by using the cut function.
data:
structure(list(label = c("WP_078201646.1..87-312", "WP_077753210.1..91-300",
"WP_044287879.1..90-306", "WP_046711496.1..56-299", "WP_069060785.1..87-301",
"WP_011394873.1..91-301", "WP_015146987.1..159-358", "WP_085748967.1..86-314",
"NP_696283.1..85-318", "WP_011925568.1..89-315", "WP_013040867.1..89-307",
"WP_062116680.1..85-302", "WP_082057246.1..88-313", "WP_079078020.1..79-301",
"WP_043081767.1..100-292", "WP_085760186.1..96-309", "WP_052427986.1..92-305",
"WP_071039302.1..84-306", "WP_012939355.1..84-312", "WP_012630775.1..85-305"
), full = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), e15 = c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), e20 = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), id_0cov_0.8evalue_0.001 = c(1L, 2L, 4L, 5L, 6L,
9L, 11L, 13L, 14L, 17L, 19L, 22L, 23L, 25L, 31L, 37L, 38L, 42L,
44L, 45L), `archConsensus1e-3` = c("LysR_substrate", "LysR_substrate",
"LysR_substrate", "LysR_substrate", "LysR_substrate", "LysR_substrate",
"PBP_like", "LysR_substrate", "LysR_substrate", "LysR_substrate",
"LysR_substrate", "LysR_substrate", "LysR_substrate", "LysR_substrate",
"LysR_substrate", "LysR_substrate", "LysR_substrate", "LysR_substrate",
"LysR_substrate", "LysR_substrate"), hhArch = c("LysR_substrate",
"LysR_substrate", "LysR_substrate", "LysR_substrate", "LysR_substrate",
"LysR_substrate", "PBP_like", "LysR_substrate", "LysR_substrate",
"LysR_substrate", "LysR_substrate", "LysR_substrate", "LysR_substrate",
"LysR_substrate", "LysR_substrate", "LysR_substrate", "LysR_substrate",
"LysR_substrate", "LysR_substrate", "LysR_substrate"), cache_rate = c(0.00383141762452107,
0, 0, 0.0123681338668607, 0.00512820512820513, 0.0254545454545455,
0.00940438871473354, 0, 0.0571428571428571, 0.00519930675909879,
0, 0.00363636363636364, 0.0357142857142857, 0, 0, 0, 0.0535714285714286,
0, 0.00393700787401575, 0), groupsize = c(261L, 28L, 351L, 2749L,
195L, 275L, 638L, 55L, 525L, 577L, 16L, 275L, 196L, 68L, 3L,
26L, 56L, 512L, 254L, 245L), `periprate1e-3` = c(0.0613026819923372,
0.285714285714286, 0.247863247863248, 0.182975627500909, 0.0358974358974359,
0.254545454545455, 0.0125391849529781, 0, 0.157794676806084,
0.131715771230503, 0.0625, 0.0654545454545455, 0.38265306122449,
0.0735294117647059, 0, 0.0384615384615385, 0.0535714285714286,
0.09765625, 0.259842519685039, 0.257142857142857)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"), .internal.selfref = <pointer: 0x55ccd018d230>)
The code I tried first was:
library(tidyverse)
data %>%
mutate(
cache_rate = cut(cache_rate, breaks = seq(0 , 1, by = 0.1)),
`periprate1e-3` = cut(`periprate1e-3`, breaks = seq(0 , 1, by = 0.1))
)
But it brings me some NA values:
# A tibble: 20 x 10
label full e15 e20 id_0cov_0.8evalue_0… `archConsensus1e… hhArch cache_rate groupsize `periprate1e-3`
<chr> <int> <int> <int> <int> <chr> <chr> <fct> <int> <fct>
1 WP_078201646.… 1 2 1 1 LysR_substrate LysR_subs… (0,0.1] 261 (0,0.1]
2 WP_077753210.… 1 2 1 2 LysR_substrate LysR_subs… NA 28 (0.2,0.3]
3 WP_044287879.… 1 2 1 4 LysR_substrate LysR_subs… NA 351 (0.2,0.3]
4 WP_046711496.… 1 2 1 5 LysR_substrate LysR_subs… (0,0.1] 2749 (0.1,0.2]
5 WP_069060785.… 1 2 1 6 LysR_substrate LysR_subs… (0,0.1] 195 (0,0.1]
6 WP_011394873.… 1 2 1 9 LysR_substrate LysR_subs… (0,0.1] 275 (0.2,0.3]
7 WP_015146987.… 1 2 1 11 PBP_like PBP_like (0,0.1] 638 (0,0.1]
8 WP_085748967.… 1 2 1 13 LysR_substrate LysR_subs… NA 55 NA
9 NP_696283.1..… 1 2 1 14 LysR_substrate LysR_subs… (0,0.1] 525 (0.1,0.2]
10 WP_011925568.… 1 2 1 17 LysR_substrate LysR_subs… (0,0.1] 577 (0.1,0.2]
11 WP_013040867.… 1 2 1 19 LysR_substrate LysR_subs… NA 16 (0,0.1]
12 WP_062116680.… 1 2 1 22 LysR_substrate LysR_subs… (0,0.1] 275 (0,0.1]
13 WP_082057246.… 1 2 1 23 LysR_substrate LysR_subs… (0,0.1] 196 (0.3,0.4]
14 WP_079078020.… 1 2 1 25 LysR_substrate LysR_subs… NA 68 (0,0.1]
15 WP_043081767.… 1 2 1 31 LysR_substrate LysR_subs… NA 3 NA
16 WP_085760186.… 1 2 1 37 LysR_substrate LysR_subs… NA 26 (0,0.1]
17 WP_052427986.… 1 2 1 38 LysR_substrate LysR_subs… (0,0.1] 56 (0,0.1]
18 WP_071039302.… 1 2 1 42 LysR_substrate LysR_subs… NA 512 (0,0.1]
19 WP_012939355.… 1 2 1 44 LysR_substrate LysR_subs… (0,0.1] 254 (0.2,0.3]
20 WP_012630775.… 1 2 1 45 LysR_substrate LysR_subs… NA 245 (0.2,0.3]
Then I tried to fix by changing the range within cut function:
data %>%
mutate(
cache_rate = cut(cache_rate, breaks = seq(-0.9 , 1, by = 0.1)),
`periprate1e-3` = cut(`periprate1e-3`, breaks = seq(-0.9 , 1, by = 0.1))
)
But the result is not too glance to plot given the negative values:
# A tibble: 20 x 10
label full e15 e20 id_0cov_0.8evalue_0… `archConsensus1e… hhArch cache_rate groupsize `periprate1e-3`
<chr> <int> <int> <int> <int> <chr> <chr> <fct> <int> <fct>
1 WP_078201646.… 1 2 1 1 LysR_substrate LysR_subs… (0,0.1] 261 (0,0.1]
2 WP_077753210.… 1 2 1 2 LysR_substrate LysR_subs… (-0.1,0] 28 (0.2,0.3]
3 WP_044287879.… 1 2 1 4 LysR_substrate LysR_subs… (-0.1,0] 351 (0.2,0.3]
4 WP_046711496.… 1 2 1 5 LysR_substrate LysR_subs… (0,0.1] 2749 (0.1,0.2]
5 WP_069060785.… 1 2 1 6 LysR_substrate LysR_subs… (0,0.1] 195 (0,0.1]
6 WP_011394873.… 1 2 1 9 LysR_substrate LysR_subs… (0,0.1] 275 (0.2,0.3]
7 WP_015146987.… 1 2 1 11 PBP_like PBP_like (0,0.1] 638 (0,0.1]
8 WP_085748967.… 1 2 1 13 LysR_substrate LysR_subs… (-0.1,0] 55 (-0.1,0]
9 NP_696283.1..… 1 2 1 14 LysR_substrate LysR_subs… (0,0.1] 525 (0.1,0.2]
10 WP_011925568.… 1 2 1 17 LysR_substrate LysR_subs… (0,0.1] 577 (0.1,0.2]
11 WP_013040867.… 1 2 1 19 LysR_substrate LysR_subs… (-0.1,0] 16 (0,0.1]
12 WP_062116680.… 1 2 1 22 LysR_substrate LysR_subs… (0,0.1] 275 (0,0.1]
13 WP_082057246.… 1 2 1 23 LysR_substrate LysR_subs… (0,0.1] 196 (0.3,0.4]
14 WP_079078020.… 1 2 1 25 LysR_substrate LysR_subs… (-0.1,0] 68 (0,0.1]
15 WP_043081767.… 1 2 1 31 LysR_substrate LysR_subs… (-0.1,0] 3 (-0.1,0]
16 WP_085760186.… 1 2 1 37 LysR_substrate LysR_subs… (-0.1,0] 26 (0,0.1]
17 WP_052427986.… 1 2 1 38 LysR_substrate LysR_subs… (0,0.1] 56 (0,0.1]
18 WP_071039302.… 1 2 1 42 LysR_substrate LysR_subs… (-0.1,0] 512 (0,0.1]
19 WP_012939355.… 1 2 1 44 LysR_substrate LysR_subs… (0,0.1] 254 (0.2,0.3]
20 WP_012630775.… 1 2 1 45 LysR_substrate LysR_subs… (-0.1,0] 245 (0.2,0.3]
data %>%
mutate(
cache_rate2 = cut(cache_rate, breaks = seq(-0.9 , 1, by = 0.1)),
`periprate1e-3_2` = cut(`periprate1e-3`, breaks = seq(-0.9 , 1, by = 0.1))
) %>%
ggplot(aes(cache_rate, `periprate1e-3`, color = cache_rate2, shape = `periprate1e-3_2`)) +
geom_point()
How do I discretize this vector without a mutate filled with disturbing case_when.
Thanks in advance
You are getting NA because the cut function by default excludes values at the lowest value of the first break. If you add include.lowest = TRUE, your problem disappears:
data %>%
mutate(
cache_rate = cut(cache_rate, breaks = 0:10/10, include.lowest = TRUE),
`periprate1e-3` = cut(`periprate1e-3`, breaks = 0:10/10, include.lowest = TRUE)
)
#> # A tibble: 20 x 10
#> label full e15 e20 id_0cov_0.8eval~ `archConsensus1~ hhArch cache_rate
#> <chr> <int> <int> <int> <int> <chr> <chr> <fct>
#> 1 WP_0~ 1 2 1 1 LysR_substrate LysR_~ [0,0.1]
#> 2 WP_0~ 1 2 1 2 LysR_substrate LysR_~ [0,0.1]
#> 3 WP_0~ 1 2 1 4 LysR_substrate LysR_~ [0,0.1]
#> 4 WP_0~ 1 2 1 5 LysR_substrate LysR_~ [0,0.1]
#> 5 WP_0~ 1 2 1 6 LysR_substrate LysR_~ [0,0.1]
#> 6 WP_0~ 1 2 1 9 LysR_substrate LysR_~ [0,0.1]
#> 7 WP_0~ 1 2 1 11 PBP_like PBP_l~ [0,0.1]
#> 8 WP_0~ 1 2 1 13 LysR_substrate LysR_~ [0,0.1]
#> 9 NP_6~ 1 2 1 14 LysR_substrate LysR_~ [0,0.1]
#> 10 WP_0~ 1 2 1 17 LysR_substrate LysR_~ [0,0.1]
#> 11 WP_0~ 1 2 1 19 LysR_substrate LysR_~ [0,0.1]
#> 12 WP_0~ 1 2 1 22 LysR_substrate LysR_~ [0,0.1]
#> 13 WP_0~ 1 2 1 23 LysR_substrate LysR_~ [0,0.1]
#> 14 WP_0~ 1 2 1 25 LysR_substrate LysR_~ [0,0.1]
#> 15 WP_0~ 1 2 1 31 LysR_substrate LysR_~ [0,0.1]
#> 16 WP_0~ 1 2 1 37 LysR_substrate LysR_~ [0,0.1]
#> 17 WP_0~ 1 2 1 38 LysR_substrate LysR_~ [0,0.1]
#> 18 WP_0~ 1 2 1 42 LysR_substrate LysR_~ [0,0.1]
#> 19 WP_0~ 1 2 1 44 LysR_substrate LysR_~ [0,0.1]
#> 20 WP_0~ 1 2 1 45 LysR_substrate LysR_~ [0,0.1]
#> # ... with 2 more variables: groupsize <int>, `periprate1e-3` <fct>
From the table below I need to combine the lines by calculating the average value for those lines with same ID (column 2).
I was thinking of the plyr function??
ddply(df, summarize, value = average(ID))
df:
miRNA ID 100G 100R 106G 106R 122G 122R 124G 124R 126G 126R 134G 134R 141G 141R 167G 167R 185G 185R
1 hsa-miR-106a ID7 1585 423 180 113 598 266 227 242 70 106 2703 442 715 309 546 113 358 309
2 hsa-miR-1185-1 ID2 10 1 3 3 11 8 4 4 28 2 13 3 6 3 6 4 7 5
3 hsa-miR-1185-2 ID2 2 0 2 1 5 1 1 0 4 1 1 1 3 2 2 0 2 1
4 hsa-miR-1197 ID2 2 0 0 5 3 3 0 4 16 0 4 1 3 0 0 2 2 4
5 hsa-miR-127 ID3 29 17 6 55 40 35 6 20 171 10 32 21 23 25 10 14 32 55
Summary of original data:
> str(ClusterMatrix)
'data.frame': 113 obs. of 98 variables:
$ miRNA: Factor w/ 202 levels "hsa-miR-106a",..: 1 3 4 6 8 8 14 15 15 16 ...
$ ID : Factor w/ 27 levels "ID1","ID10","ID11",..: 25 12 12 12 21 21 12 21 21 6 ...
$ 100G : Factor w/ 308 levels "-0.307749042739963",..: 279 11 3 3 101 42 139 158 215 222 ...
$ 100R : Factor w/ 316 levels "-0.138028803567403",..: 207 7 8 8 18 42 128 183 232 209 ...
$ 106G : Factor w/ 260 levels "-0.103556709881933",..: 171 4 1 3 7 258 95 110 149 162 ...
$ 106R : Factor w/ 300 levels "-0.141810346640204",..: 141 4 6 2 108 41 146 196 244 267 ...
$ 122G : Factor w/ 336 levels "-0.0409548922061764",..: 237 12 4 6 103 47 148 203 257 264 ...
$ 122R : Factor w/ 316 levels "-0.135708706475279",..: 177 1 8 6 36 44 131 192 239 244 ...
$ 124G : Factor w/ 267 levels "-0.348439853247856",..: 210 5 2 3 7 50 126 138 188 249 ...
$ 124R : Factor w/ 303 levels "-0.176414190219115",..: 193 3 7 3 21 52 167 200 238 239 ...
$ 126G : Factor w/ 307 levels "-0.227658806811544",..: 122 88 5 76 169 61 240 220 281 265 ...
$ 126R : Factor w/ 249 levels "-0.271925865853123",..: 119 1 2 3 11 247 78 110 151 193 ...
$ 134G : Factor w/ 344 levels "-0.106333543799583",..: 304 14 8 5 33 48 150 196 248 231 ...
$ 134R : Factor w/ 300 levels "-0.0997616469801097",..: 183 5 7 7 22 298 113 159 213 221 ...
$ 141G : Factor w/ 335 levels "-0.134429748398679",..: 253 7 3 3 24 29 142 137 223 302 ...
$ 141R : Factor w/ 314 levels "-0.143299688877927",..: 210 4 5 7 98 54 154 199 255 251 ...
$ 167G : Factor w/ 306 levels "-0.211181452126958",..: 222 7 4 6 11 292 91 101 175 226 ...
$ 167R : Factor w/ 282 levels "-0.0490740880560127",..: 130 2 6 4 15 282 110 146 196 197 ...
$ 185G : Factor w/ 317 levels "-0.0567841338235346",..: 218 2 7 7 33 34 130 194 227 259 ...
We can use dplyr. We group by 'ID', use mutate_each to create columns that show the mean value of '100G' to '185R'. We select the columns in mutate_each by using regex patterns in matches. Then cbind (bind_cols) the original dataset with the mutated columns, and convert to data.frame if needed. We can also change the column names of the mean columns.
library(dplyr)
out <- df1 %>%
group_by(ID) %>%
mutate_each(funs(mean=mean(., na.rm=TRUE)), matches('^\\d+')) %>%
setNames(., c(names(.)[1:2], paste0('Mean_', names(.)[3:ncol(.)]))) %>%
as.data.frame()
out1 <- bind_cols(df1, out[-(1:2)])
out1
# miRNA ID 100G 100R 106G 106R 122G 122R 124G 124R 126G 126R 134G
#1 hsa-miR-106a ID7 1585 423 180 113 598 266 227 242 70 106 2703
#2 hsa-miR-1185-1 ID2 10 1 3 3 11 8 4 4 28 2 13
#3 hsa-miR-1185-2 ID2 2 0 2 1 5 1 1 0 4 1 1
#4 hsa-miR-1197 ID2 2 0 0 5 3 3 0 4 16 0 4
#5 hsa-miR-127 ID3 29 17 6 55 40 35 6 20 171 10 32
# 134R 141G 141R 167G 167R 185G 185R Mean_100G Mean_100R Mean_106G
#1 442 715 309 546 113 358 309 1585.000000 423.0000000 180.000000
#2 3 6 3 6 4 7 5 4.666667 0.3333333 1.666667
#3 1 3 2 2 0 2 1 4.666667 0.3333333 1.666667
#4 1 3 0 0 2 2 4 4.666667 0.3333333 1.666667
#5 21 23 25 10 14 32 55 29.000000 17.0000000 6.000000
# Mean_106R Mean_122G Mean_122R Mean_124G Mean_124R Mean_126G Mean_126R
#1 113 598.000000 266 227.000000 242.000000 70 106
#2 3 6.333333 4 1.666667 2.666667 16 1
#3 3 6.333333 4 1.666667 2.666667 16 1
#4 3 6.333333 4 1.666667 2.666667 16 1
#5 55 40.000000 35 6.000000 20.000000 171 10
# Mean_134G Mean_134R Mean_141G Mean_141R Mean_167G Mean_167R Mean_185G
#1 2703 442.000000 715 309.000000 546.000000 113 358.000000
#2 6 1.666667 4 1.666667 2.666667 2 3.666667
#3 6 1.666667 4 1.666667 2.666667 2 3.666667
#4 6 1.666667 4 1.666667 2.666667 2 3.666667
#5 32 21.000000 23 25.000000 10.000000 14 32.000000
# Mean_185R
#1 309.000000
#2 3.333333
#3 3.333333
#4 3.333333
#5 55.000000
EDIT: If we need a single row mean for each 'ID', we can use summarise_each
df1 %>%
group_by(ID) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), matches('^\\d+'))
EDIT2: Based on the OP's update the original dataset ('ClusterMatrix') columns are all factor class. We need to convert the columns to numeric class before getting the mean. There are two options to convert the factor to numeric - 1) by as.numeric(as.character(.. which may be a bit slower, 2) as.numeric(levels(.. which is faster. Here I am using the first method as it may be more clear.
ClusterMatrix %>%
group_by(ID) %>%
summarise_each(funs(mean= mean(as.numeric(as.character(.)),
na.rm=TRUE)), matches('^\\d+'))
data
df1 <- structure(list(miRNA = c("hsa-miR-106a", "hsa-miR-1185-1",
"hsa-miR-1185-2",
"hsa-miR-1197", "hsa-miR-127"), ID = c("ID7", "ID2", "ID2", "ID2",
"ID3"), `100G` = c(1585L, 10L, 2L, 2L, 29L), `100R` = c(423L,
1L, 0L, 0L, 17L), `106G` = c(180L, 3L, 2L, 0L, 6L), `106R` = c(113L,
3L, 1L, 5L, 55L), `122G` = c(598L, 11L, 5L, 3L, 40L), `122R` = c(266L,
8L, 1L, 3L, 35L), `124G` = c(227L, 4L, 1L, 0L, 6L), `124R` = c(242L,
4L, 0L, 4L, 20L), `126G` = c(70L, 28L, 4L, 16L, 171L), `126R` = c(106L,
2L, 1L, 0L, 10L), `134G` = c(2703L, 13L, 1L, 4L, 32L), `134R` = c(442L,
3L, 1L, 1L, 21L), `141G` = c(715L, 6L, 3L, 3L, 23L), `141R` = c(309L,
3L, 2L, 0L, 25L), `167G` = c(546L, 6L, 2L, 0L, 10L), `167R` = c(113L,
4L, 0L, 2L, 14L), `185G` = c(358L, 7L, 2L, 2L, 32L), `185R` = c(309L,
5L, 1L, 4L, 55L)), .Names = c("miRNA", "ID", "100G", "100R",
"106G", "106R", "122G", "122R", "124G", "124R", "126G", "126R",
"134G", "134R", "141G", "141R", "167G", "167R", "185G", "185R"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"
))
I'm looking to construct a moving average while aggregating a timeseries dataset over two categorical variables. While I've seen a few other tutorials, none of them seem to capture the specific task I'd like to achieve.
My original dataset (df) has rows for each individual (id) for a series of dates ranging from 0-180 (Days). Individuals can be members of one of two subsets of data (Group).
I then aggregate this data frame to get a daily mean for the two groups.
library(plyr)
summary <- ddply(df, .(Group,Days), summarise,
DV = mean(variable), resp=length(unique(Id)))
The next step, however, is to construct a moving average within the two groups. In the sample dataframe below, I've just constructed a 5-day mean using the previous 5 days.
Group Days DV 5DayMA
exceeded 0 2859
exceeded 1 2948
exceeded 2 4412
exceeded 3 5074
exceeded 4 5098 4078
exceeded 5 5147 4536
exceeded 6 4459 4838
exceeded 7 4730 4902
exceeded 8 4643 4815
exceeded 9 4698 4735
exceeded 10 4818 4670
exceeded 11 4521 4682
othergroup 0 2859
othergroup 1 2948
othergroup 2 4412
othergroup 3 5074
othergroup 4 5098 4078
othergroup 5 5147 4536
othergroup 6 4459 4838
othergroup 7 4730 4902
othergroup 8 4643 4815
othergroup 9 4698 4735
othergroup 10 4818 4670
othergroup 11 4521 4682
Any thoughts on how to do this?
You could try zoo::rollmean
df <- structure(list(Group = c("exceeded", "exceeded", "exceeded",
"exceeded", "exceeded", "exceeded", "exceeded", "exceeded", "exceeded",
"exceeded", "exceeded", "exceeded", "othergroup", "othergroup",
"othergroup", "othergroup", "othergroup", "othergroup", "othergroup",
"othergroup", "othergroup", "othergroup", "othergroup", "othergroup"
), Days = c(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), DV = c(2859L,
2948L, 4412L, 5074L, 5098L, 5147L, 4459L, 4730L, 4643L, 4698L,
4818L, 4521L, 2859L, 2948L, 4412L, 5074L, 5098L, 5147L, 4459L,
4730L, 4643L, 4698L, 4818L, 4521L), X5DayMA = c(NA, NA, NA, NA,
4078L, 4536L, 4838L, 4902L, 4815L, 4735L, 4670L, 4682L, NA, NA,
NA, NA, 4078L, 4536L, 4838L, 4902L, 4815L, 4735L, 4670L, 4682L
)), .Names = c("Group", "Days", "DV", "X5DayMA"), class = "data.frame", row.names = c(NA,
-24L))
head(df)
Group Days DV X5DayMA
1 exceeded 0 2859 NA
2 exceeded 1 2948 NA
3 exceeded 2 4412 NA
4 exceeded 3 5074 NA
5 exceeded 4 5098 4078
6 exceeded 5 5147 4536
library(plyr)
library(zoo)
ddply(
df, "Group",
transform,
5daymean = rollmean(DV, 5, align="right", na.pad=TRUE ))
Group Days DV X5DayMA 5daymean
1 exceeded 0 2859 NA NA
2 exceeded 1 2948 NA NA
3 exceeded 2 4412 NA NA
4 exceeded 3 5074 NA NA
5 exceeded 4 5098 4078 4078.2
6 exceeded 5 5147 4536 4535.8
7 exceeded 6 4459 4838 4838.0
8 exceeded 7 4730 4902 4901.6
9 exceeded 8 4643 4815 4815.4
10 exceeded 9 4698 4735 4735.4
11 exceeded 10 4818 4670 4669.6
12 exceeded 11 4521 4682 4682.0
13 othergroup 0 2859 NA NA
14 othergroup 1 2948 NA NA
15 othergroup 2 4412 NA NA
16 othergroup 3 5074 NA NA
17 othergroup 4 5098 4078 4078.2
18 othergroup 5 5147 4536 4535.8
19 othergroup 6 4459 4838 4838.0
20 othergroup 7 4730 4902 4901.6
21 othergroup 8 4643 4815 4815.4
22 othergroup 9 4698 4735 4735.4
23 othergroup 10 4818 4670 4669.6
24 othergroup 11 4521 4682 4682.0
or even faster with dplyr
library(dplyr)
df %.%
dplyr:::group_by(Group) %.%
dplyr:::mutate('5daymean' = rollmean(DV, 5, align="right", na.pad=TRUE ))
OR the super fast data.table
library(data.table)
dft <- data.table(df)
dft[ , `:=` ('5daymean' = rollmean(DV, 5, align="right", na.pad=TRUE )) , by=Group ]
ave and filter:
with(df, ave(DV, Group, FUN=function(x) filter(x,rep(1/5,5),sides=1)))
# [1] NA NA NA NA 4078.2 4535.8 4838.0 4901.6 4815.4 4735.4
#[11] 4669.6 4682.0 NA NA NA NA 4078.2 4535.8 4838.0 4901.6
#[21] 4815.4 4735.4 4669.6 4682.0