My dataset has as features: players IDs, weeks and points.
I want to calculate the mean of points for previous weeks, but not all past weeks, just to the last 5 or less (if the current week is smaller than 5).
Example: For player_id = 5, week = 7, the result will be the average of POINTS for player_id = 5 and weeks 2, 3, 4, 5 and 6.
The following code already does the average for all previous week, so I need an adaptation to make it for just 5 previous week.
player_id<-c(rep(1,30),rep(2,30),rep(3,30),rep(4,30),rep(5,30))
week<-1:30
points<-round(runif(150,1,10),0)
mydata<- data.frame(player_id=player_id,week=rep(week,5),points)
mydata<-mydata %>%
group_by(player_id) %>% # the group to perform the stat on
arrange(week) %>% # order the weeks within each group
mutate(previous_mean = cummean(points) ) %>% # for each week get the
cumulative mean
mutate(previous_mean = lag(previous_mean) ) %>% # shift cumulative
mean back one week
arrange(player_id) # sort by player_id
HAVB's approach is great, but depending on what you want, here is another. This approach is adapted from this answer to a different question, but changed for your circumstances:
library(dplyr)
library(zoo)
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,30),rep(2,30),rep(3,30),rep(4,30),rep(5,30))
week<-1:30
points<-round(runif(150,1,10),0)
mydata<- data.frame(player_id=player_id,week=rep(week,5),points)
roll_mean <- function(x, k) {
result <- rollapplyr(x, k, mean, partial=TRUE, na.rm=TRUE)
result[is.nan(result)] <- NA
return( result )
}
mydata<- data.frame(player_id=player_id,week=rep(week,5),points)
mydata<-mydata %>%
group_by(player_id) %>%
arrange(week) %>%
mutate(rolling_mean = roll_mean(x=lag(points), k=5) ) %>%
arrange(player_id)
Then we can look at a subset to show it worked:
mydata[mydata$player_id %in% 1:2 & mydata$week %in% 1:6, ]
# A tibble: 12 x 4
# Groups: player_id [2]
player_id week points rolling_mean
<dbl> <int> <dbl> <dbl>
1 1 1 4 NA
2 1 2 8 4.000000
3 1 3 5 6.000000
4 1 4 9 5.666667
5 1 5 9 6.500000
6 1 6 1 7.000000
7 2 1 10 NA
8 2 2 9 10.000000
9 2 3 7 9.500000
10 2 4 8 8.666667
11 2 5 1 8.500000
12 2 6 5 7.000000
So we can see at each time t, rolling_mean for player i will be the mean of the points observations for player i at times {t - 1, ..., min(1, t - 5)}.
You can use slice to select just the last 5 weeks for each group. Try this:
player_id<-c(rep(1,30),rep(2,30),rep(3,30),rep(4,30),rep(5,30))
week<-1:30
points<-round(runif(150,1,10),0)
mydata<- data.frame(player_id=player_id,week=rep(week,5),points)
library(dplyr)
mydata <- mydata %>%
group_by(player_id) %>% # the group to perform the stat on
arrange(week) %>% # order the weeks within each group
slice( (n()-4):n() ) %>% # "slice" the last 5 rows (weeks) of every group
mutate(previous_mean = cummean(points) ) %>% # for each week get the cumulative mean
mutate(previous_mean = lag(previous_mean) ) %>% # shift cumulative mean back one week
arrange(player_id) # sort by player_id
The line
slice( (n()-4):n() )
selects rows within the range [(last row - 4) : last row], for each group
EDIT: To avoid trouble when the current week is less than 5, use an ifelse statement to validate:
mydata %>%
group_by(player_id) %>% # the group to perform the stat on
arrange(week) %>% # order the weeks within each group
slice(ifelse(n() < 5, 1:n(), n()-4):n()) %>% # "slice" the last 5 rows (weeks) of every group
mutate(previous_mean = cummean(points) ) %>% # for each week get the cumulative mean
mutate(previous_mean = lag(previous_mean) ) %>% # shift cumulative mean back one week
arrange(player_id) # sort by player_id
Related
I have a file that contains multiple individuals and multiple values for the same individual.
I need to remove the first 10 and last 10 values of each individual, putting all the leftover values in a new table.
This is what my data kinda looks like:
Cow Data
NL123456 123
NL123456 456
I tried doing a for-loop, counting per individual how many values there were (but I think, I already got stuck there, because I am not using the right command I think? All variables in Cow are a factor).
I figured removing the first and last had to be something like this:
data1[c(11: n-10),]
If you know you always have more than 20 datapoints by cow you can do the following, illustrated on the iris dataset :
library(dplyr)
dim(iris)
# [1] 150 5
iris_trimmed <-
iris %>%
group_by(Species) %>%
slice(11:(n()-10)) %>%
ungroup()
dim(iris_trimmed)
# [1] 90 5
On your data :
res <-
your_data %>%
group_by(Cow) %>%
slice(11:(n()-10)) %>%
ungroup()
In base R you can do :
iris_trimmed <- do.call(
rbind,
lapply(split(iris, iris$Species),
function(x) head(tail(x,-10),-10)))
dim(iris_trimmed)
# [1] 90 5
Using data.table:
library(data.table)
idt <- as.data.table(iris)
idt[, .SD[11:(.N-10)], Species]
Same logic in base R:
do.call(
rbind,
lapply(
split(iris, iris[["Species"]]),
function(x) x[11:(nrow(x)-10), ]
)
)
Here a solution with dplyr.
In my example I cut only the first and last values. (you can adapt it by changing 2 with any number in filter).
The idea is to add after you group_by id the number of row per each observation starting from the top (n) and in reverse from the bottom (n1), then you simply filter out.
library(dplyr)
data %>%
group_by(id) %>%
mutate(n=1:n(),
n1 = n():1) %>% # n and n1 are the row numbers
filter(n >= 2,n1 >= 2) %>% # change 2 with 10, or whatever
# filter() keeps only the rows that you want
select(-n, -n1) %>%
ungroup()
# # A tibble: 4 x 2
# id value
# <dbl> <int>
# 1 1 6
# 2 1 8
# 3 2 1
# 4 2 2
Data:
set.seed(123)
data <- data.frame(id = c(rep(1,4), rep(2,4)), value=sample(8))
data
# id value
# 1 1 3
# 2 1 6
# 3 1 8
# 4 1 5
# 5 2 4
# 6 2 1
# 7 2 2
# 8 2 7
I'm trying to modify a solution posted here Create cohort dropout rate table from raw data
I'd like to create a CUMULATIVE dropout rate table using these data.
DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
2,2,2,1,1,2,1,2,1,2,
1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))
So far, I've been able to get to this point
library(tidyverse)
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
spread(year, n) %>%
mutate(y2014_2015_dropouts = (`2014` - `2015`),
y2015_2016_dropouts = (`2015` - `2016`)) %>%
mutate(y2014_2015_cumulative =y2014_2015_dropouts/`2014`,
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative)%>%
replace_na(list(y2014_2015_dropouts = 0.0,
y2015_2016_dropouts = 0.0)) %>%
select(cohort, y2014_2015_dropouts, y2015_2016_dropouts, y2014_2015_cumulative,y2015_2016_cumulative )
A cumulative dropout rate table reflects the proportion of students within a class who dropped out of school across years.
# A tibble: 3 x 5
cohort y2014_2015_dropouts y2015_2016_dropouts y2014_2015_cumulative y2015_2016_cumulative
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 2 0.6 0.8
2 2 0 2 NA NA
3 3 0 0 NA NA
>
The last two columns of the tibble show that by the end of year 2014-2015, 60% of cohort 1 students dropped out; and by the end of year 2015-2016, 80% of cohort 1 students had dropped out.
I'd like to calculate the same for cohorts 2 and 3, but I don't know how to do it.
Here is an alternative data.table solution that keeps your data organized in a way that I find easier to deal with. Using your DT input data:
Organize and order by cohort and year:
DT2 <- DT[, .N, list(cohort, year)][order(cohort, year)]
Assign the year range:
DT2[, year := paste(lag(year), year, sep = "_"),]
Get dropouts per year
DT2[, dropouts := ifelse(!is.na(lag(N)), lag(N) - N, 0), , cohort, ]
Get the cumulative sum of proportion dropped out each year per cohort:
DT2[, cumul := cumsum(dropouts) / max(N), cohort]
Output:
> DT2
cohort year N dropouts cumul
1: 1 NA_2014 10 0 0.0000000
2: 1 2014_2015 4 6 0.6000000
3: 1 2015_2016 2 2 0.8000000
4: 2 2016_2015 6 0 0.0000000
5: 2 2015_2016 4 2 0.3333333
6: 3 2016_2016 9 0 0.0000000
Because you spread your data by year early in your pipe and your 2014 columns have NA values for everything related to cohort 2, you need to coalesce the denominator in your calculation for y2015_2016_cumulative. If you replace the definition for that variable from the current
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative
to
y2015_2016_cumulative =y2015_2016_dropouts/coalesce(`2014`, `2015`) +
coalesce(y2014_2015_cumulative, 0)
you should be good to go. The coalesce function tries the first argument, but inputs the second argument if the first is NA. That being said, this current method isn't extremely scalable. You would have to add additional coalesce statements for every year you added. If you keep your data in the tidy format, you can keep a running list at the year-cohort level using
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
group_by(cohort) %>%
mutate(dropouts = lag(n) - n,
dropout_rate = dropouts / max(n)) %>%
replace_na(list(dropouts = 0, n = 0, dropout_rate = 0)) %>%
mutate(cumulative_dropouts = cumsum(dropouts),
cumulative_dropout_rate = cumulative_dropouts / max(n))
I have a dataframe with groups that essentially looks like this
DF <- data.frame(state = c(rep("A", 3), rep("B",2), rep("A",2)))
DF
state
1 A
2 A
3 A
4 B
5 B
6 A
7 A
My question is how to count the number of consecutive rows where the first value is repeated in its first "block". So for DF above, the result should be 3. The first value can appear any number of times, with other values in between, or it may be the only value appearing.
The following naive attempt fails in general, as it counts all occurrences of the first value.
DF %>% mutate(is_first = as.integer(state == first(state))) %>%
summarize(count = sum(is_first))
The result in this case is 5. So, hints on a (preferably) dplyr solution to this would be appreciated.
You can try:
rle(as.character(DF$state))$lengths[1]
[1] 3
In your dplyr chain that would just be:
DF %>% summarize(count_first = rle(as.character(state))$lengths[1])
# count_first
# 1 3
Or to be overzealous with piping, using dplyr and magrittr:
library(dplyr)
library(magrittr)
DF %>% summarize(count_first = state %>%
as.character %>%
rle %$%
lengths %>%
first)
# count_first
# 1 3
Works also for grouped data:
DF <- data.frame(group = c(rep(1,4),rep(2,3)),state = c(rep("A", 3), rep("B",2), rep("A",2)))
# group state
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 2 B
# 6 2 A
# 7 2 A
DF %>% group_by(group) %>% summarize(count_first = rle(as.character(state))$lengths[1])
# # A tibble: 2 x 2
# group count_first
# <dbl> <int>
# 1 1 3
# 2 2 1
No need of dplyrhere but you can modify this example to use it with dplyr. The key is the function rle
state = c(rep("A", 3), rep("B",2), rep("A",2))
x = rle(state)
DF = data.frame(len = x$lengths, state = x$values)
DF
# get the longest run of consecutive "A"
max(DF[DF$state == "A",]$len)
I tried asking this question before but was it was poorly stated. This is a new attempt cause I haven't solved it yet.
I have a dataset with winners, losers, date, winner_points and loser_points.
For each row, I want two new columns, one for the winner and one for the loser that shows how many points they have scored so far (as both winners and losers).
Example data:
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
I want the output to be:
winner_points_sum <- c(0, 0, 1, 3, 1, 3, 5, 3, 5)
loser_points_sum <- c(0, 2, 2, 1, 4, 5, 4, 7, 4)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points, winner_points_sum, loser_points_sum)
How I've solved it thus far is to do a for loop such as:
library(dplyr)
test_data$winner_points_sum_loop <- 0
test_data$loser_points_sum_loop <- 0
for(i in row.names(test_data)) {
test_data[i,]$winner_points_sum_loop <-
(
test_data %>%
dplyr::filter(winner == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(winner_points, na.rm = TRUE))
+
test_data %>%
dplyr::filter(loser == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(loser_points, na.rm = TRUE))
)
}
test_data$winner_points_sum_loop <- unlist(test_data$winner_points_sum_loop)
Any suggestions how to tackle this problem? The queries take quite some time when the row numbers add up. I've tried elaborating with the AVE function, I can do it for one column to sum a players point as winner but can't figure out how to add their points as loser.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
test_data %>%
unite(winner, winner, winner_points) %>% # unite winner columns
unite(loser, loser, loser_points) %>% # unite loser columns
gather(type, pl_pts, winner, loser, -date) %>% # reshape
separate(pl_pts, c("player","points"), convert = T) %>% # separate columns
arrange(date) %>% # order dates (in case it's not)
group_by(player) %>% # for each player
mutate(sum_points = cumsum(points) - points) %>% # get points up to that date
ungroup() %>% # forget the grouping
unite(pl_pts_sumpts, player, points, sum_points) %>% # unite columns
spread(type, pl_pts_sumpts) %>% # reshape
separate(loser, c("loser", "loser_points", "loser_points_sum"), convert = T) %>% # separate columns and give appropriate names
separate(winner, c("winner", "winner_points", "winner_points_sum"), convert = T) %>%
select(winner, loser, date, winner_points, loser_points, winner_points_sum, loser_points_sum) # select the order you prefer
# # A tibble: 9 x 7
# winner loser date winner_points loser_points winner_points_sum loser_points_sum
# * <int> <int> <date> <int> <int> <int> <int>
# 1 1 3 2017-10-01 2 1 0 0
# 2 2 1 2017-10-02 1 0 0 2
# 3 3 1 2017-10-03 2 1 1 2
# 4 1 2 2017-10-04 1 0 3 1
# 5 2 1 2017-10-05 2 1 1 4
# 6 3 1 2017-10-06 1 0 3 5
# 7 1 3 2017-10-07 2 1 5 4
# 8 2 1 2017-10-08 1 0 3 7
# 9 3 2 2017-10-09 2 1 5 4
I finally understood what you want. And I took an approach of getting cumulative points of each player at each point in time and then joining it to the original test_data data frame.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
cum_points <- test_data %>%
gather(end_game_status, player_id, winner, loser) %>%
gather(which_point, how_many_points, winner_points, loser_points) %>%
filter(
(end_game_status == "winner" & which_point == "winner_points") |
(end_game_status == "loser" & which_point == "loser_points")) %>%
arrange(date = as.Date(date)) %>%
group_by(player_id) %>%
mutate(cumulative_points = cumsum(how_many_points)) %>%
mutate(cumulative_points_sofar = lag(cumulative_points, default = 0))
select(player_id, date, cumulative_points)
output <- test_data %>%
left_join(cum_points, by = c('date', 'winner' = 'player_id')) %>%
rename(winner_points_sum = cumulative_points_sofar) %>%
left_join(cum_points, by = c('date', 'loser' = 'player_id')) %>%
rename(loser_points_sum = cumulative_points_sofar)
output
The difference to the previous question of the OP is that the OP is now asking for the cumulative sum of points each player has scored so far, i.e., before the actual date. Furthermore, the sample data set now contains a date column which uniquely identifies each row.
So, my previous approach can be used here as well, with some modifications. The solution below reshapes the data from wide to long format whereby two value variables are reshaped simultaneously, computes the cumulative sums for each player id , and finally reshapes from long back to wide format, again. In order to sum only points scored before the actual date, the rows are lagged by one.
It is important to note that the winner and loser columns contain the respective player ids.
library(data.table)
cols <- c("winner", "loser")
setDT(test_data)[
# reshape multiple value variables simultaneously from wide to long format
, melt(.SD, id.vars = "date",
measure.vars = list(cols, paste0(cols, "_points")),
value.name = c("id", "points"))][
# rename variable column
, variable := forcats::lvls_revalue(variable, cols)][
# order by date and cumulate the lagged points by id
order(date), points_sum := cumsum(shift(points, fill = 0)), by = id][
# reshape multiple value variables simultaneously from long to wide format
, dcast(.SD, date ~ variable, value.var = c("id", "points", "points_sum"))]
date id_winner id_loser points_winner points_loser points_sum_winner points_sum_loser
1: 2017-10-01 1 3 2 1 0 0
2: 2017-10-02 2 1 1 0 0 2
3: 2017-10-03 3 1 2 1 1 2
4: 2017-10-04 1 2 1 0 3 1
5: 2017-10-05 2 1 2 1 1 4
6: 2017-10-06 3 1 1 0 3 5
7: 2017-10-07 1 3 2 1 5 4
8: 2017-10-08 2 1 1 0 3 7
9: 2017-10-09 3 2 2 1 5 4
I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581