There is an illustration of my example.
Sample data:
df <- data.frame(ID = c(1, 1, 2, 2, 3, 5), A = c("foo", "bar", "foo", "foo", "bar", "bar"),
B = c(1, 5, 7, 23, 54, 202))
df
ID A B
1 1 foo 1
2 1 bar 5
3 2 foo 7
4 2 foo 23
5 3 bar 54
6 5 bar 202
What I want to do is to summarize, by ID, and count of the same IDs. Furthermore, I want frequencies of IDs in subgroups based values of B in different numeric ranges (number of observations with B>=0 & B<5, B>=5 & B<10, B>=10 & B<15, B>=15 & B<20 etc for all IDs).
I want this result:
ID count count_0_5 count_5_10 etc
1 1 2 1 1 etc
2 2 2 NA 1 etc
3 3 1 NA NA etc
4 5 1 NA NA etc
I tried this code using package dplyr:
df %>%
group_by(ID) %>%
summarize(count=n(), count_0_5 = n(B>=0 & B<5))
However, it returns this error:
`Error in n(B>=0 & B<5) :
unused argument (B>=0 & B<5)`
Perhaps replacing n(B>=0 & B<5) with sum(B>=0 & B<5)?
This will sum the number of cases where the two specified conditions are accomplished.
However, you'll get 0's instead of NA's. This can be settled by:
ifelse(sum(B>=0 & B<5)>0, sum(B>=0 & B<5), NA)
I'm pretty sure that there may be a better solution (more clearer and efficient), but this should work!
library(dplyr)
library(tidyr)
df %>% group_by(ID) %>%
mutate(B_cut = cut(B, c(0,5,10,15,20,1000), labels = c('count_0_5','count_5_10','count_10_15','count_15_20','count_20_1000')), count=n()) %>%
group_by(ID,B_cut) %>% mutate(n=n()) %>% slice(1) %>% select(-A,-B) %>%
spread(B_cut, n)
#2nd option
left_join(df %>% group_by(ID) %>% summarise(n=n()),
df %>% mutate(B_cut = cut(B, c(0,5,10,15,20,1000), labels = c('count_0_5','count_5_10','count_10_15','count_15_20','count_20_1000'))) %>%
count(ID,B_cut) %>% spread(B_cut,n),
by='ID')
# A tibble: 4 x 5
# Groups: ID [4]
ID count count_0_5 count_5_10 count_20_1000
<dbl> <int> <int> <int> <int>
1 1 2 2 NA NA
2 2 2 NA 1 1
3 3 1 NA NA 1
4 5 1 NA NA 1
My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?
Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.
Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))
This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 4 years ago.
I have some data similar in structure to:
a <- data.frame("ID" = c("A", "A", "B", "B", "C", "C"),
"NUM" = c(1, 2, 4, 3, 6, 9),
"VAL" = c(1, 0, 1, 0, 1, 0))
And I am trying to sort it by ID and NUM then get the last row.
This code works to get the last row and summarize down to a unique ID, however, it doesn't actually get the full last row like I want.
a <- a %>% arrange(ID, NUM) %>%
group_by(ID) %>%
summarise(max(NUM))
I understand why this code doesn't work but am looking for the dplyr way of getting the last row for each unique ID
Expected Results:
ID NUM VAL
<fct <dbl> <dbl>
1 A 2 0
2 B 4 1
3 C 9 0
Note: I will admit that though it is nearly a duplicate of Select first and last row from grouped data, the answers on that thread were not quite what I was looking for.
You might try:
a %>%
group_by(ID) %>%
arrange(NUM) %>%
slice(n())
One dplyr option could be:
a %>%
arrange(ID, NUM) %>%
group_by(ID) %>%
summarise_all(last)
ID NUM VAL
<fct> <dbl> <dbl>
1 A 2. 0.
2 B 4. 1.
3 C 9. 0.
Or since dplyr 1.0.0:
a %>%
arrange(ID, NUM) %>%
group_by(ID) %>%
summarise(across(everything(), last))
Or using slice_max():
a %>%
group_by(ID) %>%
slice_max(order_by = NUM, n = 1)
tail() returns the last 6 items of a subsettable object. When using aggregate(), the parameters to the FUN argument are passed immediately after the function using a comma; here 1 refers to n = 1, which tells tail() to only return the last item.
aggregate(a[, c('NUM', 'VAL')], list(a$ID), tail, 1)
# Group.1 NUM VAL
# 1 A 2 0
# 2 B 3 0
# 3 C 9 0
You can use top_n. (grouping already sorts by ID, and sorting by NUM isn't necessary since there's only 1 value)
library(dplyr)
a %>%
group_by(ID) %>%
top_n(1, NUM)
# # A tibble: 3 x 3
# # Groups: ID [3]
# ID NUM VAL
# <fct> <dbl> <dbl>
# 1 A 2 0
# 2 B 4 1
# 3 C 9 0
I have a dataframe with groups that essentially looks like this
DF <- data.frame(state = c(rep("A", 3), rep("B",2), rep("A",2)))
DF
state
1 A
2 A
3 A
4 B
5 B
6 A
7 A
My question is how to count the number of consecutive rows where the first value is repeated in its first "block". So for DF above, the result should be 3. The first value can appear any number of times, with other values in between, or it may be the only value appearing.
The following naive attempt fails in general, as it counts all occurrences of the first value.
DF %>% mutate(is_first = as.integer(state == first(state))) %>%
summarize(count = sum(is_first))
The result in this case is 5. So, hints on a (preferably) dplyr solution to this would be appreciated.
You can try:
rle(as.character(DF$state))$lengths[1]
[1] 3
In your dplyr chain that would just be:
DF %>% summarize(count_first = rle(as.character(state))$lengths[1])
# count_first
# 1 3
Or to be overzealous with piping, using dplyr and magrittr:
library(dplyr)
library(magrittr)
DF %>% summarize(count_first = state %>%
as.character %>%
rle %$%
lengths %>%
first)
# count_first
# 1 3
Works also for grouped data:
DF <- data.frame(group = c(rep(1,4),rep(2,3)),state = c(rep("A", 3), rep("B",2), rep("A",2)))
# group state
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 2 B
# 6 2 A
# 7 2 A
DF %>% group_by(group) %>% summarize(count_first = rle(as.character(state))$lengths[1])
# # A tibble: 2 x 2
# group count_first
# <dbl> <int>
# 1 1 3
# 2 2 1
No need of dplyrhere but you can modify this example to use it with dplyr. The key is the function rle
state = c(rep("A", 3), rep("B",2), rep("A",2))
x = rle(state)
DF = data.frame(len = x$lengths, state = x$values)
DF
# get the longest run of consecutive "A"
max(DF[DF$state == "A",]$len)
I tried asking this question before but was it was poorly stated. This is a new attempt cause I haven't solved it yet.
I have a dataset with winners, losers, date, winner_points and loser_points.
For each row, I want two new columns, one for the winner and one for the loser that shows how many points they have scored so far (as both winners and losers).
Example data:
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
I want the output to be:
winner_points_sum <- c(0, 0, 1, 3, 1, 3, 5, 3, 5)
loser_points_sum <- c(0, 2, 2, 1, 4, 5, 4, 7, 4)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points, winner_points_sum, loser_points_sum)
How I've solved it thus far is to do a for loop such as:
library(dplyr)
test_data$winner_points_sum_loop <- 0
test_data$loser_points_sum_loop <- 0
for(i in row.names(test_data)) {
test_data[i,]$winner_points_sum_loop <-
(
test_data %>%
dplyr::filter(winner == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(winner_points, na.rm = TRUE))
+
test_data %>%
dplyr::filter(loser == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(loser_points, na.rm = TRUE))
)
}
test_data$winner_points_sum_loop <- unlist(test_data$winner_points_sum_loop)
Any suggestions how to tackle this problem? The queries take quite some time when the row numbers add up. I've tried elaborating with the AVE function, I can do it for one column to sum a players point as winner but can't figure out how to add their points as loser.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
test_data %>%
unite(winner, winner, winner_points) %>% # unite winner columns
unite(loser, loser, loser_points) %>% # unite loser columns
gather(type, pl_pts, winner, loser, -date) %>% # reshape
separate(pl_pts, c("player","points"), convert = T) %>% # separate columns
arrange(date) %>% # order dates (in case it's not)
group_by(player) %>% # for each player
mutate(sum_points = cumsum(points) - points) %>% # get points up to that date
ungroup() %>% # forget the grouping
unite(pl_pts_sumpts, player, points, sum_points) %>% # unite columns
spread(type, pl_pts_sumpts) %>% # reshape
separate(loser, c("loser", "loser_points", "loser_points_sum"), convert = T) %>% # separate columns and give appropriate names
separate(winner, c("winner", "winner_points", "winner_points_sum"), convert = T) %>%
select(winner, loser, date, winner_points, loser_points, winner_points_sum, loser_points_sum) # select the order you prefer
# # A tibble: 9 x 7
# winner loser date winner_points loser_points winner_points_sum loser_points_sum
# * <int> <int> <date> <int> <int> <int> <int>
# 1 1 3 2017-10-01 2 1 0 0
# 2 2 1 2017-10-02 1 0 0 2
# 3 3 1 2017-10-03 2 1 1 2
# 4 1 2 2017-10-04 1 0 3 1
# 5 2 1 2017-10-05 2 1 1 4
# 6 3 1 2017-10-06 1 0 3 5
# 7 1 3 2017-10-07 2 1 5 4
# 8 2 1 2017-10-08 1 0 3 7
# 9 3 2 2017-10-09 2 1 5 4
I finally understood what you want. And I took an approach of getting cumulative points of each player at each point in time and then joining it to the original test_data data frame.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
cum_points <- test_data %>%
gather(end_game_status, player_id, winner, loser) %>%
gather(which_point, how_many_points, winner_points, loser_points) %>%
filter(
(end_game_status == "winner" & which_point == "winner_points") |
(end_game_status == "loser" & which_point == "loser_points")) %>%
arrange(date = as.Date(date)) %>%
group_by(player_id) %>%
mutate(cumulative_points = cumsum(how_many_points)) %>%
mutate(cumulative_points_sofar = lag(cumulative_points, default = 0))
select(player_id, date, cumulative_points)
output <- test_data %>%
left_join(cum_points, by = c('date', 'winner' = 'player_id')) %>%
rename(winner_points_sum = cumulative_points_sofar) %>%
left_join(cum_points, by = c('date', 'loser' = 'player_id')) %>%
rename(loser_points_sum = cumulative_points_sofar)
output
The difference to the previous question of the OP is that the OP is now asking for the cumulative sum of points each player has scored so far, i.e., before the actual date. Furthermore, the sample data set now contains a date column which uniquely identifies each row.
So, my previous approach can be used here as well, with some modifications. The solution below reshapes the data from wide to long format whereby two value variables are reshaped simultaneously, computes the cumulative sums for each player id , and finally reshapes from long back to wide format, again. In order to sum only points scored before the actual date, the rows are lagged by one.
It is important to note that the winner and loser columns contain the respective player ids.
library(data.table)
cols <- c("winner", "loser")
setDT(test_data)[
# reshape multiple value variables simultaneously from wide to long format
, melt(.SD, id.vars = "date",
measure.vars = list(cols, paste0(cols, "_points")),
value.name = c("id", "points"))][
# rename variable column
, variable := forcats::lvls_revalue(variable, cols)][
# order by date and cumulate the lagged points by id
order(date), points_sum := cumsum(shift(points, fill = 0)), by = id][
# reshape multiple value variables simultaneously from long to wide format
, dcast(.SD, date ~ variable, value.var = c("id", "points", "points_sum"))]
date id_winner id_loser points_winner points_loser points_sum_winner points_sum_loser
1: 2017-10-01 1 3 2 1 0 0
2: 2017-10-02 2 1 1 0 0 2
3: 2017-10-03 3 1 2 1 1 2
4: 2017-10-04 1 2 1 0 3 1
5: 2017-10-05 2 1 2 1 1 4
6: 2017-10-06 3 1 1 0 3 5
7: 2017-10-07 1 3 2 1 5 4
8: 2017-10-08 2 1 1 0 3 7
9: 2017-10-09 3 2 2 1 5 4