Filtering and summing rows in dplyr - r

I have a data that I want to first filter some rows and sum those remaining rows.
The filtering conditions as follows;
for gr==1 find the last occurrence of y_value==10 and keep the all rows before it (including the last occurrence of this value 10 row)!
for gr==2 find the first occurrence of y_value==10 and keep all the rows after it (including the first occurrence of this value 10 row)!
The data is like this;
df <- data.frame(gr=rep(c(1,2),c(8,7)),
y_value=c(c(2,10,10,8,10,6,0,0),c(0,0,10,10,6,8,10)))
gr y_value
1 1 2
2 1 10
3 1 10
4 1 8
5 1 10
6 1 6
7 1 0
8 1 0
9 2 0
10 2 0
11 2 10
12 2 10
13 2 6
14 2 8
15 2 10
I tried this in the light of summing-rows-based-on-conditional-in-groups;
df_temp <- df %>%
group_by(gr) %>%
mutate(rows_to_aggregate=cumsum(y_value==10)) %>%
filter(ifelse(gr==1, rows_to_aggregate !=0, ifelse(gr==2, rows_to_aggregate ==0 | y_value==10, rows_to_aggregate ==0))) %>%
filter(ifelse(gr==1, row_number(gr) != 1, ifelse(gr==2, row_number(gr) != n(), rows_to_aggregate ==0)))
but the if I do rows_to_aggregate !=0 in gr==1 the rows in the interest will be gone! Any guide at this point will be appreciated!

df_to_aggregate <- df %>%
group_by(gr) %>%
mutate(rows_to_aggregate = cumsum(y_value == 10)) %>%
filter(!(gr == 1 & rows_to_aggregate == max(rows_to_aggregate) & y_value != 10)) %>%
filter(!(gr == 2 & rows_to_aggregate == 0)) %>%
select(-rows_to_aggregate)
df_to_aggregate
# A tibble: 10 x 2
# Groups: gr [2]
gr y_value
<dbl> <dbl>
1 1 2
2 1 10
3 1 10
4 1 8
5 1 10
6 2 10
7 2 10
8 2 6
9 2 8
10 2 10

Do not know how to do it in dplyr, but this code seems to work
gr1 = df[df$gr==1,]
last = tail(which(gr1$y_value==10),1)
gr1 = gr1[1:(last-1),]
gr2 = df[df$gr==2,]
first = head(which(gr2$y_value==10),1)
gr2 = gr2[(first+1):dim(gr2)[1],]
final = rbind(gr1,gr2)

You can slice with a different slicing condition for each gr.
df %>%
group_by(gr) %>%
slice(if(any(gr==1)) {1:max(which(y_value==10))} else {min(which(y_value==10)):n()})
gr y_value
1 1 2
2 1 10
3 1 10
4 1 8
5 1 10
6 2 10
7 2 10
8 2 6
9 2 8
10 2 10

Related

How to count groupings of elements in base R or dplyr using multiple conditions?

I am trying to count the number of elements by groupings, subject to the condition that each grouping code ("Group") is > 0. Suppose we start with the below output DF generated via the code immediately beneath:
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 1
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 1
7 X 0 1
8 X 0 1
9 B 0 1
10 R 0 1
11 R 2 2
12 R 2 2
13 X 3 3
14 X 3 3
15 X 3 3
library(dplyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>% group_by(Element) %>% mutate(reSeq = match(Group, unique(Group)))
Instead, I would like the reSeq column to calculate and output as shown below with explanations to the right:
Element Group reSeq reSeq explanation
<chr> <dbl> <int>
1 R 0 1 1st instance of R (ungrouped)(Group = 0 means not grouped)
2 R 0 2 2nd instance of R (ungrouped)(Group = 0 means not grouped)
3 X 0 1 1st instance of X (ungrouped)(Group = 0 means not grouped)
4 X 1 2 2nd instance of X (grouped by Group = 1)
5 X 1 2 2nd instance of X (grouped by Group = 1)
6 X 0 3 3rd instance of X (ungrouped)
7 X 0 4 4th instance of X (ungrouped)
8 X 0 5 5th instance of X (ungrouped)
9 B 0 1 1st instance of B (ungrouped)
10 R 0 3 3rd instance of R (ungrouped)
11 R 2 4 4th instance of R (grouped by Group = 2)
12 R 2 4 4th instance of R (grouped by Group = 2)
13 X 3 6 6th instance of X (grouped by Group = 3)
14 X 3 6 6th instance of X (grouped by Group = 3)
15 X 3 6 6th instance of X (grouped by Group = 3)
Any recommendations for doing this? If possible, starting with the dplyr code I use above because I am fairly familiar with it.
If we use rowid from data.table, can skip a couple of steps
library(dplyr)
library(data.table)
library(tidyr)
myDF %>%
mutate(reSeq = rowid(Element) * NA^!(Group == 0 |!duplicated(Group))) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
-output
# A tibble: 15 × 3
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 2
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 3
7 X 0 4
8 X 0 5
9 B 0 1
10 R 0 3
11 R 2 4
12 R 2 4
13 X 3 6
14 X 3 6
15 X 3 6
Below is what I managed to cobble together. Maybe there's a cleaner solution? Here's the code:
library(dplyr)
library(tidyr)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = ifelse(Group == 0 | Group != lag(Group), eleCnt,0)) %>%
mutate(reSeq = na_if(reSeq, 0)) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
And here's the output:
# A tibble: 15 x 4
Element Group eleCnt reSeq
<chr> <dbl> <int> <int>
1 R 0 1 1
2 R 0 2 2
3 X 0 1 1
4 X 1 2 2
5 X 1 3 2
6 X 0 4 3
7 X 0 5 4
8 X 0 6 5
9 B 0 1 1
10 R 0 3 3
11 R 2 4 4
12 R 2 5 4
13 X 3 7 6
14 X 3 8 6
15 X 3 9 6

How do I find the largest range in a dataset, and filter out the other data?

Competitor Laps
1 1 1
2 1 2
3 1 3
4 1 4
5 1 1
6 1 2
7 1 3
8 1 4
9 1 5
10 1 6
11 1 7
12 1 8
I need to identify the longest range in laps. Here, that range is from row 5 to row 12. The range is 7. As opposed to row 1 to row 4 which has a range of 3. After identifying the largest range, I should only keep the values values that contribute to said range. So, my final dataset should look like:
Competitor Laps
5 1 1
6 1 2
7 1 3
8 1 4
9 1 5
10 1 6
11 1 7
12 1 8
How should I go about this?
Potential solution with dplyr:
dat <- tibble(
Competitor = 1,
Laps = c(seq(1,4), seq(1,8))
)
dat |>
mutate(StintId = cumsum(if_else(Laps == 1, 1, 0))) |>
group_by(StintId) |>
mutate(range = max(Laps) - min(Laps)) |>
ungroup() |>
filter(range == max(range)) |>
select(-StintId, -range)
Output:
# A tibble: 8 x 2
Competitor Laps
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
Returns the largest range for each competitor. Assumes first laps always starts with 1, and laps are sequential.
df<-data.frame(Competitor=c(rep(1,12), rep(2,16)),
Laps=c(1:4, 1:8, 1:9, 1:7))
df %>%
group_by(Competitor) %>%
mutate(LapGroup=cumsum(if_else(Laps==1,1,0))) %>%
group_by(Competitor, LapGroup) %>%
mutate(MaxLaps=max(Laps)) %>%
group_by(Competitor) %>%
filter(MaxLaps==max(Laps))

Group data frame row by consecutive value in R [duplicate]

This question already has answers here:
Create grouping variable for consecutive sequences and split vector
(5 answers)
Closed 1 year ago.
I need to group a data frame by consecutive value in a row.
So for example, given this data frame:
tibble( time = c(1,2,3,4,5,10,11,20,30,31,32,40) )
I want to have a grouping column like:
tibble( time = c(1,2,3,4,5,10,11,20,30,31,32,40), group=c(1,1,1,1,1,2,2,3,4,4,4,5) )
What's a tidyverse (or base R) way to get the column group as explained?
We could it this way:
df %>%
arrange(time) %>%
group_by(grp = (time %/% 10)+1)
time group
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 10 2
7 11 2
8 20 3
9 30 4
10 31 4
11 32 4
12 40 5
We could use diff on the adjacent values of 'time', check if the difference is not equal to 1, then change the logical vector to numeric index by taking the cumulative sum (cumsum) so that there is an increment of 1 at each TRUE value
library(dplyr)
df1 %>%
mutate(grp = cumsum(c(TRUE, diff(time) != 1)))
-output
# A tibble: 12 x 2
time grp
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 10 2
7 11 2
8 20 3
9 30 4
10 31 4
11 32 4
12 40 5
You can use the following solution:
library(dplyr)
library(purrr)
df %>%
mutate(grp = accumulate(2:nrow(df), .init = 1,
~ if(time[.y] - time[.y - 1] == 1) {
.x
} else {
.x + 1
}))
# A tibble: 12 x 2
time grp
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 10 2
7 11 2
8 20 3
9 30 4
10 31 4
11 32 4
12 40 5

R Tidy : Dynamic Sequential Threshold

I'm trying to find a tidy way to dynamically adjust a threshold as I "move" through a tibble using library(tidyverse). For example, imagine a tibble containing sequential observations:
example <-
tibble(observed = c(2,1,1,2,2,4,10,4,2,2,3))
example
# A tibble: 11 x 1
observed
<dbl>
1 2
2 1
3 1
4 2
5 2
6 4
7 10
8 4
9 2
10 2
11 3
I'm trying to calculate a threshold that starts with the initial value (2) and increments by a prespecified amount (in this case, 1) unless the current observation is greater than that threshold in which case the current observation becomes the reference threshold and further thresholds increment from it. Here is what the final tibble would look like:
answer <-
example %>%
mutate(threshold = c(2,3,4,5,6,7,10,11,12,13,14))
answer
# A tibble: 11 x 2
observed threshold
<dbl> <dbl>
1 2 2
2 1 3
3 1 4
4 2 5
5 2 6
6 4 7
7 10 10
8 4 11
9 2 12
10 2 13
11 3 14
I'm looking for the best way to do this using dplyr/tidy. All help is appreciated!
EDIT:
The answers so far are very close, but miss in the case that the observed values drop and increase again. For example consider the same tibble as example above, but with a 4 instead of a 3 for the final observation:
example <-
tibble(observed = c(2,1,1,2,2,4,10,4,2,2,4))
example
# A tibble: 11 x 1
observed
<dbl>
1 2
2 1
3 1
4 2
5 2
6 4
7 10
8 4
9 2
10 2
11 4
The diff & cumsum method then gives:
example %>%
group_by(gr = cumsum(c(TRUE, diff(observed) > thresh))) %>%
mutate(thresold = first(observed) + row_number() - 1) %>%
ungroup %>%
select(-gr)
A tibble: 11 x 2
observed thresold
<dbl> <dbl>
1 2 2
2 1 3
3 1 4
4 2 5
5 2 6
6 4 4
7 10 10
8 4 11
9 2 12
10 2 13
11 4 4
Where the final threshold value is incorrect.
You could use diff to create groups and add row number in the group to the first value.
library(dplyr)
thresh <- 1
example %>%
group_by(gr = cumsum(c(TRUE, diff(observed) > thresh))) %>%
mutate(thresold = first(observed) + row_number() - 1) %>%
ungroup %>%
select(-gr)
# A tibble: 11 x 2
# observed thresold
# <dbl> <dbl>
# 1 2 2
# 2 1 3
# 3 1 4
# 4 2 5
# 5 2 6
# 6 4 4
# 7 10 10
# 8 4 11
# 9 2 12
#10 2 13
#11 3 14
To understand how the groups are created here are the detailed steps :
We first calculate the difference between consecutive values
diff(example$observed)
#[1] -1 0 1 0 2 6 -6 -2 0 1
Note that diff gives output of length 1 less than the actual length.
We compare it with thresh which gives TRUE for every time we have value greater than the threshold
diff(example$observed) > thresh
#[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
Now since output of diff has one value less we add one value as TRUE
c(TRUE, diff(example$observed) > thresh)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
and then finally take cumsum to create groups which is used in group_by.
cumsum(c(TRUE, diff(example$observed) > thresh))
# [1] 1 1 1 1 1 2 3 3 3 3 3
EDIT
For the updated question we can add another condition to check of the previous value is greater than the current count and update the values accordingly.
example %>%
group_by(gr = cumsum(c(TRUE, diff(observed) > thresh) &
observed > first(observed) + row_number())) %>%
mutate(thresold = first(observed) + row_number() - 1) %>%
ungroup() %>%
select(-gr)
# A tibble: 11 x 2
# observed thresold
# <dbl> <dbl>
# 1 2 2
# 2 1 3
# 3 1 4
# 4 2 5
# 5 2 6
# 6 4 7
# 7 10 10
# 8 4 11
# 9 2 12
#10 2 13
#11 4 14
We can create the grouping variable with lag of the column difference
library(dplyr)
thresh <- 1
example %>%
group_by(grp = cumsum((observed - lag(observed, default = first(observed)) >
thresh))) %>%
mutate(threshold = observed[1] + row_number() - 1) %>%
ungroup %>%
mutate(new = row_number() + 1,
threshold = pmax(threshold, new)) %>%
select(-grp, -new)
# A tibble: 11 x 2
# observed threshold
# <dbl> <dbl>
# 1 2 2
# 2 1 3
# 3 1 4
# 4 2 5
# 5 2 6
# 6 4 7
# 7 10 10
# 8 4 11
# 9 2 12
#10 2 13
#11 3 14
I think I've figured out a way to do this, by utilizing zoo::locf (although I'm not sure this part is really necessary).
First create the harder of the two examples I've listed in my description:
example2 <-
tibble(observed = c(2,1,1,2,2,4,10,4,2,2,4))
example2 %>%
mutate(def = first(observed) + row_number() - 1) %>%
mutate(t1 = pmax(observed,def)) %>%
mutate(local_maxima = ifelse(observed == t1,t1,NA)) %>%
mutate(groupings = zoo::na.locf(local_maxima)) %>%
group_by(groupings) %>%
mutate(threshold = groupings + row_number() - 1) %>%
ungroup() %>%
select(-def,-t1,-local_maxima,-groupings)
Result:
# A tibble: 11 x 2
observed threshold
<dbl> <dbl>
1 2 2
2 1 3
3 1 4
4 2 5
5 2 6
6 4 7
7 10 10
8 4 11
9 2 12
10 2 13
11 4 14
I'd definitely prefer a more elegant solution if anyone finds one.

Aggregate rows with specific shared value

I want to aggregate my data as follows:
Aggregate only for successive rows where status = 0
Keep age and sum up points
Example data:
da <- data.frame(userid = c(1,1,1,1,2,2,2,2), status = c(0,0,0,1,1,1,0,0), age = c(10,10,10,11,15,16,16,16), points = c(2,2,2,6,3,5,5,5))
da
userid status age points
1 1 0 10 2
2 1 0 10 2
3 1 0 10 2
4 1 1 11 6
5 2 1 15 3
6 2 1 16 5
7 2 0 16 5
8 2 0 16 5
I would like to have:
da2
userid status age points
1 1 0 10 6
2 1 1 11 6
3 2 1 15 3
4 2 1 16 5
5 2 0 16 10
da %>%
mutate(grp = with(rle(status),
rep(seq_along(values), lengths)) + cumsum(status != 0)) %>%
group_by_at(vars(-points)) %>%
summarise(points = sum(points)) %>%
ungroup() %>%
select(-grp)
## A tibble: 5 x 4
# userid status age points
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 10 6
#2 1 1 11 6
#3 2 0 16 10
#4 2 1 15 3
#5 2 1 16 5
You can use group_by from dplyr:
da %>% group_by(da$userid, cumsum(da$status), da$status)
%>% summarise(age=max(age), points=sum(points))
Output:
`da$userid` `cumsum(da$status)` `da$status` age points
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 10 6
2 1 1 1 11 6
3 2 2 1 15 3
4 2 3 0 16 10
5 2 3 1 16 5
Exactly the same idea as above :
library(dplyr)
data1 <- data %>% group_by(userid, age, status) %>%
filter(status == 0) %>%
summarise(points = sum(points))
data2 <- data %>%
group_by(userid, age, status) %>%
filter(status != 0) %>%
summarise(points = sum(points))
data <- rbind(data1,
data2)
We need to be more carreful with your specification of status equal to 0. I think the code of Quang Hoang works only for your specific example.
I hope it will help.

Resources