How do I use the tidyverse packages to get a running total of unique values occurring in a column? [duplicate] - r

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 3 years ago.
I'm trying to use the tidyverse (whatever package is appropriate) to add a column (via mutate()) that is a running total of the unique values that have occurred in the column so far. Here is some toy data, showing the desired output.
data.frame("n"=c(1,1,1,6,7,8,8),"Unique cumsum"=c(1,1,1,2,3,4,4))
Who knows how to accomplish this in the tidyverse?

Here is an option with group_indices
library(dplyr)
df1%>%
mutate(unique_cumsum = group_indices(., n))
# n unique_cumsum
#1 1 1
#2 1 1
#3 1 1
#4 6 2
#5 7 3
#6 8 4
#7 8 4
data
df1 <- data.frame("n"=c(1,1,1,6,7,8,8))

Here's one way, using the fact that a factor will assign a sequential value to each unique item, and then converting the underlying factor codes with as.numeric:
data.frame("n"=c(1,1,1,6,7,8,8)) %>% mutate(unique_cumsum=as.numeric(factor(n)))
n unique_cumsum
1 1 1
2 1 1
3 1 1
4 6 2
5 7 3
6 8 4
7 8 4

Another solution:
df <- data.frame("n"=c(1,1,1,6,7,8,8))
df <- df %>% mutate(`unique cumsum` = cumsum(!duplicated(n)))
This should work even if your data is not sorted.

Related

Go through a column and collect a running total in new column [duplicate]

This question already has answers here:
Creation of a specific vector without loop or recursion in R
(2 answers)
Split data.frame by value
(2 answers)
Closed 4 years ago.
I have a dataframe whose rows represent people. For a given family, the first row has the value 1 in the column A, and all following rows contain members of the same family until another row in in column A has the value 1. Then, a new family starts.
I would like to assign IDs to all families in my dataset. In other words, I would like to take:
A
1
2
3
1
3
3
1
4
And turn it into:
A family_id
1 1
2 1
3 1
1 2
3 2
3 2
1 3
4 3
I'm playing with a dataframe of 3 million rows, so a simple for-loop solution I came up with falls short of necessary efficiency. Also, the family_id need not be sequential.
I'll take a dplyr solution.
data:
df <- data.frame(A = c(1:3,1,3,3,1,4))
code:
df$familiy_id <- cumsum(c(-1,diff(df$A)) < 0)
result:
# A familiy_id
#1 1 1
#2 2 1
#3 3 1
#4 1 2
#5 3 2
#6 3 2
#7 1 3
#8 4 3
please note:
This solution starts a new group when a number occurs that is smaller than the previous one.
When its 100% sure that a new group always begins with a 1 consistently, then ronak's solution is perfect.

How to assign IDs for consecutive rows in R split by a given kind of row? [duplicate]

This question already has answers here:
Creation of a specific vector without loop or recursion in R
(2 answers)
Split data.frame by value
(2 answers)
Closed 4 years ago.
I have a dataframe whose rows represent people. For a given family, the first row has the value 1 in the column A, and all following rows contain members of the same family until another row in in column A has the value 1. Then, a new family starts.
I would like to assign IDs to all families in my dataset. In other words, I would like to take:
A
1
2
3
1
3
3
1
4
And turn it into:
A family_id
1 1
2 1
3 1
1 2
3 2
3 2
1 3
4 3
I'm playing with a dataframe of 3 million rows, so a simple for-loop solution I came up with falls short of necessary efficiency. Also, the family_id need not be sequential.
I'll take a dplyr solution.
data:
df <- data.frame(A = c(1:3,1,3,3,1,4))
code:
df$familiy_id <- cumsum(c(-1,diff(df$A)) < 0)
result:
# A familiy_id
#1 1 1
#2 2 1
#3 3 1
#4 1 2
#5 3 2
#6 3 2
#7 1 3
#8 4 3
please note:
This solution starts a new group when a number occurs that is smaller than the previous one.
When its 100% sure that a new group always begins with a 1 consistently, then ronak's solution is perfect.

Apply a maximum value to whole group [duplicate]

This question already has answers here:
Aggregate a dataframe on a given column and display another column
(8 answers)
Closed 6 years ago.
I have a df like this:
Id count
1 0
1 5
1 7
2 5
2 10
3 2
3 5
3 4
and I want to get the maximum count and apply that to the whole "group" based on ID, like this:
Id count max_count
1 0 7
1 5 7
1 7 7
2 5 10
2 10 10
3 2 5
3 5 5
3 4 5
I've tried pmax, slice etc. I'm generally having trouble working with data that is in interval-specific form; if you could direct me to tools well-suited to that type of data, would really appreciate it!
Figured it out with help from Gavin Simpson here: Aggregate a dataframe on a given column and display another column
maxcount <- aggregate(count ~ Id, data = df, FUN = max)
new_df<-merge(df, maxcount)
Better way:
df$max_count <- with(df, ave(count, Id, FUN = max))

in R: Sum by group without summarising [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 6 years ago.
I have searched a lot, but not found a solution.
I have the following data frame:
Age no.observations Factor
1 1 4 A
2 1 3 A
3 1 12 A
4 1 5 B
5 1 9 B
6 1 3 B
7 2 12 A
8 2 3 A
9 2 6 A
10 2 7 B
11 2 9 B
12 2 1 B
I would like to sum create another column with the sum by the categories Age and Factor, thus having 19 for the first three rows, 26 for the next three etc. I want this to be a column added to this data.frame, therefore dplyr and its summarise function do not help.
Use mutate with group_by to not summarise:
df %>%
group_by(Age, Factor) %>%
mutate(no.observations.in.group = sum(no.observations)) %>%
ungroup()

How to create a new row that would show me the number of observations in a group in an unbalanced panel dataset in R? [duplicate]

This question already has answers here:
Create counter with multiple variables [duplicate]
(6 answers)
Closed 6 years ago.
I have a dataset that looks like this:
id time
1 1
1 2
2 5
2 3
3 2
3 7
3 8
And I want to add another column to show me how many observations there are in a group.
id time label
1 1 1
1 2 2
2 5 1
2 3 2
3 2 1
3 7 2
3 8 3
We can use ave
df1$label <- with(df1, ave(seq_along(id), id, FUN=seq_along))
Or with dplyr
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(label = row_number())

Resources