I have a df that I need to tally and group by group.
But I also want to identify the index(?) of the observation in the grouping.
The group A has 4 observations, I want to to attach an index of 3 for the 3rd A observation.
df %>%
group_by(group) %>%
mutate(count = n())
# group index count
#1 A 1 4
#2 A 2 4
#3 A 3 4
#4 A 4 4
#5 B 1 1
#6 B 2 1
#7 C 1 3
#8 C 2 3
#9 C 3 3
#10 D 1 1
You want to use the window function row_number():
df %>%
group_by(group) %>%
mutate(index = row_number()) # explicit would be row_number(group)
Related
I'm trying to run the number of cumulative subgroupings using dplyr, as illustrated and explanation in the image below. I am trying to solve for Flag2 in the image. Any recommendations for how to do this?
Beneath the image I also have the reproducible code that runs all columns up through Flag1 which works fine.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","A","C","C","C","C","C"),
Group = c(0,0,1,1,2,2,0,3,3,0,0,0)
)
excelCopy <- myData %>%
group_by(Element) %>%
mutate(Element_Count = row_number()) %>%
mutate(Flag1 = case_when(Group > 0 ~ match(Group, unique(Group)),TRUE ~ Element_Count)) %>%
ungroup()
print.data.frame(excelCopy)
Using row_number and setting 0 values to NA
library(dplyr)
excelCopy |>
group_by(Element, Group) |>
mutate(Flag2 = ifelse(Group == 0, NA, row_number()))
Element Group Element_Count Flag1 Flag2
<chr> <dbl> <int> <int> <int>
1 A 0 1 1 NA
2 B 0 1 1 NA
3 B 1 2 2 1
4 B 1 3 2 2
5 B 2 4 3 1
6 B 2 5 3 2
7 A 0 2 2 NA
8 C 3 1 1 1
9 C 3 2 1 2
10 C 0 3 3 NA
11 C 0 4 4 NA
12 C 0 5 5 NA
I know this might be a simple operation but I can't find a solution. I know it should be some form of group_by and sum or cumsum, but I cant figure out how. I want to plot a cumulative count of something by group over time. I have multiple rows per group and time that need to be counted (and some missing data).
My dataset looks somewhat like this
df <- data.frame(group = c("A","A","A","A","B","B","B","C","C","C","C","C"),
time = c(1,1,2,3,1,2,2,1,2,2,3,3))
and I want this result:
group time count
A 1 2
A 2 3
A 3 4
B 1 1
B 2 3
C 1 1
C 2 3
C 3 5
I am usually use dplyr, but I am also happy with base R.
How do I do that?
You can use the following solution:
library(dplyr)
df %>%
group_by(group, time) %>%
add_count() %>%
distinct() %>%
group_by(group) %>%
mutate(n = cumsum(n))
# A tibble: 8 x 3
# Groups: group [3]
group time n
<chr> <dbl> <int>
1 A 1 2
2 A 2 3
3 A 3 4
4 B 1 1
5 B 2 3
6 C 1 1
7 C 2 3
8 C 3 5
We can use summarise with group_by
library(dplyr)
df %>%
group_by(group, time) %>%
summarise(count = n()) %>%
group_by(group) %>%
mutate(count = cumsum(count)) %>%
ungroup
-output
# A tibble: 8 x 3
group time count
<chr> <dbl> <int>
1 A 1 2
2 A 2 3
3 A 3 4
4 B 1 1
5 B 2 3
6 C 1 1
7 C 2 3
8 C 3 5
You can use count and cumsum -
library(dplyr)
df %>%
count(group, time, name = 'count') %>%
group_by(group) %>%
mutate(count = cumsum(count)) %>%
ungroup
# group time count
# <chr> <dbl> <int>
#1 A 1 2
#2 A 2 3
#3 A 3 4
#4 B 1 1
#5 B 2 3
#6 C 1 1
#7 C 2 3
#8 C 3 5
For the following dataframe:
A<-c('A','A','A','B','B','B','B','B','C','C','C','C','D','D','D','D','D','D')
A<-data.frame(A)
How do you add a column to count backwards, each time the group for 'A' changes....as in:
Desired Output:
desired_output<-c(3,2,1,6,5,4,3,2,1,4,3,2,1,6,5,4,3,2,1)
desired_output<-data.frame(desired_output)
Thanks for your help.
We can use rev on the row_number() after grouping by 'A'
library(dplyr)
A <- A %>%
group_by(A) %>%
mutate(desired = rev(row_number())) %>%
ungroup
-output
# A tibble: 18 x 2
# A desired
# <chr> <int>
# 1 A 3
# 2 A 2
# 3 A 1
# 4 B 5
# 5 B 4
# 6 B 3
# 7 B 2
# 8 B 1
# 9 C 4
#10 C 3
#11 C 2
#12 C 1
#13 D 6
#14 D 5
#15 D 4
#16 D 3
#17 D 2
#18 D 1
Or another option is create the sequence with : to 1
A %>%
group_by(A) %>%
mutate(desired = n():1) %>%
ungroup
Thanks in advance. I have the following data:
df <- data.frame(person=c(1,1,1,1,2,2,2,2,3,3,3,3),
neighborhood=c("A","A","A","A","B","B","C","C","D","D","E","F"))
I would like to generate a new column that gives the cumulative count of neighborhoods that each person moves through as the panel progresses. Like such:
df2 <- data.frame(person=c(1,1,1,1,2,2,2,2,3,3,3,3),
neighborhood=c("A","A","A","A","B","B","C","C","D","D","E","F"),
moved=c(0,0,0,0,0,0,1,1,0,0,1,2)
)
Thanks again.
We can use group by 'person', then create the 'moved' by matching the 'neighborhood' with its unique values to get the index and subtract 1.
df %>%
group_by(person) %>%
mutate(moved = match(neighborhood, unique(neighborhood))-1)
# person neighborhood moved
# <dbl> <fctr> <dbl>
#1 1 A 0
#2 1 A 0
#3 1 A 0
#4 1 A 0
#5 2 B 0
#6 2 B 0
#7 2 C 1
#8 2 C 1
#9 3 D 0
#10 3 D 0
#11 3 E 1
#12 3 F 2
or use factor with levels specified as the unique values in 'neighborhood', coerce to 'integer' and subtract 1.
df %>%
group_by(person) %>%
mutate(moved = as.integer(factor(neighborhood, levels = unique(neighborhood)))-1)
# person neighborhood moved
# <dbl> <fctr> <dbl>
#1 1 A 0
#2 1 A 0
#3 1 A 0
#4 1 A 0
#5 2 B 0
#6 2 B 0
#7 2 C 1
#8 2 C 1
#9 3 D 0
#10 3 D 0
#11 3 E 1
#12 3 F 2
This can also easily be achieved with rleid or the frank functions from the data.table package:
library(data.table)
# with 'rleid'
setDT(df)[, moved := rleid(neighborhood)-1, by = person]
# with 'frank'
setDT(df)[, moved := frank(neighborhood, ties.method='dense')-1, by = person]
the result:
> df
person neighborhood moved
1: 1 A 0
2: 1 A 0
3: 1 A 0
4: 1 A 0
5: 2 B 0
6: 2 B 0
7: 2 C 1
8: 2 C 1
9: 3 D 0
10: 3 D 0
11: 3 E 1
12: 3 F 2
With dplyr you could use the dense_rank function:
library(dplyr)
df %>%
group_by(person) %>%
mutate(moved = dense_rank(neighborhood)-1)
This can be achieved using window functions of dplyr, as well. Here is the code:
library(dplyr)
my.df <- tbl_df(df)
my.df %>%
# Per person
group_by(person) %>%
# sort by neighborhood
arrange(neighborhood) %>%
# if the neighborhood has changed compared to the row before
mutate(moved = (neighborhood != lag(neighborhood))) %>%
# turn NAs (first rows) into FALSE
mutate(moved = ifelse(is.na(moved), FALSE, moved)) %>%
# use cumulative sum of the logical column to get number of moves
mutate(no_moves = cumsum(moved))
Is it possible to loop through a list and replace the group_by variable when using dplyr? Let me illustrate:
Lets say I have a list of variables from the dataset myData each of the variables has the same groups 1 through 10. Ideally I'd like to loop through the list and for each variable summarise and mutate the data as indicated below. Is this possible?
Here is a smaller generalized example but I just put the variable a in the group_by function but ideally i'd like to loop through a list and get that output for each variable.
vars <- list(a,b,c)
> myData
success a b c
1 0 2 1 3
2 1 1 3 1
3 1 1 3 1
4 0 1 1 3
5 1 2 2 1
6 1 2 3 2
7 0 2 2 3
8 0 1 1 3
9 0 2 3 2
10 1 1 1 2
11 1 1 2 2
12 0 1 1 1
13 0 3 1 1
14 1 3 2 1
> myData %>% group_by(a) %>%
+ summarise(success = sum(success), n = n()) %>%
+ mutate(success_prop = success / sum(n))
Source: local data frame [3 x 4]
a success n success_prop
1 1 4 7 0.28571429
2 2 2 5 0.14285714
3 3 1 2 0.07142857
final results might look something like this:
group a.success a.n a.success_prop b.success b.n b.success_prop c.success c.n c.success_prop
1 4 7 0.28571429 1 6 0.07142857 4 6 0.2857143
2 2 5 0.14285714 3 4 0.21428571 3 4 0.2142857
3 1 2 0.07142857 3 4 0.21428571 0 4 0
I would recommend converting your data in a tidy format as a first step:
library(tidyr)
library(dplyr)
tidy_data <- myData %>%
gather(key, value, a:c)
It is then straightforward to use group_by and summarise.
Edit
tidy_data %>%
group_by(key, value) %>%
summarise(
success = sum(success),
n = n()
) %>%
group_by(key) %>%
mutate(
success_prop = success / sum(n)
)