Using dplyr to get cumulative count by group - r

Thanks in advance. I have the following data:
df <- data.frame(person=c(1,1,1,1,2,2,2,2,3,3,3,3),
neighborhood=c("A","A","A","A","B","B","C","C","D","D","E","F"))
I would like to generate a new column that gives the cumulative count of neighborhoods that each person moves through as the panel progresses. Like such:
df2 <- data.frame(person=c(1,1,1,1,2,2,2,2,3,3,3,3),
neighborhood=c("A","A","A","A","B","B","C","C","D","D","E","F"),
moved=c(0,0,0,0,0,0,1,1,0,0,1,2)
)
Thanks again.

We can use group by 'person', then create the 'moved' by matching the 'neighborhood' with its unique values to get the index and subtract 1.
df %>%
group_by(person) %>%
mutate(moved = match(neighborhood, unique(neighborhood))-1)
# person neighborhood moved
# <dbl> <fctr> <dbl>
#1 1 A 0
#2 1 A 0
#3 1 A 0
#4 1 A 0
#5 2 B 0
#6 2 B 0
#7 2 C 1
#8 2 C 1
#9 3 D 0
#10 3 D 0
#11 3 E 1
#12 3 F 2
or use factor with levels specified as the unique values in 'neighborhood', coerce to 'integer' and subtract 1.
df %>%
group_by(person) %>%
mutate(moved = as.integer(factor(neighborhood, levels = unique(neighborhood)))-1)
# person neighborhood moved
# <dbl> <fctr> <dbl>
#1 1 A 0
#2 1 A 0
#3 1 A 0
#4 1 A 0
#5 2 B 0
#6 2 B 0
#7 2 C 1
#8 2 C 1
#9 3 D 0
#10 3 D 0
#11 3 E 1
#12 3 F 2

This can also easily be achieved with rleid or the frank functions from the data.table package:
library(data.table)
# with 'rleid'
setDT(df)[, moved := rleid(neighborhood)-1, by = person]
# with 'frank'
setDT(df)[, moved := frank(neighborhood, ties.method='dense')-1, by = person]
the result:
> df
person neighborhood moved
1: 1 A 0
2: 1 A 0
3: 1 A 0
4: 1 A 0
5: 2 B 0
6: 2 B 0
7: 2 C 1
8: 2 C 1
9: 3 D 0
10: 3 D 0
11: 3 E 1
12: 3 F 2
With dplyr you could use the dense_rank function:
library(dplyr)
df %>%
group_by(person) %>%
mutate(moved = dense_rank(neighborhood)-1)

This can be achieved using window functions of dplyr, as well. Here is the code:
library(dplyr)
my.df <- tbl_df(df)
my.df %>%
# Per person
group_by(person) %>%
# sort by neighborhood
arrange(neighborhood) %>%
# if the neighborhood has changed compared to the row before
mutate(moved = (neighborhood != lag(neighborhood))) %>%
# turn NAs (first rows) into FALSE
mutate(moved = ifelse(is.na(moved), FALSE, moved)) %>%
# use cumulative sum of the logical column to get number of moves
mutate(no_moves = cumsum(moved))

Related

How to count the cumulative number of subgroupings using dplyr?

I'm trying to run the number of cumulative subgroupings using dplyr, as illustrated and explanation in the image below. I am trying to solve for Flag2 in the image. Any recommendations for how to do this?
Beneath the image I also have the reproducible code that runs all columns up through Flag1 which works fine.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","A","C","C","C","C","C"),
Group = c(0,0,1,1,2,2,0,3,3,0,0,0)
)
excelCopy <- myData %>%
group_by(Element) %>%
mutate(Element_Count = row_number()) %>%
mutate(Flag1 = case_when(Group > 0 ~ match(Group, unique(Group)),TRUE ~ Element_Count)) %>%
ungroup()
print.data.frame(excelCopy)
Using row_number and setting 0 values to NA
library(dplyr)
excelCopy |>
group_by(Element, Group) |>
mutate(Flag2 = ifelse(Group == 0, NA, row_number()))
Element Group Element_Count Flag1 Flag2
<chr> <dbl> <int> <int> <int>
1 A 0 1 1 NA
2 B 0 1 1 NA
3 B 1 2 2 1
4 B 1 3 2 2
5 B 2 4 3 1
6 B 2 5 3 2
7 A 0 2 2 NA
8 C 3 1 1 1
9 C 3 2 1 2
10 C 0 3 3 NA
11 C 0 4 4 NA
12 C 0 5 5 NA

r convert summary data to presence/absence data

I conducted 5 presence/absence measures at multiple sites and summed them together and ended up with a dataframe that looked something like this:
df <- data.frame("site" = c("a", "b", "c"),
"species1" = c(0, 2, 1),
"species2" = c(5, 2, 4))
ie. at site "a" species1 was recorded 0/5 times and species2 was recorded 5/5 times.
What I would like to do is convert this back into presence/absence data. Something like this:
data.frame("site" = ("a", "b", "c"),
"species1" = c(0,0,0,0,0, 1,1,0,0,0, 1,0,0,0,0),
"species2" = c(1,1,1,1,1, 1,1,0,0,0, 1,1,1,1,0))
I can duplicate each row 5 times with:
df %>% slice(rep(1:n(), each = 5))
but I can't figure out how to change "2" into "1,1,0,0,0". Ideally the order of the 1s and 0s (within each site) would also be randomised (ie. "0,0,1,0,1"), but that might be too difficult.
Any help would be appreciated.
We can also use uncount
library(dplyr)
library(tidyr)
df %>%
uncount(max(species2), .remove = FALSE) %>%
group_by(site) %>%
mutate(across(starts_with('species'), ~ as.integer(row_number() <= first(.))))
# A tibble: 15 x 3
# Groups: site [3]
# site species1 species2
# <chr> <int> <int>
# 1 a 0 1
# 2 a 0 1
# 3 a 0 1
# 4 a 0 1
# 5 a 0 1
# 6 b 1 1
# 7 b 1 1
# 8 b 0 0
# 9 b 0 0
#10 b 0 0
#11 c 1 1
#12 c 0 1
#13 c 0 1
#14 c 0 1
#15 c 0 0
After repeating the rows you can compare the row number with any value of the respective column and assign 1 if the current row number is less than the value.
library(dplyr)
df %>%
slice(rep(seq_len(n()), each = 5)) %>%
group_by(site) %>%
mutate(across(starts_with('species'), ~+(row_number() <= first(.))))
#Use mutate_at with old dplyr
#mutate_at(vars(starts_with('species')), ~+(row_number() <= first(.)))
# site species1 species2
# <chr> <int> <int>
# 1 a 0 1
# 2 a 0 1
# 3 a 0 1
# 4 a 0 1
# 5 a 0 1
# 6 b 1 1
# 7 b 1 1
# 8 b 0 0
# 9 b 0 0
#10 b 0 0
#11 c 1 1
#12 c 0 1
#13 c 0 1
#14 c 0 1
#15 c 0 0

which.max() by groups but output in the dataframe

There is this data frame given by (an example):
df <- read.table(header = TRUE, text = 'Group Utility
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
I want to use any command (I have been trying iterations of which.max() to no avail) to get an additional row in the dataset, say choice that is an indicator if Value is the max for the group given by Group elements. The table would look like:
Group Utility Choice
A 12 1
A 10 0
B 3 0
B 5 0
B 6 1
C 1 1
D 3 0
D 4 1
You can try this with dplyr
library(dplyr)
df %>%
group_by(Group) %>%
mutate(Choice = ifelse(Utility == max(Utility), 1, 0)) %>%
ungroup()
Output
# A tibble: 8 x 3
Group Utility Choice
<fct> <int> <dbl>
1 A 12 1
2 A 10 0
3 B 3 0
4 B 5 0
5 B 6 1
6 C 1 1
7 D 3 0
8 D 4 1
A one-liner base R solution.
df$Choice <- with(df, ave(Utility, Group, FUN = function(x) +(x == max(x))))
df
# Group Utility Choice
#1 A 12 1
#2 A 10 0
#3 B 3 0
#4 B 5 0
#5 B 6 1
#6 C 1 1
#7 D 3 0
#8 D 4 1
An option with data.table
library(data.table)
setDT(df)[, +(Utility == max(Utility)), Group]

Fill in rows based on condition for grouped data using tidyr

I have the following dataframe of which I am trying to create the 'index2' field conditional on the 'index1' field:
Basically this data represents a succession of behaviours for different individual (ID) penguins and I am trying to index groups of behaviour (index 2) that incorporates all other behaviours in between (and including) dives (which have been indexed into dive bouts = index 1). I would appreciate a tidyverse solution grouping by ID.
Reproducible:
df<-data.frame(ID=c(rep('A',9),rep('B',14)),behaviour=c('surface','dive','dive','dive','surface','commute','surface','dive', 'dive','dive','dive','surface','dive','dive','commute','commute','surface','dive','dive','surface','dive','dive','surface'),index1=c(0,1,1,1,0,0,0,1,1,2,2,0,3,3,0,0,0,3,3,0,3,3,0),index2=c(0,1,1,1,1,1,1,1,1,2,2,0,3,3,3,3,3,3,3,3,3,3,0))
We could create a function with rle
frle <- function(x) {
rl <- rle(x)
i1 <- cummax(rl$values)
i2 <- c(i1[-1] != i1[-length(i1)], FALSE)
i1[i2] <- 0
as.integer(inverse.rle(within.list(rl, values <- i1)))
}
After grouping by 'ID', mutate the 'Index1' to get the expected column
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Index2New = frle(Index1))
# A tibble: 19 x 5
# Groups: ID [2]
# ID behaviour Index1 Index2 Index2New
# <chr> <chr> <int> <int> <int>
# 1 A surface 0 0 0
# 2 A dive 1 1 1
# 3 A dive 1 1 1
# 4 A dive 1 1 1
# 5 A surface 0 1 1
# 6 A commute 0 1 1
# 7 A surface 0 1 1
# 8 A dive 1 1 1
# 9 A dive 1 1 1
#10 B dive 2 2 2
#11 B dive 2 2 2
#12 B surface 0 0 0
#13 B dive 3 3 3
#14 B dive 3 3 3
#15 B commute 0 3 3
#16 B commute 0 3 3
#17 B surface 0 3 3
#18 B dive 3 3 3
#19 B dive 3 3 3

How to copy value of a cell to other rows based on the value of other two columns?

I have a data frame that looks like this:
zz = "Sub Item Answer
1 A 1 NA
2 A 1 0
3 A 2 NA
4 A 2 1
5 B 1 NA
6 B 1 1
7 B 2 NA
8 B 2 0"
Data = read.table(text=zz, header = TRUE)
The desirable result is to have the value under "Answer" (0 or 1) copied to the NA cells of the same Subject and the same Item. For instance, the answer = 0 in row 2 should be copied to the answer cell in row 1, but not other rows. The output should be like this:
zz2 = "Sub Item Answer
1 A 1 0
2 A 1 0
3 A 2 1
4 A 2 1
5 B 1 1
6 B 1 1
7 B 2 0
8 B 2 0"
Data2 = read.table(text=zz2, header = TRUE)
How should I do this? I noticed that there are some previous questions that asked how to copy a cell to other cells such as replace NA value with the group value, but it was done based on the value of one column only. Also, this question is slightly different from Replace missing values (NA) with most recent non-NA by group, which aims to copy the most-recent numeric value to NAs.
Thanks for all your answers!
You can use zoo::na.locf.
library(tidyverse);
library(zoo);
Data %>% group_by(Sub, Item) %>% mutate(Answer = na.locf(Answer));
# A tibble: 8 x 3
## Groups: Sub, Item [4]
# Sub Item Answer
# <fct> <int> <int>
#1 A 1 0
#2 A 1 0
#3 A 2 1
#4 A 2 1
#5 B 1 1
#6 B 1 1
#7 B 2 0
#8 B 2 0
Thanks to #steveb, here is an alternative without having to rely on zoo::na.locf:
Data %>% group_by(Sub, Item) %>% mutate(Answer = Answer[!is.na(Answer)]);
library(tidyverse)
Data%>%group_by(Sub,Item)%>%fill(Answer,.direction = "up")
# A tibble: 8 x 3
# Groups: Sub, Item [4]
Sub Item Answer
<fctr> <int> <int>
1 A 1 0
2 A 1 0
3 A 2 1
4 A 2 1
5 B 1 1
6 B 1 1
7 B 2 0
8 B 2 0
Though it was not intention of OP but I thought of situations where there are only NA values for set of Sub, Item group OR there multiple non-NA values for a group.
The one way to handle such situations could be by taking max/min of that group and ignoring max/min values if those are Inf
A solution could be:
library(dplyr)
Data %>% group_by(Sub, Item) %>%
mutate(Answer = ifelse(max(Answer, na.rm=TRUE)== -Inf, NA,
as.integer(max(Answer, na.rm=TRUE))))
#Result
# Sub Item Answer
# <fctr> <int> <int>
#1 A 1 0
#2 A 1 0
#3 A 2 1
#4 A 2 1
#5 B 1 1
#6 B 1 1
#7 B 2 0
#8 B 2 0

Resources