Flagging an id when having similar columns different values in R - r

I need to flag an id when they have different grade values in the grade columns. Here how my sample dataset looks like
df <- data.frame(id = c(11,22,33,44,55),
grade.1 = c(3,4,5,6,7),
grade.2 = c(3,4,5,NA,7),
grade.3 = c(4,4,6,5,7),
grade.4 = c(NA,NA,NA, 5, 7 ))
df$Grade <- paste0(df$grade.1, df$grade.2, df$grade.3, df$grade.4)
> df
id grade.1 grade.2 grade.3 grade.4 Grade
1 11 3 3 4 NA 334NA
2 22 4 4 4 NA 444NA
3 33 5 5 6 NA 556NA
4 44 6 NA 5 5 6NA55
5 55 7 7 7 7 7777
When an id has different grade values in grade.1 grade.2 grade.3 and grade.4, that row needs to be flagged. Having NA in that column does not affect the flagging.
In other words, if the Grade column at the end has any differential numbers, that id needs to be flagged.
My desired output should look like this:
> df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA Not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 Not_flagged
Any ideas?
Thanks!

A base R solution using rle omitting NA values.
df$flag <- apply(df[,2:5], 1, function(x)
ifelse(length(rle(x[!is.na(x)])$lengths)==1, "not_flagged", "flagged"))
df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 not_flagged
Data
df <- structure(list(id = c(11, 22, 33, 44, 55), grade.1 = c(3, 4,
5, 6, 7), grade.2 = c(3, 4, 5, NA, 7), grade.3 = c(4, 4, 6, 5,
7), grade.4 = c(NA, NA, NA, 5, 7)), class = "data.frame", row.names = c(NA,
-5L))

Here is a base R approach.
df$flag <- c("not_flagged", "flagged")[
apply(df[-1L], 1L, \(x) length( (ux <- unique(x))[!is.na(ux)] ) > 1L) + 1L
]
Output
> df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 not_flagged

A possible solution:
library(tidyverse)
df <- data.frame(id = c(11,22,33,44,55),
grade.1 = c(3,4,5,6,7),
grade.2 = c(3,4,5,NA,7),
grade.3 = c(4,4,6,5,7),
grade.4 = c(NA,NA,NA, 5, 7 ))
df %>%
rowwise %>%
mutate(flag = if_else(length(unique(na.omit(c_across(2:5)))) == 1,
"not-flagged", "flagged")) %>% ungroup
#> # A tibble: 5 × 6
#> id grade.1 grade.2 grade.3 grade.4 flag
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 11 3 3 4 NA flagged
#> 2 22 4 4 4 NA not-flagged
#> 3 33 5 5 6 NA flagged
#> 4 44 6 NA 5 5 flagged
#> 5 55 7 7 7 7 not-flagged
Using data.table::uniqueN, that counts the number of unique elements in a vector (and that allows for NA removal):
library(data.table)
library(dplyr)
df %>%
rowwise %>%
mutate(flag = if_else(uniqueN(c_across(2:5), na.rm = T) == 1,
"not-flagged", "flagged")) %>% ungroup

n_distinct from dyplr is very helpful: Here a version using a combination of pivot_longer and pivot_wider:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-c(id, Grade),
names_to = "name",
values_to = "value"
) %>%
group_by(id) %>%
mutate(flag = ifelse(n_distinct(value, na.rm = TRUE)==1, "Not flagged", "Flagged")) %>%
pivot_wider(
names_from = name,
values_from = value
)
id Grade flag grade.1 grade.2 grade.3 grade.4
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 334NA Flagged 3 3 4 NA
2 22 444NA Not flagged 4 4 4 NA
3 33 556NA Flagged 5 5 6 NA
4 44 6NA55 Flagged 6 NA 5 5
5 55 7777 Not flagged 7 7 7 7

Related

How to add rows so that each group has equal number of rows?

I have a data frame with unequal numbers of rows per group, see df in the example below. I would like to add rows containing the group name and NAs in all other columns so that there is an equal number of rows per group like in df.desired. The rows should be added after the last row from the respective group.
Example:
df = data.frame(group = c("A","A","A","A","B","B","B","C","C"),
col1 = c(1, 1, 1, 1, 2, 2, 2, 3, 3),
col2 = c(12, 13, 14, 15, 21, 22, 23, 31, 32))
> df
group col1 col2
1 A 1 12
2 A 1 13
3 A 1 14
4 A 1 15
5 B 2 21
6 B 2 22
7 B 2 23
8 C 3 31
9 C 3 32
df.desired = data.frame(group = c("A","A","A","A","B","B","B","B","C","C","C","C"),
col1 = c(1, 1, 1, 1, 2, 2, 2, NA, 3, 3, NA, NA),
col2 = c(12, 13, 14, 15, 21, 22, 23, NA, 31, 32, NA, NA))
> df.desired
group col1 col2
1 A 1 12
2 A 1 13
3 A 1 14
4 A 1 15
5 B 2 21
6 B 2 22
7 B 2 23
8 B NA NA
9 C 3 31
10 C 3 32
11 C NA NA
12 C NA NA
I know how to do this with a loop but that would be super slow and I would prefer to use dplyr if possible. Does anyone have any ideas?
How about this:
library(dplyr)
df = data.frame(group = c("A","A","A","A","B","B","B","C","C"),
col1 = c(1, 1, 1, 1, 2, 2, 2, 3, 3),
col2 = c(12, 13, 14, 15, 21, 22, 23, 31, 32))
maxgp <- max(table(df$group))
df %>%
group_by(group) %>%
summarise(across(everything(), ~c(.x, rep(NA, maxgp-n()))))
#> `summarise()` has grouped output by 'group'. You can override using the
#> `.groups` argument.
#> # A tibble: 12 × 3
#> # Groups: group [3]
#> group col1 col2
#> <chr> <dbl> <dbl>
#> 1 A 1 12
#> 2 A 1 13
#> 3 A 1 14
#> 4 A 1 15
#> 5 B 2 21
#> 6 B 2 22
#> 7 B 2 23
#> 8 B NA NA
#> 9 C 3 31
#> 10 C 3 32
#> 11 C NA NA
#> 12 C NA NA
Created on 2023-02-01 by the reprex package (v2.0.1)
You can create row numbers for each group and then tidyr::complete:
library(dplyr)
df %>%
group_by(group) %>%
mutate(id = row_number()) %>%
ungroup() %>%
tidyr::complete(group, id) %>%
select(-id)
# # A tibble: 12 × 3
# group col1 col2
# <chr> <dbl> <dbl>
# 1 A 1 12
# 2 A 1 13
# 3 A 1 14
# 4 A 1 15
# 5 B 2 21
# 6 B 2 22
# 7 B 2 23
# 8 B NA NA
# 9 C 3 31
# 10 C 3 32
# 11 C NA NA
# 12 C NA NA
Update (from #Maël's answer)
After dplyr 1.1.0, Per-operation grouping with .by/by is supported for mutate(), summarise(), filter(), and the slice() family. The code can be simplified to
df %>%
mutate(id = row_number(), .by = group) %>%
tidyr::complete(group, id) %>%
select(-id)

merging with changing the form of columns

I have two data frames named "df" and "df1". what i want is merging df with df1 based on gender and district in such a way that after merging I only have one column of "prob.dis". more clearly, I want that if dis is 1, then the value from prob.dis1` should be used, and if dis is 5, then I want the value from prob.dis5.any help would be appreciated.
df<-
age gender dis
10 1 1
11 2 5
10 1 4
11 2 2
10 1 1
11 2 2
10 1 4
11 2 5
10 1 3
11 2 3
df1<-
age gender prob.dis1 prob.dis2 prob.dis3 prob.dis4 prob.dis5
10 1 0.0099 0.0124 0.0037 0.0176 0.1
11 2 0.0021 0.802 0.005 0.0029 0.2
Transform df1 into long format and join it with df on gender and dis:
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(
cols = matches("^prob.dis"),
names_to = c(".value", "dis"),
names_pattern = "([^0-9]+)([0-9]+)"
) %>%
mutate(dis = as.integer(dis)) %>%
select(-age) %>%
left_join(df, ., by = c("gender", "dis"))
age gender dis prob.dis
<dbl> <dbl> <dbl> <dbl>
1 10 1 1 0.0099
2 11 2 5 0.2
3 10 1 4 0.0176
4 11 2 2 0.802
5 10 1 1 0.0099
6 11 2 2 0.802
7 10 1 4 0.0176
8 11 2 5 0.2
9 10 1 3 0.0037
10 11 2 3 0.005
data:
df <- structure(list(age = c(10, 11, 10, 11, 10, 11, 10, 11, 10, 11
), gender = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2), dis = c(1, 5, 4,
2, 1, 2, 4, 5, 3, 3)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
df1 <- structure(list(age = c(10, 11), gender = c(1, 2), prob.dis1 = c(0.0099,
0.0021), prob.dis2 = c(0.0124, 0.802), prob.dis3 = c(0.0037,
0.005), prob.dis4 = c(0.0176, 0.0029), prob.dis5 = c(0.1, 0.2
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame")
)
Update:
pivot_longer df2
right_join with df1
library(dplyr)
library(tidyr)
df1 <- df2 %>%
pivot_longer(
cols = starts_with("prob"),
names_to = "dis",
values_to = "prob.dis"
) %>%
mutate(dis = as.numeric(gsub("\\D", "", dis))) %>%
right_join(df1, by = c("age", "gender", "dis"))
Output:
age gender dis prob.dis
<int> <int> <dbl> <dbl>
1 10 1 1 0.0099
2 10 1 1 0.0099
3 10 1 3 0.0037
4 10 1 4 0.0176
5 10 1 4 0.0176
6 11 2 2 0.802
7 11 2 2 0.802
8 11 2 3 0.005
9 11 2 5 0.2
10 11 2 5 0.2
In case dis is only ranging from 1 to 5 and is sorted in the columns of df2 you can use match to find the row and use 2 + df$dis to get the column, which can be subseted with a matrix produced with cbind.
df$prop.dis <- df1[cbind(match(df$gender, df1$gender), 2 + df$dis)]
df
# age gender dis prop.dis
#1 10 1 1 0.0099
#2 11 2 5 0.2000
#3 10 1 4 0.0176
#4 11 2 2 0.8020
#5 10 1 1 0.0099
#6 11 2 2 0.8020
#7 10 1 4 0.0176
#8 11 2 5 0.2000
#9 10 1 3 0.0037
#10 11 2 3 0.0050
or using the names of df1 to match the colums:
df$prop.dis <- df1[cbind(match(df$gender, df1$gender)
, match(paste0("prob.dis", df$dis), names(df1)))]
In case also age should be matched use in addition interaction:
M <- c("age", "gender")
df$prop.dis <- df1[cbind(match(interaction(df[M]), interaction(df1[M])), 2 + df$dis)]

Calculate rolling sum by group [duplicate]

This question already has answers here:
Why are my dplyr group_by & summarize not working properly? (name-collision with plyr)
(5 answers)
Closed 2 years ago.
I would like to calculate a rolling sum (or a custom function) of 3 previous values, treating each group separately. I have tried this:
require(dplyr)
# Build dataframe
df <- data.frame(person = c(rep("Peter", 5), rep("James", 5)),
score1 = c(1,3,2,5,4,6,8,4,5,3),
score2 = c(1,1,1,5,1,3,4,8,9,0))
# Attempt rolling sum by group
df %>%
group_by(person) %>%
mutate(s1_rolling = rollsumr(score1, k = 3, fill = NA),
s2_rolling = rollsumr(score2, k = 3, fill = NA))
But the new columns do not treat each group separately, instead continuing down the whole dataset:
person score1 score2 s1_rolling s2_rolling
<chr> <dbl> <dbl> <dbl> <dbl>
1 Peter 1 1 NA NA
2 Peter 3 1 NA NA
3 Peter 2 1 6 3
4 Peter 5 5 10 7
5 Peter 4 1 11 7
6 James 6 3 15 9
7 James 8 4 18 8
8 James 4 8 18 15
9 James 5 9 17 21
10 James 3 0 12 17
I would like row 6 and 7 to show NA in the two new columns, because until row 8 there is insufficient James data to sum 3 rows.
How can I do this?
It could be that plyr was also loaded and the mutate from plyr masked the mutate from dplyr. We could use dplyr::mutate
library(dplyr)
library(zoo)
df %>%
group_by(person) %>%
dplyr::mutate(s1_rolling = rollsumr(score1, k = 3, fill = NA),
s2_rolling = rollsumr(score2, k = 3, fill = NA))
# A tibble: 10 x 5
# Groups: person [2]
# person score1 score2 s1_rolling s2_rolling
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Peter 1 1 NA NA
# 2 Peter 3 1 NA NA
# 3 Peter 2 1 6 3
# 4 Peter 5 5 10 7
# 5 Peter 4 1 11 7
# 6 James 6 3 NA NA
# 7 James 8 4 NA NA
# 8 James 4 8 18 15
# 9 James 5 9 17 21
#10 James 3 0 12 17
If there are more than one column, we can also use across
df %>%
group_by(person) %>%
dplyr::mutate(across(starts_with('score'),
~ rollsumr(., k = 3, fill = NA), .names = '{col}_rolling'))
For a faster version, use RcppRoll::roll_sumr
df %>%
group_by(person) %>%
dplyr::mutate(across(starts_with('score'),
~ RcppRoll::roll_sumr(., 3, fill = NA), .names = '{col}_rolling'))
The behavior can be reproduced with plyr::mutate
df %>%
group_by(person) %>%
plyr::mutate(s1_rolling = rollsumr(score1, k = 3, fill = NA),
s2_rolling = rollsumr(score2, k = 3, fill = NA))
# A tibble: 10 x 5
# Groups: person [2]
# person score1 score2 s1_rolling s2_rolling
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Peter 1 1 NA NA
# 2 Peter 3 1 NA NA
# 3 Peter 2 1 6 3
# 4 Peter 5 5 10 7
# 5 Peter 4 1 11 7
# 6 James 6 3 15 9
# 7 James 8 4 18 8
# 8 James 4 8 18 15
# 9 James 5 9 17 21
#10 James 3 0 12 17
I would suggest a slider approach with slide_dbl() function with works similar to zoo and it is compatible with dplyr:
library(slider)
library(dplyr)
#Code
# Build dataframe
df <- data.frame(person = c(rep("Peter", 5), rep("James", 5)),
score1 = c(1,3,2,5,4,6,8,4,5,3),
score2 = c(1,1,1,5,1,3,4,8,9,0))
# Attempt rolling sum by group
df %>%
group_by(person) %>%
mutate(s1_rolling = slide_dbl(score1, sum, .before = 2, .complete = TRUE),
s2_rolling = slide_dbl(score2, sum, .before = 2, .complete = TRUE))
Output:
# A tibble: 10 x 5
# Groups: person [2]
person score1 score2 s1_rolling s2_rolling
<fct> <dbl> <dbl> <dbl> <dbl>
1 Peter 1 1 NA NA
2 Peter 3 1 NA NA
3 Peter 2 1 6 3
4 Peter 5 5 10 7
5 Peter 4 1 11 7
6 James 6 3 NA NA
7 James 8 4 NA NA
8 James 4 8 18 15
9 James 5 9 17 21
10 James 3 0 12 17

Restructure data based on row numbers in R

I am having troubles restructuring the data as I need to.
My df looks like this:
id <- (1:20)
author <- c("A","A","A","A","A","B","B","B","A","A","A","B","B","B","B"
,"B","B","B","A","A")
df <- data.frame(id, author)
> print(df)
id author
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 B
7 7 B
8 8 B
9 9 A
10 10 A
11 11 A
12 12 B
13 13 B
14 14 B
15 15 B
16 16 B
17 17 B
18 18 B
19 19 A
20 20 A
And I'm trying to get a data structure where the columns are the authors and the rwos indicate the first and last id values of each sequence of A or B values. So in this case the first row with author A is id = 1, and the last one of that series is id 5, and so forth.
Something like this:
A <- c(1, 5, 9, 11, 19,20)
B <- c(6, 8, 12, 18, NA, NA)
df.desired <- data.frame(A, B)
print(df.desired)
A B
1 1 6
2 5 8
3 9 12
4 11 18
5 19 NA
6 20 NA
Any ideas?
Thanks a lot!
We can create groups using data.table rleid, select 1st and last row in each group and get data in wide format.
library(dplyr)
df %>%
group_by(grp = data.table::rleid(author)) %>%
slice(1L, n()) %>%
group_by(author) %>%
mutate(grp = row_number()) %>%
tidyr::pivot_wider(names_from = author, values_from = id) %>%
select(-grp)
# A tibble: 6 x 2
# A B
# <int> <int>
#1 1 6
#2 5 8
#3 9 12
#4 11 18
#5 19 NA
#6 20 NA
For the updated request in comments we can do :
df %>%
group_by(grp = data.table::rleid(author)) %>%
slice(1L, n()) %>%
mutate(author = row_number()) %>%
tidyr::pivot_wider(names_from = row, values_from = id) %>%
ungroup %>%
select(-grp)
# A tibble: 5 x 2
# `1` `2`
# <int> <int>
#1 1 5
#2 6 8
#3 9 11
#4 12 18
#5 19 20
Here is a base R option
z <- rle(df$author)
lst <- split(df,findInterval(1:nrow(df),cumsum(z$lengths), left.open = TRUE))
u <- lapply(lst,function(v) range(v$id))
idx <- split(seq_along(z$values),z$values)
x <- lapply(idx,function(v) unlist(u[v],use.names = FALSE))
df.desired <- as.data.frame(lapply(x,`length<-`,max(lengths(x))))
which gives
> df.desired
A B
1 1 6
2 5 8
3 9 12
4 11 18
5 19 NA
6 20 NA
An option using data.table:
library(data.table)
dcast(
setDT(df)[, ri := rleid(author)][, id[c(1L, .N)], .(author, ri)],
rowid(author) ~ author, value.var="V1")
output:
author A B
1: 1 1 6
2: 2 5 8
3: 3 9 12
4: 4 11 18
5: 5 19 NA
6: 6 20 NA
If there is a possibility of an author having a single row, you will need unique(c(1L, .N))

Unnest or unchop dataframe containing lists of different lengths

I have a dataframe with several columns containing list columns that I want to unnest (or unchop). BUT, they are different lengths, so the resulting error is Error: No common size for...
Here is a reprex to show what works and doesn't work.
library(tidyr)
library(vctrs)
# This works as expected
df_A <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9)))
)
unchop(df_A, cols = c(A))
# A tibble: 7 x 2
ID A
<int> <dbl>
1 1 9
2 1 8
3 1 5
4 2 7
5 2 6
6 3 6
7 3 9
# This works as expected as the lists are the same lengths
df_AB_1 <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9))),
B = as_list_of(list(c(1, 2, 3), c(4, 5), c(7, 8)))
)
unchop(df_AB_1, cols = c(A, B))
# A tibble: 7 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 3
4 2 7 4
5 2 6 5
6 3 6 7
7 3 9 8
# This does NOT work as the lists are different lengths
df_AB_2 <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9))),
B = as_list_of(list(c(1, 2), c(4, 5, 6), c(7, 8, 9, 0)))
)
unchop(df_AB_2, cols = c(A, B))
# Error: No common size for `A`, size 3, and `B`, size 2.
The output that I would like to achieve for df_AB_2 above is as follows where each list is unchopped and missing values are filled with NA:
# A tibble: 10 x 3
ID A B
<dbl> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
I have referenced this issue on Github and StackOverflow here.
Any ideas how to achieve the result above?
Versions
> packageVersion("tidyr")
[1] ‘1.0.0’
> packageVersion("vctrs")
[1] ‘0.2.0.9001’
Here is an idea via dplyr that you can generalise to as many columns as you want,
library(tidyverse)
df_AB_2 %>%
pivot_longer(c(A, B)) %>%
mutate(value = lapply(value, `length<-`, max(lengths(value)))) %>%
pivot_wider(names_from = name, values_from = value) %>%
unnest() %>%
filter(rowSums(is.na(.[-1])) != 2)
which gives,
# A tibble: 10 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
Defining a helper function to update the lengths of the element and proceeding with dplyr:
foo <- function(x, len_vec) {
lapply(
seq_len(length(x)),
function(i) {
length(x[[i]]) <- len_vec[i]
x[[i]]
}
)
}
df_AB_2 %>%
mutate(maxl = pmax(lengths(A), lengths(B))) %>%
mutate(A = foo(A, maxl), B = foo(B, maxl)) %>%
unchop(cols = c(A, B)) %>%
select(-maxl)
# A tibble: 10 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
Using data.table:
library(data.table)
setDT(df_AB_2)
df_AB_2[, maxl := pmax(lengths(A), lengths(B))]
df_AB_2[, .(unlist(A)[seq_len(maxl)], unlist(B)[seq_len(maxl)]), by = ID]

Resources