Hashing every row of a tibble - r

I am using the newly minted dplyr 1.0.0 and the digest package to generate a hash of every row in a tibble.
I am aware of
adding hash to each row using dplyr and digest in R
but I would like to use the revamped rowwise() in dplyr 1.0.0.
See the example below. Anyone has any idea about why it fails? I should be allowed to digest a row where the entries are of different types.
library(dplyr)
library(digest)
df <- tibble(
student_id = letters[1:4],
student_id2 = letters[9:12],
test1 = 10:13,
test2 = 20:23,
test3 = 30:33,
test4 = 40:43
)
df
#> # A tibble: 4 x 6
#> student_id student_id2 test1 test2 test3 test4
#> <chr> <chr> <int> <int> <int> <int>
#> 1 a i 10 20 30 40
#> 2 b j 11 21 31 41
#> 3 c k 12 22 32 42
#> 4 d l 13 23 33 43
dd <- df %>%
rowwise(student_id) %>%
mutate(hash = digest(c_across(everything()))) %>%
ungroup
#> Error: Problem with `mutate()` input `hash`.
#> ✖ Can't combine `student_id2` <character> and `test1` <integer>.
#> ℹ Input `hash` is `digest(c_across(everything()))`.
#> ℹ The error occured in row 1.
### but digest should not care too much about the type of the input
Created on 2020-06-04 by the reprex package (v0.3.0)

It seems that the different column types have an issue. One option is to first change the column types to a single one and then do the rowwise
library(dplyr)
library(digest)
df %>%
mutate(across(everything(), as.character)) %>%
rowwise %>%
mutate(hash = digest(c_across(everything())))
# A tibble: 4 x 7
# Rowwise:
# student_id student_id2 test1 test2 test3 test4 hash
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 a i 10 20 30 40 2638067de6dcfb3d58b83a83e0cd3089
#2 b j 11 21 31 41 21162fc0c528a6550b53c87ca0c2805e
#3 c k 12 22 32 42 8d7539eacff61efbd567b6100227523b
#4 d l 13 23 33 43 9739997605aa39620ce50e96f1ff4f70
Or another option is to unite the columns to a single one and then do the digest on that column
library(tidyr)
df %>%
unite(new, everything(), remove = FALSE) %>%
rowwise %>%
mutate(hash = digest(new)) %>%
select(-new)
# A tibble: 4 x 7
# Rowwise:
# student_id student_id2 test1 test2 test3 test4 hash
# <chr> <chr> <int> <int> <int> <int> <chr>
#1 a i 10 20 30 40 a9e4cafdfbc88f17b7593dfd684eb2a1
#2 b j 11 21 31 41 a67a5df8186972285bd7be59e6fdab38
#3 c k 12 22 32 42 9c20bd87a50642631278b3e6d28ecf68
#4 d l 13 23 33 43 3f4f373d1969dcf0c8f542023a258225
Or another option is pmap, where we concatenate the elements to a single one in each row, resulting in integer converting to character as vectors can hold only a single class
library(purrr)
df %>%
mutate(hash = pmap_chr(., ~ digest(c(...))))
# A tibble: 4 x 7
# student_id student_id2 test1 test2 test3 test4 hash
# <chr> <chr> <int> <int> <int> <int> <chr>
#1 a i 10 20 30 40 f0fb4100907570ef9bda073b78dc44a6
#2 b j 11 21 31 41 754b09e8d4d854aa5e40aa88d1edfc66
#3 c k 12 22 32 42 5f3a699caff833e900fd956232cf61dd
#4 d l 13 23 33 43 4d31c65284e5db36c37461126a9eb63c
The advantage here is that we are not changing the column types

Related

dplyr concatenate strings by group - row by row

I need to concatenate strings by group in the dplyr, but the resulting column should account only for the previous columns, not the leading ones
I want my data to look like this:
ID
message
messages_used
1
53
53
1
54
53,54
1
55
53,54,55
2
53
53
2
58
53,58
Is it achievable using dplyr only?
You can use Reduce(..., accumulate = TRUE) from base:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(messages_used = Reduce(\(x, y) paste(x, y, sep = ", "), message, accumulate = TRUE)) %>%
ungroup()
# # A tibble: 5 x 3
# ID message messages_used
# <int> <int> <chr>
# 1 1 53 53
# 2 1 54 53, 54
# 3 1 55 53, 54, 55
# 4 2 53 53
# 5 2 58 53, 58
We can use dplyr::group_by() and purrr::accumulate():
dat <- data.frame(ID = c(1,1,1,2,2), message = c(53,54,55,53,58))
library(dplyr)
library(purrr)
dat %>%
group_by(ID) %>%
mutate(message_used = accumulate(message, ~ paste(.x, .y, sep =",")))
#> # A tibble: 5 x 3
#> # Groups: ID [2]
#> ID message message_used
#> <dbl> <dbl> <chr>
#> 1 1 53 53
#> 2 1 54 53,54
#> 3 1 55 53,54,55
#> 4 2 53 53
#> 5 2 58 53,58
Created on 2022-05-11 by the reprex package (v2.0.1)

Combine mutate case_when() for columns that start_with() to replace certain characters

I have a complex data frame that looks like df1
library(tidyverse)
df <- tibble(position=c(100,200,300),
correction=c("62M89S",
"8M1D55M88S",
"1S25M1P36M89S"))
df1 <- df %>%
separate(correction, into = str_c("col", 1:5),
sep = "(?<=\\D)(?=\\d)", fill = "left", remove = FALSE)
df1
#> # A tibble: 3 × 7
#> position correction col1 col2 col3 col4 col5
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 100 62M89S <NA> <NA> <NA> 62M 89S
#> 2 200 8M1D55M88S <NA> 8M 1D 55M 88S
#> 3 300 1S25M1P36M89S 1S 25M 1P 36M 89S
Created on 2022-03-02 by the reprex package (v2.0.1)
I want for every columns that starts_with("col") to replace only the strings
that start with S, M, and D with "" [empty string] and the rest of the
and the rest with 0.
I want my data to look like this
df1
#> # A tibble: 3 × 7
#> position correction col1 col2 col3 col4 col5
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 100 62M89S <NA> <NA> <NA> 62 89
#> 2 200 8M1D55M88S <NA> 8 1 55 88
#> 3 300 1S25M1P36M89S 1 25 0 36 89
Notice here, that the cell that contains P has been converted to zero.
this is a poor effort for which I am ashamed
df1 %>%
mutate(across(starts_with("col")),
~case_when(grepl("*M") | grepl("*S") | grepl("*D") ~ "",
TRUE ~ 0))
Here is one possibility using case_when and grepl:
df1 %>%
mutate(
across(starts_with("col"),~case_when(
is.na(.) ~ NA_real_,
grepl("[SMD]$", .) ~ parse_number(.),
TRUE ~ 0
)
))
# A tibble: 3 x 7
position correction col1 col2 col3 col4 col5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 62M89S NA NA NA 62 89
2 200 8M1D55M88S NA 8 1 55 88
3 300 1S25M1P36M89S 1 25 0 36 89
Please find below another solution using the map_df() function from the purrr library and the str_replace() function from stringr:
Reprex
Code
library(tidyverse)
df1 %>% select(starts_with("col")) %>%
map_df(., str_replace, ".P", "0") %>%
map_df(., str_replace, "\\D$", "") %>%
bind_cols(df1 %>% select(-starts_with("col")),.)
Output
#> # A tibble: 3 x 7
#> position correction col1 col2 col3 col4 col5
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 100 62M89S <NA> <NA> <NA> 62 89
#> 2 200 8M1D55M88S <NA> 8 1 55 88
#> 3 300 1S25M1P36M89S 1 25 0 36 89
Created on 2022-03-02 by the reprex package (v2.0.1)
df1 %>%
mutate_at(vars(starts_with('col')),
~ case_when(
grepl('[SMD]$', .x) ~ sub('[SMD]', '', .x),
grepl('P$' , .x) ~ '0',
TRUE ~ .x)
)

mutate or summarise across rows by variable containing string

I'd like to create a new data table which is the sum across rows from variables which contain a string. I have been trying to keep this within the tidyverse as a noob using new dplyr across. Help much appreciated.
dat<- data.frame("Image" = c(1,2,3,4),
"A" = c(1,2,3,4),
"A:B"= c(5,6,7,8),
"A:B:C"= c(9,10,11,12))
to obtain the sums across the rows of variables containing "A", "B", or "C".
datsums<- data.frame("Image" = c(1,2,3,4),
"Asum"= c(15,18,21,24),
"Bsum"=c(14,16,18,20),
"Csum"=c(9,10,11,12))
I have been unsuccessful using the newer dplyr verbs:
datsums<- dat %>% summarise(across(str_detect("A")), sum, .names ="Asum",
across(str_detect("B")), sum, .names="Bsum",
across(str_detect("C")), sum, .names"Csum")
use rowwise and c_across:
library(tidyverse)
dat %>%
rowwise() %>%
summarise(
Asum = sum(c_across(contains("A"))),
Bsum = sum(c_across(contains("B"))),
Csum = sum(c_across(contains("C")))
)
Returns:
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 3
Asum Bsum Csum
<dbl> <dbl> <dbl>
1 16 14 9
2 20 16 10
3 24 18 11
4 28 20 12
To add columns to the original data.frame, use mutate instead of summarise:
dat %>%
rowwise() %>%
mutate(
Asum = sum(c_across(contains("A"))),
Bsum = sum(c_across(contains("B"))),
Csum = sum(c_across(contains("C")))
)
# A tibble: 4 x 7
# Rowwise:
Image A A.B A.B.C Asum Bsum Csum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 9 16 14 9
2 2 2 6 10 20 16 10
3 3 3 7 11 24 18 11
4 4 4 8 12 28 20 12
Since you want row-wise sum you could use :
library(dplyr)
dat %>%
transmute(Asum = rowSums(select(., contains('A', ignore.case = FALSE))),
Bsum = rowSums(select(., contains('B', ignore.case = FALSE))),
Csum = rowSums(select(., contains('C', ignore.case = FALSE))))
Or for many variables use :
cols <- c('A', 'B', 'C')
purrr::map_dfc(cols, ~dat %>%
transmute(!!paste0(.x, 'sum') :=
rowSums(select(., contains(.x, ignore.case = FALSE)))))
# Asum Bsum Csum
#1 15 14 9
#2 18 16 10
#3 21 18 11
#4 24 20 12
use pivot_longer and pivot_wider
library(tidyverse)
dat %>%
pivot_longer(-Image) %>%
separate_rows(name, sep = "\\.") %>%
pivot_wider(Image,
names_from = name,
values_from = value,
values_fn = sum,
names_prefix = "sum")
#> # A tibble: 4 x 4
#> Image sumA sumB sumC
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 15 14 9
#> 2 2 18 16 10
#> 3 3 21 18 11
#> 4 4 24 20 12
Created on 2020-12-07 by the reprex package (v0.3.0)

Why does my dplyr percentile calculation not work with tidy evaluation?

I have a tibble with student test data, and I wish to convert these to percentiles using dplyr. For the sake of having a minimal example, imagine the following setup of three students.
require(tidyverse)
tbl <- tibble(Name = c("Alice", "Bob", "Cat"), Test = c(16, 13, 15))
The following code works and yields the desired output.
tbl %>% mutate(TestPercentile = cume_dist(Test) * 100)
# A tibble: 3 x 3
Name Test TestPercentile
<chr> <dbl> <dbl>
1 Alice 16 100
2 Bob 13 33.3
3 Cat 15 66.7
However, I actually want to do it programmatically because there are many such columns.
colname <- "Test"
percname <- str_c(colname, "Percentile")
tbl %>% mutate({{percname}} := cume_dist({{colname}}) * 100)
# A tibble: 3 x 3
Name Test TestPercentile
<chr> <dbl> <dbl>
1 Alice 16 100
2 Bob 13 100
3 Cat 15 100
Why does cume_dist make the percentile 100 for all students when I try to use tidy evaluation like this? (And ideally, if I can be permitted a second question, how can I fix it?)
If by programmatically you mean you want to write your own function, you can do it like this:
calculate_percentile <- function(data, colname) {
data %>%
mutate("{{colname}}Percentile" := cume_dist({{colname}} * 100))
}
tbl %>%
calculate_percentile(Test)
# A tibble: 3 x 3
Name Test TestPercentile
<chr> <dbl> <dbl>
1 Alice 16 1
2 Bob 13 0.333
3 Cat 15 0.667
Edit for multiple columns
New Data
tbl <- tibble(Name = c("Alice", "Bob", "Cat"), Test = c(16, 13, 15), Test_math = c(16, 30, 55), Test_music = c(3, 78, 34))
calculate_percentile <- function(data, colnames) {
data %>%
mutate(across({{colnames}}, ~cume_dist(.) * 100, .names = "{col}Percentile"))
}
test_columns <- c("Test_math", "Test_music")
tbl %>%
calculate_percentile(test_columns)
# A tibble: 3 x 6
Name Test Test_math Test_music Test_mathPercentile Test_musicPercentile
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alice 16 16 3 33.3 33.3
2 Bob 13 30 78 66.7 100
3 Cat 15 55 34 100 66.7
Why does your solution not work? Because your solution applies cume_dist literally to the string "test":
tbl %>% mutate({{percname}} := print({{colname}}))
[1] "Test"
# A tibble: 3 x 5
Name Test Test_math Test_music TestPercentile
<chr> <dbl> <dbl> <dbl> <chr>
1 Alice 16 16 3 Test
2 Bob 13 30 78 Test
3 Cat 15 55 34 Test
Why does this give a TestPercentile value of 100? Because cume_dist of "test" is 1:
cume_dist("test")
#[1] 1
So we need R to tell not to evaluate the string "test" per se but to look for a variable with this name, which we can do like this:
tbl %>% mutate({{percname}} := cume_dist(!!parse_quo(colname, env = global_env())) * 100)
# A tibble: 3 x 5
Name Test Test_math Test_music TestPercentile
<chr> <dbl> <dbl> <dbl> <dbl>
1 Alice 16 16 3 100
2 Bob 13 30 78 33.3
3 Cat 15 55 34 66.7
#Check that this uses the values of "Test" and not "Test" per se:
tbl %>% mutate({{percname}} := print(!!parse_quo(colname, env = global_env())))
[1] 16 13 15
# A tibble: 3 x 5
Name Test Test_math Test_music TestPercentile
<chr> <dbl> <dbl> <dbl> <dbl>
1 Alice 16 16 3 16
2 Bob 13 30 78 13
3 Cat 15 55 34 15
Passing column name as string :
library(dplyr)
library(rlang)
return_percentile <- function(data, colname) {
percname <- paste0(colname, "Percentile")
data %>% mutate({{percname}} := cume_dist(!!sym(colname)) * 100)
}
tbl %>% return_percentile("Test")
# A tibble: 3 x 3
# Name Test TestPercentile
# <chr> <dbl> <dbl>
#1 Alice 16 100
#2 Bob 13 33.3
#3 Cat 15 66.7
Passing column name unquoted :
return_percentile <- function(data, colname) {
percname <- paste0(deparse(substitute(colname)), "Percentile")
data %>% mutate({{percname}} := cume_dist({{colname}}) * 100)
}
tbl %>% return_percentile(Test)
# A tibble: 3 x 3
# Name Test TestPercentile
# <chr> <dbl> <dbl>
#1 Alice 16 100
#2 Bob 13 33.3
#3 Cat 15 66.7

Grouping based on two variables, including their individual combinations (e.g. A - B is same than B - A)

I got stuck during an coding problem at work. I have a data frame with three variables var1 and var2 and length. The latter is the mutual length between var1 and var2, e.g. a boundary.
Ultimately I want to calculate the percentage of each combination of var1 - var2 (var2 - var1 is regarded as identical) on the total length of each unique element in either var1 and var2. Because this sounds too complicated I have made some examples to show where I am stuck.
library(tidyverse)
df <- tibble(
var1 = c("A","B","A","D","A"),
var2 = c("B","A","D","A","B"),
Length = c(10,12,5,20,34))
#First I wanted the total length of each variable, irrespective of it occurring in var1 or var2
# I think that I figured this out. Let me know it its a bit convoluted
var_unique <- unique(c(unique(df$var1),unique(df$var2)))
names(var_unique) <- var_unique
total_length <- map_df(var_unique, function(x){
df %>%
filter( var1 == x | var2 == x )%>%
summarise(var_total_length = sum(Length))
},.id = "var" )
total_length
#> # A tibble: 3 x 2
#> var var_total_length
#> <chr> <dbl>
#> 1 A 81
#> 2 B 56
#> 3 D 25
# Second I need the length of each combination of var1 and var2.
#I would like the "A" - "B" should be the same than "B" - "A"
# Grouping does not work in this case. This is where I am stuck
#Neiter this
df %>% group_by(var1,var2) %>%
mutate(combination_length = sum(Length))
#> # A tibble: 5 x 4
#> # Groups: var1, var2 [4]
#> var1 var2 Length combination_length
#> <chr> <chr> <dbl> <dbl>
#> 1 A B 10 44
#> 2 B A 12 12
#> 3 A D 5 5
#> 4 D A 20 20
#> 5 A B 34 44
# nor that one does the job, because it looks at individual combination of var1 and var2.
df %>% group_by(var1,var2) %>%
summarise(combination_length = sum(Length))
#> # A tibble: 4 x 3
#> # Groups: var1 [3]
#> var1 var2 combination_length
#> <chr> <chr> <dbl>
#> 1 A B 44
#> 2 A D 5
#> 3 B A 12
#> 4 D A 20
# this is the dataframe that I would like. Rows 1,2 and 5 of df should be considered the
# same group
tibble(
var1 = c("A","B","A","D","A"),
var2 = c("B","A","D","A","B"),
Length = c(10,12,5,20,34),
combination_length = c(56,56,25,25,56))
#> # A tibble: 5 x 4
#> var1 var2 Length combination_length
#> <chr> <chr> <dbl> <dbl>
#> 1 A B 10 56
#> 2 B A 12 56
#> 3 A D 5 25
#> 4 D A 20 25
#> 5 A B 34 56
# Ultimately i want to divide each combination by the total length of the variable
# occurring in the combination to obtain the percentage of each boundary for each unique variable
Created on 2019-11-27 by the reprex package (v0.3.0)
I assume there are ways to make it less complex than I try to do it.
We can use sorted var1, var2 in group_by which can be done using pmax and pmin
library(dplyr)
df %>%
group_by(group1 = pmin(var1, var2), group2 = pmax(var1, var2)) %>%
mutate(combination_length = sum(Length)) %>%
ungroup() %>%
select(-group1, -group2)
# var1 var2 Length combination_length
# <chr> <chr> <dbl> <dbl>
#1 A B 10 56
#2 B A 12 56
#3 A D 5 25
#4 D A 20 25
#5 A B 34 56
Here is a solution for base R, where split() is used and it is assumed that df is a data frame, i.e.,
df <- data.frame(
var1 = c("A","B","A","D","A"),
var2 = c("B","A","D","A","B"),
Length = c(10,12,5,20,34))
then, using the following code
sp <- data.frame(t(apply(df[1:2], 1, sort)))
v <- split(df,sp)
res <- unsplit(lapply(v, function(x) data.frame(x,combination_length = sum(x[3]))),sp)
gives
> res
var1 var2 Length combination_length
1 A B 10 56
2 B A 12 56
3 A D 5 25
4 D A 20 25
5 A B 34 56

Resources