Hashing every row of a tibble

Hashing every row of a tibble - r

I am using the newly minted dplyr 1.0.0 and the digest package to generate a hash of every row in a tibble.
I am aware of
adding hash to each row using dplyr and digest in R
but I would like to use the revamped rowwise() in dplyr 1.0.0.
See the example below. Anyone has any idea about why it fails? I should be allowed to digest a row where the entries are of different types.
library(dplyr)
library(digest)
df <- tibble(
student_id = letters[1:4],
student_id2 = letters[9:12],
test1 = 10:13,
test2 = 20:23,
test3 = 30:33,
test4 = 40:43
)
df
#> # A tibble: 4 x 6
#> student_id student_id2 test1 test2 test3 test4
#> <chr> <chr> <int> <int> <int> <int>
#> 1 a i 10 20 30 40
#> 2 b j 11 21 31 41
#> 3 c k 12 22 32 42
#> 4 d l 13 23 33 43
dd <- df %>%
rowwise(student_id) %>%
mutate(hash = digest(c_across(everything()))) %>%
ungroup
#> Error: Problem with `mutate()` input `hash`.
#> ✖ Can't combine `student_id2` <character> and `test1` <integer>.
#> ℹ Input `hash` is `digest(c_across(everything()))`.
#> ℹ The error occured in row 1.
### but digest should not care too much about the type of the input
Created on 2020-06-04 by the reprex package (v0.3.0)

It seems that the different column types have an issue. One option is to first change the column types to a single one and then do the rowwise
library(dplyr)
library(digest)
df %>%
mutate(across(everything(), as.character)) %>%
rowwise %>%
mutate(hash = digest(c_across(everything())))
# A tibble: 4 x 7
# Rowwise:
# student_id student_id2 test1 test2 test3 test4 hash
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 a i 10 20 30 40 2638067de6dcfb3d58b83a83e0cd3089
#2 b j 11 21 31 41 21162fc0c528a6550b53c87ca0c2805e
#3 c k 12 22 32 42 8d7539eacff61efbd567b6100227523b
#4 d l 13 23 33 43 9739997605aa39620ce50e96f1ff4f70
Or another option is to unite the columns to a single one and then do the digest on that column
library(tidyr)
df %>%
unite(new, everything(), remove = FALSE) %>%
rowwise %>%
mutate(hash = digest(new)) %>%
select(-new)
# A tibble: 4 x 7
# Rowwise:
# student_id student_id2 test1 test2 test3 test4 hash
# <chr> <chr> <int> <int> <int> <int> <chr>
#1 a i 10 20 30 40 a9e4cafdfbc88f17b7593dfd684eb2a1
#2 b j 11 21 31 41 a67a5df8186972285bd7be59e6fdab38
#3 c k 12 22 32 42 9c20bd87a50642631278b3e6d28ecf68
#4 d l 13 23 33 43 3f4f373d1969dcf0c8f542023a258225
Or another option is pmap, where we concatenate the elements to a single one in each row, resulting in integer converting to character as vectors can hold only a single class
library(purrr)
df %>%
mutate(hash = pmap_chr(., ~ digest(c(...))))
# A tibble: 4 x 7
# student_id student_id2 test1 test2 test3 test4 hash
# <chr> <chr> <int> <int> <int> <int> <chr>
#1 a i 10 20 30 40 f0fb4100907570ef9bda073b78dc44a6
#2 b j 11 21 31 41 754b09e8d4d854aa5e40aa88d1edfc66
#3 c k 12 22 32 42 5f3a699caff833e900fd956232cf61dd
#4 d l 13 23 33 43 4d31c65284e5db36c37461126a9eb63c
The advantage here is that we are not changing the column types

Related

dplyr concatenate strings by group - row by row

I need to concatenate strings by group in the dplyr, but the resulting column should account only for the previous columns, not the leading ones
I want my data to look like this:
ID
message
messages_used
1
53
53
1
54
53,54
1
55
53,54,55
2
53
53
2
58
53,58
Is it achievable using dplyr only?

You can use Reduce(..., accumulate = TRUE) from base:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(messages_used = Reduce(\(x, y) paste(x, y, sep = ", "), message, accumulate = TRUE)) %>%
ungroup()
# # A tibble: 5 x 3
# ID message messages_used
# <int> <int> <chr>
# 1 1 53 53
# 2 1 54 53, 54
# 3 1 55 53, 54, 55
# 4 2 53 53
# 5 2 58 53, 58

We can use dplyr::group_by() and purrr::accumulate():
dat <- data.frame(ID = c(1,1,1,2,2), message = c(53,54,55,53,58))
library(dplyr)
library(purrr)
dat %>%
group_by(ID) %>%
mutate(message_used = accumulate(message, ~ paste(.x, .y, sep =",")))
#> # A tibble: 5 x 3
#> # Groups: ID [2]
#> ID message message_used
#> <dbl> <dbl> <chr>
#> 1 1 53 53
#> 2 1 54 53,54
#> 3 1 55 53,54,55
#> 4 2 53 53
#> 5 2 58 53,58
Created on 2022-05-11 by the reprex package (v2.0.1)

Combine mutate case_when() for columns that start_with() to replace certain characters

I have a complex data frame that looks like df1
library(tidyverse)
df <- tibble(position=c(100,200,300),
correction=c("62M89S",
"8M1D55M88S",
"1S25M1P36M89S"))
df1 <- df %>%
separate(correction, into = str_c("col", 1:5),
sep = "(?<=\\D)(?=\\d)", fill = "left", remove = FALSE)
df1
#> # A tibble: 3 × 7
#> position correction col1 col2 col3 col4 col5
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 100 62M89S <NA> <NA> <NA> 62M 89S
#> 2 200 8M1D55M88S <NA> 8M 1D 55M 88S
#> 3 300 1S25M1P36M89S 1S 25M 1P 36M 89S
Created on 2022-03-02 by the reprex package (v2.0.1)
I want for every columns that starts_with("col") to replace only the strings
that start with S, M, and D with "" [empty string] and the rest of the
and the rest with 0.
I want my data to look like this
df1
#> # A tibble: 3 × 7
#> position correction col1 col2 col3 col4 col5
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 100 62M89S <NA> <NA> <NA> 62 89
#> 2 200 8M1D55M88S <NA> 8 1 55 88
#> 3 300 1S25M1P36M89S 1 25 0 36 89
Notice here, that the cell that contains P has been converted to zero.
this is a poor effort for which I am ashamed
df1 %>%
mutate(across(starts_with("col")),
~case_when(grepl("*M") | grepl("*S") | grepl("*D") ~ "",
TRUE ~ 0))

Here is one possibility using case_when and grepl:
df1 %>%
mutate(
across(starts_with("col"),~case_when(
is.na(.) ~ NA_real_,
grepl("[SMD]$", .) ~ parse_number(.),
TRUE ~ 0
)
))
# A tibble: 3 x 7
position correction col1 col2 col3 col4 col5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 62M89S NA NA NA 62 89
2 200 8M1D55M88S NA 8 1 55 88
3 300 1S25M1P36M89S 1 25 0 36 89

Please find below another solution using the map_df() function from the purrr library and the str_replace() function from stringr:
Reprex
Code
library(tidyverse)
df1 %>% select(starts_with("col")) %>%
map_df(., str_replace, ".P", "0") %>%
map_df(., str_replace, "\\D$", "") %>%
bind_cols(df1 %>% select(-starts_with("col")),.)
Output
#> # A tibble: 3 x 7
#> position correction col1 col2 col3 col4 col5
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 100 62M89S <NA> <NA> <NA> 62 89
#> 2 200 8M1D55M88S <NA> 8 1 55 88
#> 3 300 1S25M1P36M89S 1 25 0 36 89
Created on 2022-03-02 by the reprex package (v2.0.1)

df1 %>%
mutate_at(vars(starts_with('col')),
~ case_when(
grepl('[SMD]$', .x) ~ sub('[SMD]', '', .x),
grepl('P$' , .x) ~ '0',
TRUE ~ .x)
)

mutate or summarise across rows by variable containing string

I'd like to create a new data table which is the sum across rows from variables which contain a string. I have been trying to keep this within the tidyverse as a noob using new dplyr across. Help much appreciated.
dat<- data.frame("Image" = c(1,2,3,4),
"A" = c(1,2,3,4),
"A:B"= c(5,6,7,8),
"A:B:C"= c(9,10,11,12))
to obtain the sums across the rows of variables containing "A", "B", or "C".
datsums<- data.frame("Image" = c(1,2,3,4),
"Asum"= c(15,18,21,24),
"Bsum"=c(14,16,18,20),
"Csum"=c(9,10,11,12))
I have been unsuccessful using the newer dplyr verbs:
datsums<- dat %>% summarise(across(str_detect("A")), sum, .names ="Asum",
across(str_detect("B")), sum, .names="Bsum",
across(str_detect("C")), sum, .names"Csum")

use rowwise and c_across:
library(tidyverse)
dat %>%
rowwise() %>%
summarise(
Asum = sum(c_across(contains("A"))),
Bsum = sum(c_across(contains("B"))),
Csum = sum(c_across(contains("C")))
)
Returns:
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 3
Asum Bsum Csum
<dbl> <dbl> <dbl>
1 16 14 9
2 20 16 10
3 24 18 11
4 28 20 12
To add columns to the original data.frame, use mutate instead of summarise:
dat %>%
rowwise() %>%
mutate(
Asum = sum(c_across(contains("A"))),
Bsum = sum(c_across(contains("B"))),
Csum = sum(c_across(contains("C")))
)
# A tibble: 4 x 7
# Rowwise:
Image A A.B A.B.C Asum Bsum Csum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 9 16 14 9
2 2 2 6 10 20 16 10
3 3 3 7 11 24 18 11
4 4 4 8 12 28 20 12

Since you want row-wise sum you could use :
library(dplyr)
dat %>%
transmute(Asum = rowSums(select(., contains('A', ignore.case = FALSE))),
Bsum = rowSums(select(., contains('B', ignore.case = FALSE))),
Csum = rowSums(select(., contains('C', ignore.case = FALSE))))
Or for many variables use :
cols <- c('A', 'B', 'C')
purrr::map_dfc(cols, ~dat %>%
transmute(!!paste0(.x, 'sum') :=
rowSums(select(., contains(.x, ignore.case = FALSE)))))
# Asum Bsum Csum
#1 15 14 9
#2 18 16 10
#3 21 18 11
#4 24 20 12

use pivot_longer and pivot_wider
library(tidyverse)
dat %>%
pivot_longer(-Image) %>%
separate_rows(name, sep = "\\.") %>%
pivot_wider(Image,
names_from = name,
values_from = value,
values_fn = sum,
names_prefix = "sum")
#> # A tibble: 4 x 4
#> Image sumA sumB sumC
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 15 14 9
#> 2 2 18 16 10
#> 3 3 21 18 11
#> 4 4 24 20 12
Created on 2020-12-07 by the reprex package (v0.3.0)

Why does my dplyr percentile calculation not work with tidy evaluation?

I have a tibble with student test data, and I wish to convert these to percentiles using dplyr. For the sake of having a minimal example, imagine the following setup of three students.
require(tidyverse)
tbl <- tibble(Name = c("Alice", "Bob", "Cat"), Test = c(16, 13, 15))
The following code works and yields the desired output.
tbl %>% mutate(TestPercentile = cume_dist(Test) * 100)
# A tibble: 3 x 3
Name Test TestPercentile
<chr> <dbl> <dbl>
1 Alice 16 100
2 Bob 13 33.3
3 Cat 15 66.7
However, I actually want to do it programmatically because there are many such columns.
colname <- "Test"
percname <- str_c(colname, "Percentile")
tbl %>% mutate({{percname}} := cume_dist({{colname}}) * 100)
# A tibble: 3 x 3
Name Test TestPercentile
<chr> <dbl> <dbl>
1 Alice 16 100
2 Bob 13 100
3 Cat 15 100
Why does cume_dist make the percentile 100 for all students when I try to use tidy evaluation like this? (And ideally, if I can be permitted a second question, how can I fix it?)

If by programmatically you mean you want to write your own function, you can do it like this:
calculate_percentile <- function(data, colname) {
data %>%
mutate("{{colname}}Percentile" := cume_dist({{colname}} * 100))
}
tbl %>%
calculate_percentile(Test)
# A tibble: 3 x 3
Name Test TestPercentile
<chr> <dbl> <dbl>
1 Alice 16 1
2 Bob 13 0.333
3 Cat 15 0.667
Edit for multiple columns
New Data
tbl <- tibble(Name = c("Alice", "Bob", "Cat"), Test = c(16, 13, 15), Test_math = c(16, 30, 55), Test_music = c(3, 78, 34))
calculate_percentile <- function(data, colnames) {
data %>%
mutate(across({{colnames}}, ~cume_dist(.) * 100, .names = "{col}Percentile"))
}
test_columns <- c("Test_math", "Test_music")
tbl %>%
calculate_percentile(test_columns)
# A tibble: 3 x 6
Name Test Test_math Test_music Test_mathPercentile Test_musicPercentile
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alice 16 16 3 33.3 33.3
2 Bob 13 30 78 66.7 100
3 Cat 15 55 34 100 66.7
Why does your solution not work? Because your solution applies cume_dist literally to the string "test":
tbl %>% mutate({{percname}} := print({{colname}}))
[1] "Test"
# A tibble: 3 x 5
Name Test Test_math Test_music TestPercentile
<chr> <dbl> <dbl> <dbl> <chr>
1 Alice 16 16 3 Test
2 Bob 13 30 78 Test
3 Cat 15 55 34 Test
Why does this give a TestPercentile value of 100? Because cume_dist of "test" is 1:
cume_dist("test")
#[1] 1
So we need R to tell not to evaluate the string "test" per se but to look for a variable with this name, which we can do like this:
tbl %>% mutate({{percname}} := cume_dist(!!parse_quo(colname, env = global_env())) * 100)
# A tibble: 3 x 5
Name Test Test_math Test_music TestPercentile
<chr> <dbl> <dbl> <dbl> <dbl>
1 Alice 16 16 3 100
2 Bob 13 30 78 33.3
3 Cat 15 55 34 66.7
#Check that this uses the values of "Test" and not "Test" per se:
tbl %>% mutate({{percname}} := print(!!parse_quo(colname, env = global_env())))
[1] 16 13 15
# A tibble: 3 x 5
Name Test Test_math Test_music TestPercentile
<chr> <dbl> <dbl> <dbl> <dbl>
1 Alice 16 16 3 16
2 Bob 13 30 78 13
3 Cat 15 55 34 15

Passing column name as string :
library(dplyr)
library(rlang)
return_percentile <- function(data, colname) {
percname <- paste0(colname, "Percentile")
data %>% mutate({{percname}} := cume_dist(!!sym(colname)) * 100)
}
tbl %>% return_percentile("Test")
# A tibble: 3 x 3
# Name Test TestPercentile
# <chr> <dbl> <dbl>
#1 Alice 16 100
#2 Bob 13 33.3
#3 Cat 15 66.7
Passing column name unquoted :
return_percentile <- function(data, colname) {
percname <- paste0(deparse(substitute(colname)), "Percentile")
data %>% mutate({{percname}} := cume_dist({{colname}}) * 100)
}
tbl %>% return_percentile(Test)
# A tibble: 3 x 3
# Name Test TestPercentile
# <chr> <dbl> <dbl>
#1 Alice 16 100
#2 Bob 13 33.3
#3 Cat 15 66.7

Grouping based on two variables, including their individual combinations (e.g. A - B is same than B - A)

I got stuck during an coding problem at work. I have a data frame with three variables var1 and var2 and length. The latter is the mutual length between var1 and var2, e.g. a boundary.
Ultimately I want to calculate the percentage of each combination of var1 - var2 (var2 - var1 is regarded as identical) on the total length of each unique element in either var1 and var2. Because this sounds too complicated I have made some examples to show where I am stuck.
library(tidyverse)
df <- tibble(
var1 = c("A","B","A","D","A"),
var2 = c("B","A","D","A","B"),
Length = c(10,12,5,20,34))
#First I wanted the total length of each variable, irrespective of it occurring in var1 or var2
# I think that I figured this out. Let me know it its a bit convoluted
var_unique <- unique(c(unique(df$var1),unique(df$var2)))
names(var_unique) <- var_unique
total_length <- map_df(var_unique, function(x){
df %>%
filter( var1 == x | var2 == x )%>%
summarise(var_total_length = sum(Length))
},.id = "var" )
total_length
#> # A tibble: 3 x 2
#> var var_total_length
#> <chr> <dbl>
#> 1 A 81
#> 2 B 56
#> 3 D 25
# Second I need the length of each combination of var1 and var2.
#I would like the "A" - "B" should be the same than "B" - "A"
# Grouping does not work in this case. This is where I am stuck
#Neiter this
df %>% group_by(var1,var2) %>%
mutate(combination_length = sum(Length))
#> # A tibble: 5 x 4
#> # Groups: var1, var2 [4]
#> var1 var2 Length combination_length
#> <chr> <chr> <dbl> <dbl>
#> 1 A B 10 44
#> 2 B A 12 12
#> 3 A D 5 5
#> 4 D A 20 20
#> 5 A B 34 44
# nor that one does the job, because it looks at individual combination of var1 and var2.
df %>% group_by(var1,var2) %>%
summarise(combination_length = sum(Length))
#> # A tibble: 4 x 3
#> # Groups: var1 [3]
#> var1 var2 combination_length
#> <chr> <chr> <dbl>
#> 1 A B 44
#> 2 A D 5
#> 3 B A 12
#> 4 D A 20
# this is the dataframe that I would like. Rows 1,2 and 5 of df should be considered the
# same group
tibble(
var1 = c("A","B","A","D","A"),
var2 = c("B","A","D","A","B"),
Length = c(10,12,5,20,34),
combination_length = c(56,56,25,25,56))
#> # A tibble: 5 x 4
#> var1 var2 Length combination_length
#> <chr> <chr> <dbl> <dbl>
#> 1 A B 10 56
#> 2 B A 12 56
#> 3 A D 5 25
#> 4 D A 20 25
#> 5 A B 34 56
# Ultimately i want to divide each combination by the total length of the variable
# occurring in the combination to obtain the percentage of each boundary for each unique variable
Created on 2019-11-27 by the reprex package (v0.3.0)
I assume there are ways to make it less complex than I try to do it.

We can use sorted var1, var2 in group_by which can be done using pmax and pmin
library(dplyr)
df %>%
group_by(group1 = pmin(var1, var2), group2 = pmax(var1, var2)) %>%
mutate(combination_length = sum(Length)) %>%
ungroup() %>%
select(-group1, -group2)
# var1 var2 Length combination_length
# <chr> <chr> <dbl> <dbl>
#1 A B 10 56
#2 B A 12 56
#3 A D 5 25
#4 D A 20 25
#5 A B 34 56

Here is a solution for base R, where split() is used and it is assumed that df is a data frame, i.e.,
df <- data.frame(
var1 = c("A","B","A","D","A"),
var2 = c("B","A","D","A","B"),
Length = c(10,12,5,20,34))
then, using the following code
sp <- data.frame(t(apply(df[1:2], 1, sort)))
v <- split(df,sp)
res <- unsplit(lapply(v, function(x) data.frame(x,combination_length = sum(x[3]))),sp)
gives
> res
var1 var2 Length combination_length
1 A B 10 56
2 B A 12 56
3 A D 5 25
4 D A 20 25
5 A B 34 56

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Hashing every row of a tibble - r

Related

dplyr concatenate strings by group - row by row

Combine mutate case_when() for columns that start_with() to replace certain characters

mutate or summarise across rows by variable containing string

Why does my dplyr percentile calculation not work with tidy evaluation?

Grouping based on two variables, including their individual combinations (e.g. A - B is same than B - A)

Categories

Resources