Find differences in character column in R - r

I have a dataframe with ICPM codes before and after recoding of an operation.
df1 <- tibble::tribble(~ops, ~opsalt,
"8-915, 5-847.32", "5-847.32, 5-852.f3, 8-915",
"8-915, 5-781.30, 8-919, 5-807.4, 5-800.c1, 5-79b.81", "5-79b.81, 5-800.c1, 5-805.y, 5-807.4, 8-919, 5-781.30, 8-915",
"5-786.1, 5-808.a4, 5-784.1u, 5-783.2d, 5-788.5e", "5-788.5e, 5-783.2d, 5-780.4d, 5-784.7d, 5-784.1u, 5-808.a4, 5-786.1",
"8-915, 5-784.0v, 5-788.5f, 5-788.40, 5-808.b0, 5-786.k, 5-788.60, 5-788.00, 5-786.0, 5-783.2d", "5-788.00, 5-788.60, 5-786.0, 5-786.k, 5-788.40, 5-808.b0, 5-788.5f, 5-781.ad, 5-784.0v, 8-915")
I want to calculate two columns which contains the differing codes between the two columns.
For the first row the difference between ops and opsalt would be character(0).
The difference between opsalt and ops would be 5-852.f3.
Tried:
df <– df %>% mutate(ops = strsplit(ops,",")) %>%
mutate(opsalt =strsplit(opsalt,","))
df <- df %>% rowwise() %>% mutate(neu_alt = list(setdiff(ops,opsalt))) %>% mutate(alt_neu = list(setdiff(opsalt,ops)))
This didn't work, because I want to compare parts of the respective strings and not the whole string.

It should work if you use ", " in strsplit and df1 in your first mutate call.
library(dplyr)
df1 %>%
mutate(across(.fns = ~ strsplit(.x, ", "))) %>%
rowwise %>%
mutate(neu_alt = list(setdiff(ops, opsalt)),
alt_neu = list(setdiff(opsalt, ops)))
#> # A tibble: 4 x 4
#> # Rowwise:
#> ops opsalt neu_alt alt_neu
#> <list> <list> <list> <list>
#> 1 <chr [2]> <chr [3]> <chr [0]> <chr [1]>
#> 2 <chr [6]> <chr [7]> <chr [0]> <chr [1]>
#> 3 <chr [5]> <chr [7]> <chr [0]> <chr [2]>
#> 4 <chr [10]> <chr [10]> <chr [1]> <chr [1]>
Created on 2022-01-04 by the reprex package (v0.3.0)

If you want to keep them as strings, you can try this method. If you intend to do similar ops repeatedly, then I suggest retaining the list-columns (instead of repeatedly strspliting them).
df1 %>%
mutate(
d = mapply(function(...) toString(setdiff(...)),
strsplit(ops, "[ ,]+"), strsplit(opsalt, "[ ,]+"))
)
# # A tibble: 4 x 3
# ops opsalt d
# <chr> <chr> <chr>
# 1 8-915, 5-847.32 5-847.32, 5-852.f3, 8-915 ""
# 2 8-915, 5-781.30, 8-919, 5-807.4, 5-800.c1, 5-79b.81 5-79b.81, 5-800.c1, 5-805.y, 5-807.4, 8-919, 5-781.30, 8-915 ""
# 3 5-786.1, 5-808.a4, 5-784.1u, 5-783.2d, 5-788.5e 5-788.5e, 5-783.2d, 5-780.4d, 5-784.7d, 5-784.1u, 5-808.a4, 5-786.1 ""
# 4 8-915, 5-784.0v, 5-788.5f, 5-788.40, 5-808.b0, 5-786.k, 5-788.60, 5-788.00, 5-786.0, 5-783.2d 5-788.00, 5-788.60, 5-786.0, 5-786.k, 5-788.40, 5-808.b0, 5-788.5f, 5-781.ad, 5-784.0v, 8-915 "5-783.2d"
(I recommend using list-columns, though, as demonstrated in TimTeaFan's answer.)

Related

Convert list to string with conditions

I have a dataframe that looks like:
x <- tibble(
experiment_id = rep(c('1a','1b'),each=5),
keystroke = rep(c('a','SHIFT','b','SPACE','e'),2)
)
I know I can concatenate a list into a string using str_c or str_flatten and only keep certain values like below:
> y <- c('b','a','SPACE','d')
> y[y %in% letters]
[1] "b" "a" "d"
But when I try the same thing in a grouped pipe:
x_out <- x %>%
group_by(experiment_id) %>%
mutate(
grp = cumsum(lag(keystroke=='SPACE',default=0))) %>%
group_by(grp, .add=TRUE) %>%
mutate(within_keystrokes = list(keystroke),
within_word = within_keystrokes[within_keystrokes %in% letters]
) %>%
ungroup()
I get the error:
Error: Problem with `mutate()` input `within_word`.
x Input `within_word` can't be recycled to size 2.
ℹ Input `within_word` is `within_keystrokes[within_keystrokes %in% letters]`.
ℹ Input `within_word` must be size 2 or 1, not 0.
ℹ The error occurred in group 1: experiment_id = "1a", grp = 0.
I read this answer and tried using ifelse but still ran into errors.
Any insight into what I'm doing wrong?
EDIT: EXPECTED OUTPUT Sorry for not including this. I would expect the final df to look like:
x <- tibble(
experiment_id = rep(c('1a','1b'),each=5),
keystroke = rep(c('a','SHIFT','b','SPACE','e'),2),
within_keystrokes = list(list('a','SHIFT','b','SPACE'),
list('a','SHIFT','b','SPACE'),
list('a','SHIFT','b','SPACE'),
list('a','SHIFT','b','SPACE'),
'e',
list('a','SHIFT','b','SPACE'),
list('a','SHIFT','b','SPACE'),
list('a','SHIFT','b','SPACE'),
list('a','SHIFT','b','SPACE'),
'e'),
within_word = rep(list('ab','ab','ab','ab','e'),2)
)
You almost solved your issue. You could use
library(dplyr)
library(stringr)
x %>%
group_by(experiment_id, grp = cumsum(lag(keystroke == "SPACE", default = 0))) %>%
mutate(
within_keystrokes = list(keystroke),
within_word = list(str_c(keystroke[keystroke %in% letters], collapse = ""))
)
to get
# A tibble: 10 x 4
experiment_id keystroke within_keystrokes within_word
<chr> <chr> <list> <list>
1 1a a <list [4]> <chr [1]>
2 1a SHIFT <list [4]> <chr [1]>
3 1a b <list [4]> <chr [1]>
4 1a SPACE <list [4]> <chr [1]>
5 1a e <chr [1]> <chr [1]>
6 1b a <list [4]> <chr [1]>
7 1b SHIFT <list [4]> <chr [1]>
8 1b b <list [4]> <chr [1]>
9 1b SPACE <list [4]> <chr [1]>
10 1b e <chr [1]> <chr [1]>
If you don't want within_word to be a list, just remove the list() function.

Expand tibble of email dataset in R

I have a massive tibble of my email data which looks like the following:
library(dplyr)
emails <- tibble(
from = c('employee.1#xtra.co','employee.5#xtra.co','employee.1#xtra.co',
'employee.3#xtra.co','employee.1#xtra.co'),
to = list(
c('employee.5#xtra.co', 'employee.3xtra.co'),
c('employee.3#xtra.co', 'employee.1#xtra.co'),
c('employee.2#xtra.co'),
c('employee.1#xtra.co'),
c('employee.3#xtra.co','employee.5#xtra.co','employee.6#xtra.co')),
cc = list(
c('employee.2xtra.co', 'employee.4xtra.co', 'employee.6xtra.co'),
c('employee.1xtra.co', 'employee.8xtra.co', 'employee.6xtra.co'),
NA,
c('employee.2xtra.co', 'employee.4xtra.co'),
c('employee.2xtra.co', 'employee.6xtra.co'))
)
emails
# A tibble: 5 x 3
from to cc
<chr> <list> <list>
1 employee.1#xtra.co <chr [2]> <chr [3]>
2 employee.5#xtra.co <chr [2]> <chr [3]>
3 employee.1#xtra.co <chr [1]> <lgl [1]>
4 employee.3#xtra.co <chr [1]> <chr [2]>
5 employee.1#xtra.co <chr [3]> <chr [2]>
I need your help to be able to expand each record for each combination. For example, what I want to achieve for row 1 is:
from to cc
employee.1#xtra.co employee.5#xtra.co employee.2xtra.co
employee.1#xtra.co employee.5#xtra.co employee.4xtra.co
employee.1#xtra.co employee.5#xtra.co employee.6xtra.co
employee.1#xtra.co employee.3xtra.co employee.2xtra.co
employee.1#xtra.co employee.3xtra.co employee.4xtra.co
employee.1#xtra.co employee.3xtra.co employee.6xtra.co
Thank you very much for your time.
We can apply unnest twice.
library(dplyr)
library(tidyr)
emails2 <- emails %>%
unnest(cols = "to") %>%
unnest(cols = "cc")
head(emails2)
# # A tibble: 6 x 3
# from to cc
# <chr> <chr> <chr>
# 1 employee.1#xtra.co employee.5#xtra.co employee.2xtra.co
# 2 employee.1#xtra.co employee.5#xtra.co employee.4xtra.co
# 3 employee.1#xtra.co employee.5#xtra.co employee.6xtra.co
# 4 employee.1#xtra.co employee.3xtra.co employee.2xtra.co
# 5 employee.1#xtra.co employee.3xtra.co employee.4xtra.co
# 6 employee.1#xtra.co employee.3xtra.co employee.6xtra.co
If you have more than two columns to expand, below is one approach. First identify the columns that are list. Store the column names in names_target, and then use a for loop to repeatedly apply the unnest function.
names_target <- emails %>%
select(where(is.list)) %>%
names()
temp <- emails
for (i in names_target){
temp <- temp %>% unnest(cols = all_of(i))
}
identical(temp, emails2)
# [1] TRUE

Get df subset based on condition applied to the list item

I would like to loop over the group of items and use that to check condition for subsetting dataframe. My code throws error right now. I want to check if each item in en_nam1 passes en1$attributes == item. If it passes the condition then I want to select the row and add in another dataframe to return. Thank you.
A tibble: 3 x 2
attributes models
<chr> <list>
1 AT01S <chr [2]>
2 AT02S <chr [2]>
3 AGG101 <chr [1]>
4 AGG102 <chr [1]>
5 AGG103 <chr [1]>
6 AGG104 <chr [1]>
en_nam1
[1] "AT01S" "AT02S" "AGG101"
My code:
en_nam1 %>%
+ map(~subset(en1, en1$attributes == .x))
Expected result:
A tibble: 3 x 2
attributes models
<chr> <list>
1 AT01S <chr [2]>
2 AT02S <chr [2]>
3 AGG101 <chr [1]>
We don't need a loop here. It is more direct with %in%
library(dplyr)
en1 %>%
filter(attributes %in% en_nam1)
or subset in base R
subset(en1, attributes %in% en_nam1)

Concatanate two columns with different vector sizes in R [duplicate]

This question already has answers here:
Merge Two Lists in R
(9 answers)
Closed 2 years ago.
I have a data frame that has two columns, a and b, that either contain single character values or a vector of values in specific rows. I want to combine the two columns so that I can concatanate the values of both the columns in a single vector. However, when i use the pastefunction, I am unable to concatanate the values in each row in a single vector.
The following is a reproducible example of this problem:
library(tibble)
library(tidyverse)
data_frame <-
tribble(
~a, ~b,
50, 3,
17, 50,
c("21", "19"), 50,
c("1", "10"), c("50", "51")
)
data_frame %>%
mutate(new_column = paste(a, b))
#> # A tibble: 4 x 3
#> a b new_column
#> <list> <list> <chr>
#> 1 <dbl [1]> <dbl [1]> "50 3"
#> 2 <dbl [1]> <dbl [1]> "17 50"
#> 3 <chr [2]> <dbl [1]> "c(\"21\", \"19\") 50"
#> 4 <chr [2]> <chr [2]> "c(\"1\", \"10\") c(\"50\", \"51\")"
In the new_column column, I want the results to be as following:
c("50" "3")
c("17" "50")
c("21" "19" "50")
c("1" "10" "50" "51")
Is there a way that I can combine the columns a and b to get the result in the above format? Thank you.
To combine two columns you can use c. In base R, you can do this with Map :
data_frame$new_col <- Map(c, data_frame$a, data_frame$b)
Or in tidyverse use map2 :
library(dplyr)
library(purrr)
data_frame %>% mutate(new_col = map2(a, b, c))
# A tibble: 4 x 3
# a b new_col
# <list> <list> <list>
#1 <dbl [1]> <dbl [1]> <dbl [2]>
#2 <dbl [1]> <dbl [1]> <dbl [2]>
#3 <chr [2]> <dbl [1]> <chr [3]>
#4 <chr [2]> <chr [2]> <chr [4]>

How to deal with lists of lists when the first index represents rows?

How can I convert a list of list, to a DataFrame, where the first "layer" of lists should be rows?
myList = list(
list(name="name1",num=20,dogs=list("dog1")),
list(name="name2",num=13,dogs = list()),
list(name="name3",num=5,dogs=list("dog2","dog4"))
)
My first idea was to unlist the elements in the "third layer"
myUnList = sapply(myList,function(x){y=x;y$dogs = unlist(y$dogs);y})
I can create a tibble
tibble(myUnList)
# A tibble: 3 x 1
myUnList
<list>
1 <list [3]>
2 <list [2]>
3 <list [3]>
Note that, if I had myList[[1]] to represent the vector of name, it would be simple, but I'm having trouble on how to tidy the data presented the other way. I though about using purrr to "invert" the order.
Expected result:
# A tibble: 3 x 3
names num dogs
<list> <list> <list>
1 <chr [1]> <dbl [1]> <list [1]>
2 <chr [1]> <dbl [1]> <list [0]>
3 <chr [1]> <dbl [1]> <list [2]>
Are there other type of data structure that supports varying length entries?
We can extract the list element by using map function from the purrr package and then create a new tibble using data_frame.
library(tidyverse)
dat <- data_frame(name = map_chr(myList, "name"),
num = map_dbl(myList, "num"),
dogs = map(myList, "dogs"))
dat
# # A tibble: 3 x 3
# name num dogs
# <chr> <dbl> <list>
# 1 name1 20.0 <list [1]>
# 2 name2 13.0 <NULL>
# 3 name3 5.00 <list [2]>
And if you prefer everything to be in list column, replace map_chr and map_dbl with map.
dat <- data_frame(name = map(myList, "name"),
num = map(myList, "num"),
dogs = map(myList, "dogs"))
dat
# name num dogs
# <list> <list> <list>
# 1 <chr [1]> <dbl [1]> <list [1]>
# 2 <chr [1]> <dbl [1]> <NULL>
# 3 <chr [1]> <dbl [1]> <list [2]>
After some time playing around with purrr, I got another solution that doesn't requires typing the names (could be troublesome for really large lists).
myList %>% transpose %>% simplify_all %>% tbl_df
Results in
# A tibble: 3 x 3
name num dogs
<chr> <dbl> <list>
1 name1 20 <list [1]>
2 name2 13 <list [0]>
3 name3 5 <list [2]>
The transpose function from purrr makes this type of conversion automatically.

Resources