Dictionary-like matching on string in R - r

I have a dataframe in which a string variable is an informal list of elements, that can be split on a symbol. I would like to make operaion on these elements on the basis of another dataset.
e.g. task: Calculate the sum of the elements
df_1 <- data.frame(element=c(1:2),groups=c("A,B,C","A,D"))
df_2 <- data.frame(groups=c("A","B","C","D"), values=c(1:4))
desired <- data.frame(element=c(1:2),groups=c("A,B,C","A,D"),sum=c(6,5))

An option would be to split the 'groups' by the delimiter , to expand the rows with separate_rows, do a join with the key/val dataset ('df_2'), groued by 'element', get the sum of 'values'
library(tidyverse)
df_1 %>%
separate_rows(groups) %>%
left_join(df_2) %>%
group_by(element) %>%
summarise(groups = toString(groups), sum = sum(values))
# A tibble: 2 x 3
# element groups sum
# <int> <chr> <int>
#1 1 A, B, C 6
#2 2 A, D 5
Or another option with base R would be to use a named key/value vector 'nm1') to change the values in the splitted list elements, sum and assign it to new column in 'df_1'
nm1 <- setNames(df_2$values, df_2$groups)
df_1$sum <- sapply(strsplit(as.character(df_1$groups), ","), function(x) sum(nm1[x]))

Related

Generate unique pairwise combinations of a column values based on 2 others columns in a data.table

I have tried a lot of things and I even searched for similar questions but couldn't find
solutions.
Suppose that we have a data.table in R:
ID = c("1","1","1","2","2")
Code = c("A","B","C","B","C")
N = c("3","3","3","2","2")
so basically, I want to have unique pairwise combinations for column "Code" based on "ID" and "N" columns.
Is there an R function to return the following?
ID = c("1","1","1","2")
Combinations = c("A.B","B.C","C.A","B.C")
N = c("3","3","3","2")
as I said before I have tried many things but if gives me a data.table which contains only the combinations however in my final result I need to have a column for ID and 2 others for combinations and N
Do you have an idea how to do it ?
You may use combn in group_by -
library(dplyr)
result <- df %>%
group_by(ID, N) %>%
summarise(Code = combn(Code, 2, paste0, collapse = '.'), .groups = 'drop')
result
# ID N Code
# <chr> <chr> <chr>
#1 1 3 A.B
#2 1 3 A.C
#3 1 3 B.C
#4 2 2 B.C
data
df <- data.frame(ID, Code, N)

R Subsetting text from a comma seperated column in a data-frame

I have a data.frame with a column that looks like that:
diagnosis
F.31.2,A.43.2,R.45.2,F.43.1
I want to somehow split this column into two colums with one containing all the values with F and one for all the other values, resulting in two columns in a df that looks like that.
F other
F.31.2,F43.1 A.43.2,R.45.2
Thanks in advance
Try next tidyverse approach. You can separate the rows by , and then create a group according to the pattern in order to reshape to wide and obtain the expected result:
library(dplyr)
library(tidyr)
#Data
df <- data.frame(diagnosis='F.31.2,A.43.2,R.45.2,F.43.1',stringsAsFactors = F)
#Code
new <- df %>% separate_rows(diagnosis,sep = ',') %>%
mutate(Group=ifelse(grepl('F',diagnosis),'F','Other')) %>%
pivot_wider(values_fn = toString,names_from=Group,values_from=diagnosis)
Output:
# A tibble: 1 x 2
F Other
<chr> <chr>
1 F.31.2, F.43.1 A.43.2, R.45.2
First, use strsplit at the commas. Then, using grep find indexes of F, and select/antiselect them by multiplying by 1 or -1 and paste them.
tmp <- el(strsplit(d$diagnosis, ","))
res <- lapply(c(1, -1), function(x) paste(tmp[grep("F", tmp)*x], collapse=","))
res <- setNames(as.data.frame(res), c("F", "other"))
res
# F other
# 1 F.31.2,F.43.1 A.43.2,R.45.2
Data:
d <- setNames(read.table(text="F.31.2,A.43.2,R.45.2,F.43.1"), "diagnosis")

a,Remove the duplicate based on elements in character vector

I have data frame like this, it contain 3 or more characters separated by comma (,) I want to remove the row if it contain same characters.
x <-c(1,2,3,4,5)
y <-c("a,a,a","a,a,b,c","b,c,a","b,b,b,b","a,b,b,c")
df<-data.frame(x,y)
desired output is
x <-c(2,3,5)
y <-c("a,a,b,c","b,c,a","a,b,b,c")
df<-data.frame(x,y)
You can use separate_rows to split the comma-separated values into different rows, remove those groups where there are only 1 distinct values and summarise the data again.
library(dplyr)
df %>%
tidyr::separate_rows(y) %>%
group_by(x) %>%
filter(n_distinct(y) > 1) %>%
summarise(y = toString(y))
# x y
# <dbl> <chr>
#1 2 a, b, c
#2 3 b, c, a
#3 5 a, b, c
In base R :
df[sapply(strsplit(df$y, ','), function(x) length(unique(x))) > 1, ]

Sum by aggregating complex paired names in R

In R, I'm trying to aggregate a dataframe based on unique IDs, BUT I need to use some kind of wild card value for the IDs. Meaning I have paired names like this:
lion_tiger
elephant_lion
tiger_lion
And I need the lion_tiger and tiger_lion IDs to be summed together, because the order in the pair does not matter.
Using this dataframe as an example:
df <- data.frame(pair = c("1_3","2_4","2_2","1_2","2_1","4_2","3_1","4_3","3_2"),
value = c("12","10","19","2","34","29","13","3","14"))
So the values for pair IDs, "1_2" and "2_1" need to be summed in a new table. That new row would then read:
1_2 36
Any suggestions? While my example has numbers as the pair IDs, in reality I would need this to read in text (like the lion_tiger" example above).
We can split the 'pair' column by _, then sort and paste it back, use it in a group by function to get the sum
tapply(as.numeric(as.character(df$value)),
sapply(strsplit(as.character(df$pair), '_'), function(x)
paste(sort(as.numeric(x)), collapse="_")), FUN = sum)
Or another option is gsubfn
library(gsubfn)
df$pair <- gsubfn('([0-9]+)_([0-9]+)', ~paste(sort(as.numeric(c(x, y))), collapse='_'),
as.character(df$pair))
df$value <- as.numeric(as.character(df$value))
aggregate(value~pair, df, sum)
Using tidyverse and purrrlyr
df <- data.frame(name=c("lion_tiger","elephant_lion",
"tiger_lion"),value=c(1,2,3),stringsAsFactors=FALSE)
require(tidyverse)
require(purrrlyr)
df %>% separate(col = name, sep = "_", c("A", "B")) %>%
by_row(.collate = "rows",
..f = function(this_row) {
paste0(sort(c(this_row$A, this_row$B)), collapse = "_")
}) %>%
rename(sorted = ".out") %>%
group_by(sorted) %>%
summarize(sum(value))%>%show
## A tibble: 2 x 2
# sorted `sum(value)`
# <chr> <dbl>
#1 elephant_lion 2
#2 lion_tiger 4

Replicate each row of data.frame when occurrence

I am facing a tricky question and would be glad to have some help.
I have a data frame with an ID name taking different structures. Something like this following :
ID
bbb-5p/mi-98/6134
abb-4p
bbb-5p/mi-98
Every time I have this "/" I would like to duplicate the row. Each row should be duplicated the number of time we find this "/".
Then the name of the duplicated row should be the root + the characters right after the "/".
For exemple this :
ID
bbb-5p/mi-98/6134
should give :
ID
bbb-5p
bbb-5p-mi-98
bbb-5p-6134
Also my initial data frame have 5 variables :
[ID, varA, varB, varC, varD]
And every time I have this "/" I would like to duplicate the entire row. Then I am expecting to have a new data frame with something like
newID newvarA newvarB newvarC newvarD
bbb-5p varA(1) varB(1) varC(1) varD(1)
bbb-5p-mi-98 varA(1) varB(1) varC(1) varD(1)
bbb-5p-6134 varA(1) varB(1) varC(1) varD(1)
abb-4p varA(2) varB(2) varC(2) varD(2)
bbb-5p varA(3) varB(3) varC(3) varD(3)
bbb-5p-mi-98 varA(3) varB(3) varC(3) varD(3)
Any idea?
Thank you in advance
Peter
You can accomplish this in base R, using lapply() with a custom function. First, you split your character column on "/", resulting in a list of vectors:
l <- strsplit(df$ID,"/")
Then you apply a user defined function to each element of l using lapply():
l_stacked <- lapply(l, function(x)
if(length(x) > 1) {
c(x[1], paste0(x[1],"-",x[-1])) }
else { x })
The function first checks whether the vector has a length > 1. If so, it concatenates all elements with the first element, separated by "-". If length <= 1, it means the string didn't contain "/", hence it is returned as is. Finally we flatten our output using unlist() to be able to convert to data.frame.
data.frame(ID = unlist(l_stacked))
# ID
#1 bbb-5p
#2 bbb-5p-mi-98
#3 bbb-5p-6134
#4 abb-4p
#5 bbb-5p
#6 bbb-5p-mi-98
One way to achieve this is the following:
library(dplyr)
library(tidyr)
res <- df %>% mutate(i=row_number(),
ID = strsplit(ID,split='/')) %>%
unnest() %>%
group_by(i) %>%
mutate(ID=ifelse(ID==first(ID),first(ID),paste(first(ID),ID,sep='-'))) %>%
ungroup() %>% select(-i)
### A tibble: 6 x 1
## ID
## <chr>
##1 bbb-5p
##2 bbb-5p-mi-98
##3 bbb-5p-6134
##4 abb-4p
##5 bbb-5p
##6 bbb-5p-mi-98
Notes:
First, create an indexing column i to group by later so that we can group each "root".
Use strsplit to split each row by "|".
tidyr::unnest the result to separate rows.
group_by the created index i and then if the row is the first row, just return the root; otherwise, paste to prepend the root to the row with separator "-".
Finally, ungroup and remove the created index column i.
Data
df <- structure(list(ID = c("bbb-5p/mi-98/6134", "abb-4p", "bbb-5p/mi-98"
)), .Names = "ID", row.names = c(NA, -3L), class = "data.frame")
ID
1 bbb-5p/mi-98/6134
2 abb-4p
3 bbb-5p/mi-98
Here is one option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df1, ..)) and create a column of rownames, grouped by 'rn', split the 'ID' by /, loop through the sequence of rows, paste the split elements based on the index.
library(splitstackshape)
library(data.table)
setDT(df1, keep.rownames=TRUE)[, unlist(strsplit(ID, "/")),
by = rn][, .(ID=sapply(seq_len(.N), function(i)
paste(V1[unique(c(1,i))], collapse="-"))) , rn]
Or an option with dplyr/tidyr/tibble. Create the rownames column with tibble::rownames_to_column, separate the rows into long format with separate_rows, grouped by 'rn', we mutate the 'ID' by pasteing the elements based on the condition of length and remove the 'rn' column.
library(dplyr)
library(tidyr)
library(tidyr)
rownames_to_column(df1, var = "rn") %>%
separate_rows(ID, sep="/") %>%
group_by(rn) %>%
mutate(ID = if(n()>1) c(ID[1], paste(ID[1], ID[-1], sep="-")) else ID) %>%
ungroup() %>%
select(-rn)
# ID
# <chr>
#1 bbb-5p
#2 bbb-5p-mi-98
#3 bbb-5p-6134
#4 abb-4p
#5 bbb-5p
#6 bbb-5p-mi-98

Resources