How to concatenete strings after str_split - r

Given this data frame
column_1 column_2
A w,x
B z
C q,r,s
My desired output would be
"Aw", "Ax", "Bz", "Cq", "Cr", "Cs"
I've tried
paste0(df$column_1, strsplit(df$column_2, ","))
But the output is
"Ac(\"w\", \"x\")" "Bz" "Cc(\"q\", \"r\", \"s\")"

We can split column_2 on "," and paste them with column_1 using mapply
unlist(mapply(paste0, df$column_1,strsplit(df$column_2, ",")))
#[1] "Aw" "Ax" "Bz" "Cq" "Cr" "Cs"

We can replicate the 'column_1' by the lengths of list output from strsplit and then do the paste
lst1 <- strsplit(df$column_2, ",")
paste0(rep(df$column_1, lengths(lst1)), unlist(lst1))
#[1] "Aw" "Ax" "Bz" "Cq" "Cr" "Cs"
NOTE: The above is a vectorized approach as we are not looping through the list
Or use stack to create a two column data.frame from list and then paste
do.call(paste0, stack(setNames(lst1, df$column_1))[2:1])
#[1] "Aw" "Ax" "Bz" "Cq" "Cr" "Cs"
stacking to a two column data.frame approach may be a bit less efficient compared to the first approach
Or with tidyverse, split the 'column_2' to long format with separate_rows, then unite the two columns and pull it to vector
library(tidyverse)
df %>%
separate_rows(column_2) %>%
unite(newcol, column_1, column_2, sep="") %>%
pull(newcol)
#[1] "Aw" "Ax" "Bz" "Cq" "Cr" "Cs"
The issue in the OP's approach is based on the fact that the strsplit output is a list of vectors. We need a function to loop over the list (lapply/sapply/vapply) or unlist the list into a vector while replicating the 'column_1' (to make the length during pasteing)
data
df <- structure(list(column_1 = c("A", "B", "C"), column_2 = c("w,x",
"z", "q,r,s")), class = "data.frame", row.names = c(NA, -3L))

This can also be achieved using below code. Although not very idiomatic
df <- data.frame(column_1 = c("A", "B", "C"), column_2 = c("w,x", "z", "q,r,s"))
l_vals <- strsplit(as.character(df$column_2), split = ",", perl =TRUE)
l_append = list()
for(i in seq_along(l_vals)){
l_append <- c(l_append,paste0(df$column_1[i], l_vals[[i]]))
}
unlist(l_append)

Related

In R, how to get the results of combn() from a matrix into a vector without losing data? [duplicate]

This question already has an answer here:
Paste all combinations of a vector in R
(1 answer)
Closed 25 days ago.
I know that combn() can give me all the unique combinations across a vector. However, it gives me a matrix output, and I want it as a vector. Wrapping the output in as.vector() makes every value individual, losing the purpose of running combn() in the first place. Imagine my dataset was c("a", "b", "c"), how can I use combn() (or some other function), to get a vector where my output would be:
my_data <- c("a", "b", "c")
#natural output with combn()
#output with combn()
combn(my_data, 2, simplify = TRUE)
#output with as.vector() wrapped
as.vector(combn(my_data, 2, simplify = TRUE))
#desired output
answer <- c("ab", "ac", "bc") #I don't care about order
You can paste each column of the result together using apply
my_data <- c("a", "b", "c")
apply(combn(my_data, 2), 2, paste, collapse = "")
#> [1] "ab" "ac" "bc"
Created on 2023-01-25 with reprex v2.0.2
We can use combn as
combn(my_data, 2, FUN = paste, collapse = "")
[1] "ab" "ac" "bc"

Non consecutive combinations of array elements in R

I want to generate all the possible combinations of nonadjacent elements in an array.
For example:
array_a <- c("A","B","C")
possible combinations would be : AC and CA
How can I implement this in R?
If nonadjacent elements are defined as elements with distance greater than one in absolute values, then one option could be:
mat <- which(as.matrix(dist(seq_along(array_a))) > 1, arr.ind = TRUE)
paste0(array_a[mat[, 1]], array_a[mat[, 2]])
[1] "CA" "DA" "EA" "DB" "EB" "AC" "EC" "AD" "BD" "AE" "BE" "CE"
Sample data:
array_a <- c("A", "B", "C", "D", "E")
We can use outer
c(outer(array_a, array_a, FUN = paste, sep=""))
Or if we want to omit alternate elements
outer(array_a[c(TRUE, FALSE)], array_a[c(TRUE, FALSE)], FUN = paste, sep="")
Or using crossing
library(dplyr)
library(tidyr)
crossing(v1 = array_a[c(TRUE, FALSE)],
v2 = array_a[c(TRUE, FALSE)]) %>%
filter(v1 != v2) %>%
unite(v1, v1, v2, sep="") %>%
pull(v1)
#[1] "AC" "CA"
NOTE: It is not clear about the assumptions for non-adjacent elements. We answered it based on a different assumption.
Another base R option using expand.grid + subset
inds <- subset(expand.grid(seq_along(array_a), seq_along(array_a)), abs(Var1 - Var2) > 1)
paste0(array_a[inds$Var1],array_a[inds$Var2])
The #tmfmnk solution is so cool. Still I want to add sth from me.
I use the arrangements package for permutations without repetition.
array_a <- c("A", "B", "C", "D", "E")
#vec to rm from permutations neighbors
vec = paste0(array_a[-1], head(array_a, -1))
cc = apply(arrangements::permutations(array_a, 2, replace = F), 1, function(x) paste0(x, collapse = ""))
> setdiff(cc, c(vec, stringi::stri_reverse(vec)))
[1] "AC" "AD" "AE" "BD" "BE" "CA" "CE" "DA" "DB" "EA" "EB" "EC"

How to apply the same function to several variables in R?

I know that similar questions have already been asked (e.g. Passing list element names as a variable to functions within lapply or R - iteratively apply a function of a list of variables), but I couldn't manage to find a solution for my problem based on these posts.
I have an event dataset (~100 variables, >2000 observations) that contains variables with information on the involved actors. One variable can only contain one actor, so if several actors have been involved in the event, they are spread over several variables (e.g. actor1, actor2, ...). These actors can be classified into two groups ("s" and "nons"). For later use, I need two lists of actors: one that contains all actors of the category "s" and one that contains all actors of "nons". "s" only consists of three actors while "nons" consists of dozens of actors.
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("A", "B", "D", "E", "F", "G", "H", NA),
actor2 = c("A", NA, "B", "C", "E", "I", "D", "G"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
Since the script I am about to prepare is supposed to be used on updated versions of the dataset in the future, I would like to automate as much as possible and keep the parts of the script that would need to be adapted as limited as possible. My idea was to create one function per category that extracts the actors of the respective category (e.g. "nons") from one variable (e.g. actor1) in a list and then "loop" this function over the other variables (ideally with the apply family).
I know which category each actor belongs to ("A", "B", and "C" are category "s"), which allows me to define a separation rule as used in the function below (the filter command).
# create function
nons_function <- function(col) {
col_ <- enquo(col)
nons_list <-
df %>%
filter(!is.na(!!col_), !!col_ != "A", !!col_ != "B", !!col_ != "C") %>%
distinct(!!col_) %>%
pull()
nons_list
}
# create list of variables to "loop" over
actorlist <- c("actor1", "actor2")
This results in the following. Instead of two lists of actors I get a list that contains the variable names as character strings.
> lapply(actorlist, nons_function)
[[1]]
[1] "actor1"
[[2]]
[1] "actor2"
What I would like to get is something like the following:
> lapply(actorlist, nons_function)
[[1]]
[1] "D" "E" "F" "G" "H"
[[2]]
[1] "E" "I" "D" "G"
The problem is probably the way I am passing the variable names to my function within lapply. Apparently, my function is not able use a character input as variable names. However, I have not found a way to either adapt my function in a way that allows for character input or to provide my function with a list of variables to loop over in a way it can digest.
Any help appreciated!
EDIT: Initially I had named the actors in a misleading way (actor names indicated which category an actor belongs to), which lead to answers that do not really help in my case. I have changed the actor names from "s1", "s2", "nons1", "nons2" etc to "A", "B", "C" etc now.
here is an option using base r.
for nons-actors:
lapply( df[, 2:3], function(x) grep( "^nons", x, value = TRUE ) )
#$actor1
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#
#$actor2
#[1] "nons2" "nons6" "nons1" "nons4"
and for s-actors:
lapply( df[, 2:3], function(x) grep( "^s", x, value = TRUE ) )
# $actor1
# [1] "s1" "s2"
#
# $actor2
# [1] "s1" "s2" "s3"
Here is an option
library(dplyr)
library(stringr)
library(purrr)
map(actorlist, ~ df %>%
select(.x) %>%
filter(!str_detect(!! rlang::sym(.x), "^s\\d+$")) %>%
pull(1))
#[[1]]
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#[[2]]
#[1] "nons2" "nons6" "nons1" "nons4"
It can be wrapped as a function as well. Note that the input is string, so instead of enquo, use sym to convert to symbol and then evaluate (!!)
f1 <- function(dat, colNm) {
dat %>%
select(colNm) %>%
filter(!str_detect(!! rlang::sym(colNm), "^s\\d+$")) %>%
pull(1) %>%
unique
}
map(actorlist, f1, dat = df)
NOTE: This can be done more easily, but here we are using similar code from the OP's post
Another option is to use split with grepl in base R and that returns a list of both 'nons' and 's' after removing the NAs
lapply(df[2:3], function(x) {
x1 <- x[!is.na(x)]
split(x1, grepl("nons", x1))})
Check my solution and see if it works for you.
require("dplyr")
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("s1", "s2", "nons1", "nons2", "nons3", "nons4", "nons5", NA),
actor2 = c("s1", NA, "s2", "s3", "nons2", "nons6", "nons1", "nons4"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
# Function for getting the category
category_function <- function(col,categ){
if(categ == "non"){
outp = grep("^non",col,value = T)
}else{
outp = grep("^s",col,value = T)
}
return(outp)
}
# Apply the function to all variables whose name starts with "actor"
sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
My output was the following:
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
$actor1
[1] "nons1" "nons2" "nons3" "nons4" "nons5"
$actor2
[1] "nons2" "nons6" "nons1" "nons4"
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
$actor1
[1] "s1" "s2"
$actor2
[1] "s1" "s2" "s3"

unlist to produce a vector of same length

I have a list like this:
lst <- list(a = c("y"), b = c("A", "B", "C"), c = c("x1", "x2"))
lst
> lst
$a
[1] "y"
$b
[1] "A" "B" "C"
$c
[1] "x1" "x2"
If I unlist it, I get:
unlist(lst)
> unlist(lst)
a b1 b2 b3 c1 c2
"y" "A" "B" "C" "x1" "x2"
How can I get a vector like:
a b c
"y" "A, B, C" "x1, x2"
Edit:
A similar question Convert a list of lists to a character vector was answered previously. The answer proposed by #42_ sapply( l, paste0, collapse="") could be used with a small modification: sapply( l, paste0, collapse=", "). Ronak Shah's sapply(lst, toString) to my question is a little more intuitive.
We can use toString to collapse all the elements in every list into a comma-separated string.
sapply(lst, toString)
# a b c
# "y" "A,B,C" "x1,x2"
which is same as using paste with collapse argument as ","
sapply(lst, paste, collapse = ",")
You can also do
unlist(Map(function(x) paste0(x,collapse = ","),lst))
Or
unlist(lapply(lst,function(x) paste0(x,collapse = ",")))
Or use purrr package
purrr::map_chr(lst,paste0,collapse = ",")
we can use map
library(purrr)
library(stringr)
map_chr(lst, str_c, collapse=",")

Disambiguate non-unique elements in a character vector

Given a vector of non-unique patient initials:
init = c("AA", "AB", "AB", "AB", "AC")
Looking for disambiguation as follows:
init1 = c("AA", "AB01", "AB02", "AB03", "AC")
i.e. unique initials should be left unchanged, non-unique are disambiguated by adding two-digit numbers.
Use the indicated function with ave:
uniquify <- function(x) if (length(x) == 1) x else sprintf("%s%02d", x, seq_along(x))
ave(init, init, FUN = uniquify)
## [1] "AA" "AB01" "AB02" "AB03" "AC"
If the basic requirement is just to ensure unique output then make.unique(x) or make.unique(x, sep = "0") as discussed by another answer and a comment are concise but if the requirement is that the output be exactly as in the question then they do not give the same result. If there are 10 or more duplicates the output of those answers vary even more; however, the solution here does give the same answer. Here is a further example illustrating 10 or more duplicates.
xx <- rep(c("A", "B", "C"), c(1, 10, 2))
ave(xx, xx, FUN = uniquify)
## [1] "A" "B01" "B02" "B03" "B04" "B05" "B06" "B07" "B08" "B09" "B10" "C01" "C02"
The make.unique solution could be rescued like this:

Resources