Disambiguate non-unique elements in a character vector - r

Given a vector of non-unique patient initials:
init = c("AA", "AB", "AB", "AB", "AC")
Looking for disambiguation as follows:
init1 = c("AA", "AB01", "AB02", "AB03", "AC")
i.e. unique initials should be left unchanged, non-unique are disambiguated by adding two-digit numbers.

Use the indicated function with ave:
uniquify <- function(x) if (length(x) == 1) x else sprintf("%s%02d", x, seq_along(x))
ave(init, init, FUN = uniquify)
## [1] "AA" "AB01" "AB02" "AB03" "AC"
If the basic requirement is just to ensure unique output then make.unique(x) or make.unique(x, sep = "0") as discussed by another answer and a comment are concise but if the requirement is that the output be exactly as in the question then they do not give the same result. If there are 10 or more duplicates the output of those answers vary even more; however, the solution here does give the same answer. Here is a further example illustrating 10 or more duplicates.
xx <- rep(c("A", "B", "C"), c(1, 10, 2))
ave(xx, xx, FUN = uniquify)
## [1] "A" "B01" "B02" "B03" "B04" "B05" "B06" "B07" "B08" "B09" "B10" "C01" "C02"
The make.unique solution could be rescued like this:

Related

In R, how to get the results of combn() from a matrix into a vector without losing data? [duplicate]

This question already has an answer here:
Paste all combinations of a vector in R
(1 answer)
Closed 25 days ago.
I know that combn() can give me all the unique combinations across a vector. However, it gives me a matrix output, and I want it as a vector. Wrapping the output in as.vector() makes every value individual, losing the purpose of running combn() in the first place. Imagine my dataset was c("a", "b", "c"), how can I use combn() (or some other function), to get a vector where my output would be:
my_data <- c("a", "b", "c")
#natural output with combn()
#output with combn()
combn(my_data, 2, simplify = TRUE)
#output with as.vector() wrapped
as.vector(combn(my_data, 2, simplify = TRUE))
#desired output
answer <- c("ab", "ac", "bc") #I don't care about order
You can paste each column of the result together using apply
my_data <- c("a", "b", "c")
apply(combn(my_data, 2), 2, paste, collapse = "")
#> [1] "ab" "ac" "bc"
Created on 2023-01-25 with reprex v2.0.2
We can use combn as
combn(my_data, 2, FUN = paste, collapse = "")
[1] "ab" "ac" "bc"

How to order vectors with priority layout?

Let's consider these vector of strings following:
x <- c("B", "C_small", "A", "B_big", "C", "A_huge", "D", "A_big", "B_tremendous")
As you can see there are certain strings in this vector starting the same e.g. "B", "B_big".
What I want to end up with is a vector ordered in such layout that all strings with same starting should be next to each other. But order of letter should stay the same (that "B" should be first one, "C" second one and so on). Let me put an example to clarify it:
In simple words, I want to end up with vector:
"B", "B_big", "B_tremendous", "C_small", "C", "A", "A_huge", "A_big", "D"
What I've done to achive this vector: I read from the left and I see "B" so I'm looking on all other vector which starts the same and put it to the right of "B". Then is "C", so I'm looking on all remaining strings and put all starting with "C" e.g. "C_small" to the right and so on.
I'm not sure how to do it. I'm almost sure that gsub function can be used to approach this result, however I'm not sure how to combine it with this searching and replacing. Could you please give me a hand doing so ?
Here's one option:
x <- c("B", "C_small", "A", "B_big", "C", "A_huge", "D", "A_big", "B_tremendous")
xorder <- unique(substr(x, 1, 1))
xnew <- c()
for (letter in xorder) {
if (letter %in% substr(x, 1, 1)) {
xnew <- c(xnew, x[substr(x, 1, 1) == letter])
}
}
xnew
[1] "B" "B_big" "B_tremendous" "C_small" "C"
[6] "A" "A_huge" "A_big" "D"
Use the "prefix" as factor levels and then order:
sx = substr(x, 1, 1)
x[order(factor(sx, levels = unique(sx)))]
# [1] "B" "B_big" "B_tremendous" "C_small" "C" "A" "A_huge" "A_big" "D"
If you are open for non-base alternatives, data.table::chgroup may be used, "groups together duplicated values but retains the group order (according the first appearance order of each group), efficiently":
x[chgroup(substr(x, 1, 1))]
# [1] "B" "B_big" "B_tremendous" "C_small" "C" "A" "A_huge" "A_big" "D"
I suggest splitting the two parts of the text into separate dimensions. Then, define a clear rank order for the descriptive part of the name using a named character vector. From there you can reorder the input vector on the fly. Bundled as a function:
x <- c("B", "C_small", "A", "B_big", "C", "A_huge", "D", "A_big", "B_tremendous")
sorter <- function(x) {
# separate the two parts
prefix <- sub("_.*$", "", x)
suffix <- sub("^.*_", "", x)
# identify inputs with no suffix
suffix <- ifelse(suffix == "", "none", suffix)
# map each suffix to a rank ordering
suffix_order <- c(
"small" = -1,
"none" = 0,
"big" = 1,
"huge" = 2,
"tremendous" = 3
)
# return input vector,
# ordered by the prefix and the mapping of suffix to rank
x[order(prefix, suffix_order[suffix])]
}
sorter(x)
Result
[1] "A_big" "A_huge" "A" "B_big" "B_tremendous" "B" "C_small" "C"
[9] "D"

Non consecutive combinations of array elements in R

I want to generate all the possible combinations of nonadjacent elements in an array.
For example:
array_a <- c("A","B","C")
possible combinations would be : AC and CA
How can I implement this in R?
If nonadjacent elements are defined as elements with distance greater than one in absolute values, then one option could be:
mat <- which(as.matrix(dist(seq_along(array_a))) > 1, arr.ind = TRUE)
paste0(array_a[mat[, 1]], array_a[mat[, 2]])
[1] "CA" "DA" "EA" "DB" "EB" "AC" "EC" "AD" "BD" "AE" "BE" "CE"
Sample data:
array_a <- c("A", "B", "C", "D", "E")
We can use outer
c(outer(array_a, array_a, FUN = paste, sep=""))
Or if we want to omit alternate elements
outer(array_a[c(TRUE, FALSE)], array_a[c(TRUE, FALSE)], FUN = paste, sep="")
Or using crossing
library(dplyr)
library(tidyr)
crossing(v1 = array_a[c(TRUE, FALSE)],
v2 = array_a[c(TRUE, FALSE)]) %>%
filter(v1 != v2) %>%
unite(v1, v1, v2, sep="") %>%
pull(v1)
#[1] "AC" "CA"
NOTE: It is not clear about the assumptions for non-adjacent elements. We answered it based on a different assumption.
Another base R option using expand.grid + subset
inds <- subset(expand.grid(seq_along(array_a), seq_along(array_a)), abs(Var1 - Var2) > 1)
paste0(array_a[inds$Var1],array_a[inds$Var2])
The #tmfmnk solution is so cool. Still I want to add sth from me.
I use the arrangements package for permutations without repetition.
array_a <- c("A", "B", "C", "D", "E")
#vec to rm from permutations neighbors
vec = paste0(array_a[-1], head(array_a, -1))
cc = apply(arrangements::permutations(array_a, 2, replace = F), 1, function(x) paste0(x, collapse = ""))
> setdiff(cc, c(vec, stringi::stri_reverse(vec)))
[1] "AC" "AD" "AE" "BD" "BE" "CA" "CE" "DA" "DB" "EA" "EB" "EC"

Replace the same occurances by multiple strings

Let's say I have a vector with multiple strings:
a<- c('a?cd','ab?cd','abc?')
How can I replace the first "?" by b the second "?" by c and the third "?" by d, in order to produce a result like this:
'abcd','abcd','abcd'
Improving the topic with the answer from G. Grothendieck!
In case we have two symbols in the same element that should be replaced by different patterns:
a <- c('espa?a','per? an?n','peque?os')
L <- c('N','U','O','N');
fmt <- gsub("[?]", "%s", a)
g <- cumsum(sequence(nchar(gsub("[^?]", "", a)))==1)
mapply(function(fmt, x) do.call("sprintf", as.list(c(fmt, x))), fmt, split( L, g), USE.NAMES = FALSE)
Apply chartr across each component as follows. Note that head(...) is c("b", "c", "d") . No packages are used.
a<- c('a?cd','ab?cd','abc?') # test input
mapply(chartr, "?", head(letters[-1], length(a)), a, USE.NAMES = FALSE)
## [1] "abcd" "abccd" "abcd"
If what you meant was to check if any elements of "a", "b", "c", "d" are missing from each component and if so then replace ? with that missing element then first create a list of L of replacements and then apply sub to each component with it. We assume that there are 0 or 1 missing elements from each component and 0 or 1 instances of ? in each component. Again, no packages are used.
L <- lapply(strsplit(a, ""), setdiff, x = letters[1:4])
L[lengths(L) == 0] <- ""
mapply(`sub`, "[?]", L, a, USE.NAMES = FALSE)
## [1] "abcd" "abcd" "abcd"
stringr::str_replace() has vectorized replacement so you can do:
library(stringr)
str_replace(a, "\\?", letters[seq_along(a) + 1])
[1] "abcd" "abccd" "abcd"
You can use str_replace from stringrpackage
library(stringr)
a<- c('a?cd','ab?cd','abc?')
str_replace(a,"[?]",letters[2:4])
[1] "abcd" "abccd" "abcd"
or
str_replace(a, "[?]", c("b", "c", "d"))
[1] "abcd" "abccd" "abcd"

How to apply the same function to several variables in R?

I know that similar questions have already been asked (e.g. Passing list element names as a variable to functions within lapply or R - iteratively apply a function of a list of variables), but I couldn't manage to find a solution for my problem based on these posts.
I have an event dataset (~100 variables, >2000 observations) that contains variables with information on the involved actors. One variable can only contain one actor, so if several actors have been involved in the event, they are spread over several variables (e.g. actor1, actor2, ...). These actors can be classified into two groups ("s" and "nons"). For later use, I need two lists of actors: one that contains all actors of the category "s" and one that contains all actors of "nons". "s" only consists of three actors while "nons" consists of dozens of actors.
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("A", "B", "D", "E", "F", "G", "H", NA),
actor2 = c("A", NA, "B", "C", "E", "I", "D", "G"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
Since the script I am about to prepare is supposed to be used on updated versions of the dataset in the future, I would like to automate as much as possible and keep the parts of the script that would need to be adapted as limited as possible. My idea was to create one function per category that extracts the actors of the respective category (e.g. "nons") from one variable (e.g. actor1) in a list and then "loop" this function over the other variables (ideally with the apply family).
I know which category each actor belongs to ("A", "B", and "C" are category "s"), which allows me to define a separation rule as used in the function below (the filter command).
# create function
nons_function <- function(col) {
col_ <- enquo(col)
nons_list <-
df %>%
filter(!is.na(!!col_), !!col_ != "A", !!col_ != "B", !!col_ != "C") %>%
distinct(!!col_) %>%
pull()
nons_list
}
# create list of variables to "loop" over
actorlist <- c("actor1", "actor2")
This results in the following. Instead of two lists of actors I get a list that contains the variable names as character strings.
> lapply(actorlist, nons_function)
[[1]]
[1] "actor1"
[[2]]
[1] "actor2"
What I would like to get is something like the following:
> lapply(actorlist, nons_function)
[[1]]
[1] "D" "E" "F" "G" "H"
[[2]]
[1] "E" "I" "D" "G"
The problem is probably the way I am passing the variable names to my function within lapply. Apparently, my function is not able use a character input as variable names. However, I have not found a way to either adapt my function in a way that allows for character input or to provide my function with a list of variables to loop over in a way it can digest.
Any help appreciated!
EDIT: Initially I had named the actors in a misleading way (actor names indicated which category an actor belongs to), which lead to answers that do not really help in my case. I have changed the actor names from "s1", "s2", "nons1", "nons2" etc to "A", "B", "C" etc now.
here is an option using base r.
for nons-actors:
lapply( df[, 2:3], function(x) grep( "^nons", x, value = TRUE ) )
#$actor1
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#
#$actor2
#[1] "nons2" "nons6" "nons1" "nons4"
and for s-actors:
lapply( df[, 2:3], function(x) grep( "^s", x, value = TRUE ) )
# $actor1
# [1] "s1" "s2"
#
# $actor2
# [1] "s1" "s2" "s3"
Here is an option
library(dplyr)
library(stringr)
library(purrr)
map(actorlist, ~ df %>%
select(.x) %>%
filter(!str_detect(!! rlang::sym(.x), "^s\\d+$")) %>%
pull(1))
#[[1]]
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#[[2]]
#[1] "nons2" "nons6" "nons1" "nons4"
It can be wrapped as a function as well. Note that the input is string, so instead of enquo, use sym to convert to symbol and then evaluate (!!)
f1 <- function(dat, colNm) {
dat %>%
select(colNm) %>%
filter(!str_detect(!! rlang::sym(colNm), "^s\\d+$")) %>%
pull(1) %>%
unique
}
map(actorlist, f1, dat = df)
NOTE: This can be done more easily, but here we are using similar code from the OP's post
Another option is to use split with grepl in base R and that returns a list of both 'nons' and 's' after removing the NAs
lapply(df[2:3], function(x) {
x1 <- x[!is.na(x)]
split(x1, grepl("nons", x1))})
Check my solution and see if it works for you.
require("dplyr")
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("s1", "s2", "nons1", "nons2", "nons3", "nons4", "nons5", NA),
actor2 = c("s1", NA, "s2", "s3", "nons2", "nons6", "nons1", "nons4"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
# Function for getting the category
category_function <- function(col,categ){
if(categ == "non"){
outp = grep("^non",col,value = T)
}else{
outp = grep("^s",col,value = T)
}
return(outp)
}
# Apply the function to all variables whose name starts with "actor"
sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
My output was the following:
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
$actor1
[1] "nons1" "nons2" "nons3" "nons4" "nons5"
$actor2
[1] "nons2" "nons6" "nons1" "nons4"
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
$actor1
[1] "s1" "s2"
$actor2
[1] "s1" "s2" "s3"

Resources