In R, how to get the results of combn() from a matrix into a vector without losing data? [duplicate] - r

This question already has an answer here:
Paste all combinations of a vector in R
(1 answer)
Closed 25 days ago.
I know that combn() can give me all the unique combinations across a vector. However, it gives me a matrix output, and I want it as a vector. Wrapping the output in as.vector() makes every value individual, losing the purpose of running combn() in the first place. Imagine my dataset was c("a", "b", "c"), how can I use combn() (or some other function), to get a vector where my output would be:
my_data <- c("a", "b", "c")
#natural output with combn()
#output with combn()
combn(my_data, 2, simplify = TRUE)
#output with as.vector() wrapped
as.vector(combn(my_data, 2, simplify = TRUE))
#desired output
answer <- c("ab", "ac", "bc") #I don't care about order

You can paste each column of the result together using apply
my_data <- c("a", "b", "c")
apply(combn(my_data, 2), 2, paste, collapse = "")
#> [1] "ab" "ac" "bc"
Created on 2023-01-25 with reprex v2.0.2

We can use combn as
combn(my_data, 2, FUN = paste, collapse = "")
[1] "ab" "ac" "bc"

Related

R: Is there a method in R, to substiute the values of a vector using a dictionary (2 column dataframe with old and new value)

Is there a method in R, to substitute the values of a vector using a dictionary (2 column dataframe with old and new value)
The only method I know is to extract the old value into a dataframe and merge it with, what I call,the dictionary (which is a two column dataframe with old and new values). Afterwards reassign the new value to the original old value. However, it seems when using merge (at least since R v4.1, the order of the x value is not maintained, so I am using join now which keeps the original order of dataframe x intact. I am thinking that there must be an easier way, I just have not found it. Hope this is understandable, I appreciate any help.
cheers Hermann
You could use a named character vector as a dict for replacement by unquoting with !!! inside of dplyr::recode. If you have your "dict" stored as a two-column dataframe, then tidyr::deframe might be handy.
library(tidyverse)
x <- c("a", "b", "c")
dict <- tribble(
~old, ~new,
"a", "d",
"b", "e",
"c", "f"
)
recode(x, !!!deframe(dict))
#> [1] "d" "e" "f"
Created on 2021-06-14 by the reprex package (v1.0.0)
You can use match to substitute the values of a vector using a dictionary:
D$new[match(x, D$old)]
#[1] "d" "e" "f"
You can also use the names to get the new values:
L <- setNames(D$new, D$old)
L[x]
#"d" "e" "f"
Data:
x <- c("a", "b", "c")
D <- data.frame(old = c("a", "b", "c"), new = c("d", "e", "f"))

How to apply the same function to several variables in R?

I know that similar questions have already been asked (e.g. Passing list element names as a variable to functions within lapply or R - iteratively apply a function of a list of variables), but I couldn't manage to find a solution for my problem based on these posts.
I have an event dataset (~100 variables, >2000 observations) that contains variables with information on the involved actors. One variable can only contain one actor, so if several actors have been involved in the event, they are spread over several variables (e.g. actor1, actor2, ...). These actors can be classified into two groups ("s" and "nons"). For later use, I need two lists of actors: one that contains all actors of the category "s" and one that contains all actors of "nons". "s" only consists of three actors while "nons" consists of dozens of actors.
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("A", "B", "D", "E", "F", "G", "H", NA),
actor2 = c("A", NA, "B", "C", "E", "I", "D", "G"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
Since the script I am about to prepare is supposed to be used on updated versions of the dataset in the future, I would like to automate as much as possible and keep the parts of the script that would need to be adapted as limited as possible. My idea was to create one function per category that extracts the actors of the respective category (e.g. "nons") from one variable (e.g. actor1) in a list and then "loop" this function over the other variables (ideally with the apply family).
I know which category each actor belongs to ("A", "B", and "C" are category "s"), which allows me to define a separation rule as used in the function below (the filter command).
# create function
nons_function <- function(col) {
col_ <- enquo(col)
nons_list <-
df %>%
filter(!is.na(!!col_), !!col_ != "A", !!col_ != "B", !!col_ != "C") %>%
distinct(!!col_) %>%
pull()
nons_list
}
# create list of variables to "loop" over
actorlist <- c("actor1", "actor2")
This results in the following. Instead of two lists of actors I get a list that contains the variable names as character strings.
> lapply(actorlist, nons_function)
[[1]]
[1] "actor1"
[[2]]
[1] "actor2"
What I would like to get is something like the following:
> lapply(actorlist, nons_function)
[[1]]
[1] "D" "E" "F" "G" "H"
[[2]]
[1] "E" "I" "D" "G"
The problem is probably the way I am passing the variable names to my function within lapply. Apparently, my function is not able use a character input as variable names. However, I have not found a way to either adapt my function in a way that allows for character input or to provide my function with a list of variables to loop over in a way it can digest.
Any help appreciated!
EDIT: Initially I had named the actors in a misleading way (actor names indicated which category an actor belongs to), which lead to answers that do not really help in my case. I have changed the actor names from "s1", "s2", "nons1", "nons2" etc to "A", "B", "C" etc now.
here is an option using base r.
for nons-actors:
lapply( df[, 2:3], function(x) grep( "^nons", x, value = TRUE ) )
#$actor1
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#
#$actor2
#[1] "nons2" "nons6" "nons1" "nons4"
and for s-actors:
lapply( df[, 2:3], function(x) grep( "^s", x, value = TRUE ) )
# $actor1
# [1] "s1" "s2"
#
# $actor2
# [1] "s1" "s2" "s3"
Here is an option
library(dplyr)
library(stringr)
library(purrr)
map(actorlist, ~ df %>%
select(.x) %>%
filter(!str_detect(!! rlang::sym(.x), "^s\\d+$")) %>%
pull(1))
#[[1]]
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#[[2]]
#[1] "nons2" "nons6" "nons1" "nons4"
It can be wrapped as a function as well. Note that the input is string, so instead of enquo, use sym to convert to symbol and then evaluate (!!)
f1 <- function(dat, colNm) {
dat %>%
select(colNm) %>%
filter(!str_detect(!! rlang::sym(colNm), "^s\\d+$")) %>%
pull(1) %>%
unique
}
map(actorlist, f1, dat = df)
NOTE: This can be done more easily, but here we are using similar code from the OP's post
Another option is to use split with grepl in base R and that returns a list of both 'nons' and 's' after removing the NAs
lapply(df[2:3], function(x) {
x1 <- x[!is.na(x)]
split(x1, grepl("nons", x1))})
Check my solution and see if it works for you.
require("dplyr")
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("s1", "s2", "nons1", "nons2", "nons3", "nons4", "nons5", NA),
actor2 = c("s1", NA, "s2", "s3", "nons2", "nons6", "nons1", "nons4"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
# Function for getting the category
category_function <- function(col,categ){
if(categ == "non"){
outp = grep("^non",col,value = T)
}else{
outp = grep("^s",col,value = T)
}
return(outp)
}
# Apply the function to all variables whose name starts with "actor"
sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
My output was the following:
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
$actor1
[1] "nons1" "nons2" "nons3" "nons4" "nons5"
$actor2
[1] "nons2" "nons6" "nons1" "nons4"
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
$actor1
[1] "s1" "s2"
$actor2
[1] "s1" "s2" "s3"

Joining lists into a vector [duplicate]

This question already has answers here:
Create sequence of repeated values, in sequence?
(3 answers)
Closed 6 years ago.
I want to create the following vector using a, b, c repeating each letter thrice:
BB<-c("a","a","a","b","b","b","c","c","c")
This is my code:
Alphabet<-c("a","b","c")
AA<-list()
for(i in 1:3){
AA[[i]]<-rep(Alphabet[i],each=3)
}
BB<-do.call(rbind,AA)
But I am getting a dataframe:
dput(BB)
structure(c("a", "b", "c", "a", "b", "c", "a", "b", "c"), .Dim = c(3L,
3L))
What I am doing wrong?
As Akrun mentioned we can use the same rep function
create a vector which consists of letters a,b,c
A <- c("A","B","C")
Apply rep function for the same vector, use each as sub function
AA <- rep(A,each=3)
print(AA)
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"
You should use c function to concatenate, not the rbind. This will give you vector.
Alphabet<-c("a","b","c")
AA<-list()
for(i in 1:3){
AA[[i]]<-rep(Alphabet[i],each=3)
}
BB<-do.call(c,AA)
Akrun comment is also true, if thats what you want.
You can also concatenate the rep function like so:
BB <- c(rep("a", 3), rep("b", 3), rep("c", 3))
Here is a solution but note this form or appending is not efficient for large input arrays
Alphabet <- c("a","b","c")
bb <- c()
for (i in 1:length(Alphabet)) {
bb <- c(bb, rep(Alphabet[i], 3))
}

R how to find the intersection of a subest of vectors in a list

I have a list of vectors (characters). For example:
my_list <- list(c("a", "b", "c"),
c("a", "b", "c", "d"),
c("e", "d"))
For the intersection of all these three vectors, I could use: Reduce(intersect, my_list). But as you can see, there is no common element in all three vectors.
Then, what if I want to find the common element that appears "at least" a certain amount of times in the list? Such as: somefunction(my_list, time=2) would give me c("a", "b", "c", "d") because those elements appear two times.
Thanks.
We can convert this to a data.table and do the group by action to get the elements
library(data.table)
setDT(stack(setNames(my_list, seq_along(my_list))))[,
if(uniqueN(ind)==2) values , values]$values
#[1] "a" "b" "c" "d"
A base R option would be to unlist the 'my_list', find the frequency count with the replicated sequence of 'my_list' using table, get the column sums, check whether it is equal to 2 and use that index to subset the names.
tblCount <- colSums(table(rep(seq_along(my_list), lengths(my_list)), unlist(my_list)))
names(tblCount)[tblCount==2]
#[1] "a" "b" "c" "d"
If you assume that each element will appear no more than once in a vector, you can "unlist" your vectors and count the frequency.
Here, using dplyr functions
library(dplyr)
my_list %>% unlist %>% data_frame(v=.) %>% count(v) %>% filter(n>=2) %>% .[["v"]]
Or base functions
subset(as.data.frame(table(unlist(my_list))), Freq>=2)$Var1
This works:
my_list %>%
purrr::map(~ .) %>%
purrr::reduce(.f = dplyr::intersect, .x = .)

Disambiguate non-unique elements in a character vector

Given a vector of non-unique patient initials:
init = c("AA", "AB", "AB", "AB", "AC")
Looking for disambiguation as follows:
init1 = c("AA", "AB01", "AB02", "AB03", "AC")
i.e. unique initials should be left unchanged, non-unique are disambiguated by adding two-digit numbers.
Use the indicated function with ave:
uniquify <- function(x) if (length(x) == 1) x else sprintf("%s%02d", x, seq_along(x))
ave(init, init, FUN = uniquify)
## [1] "AA" "AB01" "AB02" "AB03" "AC"
If the basic requirement is just to ensure unique output then make.unique(x) or make.unique(x, sep = "0") as discussed by another answer and a comment are concise but if the requirement is that the output be exactly as in the question then they do not give the same result. If there are 10 or more duplicates the output of those answers vary even more; however, the solution here does give the same answer. Here is a further example illustrating 10 or more duplicates.
xx <- rep(c("A", "B", "C"), c(1, 10, 2))
ave(xx, xx, FUN = uniquify)
## [1] "A" "B01" "B02" "B03" "B04" "B05" "B06" "B07" "B08" "B09" "B10" "C01" "C02"
The make.unique solution could be rescued like this:

Resources