How to apply the same function to several variables in R? - r

I know that similar questions have already been asked (e.g. Passing list element names as a variable to functions within lapply or R - iteratively apply a function of a list of variables), but I couldn't manage to find a solution for my problem based on these posts.
I have an event dataset (~100 variables, >2000 observations) that contains variables with information on the involved actors. One variable can only contain one actor, so if several actors have been involved in the event, they are spread over several variables (e.g. actor1, actor2, ...). These actors can be classified into two groups ("s" and "nons"). For later use, I need two lists of actors: one that contains all actors of the category "s" and one that contains all actors of "nons". "s" only consists of three actors while "nons" consists of dozens of actors.
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("A", "B", "D", "E", "F", "G", "H", NA),
actor2 = c("A", NA, "B", "C", "E", "I", "D", "G"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
Since the script I am about to prepare is supposed to be used on updated versions of the dataset in the future, I would like to automate as much as possible and keep the parts of the script that would need to be adapted as limited as possible. My idea was to create one function per category that extracts the actors of the respective category (e.g. "nons") from one variable (e.g. actor1) in a list and then "loop" this function over the other variables (ideally with the apply family).
I know which category each actor belongs to ("A", "B", and "C" are category "s"), which allows me to define a separation rule as used in the function below (the filter command).
# create function
nons_function <- function(col) {
col_ <- enquo(col)
nons_list <-
df %>%
filter(!is.na(!!col_), !!col_ != "A", !!col_ != "B", !!col_ != "C") %>%
distinct(!!col_) %>%
pull()
nons_list
}
# create list of variables to "loop" over
actorlist <- c("actor1", "actor2")
This results in the following. Instead of two lists of actors I get a list that contains the variable names as character strings.
> lapply(actorlist, nons_function)
[[1]]
[1] "actor1"
[[2]]
[1] "actor2"
What I would like to get is something like the following:
> lapply(actorlist, nons_function)
[[1]]
[1] "D" "E" "F" "G" "H"
[[2]]
[1] "E" "I" "D" "G"
The problem is probably the way I am passing the variable names to my function within lapply. Apparently, my function is not able use a character input as variable names. However, I have not found a way to either adapt my function in a way that allows for character input or to provide my function with a list of variables to loop over in a way it can digest.
Any help appreciated!
EDIT: Initially I had named the actors in a misleading way (actor names indicated which category an actor belongs to), which lead to answers that do not really help in my case. I have changed the actor names from "s1", "s2", "nons1", "nons2" etc to "A", "B", "C" etc now.

here is an option using base r.
for nons-actors:
lapply( df[, 2:3], function(x) grep( "^nons", x, value = TRUE ) )
#$actor1
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#
#$actor2
#[1] "nons2" "nons6" "nons1" "nons4"
and for s-actors:
lapply( df[, 2:3], function(x) grep( "^s", x, value = TRUE ) )
# $actor1
# [1] "s1" "s2"
#
# $actor2
# [1] "s1" "s2" "s3"

Here is an option
library(dplyr)
library(stringr)
library(purrr)
map(actorlist, ~ df %>%
select(.x) %>%
filter(!str_detect(!! rlang::sym(.x), "^s\\d+$")) %>%
pull(1))
#[[1]]
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#[[2]]
#[1] "nons2" "nons6" "nons1" "nons4"
It can be wrapped as a function as well. Note that the input is string, so instead of enquo, use sym to convert to symbol and then evaluate (!!)
f1 <- function(dat, colNm) {
dat %>%
select(colNm) %>%
filter(!str_detect(!! rlang::sym(colNm), "^s\\d+$")) %>%
pull(1) %>%
unique
}
map(actorlist, f1, dat = df)
NOTE: This can be done more easily, but here we are using similar code from the OP's post
Another option is to use split with grepl in base R and that returns a list of both 'nons' and 's' after removing the NAs
lapply(df[2:3], function(x) {
x1 <- x[!is.na(x)]
split(x1, grepl("nons", x1))})

Check my solution and see if it works for you.
require("dplyr")
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("s1", "s2", "nons1", "nons2", "nons3", "nons4", "nons5", NA),
actor2 = c("s1", NA, "s2", "s3", "nons2", "nons6", "nons1", "nons4"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
# Function for getting the category
category_function <- function(col,categ){
if(categ == "non"){
outp = grep("^non",col,value = T)
}else{
outp = grep("^s",col,value = T)
}
return(outp)
}
# Apply the function to all variables whose name starts with "actor"
sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
My output was the following:
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
$actor1
[1] "nons1" "nons2" "nons3" "nons4" "nons5"
$actor2
[1] "nons2" "nons6" "nons1" "nons4"
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
$actor1
[1] "s1" "s2"
$actor2
[1] "s1" "s2" "s3"

Related

Speeding up recoding of a character column in R

I have some data where each data point is associated with a character vector of varying length. For example, it might be generated by the following function:
library(tidyverse)
set.seed(27)
generate_keyset <- function(...) {
sample(LETTERS[1:5], size = rpois(n = 1, lambda = 10), replace = TRUE)
}
generate_keyset()
#> [1] "A" "C" "A" "A" "A" "A" "A" "E" "C" "C" "A" "D" "A" "D" "C" "A"
I would like to summarize this keyset by converting it to a single number score. The way this works is straightforward: each key in the keyset has a value, and to get the value of the entire keyset I sum over the values. The key-value map is a tibble with several hundred entries, but you can imagine it looks like:
key_value_map <- tribble(
~key, ~value,
"A", 1,
"B", -2,
"C", 8,
"D", -4,
"E", 0
)
Currently I am scoring keysets with the following function:
score_keyset <- function(keyset) {
merged_keysets_to_map <- data.frame(
key = keyset,
stringsAsFactors = FALSE
) %>%
left_join(key_value_map, by = "key")
sum(merged_keysets_to_map$value)
}
score_keyset(LETTERS[1:4])
#> [1] 3
This works fine, except it is very slow, and I need to do this operation about a million times. For example, I would like the following to be much faster:
n <- 1e4 # in practice I have n = 1e6
fake_data <- tibble(
keyset = map(1:n, generate_keyset)
)
library(tictoc)
tic()
scored_data <- fake_data %>%
mutate(
value = map_dbl(keyset, score_keyset)
)
toc()
I am sure this is some much better way to do this with indexing but it is escaping me at the moment. Help speeding this up is much appreciated.
Instead of doing a join and then sum, it would be more efficient if we use a named vector to match
library(tibble)
sum(deframe(key_value_map)[generate_keyset()])
Checking the timings, the OP's tic/toc showed 45.728 sec
tic()
v1 <- deframe(key_value_map)
scored_data2 <- fake_data %>%
mutate(
value = map_dbl(keyset, ~ sum(v1[.x]))
)
toc()
#0.952 sec elapsed
identical(scored_data, scored_data2)
#[1] TRUE

using purrr to extract elements from multiple lists starting with a common letter

I have a list of lists. One element in each list has a name beginning with "n_". How do I extract these elements and store them in a separate list? Can I use a combination of map and starts_with?
E.g.:
m1 <- list(n_age = c(19,40,39),
names = c("a", "b", "c"))
m2 <- list(n_gender = c("m","f","f"),
names = c("f", "t", "d"))
nice_list <- list(m1, m2)
I was hoping that something like the following to work (it doesn't!):
output <- map(nice_list, starts_with("n_"))
You could (ab)use partial matching of $:
map(nice_list, `$`, "n_")
(I don't really recommend it).
(And I can't figure out why lapply(nice_list, `$`, "n_") doesn't work (gives a list(NULL, NULL)).
How about this?
map(nice_list, ~.x[grep("n_", names(.x))])
#[[1]]
#[[1]]$n_age
#[1] 19 40 39
#
#
#[[2]]
#[[2]]$n_gender
#[1] "m" "f" "f"
Or using starts_with
map(nice_list, ~.x[starts_with("n_", vars = names(.x))])
Or to flatten the nested list, you could do
unlist(map(nice_list, ~.x[grep("n_", names(.x))]), recursive = F)
#$n_age
#[1] 19 40 39
#
#$n_gender
#[1] "m" "f" "f"

Recoding values with dpylr using a lookup table

Is there a way to use the recode function of dpylr together with a lookup table (data.frame or list)?
What I would like to have would look something like this:
# Recode values with list of named arguments
data <- sample(c("a", "b", "c", "d"), 10, replace = T)
lookup <- list(a = "Apple", b = "Pear")
dplyr::recode(data, lookup)
I found the mapvalues and revalue functions from the plyr package. Combining them is possible as explained here.
However, I am wondering whether something similar is possible with dplyr only.
do.call(dplyr::recode, c(list(data), lookup))
[1] "Pear" "c" "d" "c" "Pear" "Pear" "d" "c" "d" "c"
We can use base R
v1 <- unlist(lookup)[data]
ifelse(is.na(v1), data, v1)
It works like this:
dplyr::recode(data, !!!lookup)
Also useful with mutate in a dataframe tibble:
df <- tibble(code = data)
df %>%
mutate(fruit = recode(code, !!!lookup))

R how to find the intersection of a subest of vectors in a list

I have a list of vectors (characters). For example:
my_list <- list(c("a", "b", "c"),
c("a", "b", "c", "d"),
c("e", "d"))
For the intersection of all these three vectors, I could use: Reduce(intersect, my_list). But as you can see, there is no common element in all three vectors.
Then, what if I want to find the common element that appears "at least" a certain amount of times in the list? Such as: somefunction(my_list, time=2) would give me c("a", "b", "c", "d") because those elements appear two times.
Thanks.
We can convert this to a data.table and do the group by action to get the elements
library(data.table)
setDT(stack(setNames(my_list, seq_along(my_list))))[,
if(uniqueN(ind)==2) values , values]$values
#[1] "a" "b" "c" "d"
A base R option would be to unlist the 'my_list', find the frequency count with the replicated sequence of 'my_list' using table, get the column sums, check whether it is equal to 2 and use that index to subset the names.
tblCount <- colSums(table(rep(seq_along(my_list), lengths(my_list)), unlist(my_list)))
names(tblCount)[tblCount==2]
#[1] "a" "b" "c" "d"
If you assume that each element will appear no more than once in a vector, you can "unlist" your vectors and count the frequency.
Here, using dplyr functions
library(dplyr)
my_list %>% unlist %>% data_frame(v=.) %>% count(v) %>% filter(n>=2) %>% .[["v"]]
Or base functions
subset(as.data.frame(table(unlist(my_list))), Freq>=2)$Var1
This works:
my_list %>%
purrr::map(~ .) %>%
purrr::reduce(.f = dplyr::intersect, .x = .)

Disambiguate non-unique elements in a character vector

Given a vector of non-unique patient initials:
init = c("AA", "AB", "AB", "AB", "AC")
Looking for disambiguation as follows:
init1 = c("AA", "AB01", "AB02", "AB03", "AC")
i.e. unique initials should be left unchanged, non-unique are disambiguated by adding two-digit numbers.
Use the indicated function with ave:
uniquify <- function(x) if (length(x) == 1) x else sprintf("%s%02d", x, seq_along(x))
ave(init, init, FUN = uniquify)
## [1] "AA" "AB01" "AB02" "AB03" "AC"
If the basic requirement is just to ensure unique output then make.unique(x) or make.unique(x, sep = "0") as discussed by another answer and a comment are concise but if the requirement is that the output be exactly as in the question then they do not give the same result. If there are 10 or more duplicates the output of those answers vary even more; however, the solution here does give the same answer. Here is a further example illustrating 10 or more duplicates.
xx <- rep(c("A", "B", "C"), c(1, 10, 2))
ave(xx, xx, FUN = uniquify)
## [1] "A" "B01" "B02" "B03" "B04" "B05" "B06" "B07" "B08" "B09" "B10" "C01" "C02"
The make.unique solution could be rescued like this:

Resources