Recoding values with dpylr using a lookup table - r

Is there a way to use the recode function of dpylr together with a lookup table (data.frame or list)?
What I would like to have would look something like this:
# Recode values with list of named arguments
data <- sample(c("a", "b", "c", "d"), 10, replace = T)
lookup <- list(a = "Apple", b = "Pear")
dplyr::recode(data, lookup)
I found the mapvalues and revalue functions from the plyr package. Combining them is possible as explained here.
However, I am wondering whether something similar is possible with dplyr only.

do.call(dplyr::recode, c(list(data), lookup))
[1] "Pear" "c" "d" "c" "Pear" "Pear" "d" "c" "d" "c"

We can use base R
v1 <- unlist(lookup)[data]
ifelse(is.na(v1), data, v1)

It works like this:
dplyr::recode(data, !!!lookup)
Also useful with mutate in a dataframe tibble:
df <- tibble(code = data)
df %>%
mutate(fruit = recode(code, !!!lookup))

Related

How to use the same R recode function on multiple variables without coding each?

From the recode examples, what if I have two variables where I want to apply the same recode?
factor_vec1 <- factor(c("a", "b", "c"))
factor_vec2 <- factor(c("a", "d", "f"))
How can I recode the same answer without writing a recode for each factor_vec? These don't work, do I need to learn how to use purrr to do it, or is there another way?
Output 1: recode(c(factor_vec1, factor_vec2), a = "Apple")
Output 2: recode(c(factor_vec2, factor_vec2), a = "Apple", b =
"Banana")
If there are not many items needed to be recoded, you can try a simple lookup table approach using base R.
v1 <- c("a", "b", "c")
v2 <- c("a", "d", "f")
# lookup table
lut <- c("a" ="Apple",
"b" = "Banana",
"c" = "c",
"d" = "d",
"f" = "f")
lut[v1]
lut[v2]
You can reuse the lookup table for any relevant variables. The results are:
> lut[v1]
a b c
"Apple" "Banana" "c"
> lut[v2]
a d f
"Apple" "d" "f"
Use lists to hold multiple vectors and then you can apply same function using lapply/map.
library(dplyr)
list_fac <- lst(factor_vec1, factor_vec2)
list_fac <- purrr::map(list_fac, recode, a = "Apple", b = "Banana")
You can keep the vectors in list itself (which is better) or get the changed vectors in global environment using list2env.
list2env(list_fac, .GlobalEnv)

How to apply the same function to several variables in R?

I know that similar questions have already been asked (e.g. Passing list element names as a variable to functions within lapply or R - iteratively apply a function of a list of variables), but I couldn't manage to find a solution for my problem based on these posts.
I have an event dataset (~100 variables, >2000 observations) that contains variables with information on the involved actors. One variable can only contain one actor, so if several actors have been involved in the event, they are spread over several variables (e.g. actor1, actor2, ...). These actors can be classified into two groups ("s" and "nons"). For later use, I need two lists of actors: one that contains all actors of the category "s" and one that contains all actors of "nons". "s" only consists of three actors while "nons" consists of dozens of actors.
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("A", "B", "D", "E", "F", "G", "H", NA),
actor2 = c("A", NA, "B", "C", "E", "I", "D", "G"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
Since the script I am about to prepare is supposed to be used on updated versions of the dataset in the future, I would like to automate as much as possible and keep the parts of the script that would need to be adapted as limited as possible. My idea was to create one function per category that extracts the actors of the respective category (e.g. "nons") from one variable (e.g. actor1) in a list and then "loop" this function over the other variables (ideally with the apply family).
I know which category each actor belongs to ("A", "B", and "C" are category "s"), which allows me to define a separation rule as used in the function below (the filter command).
# create function
nons_function <- function(col) {
col_ <- enquo(col)
nons_list <-
df %>%
filter(!is.na(!!col_), !!col_ != "A", !!col_ != "B", !!col_ != "C") %>%
distinct(!!col_) %>%
pull()
nons_list
}
# create list of variables to "loop" over
actorlist <- c("actor1", "actor2")
This results in the following. Instead of two lists of actors I get a list that contains the variable names as character strings.
> lapply(actorlist, nons_function)
[[1]]
[1] "actor1"
[[2]]
[1] "actor2"
What I would like to get is something like the following:
> lapply(actorlist, nons_function)
[[1]]
[1] "D" "E" "F" "G" "H"
[[2]]
[1] "E" "I" "D" "G"
The problem is probably the way I am passing the variable names to my function within lapply. Apparently, my function is not able use a character input as variable names. However, I have not found a way to either adapt my function in a way that allows for character input or to provide my function with a list of variables to loop over in a way it can digest.
Any help appreciated!
EDIT: Initially I had named the actors in a misleading way (actor names indicated which category an actor belongs to), which lead to answers that do not really help in my case. I have changed the actor names from "s1", "s2", "nons1", "nons2" etc to "A", "B", "C" etc now.
here is an option using base r.
for nons-actors:
lapply( df[, 2:3], function(x) grep( "^nons", x, value = TRUE ) )
#$actor1
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#
#$actor2
#[1] "nons2" "nons6" "nons1" "nons4"
and for s-actors:
lapply( df[, 2:3], function(x) grep( "^s", x, value = TRUE ) )
# $actor1
# [1] "s1" "s2"
#
# $actor2
# [1] "s1" "s2" "s3"
Here is an option
library(dplyr)
library(stringr)
library(purrr)
map(actorlist, ~ df %>%
select(.x) %>%
filter(!str_detect(!! rlang::sym(.x), "^s\\d+$")) %>%
pull(1))
#[[1]]
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#[[2]]
#[1] "nons2" "nons6" "nons1" "nons4"
It can be wrapped as a function as well. Note that the input is string, so instead of enquo, use sym to convert to symbol and then evaluate (!!)
f1 <- function(dat, colNm) {
dat %>%
select(colNm) %>%
filter(!str_detect(!! rlang::sym(colNm), "^s\\d+$")) %>%
pull(1) %>%
unique
}
map(actorlist, f1, dat = df)
NOTE: This can be done more easily, but here we are using similar code from the OP's post
Another option is to use split with grepl in base R and that returns a list of both 'nons' and 's' after removing the NAs
lapply(df[2:3], function(x) {
x1 <- x[!is.na(x)]
split(x1, grepl("nons", x1))})
Check my solution and see if it works for you.
require("dplyr")
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("s1", "s2", "nons1", "nons2", "nons3", "nons4", "nons5", NA),
actor2 = c("s1", NA, "s2", "s3", "nons2", "nons6", "nons1", "nons4"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
# Function for getting the category
category_function <- function(col,categ){
if(categ == "non"){
outp = grep("^non",col,value = T)
}else{
outp = grep("^s",col,value = T)
}
return(outp)
}
# Apply the function to all variables whose name starts with "actor"
sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
My output was the following:
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
$actor1
[1] "nons1" "nons2" "nons3" "nons4" "nons5"
$actor2
[1] "nons2" "nons6" "nons1" "nons4"
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
$actor1
[1] "s1" "s2"
$actor2
[1] "s1" "s2" "s3"

using purrr to extract elements from multiple lists starting with a common letter

I have a list of lists. One element in each list has a name beginning with "n_". How do I extract these elements and store them in a separate list? Can I use a combination of map and starts_with?
E.g.:
m1 <- list(n_age = c(19,40,39),
names = c("a", "b", "c"))
m2 <- list(n_gender = c("m","f","f"),
names = c("f", "t", "d"))
nice_list <- list(m1, m2)
I was hoping that something like the following to work (it doesn't!):
output <- map(nice_list, starts_with("n_"))
You could (ab)use partial matching of $:
map(nice_list, `$`, "n_")
(I don't really recommend it).
(And I can't figure out why lapply(nice_list, `$`, "n_") doesn't work (gives a list(NULL, NULL)).
How about this?
map(nice_list, ~.x[grep("n_", names(.x))])
#[[1]]
#[[1]]$n_age
#[1] 19 40 39
#
#
#[[2]]
#[[2]]$n_gender
#[1] "m" "f" "f"
Or using starts_with
map(nice_list, ~.x[starts_with("n_", vars = names(.x))])
Or to flatten the nested list, you could do
unlist(map(nice_list, ~.x[grep("n_", names(.x))]), recursive = F)
#$n_age
#[1] 19 40 39
#
#$n_gender
#[1] "m" "f" "f"

R how to find the intersection of a subest of vectors in a list

I have a list of vectors (characters). For example:
my_list <- list(c("a", "b", "c"),
c("a", "b", "c", "d"),
c("e", "d"))
For the intersection of all these three vectors, I could use: Reduce(intersect, my_list). But as you can see, there is no common element in all three vectors.
Then, what if I want to find the common element that appears "at least" a certain amount of times in the list? Such as: somefunction(my_list, time=2) would give me c("a", "b", "c", "d") because those elements appear two times.
Thanks.
We can convert this to a data.table and do the group by action to get the elements
library(data.table)
setDT(stack(setNames(my_list, seq_along(my_list))))[,
if(uniqueN(ind)==2) values , values]$values
#[1] "a" "b" "c" "d"
A base R option would be to unlist the 'my_list', find the frequency count with the replicated sequence of 'my_list' using table, get the column sums, check whether it is equal to 2 and use that index to subset the names.
tblCount <- colSums(table(rep(seq_along(my_list), lengths(my_list)), unlist(my_list)))
names(tblCount)[tblCount==2]
#[1] "a" "b" "c" "d"
If you assume that each element will appear no more than once in a vector, you can "unlist" your vectors and count the frequency.
Here, using dplyr functions
library(dplyr)
my_list %>% unlist %>% data_frame(v=.) %>% count(v) %>% filter(n>=2) %>% .[["v"]]
Or base functions
subset(as.data.frame(table(unlist(my_list))), Freq>=2)$Var1
This works:
my_list %>%
purrr::map(~ .) %>%
purrr::reduce(.f = dplyr::intersect, .x = .)

R beginner standard regarding grouping levels used in R

So one of the problems I am stuck on is that:
I have some variable X which takes values {1,2,3,4}. Thus
X:
1
2
2
4
2
3
What I want to do, is turn the 1's and 2's into A, and the 3's and 4's into B.
Is there any possible suggestions how I would go about doing this. Or hints?
I was initially thinking of using the subset command, but these seems to just extract them from the dataset.
One possible option is to use recodeVar from the doBy package
library(doBy)
x <- c(1, 2, 2, 4, 2, 3)
src = list(c(1, 2), c(3, 4))
tgt = list("A", "B")
recodeVar(x, src, tgt)
which yields
> recodeVar(x, src, tgt)
[1] "A" "A" "A" "B" "A" "B"
>
Or you can use the car package:
library(car)
recode(x, "1:2='A'; 3:4='B'")
X <- c(1,2,2,4,2,3)
Y <- ifelse(X %in% 1:2, "A", "B")
## or
Y <- cut(X,breaks=c(0,2.5,5),labels=c("A","B"))
The latter approach creates a factor rather than a character vector; you can use as.character to turn it back into a character vector if you want.
Another alternative:
LETTERS[ceiling((1:4)/2)]
[1] "A" "A" "B" "B"
LETTERS[ceiling(X/2)]
[1] "A" "A" "A" "B" "A" "B"
if it's dataframe, dplyr package:
dataframe %>%
mutate (newvar = case_when(var %in% c(1, 2) ~ "A",
case_when(var %in% c(3, 4) ~ "B")) -> dataframe

Resources