Speeding up recoding of a character column in R

Speeding up recoding of a character column in R - r

I have some data where each data point is associated with a character vector of varying length. For example, it might be generated by the following function:
library(tidyverse)
set.seed(27)
generate_keyset <- function(...) {
sample(LETTERS[1:5], size = rpois(n = 1, lambda = 10), replace = TRUE)
}
generate_keyset()
#> [1] "A" "C" "A" "A" "A" "A" "A" "E" "C" "C" "A" "D" "A" "D" "C" "A"
I would like to summarize this keyset by converting it to a single number score. The way this works is straightforward: each key in the keyset has a value, and to get the value of the entire keyset I sum over the values. The key-value map is a tibble with several hundred entries, but you can imagine it looks like:
key_value_map <- tribble(
~key, ~value,
"A", 1,
"B", -2,
"C", 8,
"D", -4,
"E", 0
)
Currently I am scoring keysets with the following function:
score_keyset <- function(keyset) {
merged_keysets_to_map <- data.frame(
key = keyset,
stringsAsFactors = FALSE
) %>%
left_join(key_value_map, by = "key")
sum(merged_keysets_to_map$value)
}
score_keyset(LETTERS[1:4])
#> [1] 3
This works fine, except it is very slow, and I need to do this operation about a million times. For example, I would like the following to be much faster:
n <- 1e4 # in practice I have n = 1e6
fake_data <- tibble(
keyset = map(1:n, generate_keyset)
)
library(tictoc)
tic()
scored_data <- fake_data %>%
mutate(
value = map_dbl(keyset, score_keyset)
)
toc()
I am sure this is some much better way to do this with indexing but it is escaping me at the moment. Help speeding this up is much appreciated.

Instead of doing a join and then sum, it would be more efficient if we use a named vector to match
library(tibble)
sum(deframe(key_value_map)[generate_keyset()])
Checking the timings, the OP's tic/toc showed 45.728 sec
tic()
v1 <- deframe(key_value_map)
scored_data2 <- fake_data %>%
mutate(
value = map_dbl(keyset, ~ sum(v1[.x]))
)
toc()
#0.952 sec elapsed
identical(scored_data, scored_data2)
#[1] TRUE

Related

Non duplicate remove subsetting [duplicate]

This question already has answers here:
"Set Difference" between two vectors with duplicate values
(4 answers)
Closed 2 years ago.
a <- c("A", "B", "C", "A", "A", "B")
b <- c("A", "C", "A")
I want to subset a wrt to b such that the following set is obtained:-
("B" "A" "B")
Tradition subsetting results in removal of all the "A"s and "C"s from set a.
It removes duplicates also. I don't want them to be remove. For ex:- Set b has 2 "A"s and 1 "C". So while subsetting a wrt b only two "A"s and one "C" should be removed from set a. And rest all the elements in a should remain even though they might be "A" or "C".
I just want to know if there is a way of doing this in R.

An easy option is to use vsetdiff from package vecsets, i.e.,
vecsets::vsetdiff(a,b)
such that
> vecsets::vsetdiff(a,b)
[1] "B" "A" "B"

Using tibble and dplyr, you can do:
enframe(a) %>%
transmute(name = value) %>%
group_by(name) %>%
mutate(ID = 1:n()) %>%
left_join(enframe(table(b)), by = c("name" = "name")) %>%
filter(ID > value | is.na(value)) %>%
pull(name)
[1] "B" "A" "B"

Here is a way to do this :
#Count occurrences of `a`
a_count <- table(a)
#Count occurrences of `b`
b_count <- table(b)
#Subtract the count present in b from a
a_count[names(b_count)] <- a_count[names(b_count)] - b_count
#Create a new vector of remaining values
rep(names(a_count), a_count)
#[1] "A" "B" "B"

Or:
a <- c("A", "B", "C", "A", "A", "B")
b <- c("A", "C", "A")
greedy_delete <- function(x, rmv) {
for (i in rmv) {
x <- x[-which(x == i)[1]]
}
x
}
greedy_delete(a, b)
#"B" "A" "B"

How to apply the same function to several variables in R?

I know that similar questions have already been asked (e.g. Passing list element names as a variable to functions within lapply or R - iteratively apply a function of a list of variables), but I couldn't manage to find a solution for my problem based on these posts.
I have an event dataset (~100 variables, >2000 observations) that contains variables with information on the involved actors. One variable can only contain one actor, so if several actors have been involved in the event, they are spread over several variables (e.g. actor1, actor2, ...). These actors can be classified into two groups ("s" and "nons"). For later use, I need two lists of actors: one that contains all actors of the category "s" and one that contains all actors of "nons". "s" only consists of three actors while "nons" consists of dozens of actors.
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("A", "B", "D", "E", "F", "G", "H", NA),
actor2 = c("A", NA, "B", "C", "E", "I", "D", "G"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
Since the script I am about to prepare is supposed to be used on updated versions of the dataset in the future, I would like to automate as much as possible and keep the parts of the script that would need to be adapted as limited as possible. My idea was to create one function per category that extracts the actors of the respective category (e.g. "nons") from one variable (e.g. actor1) in a list and then "loop" this function over the other variables (ideally with the apply family).
I know which category each actor belongs to ("A", "B", and "C" are category "s"), which allows me to define a separation rule as used in the function below (the filter command).
# create function
nons_function <- function(col) {
col_ <- enquo(col)
nons_list <-
df %>%
filter(!is.na(!!col_), !!col_ != "A", !!col_ != "B", !!col_ != "C") %>%
distinct(!!col_) %>%
pull()
nons_list
}
# create list of variables to "loop" over
actorlist <- c("actor1", "actor2")
This results in the following. Instead of two lists of actors I get a list that contains the variable names as character strings.
> lapply(actorlist, nons_function)
[[1]]
[1] "actor1"
[[2]]
[1] "actor2"
What I would like to get is something like the following:
> lapply(actorlist, nons_function)
[[1]]
[1] "D" "E" "F" "G" "H"
[[2]]
[1] "E" "I" "D" "G"
The problem is probably the way I am passing the variable names to my function within lapply. Apparently, my function is not able use a character input as variable names. However, I have not found a way to either adapt my function in a way that allows for character input or to provide my function with a list of variables to loop over in a way it can digest.
Any help appreciated!
EDIT: Initially I had named the actors in a misleading way (actor names indicated which category an actor belongs to), which lead to answers that do not really help in my case. I have changed the actor names from "s1", "s2", "nons1", "nons2" etc to "A", "B", "C" etc now.

here is an option using base r.
for nons-actors:
lapply( df[, 2:3], function(x) grep( "^nons", x, value = TRUE ) )
#$actor1
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#
#$actor2
#[1] "nons2" "nons6" "nons1" "nons4"
and for s-actors:
lapply( df[, 2:3], function(x) grep( "^s", x, value = TRUE ) )
# $actor1
# [1] "s1" "s2"
#
# $actor2
# [1] "s1" "s2" "s3"

Here is an option
library(dplyr)
library(stringr)
library(purrr)
map(actorlist, ~ df %>%
select(.x) %>%
filter(!str_detect(!! rlang::sym(.x), "^s\\d+$")) %>%
pull(1))
#[[1]]
#[1] "nons1" "nons2" "nons3" "nons4" "nons5"
#[[2]]
#[1] "nons2" "nons6" "nons1" "nons4"
It can be wrapped as a function as well. Note that the input is string, so instead of enquo, use sym to convert to symbol and then evaluate (!!)
f1 <- function(dat, colNm) {
dat %>%
select(colNm) %>%
filter(!str_detect(!! rlang::sym(colNm), "^s\\d+$")) %>%
pull(1) %>%
unique
}
map(actorlist, f1, dat = df)
NOTE: This can be done more easily, but here we are using similar code from the OP's post
Another option is to use split with grepl in base R and that returns a list of both 'nons' and 's' after removing the NAs
lapply(df[2:3], function(x) {
x1 <- x[!is.na(x)]
split(x1, grepl("nons", x1))})

Check my solution and see if it works for you.
require("dplyr")
# create example data
df <- data.frame(id = c(1:8),
actor1 = c("s1", "s2", "nons1", "nons2", "nons3", "nons4", "nons5", NA),
actor2 = c("s1", NA, "s2", "s3", "nons2", "nons6", "nons1", "nons4"))
df <-
df %>%
mutate(actor1 = as.character(actor1),
actor2 = as.character(actor2))
# Function for getting the category
category_function <- function(col,categ){
if(categ == "non"){
outp = grep("^non",col,value = T)
}else{
outp = grep("^s",col,value = T)
}
return(outp)
}
# Apply the function to all variables whose name starts with "actor"
sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
My output was the following:
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="non")
$actor1
[1] "nons1" "nons2" "nons3" "nons4" "nons5"
$actor2
[1] "nons2" "nons6" "nons1" "nons4"
> sapply(df[grep("actor",names(df),value=T)],category_function,categ="s")
$actor1
[1] "s1" "s2"
$actor2
[1] "s1" "s2" "s3"

Replacing multiple values in a column of a data frame based on condition

here is data set of weight
I have to replace values in weight column based on following conditon
if weight < 7 then replace values with a
if weight > = 7 and <8 replace values with b
if weight >= 8 replace with c

One option is ifelse
df1$New <- with(df1, ifelse(weight <7, "a", ifelse(weight >=7 & weight < 8, "b", "c")))
df1$New
#[1] "a" "c" "c" "a" "c" "b" "c"
or we can use cut if there are many groups
with(df1, as.character(cut(weight, breaks = c(-Inf, 7, 8, Inf),
labels = c('a', 'b', 'c'))))
#[1] "a" "c" "c" "a" "c" "b" "c"
data
set.seed(24)
df1 <- data.frame(weight = rnorm(7, 7, 3))

You could use dplyr as well to either replace the same column (give it the same name) or add (by giving it a new name). The code below replaces the existing "weight" column.
library(dplyr)
yourdata%>%mutate(
weight=ifelse(weight <7, "a", ifelse(weight >=7 & weight < 8, "b", "c")))
In future can you provide a reproducible example of your data that's not a jpeg: How to make a great R reproducible example?

R - two data frame columns to list of key-value pairs

Say I have a data frame
DF1 <- data.frame("a" = c("a", "b", "c"), "b" = 1:3)
What is the easiest way to turn this into a list?
DF2 <- list("a" = 1, "b" = 2, "c" = 3)
It must be really simple but I can't find out the answer.

You can use setNames and as.list
DF2 <- setNames(as.list(DF1$b), DF1$a)

R beginner standard regarding grouping levels used in R

So one of the problems I am stuck on is that:
I have some variable X which takes values {1,2,3,4}. Thus
X:
1
2
2
4
2
3
What I want to do, is turn the 1's and 2's into A, and the 3's and 4's into B.
Is there any possible suggestions how I would go about doing this. Or hints?
I was initially thinking of using the subset command, but these seems to just extract them from the dataset.

One possible option is to use recodeVar from the doBy package
library(doBy)
x <- c(1, 2, 2, 4, 2, 3)
src = list(c(1, 2), c(3, 4))
tgt = list("A", "B")
recodeVar(x, src, tgt)
which yields
> recodeVar(x, src, tgt)
[1] "A" "A" "A" "B" "A" "B"
>
Or you can use the car package:
library(car)
recode(x, "1:2='A'; 3:4='B'")

X <- c(1,2,2,4,2,3)
Y <- ifelse(X %in% 1:2, "A", "B")
## or
Y <- cut(X,breaks=c(0,2.5,5),labels=c("A","B"))
The latter approach creates a factor rather than a character vector; you can use as.character to turn it back into a character vector if you want.

Another alternative:
LETTERS[ceiling((1:4)/2)]
[1] "A" "A" "B" "B"
LETTERS[ceiling(X/2)]
[1] "A" "A" "A" "B" "A" "B"

if it's dataframe, dplyr package:
dataframe %>%
mutate (newvar = case_when(var %in% c(1, 2) ~ "A",
case_when(var %in% c(3, 4) ~ "B")) -> dataframe

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Speeding up recoding of a character column in R - r

Related

Non duplicate remove subsetting [duplicate]

How to apply the same function to several variables in R?

Replacing multiple values in a column of a data frame based on condition

R - two data frame columns to list of key-value pairs

R beginner standard regarding grouping levels used in R

Categories

Resources