Efficient implementation of value selection based on a column - r

R> data.frame(x1=1:3, x2=11:13, y=c('a', 'a;b', 'b'))
x1 x2 y
1 1 11 a
2 2 12 a;b
3 3 13 b
I have a data.frame in the format like above, where if y contains a, then x1 will be added to the result, and if y contains b, then x2 will be added to the result.
For this specific example, the result should be data.frame(i=c(1,2,2,3), v=c(1, 2, 12, 13)), where i is the index. The order must be maintained as in the input. It is trivial to use element-by-element operations to perform these tasks. But I am wondering if there is a more efficient implementation (e.g., based on vector operations). Is there a more efficient implementation of this problem?
Edit
A method based on *apply may be
f=data.frame(x1=1:3, x2=11:13, y=c('a', 'a;b', 'b'))
n=nrow(f)
do.call(
rbind
, lapply(seq_len(n), function(i) {
do.call(
rbind
, lapply(strsplit(f$y[[i]], ';')[[1]], function(x) {
if(x=='a') {
data.frame(i=i, v=f$x1[[i]])
} else if(x=='b') {
data.frame(i=i, v=f$x2[[i]])
} else {
NULL
}
})
)
})
)

This will give you the desired output:
vector <- df %>%
separate_rows(y) %>%
mutate(new_col = ifelse(y=="a", x1, x2)) %>%
pull(new_col)
dput(vector)
output:
c(1L, 2L, 12L, 13L)

I don't know about efficient for your particular case, but here is what I propose :
library(tidyr)
dat <- tibble( # First create the data
x1 = 1:3, x2 = 11:13, y = c('a', 'a;b', 'b'))
dat %>%
add_row(x1 = 23, x2 = -2, y = "bla") %>% # Add a row for testing purposes
separate_rows(y, sep = ";") %>% # separate rows with ";"
mutate(
result =
case_when( # Output either x1 or x2 based on the value in "y"
y == "a" ~ x1,
y == "b" ~ x2))

Related

Assign multiple columns when using mutate in dtplyr

Is there a way of getting my data table to look like my target table when using dtplyr and mutate?`
A Dummy table
library(data.table)
library(dtplyr)
library(dplyr)
id <- rep(c("A","B"),each=3)
x1 <- rnorm(6)
x2 <- rnorm(6)
dat <- data.table(id,x1,x2)
A dummy function
my_fun <- function(x,y){
cbind(a = x+10,b=y-10)
}
And I would like to use this type of syntax
dat |>
group_by(id) |>
mutate(my_fun(x = x1,y = x2))
Where the end result will look like this
data.table(id, x1, x2, a=x1+10,b=x2-10)
I would like to have a generic solution that works for functions with variable number of columns returned but is that possible?
I think we would need more information about how this would work with a variable number of columns:
Are the columns named in a specific way?
Do the output columns need to be named in a specific way?
Are there standard calculations being done to each column dependent on name? E.g., x1 = +10 and x2 = -10?
At any rate, here is a solution that works with your provided data to return the data.table you specified:
my_fun <- function(data, ...){
dots <- list(...)
cbind(data,
a = data[[dots$x]] + 10,
b = data[[dots$y]] - 10
)
}
dat |>
my_fun(x = "x1", y = "x2")
id x1 x2 a b
1: A 0.8485309 -0.3532837 10.848531 -10.353284
2: A 0.7248478 -1.6561564 10.724848 -11.656156
3: A -1.3629114 0.4210139 8.637089 -9.578986
4: B -1.7934827 0.6717033 8.206517 -9.328297
5: B -1.0971890 -0.3008422 8.902811 -10.300842
6: B 0.4396630 -0.7447419 10.439663 -10.744742

R lapply function with two arguments that are not fixed

I found a similar question asked before. My question is a bit more complex than the previous one. For my question, the y parameter is not fixed.
In the function(X,Y){SOME FUNCTION}, X is a list of characters and Y is a list of dataframe. Basically, I want the function to work on the pair of X and Y in sequence respectively, and produce the output as one list. For example, the first element of X list and the first element of Y list, the second element of X list and the second element of Y list, the third element of X list and the third element of Y list,...
Example of X, Y
X <- c("1", "2")
y1 <- data.frame("person.1" = "Amy", "bestfood..1" = "fish", "bestthing..1" = "book",
"person.2" = "Mike", "bestfood..2" = "fish", "bestthing..2" = "book")
y2 <- data.frame("person.1" = "Amy","bestfood..1" = "carrot", "bestthing..1" = "cloth",
"person.2" = "Mike","bestfood..2" = "carrot", "bestthing..2" = "cloth")
Y <- list(y1,y2)
The function:
addID <- function(X, Y) {
rowlength <- length(Y)
df <- as.data.frame(matrix(NA, nrow = rowlength, ncol = 3))
colnames(df) <- c("ID", "Person", "Food")
df[1:nrow(df), 1] <- X
# name
namecols <-grep("person",colnames(Y))
for (i in 1:length(namecols)) {
name <- Y[1, namecols[i]]
df[i, 2] <- as.character(name)
}
# food
foodcols <-
grep("bestfood",colnames(Y))
for (i in 1:length(foodcols)) {
food <- Y[1, foodcols[i]]
df[i, 3] <- as.character(foodcols)
}
return(df)
}
}
I tried to use lapply but can't figure out the way to include the X list. When I try this:
lapply(Y, function, X=X)
The function doesn't work properly. I wonder if there are other ways to include X in it(I tried the function on individual character and dataframe, it works just fine. )
I hope this is clear. If not, please point it out, I will try my best to clarify. Thanks in advance.
UPDATE:
I tried Map as suggested by comments. It returns: incorrect number of dimensions. I added some details in the function. It seems like R stucks on the last line.
outcome <- Map(addID, Y, X)
I get
error in Y[1, namecols[i]] : incorrect number of dimensions
In addition: Warning message:
In `[<-.data.frame`(`*tmp*`, 1:nrow(df), 1, value = list(person.1 = 1L, :
provided 6 variables to replace 1 variables
The outcome should looks like:
z1 <- data.frame(ID = c(1,2), Person = c("Amy","Mike"), Food = c("fish", "fish"))
z2 <- data.frame(ID = c(1,2), Person = c("Amy","Mike"), Food = c("carrot", "carrot"))
outcome <- list(z1,z2)
We could do this easily in tidyverse
library(dplyr)
library(tidyr)
bind_rows(Y, .id = 'ID') %>%
select(ID, starts_with('person'), contains('food')) %>%
pivot_longer(cols = -ID, names_to = c(".value"),
names_pattern = "([^.]+)\\.+\\d+")
-output
# A tibble: 4 x 3
ID person bestfood
<chr> <chr> <chr>
1 1 Amy fish
2 1 Mike fish
3 2 Amy carrot
4 2 Mike carrot
With the OP's function, if we modify, it would work
addID <- function(X, Y) {
rowlength <- length(Y)
df <- as.data.frame(matrix(NA, nrow = rowlength, ncol = 3))
colnames(df) <- c("ID", "Person", "Food")
df[1:nrow(df), 1] <- X
namecols <- grep("person",colnames(Y))
df[, 2] <- unlist(Y[namecols])
foodcols <- grep("bestfood", colnames(Y))
df[,3] <- unlist(Y[foodcols])
return(unique(df))
}
-testing
Map(addID, X, Y)
$`1`
ID Person Food
1 1 Amy fish
2 1 Mike fish
$`2`
ID Person Food
1 2 Amy carrot
2 2 Mike carrot

How can I write a for loop for this?

I would like to write a for loop for the actions below, data is a df with multiple columns, each column contains a list. I would like to replace all NULL values in each column list with NA so that I can bind all lists into a dataframe. If there's a more efficient way to do this than a for loop I would like to know as well. Thank you.
for (i in names(data)){
list1=sapply(data[,1], function(x) ifelse(x == "NULL", NA, x))
list1=as.data.frame(list1)
list2=sapply(data[,2], function(x) ifelse(x == "NULL", NA, x))
list2=as.data.frame(list2)
.
.
.
fulllist=as.data.frame(cbind(list1,list2,....))
fulllist = as.data.frame(t(fulllist))
}
We loop over the columns of the data to find the list column ('i1'). Use that index to loop over the columns, then loop over the elements of the list and assign those NULL elements to NA
i1 <- sapply(data, is.list)
data[i1] <- lapply(data[i1], function(x) {
i2 <- sapply(x, is.null)
x[i2] <- NA
x })
If you indeed are working with a dataframe, you could perhaps consider not going through listing and recombining into dataframe:
purrr::map_df(.x = data, .f = ~ stringr::str_replace(.x, 'NULL', NA_character_))
You are inputting a dataframe data, applying to each column the function str_replace where you replace the character NULL with the character version of NA. The output is also a dataframe.
Here is an example:
library(purrr)
library(stringr)
df <- data.frame(
X1 = c('A', 'NULL', 'B'),
X2 = c('NULL', 'C', 'D'),
X3 = c('E', 'NULL', 'NULL')
)
purrr::map_df(.x = df, .f = ~ stringr::str_replace(.x, 'NULL', NA_character_))
# X1 X2 X3
# <chr> <chr> <chr>
# 1 A NA E
# 2 NA C NA
# 3 B D NA

Canonical tidyverse method to update some values of a vector from a look-up table

I frequently need to recode some (not all!) values in a data frame column based off of a look-up table. I'm not satisfied by the ways I know of to solve the problem. I'd like to be able to do it in a clear, stable, and efficient way. Before I write my own function, I'd want to make sure I'm not duplicating something standard that's already out there.
## Toy example
data = data.frame(
id = 1:7,
x = c("A", "A", "B", "C", "D", "AA", ".")
)
lookup = data.frame(
old = c("A", "D", "."),
new = c("a", "d", "!")
)
## desired result
# id x
# 1 1 a
# 2 2 a
# 3 3 B
# 4 4 C
# 5 5 d
# 6 6 AA
# 7 7 !
I can do it with a join, coalesce, unselect as below, but this isn't as clear as I'd like - too many steps.
## This works, but is more steps than I want
library(dplyr)
data %>%
left_join(lookup, by = c("x" = "old")) %>%
mutate(x = coalesce(new, x)) %>%
select(-new)
It can also be done with dplyr::recode, as below, converting the lookup table to a named lookup vector. I prefer lookup as a data frame, but I'm okay with the named vector solution. My concern here is that recode is the Questioning lifecycle phase, so I'm worried that this method isn't stable.
lookup_v = pull(lookup, new) %>% setNames(lookup$old)
data %>%
mutate(x = recode(x, !!!lookup_v))
It could also be done with, say, stringr::str_replace, but using regex for whole-string matching isn't efficient. I suppose there is forcats::fct_recode is a stable version of recode, but I don't want a factor output (though mutate(x = as.character(fct_recode(x, !!!lookup_v))) is perhaps my favorite option so far...).
I had hoped that the new-ish rows_update() family of dplyr functions would work, but it is strict about column names, and I don't think it can update the column it's joining on. (And it's Experimental, so doesn't yet meet my stability requirement.)
Summary of my requirements:
A single data column is updated based off of a lookup data frame (preferably) or named vector (allowable)
Not all values in the data are included in the lookup--the ones that are not present are not modified
Must work on character class input. Working more generally is a nice-to-have.
No dependencies outside of base R and tidyverse packages (though I'd also be interested in seeing a data.table solution)
No functions used that are in lifecycle phases like superseded or questioning. Please note any experimental lifecycle functions, as they have future potential.
Concise, clear code
I don't need extreme optimization, but nothing wildly inefficient (like regex when it's not needed)
A direct data.table solution, without %in%.
Depending on the length of the lookup / data tables, adding keys could improve performance substantially, but this isn't the case on this simple example.
library(data.table)
setDT(data)
setDT(lookup)
## If needed
# setkey(data,x)
# setkey(lookup,old)
data[lookup, x:=new, on=.(x=old)]
data
id x
1: 1 a
2: 2 a
3: 3 B
4: 4 C
5: 5 d
6: 6 AA
7: 7 !
Benchmarking
Expanding the original dataset to 10M rows, 15 runs using microbenchmark gave the follow results on my computer:
Note that forcats::fct_recode and dplyr::recode solutions mentioned by the OP have also been included. Neither works with the updated data because the named vector that resolves to . = ! will throw an error, which is why results are tested on the original dataset.
data = data.frame(
id = 1:5,
x = c("A", "A", "B", "C", "D")
)
lookup = data.frame(
old = c("A", "D"),
new = c("a", "d")
)
set.seed(1)
data <- data[sample(1:5, 1E7, replace = T),]
dt_lookup <- data.table::copy(lookup)
dplyr_coalesce <- function(){
library(dplyr)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = coalesce(lookupV[ x ], x))
}
datatable_in <- function(){
library(data.table)
lookupV <- setNames(lookup$new, lookup$old)
setDT(dt_data)
dt_data[ x %in% names(lookupV), x := lookupV[ x ] ]
}
datatable <- function(){
library(data.table)
setDT(dt_data)
setDT(dt_lookup)
## If needed
# setkey(data,x)
# setkey(lookup,old)
dt_data[dt_lookup, x:=new, on =.(x=old)]
}
purrr_modify_if <- function(){
library(dplyr)
library(purrr)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = modify_if(x, x %in% lookup$old, ~ lookupV[.x]))
}
stringr_str_replace_all_update <- function(){
library(dplyr)
library(stringr)
lookupV <- setNames(lookup$new, do.call(sprintf, list("^\\Q%s\\E$", lookup$old)))
data %>%
dplyr::mutate(x = str_replace_all(x, lookupV))
}
base_named_vector <- function(){
lookupV <- c(with(lookup, setNames(new, old)), rlang::set_names(setdiff(unique(data$x), lookup$old)))
lookupV[data$x]
}
base_ifelse <- function(){
lookupV <- setNames(lookup$new, lookup$old)
with(data, ifelse(x %in% lookup$old, lookup$new, x))
}
plyr_mapvalues <- function(){
library(plyr)
data %>%
dplyr::mutate(x = plyr::mapvalues(x, lookup$old, lookup$new, warn_missing = F))
}
base_match <- function(){
tochange <- match(data$x, lookup$old, nomatch = 0)
data$x[tochange > 0] <- lookup$new[tochange]
}
base_local_safe_lookup <- function(){
lv <- structure(lookup$new, names = lookup$old)
safe_lookup <- function(val) {
new_val <- lv[val]
unname(ifelse(is.na(new_val), val, new_val))
}
safe_lookup(data$x)
}
dplyr_recode <- function(){
library(dplyr)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = recode(x, !!!lookupV))
}
base_for <- function(){
for (i in seq_len(nrow(lookup))) {
data$x[data$x == lookup$old[i]] = lookup$new[i]
}
}
datatable_for <- function(){
library(data.table)
setDT(dt_data)
for (i in seq_len(nrow(lookup))) {
dt_data[x == lookup$old[i], x := lookup$new[i]]
}
}
forcats_fct_recode <- function(){
library(dplyr)
library(forcats)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = as.character(fct_recode(x, !!!lookupV)))
}
datatable_set <- function(){
library(data.table)
setDT(dt_data)
tochange <- dt_data[, chmatch(x, lookup$old, nomatch = 0)]
set(dt_data, i = which(tochange > 0), j = "x", value = lookup$new[tochange])
}
library(microbenchmark)
bench <- microbenchmark(dplyr_coalesce(),
datatable(),
datatable_in(),
datatable_for(),
base_for(),
purrr_modify_if(),
stringr_str_replace_all_update(),
base_named_vector(),
base_ifelse(),
plyr_mapvalues(),
base_match(),
base_local_safe_lookup(),
dplyr_recode(),
forcats_fct_recode(),
datatable_set(),
times = 15L,
setup = dt_data <- data.table::copy(data))
bench$expr <- forcats::fct_rev(forcats::fct_reorder(bench$expr, bench$time, mean))
ggplot2::autoplot(bench)
Thanks to #Waldi and #nicola for advice implementing data.table solutions in the benchmark.
Combination of a named vector and coalesce:
# make lookup vector
lookupV <- setNames(lookup$new, lookup$old)
data %>%
mutate(x = coalesce(lookupV[ x ], x))
# id x
# 1 1 a
# 2 2 a
# 3 3 B
# 4 4 C
# 5 5 d
Or data.table:
library(data.table)
setDT(data)
data[ x %in% names(lookupV), x := lookupV[ x ] ]
This post might have a better solution for data.table - "update on merge":
R data table: update join
A base R option using %in% and match - thanks to #LMc & #nicola
tochange <- match(data$x, lookup$old, nomatch = 0)
data$x[tochange > 0] <- lookup$new[tochange]
One more data.table option using set() and chmatch
library(data.table)
setDT(data)
tochange <- data[, chmatch(x, lookup$old, nomatch = 0)]
set(data, i = which(tochange > 0), j = "x", value = lookup$new[tochange])
Result
data
# id x
#1 1 a
#2 2 a
#3 3 B
#4 4 C
#5 5 d
#6 6 AA
#7 7 !
modify_if
You could use purrr::modify_if to only apply the named vector to values that exist in it. Though not a specified requirement, it has the benefit of the .else argument, which allows you to apply a different function to values not in your lookup.
I also wanted to include the use of tibble::deframe here to create the named vector. It is slower than setNames, though.
lookupV <- deframe(lookup)
data %>%
mutate(x = modify_if(x, x %in% lookup$old, ~ lookupV[.x]))
str_replace_all
Alternatively, you could use stringr::str_replace_all, which can take a named vector for the replacement argument.
data %>%
mutate(x = str_replace_all(x, lookupV))
Update
To accommodate the change to your edited example, the named vector used in str_replace_all needs to be modified. In this way, the entire literal string needs to be match so that "A" does not get substituted in "AA", or "." does not replace everything:
lookupV <- setNames(lookup$new, do.call(sprintf, list("^\\Q%s\\E$", lookup$old)))
data %>%
mutate(x = str_replace_all(x, lookupV))
left_join
Using dplyr::left_join this is very similar to OP solution, but uses .keep argument of mutate so it has less steps. This argument is currently in the experimental lifecycle and so it is not included in the benchmark (though it is around the middle of posted solutions).
left_join(data, lookup, by = c("x" = "old")) %>%
mutate(x = coalesce(new, x) , .keep = "unused")
Base R
Named Vector
Create a substitution value for every unique value in your dataframe.
lookupV <- c(with(lookup, setNames(new, old)), setNames(nm = setdiff(unique(data$x), lookup$old)))
data$x <- lookupV[data$x]
ifelse
with(data, ifelse(x %in% lookup$old, lookupV[x], x))
Another option that is clear is to use a for-loop with subsetting to loop through the rows of the lookup table. This will almost always be quicker with data.table because of auto indexing, or if you set the key (i.e., ?data.table::setkey()) ahead of time. Also, it will--of course--get slower as the lookup table gets longer. I would guess an update-join would be preferred if there is a long lookup table.
Base R:
for (i in seq_len(nrow(lookup))) {
data$x[data$x == lookup$old[i]] <- lookup$new[i]
}
data$x
# [1] "a" "a" "B" "C" "d" "AA" "!"
Or the same logic with data.table:
library(data.table)
setDT(data)
for (i in seq_len(nrow(lookup))) {
data[x == lookup$old[i], x := lookup$new[i]]
}
data$x
# [1] "a" "a" "B" "C" "d" "AA" "!"
Data:
data = data.frame(
id = 1:7,
x = c("A", "A", "B", "C", "D", "AA", ".")
)
lookup = data.frame(
old = c("A", "D", "."),
new = c("a", "d", "!")
)
Another base solution, with a lookup vector:
## Toy example
data = data.frame(
id = 1:5,
x = c("A", "A", "B", "C", "D"),
stringsAsFactors = F
)
lookup = data.frame(
old = c("A", "D"),
new = c("a", "d"),
stringsAsFactors = F
)
lv <- structure(lookup$new, names = lookup$old)
safe_lookup <- function(val) {
new_val <- lv[val]
unname(ifelse(is.na(new_val), val, new_val))
}
data$x <- safe_lookup(data$x)
dplyr+plyr solution that is in order with all ur bulletpoints (if u consider plyr in the the tidyverse):
data <- data %>%
dplyr::mutate(
x = plyr::mapvalues(x, lookup$old, lookup$new) #Can add , F to remove warnings
)
I basically share the same problem. Although dplyr::recode is in the "questioning" life cycle I don't expect it to become deprecated. At some point it might be superseded, but even in this case it should still be usable. Therefore I'm using a wrapper around dplyr::recode which allows the use of named vectors and or two vectors (which could be a lookup table).
library(dplyr)
library(rlang)
recode2 <- function(x, new, old = NULL, .default = NULL, .missing = NULL) {
if (!rlang::is_named(new) && !is.null(old)) {
new <- setNames(new, old)
}
do.call(dplyr::recode,
c(.x = list(x),
.default = list(.default),
.missing = list(.missing),
as.list(new)))
}
data = data.frame(
id = 1:7,
x = c("A", "A", "B", "C", "D", "AA", ".")
)
lookup = data.frame(
old = c("A", "D", "."),
new = c("a", "d", "!")
)
# two vectors new / old
data %>%
mutate(x = recode2(x, lookup$new, lookup$old))
#> id x
#> 1 1 a
#> 2 2 a
#> 3 3 B
#> 4 4 C
#> 5 5 d
#> 6 6 AA
#> 7 7 !
# named vector
data %>%
mutate(x = recode2(x, c("A" = "a",
"D" = "d",
"." = "!")))
#> id x
#> 1 1 a
#> 2 2 a
#> 3 3 B
#> 4 4 C
#> 5 5 d
#> 6 6 AA
#> 7 7 !
Created on 2021-04-21 by the reprex package (v0.3.0)

For loop for updating data.frame

I am trying to write a "for loop" to update my R data frame by iterating.
Here is my code:
datalist = list()
for (i in 1:5) {
dat <- data.frame(ID=LETTERS[seq( from = 1, to = 20 )],nutrition=rnorm(20, mean=50, sd=10),
Stage=c(rep("A1",5), rep("B1",15)))
dat$ADG<-dat$nutrition*0.05
dat$M_weight<-dat$nutrition*0.5+dat$ADG*100
dat$Age<-dat$M_weight*1.1+dat$ADG*0.6
dat$Stage<-as.character(dat$Stage)
dat$Stage[dat$ADG>=3]<-"C1"
dat$i <- i # maybe you want to keep track of which iteration produced it?
datalist[[i]] <- dat # add it to your list #
}
big_data = do.call(rbind, datalist)
From iteration 2, I would like to have "Stage" updated to "C1" if ADG is equal or greater than 3 but this would not apply to iteration 1.
Thank you so much! I appreciate any replies!
I think you want a recursive function instead of an iterative one
Your data stringsAsFactors=F
dat <- data.frame(ID=LETTERS[seq( from = 1, to = 20 )], nutrition=rnorm(20, mean=50, sd=10), Stage=c(rep("A1",5), rep("B1",15)), stringsAsFactors=F)
Use tidyverse for dplyr and purrr verbs
library(tidyverse)
special <- function( dat, counter, end ) {
dat1 <- dat %>%
mutate(ADG = nutrition*0.05) %>%
mutate(M_weight = nutrition*0.5 + ADG*100) %>%
mutate(Age = M_weight*1.1 + ADG*0.6) %>%
mutate(Stage = ifelse( ADG >= 3, "C1", Stage )) %>%
mutate(i=counter)
if (counter < end) {
special(dat1, counter+1, end)
} else {
return(dat1)
}
}
desired <- map_df(2:5, ~special(dat,1,.x))
head(desired)
ID nutrition Stage ADG M_weight Age i
1 A 47.17826 A1 2.358913 259.4804 286.8438 2
2 B 64.55988 C1 3.227994 355.0794 392.5241 2
3 C 52.29020 A1 2.614510 287.5961 317.9244 2
4 D 59.96544 A1 2.998272 329.8099 364.5899 2
Let me know if this is not the output you were expecting

Resources