Someone should have asked this already, but I couldn't find an answer. Say I have:
x = data.frame(q=1,w=2,e=3, ...and many many columns...)
what is the most elegant way to rename an arbitrary subset of columns, whose position I don't necessarily know, into some other arbitrary names?
e.g. Say I want to rename "q" and "e" into "A" and "B", what is the most elegant code to do this?
Obviously, I can do a loop:
oldnames = c("q","e")
newnames = c("A","B")
for(i in 1:2) names(x)[names(x) == oldnames[i]] = newnames[i]
But I wonder if there is a better way? Maybe using some of the packages? (plyr::rename etc.)
With dplyr you would do:
library(dplyr)
df = data.frame(q = 1, w = 2, e = 3)
df %>% rename(A = q, B = e)
# A w B
#1 1 2 3
Or if you want to use vectors, as suggested by #Jelena-bioinf:
library(dplyr)
df = data.frame(q = 1, w = 2, e = 3)
oldnames = c("q","e")
newnames = c("A","B")
df %>% rename_at(vars(oldnames), ~ newnames)
# A w B
#1 1 2 3
L. D. Nicolas May suggested a change given rename_at is being superseded by rename_with:
df %>%
rename_with(~ newnames[which(oldnames == .x)], .cols = oldnames)
# A w B
#1 1 2 3
setnames from the data.tablepackage will work on data.frames or data.tables
library(data.table)
d <- data.frame(a=1:2,b=2:3,d=4:5)
setnames(d, old = c('a','d'), new = c('anew','dnew'))
d
# anew b dnew
# 1 1 2 4
# 2 2 3 5
Note that changes are made by reference, so no copying (even for data.frames!)
Another solution for dataframes which are not too large is (building on #thelatemail answer):
x <- data.frame(q=1,w=2,e=3)
> x
q w e
1 1 2 3
colnames(x) <- c("A","w","B")
> x
A w B
1 1 2 3
Alternatively, you can also use:
names(x) <- c("C","w","D")
> x
C w D
1 1 2 3
Furthermore, you can also rename a subset of the columnnames:
names(x)[2:3] <- c("E","F")
> x
C E F
1 1 2 3
Here is the most efficient way I have found to rename multiple columns using a combination of purrr::set_names() and a few stringr operations.
library(tidyverse)
# Make a tibble with bad names
data <- tibble(
`Bad NameS 1` = letters[1:10],
`bAd NameS 2` = rnorm(10)
)
data
# A tibble: 10 x 2
`Bad NameS 1` `bAd NameS 2`
<chr> <dbl>
1 a -0.840
2 b -1.56
3 c -0.625
4 d 0.506
5 e -1.52
6 f -0.212
7 g -1.50
8 h -1.53
9 i 0.420
10 j 0.957
# Use purrr::set_names() with annonymous function of stringr operations
data %>%
set_names(~ str_to_lower(.) %>%
str_replace_all(" ", "_") %>%
str_replace_all("bad", "good"))
# A tibble: 10 x 2
good_names_1 good_names_2
<chr> <dbl>
1 a -0.840
2 b -1.56
3 c -0.625
4 d 0.506
5 e -1.52
6 f -0.212
7 g -1.50
8 h -1.53
9 i 0.420
10 j 0.957
Update dplyr 1.0.0
The newest dplyr version became more flexible by adding rename_with() where _with refers to a function as input. The trick is to reformulate the character vector newnames into a formula (by ~), so it would be equivalent to function(x) return (newnames).
In my subjective opinion, that is the most elegant dplyr expression.
Update: thanks to #desval, the oldnames vector must be wrapped by all_of to include all its elements:
# shortest & most elegant expression
df %>% rename_with(~ newnames, all_of(oldnames))
A w B
1 1 2 3
Side note:
If you reverse the order, either argument .fn must be specified as .fn is expected before .cols argument:
df %>% rename_with(oldnames, .fn = ~ newnames)
A w B
1 1 2 3
or specify argument .col:
df %>% rename_with(.col = oldnames, ~ newnames)
A w B
1 1 2 3
So I recently ran into this myself, if you're not sure if the columns exist and only want to rename those that do:
existing <- match(oldNames,names(x))
names(x)[na.omit(existing)] <- newNames[which(!is.na(existing))]
Building on #user3114046's answer:
x <- data.frame(q=1,w=2,e=3)
x
# q w e
#1 1 2 3
names(x)[match(oldnames, names(x))] <- newnames
x
# A w B
#1 1 2 3
This won't be reliant on a specific ordering of columns in the x dataset.
You can use a named vector. Below two options (with base R and dplyr).
base R, via subsetting:
x = data.frame(q = 1, w = 2, e = 3)
rename_vec <- c(q = "A", e = "B")
## vector of same length as names(x) which returns NA if there is no match to names(x)
which_rename <- rename_vec[names(x)]
## simple ifelse where names(x) will be renamed for every non-NA
names(x) <- ifelse(is.na(which_rename), names(x), which_rename)
x
#> A w B
#> 1 1 2 3
Or a dplyr option with !!!:
library(dplyr)
rename_vec <- c(A = "q", B = "e") # the names are just the other way round than in the base R way!
x %>% rename(!!!rename_vec)
#> A w B
#> 1 1 2 3
The latter works because the 'big-bang' operator !!! is forcing evaluation of a list or a vector.
?`!!`
!!! forces-splice a list of objects. The elements of the list are
spliced in place, meaning that they each become one single argument.
names(x)[names(x) %in% c("q","e")]<-c("A","B")
This would change all the occurrences of those letters in all names:
names(x) <- gsub("q", "A", gsub("e", "B", names(x) ) )
There are a few answers mentioning the functions dplyr::rename_with and rlang::set_names already. By they are separate. this answer illustrates the differences between the two and the use of functions and formulas to rename columns.
rename_with from the dplyr package can use either a function or a formula
to rename a selection of columns given as the .cols argument. For example passing the function name toupper:
library(dplyr)
rename_with(head(iris), toupper, starts_with("Petal"))
Is equivalent to passing the formula ~ toupper(.x):
rename_with(head(iris), ~ toupper(.x), starts_with("Petal"))
When renaming all columns, you can also use set_names from the rlang package. To make a different example, let's use paste0 as a renaming function. pasteO takes 2 arguments, as a result there are different ways to pass the second argument depending on whether we use a function or a formula.
rlang::set_names(head(iris), paste0, "_hi")
rlang::set_names(head(iris), ~ paste0(.x, "_hi"))
The same can be achieved with rename_with by passing the data frame as first
argument .data, the function as second argument .fn, all columns as third
argument .cols=everything() and the function parameters as the fourth
argument .... Alternatively you can place the second, third and fourth
arguments in a formula given as the second argument.
rename_with(head(iris), paste0, everything(), "_hi")
rename_with(head(iris), ~ paste0(.x, "_hi"))
rename_with only works with data frames. set_names is more generic and can
also perform vector renaming
rlang::set_names(1:4, c("a", "b", "c", "d"))
If the table contains two columns with the same name then the code goes like this,
rename(df,newname=oldname.x,newname=oldname.y)
You can get the name set, save it as a list, and then do your bulk renaming on the string. A good example of this is when you are doing a long to wide transition on a dataset:
names(labWide)
Lab1 Lab10 Lab11 Lab12 Lab13 Lab14 Lab15 Lab16
1 35.75366 22.79493 30.32075 34.25637 30.66477 32.04059 24.46663 22.53063
nameVec <- names(labWide)
nameVec <- gsub("Lab","LabLat",nameVec)
names(labWide) <- nameVec
"LabLat1" "LabLat10" "LabLat11" "LabLat12" "LabLat13" "LabLat14""LabLat15" "LabLat16" "
Sidenote, if you want to concatenate one string to all of the column names, you can just use this simple code.
colnames(df) <- paste("renamed_",colnames(df),sep="")
Lot's of sort-of-answers, so I just wrote the function so you can copy/paste.
rename <- function(x, old_names, new_names) {
stopifnot(length(old_names) == length(new_names))
# pull out the names that are actually in x
old_nms <- old_names[old_names %in% names(x)]
new_nms <- new_names[old_names %in% names(x)]
# call out the column names that don't exist
not_nms <- setdiff(old_names, old_nms)
if(length(not_nms) > 0) {
msg <- paste(paste(not_nms, collapse = ", "),
"are not columns in the dataframe, so won't be renamed.")
warning(msg)
}
# rename
names(x)[names(x) %in% old_nms] <- new_nms
x
}
x = data.frame(q = 1, w = 2, e = 3)
rename(x, c("q", "e"), c("Q", "E"))
Q w E
1 1 2 3
If one row of the data contains the names you want to change all columns to you can do
names(data) <- data[row,]
Given data is your dataframe and row is the row number containing the new values.
Then you can remove the row containing the names with
data <- data[-row,]
This is the function that you need:
Then just pass the x in a rename(X) and it will rename all values that appear and if it isn't in there it won't error
rename <-function(x){
oldNames = c("a","b","c")
newNames = c("d","e","f")
existing <- match(oldNames,names(x))
names(x)[na.omit(existing)] <- newNames[which(!is.na(existing))]
return(x)
}
Many good answers above using specialized packages. This is a simple way of doing it only with base R.
df.rename.cols <- function(df, col2.list) {
tlist <- transpose(col2.list)
names(df)[which(names(df) %in% tlist[[1]])] <- tlist[[2]]
df
}
Here is an example:
df1 <- data.frame(A = c(1, 2), B = c(3, 4), C = c(5, 6), D = c(7, 8))
col.list <- list(c("A", "NewA"), c("C", "NewC"))
df.rename.cols(df1, col.list)
NewA B NewC D
1 1 3 5 7
2 2 4 6 8
I recently built off of #agile bean's answer (using rename_with, formerly rename_at) to build a function which changes column names if they exist in the data frame, such that one can make the column names of heterogeneous data frames match each other when applicable.
The looping can surely be improved, but figured I'd share for posterity.
create example data frame:
x= structure(list(observation_date = structure(c(18526L, 18784L,
17601L), class = c("IDate", "Date")), year = c(2020L, 2021L,
2018L)), sf_column = "geometry", agr = structure(c(id = NA_integer_,
common_name = NA_integer_, scientific_name = NA_integer_, observation_count = NA_integer_,
country = NA_integer_, country_code = NA_integer_, state = NA_integer_,
state_code = NA_integer_, county = NA_integer_, county_code = NA_integer_,
observation_date = NA_integer_, time_observations_started = NA_integer_,
observer_id = NA_integer_, sampling_event_identifier = NA_integer_,
protocol_type = NA_integer_, protocol_code = NA_integer_, duration_minutes = NA_integer_,
effort_distance_km = NA_integer_, effort_area_ha = NA_integer_,
number_observers = NA_integer_, all_species_reported = NA_integer_,
group_identifier = NA_integer_, year = NA_integer_, checklist_id = NA_integer_,
yday = NA_integer_), class = "factor", .Label = c("constant",
"aggregate", "identity")), row.names = c("3", "3.1", "3.2"), class = "data.frame")
function
match_col_names <- function(x){
col_names <- list(date = c("observation_date", "date"),
C = c("observation_count", "count","routetotal"),
yday = c("dayofyear"),
latitude = c("lat"),
longitude = c("lon","long")
)
for(i in seq_along(col_names)){
newname=names(col_names)[i]
oldnames=col_names[[i]]
toreplace = names(x)[which(names(x) %in% oldnames)]
x <- x %>%
rename_with(~newname, toreplace)
}
return(x)
}
apply function
x <- match_col_names(x)
For execution time purposes , I would like to suggest to use data tables structure:
> df = data.table(x = 1:10, y = 3:12, z = 4:13)
> oldnames = c("x","y","z")
> newnames = c("X","Y","Z")
> library(microbenchmark)
> library(data.table)
> library(dplyr)
> microbenchmark(dplyr_1 = df %>% rename_at(vars(oldnames), ~ newnames) ,
+ dplyr_2 = df %>% rename(X=x,Y=y,Z=z) ,
+ data_tabl1= setnames(copy(df), old = c("x","y","z") , new = c("X","Y","Z")),
+ times = 100)
Unit: microseconds
expr min lq mean median uq max neval
dplyr_1 5760.3 6523.00 7092.538 6864.35 7210.45 17935.9 100
dplyr_2 2536.4 2788.40 3078.609 3010.65 3282.05 4689.8 100
data_tabl1 170.0 218.45 368.261 243.85 274.40 12351.7 100
Related
I usually have to perform equivalent calculations on a series of variables/columns that can be identified by their suffix (ranging, let's say from _a to _i) and save the result in new variables/columns. The calculations are equivalent, but vary between the variables used in the calculations. These again can be identified by the same suffix (_a to _i). So what I basically want to achieve is the following:
newvar_a = (oldvar1_a + oldvar2_a) - z
...
newvar_i = (oldvar1_i + oldvar2_i) - z
This is the farest I got:
mutate(across(c(oldvar1_a:oldvar1_i), ~ . - z, .names = "{col}_new"))
Thus, I'm able to "loop" over oldvar1_a to oldvar1_i, substract z from them and save the results in new columns named oldvar1_a_new to oldvar1_i_new. However, I'm not able to include oldvar2_a to oldvar2_i in the calculations as R won't loop over them. (Additionally, I'd still need to rename the new columns).
I found a way to achieve the result using a for-loop. However, this definitely doesn't look like the most efficient and straightforward way to do it:
for (i in letters[1:9]) {
oldvar1_x <- paste0("oldvar1_", i)
oldvar2_x <- paste0("oldvar2_", i)
newvar_x <- paste0("newvar_", i)
df <- df %>%
mutate(!!sym(newvar_x) := (!!sym(oldvar1_x) + !!sym(oldvar2_x)) - z)
}
Thus, I'd like to know whether/how to make mutate(across) loop over multiple columns that can be identified by suffixes (as in the example above)
In this case, you can use cur_data() and cur_column() to take advantage that we are wanting to sum together columns that have the same suffix but just need to swap out the numbers.
library(dplyr)
df <- data.frame(
oldvar1_a = 1:3,
oldvar2_a = 4:6,
oldvar1_i = 7:9,
oldvar2_i = 10:12,
z = c(1,10,20)
)
mutate(
df,
across(
starts_with("oldvar1"),
~ (.x + cur_data()[gsub("1", "2", cur_column())]) - z,
.names = "{col}_new"
)
)
#> oldvar1_a oldvar2_a oldvar1_i oldvar2_i z oldvar2_a oldvar2_i
#> 1 1 4 7 10 1 4 16
#> 2 2 5 8 11 10 -3 9
#> 3 3 6 9 12 20 -11 1
If you want to use with case_when, just make sure to index using [[, you can read more here.
df <- data.frame(
oldvar1_a = 1:3,
oldvar2_a = 4:6,
oldvar1_i = 7:9,
oldvar2_i = 10:12,
z = c(1,2,0)
)
mutate(
df,
across(
starts_with("oldvar1"),
~ case_when(
z == 1 ~ .x,
z == 2 ~ cur_data()[[gsub("1", "2", cur_column())]],
TRUE ~ NA_integer_
),
.names = "{col}_new"
)
)
#> oldvar1_a oldvar2_a oldvar1_i oldvar2_i z oldvar1_a_new oldvar1_i_new
#> 1 1 4 7 10 1 1 7
#> 2 2 5 8 11 2 5 11
#> 3 3 6 9 12 0 NA NA
There is a fairly straightforward way to do what I believe you are attempting to do.
# first lets create data
library(dplyr)
df <- data.frame(var1_a=runif(10, min = 128, max = 131),
var2_a=runif(10, min = 128, max = 131),
var1_b=runif(10, min = 128, max = 131),
var2_b=runif(10, min = 128, max = 131),
var1_c=runif(10, min = 128, max = 131),
var2_c=runif(10, min = 128, max = 131))
# taking a wild guess at what your z is
z <- 4
# initialize a list
fnl <- list()
# iterate over all your combo, put in list
for (i in letters[1:3])
{
dc <- df %>% select(ends_with(i))
i <- dc %>% mutate(a = rowSums(dc[1:ncol(dc)]) - z)
fnl <- append(fnl, i)
}
# convert to a dataframe/tibble
final <- bind_cols(fnl)
I left the column names sloppy assuming you had specific requirements here. You can convert this loop into a function and do the whole thin in a single step using purrr.
I frequently need to recode some (not all!) values in a data frame column based off of a look-up table. I'm not satisfied by the ways I know of to solve the problem. I'd like to be able to do it in a clear, stable, and efficient way. Before I write my own function, I'd want to make sure I'm not duplicating something standard that's already out there.
## Toy example
data = data.frame(
id = 1:7,
x = c("A", "A", "B", "C", "D", "AA", ".")
)
lookup = data.frame(
old = c("A", "D", "."),
new = c("a", "d", "!")
)
## desired result
# id x
# 1 1 a
# 2 2 a
# 3 3 B
# 4 4 C
# 5 5 d
# 6 6 AA
# 7 7 !
I can do it with a join, coalesce, unselect as below, but this isn't as clear as I'd like - too many steps.
## This works, but is more steps than I want
library(dplyr)
data %>%
left_join(lookup, by = c("x" = "old")) %>%
mutate(x = coalesce(new, x)) %>%
select(-new)
It can also be done with dplyr::recode, as below, converting the lookup table to a named lookup vector. I prefer lookup as a data frame, but I'm okay with the named vector solution. My concern here is that recode is the Questioning lifecycle phase, so I'm worried that this method isn't stable.
lookup_v = pull(lookup, new) %>% setNames(lookup$old)
data %>%
mutate(x = recode(x, !!!lookup_v))
It could also be done with, say, stringr::str_replace, but using regex for whole-string matching isn't efficient. I suppose there is forcats::fct_recode is a stable version of recode, but I don't want a factor output (though mutate(x = as.character(fct_recode(x, !!!lookup_v))) is perhaps my favorite option so far...).
I had hoped that the new-ish rows_update() family of dplyr functions would work, but it is strict about column names, and I don't think it can update the column it's joining on. (And it's Experimental, so doesn't yet meet my stability requirement.)
Summary of my requirements:
A single data column is updated based off of a lookup data frame (preferably) or named vector (allowable)
Not all values in the data are included in the lookup--the ones that are not present are not modified
Must work on character class input. Working more generally is a nice-to-have.
No dependencies outside of base R and tidyverse packages (though I'd also be interested in seeing a data.table solution)
No functions used that are in lifecycle phases like superseded or questioning. Please note any experimental lifecycle functions, as they have future potential.
Concise, clear code
I don't need extreme optimization, but nothing wildly inefficient (like regex when it's not needed)
A direct data.table solution, without %in%.
Depending on the length of the lookup / data tables, adding keys could improve performance substantially, but this isn't the case on this simple example.
library(data.table)
setDT(data)
setDT(lookup)
## If needed
# setkey(data,x)
# setkey(lookup,old)
data[lookup, x:=new, on=.(x=old)]
data
id x
1: 1 a
2: 2 a
3: 3 B
4: 4 C
5: 5 d
6: 6 AA
7: 7 !
Benchmarking
Expanding the original dataset to 10M rows, 15 runs using microbenchmark gave the follow results on my computer:
Note that forcats::fct_recode and dplyr::recode solutions mentioned by the OP have also been included. Neither works with the updated data because the named vector that resolves to . = ! will throw an error, which is why results are tested on the original dataset.
data = data.frame(
id = 1:5,
x = c("A", "A", "B", "C", "D")
)
lookup = data.frame(
old = c("A", "D"),
new = c("a", "d")
)
set.seed(1)
data <- data[sample(1:5, 1E7, replace = T),]
dt_lookup <- data.table::copy(lookup)
dplyr_coalesce <- function(){
library(dplyr)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = coalesce(lookupV[ x ], x))
}
datatable_in <- function(){
library(data.table)
lookupV <- setNames(lookup$new, lookup$old)
setDT(dt_data)
dt_data[ x %in% names(lookupV), x := lookupV[ x ] ]
}
datatable <- function(){
library(data.table)
setDT(dt_data)
setDT(dt_lookup)
## If needed
# setkey(data,x)
# setkey(lookup,old)
dt_data[dt_lookup, x:=new, on =.(x=old)]
}
purrr_modify_if <- function(){
library(dplyr)
library(purrr)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = modify_if(x, x %in% lookup$old, ~ lookupV[.x]))
}
stringr_str_replace_all_update <- function(){
library(dplyr)
library(stringr)
lookupV <- setNames(lookup$new, do.call(sprintf, list("^\\Q%s\\E$", lookup$old)))
data %>%
dplyr::mutate(x = str_replace_all(x, lookupV))
}
base_named_vector <- function(){
lookupV <- c(with(lookup, setNames(new, old)), rlang::set_names(setdiff(unique(data$x), lookup$old)))
lookupV[data$x]
}
base_ifelse <- function(){
lookupV <- setNames(lookup$new, lookup$old)
with(data, ifelse(x %in% lookup$old, lookup$new, x))
}
plyr_mapvalues <- function(){
library(plyr)
data %>%
dplyr::mutate(x = plyr::mapvalues(x, lookup$old, lookup$new, warn_missing = F))
}
base_match <- function(){
tochange <- match(data$x, lookup$old, nomatch = 0)
data$x[tochange > 0] <- lookup$new[tochange]
}
base_local_safe_lookup <- function(){
lv <- structure(lookup$new, names = lookup$old)
safe_lookup <- function(val) {
new_val <- lv[val]
unname(ifelse(is.na(new_val), val, new_val))
}
safe_lookup(data$x)
}
dplyr_recode <- function(){
library(dplyr)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = recode(x, !!!lookupV))
}
base_for <- function(){
for (i in seq_len(nrow(lookup))) {
data$x[data$x == lookup$old[i]] = lookup$new[i]
}
}
datatable_for <- function(){
library(data.table)
setDT(dt_data)
for (i in seq_len(nrow(lookup))) {
dt_data[x == lookup$old[i], x := lookup$new[i]]
}
}
forcats_fct_recode <- function(){
library(dplyr)
library(forcats)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = as.character(fct_recode(x, !!!lookupV)))
}
datatable_set <- function(){
library(data.table)
setDT(dt_data)
tochange <- dt_data[, chmatch(x, lookup$old, nomatch = 0)]
set(dt_data, i = which(tochange > 0), j = "x", value = lookup$new[tochange])
}
library(microbenchmark)
bench <- microbenchmark(dplyr_coalesce(),
datatable(),
datatable_in(),
datatable_for(),
base_for(),
purrr_modify_if(),
stringr_str_replace_all_update(),
base_named_vector(),
base_ifelse(),
plyr_mapvalues(),
base_match(),
base_local_safe_lookup(),
dplyr_recode(),
forcats_fct_recode(),
datatable_set(),
times = 15L,
setup = dt_data <- data.table::copy(data))
bench$expr <- forcats::fct_rev(forcats::fct_reorder(bench$expr, bench$time, mean))
ggplot2::autoplot(bench)
Thanks to #Waldi and #nicola for advice implementing data.table solutions in the benchmark.
Combination of a named vector and coalesce:
# make lookup vector
lookupV <- setNames(lookup$new, lookup$old)
data %>%
mutate(x = coalesce(lookupV[ x ], x))
# id x
# 1 1 a
# 2 2 a
# 3 3 B
# 4 4 C
# 5 5 d
Or data.table:
library(data.table)
setDT(data)
data[ x %in% names(lookupV), x := lookupV[ x ] ]
This post might have a better solution for data.table - "update on merge":
R data table: update join
A base R option using %in% and match - thanks to #LMc & #nicola
tochange <- match(data$x, lookup$old, nomatch = 0)
data$x[tochange > 0] <- lookup$new[tochange]
One more data.table option using set() and chmatch
library(data.table)
setDT(data)
tochange <- data[, chmatch(x, lookup$old, nomatch = 0)]
set(data, i = which(tochange > 0), j = "x", value = lookup$new[tochange])
Result
data
# id x
#1 1 a
#2 2 a
#3 3 B
#4 4 C
#5 5 d
#6 6 AA
#7 7 !
modify_if
You could use purrr::modify_if to only apply the named vector to values that exist in it. Though not a specified requirement, it has the benefit of the .else argument, which allows you to apply a different function to values not in your lookup.
I also wanted to include the use of tibble::deframe here to create the named vector. It is slower than setNames, though.
lookupV <- deframe(lookup)
data %>%
mutate(x = modify_if(x, x %in% lookup$old, ~ lookupV[.x]))
str_replace_all
Alternatively, you could use stringr::str_replace_all, which can take a named vector for the replacement argument.
data %>%
mutate(x = str_replace_all(x, lookupV))
Update
To accommodate the change to your edited example, the named vector used in str_replace_all needs to be modified. In this way, the entire literal string needs to be match so that "A" does not get substituted in "AA", or "." does not replace everything:
lookupV <- setNames(lookup$new, do.call(sprintf, list("^\\Q%s\\E$", lookup$old)))
data %>%
mutate(x = str_replace_all(x, lookupV))
left_join
Using dplyr::left_join this is very similar to OP solution, but uses .keep argument of mutate so it has less steps. This argument is currently in the experimental lifecycle and so it is not included in the benchmark (though it is around the middle of posted solutions).
left_join(data, lookup, by = c("x" = "old")) %>%
mutate(x = coalesce(new, x) , .keep = "unused")
Base R
Named Vector
Create a substitution value for every unique value in your dataframe.
lookupV <- c(with(lookup, setNames(new, old)), setNames(nm = setdiff(unique(data$x), lookup$old)))
data$x <- lookupV[data$x]
ifelse
with(data, ifelse(x %in% lookup$old, lookupV[x], x))
Another option that is clear is to use a for-loop with subsetting to loop through the rows of the lookup table. This will almost always be quicker with data.table because of auto indexing, or if you set the key (i.e., ?data.table::setkey()) ahead of time. Also, it will--of course--get slower as the lookup table gets longer. I would guess an update-join would be preferred if there is a long lookup table.
Base R:
for (i in seq_len(nrow(lookup))) {
data$x[data$x == lookup$old[i]] <- lookup$new[i]
}
data$x
# [1] "a" "a" "B" "C" "d" "AA" "!"
Or the same logic with data.table:
library(data.table)
setDT(data)
for (i in seq_len(nrow(lookup))) {
data[x == lookup$old[i], x := lookup$new[i]]
}
data$x
# [1] "a" "a" "B" "C" "d" "AA" "!"
Data:
data = data.frame(
id = 1:7,
x = c("A", "A", "B", "C", "D", "AA", ".")
)
lookup = data.frame(
old = c("A", "D", "."),
new = c("a", "d", "!")
)
Another base solution, with a lookup vector:
## Toy example
data = data.frame(
id = 1:5,
x = c("A", "A", "B", "C", "D"),
stringsAsFactors = F
)
lookup = data.frame(
old = c("A", "D"),
new = c("a", "d"),
stringsAsFactors = F
)
lv <- structure(lookup$new, names = lookup$old)
safe_lookup <- function(val) {
new_val <- lv[val]
unname(ifelse(is.na(new_val), val, new_val))
}
data$x <- safe_lookup(data$x)
dplyr+plyr solution that is in order with all ur bulletpoints (if u consider plyr in the the tidyverse):
data <- data %>%
dplyr::mutate(
x = plyr::mapvalues(x, lookup$old, lookup$new) #Can add , F to remove warnings
)
I basically share the same problem. Although dplyr::recode is in the "questioning" life cycle I don't expect it to become deprecated. At some point it might be superseded, but even in this case it should still be usable. Therefore I'm using a wrapper around dplyr::recode which allows the use of named vectors and or two vectors (which could be a lookup table).
library(dplyr)
library(rlang)
recode2 <- function(x, new, old = NULL, .default = NULL, .missing = NULL) {
if (!rlang::is_named(new) && !is.null(old)) {
new <- setNames(new, old)
}
do.call(dplyr::recode,
c(.x = list(x),
.default = list(.default),
.missing = list(.missing),
as.list(new)))
}
data = data.frame(
id = 1:7,
x = c("A", "A", "B", "C", "D", "AA", ".")
)
lookup = data.frame(
old = c("A", "D", "."),
new = c("a", "d", "!")
)
# two vectors new / old
data %>%
mutate(x = recode2(x, lookup$new, lookup$old))
#> id x
#> 1 1 a
#> 2 2 a
#> 3 3 B
#> 4 4 C
#> 5 5 d
#> 6 6 AA
#> 7 7 !
# named vector
data %>%
mutate(x = recode2(x, c("A" = "a",
"D" = "d",
"." = "!")))
#> id x
#> 1 1 a
#> 2 2 a
#> 3 3 B
#> 4 4 C
#> 5 5 d
#> 6 6 AA
#> 7 7 !
Created on 2021-04-21 by the reprex package (v0.3.0)
I'm trying to check the "pin" numbers of cases with missing data for each variable of interest in my dataset.
Here are some fake data:
c <- data.frame(pin = c(1, 2, 3, 4), type = c(1, 1, 2, 2), v1 = c(1, NA, NA,
NA), v2 = c(NA, NA, 1, 1))
I wrote a function "m.pin" to do this:
m.pin <- function(x, data = "c", return = "$pin") {
sect <- gsub("^.*\\[", "\\[", deparse(substitute(x)))
vect <- eval(parse(text = paste(data, return, sect, sep = "")))
return(vect[is.na(x)])
}
And I use it like so:
m.pin(c$v1[c$type == 1])
[1] 2
I wrote a function to apply "m.pin" over a list of variables to only return pins with missing data:
return.m.pin <- function(x, fun = m.pin) {
val.list <- lapply(x, fun)
condition <- lapply(val.list, function(x) length(x) > 0)
val.list[unlist(condition)]
}
But when I apply it, I get this error:
l <- lst(c$v1[c$type == 1], c$v2[c$type == 2])
return.m.pin(l)
Error in parse(text = paste(data, return, sect, sep = "")) :
<text>:1:9: unexpected ']'
1: c$pin[i]]
^
How can I rewrite my function(s) to avoid this issue?
Many thanks!
Please see Gregor's comment for the most critical issues with your code (to add: don't use return as a variable name as it is the name of a base R function).
It's not clear to me why you want to define a specific function m.pin, nor what you ultimately are trying to do, but I am assuming this is a critical design component.
Rewriting m.pin as
m.pin <- function(df, type, vcol) which(df[, "type"] == type & is.na(df[, vcol]))
we get
m.pin(df, 1, "v1")
#[1] 2
Or to identify rows with NA in "v1" for all types
lapply(unique(df$type), function(x) m.pin(df, x, "v1"))
#[[1]]
#[1] 2
#
#[[2]]
#[1] 3 4
Update
In response to Gregor's comment, perhaps this is what you're after?
by(df, df$type, function(x)
list(v1 = x$pin[which(is.na(x$v1))], v2 = x$pin[which(is.na(x$v2))]))
# df$type: 1
# $v1
# [1] 2
#
# $v2
# [1] 1 2
#
# ------------------------------------------------------------
# df$type: 2
# $v1
# [1] 3 4
#
# $v2
# integer(0)
This returns a list of the pin numbers for every type and NA entries in v1/v2.
Sample data
df <- data.frame(
pin = c(1, 2, 3, 4),
type = c(1, 1, 2, 2),
v1 = c(1, NA, NA, NA),
v2 = c(NA, NA, 1, 1))
I would suggest rewriting like this (if this approach is to be taken at all). I call your data d because c is already the name of an extremely common function.
# string column names, pass in the data frame as an object
# means no need for eval, parse, substitute, etc.
foo = function(data, na_col, return_col = "pin", filter_col, filter_val) {
if(! missing(filter_col) & ! missing(filter_val)) {
data = data[data[, filter_col] == filter_val, ]
}
data[is.na(data[, na_col]), return_col]
}
# working on the whole data frame
foo(d, na_col = "v1", return_col = "pin")
# [1] 2 3 4
# passing in a subset of the data
foo(d[d$type == 1, ], "v1", "pin")
# [1] 2
# using function arguments to subset the data
foo(d, "v1", "pin", filter_col = "type", filter_val = 1)
# [1] 2
# calling it with changing arguments:
# you could use `Map` or `mapply` to be fancy, but this for loop is nice and clear
inputs = data.frame(na_col = c("v1", "v2"), filter_val = c(1, 2), stringsAsFactors = FALSE)
result = list()
for (i in 1:nrow(inputs)) {
result[[i]] = foo(d, na_col = inputs$na_col[i], return_col = "pin",
filter_col = "type", filter_val = inputs$filter_val[i])
}
result
# [[1]]
# [1] 2
#
# [[2]]
# numeric(0)
A different approach I would suggest is melting your data into a long format, and simply taking a subset of the NA values, hence getting all combinations of type and the v* columns that have NA values at once. Do this once, and no function is needed to look up individual combinations.
d_long = reshape2::melt(d, id.vars = c("pin", "type"))
library(dplyr)
d_long %>% filter(is.na(value)) %>%
arrange(variable, type)
# pin type variable value
# 1 2 1 v1 NA
# 2 3 2 v1 NA
# 3 4 2 v1 NA
# 4 1 1 v2 NA
# 5 2 1 v2 NA
I have a data frame with the following variables:
df <- data.frame(ID = seq(1:5),
Price.A = c(10,12,14,16,18),
Price.B = c(6,7,9,8,5),
Price.C = c(27,26,25,24,23),
Choice = c("A", "A", "B", "B", "C"))
I want to create a variable called Expenditure, which picks the value from Price.A, Price.B or Price.C depending on the value of the variable Choice.
I tried to create it with the following code:
df$Expenditure <- with(df, get(paste("Price.", Choice, sep ="")))
However, that returns the value of Price.A for all observations.
In my real application, instead of A, B and C, I have hundreds of names, so an ifelse command is not feasible.
Does anyone knows how to do that?
It would probably make more sense to reshape your data. Currently your data is not in a "tidy" format
library(dplyr)
library(tidyr)
df %>% gather(Price, Expendeture, -ID, -Choice) %>%
filter(Price == paste0("Price.", Choice)) %>%
select(-Price)
Otherwise you could do matrix-indexing of a matrix
cols <- grep("Price", names(df), value=T)
mm <- as.matrix(df[, cols])
colidx <- match(paste0("Price.", df$Choice), cols)
df$Expenditure <- mm[cbind(1:length(colidx), colidx)]
df$Expenditure[df$Choice=="A"] <- df$Price.A[df$Choice=="A"]
df$Expenditure[df$Choice=="B"] <- df$Price.B[df$Choice=="B"]
df$Expenditure[df$Choice=="C"] <- df$Price.C[df$Choice=="C"]
Here's how to scale it up with a loop:
df$Expenditure <- NA
for(i in unique(df$Choice)){
j <- paste0("Price.",i)
df$Expenditure[df$Choice==i] <- df[df$Choice==i,colnames(df) == j]
}
ID Price.A Price.B Price.C Choice Expenditure
1 1 10 6 27 A 10
2 2 12 7 26 A 12
3 3 14 9 25 B 9
4 4 16 8 24 B 8
5 5 18 5 23 C 23
You could easily wrap this into a function and use apply if you prefer.
There are also lots of more overly complicated ways to do this, though I think it's a terrible practice to use some 3rd party package to do this when base R does a wonderful job. Here's one:
df <- data.frame(ID = seq(1:5),
PriceA = c(10,12,14,16,18),
PriceB = c(6,7,9,8,5),
PriceC = c(27,26,25,24,23),
Choice = c("A", "A", "B", "B", "C"))
require(sqldf)
df$Expenditure <- unname(sqldf("SELECT
CASE
WHEN Choice == 'A' THEN PriceA
WHEN Choice == 'B' THEN PriceB
WHEN Choice == 'C' THEN PriceC
END
from df"))
Here are a couple of *apply based approaches:
df$Expenditure <- sapply(seq_along(df[[1]]), function(i) {
df[i, sprintf("Price.%s", df$Choice[i])]
})
df$Expenditure <- mapply(function(x, y) {
df[x, sprintf("Price.%s", y)]
}, row.names(df), df$Choice
)
The second one assumes your object has the default row.names of 1:nrow(df).
How about
for (i in 1:nrow(df)) {
df$Expenditure[i] <- with(df[i, ], get(paste("Price.", Choice, sep="")))
}
I have a data.frame:
df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b",
"c"))), .Names = c("id", "vars"), row.names = c(NA, -3L), class = "data.frame")
with a list column (each with a character vector):
> str(df)
'data.frame': 3 obs. of 2 variables:
$ id : int 1 2 3
$ vars:List of 3
..$ : chr "a"
..$ : chr "a" "b" "c"
..$ : chr "b" "c"
I want to filter the data.frame according to setdiff(vars,remove_this)
library(dplyr)
library(tidyr)
res <- df %>% mutate(vars = lapply(df$vars, setdiff, "a"))
which gets me this:
> res
id vars
1 1
2 2 b, c
3 3 b, c
But to get drop the character(0) vars I have to do something like:
res %>% unnest(vars) # and then do the equivalent of nest(vars) again after...
Actual datasets:
560K rows and 3800K rows that also have 10 more columns (to carry along).
(this is quite slow, which leads to question...)
What is the Fastest way to do this in R?
Is there a dplyr/ data.table/ other faster method?
How to do this with Rcpp?
UPDATE/EXTENSION:
can the column modification be done in place rather then by copying the lapply(vars,setdiff(... result?
what's the most efficient way to filter out for vars == character(0) if it must be a seperate step.
Setting aside any algorithmic improvements, the analogous data.table solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:
library(data.table)
dt = as.data.table(df) # or use setDT to convert in place
dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
# id vars newcol
#1: 2 a,b,c b,c
#2: 3 b,c b,c
You can also delete the original column (with basically 0 cost), by adding [, vars := NULL] at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].
Now as far as algorithmic improvements go, assuming your id values are unique for each vars (and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
# id vars
#1: 2 b,c
#2: 3 b,c
To carry along the other columns, I think it's easiest to simply merge back:
dt[, othercol := 5:7]
# notice the keyby
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
# id vars i.vars othercol
#1: 2 b,c a,b,c 6
#2: 3 b,c b,c 7
Here's another way:
# prep
DT <- data.table(df)
DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
setkey(DT,vstr)
get_badkeys <- function(x)
unlist(sapply(1:length(x),function(n) combn(sort(x),n,paste0,collapse="_")))
# choose values to exclude
baduns <- c("a","b")
# subset
DT[!J(get_badkeys(baduns))]
This is fairly fast, but it takes up your key.
Benchmarks. Here's a made-up example:
Candidates:
hannahh <- function(df,baduns){
df %>%
mutate(vars = lapply(.$vars, setdiff, baduns)) %>%
filter(!!sapply(vars,length))
}
eddi <- function(df,baduns){
dt = as.data.table(df)
dt[,
unlist(vars)
, by = id][!V1 %in% baduns,
.(vars = list(V1))
, keyby = id][dt, nomatch = 0]
}
stevenb <- function(df,baduns){
df %>%
rowwise() %>%
do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, baduns)) %>%
mutate(length = length(newcol)) %>%
ungroup() %>%
filter(length > 0)
}
frank <- function(df,baduns){
DT <- data.table(df)
DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
setkey(DT,vstr)
DT[!J(get_badkeys(baduns))]
}
Simulation:
nvals <- 4
nbads <- 2
maxlen <- 4
nobs <- 1e4
exdf <- data.table(
id=1:nobs,
vars=replicate(nobs,list(sample(valset,sample(maxlen,1))))
)
setDF(exdf)
baduns <- valset[1:nbads]
Results:
system.time(frank_res <- frank(exdf,baduns))
# user system elapsed
# 0.24 0.00 0.28
system.time(hannahh_res <- hannahh(exdf,baduns))
# 0.42 0.00 0.42
system.time(eddi_res <- eddi(exdf,baduns))
# 0.05 0.00 0.04
system.time(stevenb_res <- stevenb(exdf,baduns))
# 36.27 55.36 93.98
Checks:
identical(sort(frank_res$id),eddi_res$id) # TRUE
identical(unlist(stevenb_res$id),eddi_res$id) # TRUE
identical(unlist(hannahh_res$id),eddi_res$id) # TRUE
Discussion:
For eddi() and hannahh(), the results scarcely change with nvals, nbads and maxlen. In contrast, when baduns goes over 20, frank() becomes incredibly slow (like 20+ sec); it also scales up with nbads and maxlen a little worse than the other two.
Scaling up nobs, eddi()'s lead over hannahh() stays the same, at about 10x. Against frank(), it sometimes shrinks and sometimes stays the same. In the best nobs = 1e5 case for frank(), eddi() is still 3x faster.
If we switch from a valset of characters to something that frank() must coerce to a character for its by-row paste0 operation, both eddi() and hannahh() beat it as nobs grows.
Benchmarks for doing this repeatedly. This is probably obvious, but if you have to do this "many" times (...how many is hard to say), it's better to create the key column than to go through the subsetting for each set of baduns. In the simulation above, eddi() is about 5x as fast as frank(), so I'd go for the latter if I was doing this subsetting 10+ times.
maxbadlen <- 2
set_o_baduns <- replicate(10,sample(valset,size=sample(maxbadlen,1)))
system.time({
DT <- data.table(exdf)
DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
setkey(DT,vstr)
for (i in 1:10) DT[!J(get_badkeys(set_o_baduns[[i]]))]
})
# user system elapsed
# 0.29 0.00 0.29
system.time({
dt = as.data.table(exdf)
for (i in 1:10) dt[,
unlist(vars), by = id][!V1 %in% set_o_baduns[[i]],
.(vars = list(V1)), keyby = id][dt, nomatch = 0]
})
# user system elapsed
# 0.39 0.00 0.39
system.time({
for (i in 1:10) hannahh(exdf,set_o_baduns[[i]])
})
# user system elapsed
# 4.10 0.00 4.13
So, as expected, frank() takes very little time for additional evaluations, while eddi() and hannahh() grow linearly.
Here's another idea:
df %>%
rowwise() %>%
do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, "a")) %>%
mutate(length = length(newcol)) %>%
ungroup()
Which gives:
# id vars newcol length
#1 1 a 0
#2 2 a, b, c b, c 2
#3 3 b, c b, c 2
You could then filter on length > 0 to keep only non-empty newcol
df %>%
rowwise() %>%
do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, "a")) %>%
mutate(length = length(newcol)) %>%
ungroup() %>%
filter(length > 0)
Which gives:
# id vars newcol length
#1 2 a, b, c b, c 2
#2 3 b, c b, c 2
Note: As mentioned by #Arun in the comments, this approach is quite slow. You are better off with the data.table solutions.