I'm confused about how this code should work:
foo <- factor(c("a", "b", "a", "c", "a", "a", "c", "c"))
#[1] a b a c a a c c
#Levels: a b c
factor(foo, exclude = "a")
#[1] a b a c a a c c
#Levels: a b c
Warning message:
In as.vector(exclude, typeof(x)) : NAs introduced by coercion
Shouldn't it display factor with all a replaced by NA? If not, how to achieve this?
This bug has been fixed since R-3.4.0. The following answer now only serves as a historical reference.
As I said in my comment, at the moment exclude only works for
factor(as.character(foo), exclude = "a")
rather than
factor(foo, exclude = "a")
Note, the documentation ?factor under R 3.3.1 is not satisfying at all:
exclude: a vector of values to be excluded when forming the set of
levels. This should be of the same type as ‘x’, and will be
coerced if necessary.
The following are not giving any warning or error, but are also not doing anything:
## foo is a factor with `typeof` being "integer"
factor(foo, exclude = 1L)
factor(foo, exclude = factor("a", levels = levels(foo)))
#[1] a b a c a a c c
#Levels: a b c
Actually, the documentation appears quite contradictory, as it also reads:
The encoding of the vector happens as follows. First all the
values in ‘exclude’ are removed from ‘levels’.
so it looks like the developer really expects exclude to be a "character".
This is more likely to be a bug inside factor. The problem is rather evident, that following line inside factor(x, ...) is making a mess when input vector x is of "factor" class:
exclude <- as.vector(exclude, typeof(x))
as in that case typeof(x) is "integer". If exclude is a string, NA will be produced when trying to convert a string to an integer.
I really have no idea why there is such a line inside factor. The subsequent two lines are just doing the right thing, if this line does not exist:
x <- as.character(x)
levels <- levels[is.na(match(levels, exclude))]
So, a remedy / fix is simply eliminating this line:
my_factor <- function (x = character(), levels, labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
{
if (is.null(x))
x <- character()
nx <- names(x)
if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- sort.list(y)
y <- as.character(y)
levels <- unique(y[ind])
}
force(ordered)
#exclude <- as.vector(exclude, typeof(x))
x <- as.character(x)
levels <- levels[is.na(match(levels, exclude))]
f <- match(x, levels)
if (!is.null(nx))
names(f) <- nx
nl <- length(labels)
nL <- length(levels)
if (!any(nl == c(1L, nL)))
stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
nl, nL), domain = NA)
levels(f) <- if (nl == nL)
as.character(labels)
else paste0(labels, seq_along(levels))
class(f) <- c(if (ordered) "ordered", "factor")
f
}
Let's have a test now:
my_factor(foo, exclude = "a")
#[1] <NA> b <NA> c <NA> <NA> c c
#Levels: b c
my_factor(as.character(foo), exclude = "a")
#[1] <NA> b <NA> c <NA> <NA> c c
#Levels: b c
Related
I am working with two data.frames which use different terminology. To keep the terminology of each data.frame intact, I am currently deliberating whether it would be an idea to simply add the columns to the other data.frame.
df_a <- data.frame(
A = c("a", "b", "c"),
B = c("a", "b", "c")
)
df_b <- data.frame(
same_as_A = c("a", "b", "c"),
same_as_B = c("a", "b", "c")
)
df_a <- cbind(df_a, df_b)
df_b <- cbind(df_b, df_a)
This will however become problematic as soon as I will start making changes to any of these columns. I was wondering if there is instead a way or even a trick, to refer to a column by more than one name. Obviously this does not work, but something like:
df_a <- data.frame(
A & same_as_A = c("a", "b", "c"),
B & same_as_B = c("a", "b", "c")
)
Where df_a$same_as_A is equal to df_a$A
"a" "b" "c"
You can derive your own superclass of data.frame, wrap [ and $, and handle aliases explicitly.
aliases <- function(x, ...) {
dots <- list(...)
stopifnot(!is.null(names(dots)), all(nzchar(names(dots))))
nms <- attr(x, "aliases")
attr(x, "aliases") <- c(nms[!names(nms) %in% names(dots)], dots)
if (class(x)[1] != "aliased_dataframe") {
class(x) <- c("aliased_dataframe", class(x))
}
x
}
`[.aliased_dataframe` <- function(x, i, j, ...) {
if (!inherits(x, "aliased_dataframe")) NextMethod()
if (!missing(j) && length(j)) {
aliases <- attr(x, "aliases")
ind <- j %in% names(aliases)
j[ind] <- unlist(aliases[ match(j[ind], names(aliases)) ])
}
NextMethod(object = x)
}
`$.aliased_dataframe` <- function(x, j, ...) {
if (!inherits(x, "aliased_dataframe")) NextMethod()
if (!missing(j) && length(j)) {
aliases <- attr(x, "aliases")
ind <- j %in% names(aliases)
j[ind] <- unlist(aliases[ match(j[ind], names(aliases)) ])
}
NextMethod(object = x)
}
`$<-.aliased_dataframe` <- function(x, j, ...) {
if (!inherits(x, "aliased_dataframe")) NextMethod()
if (!missing(j) && length(j)) {
aliases <- attr(x, "aliases")
ind <- j %in% names(aliases)
j[ind] <- unlist(aliases[ match(j[ind], names(aliases)) ])
}
NextMethod(object = x)
}
Demo:
df_b <- data.frame(
same_as_A = c("a", "b", "c"),
same_as_B = c("a", "b", "c")
)
df_b[, "a"]
# Error in `[.data.frame`(df_b, , "a") : undefined columns selected
df_b$a
# NULL
df_b <- aliases(df_b, a="same_as_A", b="same_as_B")
df_b[, "a"]
# [1] "a" "b" "c"
df_b$a
# [1] "a" "b" "c"
df_b$a <- c("A","B","C")
df_b
# same_as_A same_as_B
# 1 A a
# 2 B b
# 3 C c
Coincidentally, this works with tbl_df as well, but sadly not with data.table variants.
library(tibble) # or dplyr
df_b <- tibble(df_b)
df_b[, "a"]
# Error in `stop_subscript()`:
# ! Can't subset columns that don't exist.
# x Column `a` doesn't exist.
# Run `rlang::last_error()` to see where the error occurred.
df_b$a
# Warning: Unknown or uninitialised column: `a`.
# NULL
df_b <- aliases(df_b, a="same_as_A", b="same_as_B")
df_b[, "a"]
# # A tibble: 3 x 1
# same_as_A
# <chr>
# 1 a
# 2 b
# 3 c
df_b$a
# [1] "a" "b" "c"
df_b$a <- c("A","B","C")
df_b
# # A tibble: 3 x 2
# same_as_A same_as_B
# <chr> <chr>
# 1 A a
# 2 B b
# 3 C c
I should note that this accommodates explicit use of j=, as in df_b[,"a"]; the shortcut of df_b["a"] is technically overloading the i= argument, and while the base [.data.frame is correctly inferring your intent, these S3 wrappers are not. It is not difficult to add that (just another conditional, perhaps starting with if (missing(j) && !missing(i) && is.character(i))), but for simplicity I"m keeping it out. Because of this, df_b["a"] fails.
Another note, I did not overload [[, so df_b[["a"]] returns NULL. If it's really important to you, one could adapt this methodology to do that as well.
I'm working on a project where I have to apply the same transformation to multiple variables. For example
a <- a + 1
b <- b + 1
d <- d + 1
e <- e + 1
I can obviously perform the operations in sequence using
for (i in c(a, b, d, e)) i <- i + 1
However, I can't actually assign the result to each variable this way, since i is a copy of each variable, not a reference.
Is there a way to do this? Obviously, it'd be easier if the variables were merged in a data.frame or something, but that's not possible.
Usually if you find yourself doing the same thing to multiple objects, they should be stored / thought-of as single object with sub-components. You say that storing these as a data.frame is not possible, so you can use a list instead. This allows you to use lapply/sapply to apply a function to each element of the list in one step.
a <- c(1, 2, 3)
b <- c(1, 4)
c <- 5
d <- rnorm(10)
e <- runif(5)
lstt <- list(a = a, b = b, c = c, d = d, e = e)
lstt$a
# [1] 1 2 3
lstt <- lapply(lstt, '+', 1)
lstt$a
# [1] 2 3 4
The question states that the variables to increment cannot be in a larger structure but then in the comments it is stated that that is not so after all so we will assume they are in a list L.
L <- list(a = 1, b = 2, d = 3, e = 4) # test data
for(nm in names(L)) L[[nm]] <- L[[nm]] + 1
# or
L <- lapply(L, `+`, 1)
# or
L <- lapply(L, function(x) x + 1)
Scalars
If they are all scalars then they can be put in an ordinary vector:
v <- c(a = 1, b = 2, d = 3, e = 4)
v <- v + 1
Vectors
If they are all vectors of the same length they can be put in data frame or if they are also of the same type they can be put in a matrix in which case we can also add 1 to it.
Environment
If the variables do have to be free in an environment then if nms is a vector of the variable names then we can iterate over the names and use those names to subscript the environment env. If the names follow some pattern we may be able to use nms <- ls(pattern = "...", envir = env) or if they are the only variables in that environment we can use nms <- ls(env).
a <- b <- d <- e <- 1 # test data
env <- .GlobalEnv # can change this if not being done in global envir
nms <- c("a", "b", "d", "e")
for(nm in nms) env[[nm]] <- env[[nm]] + 1
a;b;d;e # check
## [1] 2
## [1] 2
## [1] 2
## [1] 2
In my R function below, I was wondering how I could get the length of the unique elements (which is 2) of two vectors a and b?
Here is what I tried without success:
foo <- function(...){
L <- list(...)
lengths(unique(unlist(L)))
}
a = rep(c("a", "b"), 30) # Vector `a`
b = rep(c("a", "b"), 20) # Vector `b`
foo(a, b) # the function returns 1 1 instead of 2 2
Use lapply() or sapply() because your object is a list. I think you might check the difference between length() and lengths(). They both exist but have different abilities. I provide two solutions foo1 and foo2:
foo1 <- function(...){
L <- list(...)
sapply(L, function(x) length(unique(x)))
}
foo2 <- function(...){
L <- list(...)
lengths(lapply(L, unique))
}
a = rep(c("a", "b"), 30) # Vector `a`
b = rep(c("a", "b"), 20) # Vector `b`
foo1(a, b)
# [1] 2 2
foo2(a, b)
# [1] 2 2
Here is the answer
You were using the unlist function - so you were back at the start with the vector lengths!
use this code instead
foo <- function(a,b){
L <- list(a,b)
lengths(unique(L)) ### this return 1 1
}
a = rep(c("a", "b"), 30) # Vector `a`
b = rep(c("a", "b"), 20) # Vector `b`
foo(a, b)
I want to lappy two functions on a data set conditional on the value of a specific variable.
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
df <- data.frame(Letters = c("A","B","B"), Numbers = 1:3)
Someting like:
df <- lapply(df, if(df$Letters=="A") first_function else second_function )
To produce:
df_desired <- data.frame(Letters = c("A","B","B"), Numbers = c(1,3,4))
You can do it with dplyr and purrr. Obviously this is a basic function, but you should be able to build on it for your needs:
library(dplyr)
library(purrr)
calc <- function(y, x){
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
if(y == "A")
return(first_function(x))
return(second_function(x))
}
df <- data.frame(Letters = c("A","B","B"), Numbers = 1:3)
df %>%
mutate(Numbers = map2_dbl(Letters, Numbers, ~calc(.x,.y)))
Letters Numbers
1 A 1
2 B 3
3 B 4
>(df_desired <- data.frame(Letters = c("A","B","B"), Numbers = c(1,3,4)))
Letters Numbers
1 A 1
2 B 3
3 B 4
BENCHMARKING
I am not a data.table expert (feel free to add), so did not incorporate here. But, #R Yoda is correct. Although it reads nicely and future you will find it easier to read and extend the function, the purrr solution is not that fast. I liked the ifelse approach, so added case_when which is easier to scale when dealing with multiple functions. Here are a couple solutions:
library(dplyr)
library(purrr)
library(microbenchmark)
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
calc <- function(y, x){
if(y == "A")
return(first_function(x))
return(second_function(x))
}
df <- data.frame(Letters = rep(c("A","B","B"),1000), Numbers = 1:3)
basic <- function(){
data.frame(df$Letters, apply(df, 1, function(row) {
num <- as.numeric(row['Numbers'])
if (row['Letters'] == 'A') first_function(num) else second_function(num)
}))
}
dplyr_purrr <- function(){
df %>%
mutate(Numbers = map2_dbl(Letters, Numbers, ~calc(.x,.y)))
}
dplyr_case_when <- function(){
df %>%
mutate(Numbers = case_when(
Letters == "A" ~ first_function(Numbers),
TRUE ~ second_function(Numbers)))
}
map_list <- function(){
data.frame(df$Letters, map2_dbl(df2$Letters, df2$Numbers, ~calc(.x, .y)))
}
within_mapply <- function(){
within(df, Numbers <- mapply(Letters, Numbers,
FUN = function(x, y){
switch(x,
"A" = first_function(y),
"B" = second_function(y))
}))
}
within_ifelse <- function(){
within(df, Numbers <- ifelse(Letters == "A",
first_function(Numbers),
second_function(Numbers)))
}
within_case_when <- function(){
within(df, Numbers <- case_when(
Letters == "A" ~ first_function(Numbers),
TRUE ~ second_function(Numbers)))
}
(mbm <- microbenchmark(
basic(),
dplyr_purrr(),
dplyr_case_when(),
map_list(),
within_mapply(),
within_ifelse(),
within_case_when(),
times = 1000
))
Unit: microseconds
expr min lq mean median uq max neval cld
basic() 12816.427 24028.3375 27719.8182 26741.7770 29417.267 277756.650 1000 f
dplyr_purrr() 9682.884 17817.0475 20072.2752 19736.8445 21767.001 48344.265 1000 e
dplyr_case_when() 1098.258 2096.2080 2426.7183 2325.7470 2625.439 9039.601 1000 b
map_list() 8764.319 16873.8670 18962.8540 18586.2790 20599.000 41524.564 1000 d
within_mapply() 6718.368 12397.1440 13806.1752 13671.8120 14942.583 24958.390 1000 c
within_ifelse() 279.796 586.6675 690.1919 653.3345 737.232 8131.292 1000 a
within_case_when() 470.155 955.8990 1170.4641 1070.5655 1219.284 46736.879 1000 a
The simple way to do this with *apply would be to put the whole logic (with the conditional and the two functions) into another function and use apply with MARGIN=1 to pass the data in row by row (lapply will pass in the data by column):
apply(df, 1, function(row) {
num <- as.numeric(row['Numbers'])
if (row['Letters'] == 'A') first_function(num) else second_function(num)
})
[1] 1 3 4
The problem with this approach, at #r2evans points out in the comment below, is that when you use apply with a heterogeneous data.frame (in this case, Letters is type factor while Numbers is type integer) each row passed into the applied function is passed as a vector which can only have a single type, so everything in the row is coerced to the same type (in this case character). This is why it's necessary to use as.numeric(row['Numbers']), to turn Numbers back into type numeric. Depending on your data, this could be a simple fix (as above) or it could make things much more complicated and bug-prone. Either way #akrun's solution is much better, since it preserves each variable's original data type.
lapply has difficulty in this case because it's column-based. However you can try transpose your data by t() and use lapply if you persist. Here I provide two ways which use mapply and ifelse :
df$Letters <- as.character(df$Letters)
# Method 1
within(df, Numbers <- mapply(Letters, Numbers, FUN = function(x, y){
switch(x, "A" = first_function(y),
"B" = second_function(y))
}))
# Method 2
within(df, Numbers <- ifelse(Letters == "A",
first_function(Numbers),
second_function(Numbers)))
Both above got the same outputs :
# Letters Numbers
# 1 A 1
# 2 B 3
# 3 B 4
Here a data.table variant for better performance in case of many data rows (but also showing an implicit conversion problem):
library(data.table)
setDT(df) # fast convertion from data.frame to data.table
df[ Letters == "A", Numbers := first_function(Numbers) ]
df[!(Letters == "A"), Numbers := second_function(Numbers)] # issues a warning, see below
df
# Letters Numbers
# 1: A 1
# 2: B 3
# 3: B 4
The issued warning is:
Warning message: In [.data.table(df, !(Letters == "A"),
:=(Numbers, second_function(Numbers))) : Coerced 'double' RHS to
'integer' to match the column's type; may have truncated precision.
Either change the target column ['Numbers'] to 'double' first (by
creating a new 'double' vector length 3 (nrows of entire table) and
assign that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g.
1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for
speed. Or, set the column type correctly up front when you create the
table and stick to it, please.
The reason is that the data.frame column Numbers is an integer
> str(df)
'data.frame': 3 obs. of 2 variables:
$ Letters: Factor w/ 2 levels "A","B": 1 2 2
$ Numbers: int 1 2 3
but the functions return a double (for whatever reason):
> typeof(first_function(df$Numbers))
[1] "double"
I know that the function relevel sets an specified level to be the first. I would like to know if there is a built-in function that sets an specified level to be the last. If not, what is an efficient way to write such a function?
The package forcats has a function that does this neatly.
f <- gl(2, 1, labels = c("b", "a"))
forcats::fct_relevel(f, "b", after = Inf)
#> [1] b a
#> Levels: a b
There is not a built-in function. You could do it like this:
lastlevel = function(f, last) {
if (!is.factor(f)) stop("f must be a factor")
orig_levels = levels(f)
if (! last %in% orig_levels) stop("last must be a level of f")
new_levels = c(setdiff(orig_levels, last), last)
factor(f, levels = new_levels)
}
x = factor(c("a", "b", "c"))
> lastlevel(x, "a")
[1] a b c
Levels: b c a
> lastlevel(x, "b")
[1] a b c
Levels: a c b
> lastlevel(x, "c")
[1] a b c
Levels: a b c
> lastlevel(x, "d")
Error in lastlevel(x, "d") : last must be a level of f
I feel a little silly because I just wrote that out, when I could have made a tiny modification to stats:::relevel.factor. A solution adapted from relevel would look like this:
lastlevel = function (f, last, ...) {
if (!is.factor(f)) stop("f must be a factor")
lev <- levels(f)
if (length(last) != 1L)
stop("'last' must be of length one")
if (is.character(last))
last <- match(last, lev)
if (is.na(last))
stop("'last' must be an existing level")
nlev <- length(lev)
if (last < 1 || last > nlev)
stop(gettextf("last = %d must be in 1L:%d", last, nlev),
domain = NA)
factor(f, levels = lev[c(last, seq_along(lev)[-last])])
}
It checks a few more inputs and also accepts a numeric (e.g., last = 2 would move the second level to the last).