mutate_at to replace NAs with 0 - r

mtcars1 <- mtcars %>%
mutate(blah = LETTERS[seq(1,26, length.out = nrow(.))],
blah2 = letters[seq(1,26, length.out = nrow(.))])
# sprinkle some random NA values
mtcars1$blah[c(1,3,5,10,11)] <- NA
mtcars1$blah2[c(1,2,5,15,20)] <- NA
mtcars1 %>%
mutate_at(blah:blah2, function(x) {
if_else(is.na(x), 0, x)
})
Returns:
Error in check_dot_cols(.vars, .cols) : object 'blah' not found
How can I replace NAs with 0 across multiple columns using dplyr/tidyverse approach?

We can wrap the ranger of column names in vars and also make sure that the return types are the same for if_else (or case_when) based on the documentation of ?if_else
Compared to the base ifelse(), this function is more strict. It checks that true and false are the same type. This strictness makes the output type more predictable, and makes it somewhat faster.
library(dplyr)
mtcars1 %>%
mutate_at(vars(blah:blah2), ~ if_else(is.na(.), '0', .))
Here, instead of the anonymous function call (function(x)), the tidyverse concise option with ~ is used and as 'blah', 'blah2' are character types, the return type for true is also quoted ('0')
Instead of if_else, another option specifically for replacing NA is replace_na from tidyr
library(tidyr)
mtcars1 %>%
mutate_at(vars(blah:blah2), replace_na, '0')

Related

Moving from mutate_all to across() in dplyr 1.0

With the new release of dplyr I am refactoring quite a lot of code and removing functions that are now retired or deprecated. I had a function that is as follows:
processingAggregatedLoad <- function (df) {
defined <- ls()
passed <- names(as.list(match.call())[-1])
if (any(!defined %in% passed)) {
stop(paste("Missing values for the following arguments:", paste(setdiff(defined, passed), collapse=", ")))
}
df_isolated_load <- df %>% select(matches("snsr_val")) %>% mutate(global_demand = rowSums(.)) # we get isolated load
df_isolated_load_qlty <- df %>% select(matches("qlty_good_ind")) # we get isolated quality
df_isolated_load_qlty <- df_isolated_load_qlty %>% mutate_all(~ factor(.), colnames(df_isolated_load_qlty)) %>%
mutate_each(funs(as.numeric(.)), colnames(df_isolated_load_qlty)) # we convert the qlty to factors and then to numeric
df_isolated_load_qlty[df_isolated_load_qlty[]==1] <- 1 # 1 is bad
df_isolated_load_qlty[df_isolated_load_qlty[]==2] <- 0 # 0 is good we mask to calculate the global index quality
df_isolated_load_qlty <- df_isolated_load_qlty %>% mutate(global_quality = rowSums(.)) %>% select(global_quality)
df <- bind_cols(df, df_isolated_load, df_isolated_load_qlty)
return(df)
}
Basically the function does as follows:
1.The function selects all of the values of a pivoted dataframe and aggregated them.
2.The function selects the quality indicator (character) of a pivoted dataframe.
3.I convert the characters of the quality to factors and then to numeric to get the 2 levels (1 or 2).
4.I replace the numeric values of each of the individual columns by 0 or 1 depending on the level.
5.I rowsum the individual quality as I will get 0 if all of the values are good, otherwise the global quality is bad.
The problem is that I am getting the following messages:
1: `funs()` is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
2: `mutate_each_()` is deprecated as of dplyr 0.7.0.
Please use `across()` instead.
I did multiple trials as for instance:
df_isolated_load_qlty %>% mutate(across(.fns = ~ as.factor(), .names = colnames(df_isolated_load_qlty)))
Error: Problem with `mutate()` input `..1`.
x All unnamed arguments must be length 1
ℹ Input `..1` is `across(.fns = ~as.factor(), .names = colnames(df_isolated_load_qlty))`.
But I am still a bit confused about the new dplyr syntax. Would someone be able to guide me a little bit around the right way of doing this?
mutate_each has been long deprecated and was replaced with mutate_all.
mutate_all is now replaced with across
across has default .cols as everything() which means it behaves as mutate_all by default (like here) if not mentioned explicitly.
You can apply the mulitple function in the same mutate call, so here factor and as.numeric can be applied together.
Considering all this you can change your existing function to :
library(dplyr)
processingAggregatedLoad <- function (df) {
defined <- ls()
passed <- names(as.list(match.call())[-1])
if (any(!defined %in% passed)) {
stop(paste("Missing values for the following arguments:",
paste(setdiff(defined, passed), collapse=", ")))
}
df_isolated_load <- df %>%
select(matches("snsr_val")) %>%
mutate(global_demand = rowSums(.))
df_isolated_load_qlty <- df %>% select(matches("qlty_good_ind"))
df_isolated_load_qlty <- df_isolated_load_qlty %>%
mutate(across(.fns = ~as.numeric(factor(.))))
df_isolated_load_qlty[df_isolated_load_qlty ==1] <- 1
df_isolated_load_qlty[df_isolated_load_qlty==2] <- 0
df_isolated_load_qlty <- df_isolated_load_qlty %>%
mutate(global_quality = rowSums(.)) %>%
select(global_quality)
df <- bind_cols(df, df_isolated_load, df_isolated_load_qlty)
return(df)
}

How to use mutate_at or mutate_if at the same time to do multiple action on data

I would like to apply 3 functions using one code on the same variables in my data.
I have a data set and there are certain columns in my data and i want to apply these function to all of them.
1- make them all factor data
2- replace spaces in the columns with missing(convert space values to missing)
3- give missing value an explicit factor level using fct_explicit_na
i have done this in separate code lines but i want to merge all of them using dplyr mutate function. I tried the following but didnt work
cols <- c("id12", "id13", "id14", "id15")
data_new <- data_old %>%
mutate_if(cols=="", NA) %>% # replace space with NA for cols
mutate_at(cols, factor) %>% # then turn them into factors
mutate_at(cols, fct_explicit_na) # give NAs explicit factor level
)
I get the error:
Error in tbl_if_vars(.tbl, .p, .env, ..., .include_group_vars = .include_group_vars) :
length(.p) == length(tibble_vars) is not TRUE
The mutate_if step is not doing what the OP intend to do. Instead, we can do this in a single step with
library(dplyr)
data_old %>%
mutate_at(vars(cols), ~ na_if(., "") %>%
factor %>%
fct_explicit_na)
Why the OP's code didn't work?
Using a reproducible example, below code converts columns that are factor to character class
iris1 <- iris %>%
mutate_if(is.factor, as.character) %>%
mutate(Species = replace(Species, c(1, 3, 5), ""))
Now, if we do
iris1 %>%
mutate_if("Species" == "", NA)
it is comparing two strings instead of checking the column values. Also, mutate_if should return a logical vector of length 1 for selecting that column.
Instead, if we use
iris1 %>%
mutate_if(~ any(. == ""), ~ na_if(., "")) %>%
head

R: row-wise dplyr::mutate using function that takes a data frame row and returns an integer

I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain.
Say I have a data frame like this (where blob is some variable not related to the specific task but is part of the entire data) :
df <-
data.frame(exclude=c('B','B','D'),
B=c(1,0,0),
C=c(3,4,9),
D=c(1,1,0),
blob=c('fd', 'fs', 'sa'),
stringsAsFactors = F)
I have a function that uses the variable names so select some based on the value in the exclude column and e.g. calculates a sum on the variables not specified in exclude (which is always a single character).
FUN <- function(df){
sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}
When I gives a single row (row 1) to FUN I get the the expected sum of C and D (those not mentioned by exclude), namely 4:
FUN(df[1,])
How do I do similarly in a pipe with mutate (adding the result to a variable s). These two tries do not work:
df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))
UPDATE
This also do not work as intended:
df %>% rowwise(.) %>% mutate(s=FUN(.))
This works of cause but is not within dplyr's mutate (and pipes):
df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))
If you want to use dplyr you can do so using rowwise and your function FUN.
df %>%
rowwise %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The same can be achieved using group_by instead of rowwise (like you already tried) but with do instead of mutate
df %>%
group_by(1:n()) %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The reason mutate doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df).
A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums.
cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)
purrr approach
We can use a combination of nest and map_dbl for this:
library(tidyverse)
df %>%
rowwise %>%
nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>%
unnest
Let's break that down a little bit. First, rowwise allows us to apply each subsequent function to support arbitrary complex operations that need to be applied to each row.
Next, nest will create a new column that is a list of our data to be fed into FUN (the beauty of tibbles vs data.frames!). Since we are applying this rowwise, each row contains a single-row tibble of exclude:D.
Finally, we use map_dbl to map our FUN to each of these tibbles. map_dbl is used over the family of other map_* functions since our intended output is numeric (i.e. double).
unnest returns our tibble into the more standard structure.
purrrlyr approach
While purrrlyr may not be as 'popular' as its parents dplyr and purrr, its by_row function has some utility here.
In your above example, we would use your data frame df and user-defined function FUN in the following way:
df %>%
by_row(..f = FUN, .to = "s", .collate = "cols")
That's it! Giving you:
# tibble [3 x 6]
exclude B C D blob s
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 B 1 3 1 fd 4
2 B 0 4 1 fs 5
3 D 0 9 0 sa 9
Admittedly, the syntax is a little strange, but here's how it breaks down:
..f = the function to apply to each row
.to = the name of the output column, in this case s
.collate = the way the results should be collated, by list, row, or column. Since FUN only has a single output, we would be fine to use either "cols" or "rows"
See here for more information on using purrrlyr...
Performance
Forewarning, while I like the functionality of by_row, it's not always the best approach for performance! purrr is more intuitive, but also at a rather large speed loss. See the following microbenchmark test:
library(microbenchmark)
mbm <- microbenchmark(
purrr.test = df %>% rowwise %>% nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>% unnest,
purrrlyr.test = df %>% by_row(..f = FUN, .to = "s", .collate = "cols"),
rowwise.test = df %>%
rowwise %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
group_by.test = df %>%
group_by(1:n()) %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
sapply.test = {df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))},
times = 1000
)
autoplot(mbm)
You can see that the purrrlyr approach is faster than the approach of using a combination of do with rowwise or group_by(1:n()) (see #konvas answer), and rather on par with the sapply approach. However, the package is admittedly not the most intuitive. The standard purrr approach seems to be the slowest, but also perhaps easier to work with. Different user-defined functions may change the speed order.

Correct syntax for mutate_if

I would like to replace NA values with zeros via mutate_if in dplyr. The syntax below:
set.seed(1)
mtcars[sample(1:dim(mtcars)[1], 5),
sample(1:dim(mtcars)[2], 5)] <- NA
require(dplyr)
mtcars %>%
mutate_if(is.na,0)
mtcars %>%
mutate_if(is.na, funs(. = 0))
Returns error:
Error in vapply(tbl, p, logical(1), ...) : values must be length 1,
but FUN(X[[1]]) result is length 32
What's the correct syntax for this operation?
I learned this trick from the purrr tutorial, and it also works in dplyr.
There are two ways to solve this problem:
First, define custom functions outside the pipe, and use it in mutate_if():
any_column_NA <- function(x){
any(is.na(x))
}
replace_NA_0 <- function(x){
if_else(is.na(x),0,x)
}
mtcars %>% mutate_if(any_column_NA,replace_NA_0)
Second, use the combination of ~,. or .x.( .x can be replaced with ., but not any other character or symbol):
mtcars %>% mutate_if(~ any(is.na(.x)),~ if_else(is.na(.x),0,.x))
#This also works
mtcars %>% mutate_if(~ any(is.na(.)),~ if_else(is.na(.),0,.))
In your case, you can also use mutate_all():
mtcars %>% mutate_all(~ if_else(is.na(.x),0,.x))
Using ~, we can define an anonymous function, while .x or . stands for the variable. In mutate_if() case, . or .x is each column.
The "if" in mutate_if refers to choosing columns, not rows. Eg mutate_if(data, is.numeric, ...) means to carry out a transformation on all numeric columns in your dataset.
If you want to replace all NAs with zeros in numeric columns:
data %>% mutate_if(is.numeric, funs(ifelse(is.na(.), 0, .)))
mtcars %>% mutate_if(is.numeric, replace_na, 0)
or more recent syntax
mtcars %>% mutate(across(where(is.numeric),
replace_na, 0))
We can use set from data.table
library(data.table)
setDT(mtcars)
for(j in seq_along(mtcars)){
set(mtcars, i= which(is.na(mtcars[[j]])), j = j, value = 0)
}
I always struggle with replace_na function of dplyr
replace(is.na(.),0)
this works for me for what you are trying to do.

dplyr change many data types

I have a data.frame:
dat <- data.frame(fac1 = c(1, 2),
fac2 = c(4, 5),
fac3 = c(7, 8),
dbl1 = c('1', '2'),
dbl2 = c('4', '5'),
dbl3 = c('6', '7')
)
To change data types I can use something like
l1 <- c("fac1", "fac2", "fac3")
l2 <- c("dbl1", "dbl2", "dbl3")
dat[, l1] <- lapply(dat[, l1], factor)
dat[, l2] <- lapply(dat[, l2], as.numeric)
with dplyr
dat <- dat %>% mutate(
fac1 = factor(fac1), fac2 = factor(fac2), fac3 = factor(fac3),
dbl1 = as.numeric(dbl1), dbl2 = as.numeric(dbl2), dbl3 = as.numeric(dbl3)
)
is there a more elegant (shorter) way in dplyr?
thx
Christof
Edit (as of 2021-03)
As also pointed out in Eric's answer, mutate_[at|if|all] has been superseded by a combination of mutate() and across(). For reference, I will add the respective pendants to the examples in the original answer (see below):
# convert all factor to character
dat %>% mutate(across(where(is.factor), as.character))
# apply function (change encoding) to all character columns
dat %>% mutate(across(where(is.character),
function(x){iconv(x, to = "ASCII//TRANSLIT")}))
# subsitute all NA in numeric columns
dat %>% mutate(across(where(is.numeric), function(x) tidyr::replace_na(x, 0)))
Original answer
Since Nick's answer is deprecated by now and Rafael's comment is really useful, I want to add this as an Answer. If you want to change all factor columns to character use mutate_if:
dat %>% mutate_if(is.factor, as.character)
Also other functions are allowed. I for instance used iconv to change the encoding of all character columns:
dat %>% mutate_if(is.character, function(x){iconv(x, to = "ASCII//TRANSLIT")})
or to substitute all NA by 0 in numeric columns:
dat %>% mutate_if(is.numeric, function(x){ifelse(is.na(x), 0, x)})
You can use the standard evaluation version of mutate_each (which is mutate_each_) to change the column classes:
dat %>% mutate_each_(funs(factor), l1) %>% mutate_each_(funs(as.numeric), l2)
EDIT - The syntax of this answer has been deprecated, loki's updated answer is more appropriate.
ORIGINAL-
From the bottom of the ?mutate_each (at least in dplyr 0.5) it looks like that function, as in #docendo discimus's answer, will be deprecated and replaced with more flexible alternatives mutate_if, mutate_all, and mutate_at. The one most similar to what #hadley mentions in his comment is probably using mutate_at. Note the order of the arguments is reversed, compared to mutate_each, and vars() uses select() like semantics, which I interpret to mean the ?select_helpers functions.
dat %>% mutate_at(vars(starts_with("fac")),funs(factor)) %>%
mutate_at(vars(starts_with("dbl")),funs(as.numeric))
But mutate_at can take column numbers instead of a vars() argument, and after reading through this page, and looking at the alternatives, I ended up using mutate_at but with grep to capture many different kinds of column names at once (unless you always have such obvious column names!)
dat %>% mutate_at(grep("^(fac|fctr|fckr)",colnames(.)),funs(factor)) %>%
mutate_at(grep("^(dbl|num|qty)",colnames(.)),funs(as.numeric))
I was pretty excited about figuring out mutate_at + grep, because now one line can work on lots of columns.
EDIT - now I see matches() in among the select_helpers, which handles regex, so now I like this.
dat %>% mutate_at(vars(matches("fac|fctr|fckr")),funs(factor)) %>%
mutate_at(vars(matches("dbl|num|qty")),funs(as.numeric))
Another generally-related comment - if you have all your date columns with matchable names, and consistent formats, this is powerful. In my case, this turns all my YYYYMMDD columns, which were read as numbers, into dates.
mutate_at(vars(matches("_DT$")),funs(as.Date(as.character(.),format="%Y%m%d")))
Dplyr across function has superseded _if, _at, and _all. See vignette("colwise").
dat %>%
mutate(across(all_of(l1), as.factor),
across(all_of(l2), as.numeric))
It's a one-liner with mutate_at:
dat %>% mutate_at("l1", factor) %>% mutate_at("l2", as.numeric)
A more general way of achieving column type transformation is as follows:
If you want to transform all your factor columns to character columns, e.g., this can be done using one pipe:
df %>% mutate_each_( funs(as.character(.)), names( .[,sapply(., is.factor)] ))
Or mayby even more simple with convert from hablar:
library(hablar)
dat %>%
convert(fct(fac1, fac2, fac3),
num(dbl1, dbl2, dbl3))
or combines with tidyselect:
dat %>%
convert(fct(contains("fac")),
num(contains("dbl")))
For future readers, if you are ok with dplyr guessing the column types, you can convert the col types of an entire df as if you were originally reading it in with readr and col_guess() with
library(tidyverse)
df %>% type_convert()
Try this
df[,1:11] <- sapply(df[,1:11], as.character)

Resources