Conditional Evaluation in Dplyr - r

I have a character vector r <- c(). I want to mutate on dataframe based on length of r
This works
iris %>% if(length(r) > 0) mutate(Test = 1) else .
This does not work when I expand to add more dplyr verbs
iris %>% if(length(r) > 0) mutate(Test = 1) else . %>% mutate(Test2 = 1)
I am only looking for dplyr based solution.

As there are multiple statements, wrap it inside a {}
r <- c()
iris %>%
{if(length(r) > 0) {
mutate(., Test = 1)
} else .}
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
...
-testing with r length > 0
r <- 5
iris %>%
{if(length(r) > 0) {
mutate(., Test = 1)
} else .}
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
...
However, this can be easily modified without a loop i.e. convert the logical vector to numeric index by adding 1 (as indexing in R starts from 1). Use that to select a list with values 1 and NULL. If the length is 0, then NULL is selected and thus no column is created
iris %>%
mutate(Test = list(NULL, 1)[[1 + (length(r) > 0)]])

library(dplyr)
Using an intermediate function provides an alternative solution once it is substituted by an anonymous function
g_if <- function(df, r){
if(length(r)) {
ans <- df %>% mutate(test = 1)
} else {
ans <- df
}
invisible(ans)
}
r <- c()
iris %>% g_if(r) %>% str
#> 'data.frame': 150 obs. of 5 variables:
#> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
r <- c(1)
iris %>% g_if(r) %>% str
#> 'data.frame': 150 obs. of 6 variables:
#> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ test : num 1 1 1 1 1 1 1 1 1 1 ...
Now, we can use the same idea with an anonymous function, that is, without defining explicitely
function g_if()
r <- c()
iris %>% {
function(df, cond){
if(length(cond) > 0) {
ans <- df %>% mutate(test = 1)
} else {
ans <- df
}
ans}}(r) %>%
head
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
r <- c(1)
iris %>% {
function(df, cond){
if(length(cond) > 0) {
ans <- df %>% mutate(test = 1)
} else {
ans <- df
}
ans}}(r) %>%
head
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species test
#> 1 5.1 3.5 1.4 0.2 setosa 1
#> 2 4.9 3.0 1.4 0.2 setosa 1
#> 3 4.7 3.2 1.3 0.2 setosa 1
#> 4 4.6 3.1 1.5 0.2 setosa 1
#> 5 5.0 3.6 1.4 0.2 setosa 1
#> 6 5.4 3.9 1.7 0.4 setosa 1
Created on 2021-06-17 by the reprex package (v0.3.0)

We could use ifelse
library(dplyr)
r <- c()
iris %>%
mutate(Test = ifelse(length(r) > 0, 1,1))
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5.0 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1

The below code will add the variable if the condition is met. If not, it will add a variable populated will all NA and eventually remove it (I understand you need the new variable only if the condition is met).
library(dplyr)
r <- c()
iris %>%
mutate(test2=if_else(length(r)>0, 2, NULL)) %>%
select(where(~ !(all(is.na(.))))) #remove columns with all NAs

Related

Add a column count if values in multiple column meet threshold conditions: R

Consider iris dataset. Let's say I want to create a column count if values "sepal" columns are between 1 to 5.
Here's what I have:
iris %>% rowwise() %>%
mutate(count = sum(if_any(contains("sepal", ignore.case = TRUE),
.fns = ~ between(.x, 1, 5)))) %>%
arrange(desc(count))
But the output is not what I want.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species count
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 5.1 3.5 1.4 0.2 setosa 1 # Should be 1
2 4.9 3 1.4 0.2 setosa 1 # Should be 2
3 4.7 3.2 1.3 0.2 setosa 1 # Should be 2
4 4.6 3.1 1.5 0.2 setosa 1 # Should be 2
5 5 3.6 1.4 0.2 setosa 1 # Should be 2
6 5.4 3.9 1.7 0.4 setosa 1 # Should be 1
7 4.6 3.4 1.4 0.3 setosa 1 # Should be 2
8 5 3.4 1.5 0.2 setosa 1 # Should be 2
9 4.4 2.9 1.4 0.2 setosa 1 # Should be 2
10 4.9 3.1 1.5 0.1 setosa 1 # Should be 2
I can use case_when or if_else for the two columns but the actual dataset has a lot more columns. So I'm looking for a dplyr solution where I don't have to type out all the columns.
library(tidyverse)
iris %>%
mutate(
count = rowSums(across(contains("Sepal"), ~ between(.x, 1, 5)))
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species count
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 2
3 4.7 3.2 1.3 0.2 setosa 2
4 4.6 3.1 1.5 0.2 setosa 2
5 5.0 3.6 1.4 0.2 setosa 2
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 2
8 5.0 3.4 1.5 0.2 setosa 2
9 4.4 2.9 1.4 0.2 setosa 2
10 4.9 3.1 1.5 0.1 setosa 2
EDIT:
With c_across. To my understanding, c_across has to be used with rowwise() to perform rowwise aggregation and calculation.
iris %>%
rowwise() %>%
mutate(count = sum(between(c_across(contains("Sepal")), 1, 5)))

Rename and recode range of new variables in dataframe in R

I essentially want recode and rename a range of variables in a dataframe. I am looking for a way to do this in the single step.
Example in pseudo-code:
require(dplyr)
df <- iris %>% head()
df %>% mutate(
paste0("x", 1:3) = across( # In the example I want to rename
Sepal.Length:Petal.Length, # the variables I've selected
~ .x + 1 # and recoded to "x1" ... "x5"
)
)
df
Desired output:
x1 x2 x3 Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Maybe rename_with() is what you want. After that you can manipulate these renamed columns with mutate(across(...)).
library(dplyr)
df %>%
rename_with(~ paste0("x", seq_along(.x)), Sepal.Length:Petal.Length) %>%
mutate(across(x1:x3, ~ .x * 10))
x1 x2 x3 Petal.Width Species
1 51 35 14 0.2 setosa
2 49 30 14 0.2 setosa
3 47 32 13 0.2 setosa
4 46 31 15 0.2 setosa
5 50 36 14 0.2 setosa
6 54 39 17 0.4 setosa
If you want to manipulate and rename a range of columns in one step, try the argument .names in across().
df %>%
mutate(across(Sepal.Length:Petal.Length, ~ .x * 10,
.names = "x{seq_along(.col)}"),
.keep = "unused", .after = 1)
x1 x2 x3 Petal.Width Species
1 51 35 14 0.2 setosa
2 49 30 14 0.2 setosa
3 47 32 13 0.2 setosa
4 46 31 15 0.2 setosa
5 50 36 14 0.2 setosa
6 54 39 17 0.4 setosa
Hint: You can use seq_along() to create a sequence 1, 2, ... along with the selected columns, or match() to get the positions of the selected columns in the data, i.e. .names = "x{match(.col, names(df))}".
The below code allows you to just input the column numbers into a for loop, not sure if this is what you're going for.
require(dplyr)
df <- iris %>% head()
for(i in 1:3){
names(df)[i] <- paste0("x",i)
}
df
Outputs:
x1 x2 x3 Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You could add consecutive numbers to n columns with the same prefix this way:
df <- iris %>% head()
n <- 3
colnames(df)[1:n] <- sprintf("x%s",1:n)
output:
# x1 x2 x3 Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
Of any nonconsecutive number of columns by:
n <- c(1,3,5)
colnames(df)[n] <- sprintf("x%s",n)
# x1 Sepal.Width x3 Petal.Width x5
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa

str_c over all but one column in tibble (R)

I am new to tidyverse. I want to join all columns but one (as the names of the other columns might vary). Here an example with iris that does not work obviously. Thanks :)
library(tidyverse)
dat <- as_tibble(iris)
dat %>% mutate(New = str_c(!Sepal.Length, sep="_"))
We can use select to select the columns that we want to paste and apply str_c with do.call.
library(tidyverse)
dat %>% mutate(New = do.call(str_c, c(select(., !Sepal.Length), sep="_")))
However, using unite would be simpler.
dat %>% unite(New, !Sepal.Length, sep="_", remove= FALSE)
# Sepal.Length New Sepal.Width Petal.Length Petal.Width Species
# <dbl> <chr> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5_1.4_0.2_setosa 3.5 1.4 0.2 setosa
# 2 4.9 3_1.4_0.2_setosa 3 1.4 0.2 setosa
# 3 4.7 3.2_1.3_0.2_setosa 3.2 1.3 0.2 setosa
# 4 4.6 3.1_1.5_0.2_setosa 3.1 1.5 0.2 setosa
# 5 5 3.6_1.4_0.2_setosa 3.6 1.4 0.2 setosa
# 6 5.4 3.9_1.7_0.4_setosa 3.9 1.7 0.4 setosa
# 7 4.6 3.4_1.4_0.3_setosa 3.4 1.4 0.3 setosa
# 8 5 3.4_1.5_0.2_setosa 3.4 1.5 0.2 setosa
# 9 4.4 2.9_1.4_0.2_setosa 2.9 1.4 0.2 setosa
#10 4.9 3.1_1.5_0.1_setosa 3.1 1.5 0.1 setosa
# … with 140 more rows
using base
dat <- iris
cols <- grepl("Sepal.Length", names(dat))
tmp <- dat[, !cols]
dat$new <- apply(tmp, 1, paste0, collapse = "_")
head(dat)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
#> 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
#> 2 4.9 3.0 1.4 0.2 setosa 3.0_1.4_0.2_setosa
#> 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
#> 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
#> 5 5.0 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
#> 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
Created on 2021-02-01 by the reprex package (v1.0.0)
We can reduce
library(dplyr)
library(purrr)
library(stringr)
dat %>%
mutate(New = select(., -Sepal.Length) %>%
reduce(str_c, sep="_"))
# A tibble: 150 x 6
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
# 2 4.9 3 1.4 0.2 setosa 3_1.4_0.2_setosa
# 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
# 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
# 5 5 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
# 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
# 7 4.6 3.4 1.4 0.3 setosa 3.4_1.4_0.3_setosa
# 8 5 3.4 1.5 0.2 setosa 3.4_1.5_0.2_setosa
# 9 4.4 2.9 1.4 0.2 setosa 2.9_1.4_0.2_setosa
#10 4.9 3.1 1.5 0.1 setosa 3.1_1.5_0.1_setosa
# … with 140 more rows

Display column names that are only factors

Is there a way to extract only column names that are factor. For example, in iris dataset, last column is a factor, so only Species (column name and not entire column) should be extracted
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> str(head(iris))
'data.frame': 6 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
We can use :
names(iris)[sapply(iris, is.factor)]
#[1] "Species"
Or using Filter :
names(Filter(is.factor, iris))
Another solution which involves the dplyr package (if by chance you are already using it in your own project) is
names(iris %>% select_if(is.factor))
or equivalently (choose the one you like more)
iris %>% select_if(is.factor) %>% names()
Output
# [1] "Species"

dplyr: how to reference columns by column index rather than column name using mutate?

Using dplyr, you can do something like this:
iris %>% head %>% mutate(sum=Sepal.Length + Sepal.Width)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
But above, I referenced the columns by their column names. How can I use 1 and 2 , which are the column indices to achieve the same result?
Here I have the following, but I feel it's not as elegant.
iris %>% head %>% mutate(sum=apply(select(.,1,2),1,sum))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
You can try:
iris %>% head %>% mutate(sum = .[[1]] + .[[2]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
I'm a bit late to the game, but my personal strategy in cases like this is to write my own tidyverse-compliant function that will do exactly what I want. By tidyverse-compliant, I mean that the first argument of the function is a data frame and that the output is a vector that can be added to the data frame.
sum_cols <- function(x, col1, col2){
x[[col1]] + x[[col2]]
}
iris %>%
head %>%
mutate(sum = sum_cols(x = ., col1 = 1, col2 = 2))
An alternative to reusing . in mutate that will respect grouping is to use dplyr::cur_data_all(). From help(cur_data_all)
cur_data_all() gives the current data for the current group (including grouping variables)
Consider the following:
iris %>% group_by(Species) %>% mutate(sum = .[[1]] + .[[2]]) %>% head
#Error: Problem with `mutate()` column `sum`.
#ℹ `sum = .[[1]] + .[[2]]`.
#ℹ `sum` must be size 50 or 1, not 150.
#ℹ The error occurred in group 1: Species = setosa.
If instead you use cur_data_all(), it works without issue:
iris %>% mutate(sum = select(cur_data_all(),1) + select(cur_data_all(),2)) %>% head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length
#1 5.1 3.5 1.4 0.2 setosa 8.6
#2 4.9 3.0 1.4 0.2 setosa 7.9
#3 4.7 3.2 1.3 0.2 setosa 7.9
#4 4.6 3.1 1.5 0.2 setosa 7.7
#5 5.0 3.6 1.4 0.2 setosa 8.6
#6 5.4 3.9 1.7 0.4 setosa 9.3
The same approach works with the extract operator ([[).
iris %>% mutate(sum = cur_data()[[1]] + cur_data()[[2]]) %>% head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
#1 5.1 3.5 1.4 0.2 setosa 8.6
#2 4.9 3.0 1.4 0.2 setosa 7.9
#3 4.7 3.2 1.3 0.2 setosa 7.9
#4 4.6 3.1 1.5 0.2 setosa 7.7
#5 5.0 3.6 1.4 0.2 setosa 8.6
#6 5.4 3.9 1.7 0.4 setosa 9.3
What do you think about this version?
Inspired by #SavedByJesus's answer.
applySum <- function(df, ...) {
assertthat::assert_that(...length() > 0, msg = "one or more column indexes are required")
mutate(df, Sum = apply(as.data.frame(df[, c(...)]), 1, sum))
}
iris %>%
head(2) %>%
applySum(1, 2)
#
### output
#
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
#
### you can select and sum more then two columns by the same function
#
iris %>%
head(2) %>%
applySum(1, 2, 3, 4)
#
### output
#
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sum
1 5.1 3.5 1.4 0.2 setosa 10.2
2 4.9 3.0 1.4 0.2 setosa 9.5
To address the issue that #pluke is asking about in the comments, dplyr doesn't really support column index.
Not a perfect solution, but you can use base R to get around this
iris[1] <- iris[1] + iris[2]
This can now (packageVersion("dplyr") >= 1.0.0) be done very nicely with the combination of dplyr::rowwise() and dplyr::c_across().
library(dplyr)
packageVersion("dplyr")
#> [1] '1.0.10'
iris %>%
head %>%
rowwise() %>%
mutate(sum = sum(c_across(c(1, 2))))
#> # A tibble: 6 × 6
#> # Rowwise:
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 8.6
#> 2 4.9 3 1.4 0.2 setosa 7.9
#> 3 4.7 3.2 1.3 0.2 setosa 7.9
#> 4 4.6 3.1 1.5 0.2 setosa 7.7
#> 5 5 3.6 1.4 0.2 setosa 8.6
#> 6 5.4 3.9 1.7 0.4 setosa 9.3
Created on 2022-11-01 with reprex v2.0.2

Resources