Using variables for column functions in mutate() - r

How can I use variables in place of column names in dplyr strings? As an example say I want to add a column to the iris dataset called sum that is the sum of Sepal.Length and Sepal.Width. In short I want a working version of the below code.
x = "Sepal.Length"
y = "Sepal.Width"
head(iris%>% mutate(sum = x+y))
Currently, running the code outputs "Evaluation error: non-numeric argument to binary operator" as R evaluates x and y as character vectors. How do I instead get R to evaluate x and y as column names of the dataframe? I know that the answer is to use some form of lazy evaluation, but I'm having trouble figuring out exactly how to configure it.
Note that the proposed duplicate: dplyr - mutate: use dynamic variable names does not address this issue. The duplicate answers this question:
Not my question: How do I do:
var = "sum"
head(iris %>% mutate(var = Sepal.Length + Sepal.Width))

I think that recommended way is using sym:
iris %>% mutate(sum = !!sym(x) + !!sym(y)) %>% head

It also works with get():
> rm(list = ls())
> data("iris")
>
> library(dplyr)
>
> x <- "Sepal.Length"
> y <- "Sepal.Width"
>
> head(iris %>% mutate(sum = get(x) + get(y)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3

Related

Select values in a table conditional to an external table

I'd like to select the first N values of each variables (columns) in a data set, where N varies by column and row and are given in an other table. An example below with the iris data:
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
## Create a fake external table
ext.tab <- data.table(species=c("setosa","versicolor", "virginica" ),N1=c(1:3),N2=c(3:5),N3=c(5:7),N4=c(7:9))
head(ext.tab)
species N1 N2 N3 N4
1: setosa 1 3 5 7
2: versicolor 2 4 6 8
3: virginica 3 5 7 9
Now for Iris setosa, I'd like to get the first maximum value (N1 in ext.tab) of column 1 ('sepal.length' in iris data), then the three max values (N2 in ext.tab) for column 2 (sepal.width), then the five max values (N3) for column 3 (petal.length) and so forth. Then moving to the Iris versicolor and do the same.
The result can be either a table or a list for each species with the values themselves or row indices for each variable (column). Any idea of a fast way to implement that?
Here is a tidyverse approach using a custom function. The function takes the variable and group names as character scalar and number of maximum values as numeric. Inside the function is a dplyr pipeline using .data pronoun. Then, I reshaped ext.tab to long form and applied get_maximum() row-wise.
library(tidyverse)
get_maximum <- \(.x, .group, .n_max, .dat) {
.dat %>%
filter(Species == .group) %>%
arrange(desc(.data[[.x]])) %>%
slice(seq_len(.n_max)) %>%
pull(.data[[.x]])
}
dat <- as_tibble(ext.tab) %>%
pivot_longer(-species) %>%
mutate(name = recode(
name,
N1 = "Sepal.Length",
N2 = "Sepal.Width",
N3 = "Petal.Length",
N4 = "Petal.Width"
)) %>%
rowwise() %>%
mutate(max_num = list(
get_maximum(name, species, value, iris)
)) %>%
ungroup()
If you need the unique maximum values, you can add distinct() inside the custom function.
get_maximum_unique <- \(.x, .group, .n_max, .dat) {
.dat %>%
filter(Species == .group) %>%
distinct(.data[[.x]]) %>%
arrange(desc(.data[[.x]])) %>%
slice(seq_len(.n_max)) %>%
pull(.data[[.x]])
}
Here is an option using data.table. I have taken the liberty of renaming the column names.
cols <- setdiff(names(ext.tab), "Species")
iris[ext.tab, on=.(Species), by=.EACHI,
.(.(mapply(function(x, n) -head(sort(-x, partial=n), n),
x=mget(cols), n=mget(paste0("i.", cols)), SIMPLIFY=FALSE)))]$V1
data:
library(data.table)
iris <- as.data.table(iris)
ext.tab <- data.table(Species=c("setosa", "versicolor", "virginica"),
Sepal.Length=c(1:3),
Sepal.Width=c(3:5),
Petal.Length=c(5:7),
Petal.Width=c(7:9))
output:
[[1]]
[[1]]$Sepal.Length
[1] 5.8
[[1]]$Sepal.Width
[1] 4.4 4.2 4.1
[[1]]$Petal.Length
[1] 1.9 1.9 1.7 1.7 1.7
[[1]]$Petal.Width
[1] 0.4 0.4 0.6 0.4 0.5 0.4 0.4
[[2]]
[[2]]$Sepal.Length
[1] 7.0 6.9
[[2]]$Sepal.Width
[1] 3.4 3.3 3.2 3.2
[[2]]$Petal.Length
[1] 5.1 4.8 4.9 5.0 4.9 4.8
[[2]]$Petal.Width
[1] 1.7 1.6 1.6 1.8 1.5 1.5 1.6 1.5
[[3]]
[[3]]$Sepal.Length
[1] 7.7 7.9 7.7
[[3]]$Sepal.Width
[1] 3.8 3.8 3.6 3.4 3.4
[[3]]$Petal.Length
[1] 6.4 6.3 6.7 6.9 6.7 6.6 6.1
[[3]]$Petal.Width
[1] 2.5 2.5 2.4 2.5 2.4 2.4 2.3 2.3 2.3
Short explanation:
Perform a left join iris[ext.tab, on=.(Species),
by=.EACHI means for each row of ext.tab
x=mget(cols) gets the columns in iris
mget(paste0("i.", cols)) gets the number of values required for each column
-head(sort(-x, partial=n), n) performs a partial sort and extract the first n values
SIMPLIFY=FALSE and .(.( )) are simply required to return the results as a list

Operating on list of strings representing column names?

I'm currently trying to automate a data task that requires taking in a list of column names in string format, then summing those columns (rowwise). i.e., suppose there is some list as follows:
> list
[1] "colname1" "colname2" "colname3"
How would I go about passing in this list to some function like sum() in tidyverse? That is, I would like to run something like the following:
df <- df %>%
rowwise %>%
mutate(new_var = sum(list))
Any suggestions would be greatly, greatly appreciated. Thanks.
You could use rowSums here. For example:
library(dplyr)
mycols <- colnames(iris)[3:4]
mycols
[1] "Petal.Length" "Petal.Width"
Then:
iris %>%
mutate(new_var = rowSums(.[, mycols])) %>%
head()
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.6
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.7
5 5.0 3.6 1.4 0.2 setosa 1.6
6 5.4 3.9 1.7 0.4 setosa 2.1
You can pass the vector of column names in c_across.
library(dplyr)
df <- df %>% rowwise() %>% mutate(new_var = sum(c_across(list)))
df

Extracting columns from Data Frame based on a "formula"

I have some data which looks like:
data(iris)
iris %>%
select(Species, everything()) %>%
rename(Y = 1) %>%
rename_at(vars(-c(1)), ~str_c("X", seq_along(.)))
Data:
Y X1 X2 X3 X4
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3.0 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5.0 3.6 1.4 0.2
6 setosa 5.4 3.9 1.7 0.4
I add a random variable:
d$noise <- rnorm(length(d))
I am trying to extract just the Y, X1, X2... XN variables (dynamically). What I currently have is:
d %>%
select("Y", cat(paste0("X", seq_along(2:ncol(.)), collapse = ", ")))
This doesn't work since it takes into account the noise column and doesn't work even without the noise column.
So I am trying to create a new data frame which just extracts the Y, X1, X2...XN columns.
dplyr provides two select helper functions that you could use --- contains for literal strings or matches for regular expressions.
In this case you could do
d %>%
select("Y", contains("X"))
or
d %>%
select("Y", matches("X\\d+"))
The first one works in the example you provided but would fail if you have other variables that contain any "X" character. The second is more robust in that it will only capture variables whose names are "X" followed by one or more digits.
we can also use
d %>%
select(Y, starts_with('X'))

R: column profiling

In R I'm trying to profile the columns of a data frame. This is the data frame:
> library(MASS)
> data<-iris[1:5,1:4]
> data
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
I want the result of the profiling to look something like this:
min max mean
Sepal.Length 4.6 5.1 5
Sepal.Width 3.0 3.6 5
Petal.Length 1.3 1.5 3
Petal.Width 0.2 0.2 1
There could be many more functions I want to apply to the columns.
I'm able to get the data I want with this command:
library(dplyr)
data %>% summarise_all(funs(min, max, mean))
However, neither the shape nor the row/column names are as desired. Is there an elegant way of achieving what I want?
Oneliner with base R:
t(sapply(data, summary))[, c('Min.', 'Max.', 'Mean')]
library(plyr)
t(sapply(data, each(min,max,mean)))
Using dplyr to allow use of any functions
library(dplyr)
library(tidyr)
data %>%
gather() %>%
group_by(key) %>%
summarise_all(funs(min, max, mean))

dplyr rowwise mutate without hardcoding names

I want to do something like this
df <- iris %>%
rowwise %>%
mutate(new_var = sum(Sepal.Length, Sepal.Width))
Except I want to do it without typing the variable names, e.g.
names_to_add <- c("Sepal.Length", "Sepal.Width")
df <- iris %>%
rowwise %>%
[some function that uses names_to_add]
I attempted a few things e.g.
df <- iris %>%
rowwise %>%
mutate(new_var = sum(sapply(names_to_add, get, envir = as.environment(.))))
but still can't figure it out. I'll take an answer that plays around with lazyeval or something that's simpler. Note that the sum function here is just a placeholder and my actual function is much more complex, although it returns one value per row. I'd also rather not use data.table
You should check out all the functions that end with _ in dplyr. Example mutate_, summarise_ etc.
names_to_add <- ("sum(Sepal.Length, Sepal.Width)")
df <- iris %>%
rowwise %>% mutate_(names_to_add)
Edit
The results of the code:
df <- iris %>%
rowwise %>% mutate(new_var = sum(Sepal.Length, Sepal.Width))
names_to_add <- ("sum(Sepal.Length, Sepal.Width)")
df2 <- iris %>%
rowwise %>% mutate_(new_var = names_to_add)
identical(df, df2)
[1] TRUE
Edit
I edited the answer and it solves the problem. I wonder why it was donwvoted. We use SE (standard evaluation), passing a string as an input inside 'mutate_'. More info: vignette("nse","dplyr")
x <- "Sepal.Length + Sepal.Width"
df <- mutate_(iris, x)
head(df)
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length + Sepal.Width
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3

Resources