How can I have use variables in the calculations of mutate function? - r

Consider this famous table (already exists in R)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Please notice that it has a column named Sepal.Length.
I defined a variable with the same name. Please consider this code:
table = iris
Sepal.Length = 0
table2 = table %>% mutate ( new = Sepal.Length*Petal.Length )
If you check the result:
head(table2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
1 5.1 3.5 1.4 0.2 setosa 0.28
2 4.9 3.0 1.4 0.2 setosa 0.28
3 4.7 3.2 1.3 0.2 setosa 0.26
4 4.6 3.1 1.5 0.2 setosa 0.30
5 5.0 3.6 1.4 0.2 setosa 0.28
6 5.4 3.9 1.7 0.4 setosa 0.68
As you see, the variable Sepal.Length = 0 has been ignored and the column table$Sepal.Length has been taken into account for creating the new column.
How can I have use variables in the calculations of mutate function?

If we want to use the object from Global env which is also a column name in the data, use .env
library(dplyr)
table2 <- table %>%
mutate ( new = Petal.Width* .env$Sepal.Length )

Alternatively, put !! in front of Sepal.Length as noted here https://stackoverflow.com/a/47659705/13015865
packages
library(dplyr)
Solution
table <- iris #no need to change the name of the dataset. But ok.
Sepal.Length <- 0
table %>% mutate ( new = !!Sepal.Length*Petal.Length )
output (head)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
1 5.1 3.5 1.4 0.2 setosa 0
2 4.9 3.0 1.4 0.2 setosa 0
3 4.7 3.2 1.3 0.2 setosa 0
4 4.6 3.1 1.5 0.2 setosa 0
5 5.0 3.6 1.4 0.2 setosa 0
6 5.4 3.9 1.7 0.4 setosa 0

Related

How to slice a dataset into multiple dataset in R

For this example, I'm going to use iris dataset built-in in R.
How can I avoid the copy and pasting of the syntax below to have the same output?
package
library(dplyr)
Input
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
Manual Solution
I have to subset my dataset based on the name of the column names.
I know how to do this "manually" but it would require a lot of copying and pasting on my current dataset.
Sepal <- iris %>% select(contains("Sepal"))
Petal <- iris %>% select(contains("Petal"))
Output
head(Sepal)
# Sepal.Length Sepal.Width
# 1 5.1 3.5
# 2 4.9 3.0
# 3 4.7 3.2
# 4 4.6 3.1
# 5 5.0 3.6
# 6 5.4 3.9
head(Petal)
# Petal.Length Petal.Width
# 1 1.4 0.2
# 2 1.4 0.2
# 3 1.3 0.2
# 4 1.5 0.2
# 5 1.4 0.2
# 6 1.7 0.4
How can I automatize this process? I think I can use the purrr package here. But I couldn't find a way to do it.
You can use
library(tidyverse)
map(set_names(c("Sepal", "Petal")), ~ select(iris, starts_with(.x)))
output (head)
$Sepal
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
$Petal
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
An option is also to use split.default on the substring of column names to return a named list of data.frames
library(dplyr)
library(stringr)
head(iris) %>%
select(-Species) %>%
split.default(str_remove(names(.), "\\..*"))
$Petal
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
$Sepal
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9

subseting a dataframe in R

I have a dataframe and I want to Create a subset,< Frame>, of just the species variable and display the first five records. with R how can I subset?
there are 10 rows and 7 columns.one column is Species
netID- fishID - species- tl - wtag - scale
By select.
head(
select(dataframe, speceis)
)
Assuming your dataframe is called df you can subset with dplyr
library(dplyr)
df <- iris[1:10,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
newdf<-df %>% select(Species) %>%slice(1:5)
Here you are selecting species from your data frame and then using slice you can select the range of rows you need. The Output of newdf is
Species
1 setosa
2 setosa
3 setosa
4 setosa
5 setosa

How to subtract two columns using tidyverse mutate with columns named by external variables

I’d like to dynamically assign which columns to subtract from each other. I’ve read around and looks like I need to use all_of, and maybe across (How to subtract one column from multiple columns in a dataframe in R using dplyr, How to you use objects in dplyr filter?). I can get it working for one variable in a mutate phrase (e.g. mutate(y = all_of(x))), but I can’t seem to do even simple calculations using two. Here’s a simplified example of what I want to do:
var1 <- c("Sepal.Length")
var2 <- c("Sepal.Width")
result <- iris %>%
mutate(calculation = all_of(var1) - all_of(var2))
We may use .data to subset the column as a vector. The all_of/any_of are used along with across to loop across the columns
library(dplyr)
iris %>%
mutate(calculation = .data[[var1]] - .data[[var2]])%>%
head
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or may also use cur_data()
iris %>%
head %>%
mutate(calculation = cur_data()[[var1]] - cur_data()[[var2]])
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or another option is to pass both the variables in across, and then reduce with -
library(purrr)
iris %>%
head %>%
mutate(calculation = reduce(across(all_of(c(var1, var2))), `-`))
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or could convert to symbol and evaluate (!!)
iris %>%
head %>%
mutate(calculation = !! rlang::sym(var1) - !! rlang::sym(var2))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
Or if we want to use all_of in across, just subset the column with [[
iris %>%
head %>%
mutate(calculation = across(all_of(var1))[[1]] -
across(all_of(var2))[[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
The reason we need to subset is because, across by default will update the original column when the .names is not present. The calculation will be a data.frame with a single column
out <- iris %>%
head %>%
mutate(calculation = across(all_of(var1)) -
across(all_of(var2)))
out
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5.0 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
str(out)
data.frame': 6 obs. of 6 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
$ calculation :'data.frame': 6 obs. of 1 variable:
..$ Sepal.Length: num 1.6 1.9 1.5 1.5 1.4 1.5
We could use get to access the variable values where the name of variable is stored in a string (thanks to akrun for assist):
iris %>%
mutate(calculation = get(var1) - get(var2))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species calculation
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3 1.4 0.2 setosa 1.9
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.5
5 5 3.6 1.4 0.2 setosa 1.4
6 5.4 3.9 1.7 0.4 setosa 1.5
7 4.6 3.4 1.4 0.3 setosa 1.2
8 5 3.4 1.5 0.2 setosa 1.6
9 4.4 2.9 1.4 0.2 setosa 1.5
10 4.9 3.1 1.5 0.1 setosa 1.8
# ... with 140 more rows

Creating new columns with mutate() and across()

This is a simplified version of the actual problem I'm dealing with. In this example, I'll be working with four columns, and the actual problem requires working with about 20-30 columns.
Consider the iris dataset. Suppose that I wanted to, for some reason, append new columns which would be equal to double the .Length and the .Width columns. With the following code, this would change the existing columns:
library(dplyr)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
df_iris <- iris %>% mutate(across(matches("(\\.)(Length|Width)"),
function(x) { x * 2 }))
head(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 10.2 7.0 2.8 0.4 setosa
2 9.8 6.0 2.8 0.4 setosa
3 9.4 6.4 2.6 0.4 setosa
4 9.2 6.2 3.0 0.4 setosa
5 10.0 7.2 2.8 0.4 setosa
6 10.8 7.8 3.4 0.8 setosa
However, instead, I would like to have this doubled calculation create NEW columns, say .Length.2 and .Width.2. One way this could be done is the following:
double <- function(x) {
x * 2
}
df_iris <- iris %>%
mutate(Sepal.Length.2 = double(Sepal.Length),
Sepal.Width.2 = double(Sepal.Width),
Petal.Length.2 = double(Petal.Length),
Petal.Width.2 = double(Petal.Width))
head(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.2 Sepal.Width.2 Petal.Length.2 Petal.Width.2
1 5.1 3.5 1.4 0.2 setosa 10.2 7.0 2.8 0.4
2 4.9 3.0 1.4 0.2 setosa 9.8 6.0 2.8 0.4
3 4.7 3.2 1.3 0.2 setosa 9.4 6.4 2.6 0.4
4 4.6 3.1 1.5 0.2 setosa 9.2 6.2 3.0 0.4
5 5.0 3.6 1.4 0.2 setosa 10.0 7.2 2.8 0.4
6 5.4 3.9 1.7 0.4 setosa 10.8 7.8 3.4 0.8
Is there a way to do this in dplyr without:
relying on superseded/deprecated functions?
having to manually specify each column name?
We can use across (used dplyr 1.0.6 version)
library(dplyr)
df_iris <- iris %>%
mutate(across(where(is.numeric), double, .names = '{.col}.2'))
-output
head(df_iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.2 Sepal.Width.2 Petal.Length.2 Petal.Width.2
1 5.1 3.5 1.4 0.2 setosa 10.2 7.0 2.8 0.4
2 4.9 3.0 1.4 0.2 setosa 9.8 6.0 2.8 0.4
3 4.7 3.2 1.3 0.2 setosa 9.4 6.4 2.6 0.4

Renaming columns based on condition about their names

I would like to add a prefix to my dataset column names only if they already begin with a certain string, and I would like to do it (if possible) using a dplyr pipeline.
Taking the iris dataset as toy example, I was able to get the expected result with base R (with a quite cumbersome line of code):
data("iris")
colnames(iris)[startsWith(colnames(iris), "Sepal")] <- paste0("YAY_", colnames(iris)[startsWith(colnames(iris), "Sepal")])
head(iris)
YAY_Sepal.Length YAY_Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
In this example, the prefix YAY_ has been added to all the column names starting with Sepal. Is there a way to obtain the same result with a dplyr command/pipeline?
An option would be rename_at
library(tidyverse)
iris %>%
rename_at(vars(starts_with("Sepal")), ~ str_c("YAY_", .))
# YAY_Sepal.Length YAY_Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
# ...

Resources