I have a tibble data frame in R and I want to add a new column name but the name must come from the value of a variable, what is the easiest way to achieve this?
# let us generate a whole set of new features
iris_tbl = as_tibble(iris)
iris_tbl <- iris_tbl %>% mutate(hello=1)
> print(iris_tbl)
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species hello
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 1
8 5 3.4 1.5 0.2 setosa 1
9 4.4 2.9 1.4 0.2 setosa 1
10 4.9 3.1 1.5 0.1 setosa 1
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
# This is the example that does not work as desired
var_name='hello'
iris_tbl = as_tibble(iris)
iris_tbl <- iris_tbl %>% mutate(var_name=1)
> print(iris_tbl)
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species var_name
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 1
8 5 3.4 1.5 0.2 setosa 1
9 4.4 2.9 1.4 0.2 setosa 1
10 4.9 3.1 1.5 0.1 setosa 1
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
In the first example the column name created from mutate is actually a column called 'hello'. In the second example mutate names the column 'var_name', instead of 'hello' which is the desired outcome.
Any suggestions on how to make this as easy as possible?
Enter the command ?dplyr_data_masking
If you read through that, you can see there are at least 2 ways you can get your desired result.
iris_tbl <- iris_tbl %>% mutate("{var_name}" := 1)
Or
iris_tbl <- iris_tbl %>% mutate({{var_name}} := 1)
Related
Consider iris dataset. Let's say I want to create a column count if values "sepal" columns are between 1 to 5.
Here's what I have:
iris %>% rowwise() %>%
mutate(count = sum(if_any(contains("sepal", ignore.case = TRUE),
.fns = ~ between(.x, 1, 5)))) %>%
arrange(desc(count))
But the output is not what I want.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species count
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 5.1 3.5 1.4 0.2 setosa 1 # Should be 1
2 4.9 3 1.4 0.2 setosa 1 # Should be 2
3 4.7 3.2 1.3 0.2 setosa 1 # Should be 2
4 4.6 3.1 1.5 0.2 setosa 1 # Should be 2
5 5 3.6 1.4 0.2 setosa 1 # Should be 2
6 5.4 3.9 1.7 0.4 setosa 1 # Should be 1
7 4.6 3.4 1.4 0.3 setosa 1 # Should be 2
8 5 3.4 1.5 0.2 setosa 1 # Should be 2
9 4.4 2.9 1.4 0.2 setosa 1 # Should be 2
10 4.9 3.1 1.5 0.1 setosa 1 # Should be 2
I can use case_when or if_else for the two columns but the actual dataset has a lot more columns. So I'm looking for a dplyr solution where I don't have to type out all the columns.
library(tidyverse)
iris %>%
mutate(
count = rowSums(across(contains("Sepal"), ~ between(.x, 1, 5)))
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species count
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 2
3 4.7 3.2 1.3 0.2 setosa 2
4 4.6 3.1 1.5 0.2 setosa 2
5 5.0 3.6 1.4 0.2 setosa 2
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 2
8 5.0 3.4 1.5 0.2 setosa 2
9 4.4 2.9 1.4 0.2 setosa 2
10 4.9 3.1 1.5 0.1 setosa 2
EDIT:
With c_across. To my understanding, c_across has to be used with rowwise() to perform rowwise aggregation and calculation.
iris %>%
rowwise() %>%
mutate(count = sum(between(c_across(contains("Sepal")), 1, 5)))
I am new to tidyverse. I want to join all columns but one (as the names of the other columns might vary). Here an example with iris that does not work obviously. Thanks :)
library(tidyverse)
dat <- as_tibble(iris)
dat %>% mutate(New = str_c(!Sepal.Length, sep="_"))
We can use select to select the columns that we want to paste and apply str_c with do.call.
library(tidyverse)
dat %>% mutate(New = do.call(str_c, c(select(., !Sepal.Length), sep="_")))
However, using unite would be simpler.
dat %>% unite(New, !Sepal.Length, sep="_", remove= FALSE)
# Sepal.Length New Sepal.Width Petal.Length Petal.Width Species
# <dbl> <chr> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5_1.4_0.2_setosa 3.5 1.4 0.2 setosa
# 2 4.9 3_1.4_0.2_setosa 3 1.4 0.2 setosa
# 3 4.7 3.2_1.3_0.2_setosa 3.2 1.3 0.2 setosa
# 4 4.6 3.1_1.5_0.2_setosa 3.1 1.5 0.2 setosa
# 5 5 3.6_1.4_0.2_setosa 3.6 1.4 0.2 setosa
# 6 5.4 3.9_1.7_0.4_setosa 3.9 1.7 0.4 setosa
# 7 4.6 3.4_1.4_0.3_setosa 3.4 1.4 0.3 setosa
# 8 5 3.4_1.5_0.2_setosa 3.4 1.5 0.2 setosa
# 9 4.4 2.9_1.4_0.2_setosa 2.9 1.4 0.2 setosa
#10 4.9 3.1_1.5_0.1_setosa 3.1 1.5 0.1 setosa
# … with 140 more rows
using base
dat <- iris
cols <- grepl("Sepal.Length", names(dat))
tmp <- dat[, !cols]
dat$new <- apply(tmp, 1, paste0, collapse = "_")
head(dat)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
#> 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
#> 2 4.9 3.0 1.4 0.2 setosa 3.0_1.4_0.2_setosa
#> 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
#> 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
#> 5 5.0 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
#> 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
Created on 2021-02-01 by the reprex package (v1.0.0)
We can reduce
library(dplyr)
library(purrr)
library(stringr)
dat %>%
mutate(New = select(., -Sepal.Length) %>%
reduce(str_c, sep="_"))
# A tibble: 150 x 6
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
# 2 4.9 3 1.4 0.2 setosa 3_1.4_0.2_setosa
# 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
# 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
# 5 5 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
# 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
# 7 4.6 3.4 1.4 0.3 setosa 3.4_1.4_0.3_setosa
# 8 5 3.4 1.5 0.2 setosa 3.4_1.5_0.2_setosa
# 9 4.4 2.9 1.4 0.2 setosa 2.9_1.4_0.2_setosa
#10 4.9 3.1 1.5 0.1 setosa 3.1_1.5_0.1_setosa
# … with 140 more rows
library(tidyverse)
iris %>% as_tibble() %>% select(everything())
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 140 more rows
Say I want to select everything in the iris data frame except Species. How do I list this one exception while utilizing tidyselect::everything()?
My actual pipe is below, and when
... %>%
group_by(`ID`) %>%
fill(everything, .direction = "updown") %>%
... %>%
and I get the following error:
Error: Column ID can't be modified because it's a grouping variable
You would do
iris %>% as_tibble() %>% select(-Species)
but assuming you have good reason not to want that, here's a way using everything()
iris %>% as_tibble() %>% select(setdiff(everything(), one_of("Species")))
#> # A tibble: 150 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
#> 7 4.6 3.4 1.4 0.3
#> 8 5 3.4 1.5 0.2
#> 9 4.4 2.9 1.4 0.2
#> 10 4.9 3.1 1.5 0.1
#> # ... with 140 more rows
(or just iris %>% as_tibble() %>% select(setdiff(everything(), 5)) if it's acceptable)
I would like to create a function that takes a data frame and a character vector containing column names as input, and uses tidy verse quoting functions inside in a safe manner.
I believe I have a working example of what I want to do. I would like to know if there is a more elegant solution or I am thinking about this problem incorrectly (perhaps I shouldn't want to do this?). From what I can tell, in order to avoid variable scoping issues I need to wrap the column names in .data[[]] and make it an expression before unquoting for tidy verse NSE verbs.
Regarding previous questions this answer is along the right lines but I want to abstract the code into a function. A github issue
asks about this but using rlang::syms won't work as far as I can tell because the
combination of the column labels with .data makes it an expression not a symbol.
Here
and here
solve the problem but as far as I can tell don't account for a subtle bug in which the variables can leak
in from the environment if they don't exist as column labels in the dataframe or the solutions don't work for the input being a vector of labels.
# Setup
suppressWarnings(suppressMessages(library("dplyr")))
suppressWarnings(suppressMessages(library("rlang")))
# define iris with and without Sepal.Width column
iris <- tibble::as_tibble(iris)
df_with_missing <- iris %>% select(-Sepal.Width)
# This should not be findable by my function
Sepal.Width <- iris$Sepal.Width * -1
################
# Now lets try a function for which we programmatically define the column labels
programmatic_mutate_y <- function(df, col_names, safe = FALSE) {
# Add .data[[]] to the col_names to make evalutation safer
col_exprs <- rlang::parse_exprs(
purrr::map_chr(
col_names,
~ glue::glue(stringr::str_c('.data[["{.x}"]]'))
)
)
output <- dplyr::mutate(df, product = purrr::pmap_dbl(list(!!!col_exprs), ~ prod(...)))
output
}
################
# The desired output
testthat::expect_error(programmatic_mutate_y(df_with_missing, c("Sepal.Width", "Sepal.Length")))
programmatic_mutate_y(iris, c("Sepal.Width", "Sepal.Length"))
#> # A tibble: 150 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 17.8
#> 2 4.9 3 1.4 0.2 setosa 14.7
#> 3 4.7 3.2 1.3 0.2 setosa 15.0
#> 4 4.6 3.1 1.5 0.2 setosa 14.3
#> 5 5 3.6 1.4 0.2 setosa 18
#> 6 5.4 3.9 1.7 0.4 setosa 21.1
#> 7 4.6 3.4 1.4 0.3 setosa 15.6
#> 8 5 3.4 1.5 0.2 setosa 17
#> 9 4.4 2.9 1.4 0.2 setosa 12.8
#> 10 4.9 3.1 1.5 0.1 setosa 15.2
#> # … with 140 more rows
Created on 2019-08-09 by the reprex package (v0.3.0)
I think you are making things complicated. With the _at variant, you can use strings as arguments in almost every dplyr functions. purrr::pmap_dbl() is used to map calculation by rows.
programmatic_mutate_y_v1 <- function(df, col_names, safe = FALSE) {
df["product"] <- purrr::pmap_dbl(dplyr::select_at(df,col_names),prod)
return(df)
}
programmatic_mutate_y_v1(iris, c("Sepal.Width", "Sepal.Length"))
# A tibble: 150 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 17.8
2 4.9 3 1.4 0.2 setosa 14.7
3 4.7 3.2 1.3 0.2 setosa 15.0
4 4.6 3.1 1.5 0.2 setosa 14.3
5 5 3.6 1.4 0.2 setosa 18
6 5.4 3.9 1.7 0.4 setosa 21.1
7 4.6 3.4 1.4 0.3 setosa 15.6
8 5 3.4 1.5 0.2 setosa 17
9 4.4 2.9 1.4 0.2 setosa 12.8
10 4.9 3.1 1.5 0.1 setosa 15.2
# ... with 140 more rows
We can turn col_names into a single expression with parse_expr and paste, then unquote when being evaluated in mutate:
library(dplyr)
library(rlang)
programmatic_mutate_y <- function(df, col_names){
mutate(df, product = !!parse_expr(paste(col_names, collapse = "*")))
}
Output:
> parse_expr(paste(c("Sepal.Width", "Sepal.Length"), collapse = "*"))
Sepal.Width * Sepal.Length
> programmatic_mutate_y(df_with_missing, c("Sepal.Width", "Sepal.Length"))
> Error: object 'Sepal.Width' not found
> programmatic_mutate_y(iris, c("Sepal.Width", "Sepal.Length"))
# A tibble: 150 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species product
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 17.8
2 4.9 3 1.4 0.2 setosa 14.7
3 4.7 3.2 1.3 0.2 setosa 15.0
4 4.6 3.1 1.5 0.2 setosa 14.3
5 5 3.6 1.4 0.2 setosa 18
6 5.4 3.9 1.7 0.4 setosa 21.1
7 4.6 3.4 1.4 0.3 setosa 15.6
8 5 3.4 1.5 0.2 setosa 17
9 4.4 2.9 1.4 0.2 setosa 12.8
10 4.9 3.1 1.5 0.1 setosa 15.2
# ... with 140 more rows
My current dataframe in R has the following dimensions
nrows=605
ncol: 1514
The first column indicates the class/ label and my dataset has only two classes namely: setosa and iris.
test[1:5,]
class id1 id2...
1: setosa 2 4.....
2: setosa 2 5 .....
3: setosa 5 4 .....
4: iris 5 9......
5: iris 7 9 ....
However the dataframe is ordered as of now : ie. Rows 2- row 233 of my dataframe correspond to class setosa and class iris is from 234 until end. I want the dataset to be rearranged so that the samples are mixed up.
The expected output should be in following form:
If I do df[1:10,] ie. 10 lines of dataframe ,I should be able to see samples of both iris and setosa. Any ideas or suggestion on how to do this?
library( tidyverse )
iris[1:10,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 8 5.0 3.4 1.5 0.2 setosa
# 9 4.4 2.9 1.4 0.2 setosa
# 10 4.9 3.1 1.5 0.1 setosa
df <- iris %>%
group_by( Species ) %>%
mutate( id = row_number() ) %>%
arrange( id ) %>%
select ( -id )
df[1:10,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 7 3.2 4.7 1.4 versicolor
# 3 6.3 3.3 6 2.5 virginica
# 4 4.9 3 1.4 0.2 setosa
# 5 6.4 3.2 4.5 1.5 versicolor
# 6 5.8 2.7 5.1 1.9 virginica
# 7 4.7 3.2 1.3 0.2 setosa
# 8 6.9 3.1 4.9 1.5 versicolor
# 9 7.1 3 5.9 2.1 virginica
# 10 4.6 3.1 1.5 0.2 setosa