I want to add a new column based on a given character vector.
For example, in the example below, I want to add column d defined in expr:
library(magrittr)
data <- tibble::tibble(
a = c(1, 2),
b = c(3, 4)
)
expr <- "d = a + b"
just as below:
data %>%
dplyr::mutate(d = a + b)
# # A tibble: 2 x 3
# a b d
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
However, in the codes below, while the calculations themselves (i.e., adding) work, the names of the new columns are different from what I expected.
data %>%
dplyr::mutate(!!rlang::parse_expr(expr))
# # A tibble: 2 x 3
# a b `d = a + b`
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
data %>%
dplyr::mutate(!!rlang::parse_quo(expr, env = rlang::global_env()))
# # A tibble: 2 x 3
# a b `d = a + b`
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
data %>%
dplyr::mutate(rlang::eval_tidy(rlang::parse_expr(expr)))
# # A tibble: 2 x 3
# a b `rlang::eval_tidy(rlang::parse_expr(expr))`
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
How can I properly use an expression in dplyr::mutate?
My question is similar to this, but in my example, the new variable (d) and its definition (a + b) are given in a single character vector (expr).
Lets first look at what kind of expressions dplyr::mutate takes to create named variables: we need a named list that contains an expression to create variables based on that expression with the given list element name.
library(tidyverse)
data <- tibble::tibble(
a = c(1, 2),
b = c(3, 4)
)
expr <- "d = a + b"
# let's rewrite the string above as named list containing an expression.
expr2 <- list(d = expr(a + b))
# this works as expected:
data %>%
mutate(!!! expr2)
#> # A tibble: 2 x 3
#> a b d
#> <dbl> <dbl> <dbl>
#> 1 1 3 4
#> 2 2 4 6
Now we simply need a function that transforms a string into a named list containing the expression of the right-hand side of the equation. The name needs to be the left-hand side of the equation. We can do this with regular string manipulations. Finally we need to transform the right-hand side of the equation from a string into an expression. We can use base R's str2lang here.
create_expr_ls <- function(str_expr) {
expr_nm <- str_extract(str_expr, "^\\w+")
expr_code <- str_replace_all(str_expr, "(^\\w+\\s?=\\s?)(.*)", "\\2")
set_names(list(str2lang(expr_code)), expr_nm)
}
expr3 <- create_expr_ls(expr)
data %>%
mutate(!!! expr3)
#> # A tibble: 2 x 3
#> a b d
#> <dbl> <dbl> <dbl>
#> 1 1 3 4
#> 2 2 4 6
Created on 2022-01-23 by the reprex package (v0.3.0)
Any of these work. The second is similar to the first but does not require that rlang be on the search path. The third and fourth also work if the d= part is not present in expr in which case default names are used. The last one uses only base R and is also the shortest.
data %>% mutate(within(., !!parse_expr(expr)))
data %>% mutate(within(., !!parse(text = expr)))
data %>% mutate(data, !!parse_expr(sprintf("tibble(%s)", expr)))
data %>% { eval_tidy(parse_expr(sprintf("mutate(., %s)", expr))) }
within(data, eval(parse(text = expr))) # base R
Note
Assume this premable:
library(dplyr)
library(rlang)
# input
data <- tibble(a = c(1, 2), b = c(3, 4))
expr <- "d = a + b"
To get the desired name for the mutated column, you can still use the same syntax and assign the results to a column with the preferred name. To get this name you can use a regular expression to find what is before = and then remove any leading or trailing spaces that might exist.
expr <- "x = a * b"
col_name <- trimws(str_extract(expr,"[^=]+"))
data %>%
dplyr::mutate(!!col_name := !!rlang::parse_expr(expr))
# A tibble: 2 × 3
a b x
<dbl> <dbl> <dbl>
1 1 3 3
2 2 4 8
data %>%
dplyr::mutate(!!col_name := !!rlang::parse_quo(expr, env = rlang::global_env()))
# A tibble: 2 × 3
a b x
<dbl> <dbl> <dbl>
1 1 3 3
2 2 4 8
data %>%
dplyr::mutate(!!col_name := rlang::eval_tidy(rlang::parse_expr(expr)))
# A tibble: 2 × 3
a b x
<dbl> <dbl> <dbl>
1 1 3 3
2 2 4 8
Related
looking for some help on this maybe basic issue. Suppose I have the following:
tibble(
x= c("a","a","b","b","b","b"),
y= c(1,2,1,2,1,2)
)
# A tibble: 6 x 2
# x y
# <chr> <dbl>
# 1 a 1
# 2 a 2
# 3 b 1
# 4 b 2
# 5 b 1
# 6 b 2
I would like to transform to the following tibble:
tibble(
x= c("a","b","b"),
y.1= c("1","1","1"),
y.2= c("2","2","2")
)
# A tibble: 3 x 3
# x y.1 y.2
# <chr> <chr> <chr>
# 1 a 1 2
# 2 b 1 2
# 3 b 1 2
What's the best way to achieve this? I tried to use tidyr::pivot_wider but I couldn't figure it out without preserving the x column's two "b" values.
Given the current structure, one possible approach is to use vector recycling
library(tidyverse)
df = tibble(
x= c("a","a","b","b","b","b"),
y= c(1,2,1,2,1,2)
)
df %>%
summarise(
x = x[c(T, F)],
y.1 = y[c(T, F)],
y.2 = y[c(F, T)]
)
#> # A tibble: 3 x 3
#> x y.1 y.2
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 b 1 2
#> 3 b 1 2
Although this could break if data is not in the proper sequence.
I want to turn this line of code into a function:
mutate(var_avg = rowMeans(select(., starts_with("var"))))
It works in the pipe:
df <- read_csv("var_one,var_two,var_three
1,1,1
2,2,2
3,3,3")
df %>% mutate(var_avg = rowMeans(select(., starts_with("var"))))
># A tibble: 3 x 4
> var_one var_two var_three var_avg
> <dbl> <dbl> <dbl> <dbl>
>1 1 1 1 1
>2 2 2 2 2
>3 3 3 3 3
Here's my attempt (I'm new at writing functions):
colnameMeans <- function(x) {
columnname <- paste0("avg_",x)
mutate(columnname <- rowMeans(select(., starts_with(x))))
}
It doesn't work.
df %>% colnameMeans("var")
>Error in colnameMeans(., "var") : unused argument ("var")
I have a lot to learn about functions and I'm not sure where to start with fixing this. Any help would be much appreciated. Note that this is a simplified example. In my real data, I have several column prefixes and I want to calculate a row-wise mean for each one. EDIT: Being able to run the function for multiple prefixes at once would be a bonus.
If we need to assign column name on the lhs of assignment, use := and evaluate (!!) the string. The <- inside mutate won't work as the default option is = and it would evaluate unquoted value on the lhs of = literally. In addition, we may need to specify the data as argument in the function
library(dplyr)
colnameMeans <- function(., x) {
columnname<- paste0("avg_", x)
mutate(., !! columnname := rowMeans(select(., starts_with(x))))
}
df %>%
colnameMeans('var')
# A tibble: 3 x 4
# var_one var_two var_three avg_var
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
If there are several prefixes, use map
library(purrr)
library(stringr)
colnameMeans <- function(., x) {
columnname<- paste0("avg_", x)
transmute(., !! columnname := rowMeans(select(., starts_with(x))))
}
map_dfc(c('var', 'alt'), ~ df1 %>%
colnameMeans(.x)) %>%
bind_cols(df1, .)
# A tibble: 3 x 8
# var_one var_two var_three alt_var_one alt_var_two alt_var_three avg_var avg_alt
#* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 1 1 1 1
#2 2 2 2 2 2 2 2 2
#3 3 3 3 3 3 3 3 3
data
df1 <- bind_cols(df, df %>% rename_all(~ str_replace(., 'var_', 'new_')))
I am trying to rename all columns in my data frame using dplyr and stringr, but it seems not to be working the way I desire. How should I change the following code to get the output I want (shown in the code below)?
Here is the fully reproducible code:
library(dplyr)
library(stringr)
library(tibble)
library(rlang)
# dataframe
x <-
tibble::as.tibble(cbind(
Grace_neu_wrong = c(1:4),
Grace_acc_wrong = c(1:4),
Grace_att_wrong = c(1:4),
Grace_int_wrong = c(1:4)
))
# defining custom function to rename the entire dataframe in a certain way
string_conversion <- function(df, ...) {
# preparing the dataframe
df <- dplyr::select(.data = df,
!!rlang::quo(...))
# custom function to split the name of each column in a certain way
splitfn <- function(x) {
x <- as.character(x)
split <- stringr::str_split(string = x, pattern = "_")[[1]]
paste(split[2], split[3], '_', split[1], sep = '')
}
# applying the splitfn function to each column name and outputting the data frame
df_new <- df %>%
dplyr::select_all(.funs = colnames) %>%
dplyr::mutate_all(.funs = splitfn)
return(df_new)
}
# the output I get
string_conversion(df = x, names(x))
#> # A tibble: 4 x 4
#> Grace_neu_wrong Grace_acc_wrong Grace_att_wrong Grace_int_wrong
#> <chr> <chr> <chr> <chr>
#> 1 NANA_1 NANA_1 NANA_1 NANA_1
#> 2 NANA_1 NANA_1 NANA_1 NANA_1
#> 3 NANA_1 NANA_1 NANA_1 NANA_1
#> 4 NANA_1 NANA_1 NANA_1 NANA_1
# the output I desire
tibble::as.tibble(cbind(
neuwrong_Grace = c(1:4),
accwrong_Grace = c(1:4),
attwrong_Grace = c(1:4),
intwrong_Grace = c(1:4)
))
#> # A tibble: 4 x 4
#> neuwrong_Grace accwrong_Grace attwrong_Grace intwrong_Grace
#> <int> <int> <int> <int>
#> 1 1 1 1 1
#> 2 2 2 2 2
#> 3 3 3 3 3
#> 4 4 4 4 4
Created on 2018-02-08 by the reprex
package (v0.1.1.9000).
You can do this in a single line without using mutate, which should be for the column values rather than the column names. Instead, do the following using stringr::str_replace and regular expressions.
The pattern "(.*)_(.*)_(.*)" is three groups of characters separated by underscores.
We simply make the replacement "\\2\\3_\\1", which is group 2, then group 3, then an underscore, then group 1, giving us the desired result.
The code is consequently just one line long:
names(x) <- str_replace(names(x), "(.*)_(.*)_(.*)", "\\2\\3_\\1")
print(x)
# A tibble: 4 x 4
neuwrong_Grace accwrong_Grace attwrong_Grace intwrong_Grace
<int> <int> <int> <int>
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
I've imported an excel data set and want to set nearly all columns (greater than 90) to numeric when they are initially characters. What is the best way to achieve this because importing and changing each to numeric one by one isn't the most efficient approach?
This should do as you wish:
# Random data frame for illustration (100 columns wide)
df <- data.frame(replicate(100,sample(0:1,1000,rep=TRUE)))
# Check column names / return column number (just encase you wanted to check)
colnames(df)
# Specify columns
cols <- c(1:length(df)) # length(df) is useful as if you ever add more columns at later date
# Or if only want to specify specific column numbers:
# cols <- c(1:100)
#With help of magrittr pipe function change all to numeric
library(magrittr)
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
# Check our columns are numeric
str(df)
Assuming your data is already imported with all character columns, you can convert the relevant columns to numeric using mutate_at by position or name:
suppressPackageStartupMessages(library(tidyverse))
# Assume the imported excel file has 5 columns a to e
df <- tibble(a = as.character(1:3),
b = as.character(5:7),
c = as.character(8:10),
d = as.character(2:4),
e = as.character(2:4))
# select the columns by position (convert all except 'b')
df %>% mutate_at(c(1, 3:5), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# or drop the columns that shouldn't be used ('b' and 'd' should stay as chr)
df %>% mutate_at(-c(2, 4), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# select the columns by name
df %>% mutate_at(c("a", "c", "d", "e"), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
I have a tibble with one column being a list column, always having two numeric values named a and b (e.g. as a result of calling purrr:map to a function which returns a list), say:
df <- tibble(x = 1:3, y = list(list(a = 1, b = 2), list(a = 3, b = 4), list(a = 5, b = 6)))
df
# A tibble: 3 × 2
x y
<int> <list>
1 1 <list [2]>
2 2 <list [2]>
3 3 <list [2]>
How do I separate the list column y into two columns a and b, and get:
df_res <- tibble(x = 1:3, a = c(1,3,5), b = c(2,4,6))
df_res
# A tibble: 3 × 3
x a b
<int> <dbl> <dbl>
1 1 1 2
2 2 3 4
3 3 5 6
Looking for something like tidyr::separate to deal with a list instead of a string.
Using dplyr (current release: 0.7.0):
bind_cols(df[1], bind_rows(df$y))
# # A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
# 1 1 1 2
# 2 2 3 4
# 3 3 5 6
edit based on OP's comment:
To embed this in a pipe and in case you have many non-list columns, we can try:
df %>% select(-y) %>% bind_cols(bind_rows(df$y))
We could also make use the map_df from purrr
library(tidyverse)
df %>%
summarise(x = list(x), new = list(map_df(.$y, bind_rows))) %>%
unnest
# A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
#1 1 1 2
#2 2 3 4
#3 3 5 6