How to pass tibble of variable names and function calls to tibble - r

I'm trying to go from a tibble of variable names and functions like this:
N <- 100
dat <-
tibble(
variable_name = c("a", "b"),
variable_value = c("rnorm(N)", "rnorm(N)")
)
to a tibble with two variables a and b of length N
dat2 <-
tibble(
a = rnorm(N),
b = rnorm(N)
)
is there a !!! or rlang-y way to accomplish this?

We can evalutate the string
library(dplyr)
library(purrr)
library(tibble)
deframe(dat) %>%
map_dfc(~ eval(rlang::parse_expr(.x)))
-output
# A tibble: 100 x 2
a b
<dbl> <dbl>
1 0.0750 2.55
2 -1.65 -1.48
3 1.77 -0.627
4 0.766 -0.0411
5 0.832 0.200
6 -1.91 -0.533
7 -0.0208 -0.266
8 -0.409 1.08
9 -1.38 -0.181
10 0.727 0.252
# … with 90 more rows

Here is a base way with a pipe and a as_tibble call.
Map(function(x) eval(str2lang(x)), setNames(dat$variable_value, dat$variable_name)) %>%
as_tibble

Related

Using group_modify with selected columns (retaining whole data frame and order)

I have run out of R power on this one. I appreciate any help, it is probably quite simple for someone with more experience.
I have a data frame (tibble) with some numerical columns, a group column, and some other columns with other information. I want to do operations on the numerical columns, by group, but still retain all the columns.
I've put an example below: I am replacing the NAs with the group mean, for each column. The columns to replace the NAs are specified by the df_names variable.
It basically works, except it removes all columns except the numerical ones, AND reorders everything. Which makes it hard to reassemble. I could work around this, but I have a feeling there must be a simpler way to direct group_apply to specified columns, while retaining the other columns, and keeping the order.
Can anyone help? Thanks so much in advance!
Will
library("tidyverse")
# create tibble
df <- tibble(
name=letters[1:10],
csize=c("L","S","S","L","L","S","L","S","L","S"),
v1=rnorm(10),
v2=rnorm(10),
v3=rnorm(10)
)
# introduce some missing data
df$v1[3] <- NA
df$v1[6] <- NA
df$v1[7] <- NA
df$v3[2] <- NA
# these are the cols where I want to replace the NAs
df_names <- c("v1","v2","v3")
# this is the grouping variable (has to be stored as a string, since it is an input to the function)
groupvar <- "csize"
# now I want to replace the NAs with column means, restricted to their group
# the following line works, but the problem is that it removes the name column, and reorders the rows...
df_imp <- df %>% group_by(.dots=groupvar) %>% select(df_names) %>% group_modify( ~{replace_na(.x,as.list(colMeans(.x, na.rm=TRUE)))})
group_modify is overkill in this case; mutate(across()) is your friend here:
df %>% group_by(.dots = groupvar) %>%
mutate(across(all_of(df_names), ~if_else(is.na(.x), mean(.x, na.rm = TRUE), .x)))
Result:
> df
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L -1.22 1.48 -0.628
2 b S -1.17 0.0890 -0.130
3 c S -0.422 -0.0956 -0.0271
4 d L -0.265 0.180 -0.786
5 e L -0.491 0.509 -0.359
6 f S -0.422 -0.712 0.232
7 g L -0.400 -1.13 1.13
8 h S -0.538 -0.0785 0.690
9 i L 0.373 0.308 0.252
10 j S 0.445 0.743 -1.41
Does this work:
> library(dplyr)
> df %>% group_by(csize) %>% mutate(across(v1:v3, ~ replace_na(., mean(., na.rm = T))))
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L 1.57 0.310 -1.76
2 b S -0.705 0.0655 0.577
3 c S -1.05 1.28 1.82
4 d L 0.958 -2.09 -0.371
5 e L -0.712 0.247 -1.13
6 f S -1.05 -0.516 -0.107
7 g L 0.403 1.79 0.128
8 h S -0.793 1.52 1.07
9 i L -0.206 -0.369 -1.77
10 j S -1.65 -0.992 -0.476

In R, use nonstandard evaluation to select specific variables from data.frames

I've got several large-ish data.frames set up like a relational database, and I'd like to make a single function to look for whatever variable I need and grab it from that particular data.frame and add it to the data.frame I'm currently working on. I've got a way to do this that works, but it requires temporarily making a list of all the data.frames, which seems inefficient. I suspect that nonstandard evaluation would solve this problem for me, but I'm not sure how to do it.
Here's what works but seems inefficient:
Table1 <- data.frame(ID = LETTERS[1:10], ColA = rnorm(10), ColB = rnorm(10),
ColC = rnorm(10))
Table2 <- data.frame(ID = LETTERS[1:10], ColD = rnorm(10), ColE = rnorm(10),
ColF = rnorm(10))
Table3 <- data.frame(ID = LETTERS[1:10], ColG = rnorm(10), ColH = rnorm(10),
ColI = rnorm(10))
Key <- data.frame(Table = rep(c("Table1", "Table2", "Table3"), each = 4),
ColumnName = c("ID", paste0("Col", LETTERS[1:3]),
"ID", paste0("Col", LETTERS[4:6]),
"ID", paste0("Col", LETTERS[7:9])))
# function for grabbing info from other tables
grab <- function(StartDF, ColNames){
AllDFs <- list(Table1, Table2, Table3)
names(AllDFs) <- c("Table1", "Table2", "Table3")
# Determine which data.frames have that column
WhichDF <- Key %>% filter(ColumnName %in% ColNames) %>%
select(Table)
TempDF <- StartDF
for(i in 1:length(ColNames)){
ToAdd <- AllDFs[WhichDF[i, 1]]
ToAdd <- ToAdd[[1]] %>%
select(c(ColNames[i], ID))
TempDF <- TempDF %>% left_join(ToAdd)
rm(ToAdd)
}
return(TempDF)
}
grab(Table1, c("ColE", "ColH"))
What would be great instead would be something like this:
grab <- function(StartDF, ColNames){
# Some function that returns the column names of all the data.frames
# without me creating a new object that is a list of them
# Some function that left_joins the correct data.frame plus the column
# "ID" to my starting data.frame, again without needing to create that list
# of all the data.frames
}
Instead of creating the list manually, we can directly get the values of the objects returned from the 'Table' column of 'Key' dataset with mget
library(dplyr)
library(purrr)
grab <- function(StartDF, ColNames){
# filter the rows of Key based on the ColNames input
# pull the Table column as a vector
# column was factor, so convert to character class
# return the value of the objects with mget in a list
Tables <- Key %>%
filter(ColumnName %in% ColNames) %>%
pull(Table) %>%
as.character %>%
mget(envir = .GlobalEnv)
TempDF <- StartDF
# use the same left_joins in a loop after selecting only the
# ID and corresponding columns from 'ColNames'
for(i in seq_along(ColNames)){
ToAdd <- Tables[[i]] %>%
select(ColNames[i], ID)
TempDF <- TempDF %>%
left_join(ToAdd)
rm(ToAdd)
}
TempDF
}
grab(Table1, c("ColE", "ColH"))
Or another option is reduce
grab <- function(StartDF, ColNames) {
#only change is that instead of a for loop
# use reduce with left_join after selecting the corresponding columns
# with map
Key %>%
filter(ColumnName %in% ColNames) %>%
pull(Table) %>%
as.character %>%
mget(envir = .GlobalEnv) %>%
map2(ColNames, ~ .x %>%
select(ID, .y)) %>%
append(list(Table1), .) %>%
reduce(left_join)
}
grab(Table1, c("ColE", "ColH"))
# ID ColA ColB ColC ColE ColH
#1 A -0.9490093 0.5177143 -1.91015491 0.07777086 1.86277670
#2 B -0.7182786 -1.1019146 -0.70802738 -0.73965230 0.18375660
#3 C 0.5064516 -1.6904354 1.11106206 2.04315508 -0.65365228
#4 D 0.9362477 0.5260682 -0.03419651 -0.51628310 -1.17104181
#5 E 0.5636047 -0.9470895 0.43303304 -2.95928629 1.86425049
#6 F 1.0598531 0.4144901 0.10239896 1.57681703 -0.05382603
#7 G 1.1335047 -0.8282173 -0.28327898 2.02917831 0.50768462
#8 H 0.2941341 0.3261185 -0.15528127 -0.46470035 -0.86561320
#9 I -2.1434905 0.6567689 0.02298549 0.90822132 0.64360337
#10 J 0.4291258 1.3410147 0.67544567 0.12466251 0.75989623
There is a serious bug in the accepted solution. If you're not careful with the ordering in the ColNames argument, then the function won't work. Also, I redefined your data to use tibbles instead. They're basically the same as data frames, but their default settings are nicer (e.g. you don't need StringsAsFactors = FALSE)
library(tidyverse)
Table1 <- tibble(
ID = LETTERS[1:10], ColA = rnorm(10), ColB = rnorm(10), ColC = rnorm(10)
)
Table2 <- tibble(
ID = LETTERS[1:10], ColD = rnorm(10), ColE = rnorm(10), ColF = rnorm(10)
)
Table3 <- tibble(
ID = LETTERS[1:10], ColG = rnorm(10), ColH = rnorm(10), ColI = rnorm(10)
)
Key <- tibble(
Table = rep(c("Table1", "Table2", "Table3"), each = 4),
ColumnName = c("ID", paste0("Col", LETTERS[1:3]),
"ID", paste0("Col", LETTERS[4:6]),
"ID", paste0("Col", LETTERS[7:9]))
)
grab_akrun <- function(StartDF, ColNames) {
#only change is that instead of a for loop
# use reduce with left_join after selecting the corresponding columns
# with map
Key %>%
filter(ColumnName %in% ColNames) %>%
pull(Table) %>%
as.character %>%
mget(envir = .GlobalEnv) %>%
map2(ColNames, ~ .x %>%
select(ID, .y)) %>%
append(list(Table1), .) %>%
reduce(left_join)
}
grab_akrun(Table1, c("ColE", "ColH"))
#> Joining, by = "ID"Joining, by = "ID"
#> # A tibble: 10 x 6
#> ID ColA ColB ColC ColE ColH
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A -0.658 -0.613 0.689 -0.850 -0.795
#> 2 B 0.143 0.732 -0.212 -1.74 1.99
#> 3 C -0.966 -0.570 -0.354 0.559 -1.11
#> 4 D -1.05 0.269 -0.856 -0.370 -1.35
#> 5 E 0.255 -0.349 0.329 1.39 0.421
#> 6 F 1.51 1.38 0.707 -0.639 0.289
#> 7 G -1.28 1.44 -1.35 1.94 -1.04
#> 8 H -1.56 -0.434 0.231 0.467 0.656
#> 9 I -0.553 -1.64 -0.761 0.133 0.249
#> 10 J -0.950 0.418 -0.843 0.593 0.343
This works, but if you change the order:
grab_akrun(Table1, c("ColH", "ColE"))
#> Error: Unknown column `ColH`
Instead, you should approach it like this:
grab_new <- function(StartDF, ColNames) {
Key %>%
filter(ColumnName %in% ColNames) %>%
pluck("Table") %>%
mget(inherits = TRUE) %>%
map(~select(.x, ID, intersect(colnames(.x), ColNames))) %>%
reduce(left_join, .init = StartDF)
}
grab_new(Table1, c("ColE", "ColH"))
#> Joining, by = "ID"Joining, by = "ID"
#> # A tibble: 10 x 6
#> ID ColA ColB ColC ColE ColH
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A -0.658 -0.613 0.689 -0.850 -0.795
#> 2 B 0.143 0.732 -0.212 -1.74 1.99
#> 3 C -0.966 -0.570 -0.354 0.559 -1.11
#> 4 D -1.05 0.269 -0.856 -0.370 -1.35
#> 5 E 0.255 -0.349 0.329 1.39 0.421
#> 6 F 1.51 1.38 0.707 -0.639 0.289
#> 7 G -1.28 1.44 -1.35 1.94 -1.04
#> 8 H -1.56 -0.434 0.231 0.467 0.656
#> 9 I -0.553 -1.64 -0.761 0.133 0.249
#> 10 J -0.950 0.418 -0.843 0.593 0.343
grab_new(Table1, c("ColH", "ColE"))
#> Joining, by = "ID"Joining, by = "ID"
#> # A tibble: 10 x 6
#> ID ColA ColB ColC ColE ColH
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A -0.658 -0.613 0.689 -0.850 -0.795
#> 2 B 0.143 0.732 -0.212 -1.74 1.99
#> 3 C -0.966 -0.570 -0.354 0.559 -1.11
#> 4 D -1.05 0.269 -0.856 -0.370 -1.35
#> 5 E 0.255 -0.349 0.329 1.39 0.421
#> 6 F 1.51 1.38 0.707 -0.639 0.289
#> 7 G -1.28 1.44 -1.35 1.94 -1.04
#> 8 H -1.56 -0.434 0.231 0.467 0.656
#> 9 I -0.553 -1.64 -0.761 0.133 0.249
#> 10 J -0.950 0.418 -0.843 0.593 0.343
Which works as expected.
Created on 2020-01-21 by the reprex package (v0.3.0)

Changing / coercing multiple colums of a tibble while avoiding a loop

I have a tibble with several columns in which numbers are stored as text:
my_tbl <- tibble(names = letters[1:5],
value1 = as.character(runif(5)),
value2 = as.character(runif(5)))
Now, I'd like to change the type of these columns ("value1" and "value2") from character to numeric. Only option I've found is using a for-loop:
for (i in 2:ncol(my_tbl)) {
my_tbl[[i]] <- as.numeric(my_tbl[[i]])
}
Is there a possibility to do this without a loop?
You can use mutate_if from dplyr:
library(dplyr)
my_tbl %>%
group_by(names) %>%
mutate_if(is.character, as.numeric)
my_tbl
## A tibble: 5 x 3
## Groups: names [5]
# names value1 value2
# <chr> <dbl> <dbl>
#1 a 0.427 0.0191
#2 b 0.817 0.300
#3 c 0.108 0.158
#4 d 0.394 0.643
#5 e 0.775 0.311
With purrr you could do this:
If you already know your target columns :
library(purrr)
modify_at(my_tbl,-1,as.numeric)
If you need to detect them:
modify_if(my_tbl,~is.character(.) && !any(grepl("[:alpha:]",.)),as.numeric)
# # A tibble: 5 x 3
# names value1 value2
# <chr> <dbl> <dbl>
# 1 a 0.715 0.943
# 2 b 0.639 0.128
# 3 c 0.471 0.0395
# 4 d 0.374 0.374
# 5 e 0.500 0.800
using dplyr instead of purrr, these will yield the same results:
library(dplyr)
mutate_at(my_tbl,-1,as.numeric)
mutate_if(my_tbl,~is.character(.) && !any(grepl("[:alpha:]",.)),as.numeric)
The base R translations:
my_tbl[-1] <- lapply(my_tbl[-1],as.numeric)
my_tbl[] <- lapply(my_tbl,function(x)
if (is.character(x) && !any(grepl("[:alpha:]",x))) as.numeric(x)
else x)

function for dplyr with argument that defaults to "."

Let's say I want to sum over all columns in a tibble to create a new column called "total". I could do:
library(tibble)
library(dplyr)
set.seed(42)
N <- 10
Df <- tibble(p_1 = rnorm(N),
p_2 = rnorm(N),
q_1 = rnorm(N),
q_2 = rnorm(N))
# Works fine
Df %>% mutate(total = apply(., 1, sum))
I could make a helper function like so,
myfun <- function(Df){
apply(Df, 1, sum)
}
# Works fine
Df %>% mutate(total = myfun(.))
But let's say this myfun was usually going to be used in this way, i.e. within a dplyr verb function, then the "." referencing the data frame is a but superfluous, and it would be nice if the myfun function could replace this with a default value. I'd like something like this:
myfun2 <- function(Df=.){
apply(Df, 1, sum)
}
which does not work.
Df %>% mutate(total = myfun2())
Error in mutate_impl(.data, dots) :
Evaluation error: object '.' not found.
Because I am not even sure how the "." works, I don't think I can formulate the question better, but basically, I want to know if there a way of saying, in effect, if the Df is not defined in myfun2, get the data-frame that is normally referenced by "."?
One option would be to quote the function and then evaluate with !!
library(tidyverse)
myfun <- function() {
quote(reduce(., `+`))
}
r1 <- Df %>%
mutate(total = !! myfun())
r1
# A tibble: 10 x 5
# p_1 p_2 q_1 q_2 total
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1.37 1.30 -0.307 0.455 2.82
# 2 -0.565 2.29 -1.78 0.705 0.645
# 3 0.363 -1.39 -0.172 1.04 -0.163
# 4 0.633 -0.279 1.21 -0.609 0.960
# 5 0.404 -0.133 1.90 0.505 2.67
# 6 -0.106 0.636 -0.430 -1.72 -1.62
# 7 1.51 -0.284 -0.257 -0.784 0.186
# 8 -0.0947 -2.66 -1.76 -0.851 -5.37
# 9 2.02 -2.44 0.460 -2.41 -2.38
#10 -0.0627 1.32 -0.640 0.0361 0.654
Note that the reduce was used to be more in align with tidyverse, but the OP's function can also be quoted and get the same result
myfun2 <- function() {
quote(apply(., 1, sum ))
}
r2 <- Df %>%
mutate(total = !! myfun2())
all.equal(r2$total, r1$total)
#[1] TRUE

Order data frame by the last column with dplyr

library(dplyr)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
arrange(colnames(df) %>% tail(1) %>% desc())
I am looping over a list of data frames. There are different columns in the data frames and the last column of each may have a different name.
I need to arrange every data frame by its last column. The simple case looks like the above code.
Using arrange_at and ncol:
df %>% arrange_at(ncol(.), desc)
As arrange_at will be depricated in the future, you could also use:
# option 1
df %>% arrange(desc(.[ncol(.)]))
# option 2
df %>% arrange(across(ncol(.), desc))
If we need to arrange by the last column name, either use the name string
df %>%
arrange_at(vars(last(names(.))), desc)
Or specify the index
df %>%
arrange_at(ncol(.), desc)
The new dplyr way (I guess from 1.0.0 on) would be using across(last_col()):
library(dplyr)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
arrange(across(last_col(), desc))
#> # A tibble: 10 x 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.283 0.443 1.30 0.910
#> 2 0.797 -0.0819 -0.936 0.828
#> 3 0.0717 -0.858 -0.355 0.671
#> 4 -1.38 -1.08 -0.472 0.426
#> 5 1.52 1.43 -0.0593 0.249
#> 6 0.827 -1.28 1.86 0.0824
#> 7 -0.448 0.0558 -1.48 -0.143
#> 8 0.377 -0.601 0.238 -0.918
#> 9 0.770 1.93 1.23 -1.43
#> 10 0.0532 -0.0934 -1.14 -2.08
> packageVersion("dplyr")
#> [1] ‘1.0.4’

Resources