Vectorised column operations in dplyr - r

I am looking for a tidy way of incorporating vectorised operations on columns using dplyr.
Basically, having a simple df as follows:
library(dplyr)
df <- data.frame("X" = runif(1:10),
"Y" = runif(1:10), "Z" = runif(1:10)) %>%
tbl_df()
I am now looking to apply the following vectorised formula:
Formula <- "X / Y + lag(Z)"
Of course the following won't work as it is looking for a column 'X / Y + lag(Z)':
df %>% mutate(Result := !!sym(Formula))
Can anyone suggest a simple way of applying a formula from a vector directly in a pipe on columns to achieve:
df %>% mutate(Result = X/Y+lag(Z))

Is this what you're looking for?
set.seed(1)
df <- data.frame("X" = runif(1:10),
"Y" = runif(1:10), "Z" = runif(1:10)) %>%
tbl_df()
Formula <- "X / Y + lag(Z)"
df <- df %>% mutate(Result = eval(parse(text = Formula)))
X Y Z Result
<dbl> <dbl> <dbl> <dbl>
1 0.153 0.0158 0.527 NA
2 0.322 0.231 0.327 1.93
3 0.479 0.0958 0.365 5.33
4 0.764 0.537 0.105 1.79
5 0.180 0.223 0.0243 0.913
6 0.178 0.538 0.975 0.355
7 0.869 0.820 0.845 2.03
8 0.356 0.263 0.0628 2.20
9 0.0399 0.710 0.968 0.119
10 0.863 0.422 0.825 3.02
parse an unevaluated expression, then evaluate it.

With tidyverse, parse_expr can be used
library(dplyr)
df <- df %>%
mutate(Calc_Col = !! rlang::parse_expr(Formula))
and if we need to pass the column name as variable, use the := (as #Nick mentioned in the comments)
Name <- "Calc_Col"
df <- df %>%
mutate(!!Name := !!rlang::parse_expr(Formula))

Related

How can I show an output in r by a condition

Im trying to show the output of my code depending by a condition that its print.condition(pregunta_menos < 12347) but it only displays an error, How can I display minimum sum in the output?
Here's part of my code:
pregunta_menos <- colSums(!is.na(df))
as.data.table(pregunta_menos,keep.rownames = TRUE)
I need to print ("The Minimum column is:" )
Perhaps:
library(tidyverse)
set.seed(111)
#some dummy data
(df <-
paste0('x', 1:5) %>%
map_dfc(~ tibble('{.x}' := rnorm(5))))
#> # A tibble: 5 × 5
#> x1 x2 x3 x4 x5
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.235 0.140 -0.174 -1.57 0.362
#> 2 -0.331 -1.50 -0.407 -0.0859 0.347
#> 3 -0.312 -1.01 1.85 -0.359 0.190
#> 4 -2.30 -0.948 0.394 -1.19 -0.160
#> 5 -0.171 -0.494 0.798 0.364 0.327
min_col <-
df %>%
colSums() %>%
which.min() %>%
names()
paste("The Minimum column is:", min_col)
#> [1] "The Minimum column is: x2"
#Calculate only with numeric columns
nms <- select(df, where(is.numeric)) %>% names()
min_col_numeric_only <-
df %>%
transmute(c_sums = colSums(across(where(is.numeric), ~.)), nms = nms) %>%
filter(c_sums == min(c_sums)) %>%
pull(nms)
paste("The Minimum column is:", min_col_numeric_only)
#> [1] "The Minimum column is: x2"
Created on 2021-11-23 by the reprex package (v2.0.1)

How to pass tibble of variable names and function calls to tibble

I'm trying to go from a tibble of variable names and functions like this:
N <- 100
dat <-
tibble(
variable_name = c("a", "b"),
variable_value = c("rnorm(N)", "rnorm(N)")
)
to a tibble with two variables a and b of length N
dat2 <-
tibble(
a = rnorm(N),
b = rnorm(N)
)
is there a !!! or rlang-y way to accomplish this?
We can evalutate the string
library(dplyr)
library(purrr)
library(tibble)
deframe(dat) %>%
map_dfc(~ eval(rlang::parse_expr(.x)))
-output
# A tibble: 100 x 2
a b
<dbl> <dbl>
1 0.0750 2.55
2 -1.65 -1.48
3 1.77 -0.627
4 0.766 -0.0411
5 0.832 0.200
6 -1.91 -0.533
7 -0.0208 -0.266
8 -0.409 1.08
9 -1.38 -0.181
10 0.727 0.252
# … with 90 more rows
Here is a base way with a pipe and a as_tibble call.
Map(function(x) eval(str2lang(x)), setNames(dat$variable_value, dat$variable_name)) %>%
as_tibble

Tidyverse way to get lm()-residuals using a subset of columns as predictors but keeping all in output

I have some data like this:
dat = tibble(
var1 = rep(c('A','B'),each=5)
, var2 = rnorm(10)
, var3 = rnorm(10)
, var4 = rnorm(10)
, var5 = rnorm(10)
)
I can get what I want by explicitly naming the columns to be used in the lm formula:
dat %>%
#dat has columns: var1 through var5
dplyr::group_by(var1) %>%
dplyr::mutate(
resids = resid(lm( var2 ~ var3 + var4 ))
)
But I actually have many columns in my real data set, and the number and names of the ones I'll be using will vary. I do know the names of the ones I don't want, so I thought this would work:
dat %>%
#dat has columns: var1 through var5
dplyr::group_by(var1) %>%
dplyr::mutate(
resids = resid(lm(
formula = var2 ~ .
, data = . %>% select(-var1, -var5)
))
)
But that doesn't seem to work. Any suggestions?
Your code doesn't work because the special symbol "." in data = . %>% select(-var1, -var5) is actually the input data which has not been grouped. After dplyr 1.0.0 this issue has been solved by cur_data(), which gives the current data for the current group (exclusing grouping variables).
dat %>%
group_by(var1) %>%
mutate(
resids = resid(lm(var2 ~ ., cur_data() %>% select(-var5)))
)
Note that I use select(-var5) instead of select(ivar1, -var5) because cur_data() has excluded grouping variables, i.e. var1, so select(ivar1, -var5) will get an error:
Can't subset columns that don't exist.
As #IRTFM comments, you can also select variables before passing to lm. Remember to use cur_data() rather than ".".
dat %>%
group_by(var1) %>%
select(-var5) %>%
mutate(
resids = resid(lm(var2 ~ ., cur_data()))
)
Here you still don't need to exclude var1 in select() because grouping variables cannot be excluded.
Since you know the variables that aren't going to be on the right-hand side of the model, one option is to get those names into a vector and then build the formula for lm(). This means doing an extra step outside your pipe chain.
Building the formula is a relatively common approach for, e.g., making functions for fitting models. I wrote a blog post to show this approach here.
In your case, you can pull the names of the variables you want in the model as a character vector based on the variables you don't want.
modvars = dat %>%
select(-var1, -var5, -var2) %>%
names()
modvars
[1] "var3" "var4"
You can build the formula for model fitting using as.formula() after pasting the variables together. This is what that would look like:
as.formula(paste("var2 ~", paste(modvars, collapse = "+") ) )
var2 ~ var3 + var4
Even easier is the reformulate() approach from the comments thanks to #BenBolker.
reformulate(modvars, response = "var2")
var2 ~ var3 + var4
You could build this outside your pipe chain or put it directly within the chain. Here I do the latter.
dat %>%
dplyr::group_by(var1) %>%
dplyr::mutate(
resids = resid(lm(
formula = reformulate(modvars, response = "var2") )
)
)
# A tibble: 10 x 6
# Groups: var1 [2]
var1 var2 var3 var4 var5 resids
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.0792 0.265 0.637 -0.106 0.386
2 A -0.845 0.386 1.20 1.55 -0.232
3 A 0.465 1.12 -0.750 0.726 -0.141
4 A -0.365 -1.19 0.174 0.347 -0.126
5 A 0.395 -0.0515 -0.464 -0.0934 0.112
6 B -2.83 -0.0664 -0.0958 0.588 -1.99
7 B 0.383 1.16 -0.339 0.492 0.838
8 B 1.35 0.270 2.40 0.626 -0.512
9 B 0.620 -1.33 1.32 -0.148 0.688
10 B 0.664 -0.0487 0.426 -0.158 0.973
The residuals match your original approach where you wrote the formula out:
dat %>%
#dat has columns: var1 through var5
dplyr::group_by(var1) %>%
dplyr::mutate(
resids = resid(lm( var2 ~ var3 + var4 ))
)
# A tibble: 10 x 6
# Groups: var1 [2]
var1 var2 var3 var4 var5 resids
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.0792 0.265 0.637 -0.106 0.386
2 A -0.845 0.386 1.20 1.55 -0.232
3 A 0.465 1.12 -0.750 0.726 -0.141
4 A -0.365 -1.19 0.174 0.347 -0.126
5 A 0.395 -0.0515 -0.464 -0.0934 0.112
6 B -2.83 -0.0664 -0.0958 0.588 -1.99
7 B 0.383 1.16 -0.339 0.492 0.838
8 B 1.35 0.270 2.40 0.626 -0.512
9 B 0.620 -1.33 1.32 -0.148 0.688
10 B 0.664 -0.0487 0.426 -0.158 0.973
The issue here is not the subsettng but rather the confusion of the . in the formula. Is this referring to the variables or to the data.
There are other issues also with regard to this, ie the select. You should dodata = {.}%>%subset(-var1,var5) and not what you have, or simply do data = subset(.,-var1,-var5). HOw to solve this issue:
use nest + unnest
By nesting_by, the grouping variable is automatically removed from the data:
dat %>%
nest_by(var1) %>%
mutate(resid = list(resid(lm(formula = var2 ~ .-var5, data = data))))%>%
unnest(c(data, resid))
use group_by + summarize
dat %>%
group_by(var1) %>%
summarise(resid = list(resid(lm(formula = var2~.-var1-var5, data =.))),.groups="drop")%>%
unnest(resid)

function for dplyr with argument that defaults to "."

Let's say I want to sum over all columns in a tibble to create a new column called "total". I could do:
library(tibble)
library(dplyr)
set.seed(42)
N <- 10
Df <- tibble(p_1 = rnorm(N),
p_2 = rnorm(N),
q_1 = rnorm(N),
q_2 = rnorm(N))
# Works fine
Df %>% mutate(total = apply(., 1, sum))
I could make a helper function like so,
myfun <- function(Df){
apply(Df, 1, sum)
}
# Works fine
Df %>% mutate(total = myfun(.))
But let's say this myfun was usually going to be used in this way, i.e. within a dplyr verb function, then the "." referencing the data frame is a but superfluous, and it would be nice if the myfun function could replace this with a default value. I'd like something like this:
myfun2 <- function(Df=.){
apply(Df, 1, sum)
}
which does not work.
Df %>% mutate(total = myfun2())
Error in mutate_impl(.data, dots) :
Evaluation error: object '.' not found.
Because I am not even sure how the "." works, I don't think I can formulate the question better, but basically, I want to know if there a way of saying, in effect, if the Df is not defined in myfun2, get the data-frame that is normally referenced by "."?
One option would be to quote the function and then evaluate with !!
library(tidyverse)
myfun <- function() {
quote(reduce(., `+`))
}
r1 <- Df %>%
mutate(total = !! myfun())
r1
# A tibble: 10 x 5
# p_1 p_2 q_1 q_2 total
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1.37 1.30 -0.307 0.455 2.82
# 2 -0.565 2.29 -1.78 0.705 0.645
# 3 0.363 -1.39 -0.172 1.04 -0.163
# 4 0.633 -0.279 1.21 -0.609 0.960
# 5 0.404 -0.133 1.90 0.505 2.67
# 6 -0.106 0.636 -0.430 -1.72 -1.62
# 7 1.51 -0.284 -0.257 -0.784 0.186
# 8 -0.0947 -2.66 -1.76 -0.851 -5.37
# 9 2.02 -2.44 0.460 -2.41 -2.38
#10 -0.0627 1.32 -0.640 0.0361 0.654
Note that the reduce was used to be more in align with tidyverse, but the OP's function can also be quoted and get the same result
myfun2 <- function() {
quote(apply(., 1, sum ))
}
r2 <- Df %>%
mutate(total = !! myfun2())
all.equal(r2$total, r1$total)
#[1] TRUE

Order data frame by the last column with dplyr

library(dplyr)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
arrange(colnames(df) %>% tail(1) %>% desc())
I am looping over a list of data frames. There are different columns in the data frames and the last column of each may have a different name.
I need to arrange every data frame by its last column. The simple case looks like the above code.
Using arrange_at and ncol:
df %>% arrange_at(ncol(.), desc)
As arrange_at will be depricated in the future, you could also use:
# option 1
df %>% arrange(desc(.[ncol(.)]))
# option 2
df %>% arrange(across(ncol(.), desc))
If we need to arrange by the last column name, either use the name string
df %>%
arrange_at(vars(last(names(.))), desc)
Or specify the index
df %>%
arrange_at(ncol(.), desc)
The new dplyr way (I guess from 1.0.0 on) would be using across(last_col()):
library(dplyr)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
arrange(across(last_col(), desc))
#> # A tibble: 10 x 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.283 0.443 1.30 0.910
#> 2 0.797 -0.0819 -0.936 0.828
#> 3 0.0717 -0.858 -0.355 0.671
#> 4 -1.38 -1.08 -0.472 0.426
#> 5 1.52 1.43 -0.0593 0.249
#> 6 0.827 -1.28 1.86 0.0824
#> 7 -0.448 0.0558 -1.48 -0.143
#> 8 0.377 -0.601 0.238 -0.918
#> 9 0.770 1.93 1.23 -1.43
#> 10 0.0532 -0.0934 -1.14 -2.08
> packageVersion("dplyr")
#> [1] ‘1.0.4’

Resources