Using the pipe in unique() function in r is not working - r

I have some troubles using the pipe operator (%>%) with the unique function.
df = data.frame(
a = c(1,2,3,1),
b = 'a')
unique(df$a) # no problem here
df %>% unique(.$a) # not working here
# I got "Error: argument 'incomparables != FALSE' is not used (yet)"
Any idea?

As other answers mention : df %>% unique(.$a) is equivalent to df %>% unique(.,.$a).
To force the dots to be explicit you can do:
df %>% {unique(.$a)}
# [1] 1 2 3
An alternative option from magrittr
df %$% unique(a)
# [1] 1 2 3
Or possibly stating the obvious:
df$a %>% unique()
# [1] 1 2 3

What is happening is that %>% takes the object on the left hand side and feeds it into the first argument of the function by default, and then will feed in other arguments as provided. Here is an example:
df = data.frame(
a = c(1,2,3,1),
b = 'a')
MyFun<-function(x,y=FALSE){
return(match.call())
}
> df %>% MyFun(.$a)
MyFun(x = ., y = .$a)
What is happening is that %>% is matching df to x and .$a to y.
So for unique your code is being interpreted as:
unique(x=df, incomparables=.$a)
which explains the error. For your case you need to pull out a before you run unique. If you want to keep with %>% you can use df %>% .$a %>% unique() but obviously there are lots of other ways to do that.

Related

R: What is the expected output of passing a character vector to dplyr::all_of()?

I am trying to understand the expected output of dplyr::group_by() in conjunction with the use of dplyr::all_of(). My understanding is that using dplyr::all_of() should convert character vectors containing variable names to the bare names so that group_by(), but this doesn't appear to happen.
Below, I generate some fake data, pass different objects to group_by() with(out) all_of() and calculate the number of observations in each group. In the example, passing a single bare column name without dplyr::all_of() produces the correct output: one row per unique value of the column. However, passing character vectors or using dplyr::all_of() produces incorrect output: one row regardless of the number of values in a column.
What is expected when using all_of and how might I alternatively pass a character vector to group_by to process as a vector of bare names?
library(dplyr)
# Create a 20-row data.frame with
# 2 variables each with 2 unique values.
df <- data.frame(var = rep(c("a", "b"), 10),
bar = rep(c(1, 2), 20))
# Output 1: 2x2 tibble - GOOD
df %>% group_by(var) %>% summarize(n = n())
# Output 2: 1x2 tibble - BAD
foo <- "var"
df %>% group_by(all_of(foo)) %>% summarize(n = n())
# Output 3: 1x2 tibble
df %>% group_by("var") %>% summarize(n = n())
# Output 4: Error in_var not found - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(in_var) %>%
summarize(n = n())
})
# Output 5: list of length 2 where
# each element is a 1x2 tibble - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(all_of(in_var)) %>%
summarize(n = n())
})
We can use group_by_at
lapply(foo2, function(in_var) df %>%
group_by_at(all_of(in_var)) %>%
summarise(n = n()))
-output
#[[1]]
# A tibble: 2 x 2
# var n
#* <chr> <int>
#1 a 20
#2 b 20
#[[2]]
# A tibble: 2 x 2
# bar n
#* <dbl> <int>
#1 1 20
#2 2 20
As across replaces some of the functionality of group_by_at, we can use it instead with all_of:
lapply(foo2, function(in_var) df %>%
group_by(across(all_of(in_var))) %>%
summarise(n = n()))
Or convert to symbol and evaluate (!!)
lapply(foo2, function(in_var) df %>%
group_by(!! rlang::sym(in_var)) %>%
summarise(n = n()))
Or use map
library(purrr)
map(foo2, ~ df %>%
group_by(!! rlang::sym(.x)) %>%
summarise(n = n()))
Or instead of group_by, it can be count
map(foo2, ~ df %>%
count(across(all_of(.x))))
To add to #akrun's answers of mutliple ways to achieve the desired output - my understanding of all_of() is that, it is a helper for selection of variables stored as character for dplyr function and uses vctrs underneath. Compared to any_of() which is a less strict version of all_of() and some convenient use cases.
reading the ?tidyselect::all_off() is helpful. This page is also helpful to keep up with changes in dplyr and tidy evaluation https://dplyr.tidyverse.org/articles/programming.html.
The scoped dplyr verbs are being superceded in the future with across based on decisions by the devs at RStudio. See ?group_by_at() or other *_if, *_at, *_all documentation. So I guess it really depends on what version of dplyr you are using in your workflow and what works best for you.
This SO post also gives context of changes in solutions over time with passing characters into dplyr functions, and there's probably more posts out there.

Moving from mutate_all to across() in dplyr 1.0

With the new release of dplyr I am refactoring quite a lot of code and removing functions that are now retired or deprecated. I had a function that is as follows:
processingAggregatedLoad <- function (df) {
defined <- ls()
passed <- names(as.list(match.call())[-1])
if (any(!defined %in% passed)) {
stop(paste("Missing values for the following arguments:", paste(setdiff(defined, passed), collapse=", ")))
}
df_isolated_load <- df %>% select(matches("snsr_val")) %>% mutate(global_demand = rowSums(.)) # we get isolated load
df_isolated_load_qlty <- df %>% select(matches("qlty_good_ind")) # we get isolated quality
df_isolated_load_qlty <- df_isolated_load_qlty %>% mutate_all(~ factor(.), colnames(df_isolated_load_qlty)) %>%
mutate_each(funs(as.numeric(.)), colnames(df_isolated_load_qlty)) # we convert the qlty to factors and then to numeric
df_isolated_load_qlty[df_isolated_load_qlty[]==1] <- 1 # 1 is bad
df_isolated_load_qlty[df_isolated_load_qlty[]==2] <- 0 # 0 is good we mask to calculate the global index quality
df_isolated_load_qlty <- df_isolated_load_qlty %>% mutate(global_quality = rowSums(.)) %>% select(global_quality)
df <- bind_cols(df, df_isolated_load, df_isolated_load_qlty)
return(df)
}
Basically the function does as follows:
1.The function selects all of the values of a pivoted dataframe and aggregated them.
2.The function selects the quality indicator (character) of a pivoted dataframe.
3.I convert the characters of the quality to factors and then to numeric to get the 2 levels (1 or 2).
4.I replace the numeric values of each of the individual columns by 0 or 1 depending on the level.
5.I rowsum the individual quality as I will get 0 if all of the values are good, otherwise the global quality is bad.
The problem is that I am getting the following messages:
1: `funs()` is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
2: `mutate_each_()` is deprecated as of dplyr 0.7.0.
Please use `across()` instead.
I did multiple trials as for instance:
df_isolated_load_qlty %>% mutate(across(.fns = ~ as.factor(), .names = colnames(df_isolated_load_qlty)))
Error: Problem with `mutate()` input `..1`.
x All unnamed arguments must be length 1
ℹ Input `..1` is `across(.fns = ~as.factor(), .names = colnames(df_isolated_load_qlty))`.
But I am still a bit confused about the new dplyr syntax. Would someone be able to guide me a little bit around the right way of doing this?
mutate_each has been long deprecated and was replaced with mutate_all.
mutate_all is now replaced with across
across has default .cols as everything() which means it behaves as mutate_all by default (like here) if not mentioned explicitly.
You can apply the mulitple function in the same mutate call, so here factor and as.numeric can be applied together.
Considering all this you can change your existing function to :
library(dplyr)
processingAggregatedLoad <- function (df) {
defined <- ls()
passed <- names(as.list(match.call())[-1])
if (any(!defined %in% passed)) {
stop(paste("Missing values for the following arguments:",
paste(setdiff(defined, passed), collapse=", ")))
}
df_isolated_load <- df %>%
select(matches("snsr_val")) %>%
mutate(global_demand = rowSums(.))
df_isolated_load_qlty <- df %>% select(matches("qlty_good_ind"))
df_isolated_load_qlty <- df_isolated_load_qlty %>%
mutate(across(.fns = ~as.numeric(factor(.))))
df_isolated_load_qlty[df_isolated_load_qlty ==1] <- 1
df_isolated_load_qlty[df_isolated_load_qlty==2] <- 0
df_isolated_load_qlty <- df_isolated_load_qlty %>%
mutate(global_quality = rowSums(.)) %>%
select(global_quality)
df <- bind_cols(df, df_isolated_load, df_isolated_load_qlty)
return(df)
}

Apply dplyr functions on a single column across a list using piping

I'm tring to filter something across a list of dataframes for a specific column. Typically across a single dataframe using dplyr I would use:
#creating dataframe
df <- data.frame(a = 0:10, d = 10:20)
# filtering column a for rows greater than 7
df %>% filter(a > 7)
I've tried doing this across a list using the following:
# creating list
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(c = 11:20, d = 21:30),
data.frame(e = 15:25, f = 35:45))
# selecting the appropriate column and trying to filter
# this is not working
x[1][[1]][1] %>% lapply(. %>% {filter(. > 2)})
# however, if I use the min() function it works
x[1][[1]][1] %>% lapply(. %>% {min(.)})
I find the %>% syntax quite easy to understand and carry out. However, in this case, selecting a specific column and doing something quite simple like filtering is not working. I'm guessing map could be equally useful. Any help is appreciated.
You can use filter_at to refer column by position.
library(dplyr)
purrr::map(x, ~.x %>% filter_at(1, any_vars(. > 7)))
In filter, you can subset the column and use it
purrr::map(x, ~.x %>% filter(.[[1]] > 7))
In base R, that would be :
lapply(x, function(y) y[y[[1]] > 7, ])
It seems you are interested in checking the condition on the first column of each dataframe in your list.
One solution using dplyr would be
lapply(x, function(df) {df %>% filter_at(1, ~. > 7)})
The 1 in filter_at indicates that I want to check the condition on the first column (1 is a positional index) of each dataframe in the list.
EDIT
After the discussion in the comments, I propose the following solution
lapply(x, function(df) {df %>% filter(a > 7) %>% select(a) %>% slice(1)})
Input data
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(a = 11:20, b = 21:30),
data.frame(a = 15:25, b = 35:45))
Output
[[1]]
a
1 8
[[2]]
a
1 11
[[3]]
a
1 15
Using filter with across
library(dplyr)
library(purrr)
map(x, ~ .x %>%
filter(across(names(.)[1], ~ .> 7)))

Use $ dollar sign at end of of an R magrittr pipeline to return a vector

I'd like to use $ at the end of a magrittr/tidyverse pipeline. $ works directly next to tidyverse functions like read_csv and filter, but as soon I create a pipeline with %>% it raises an error. Here is a simple reproducible example.
# Load libraries and create a dummy data file
library(dplyr)
library(readr)
write_csv(data_frame(x=c(0,1), y=c(0,2)), 'tmp.csv')
# This works
y <- read_csv('tmp.csv')$y
str(y)
# This also works
df_y <- read_csv('tmp.csv')
y <- filter(df_y, y > 0)$y
str(y)
# This does not work
y <- read_csv('tmp.csv') %>% filter(y > 0)$y
My questions are:
1) What are the underlying explanations/mechanics for why using $ at the end of a pipepline does not work?
2) What's a best practice way for what I am trying to accomplish? Specifically, to get a vector as the end result of a pipeline?
It does not work because it thinks that the function is $, not filter, and tries to run:
"$"(., filter(y > 0), y)
which, of course, makes no sense.
Suppose DF is as shown below. Then any of the subsequent lines of code work as expected:
DF <- data.frame(y = seq(-3, 3))
DF %>% filter(y > 0) %>% "$"(y)
## [1] 1 2 3
DF %>% { filter(., y > 0)$y }
## [1] 1 2 3
DF %>% filter(y > 0) %>% "[["("y")
## [1] 1 2 3
library(magrittr) # supplies extract2 as an alias for [[
DF %>% filter(y > 0) %>% extract2("y")
## [1] 1 2 3
question 1: I think the problem is grouping. Enclose most of that statement in parentheses, and it produce the same result as your first two approaches:
y <- (read_csv('tmp.csv') %>% filter(y > 0))$y
question 2: the newish function dplyr::pull() is my preference for pulling out a single vector, instead of returning an entire data.frame.
read_csv('tmp.csv') %>%
filter(y > 0) %>%
dplyr::pull(y)
The older way was to treat the data.frame as a list, and pull out a single element. The dot on the last line is magrittr syntax for the output of a pipe.
read_csv('tmp.csv') %>%
filter(y > 0) %>%
.[["y"]]

Assign intermediate output to temp variable as part of dplyr pipeline

Q: In an R dplyr pipeline, how can I assign some intermediate output to a temp variable for use further down the pipeline?
My approach below works. But it assigns into the global frame, which is undesirable. There has to be a better way, right? I figured my approach involving the commented line would get the desired results. No dice. Confused why that didn't work.
df <- data.frame(a = LETTERS[1:3], b=1:3)
df %>%
filter(b < 3) %>%
assign("tmp", ., envir = .GlobalEnv) %>% # works
#assign("tmp", .) %>% # doesn't work
mutate(b = b*2) %>%
bind_rows(tmp)
a b
1 A 2
2 B 4
3 A 1
4 B 2
This does not create an object in the global environment:
df %>%
filter(b < 3) %>%
{
{ . -> tmp } %>%
mutate(b = b*2) %>%
bind_rows(tmp)
}
This can also be used for debugging if you use . ->> tmp instead of . -> tmp or insert this into the pipeline:
{ browser(); . } %>%
I often find the need to save an intermediate product in a pipeline. While my use case is typically to avoid duplicating filters for later splitting, manipulation and reassembly, the technique can work well here:
df %>%
filter(b < 3) %>%
{. ->> intermediateResult} %>% # this saves intermediate
mutate(b = b*2) %>%
bind_rows(intermediateResult)
pipeR is a package that extends the capabilities of the pipe without adding different pipes (as magrittr does). To assign, you pass a variable name, quoted with ~ in parentheses as an element in your pipe:
library(dplyr)
library(pipeR)
df %>>%
filter(b < 3) %>>%
(~tmp) %>>%
mutate(b = b*2) %>>%
bind_rows(tmp)
## a b
## 1 A 2
## 2 B 4
## 3 A 1
## 4 B 2
tmp
## a b
## 1 A 1
## 2 B 2
While the syntax is not terribly descriptive, pipeR is very well documented.
You can generate the desired object at the location in the pipeline where it's needed. For example:
df %>% filter(b < 3) %>% mutate(b = b*2) %>%
bind_rows(df %>% filter(b < 3))
This method avoids having to filter twice:
df %>%
filter(b < 3) %>%
bind_rows(., mutate(., b = b*2))
I was interested in the question for the sake of debugging (wanting to save intermediate results so that I can inspect and manipulate them from the console without having to separate the pipeline into two pieces which is cumbersome. So, for my purposes, the only problem with the OP's solution original solution was that it was slightly verbose.
This as can be fixed by defining a helper function:
to_var <- function(., ..., env=.GlobalEnv) {
var_name = quo_name(quos(...)[[1]])
assign(var_name, ., envir=env)
.
}
Which can then be used as follows:
df <- data.frame(a = LETTERS[1:3], b=1:3)
df %>%
filter(b < 3) %>%
to_var(tmp) %>%
mutate(b = b*2) %>%
bind_rows(tmp)
# tmp still exists here
That still uses the global environment, but you can also explicitly pass a more local environment as in the following example:
f <- function() {
df <- data.frame(a = LETTERS[1:3], b=1:3)
env = environment()
df %>%
filter(b < 3) %>%
to_var(tmp, env=env) %>%
mutate(b = b*2) %>%
bind_rows(tmp)
}
f()
# tmp does not exist here
The problem with the accepted solution is that it didn't seem to work out of the box with tidyverse pipes.
G. Grothendieck's solution doesn't work for the debugging use case at all. (update: see G. Grothendieck's comment below and his updated answer!)
Finally, the reason assign("tmp", .) %>% doesn't work is that the default 'envir' argument for assign() is the "current environment" (see documentation for assign) which is different at each stage of the pipeline. To see this, try inserting { print(environment()); . } %>% into the pipeline at various points and see that a different address is printed each time. (It is probably possible to tweak the definition of to_var so that the default is the grandparent environment instead.)
Just tacking-on a simplistic note to #tiechert's good post: As long as you're operating inside a function call, you can get the function's environment() reference and then use assign() to output the current state of the pipe to the function's environment and keep it separate from the global.
f = function(df) {
env = environment()
df %>%
# <actions here> %>%
assign("tmp", ., envir = env) %>% # Assign to function environment, not the pipe
# <more actions> %>%
.[]
# tmp is accessible here, within the function
}
# tmp does not exist here

Resources