Unexpected behavior with n_distinct inside pipe - r

I am trying to use the n_distinct function from dplyr inside a pipe in a function and am finding it to be sensitive to my choice of syntax in a way I didn't expect. Here's a toy example.
# preliminaries
library(tidyverse)
set.seed(123)
X <- data.frame(a1 = rnorm(10), a2 = rnorm(10), b = rep(LETTERS[1:5], times = 2), stringsAsFactors = FALSE)
print(X)
a1 a2 b
1 -0.56047565 1.2240818 A
2 -0.23017749 0.3598138 B
3 1.55870831 0.4007715 C
4 0.07050839 0.1106827 D
5 0.12928774 -0.5558411 E
6 1.71506499 1.7869131 A
7 0.46091621 0.4978505 B
8 -1.26506123 -1.9666172 C
9 -0.68685285 0.7013559 D
10 -0.44566197 -0.4727914 E
Okay, now let's say I want to iterate a function over the names of selected columns in that data frame (humor me). Here, I'm going to use values in the selected column to filter the initial data set, count the number of unique ids that remain, and return the results as a one-row tibble that I then bind into a new tibble. When I create a new tibble inside the function and then apply n_distinct to a selected column in that tibble as its own step, I get the expected results from n_distinct, 5 and 4.
bind_rows(map(str_subset(colnames(X), "a"), function(i) {
subdf <- filter(X, !!sym(i) > 0)
value <- n_distinct(subdf$b)
tibble(y = i, n_uniq = value)
}))
# A tibble: 2 x 2
y n_uniq
<chr> <int>
1 a1 5
2 a2 4
If I put n_distinct inside a pipe and use . to refer to the filtered tibble, however, the code executes but I get a different and incorrect result.
bind_rows(map(str_subset(colnames(X), "a"), function(i) {
value <- filter(X, !!sym(i) > 0) %>% n_distinct(.$b)
tibble(y = i, n_uniq = value)
}))
# A tibble: 2 x 2
y n_uniq
<chr> <int>
1 a1 5
2 a2 7
What's up with that? Am I misunderstanding the use of . inside a pipe? Is something funky with n_distinct?

n_distinct accepts multiple arguments and here you're actually passing both the tibble and the b column as arguments, since the left-hand-side of pipe is passed by default. Here's some other ways of getting the expected output:
filter(X, !!sym(i) > 0) %>%
{n_distinct(.$b)}
filter(X, !!sym(i) > 0) %>%
with(n_distinct(b))
library(magrittr)
filter(X, !!sym(i) > 0) %$%
n_distinct(b)
Also, not directly related to your question, there's a convenience function for this kind of thing
map_dfr(str_subset(colnames(X), "a"), function(i) {
value <- filter(X, !!sym(i) > 0) %>% {n_distinct(.$b)}
tibble(y = i, n_uniq = value)
})

Here is minimal example of I think what you are seeing.
iris %>%
n_distinct(.$Species)
# 149
n_distinct(iris$Species)
# 3
The first option is actually doing as follows. The .$Species is redundant.
n_distinct(iris, iris$Species)
# 149
I think to pipe it without doing weird syntax things you need to use this.
iris %>%
distinct(Species) %>%
count()
# 3

Agreed...if you're just looking for the number of distinct as in Adam's last example, you might be better off with
length(unique(iris$Species))
depending on what your goals are

Related

Extract first Non NA value over multiple columns

I'm still learning R and was wondering if I there was an elegant way of manipulating the below df to achieve df2.
I'm not sure if it's a loop that is supposed to be used for this, but basically I want to extract the first Non NA "X_No" Value if the "X_No" value is NA in the first row. This would perhaps be best described through an example from df to the desired df2.
A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)
ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)
I'm hoping for an elegant solution to this as there are over a 1000 columns similar to the example provided.
I've looked all over the web for a similar example however to no avail that would reproduce the expected result.
Your help is very much appreciated.
Thankyou
I don't know if I'd call it "elegant", but here is a potential solution:
library(tidyverse)
A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)
ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)
output <- df %>%
pivot_longer(everything(),
names_sep = "_",
names_to = c("Col", ".value")) %>%
drop_na() %>%
group_by(Col) %>%
slice_head(n = 1) %>%
ungroup() %>%
select(-Col)
df2
#> ID No
#> 1 A 11
#> 2 D 12
#> 3 F 13
#> 4 M 40
#> 5 W 50
output
#> # A tibble: 5 × 2
#> ID No
#> <chr> <dbl>
#> 1 A 11
#> 2 D 12
#> 3 F 13
#> 4 M 40
#> 5 W 50
all_equal(df2, output)
#> [1] TRUE
Created on 2023-02-08 with reprex v2.0.2
Using base R with max.col (assuming the columns are alternating with ID, No)
ind <- max.col(!is.na(t(df[c(FALSE, TRUE)])), "first")
m1 <- cbind(seq_along(ind), ind)
data.frame(ID = t(df[c(TRUE, FALSE)])[m1], No = t(df[c(FALSE, TRUE)])[m1])
ID No
1 A 11
2 D 12
3 F 13
4 M 40
5 W 50
Here is a data.table solution that should scale well to a (very) large dataset.
functionally
split the data.frame to a list of chunks of columns, based on their
names. So all columns startting with A_ go to
the first element, all colums startting with B_ to the second
Then, put these list elements on top of each other, using
data.table::rbindlist. Ignure the column-namaes (this only works if
A_ has the same number of columns as B_ has the same number of cols
as n_)
Now get the first non-NA value of each value in the first column
code
library(data.table)
# split based on what comes after the underscore
L <- split.default(df, f = gsub("(.*)_.*", "\\1", names(df)))
# bind together again
DT <- rbindlist(L, use.names = FALSE)
# extract the first value of the non-NA
DT[!is.na(A_No), .(No = A_No[1]), keyby = .(ID = A_ID)]
# ID No
# 1: A 11
# 2: D 12
# 3: F 13
# 4: G 14
# 5: I 15
# 6: M 40
# 7: P 20
# 8: W 50
# 9: X 32
#10: Y 48
#11: Z 40

product of multiple selected columns in a data frame in R

I have a data frame with a subset of variables that starts with 'AA_' (e.g., AA_1, AA_2, ... AA_100) along with other variables X, Y, Z.
If I would like to get the produce of all 'AA_' variables, what would be the most efficient way in R to achieve this?
I am thinking something like
mydata = mydata %>%
mutate(AA_product = reduce(starts_with('AA_'), `*`))
but it does not quite work
Here, we need to select the data
library(dplyr)
library(purrr)
mydata %>%
mutate(AA_product = reduce(select(., starts_with( 'AA_')), `*`))
-output
# X Y Z AA_1 AA_2 AA_3 AA_product
#1 1 2 3 1 2 3 6
#2 2 3 4 2 3 4 24
#3 3 4 5 3 4 5 60
Another less efficient approach is rowwise with c_across
mydata %>%
rowwise() %>%
mutate(AA_prod = prod(c_across(starts_with('AA')))) %>%
ungroup
data
mydata <- data.frame(X = 1:3, Y = 2:4, Z = 3:5,
AA_1 = 1:3, AA_2 = 2:4, AA_3 = 3:5)
If you want row-wise product for "AA_" columns, you can do this in base R with Reduce :
cols <- grep('AA_', names(mydata))
mydata$AA_product <- Reduce(`*`, mydata[cols])
and apply :
mydata$AA_product <- apply(mydata[cols], 1, prod)

pmap purrr error: Argument 1 must have names

I plan to sum a data.table row-wise and add a constant to it. What is wrong with this code. I am specifically looking for pmap_dfr solution:
library(data.table)
library(tidyverse)
temp.dt <- data.table(a = 1:3, b = 1:3, c = 1:3)
d <- 10
temp.dt %>% pmap_dfr(., sum, d) # add columns a b and c and add variable d to it
The output expected is a single column tibble with the following rows:
13
16
19
Error thrown: Argument 1 must have names.
I have been able to get it to work with pmap and pmap_dbl but it fails when using pmap_dfr. Additionally, the example I have provided is a toy example. I want the d variable as an input argument to the sum function instead of adding d later to the row-wise sum.
Example I know the below would work:
temp.dt %>% pmap_dbl(., sum) + d
The problem occurs for regular data frames too so to reduce this to the essentials start a new R session, get rid of the data.table part and use the input shown where we have a 3x4 data.frame so that we don't confuse rows and columns. Also note that pmap_dfr(sum, d) is the same as pmap(sum, d) %>% bind_rows and it is in the bind_rows step that the problem occurs.
library(dplyr)
library(purrr)
# test input
temp.df <- data.frame(a = 1:3, b = 1:3, c = 1:3, z = 1:3)
rownames(temp.df) <- LETTERS[1:3]
d <- 10
out <- temp.df %>% pmap(sum, d) # this works
out %>% bind_rows
## Error: Argument 1 must have names
The problem, as the error states, is that out has no names and it seems it will not provide default names for the result. For example, this will work -- I am not suggesting that you necessarily do this but just trying to illustrate why it does not work by showing minimal changes that make it work:
temp.df %>% pmap(sum, d) %>% set_names(rownames(temp.df)) %>% bind_rows
## # A tibble: 1 x 3
## A B C
## <dbl> <dbl> <dbl>
## 1 14 18 22
or this could be written like this to avoid writing temp.df twice:
temp.df %>% { set_names(pmap(., sum, d), rownames(.)) } %>% bind_rows
I think we can conclude that pmap_dfr is just not the right function to use here.
Base R
Of course, this is all trivial in base R as you can do this:
rowSums(temp.df) + d
## A B C
## 14 18 22
or more generally:
as.data.frame.list(apply(temp.df, 1, sum, d))
## A B C
## 14 18 22
or
as.data.frame.list(Reduce("+", temp.df) + d)
## X14 X18 X22
##1 14 18 22
data.table
In data.table we can write:
library(data.table)
DT <- as.data.table(temp.df)
DT[, as.list(rowSums(.SD) + d)]
## V1 V2 V3
## 1: 14 18 22
DT[, as.list(apply(.SD, 1, sum, d))]
## V1 V2 V3
## 1: 14 18 22
Also note that using data.table directly tends to be faster than sticking another level on top of it so if you thought you were getting the benefit of data.table's speed by using it with dplyr and purrr you likely aren't.
A pmap_dfr solution is to first transpose the dataset. We can later rename the columns as desired:
temp.dt %>%
t() %>%
as.data.frame()-> tmp_dt
pmap_dfr(list(tmp_dt, 10),sum)
# A tibble: 1 x 3
V1 V2 V3
<dbl> <dbl> <dbl>
1 13 16 19
A possible dplyr-base alternative:
temp.dt %>%
mutate(Sum = rowSums(.) + d) %>%
pull(Sum)
[1] 13 16 19
Or using pmap_dbl:
temp.dt %>%
pmap_dbl(.,sum) + d
[1] 13 16 19

Targetting a specific column within a dplyr pipe

I very much enjoy working with the magrittr pipes in R %>% and try to use them as often / efficiently as possible. I quite often need to target specific columns in a pipe chain, for example to change the column type. This results in me having to break the chain / my workflow because I need to target only a specific column instead of my entire dataframe.
Consider the following example:
library(tidyverse)
rm(list = ls())
a <- c(1:20)
b <- rep(c("a", "b"), 10)
df <- data_frame(a, b) %>%
rename(info = b) %>%
recode(x = df$info, "a" = "x") #I'd like to target only the df$info column here
This obviously doesn't work, because dplyr doesn't expect me to change the x = argument for a function in a pipe chain.
library(tidyverse)
rm(list = ls())
a <- c(1:20)
b <- rep(c("a", "b"), 10)
df <- data_frame(a, b) %>%
rename(info = b)
df$info <- df$info %>% #this works as expected, but is not as elegant
recode("a" = "x")
This is how I think it should be done, but I feel that it is not as efficient / elegant as I would like it to be, especially if I plan on chaining more functions together after recoding.
Is there a convenient way around this, so I can tell a command in my pipe chain to target only a specific column?
We need to place it inside mutate
data_frame(a, b) %>%
rename(info = b) %>%
mutate(info = recode(info, a = "x"))
# A tibble: 20 x 2
# a info
# <int> <chr>
# 1 1 x
# 2 2 b
# 3 3 x
# 4 4 b
# 5 5 x
# 6 6 b
# 7 7 x
# 8 8 b
# 9 9 x
#10 10 b
#11 11 x
#12 12 b
#13 13 x
#14 14 b
#15 15 x
#16 16 b
#17 17 x
#18 18 b
#19 19 x
#20 20 b

separate() in tidyr with NA

I have a question related to separate() in the tidyr package. When there is no NA in a data frame, separate() works. I have been using this function a lot. But, today I had a case in which there were NAs in a data frame. separate() returned an error message. I could be very silly. But, I wonder if tidyr may not be designed for this kind of data cleaning. Or is there any way separate() can work with NAs? Thank you very much for taking your time.
Here is an updated sample based on the comments. Say I want to separate characters in y and create new columns. If I remove the row with NA, separate() will work. But, I do not want to delete the row, what could I do?
x <- c("a-1","b-2","c-3")
y <- c("d-4","e-5", NA)
z <- c("f-6", "g-7", "h-8")
foo <- data.frame(x,y,z, stringsAsFactors = F)
ana <- foo %>%
separate(y, c("part1", "part2"))
# > foo
# x y z
# 1 a-1 d-4 f-6
# 2 b-2 e-5 g-7
# 3 c-3 <NA> h-8
# > ana <- foo %>%
# + separate(y, c("part1", "part2"))
# Error: Values not split into 2 pieces at 3
One way would be:
res <- foo %>%
mutate(y=ifelse(is.na(y), paste0(NA,"-", NA), y)) %>%
separate(y, c('part1', 'part2'))
res[res=='NA'] <- NA
res
# x part1 part2 z
#1 a-1 d 4 f-6
#2 b-2 e 5 g-7
#3 c-3 <NA> <NA> h-8
You can use extra option in separate.
Here's an example from hadley's github issue page
> df <- data.frame(x = c("a", "a b", "a b c", NA))
> df
x
1 a
2 a b
3 a b c
4 <NA>
> df %>% separate(x, c("a", "b"), extra = "merge")
a b
1 a <NA>
2 a b
3 a b c
4 <NA> <NA>
> df %>% separate(x, c("a", "b"), extra = "drop")
a b
1 a <NA>
2 a b
3 a b
4 <NA> <NA>

Resources