R double row filter by string - r

I'm cleaning a dataset that doesn't yet have column names (so I'm working with indexes) and I'm trying to filter two columns of a df by piping the results of the first filter into the second and don't understand why the below doesn't work:
stripcols <- c("","Total+")
df <- df %>%
filter(!df[,1] %in% stripcols) %>%
filter(!df[,2] %in% stripcols)
Running this results in:
Error in filter_impl(.data, quo) : Result must have length 46, not 58
This is easily worked around by running the filter twice, but I don't understand why this didn't work.
I'm also curious as to whether there is a way to do this with one filter command that is applied on both columns rather than two.

The source of the error is that you are always comparing against nrow(df) rows regardless of how many rows hit the second filter. For instance:
dat <- data.frame(a=1:10)
dat %>% filter(a > 5)
# a
# 1 6
# 2 7
# 3 8
# 4 9
# 5 10
The way you're writing it, you're doing
dat %>% filter(dat[,1] > 5)
# a
# 1 6
# 2 7
# 3 8
# 4 9
# 5 10
For this first call, the number of rows that go into filter is 10, and the number of rows being compared inside filter is also 10. However, if you were to do:
dat %>% filter(dat[,1] > 5) %>% filter(dat[,1] > 7)
# Error in filter_impl(.data, quo) : Result must have length 5, not 10
this fails because the number of rows going into the second filter is only 5 not 10, though we are giving the filter command 10 comparisons by using dat[,1].
(N.B.: many comments about names are perfectly appropriate, but let's continue with the theme of using column indices.)
The first trick is to give each filter only as many comparisons as the data coming in. Another way to say this is to do comparisons on the state of the data at that point in time. magrittr (and therefore dplyr) do this with the . placeholder. The dot is always able to be inferred (defaulting to the first argument of the RHS function, the function after %>%), but some feel that being explicit is better. For instance, this is legal:
mtcars %>%
group_by(cyl) %>%
tally()
# # A tibble: 3 x 2
# cyl n
# <dbl> <int>
# 1 4 11
# 2 6 7
# 3 8 14
but an explicit equivalent pipe is this:
mtcars %>%
group_by(., cyl) %>%
tally(.)
If the first argument to the function is not the frame itself, then the %>% inferred way will fail:
mtcars %>%
xtabs(~ cyl + vs)
# Error in as.data.frame.default(data, optional = TRUE) :
# cannot coerce class '"formula"' to a data.frame
(Because it is effectively calling xtabs(., ~cyl + vs), and without named arguments then xtabs assumed the first argument to be a formula.)
so we must be explicit in these situations:
mtcars %>%
xtabs(~ cyl + vs, data = .)
# vs
# cyl 0 1
# 4 1 10
# 6 3 4
# 8 14 0
(contrived example, granted). One could also do mtcars %>% xtabs(formula=~cyl+vs), but my points stands.
So to adapt your code, I would expect this to work:
df %>%
filter(!.[,1] %in% stripcols) %>%
filter(!.[,2] %in% stripcols)
I think I'd prefer the [[ approach (partly because I know that tbl_df and data.frame deal with [,1] slightly differently ... and though it works with it, I still prefer the explicitness of [[):
df %>%
filter(!.[[1]] %in% stripcols) %>%
filter(!.[[2]] %in% stripcols)
which should work. Of course, combining works just fine, too:
df %>%
filter(!.[[1]] %in% stripcols, !.[[2]] %in% stripcols)

Related

R/arrow summarizing on variable columns

I have a large-ish parquet file I'm referencing via arrow::open_dataset. I'd like to get the max value of one or more of the columns, where I don't know a priori which (or how many) columns. In general, this sounds like "programming with dplyr" (assuming arrow-10 and its recent support of dplyr::across), but I can't get it to work.
write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")
vars <- c("a")
open_dataset("quux.parquet") %>%
summarize(across(sym(vars), ~ max(.))) %>%
collect()
# # A tibble: 1 x 1
# a
# <dbl>
# 1 9
But when vars is length 2 or more, I assume I need to be using syms or similar, but that fails with
vars <- c("a", "b")
open_dataset("quux.parquet") %>%
summarize(across(all_of(syms(vars)), ~ max(.))) %>%
collect()
# Error: Must subset columns with a valid subscript vector.
# x Subscript has the wrong type `list`.
# i It must be numeric or character.
How do I lazily (not load all data) find the max of multiple columns in an arrow dataset?
While I suspect that the correct answer in dplyr will be some form of syms, and then whether or not arrow supports that is the next question. I'm not tied to the dplyr mechanisms, if there's a method using ds$NewScan() or similar, I'm amenable.
Is this the kind of thing you're after - using tidyselect's all_of function?
library(arrow)
library(dplyr)
write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")
vars <- c("a", "d")
open_dataset("quux.parquet") %>%
summarize(across(all_of(vars), ~ max(.))) %>%
collect()
#> # A tibble: 1 × 2
#> a d
#> <dbl> <chr>
#> 1 9 r
See https://tidyselect.r-lib.org/reference/index.html for the different tidyselect functions you may also want to check out.

Question on R programming with dplyr and tidy evaluation

Folks I have a couple of questions about how tidy evaluation works with dplyr
The following code produces a tally of cars by cylinder using the mtcars dataset:
mtcars %>%
select(cyl) %>%
group_by(cyl) %>%
tally()
With output as expected:
# A tibble: 3 x 2
cyl n
* <dbl> <int>
1 4 11
2 6 7
3 8 14
If I want to pass the grouping factor as variable, then this fails:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by(var) %>%
tally()
with error message:
Error: Must group by variables found in `.data`.
* Column `var` is not found.
This also fails:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by({{ var}}) %>%
tally()
Producing output:
# A tibble: 1 x 2
`"cyl"` n
* <chr> <int>
1 cyl 32
This code, however, works as expected:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by(.data[[ var]]) %>%
tally()
Producing the expected output:
# A tibble: 3 x 2
cyl n
* <dbl> <int>
1 4 11
2 6 7
3 8 14
I have two questions about this and wondering if someone can help!
Why does select(var) work fine without using any of the dplyr tidy evaluation extensions, such as select({{ var }}) or select(.data[[ var ]])?
What is is about group_by() that makes group_by({{ var }}) wrong but group_by(.data[[ var ]]) right?
Thanks so much!
Matt.
It depends on how those functions work and accept input.
If you look at the documentation at ?select the relevant part for this question is -
These helpers select variables from a character vector:
all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.
any_of(): Same as all_of(), except that no error is thrown for names that don't exist.
So you can use all_of and any_of in select with character vectors hence you get a warning when you run mtcars %>% select(var)
Note: Using an external vector in selections is ambiguous.
ℹ Use all_of(var) instead of var to silence this message.
and no warning with mtcars %>% select(all_of(var)).
As far as group_by is concerned there is no such specific provision and you need to use mtcars %>% group_by(.data[[var]]).

R: rowwise nth element ordered_by row values

I have this input:
t <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
And want to have the rowwise nth-lowest element of the dataframe ordered by the rowwise values, so that the output is something like this (example for nth_element = 2):
[1] 2 3 5 4
I tried a function like this:
apply(t, 1, nth, n=1, order_by = .)
But this does not work. Two questions:
What should I type in the order_by gument to make this function work?
Which is the best way to summarise rows with an own summary function if I don't want to mention the column names in the rowwise summary function?
Sidenote:
I don't want to mention the column names specifically, I want the function to use all rows in the dataset.
I tried the rownth function from the Rfast package but it only provides one result. Does anybody know what I do wrong?
We can use apply and sort to do this.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
nth_lowest <- 2
apply(d, 1, FUN = function(x) sort(x)[nth_lowest])
# [1] 2 3 5 4
Note that I am calling the data d instead of t. t is already a reserved name in R (matrix transpose function).
Not as elegant as #bouncyball's answer, but using dplyr (and tidyr), one possibility is to do:
library(dplyr)
library(tidyr)
t %>% mutate(Row = row_number()) %>%
pivot_longer(-Row, names_to = "Col", values_to = "Val") %>%
group_by(Row) %>%
arrange(Val) %>%
slice(2) %>%
select(Val)
Adding missing grouping variables: `Row`
# A tibble: 4 x 2
# Groups: Row [4]
Row Val
<int> <dbl>
1 1 2
2 2 3
3 3 5
4 4 4
Using Rfast you could reduce run time for big matrices and for matrices only.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
d<- Rfast::data.frame.to_matrix(d)
nth_lowests <- rep(2,ncol(d))
Rfast::rownth(d,nth_lowests)
# [1] 2 3 5 4
You could also use the parallel version of Rfast::rownth

R: Create new column based list of values from a multiple columns

I want to create a new column (T/F) based on any value from a list being present in multiple columns. For this example, I'm using mtcars for my example, searching for two values in two columns, but my actual challenge is many values in many columns.
I have a successful filter using filter_at() included below, but I've been unable to apply that logic to a mutate:
# there are 7 cars with 6 cyl
mtcars %>%
filter(cyl == 6)
# there are 2 cars with 19.2 mpg, one with 6 cyl, one with 8
mtcars %>%
filter(mpg == 19.2)
# there are 8 rows with either.
# these are the rows I want as TRUE
mtcars %>%
filter(mpg == 19.2 | cyl == 6)
# set the cols to look at
mtcars_cols <- mtcars %>%
select(matches('^(mp|cy)')) %>% names()
# set the values to look at
mtcars_numbs <- c(19.2, 6)
# result is 8 vars with either value in either col.
# this is a successful filter of the data
out1 <- mtcars %>%
filter_at(vars(mtcars_cols), any_vars(
. %in% mtcars_numbs
)
)
# shows set with all 6 cyl, plus one 8cyl 21.9 mpg
out1 %>%
select(mpg, cyl)
# This attempts to apply the filter list to the cols,
# but I only get 6 rows as True
# I tried to change == to %in& but that results in an error
out2 <- mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs) > 0
)
# only 6 rows returned
out2 %>%
filter(myset == T)
I'm not sure why the two rows are skipped. I think it might be the use of rowSums that is aggregating those two rows in some way.
If we want to do the corresponding checks, it may be better to use map2
library(dplyr)
library(purrr)
map2_df(mtcars_cols, mtcars_numbs, ~
mtcars %>%
filter(!! rlang::sym(.x) == .y)) %>%
distinct
NOTE: Doing the comparison (==) with floating point numbers can get into trouble as the precision can vary and result in FALSE
Also, note that == works only when when either the lhs and rhs elements have the same length or the rhs vector is of length 1 (here the recycling happens). If the length is greater than 1 and not equal to length of lhs vector, then the recycling would be comparing in the column order.
We can replicate to make the lengths equal and now it should work
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs[col(select(., mtcars_cols))]) > 0
) %>% pull(myset) %>% sum
#[1] 8
In the above code select is used twice for better understanding. Otherwise, we can also use rep
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == rep(mtcars_numbs, each = n())) > 0
) %>%
pull(myset) %>%
sum
#[1] 8

Parallel wilcox.test using group_by and summarise

There must be an R-ly way to call wilcox.test over multiple observations in parallel using group_by. I've spent a good deal of time reading up on this but still can't figure out a call to wilcox.test that does the job. Example data and code below, using magrittr pipes and summarize().
library(dplyr)
library(magrittr)
# create a data frame where x is the dependent variable, id1 is a category variable (here with five levels), and id2 is a binary category variable used for the two-sample wilcoxon test
df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))
# make sure piping and grouping are called correctly, with "sum" function as a well-behaving example function
df %>% group_by(id1) %>% summarise(s=sum(x))
df %>% group_by(id1,id2) %>% summarise(s=sum(x))
# make sure wilcox.test is called correctly
wilcox.test(x~id2, data=df, paired=FALSE)$p.value
# yet, cannot call wilcox.test within pipe with summarise (regardless of group_by). Expected output is five p-values (one for each level of id1)
df %>% group_by(id1) %>% summarise(w=wilcox.test(x~id2, data=., paired=FALSE)$p.value)
df %>% summarise(wilcox.test(x~id2, data=., paired=FALSE))
# even specifying formula argument by name doesn't help
df %>% group_by(id1) %>% summarise(w=wilcox.test(formula=x~id2, data=., paired=FALSE)$p.value)
The buggy calls yield this error:
Error in wilcox.test.formula(c(1.09057358373486,
2.28465932554436, 0.885617572657959, : 'formula' missing or incorrect
Thanks for your help; I hope it will be helpful to others with similar questions as well.
Your task will be easily accomplished using the do function (call ?do after loading the dplyr library). Using your data, the chain will look like this:
df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))
df <- tbl_df(df)
res <- df %>% group_by(id1) %>%
do(w = wilcox.test(x~id2, data=., paired=FALSE)) %>%
summarise(id1, Wilcox = w$p.value)
output
res
Source: local data frame [5 x 2]
id1 Wilcox
(int) (dbl)
1 1 0.6904762
2 2 0.4206349
3 3 1.0000000
4 4 0.6904762
5 5 1.0000000
Note I added the do function between the group_by and summarize.
I hope it helps.
You can do this with base R (although the result is a cumbersome list):
by(df, df$id1, function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })
or with dplyr:
ddply(df, .(id1), function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })
id1 V1
1 1 0.3095238
2 2 1.0000000
3 3 0.8412698
4 4 0.6904762
5 5 0.3095238

Resources