How to use xtabs in a user defined function? - r

I'm trying to create a wrapper for xtabs with a custom function.
I tried doing this:
xtabs_fun <- function(data, variable) {
xtabs(~ variable, data = data)
}
But using it doesn't return the correct value.
xtabs_fun(mtcars, "cyl")
cyl
1
The expected output should be this:
xtabs(~ cyl, data = mtcars)
cyl
4 6 8
11 7 14

We could use reformulate
xtabs_fun <- function(data, variable) {
xtabs(reformulate(variable), data = data)
}
-testing
> xtabs_fun(mtcars, "cyl")
cyl
4 6 8
11 7 14

xtabs accepts a data frame so:
xtabs_fun <- function(data, variable) xtabs(data[variable])
# test
xtabs_fun(mtcars, "cyl")
## cyl
## 4 6 8
## 11 7 14
Alternately use table which works the same:
table_fun <- function(data, variable) table(data[variable])

Variable names inside formulas are always taken literally rather than being substituted, so inside your function, xtabs is looking for a column called variable inside mtcars, which doesn't exist. Instead, you can build the formula as a string:
xtabs_fun <- function(data, variable) {
xtabs(as.formula(paste("~", variable)), data = data)
}
xtabs_fun(mtcars, "cyl")
#> cyl
#> 4 6 8
#> 11 7 14
If you prefer your function to take unquoted variable names in a tidyverse style, you can do
xtabs_fun <- function(data, variable) {
xtabs(as.formula(paste("~", deparse(substitute(variable)))), data = data)
}
mtcars |> xtabs_fun(cyl)
#> cyl
#> 4 6 8
#> 11 7 14
Created on 2022-09-24 with reprex v2.0.2

Related

What is the tidyverse way of passing columns into functions? [duplicate]

This question already has answers here:
Using column names as function arguments
(4 answers)
Closed 6 months ago.
I am working on a function to perform PCA on a dataset, and I wanted to write a function to do the same stuff on different columns. However, I'm having a hard time doing so because I can't seem to make the function understand that I'm passing through columns. As an example:
perform_pca <- function(columns_to_exclude = c()) {
pca <- data %>%
select(-column_to_exclude) %>%
other_stuff() %>%
prcomp()
pvar_pve <- tibble(
p.var = pca$sdev ^ 2 / sum(pca$sdev ^ 2),
pve = cumsum(p.var),
row_id = seq(1, length(pca) - length(columns_to_exclude))
)
ggplot(pvar_pve, ...other things)
}
However, doing afterwards
perform_pca(c(data$column1, data$column2, whatever_else))
only works if I call it without arguments. If I pass it one or more columns, it gives me an error message about the tibble length.
Put another way, what is the correct way of passing tibble columns into functions so that dplyr recognizes them as such? For example
test <- function(columns) {
data %>%
select(columns)
}
test(c(var1,var2))
would return an error. What's the correct way to actually do this?
You can do it without curly brackets just by using ... to pass to select and passing column names separately:
library(tidyverse)
data <- tibble(
a = 1:10,
b = rnorm(10),
c = letters[1:10],
d = 21:30
)
test <- function(data, ...) {
data %>%
select(-c(...))
}
test(data, a, b)
#> # A tibble: 10 × 2
#> c d
#> <chr> <int>
#> 1 a 21
#> 2 b 22
#> 3 c 23
#> 4 d 24
#> 5 e 25
#> 6 f 26
#> 7 g 27
#> 8 h 28
#> 9 i 29
#> 10 j 30
See here for info on this and other ways of doing things with tidy evaluation. The benefits of doing it this way and also using data as your first argument is that you can pipe your dataframe into the function and it will use 'tidyselect' to suggest variables to include as arguments to the function from inside your dataframe environment.
You can do it with passing a vector of columns, which is where curly brackets are needed:
test <- function(data, vars) {
data %>%
select(-c({{vars}}))
}
test(data, c(a, b))

Pass multiple column names in function to dplyr::distinct() with Spark

I want to specify an unknown number of column names in a function that will use dplyr::distinct(). My current attempt is:
myFunction <- function(table, id) {
table %>%
dplyr::distinct(.data[[id]])
}
I'm trying the above [.data[[id]]] because the data-masking section of this dplyr blog states:
When you have an env-variable that is a character vector, you need to index into the .data pronoun with [[, like summarise(df, mean = mean(.data[[var]])).
and the documentation for dplyr::distinct() says about its second argument:
<data-masking> Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables.
Spark
More specifically, I'm trying to use this function with Spark.
sc <- sparklyr::spark_connect(local = "master")
mtcars_tbl <- sparklyr::copy_to(sc, mtcars, "mtcars_spark")
##### desired return
mtcars_tbl %>% dplyr::distinct(cyl, gear)
# Source: spark<?> [?? x 2]
cyl gear
<dbl> <dbl>
1 6 4
2 4 4
3 6 3
4 8 3
5 4 3
6 4 5
7 8 5
8 6 5
##### myFunction fails
id = c("cyl", "gear")
myFunction(mtcars_tbl, id)
Error: Can't convert a call to a string
Run `rlang::last_error()` to see where the error occurred.
Following this comment, I have other failed attempts:
myFunction <- function(table, id) {
table %>%
dplyr::distinct(.dots = id)
}
myFunction(mtcars_tbl, id)
# Source: spark<?> [?? x 1]
.dots
<list>
1 <named list [2]>
#####
myFunction <- function(table, id) {
table %>%
dplyr::distinct_(id)
}
myFunction(mtcars_tbl, id)
Error in UseMethod("distinct_") :
no applicable method for 'distinct_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"
Distinct applies to all columns of a table at once. Consider an example table:
A B
1 4
1 4
2 3
2 3
3 3
3 5
It is not clear what applying distinct to only column A, but not column B should return. The following example is clearly not a good choice because it breaks the relationship between columns A and B. For example, there is no (A = 2, B = 4) row in the original dataset.
A B
1 4
2 4
3 3
3
3
5
Hence the best approach is to select only those columns you want first, and then take the distinct. Something more like:
myFunction <- function(table, id) {
table %>%
dplyr::select(dplyr::all_of(id)) %>%
dplyr::distinct()
}

Stratified random sampling from data frame_follow up

I am trying to randomly sample 50% of the data for each of the group following Stratified random sampling from data frame. A reproducible example using mtcars dataset in R looks like below. What I dont understand is, the sample index clearly shows a group of gear labeled as '5', but when the index is applied to the mtcars dataset, the sampled data mtcars2 does not contain any record from gear='5'. What went wrong? Thank you very much.
> set.seed(14908141)
> index=tapply(1:nrow(mtcars),mtcars$gear,function(x){sample(length(x),length(x)*0.5)})
> index
$`3`
[1] 6 7 14 4 12 9 13
$`4`
[1] 12 7 8 4 6 5
$`5`
[1] 5 1
> mtcars2=mtcars[unlist(index),]
> table(mtcars2$gear)
3 4
12 3
I think the approach you've done creates a number 1:length(mtcars$gear) for each gear group so you will have repeat row numbers for each group. Then, when you subset it isn't working, see in your output above you have row number 7 in both gear group 3 and 4.
Base R
I would use split first to split by gear:
res <- split(mtcars, mtcars$gear)
then I run over this list using lapply and sample 50% of them that way:
res2 <- lapply(res, function(x) {
x[sample(1:nrow(x), nrow(x)*0.5, FALSE), ]
}
)
if you would like one dataset at the end (instead of a list) you can combine using do.call:
final_df <- do.call(rbind, res2)
dplyr
A simpler approach would be:
library(dplyr)
mtcars %>%
group_by(gear) %>%
sample_frac(., 0.5)

using quasiquotation in functions with formula interface

I want to write a custom function that can take bare and "string" inputs, and can handle both functions with and without the formula interface.
custom function example
# setup
set.seed(123)
library(tidyverse)
# custom function
foo <- function(data, x, y) {
# function without formula
print(table(data %>% dplyr::pull({{ x }}), data %>% dplyr::pull({{ y }})))
# function with formula
print(
broom::tidy(stats::t.test(
formula = rlang::new_formula({{ rlang::ensym(y) }}, {{ rlang::ensym(x) }}),
data = data
))
)
}
bare
works for both functions with and without formula interface
foo(mtcars, am, cyl)
#>
#> 4 6 8
#> 0 3 4 12
#> 1 8 3 2
#> # A tibble: 1 x 10
#> estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1.87 6.95 5.08 3.35 0.00246 25.9 0.724 3.02
#> # ... with 2 more variables: method <chr>, alternative <chr>
string
works for both functions with and without formula interface
foo(mtcars, "am", "cyl")
#>
#> 4 6 8
#> 0 3 4 12
#> 1 8 3 2
#> # A tibble: 1 x 10
#> estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1.87 6.95 5.08 3.35 0.00246 25.9 0.724 3.02
#> # ... with 2 more variables: method <chr>, alternative <chr>
as colnames
works only for functions without the formula interface
foo(mtcars, colnames(mtcars)[9], colnames(mtcars)[2])
#>
#> 4 6 8
#> 0 3 4 12
#> 1 8 3 2
#> Error: Only strings can be converted to symbols
#> Backtrace:
#> x
#> 1. \-global::foo(mtcars, colnames(mtcars)[9], colnames(mtcars)[2])
#> 2. +-base::print(...)
#> 3. +-broom::tidy(...)
#> 4. +-stats::t.test(...)
#> 5. +-rlang::new_formula(...)
#> 6. \-rlang::ensym(y)
How can I modify the original function so that it will work with all the above-mentioned ways of entering the inputs and for both kinds of functions used?
The nice philosophy of rlang is that you get to control when you want values to be evaluated via the !! and {{}} operators. You seem to want to make a function that takes strings, symbols, and (possibly evaluated) expressions all in the same parameter. Using symbols or bare strings is actually easy with ensym but also wanting to allow for code like colnames(mtcars)[9] that has to be evaulated before returning a string is the problem. This potentially can be quite confusing. For example, what's the behavior you expect when you run the following?
am <- 'disp'
cyl <- 'gear'
foo(mtcars, am, cyl)
You could write a helper function if you want to assume all "calls" should be evaluated but symbols and literals should not. Here's a "cleaner" function
clean_quo <- function(x) {
if (rlang::quo_is_call(x)) {
x <- rlang::eval_tidy(x)
} else if (!rlang::quo_is_symbolic(x)) {
x <- rlang::quo_get_expr(x)
}
if (is.character(x)) x <- rlang::sym(x)
if (!rlang::is_quosure(x)) x <- rlang::new_quosure(x)
x
}
and you could use that in your function with
foo <- function(data, x, y) {
x <- clean_quo(rlang::enquo(x))
y <- clean_quo(rlang::enquo(y))
# function without formula
print(table(data %>% dplyr::pull(!!x), data %>% dplyr::pull(!!y)))
# function with formula
print(
broom::tidy(stats::t.test(
formula = rlang::new_formula(rlang::quo_get_expr(y), rlang::quo_get_expr(x)),
data = data
))
)
}
Doing so will allow all these to return the same values
foo(mtcars, am, cyl)
foo(mtcars, "am", "cyl")
foo(mtcars, colnames(mtcars)[9], colnames(mtcars)[2])
But you are probably just delaying possible other problems. I would not recommend over-interpreting user intentions with this kind of code. That's why it's better to explicitly allow them to un-escape themselves. Perhaps provide two different versions of the function that can be used with parameter that require evaluation and those that do not.
I have to agree with #MrFlick and others about inherent ambiguity when mixing standard and non-standard evaluation. (I also pointed this out in your similar question from a while ago.)
However, one can argue that dplyr::select() works with symbols, strings and expressions of the form colnames(.)[.]. If you absolutely must have the same interface, then you can leverage tidyselect to resolve your inputs:
library( rlang )
library( tidyselect )
ttest <- function(data, x, y) {
## Identify locations of x and y in data, get column names as symbols
s <- eval_select( expr(c({{x}},{{y}})), data ) %>% names %>% syms
## Use the corresponding symbols to build the formula by hand
broom::tidy(stats::t.test(
formula = new_formula( s[[2]], s[[1]] ),
data = data
))
}
## All three now work
ttest( mtcars, am, cyl )
ttest( mtcars, "am", "cyl" )
ttest( mtcars, colnames(mtcars)[9], colnames(mtcars)[2] )

How to drop all columns with the same name (tidyverse or base R)?

My dataset has several columns with the same name and I'd like to delete them.
Unfortunately, select(data, -Offender) doesn't work (Error: Can't bind data because some arguments have the same name), but I can't use janitor::clean_names to overcome that either—I need the remaining to column names unchanged.
I tried with base R (df1 <- subset(df1, select = -c(Offender))), but it then renames the other Offenders.#
Here's a base solution.
# Create data frame
df <- data.frame(runif(10), runif(10), runif(10), runif(10))
names(df) <- c("foo", "bar", "bar", "bar")
# Examine data
print(df)
#> foo bar bar bar
#> 1 0.4126322 0.1436917 0.4870304 0.80514914
#> 2 0.1642798 0.7948858 0.7741611 0.64353516
#> 3 0.6485090 0.6607900 0.9260459 0.75226835
#> 4 0.5254531 0.5535233 0.5746025 0.07131740
#> 5 0.1367792 0.4830001 0.4947695 0.46050685
#> 6 0.9517484 0.5750567 0.2231635 0.20050668
#> 7 0.9535626 0.1362335 0.5429668 0.07581875
#> 8 0.6679136 0.5017095 0.3106441 0.02235911
#> 9 0.4489072 0.3811038 0.1922099 0.07851105
#> 10 0.7724686 0.1284880 0.4240226 0.44462081
# Remove all columns called "bar"
df[, "bar" == names(df)] <- NULL
# Examine data again
print(df)
#> foo
#> 1 0.4126322
#> 2 0.1642798
#> 3 0.6485090
#> 4 0.5254531
#> 5 0.1367792
#> 6 0.9517484
#> 7 0.9535626
#> 8 0.6679136
#> 9 0.4489072
#> 10 0.7724686
Created on 2019-06-24 by the reprex package (v0.3.0)
I figured it out as I finished typing the question:
Use the base R command df1 <- subset(df1, select = -Offender) to rename all the Offenders, leaving the other columns untouched.
Then use select_at(vars(-starts_with("Offender"))) and Bob's your uncle.
In short:
df1 <- subset(df1, select = -Offender) %>%
select_at(vars(-starts_with("Offender")))
Essentially, base R won't complain that columns have the same name and will silently fix it for you. At that point, you can tidyverse to your heart's content.
If you want to remove all columns named Offender, I think there is an easier and more elegant way:
df2 <- df1 %>%
dplyr::select(-starts_with("Offender"))
If that is not what you want, maybe a minimal example would help to better illustrate what you are trying to do.

Resources