Using Dplyr In A Function To Create New Dataframes - r

I'm Trying to create new dataframes from dplyr 0.4.3 functions using R 3.2.2.
What I want to do is create some new dataframes using dplyr::filter to separate out data from one ginormous dataframe into a bunch of smaller dataframes.
For my reproducible base case bog simple example, I used this:
filter(mtcars, cyl == 4)
I know I need to assign that to a dataframe of its own, so I started with:
paste("Cylinders:", x, sep = "") <- filter(mtcars, cyl == 4))
That didn't work -- it gave me the error found here: Assignment Expands to Non-Language Object
From there, I found this: Create A Variable Name with Paste in R
(also, big ups to the authors of the above)
And that led me to this, which works:
assign(paste("gears_cars_cylinders", 4, sep = "_"), filter(mtcars, cyl == 4)) %>%
group_by(gear) %>%
summarise(number_of_cars = n())
and by "works," I mean I get a dataframe named gears_cars_cylinders_4 with all the goodies from
filter(mtcars, cyl == 4) %>%
group_by(gear) %>%
summarise(number_of_cars = n())
But ultimately, I think I need to wrap this whole thing in a function and be able to feed it the cylinder numbers from mtcars$cyl. I'm thinking something like plyr::ldply(mtcars$cyl, function_name)?
In my real-life data, I have about 70 different classes I need to split out into separate dataframes to drop into DT::datatable tabs in Shiny, which is a whole nuther mess. Anyway.
When I try this:
function_name <- function(x){
assign(paste("gears_cars_cylinders", x, sep = "_"), filter(mtcars, cyl == x)) %>%
group_by(gear) %>%
summarise(number_of_cars = n())
}
and then function_name(6),
I get the output of the dataframe to the screen, but not a dataframe with the name.
Am I looking right over the answer here?

You need to assign the new data frames into the environment from which you're calling function_name(). Try something like this:
library(dplyr)
foo <- function(x) {
assign(paste("gears_cars_cylinders", x, sep = "_"),
envir = parent.frame(),
value = mtcars %>%
filter(cyl == x) %>%
count(gear))
}
for(cyl in sort(unique(mtcars$cyl))) foo(cyl)
ls()
#> [1] "cyl" "foo"
#> [3] "gears_cars_cylinders_4" "gears_cars_cylinders_6"
#> [5] "gears_cars_cylinders_8"
gears_cars_cylinders_4
#> Source: local data frame [3 x 2]
#>
#> gear n
#> (dbl) (int)
#> 1 3 1
#> 2 4 8
#> 3 5 2

Related

group by and concatenate string using sparklyr

There are a number of questions asking precisely the same thing but none within the context of a sparklyr environment. How does one group by a column and then concatenate the values of some other column as a list?
For example the following results in the desired output in a local R environment.
mtcars %>%
distinct(gear, cyl) %>%
group_by(gear) %>%
summarize(test_list = paste0(cyl, collapse = ";")) %>%
select(gear, test_list) %>%
as.data.frame() %>%
print()
gear test_list
1 3 6;8;4
2 4 6;4
3 5 4;8;6
But registering that same table to spark and using the same code errors (sql parsing error, probably it attempts to apply spark's cocollapse function instead of R's C based collapse function) on the mutate (see code below). I know pyspark and spark SQL have collect_set() function that achieves the desired effect, is there something analogous for sparklyr?
sdf_copy_to(sc, x = mtcars, name = "mtcars_test")
tbl(sc, "mtcars_test") %>%
distinct(gear, cyl) %>%
group_by(gear) %>%
summarize(test_list = paste0(cyl, collapse = ";"))
Error:
Error : org.apache.spark.sql.catalyst.parser.ParseException:
In pyspark, the following approach is similar (except concatenated column is an array that can be collapsed).
from pyspark.sql.functions import collect_set
df2 = spark.table("mtcars_test")
df2.groupby("gear").agg(collect_set('cyl')).createOrReplaceTempView("mtcars_test_cont")
display(spark.table("mtcars_test_cont"))
gear collect_set(cyl)
3 [8, 4, 6]
4 [4, 6]
5 [8, 4, 6]
Instead of using R functions, you could have used Spark SQL syntax directly by wrapping it inside sql function from dbplyr. Below is an example script to get desired output:
sdf_copy_to(sc, x = mtcars, name = "mtcars_test")
tbl(sc, "mtcars_test") %>%
group_by(gear) %>%
summarize(test_list = sql("array_join(collect_set(cast(cyl as int)), ';')"))
#> gear test_list
#> <dbl> <chr>
#> 4 6;4
#> 3 6;4;8
#> 5 6;4;8
I just changed the last line of your code where you used paste0 function.
This is one reason why I prefer SparkR more than sparklyr, as almost all the syntax of PySpark works in the same manner.
SparkR::agg(
SparkR::group_by(SparkR::createDataFrame(mtcars), SparkR::column("gear")),
test_list = SparkR::array_join(
SparkR::collect_set(SparkR::cast(SparkR::column("cyl"), "integer")),
";"
)
) %>%
SparkR::collect()
#> gear test_list
#> 4 6;4
#> 3 6;4;8
#> 5 6;4;8

Reconcile dataset *column types* (formats) using a dictionary/list in R/dplyr

Following on the renaming request #67453183 I want to do the same for formats using the dictionary, because it won't bring together columns of distinct types.
I have a series of data sets and a dictionary to bring these together. But I'm struggling to figure out how to automate this. > Suppose this data and dictionary (actual one is much longer, thus I want to automate):
mtcarsA <- mtcars[1:2,1:3] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[3:4,1:3] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)
dic <- tibble(true_name = c("mpg_true", "cyl_true"),
nameA = c("mpgA", "cyl_A"),
nameB = c("mpg_B", "B_cyl"),
true_format = c("factor", "numeric")
)
I want these datasets (from years A and B) appended to one another, and then to have the names changed or coalesced to the 'true_name' values.... I want to automate 'coalesce all columns with duplicate names'.
And to bring these together, the types need to be the same too. I'm giving the entire problem here because perhaps someone also has a better solution for 'using a data dictionary'.
#ronakShah in the previous query proposed
pmap(dic, ~setNames(..1, paste0(c(..2, ..3), collapse = '|'))) %>%
flatten_chr() -> val
mtcars_all <- list(mtcarsA,mtcarsB) %>%
map_df(function(x) x %>% rename_with(~str_replace_all(.x, val)))
Which works great in the previous example but not if the formats vary. Here it throws error:
Error: Can't combine ..1$cyl_true<double> and..2$cyl_true <factor<51fac>>.
This response to #56773354 offers a related solution if one has a complete list of types, but not for a type list by column name, as I have.
Desired output:
mtcars_all
# A tibble: 4 x 3
mpg_true cyl_true disp
<factor> <numeric> <dbl>
1 21 6 160
2 21 6 160
3 22.8 4 108
4 21.4 6 258
Something simpler:
library(magrittr) # %<>% is cool
library(dplyr)
# The renaming is easy:
renameA <- dic$nameA
renameB <- dic$nameB
names(renameA) <- dic$true_name
names(renameB) <- dic$true_name
mtcarsA %<>% rename(all_of(renameA))
mtcarsB %<>% rename(all_of(renameB))
# Formatting is a little harder:
formats <- dic$true_format
names(formats) <- dic$true_name
lapply(names(formats), function (x) {
# there's no nice programmatic way to do this, I think
coercer <- switch(formats[[x]],
factor = as.factor,
numeric = as.numeric,
warning("Unrecognized format")
)
mtcarsA[[x]] <<- coercer(mtcarsA[[x]])
mtcarsB[[x]] <<- coercer(mtcarsB[[x]])
})
mtcars_all <- bind_rows(mtcarsA, mtcarsB)
In the background you should be aware of how base R treated concatenating factors before 4.1.0, and how this'll change. Here it probably doesn't matter because bind_rows will use the vctrs package.
I took another approach than Ronak's to read the dictionary. It is more verbose but I find it a bit more readable. A benchmark would be interesting to see which one is faster ;-)
Unfortunately, it seems that you cannot blindly cast a variable to a factor so I switched to character instead. In practice, it should behave exactly like a factor and you can call as_factor() on the end object if this is very important to you. Another possibility would be to store a casting function name (such as as_factor()) in the dictionary, retrieve it using get() and use it instead of as().
library(tidyverse)
mtcarsA <- mtcars[1:2,1:3] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[3:4,1:3] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)
dic <- tibble(true_name = c("mpg_true", "cyl_true"),
nameA = c("mpgA", "cyl_A"),
nameB = c("mpg_B", "B_cyl"),
true_format = c("numeric", "character") #instead of factor
)
dic2 = dic %>%
pivot_longer(-c(true_name, true_format), names_to=NULL)
read_dic = function(key, dict=dic2){
x = dict[dict$value==key,][["true_name"]]
if(length(x)!=1) x=key
x
}
rename_from_dic = function(df, dict=dic2){
rename_with(df, ~{
map_chr(.x, ~read_dic(.x, dict))
})
}
cast_from_dic = function(df, dict=dic){
mutate(df, across(everything(), ~{
cl=dict[dict$true_name==cur_column(),][["true_format"]]
if(length(cl)!=1) cl=class(.x)
as(.x, cl, strict=FALSE)
}))
}
list(mtcarsA,mtcarsB) %>%
map(rename_from_dic) %>%
map_df(cast_from_dic)
#> # A tibble: 4 x 3
#> mpg_true cyl_true disp
#> <dbl> <chr> <dbl>
#> 1 21 6 160
#> 2 21 6 160
#> 3 22.8 4 108
#> 4 21.4 6 258
Created on 2021-05-09 by the reprex package (v2.0.0)

Programmatically dropping a `group_by` field in dplyr

I'm writing functions that take in a data.frame and then do some operations. I need to add and subtract items from the group_by criteria in order to get where I want to go.
If I want to add a group_by criteria to a df, that's pretty easy:
library(tidyverse)
set.seed(42)
n <- 10
input <- data.frame(a = 'a',
b = 'b' ,
vals = 1
)
input %>%
group_by(a) ->
grouped
grouped
#> # A tibble: 1 x 3
#> # Groups: a [1]
#> a b vals
#> <fct> <fct> <dbl>
#> 1 a b 1.
## add a group:
grouped %>%
group_by(b, add=TRUE)
#> # A tibble: 1 x 3
#> # Groups: a, b [1]
#> a b vals
#> <fct> <fct> <dbl>
#> 1 a b 1.
## drop a group?
But how do I programmatically drop the grouping by b which I added, yet keep all other groupings the same?
Here's an approach that uses tidyeval so that bare column names can be used as the function arguments. I'm not sure if it makes sense to convert the bare column names to text (as I've done below) or if there's a more elegant way to work directly with the bare column names.
drop_groups = function(data, ...) {
groups = map_chr(groups(data), rlang::quo_text)
drop = map_chr(quos(...), rlang::quo_text)
if(any(!drop %in% groups)) {
warning(paste("Input data frame is not grouped by the following groups:",
paste(drop[!drop %in% groups], collapse=", ")))
}
data %>% group_by_at(setdiff(groups, drop))
}
d = mtcars %>% group_by(cyl, vs, am)
groups(d %>% drop_groups(vs, cyl))
[[1]]
am
groups(d %>% drop_groups(a, vs, b, c))
[[1]]
cyl
[[2]]
am
Warning message:
In drop_groups(., a, vs, b, c) :
Input data frame is not grouped by the following groups: a, b, c
UPDATE: The approach below works directly with quosured column names, without converting them to strings. I'm not sure which approach is "preferred" in the tidyeval paradigm, or whether there is yet another, more desirable method.
drop_groups2 = function(data, ...) {
groups = map(groups(data), quo)
drop = quos(...)
if(any(!drop %in% groups)) {
warning(paste("Input data frame is not grouped by the following groups:",
paste(drop[!drop %in% groups], collapse=", ")))
}
data %>% group_by(!!!setdiff(groups, drop))
}
Maybe something like this to remove grouping variables from the end of the list back:
grouped %>%
group_by(b, add=TRUE) -> grouped
grouped %>% group_by_at(.vars = group_vars(.)[-2])
or use head or tail or something on the output from group_vars for more control.
It would be interesting to have this sort of utility function available more generally:
peel_groups <- function(.data,n){
.data %>%
group_by_at(.vars = head(group_vars(.data),-n))
}
A more thought out version would likely include more careful checks on n being out of bounds.
Function to remove groups by column name
drop_groups_at <- function(df, vars){
df %>%
group_by_at(setdiff(group_vars(.), vars))
}
input %>%
group_by(a, b) %>%
drop_groups_at('b') %>%
group_vars
# [1] "a"

How to programmatically group a data_frame by each column name specified in a vector? [duplicate]

I'm writing a function where the user is asked to define one or more grouping variables in the function call. The data is then grouped using dplyr and it works as expected if there is only one grouping variable, but I haven't figured out how to do it with multiple grouping variables.
Example:
x <- c("cyl")
y <- c("cyl", "gear")
dots <- list(~cyl, ~gear)
library(dplyr)
library(lazyeval)
mtcars %>% group_by_(x) # groups by cyl
mtcars %>% group_by_(y) # groups only by cyl (not gear)
mtcars %>% group_by_(.dots = dots) # groups by cyl and gear, this is what I want.
I tried to turn y into the same as dots using:
mtcars %>% group_by_(.dots = interp(~var, var = list(y)))
#Error: is.call(expr) || is.name(expr) || is.atomic(expr) is not TRUE
How to use a user-defined input string of > 1 variable names (like y in the example) to group the data using dplyr?
(This question is somehow related to this one but not answered there.)
No need for interp here, just use as.formula to convert the strings to formulas:
dots = sapply(y, . %>% {as.formula(paste0('~', .))})
mtcars %>% group_by_(.dots = dots)
The reason why your interp approach doesn’t work is that the expression gives you back the following:
~list(c("cyl", "gear"))
– not what you want. You could, of course, sapply interp over y, which would be similar to using as.formula above:
dots1 = sapply(y, . %>% {interp(~var, var = .)})
But, in fact, you can also directly pass y:
mtcars %>% group_by_(.dots = y)
The dplyr vignette on non-standard evaluation goes into more detail and explains the difference between these approaches.
slice_rows() from the purrrlyr package (https://github.com/hadley/purrrlyr) groups a data.frame by taking a vector of column names (strings) or positions (integers):
y <- c("cyl", "gear")
mtcars_grp <- mtcars %>% purrrlyr::slice_rows(y)
class(mtcars_grp)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
group_vars(mtcars_grp)
#> [1] "cyl" "gear"
Particularly useful now that group_by_() has been depreciated.

dplyr::group_by_ with character string input of several variable names

I'm writing a function where the user is asked to define one or more grouping variables in the function call. The data is then grouped using dplyr and it works as expected if there is only one grouping variable, but I haven't figured out how to do it with multiple grouping variables.
Example:
x <- c("cyl")
y <- c("cyl", "gear")
dots <- list(~cyl, ~gear)
library(dplyr)
library(lazyeval)
mtcars %>% group_by_(x) # groups by cyl
mtcars %>% group_by_(y) # groups only by cyl (not gear)
mtcars %>% group_by_(.dots = dots) # groups by cyl and gear, this is what I want.
I tried to turn y into the same as dots using:
mtcars %>% group_by_(.dots = interp(~var, var = list(y)))
#Error: is.call(expr) || is.name(expr) || is.atomic(expr) is not TRUE
How to use a user-defined input string of > 1 variable names (like y in the example) to group the data using dplyr?
(This question is somehow related to this one but not answered there.)
No need for interp here, just use as.formula to convert the strings to formulas:
dots = sapply(y, . %>% {as.formula(paste0('~', .))})
mtcars %>% group_by_(.dots = dots)
The reason why your interp approach doesn’t work is that the expression gives you back the following:
~list(c("cyl", "gear"))
– not what you want. You could, of course, sapply interp over y, which would be similar to using as.formula above:
dots1 = sapply(y, . %>% {interp(~var, var = .)})
But, in fact, you can also directly pass y:
mtcars %>% group_by_(.dots = y)
The dplyr vignette on non-standard evaluation goes into more detail and explains the difference between these approaches.
slice_rows() from the purrrlyr package (https://github.com/hadley/purrrlyr) groups a data.frame by taking a vector of column names (strings) or positions (integers):
y <- c("cyl", "gear")
mtcars_grp <- mtcars %>% purrrlyr::slice_rows(y)
class(mtcars_grp)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
group_vars(mtcars_grp)
#> [1] "cyl" "gear"
Particularly useful now that group_by_() has been depreciated.

Resources