Using dplyr within a function, Grouping Error with function arguments - r

Below I have a working example of what I would like the function to do, and then script for the function, noting where the Error occurs.
The error message is:
Error: index out of bounds
Which I know usually means R can’t find the variable that’s being called.
Interestingly, in my function example below, if I only group by my subgroup_name (which is passed to the function and becomes a column in the newly created dataframe) the function will successfully regroup that variable, but I also want to group by a newly created column (from the melt) called variable.
Similar code used to work for me using regroup(), but that has been deprecated. I am trying to use group_by_() but to no avail.
I have read many other posts and answers and experimented several hours today but still not successful.
# Initialize example dataset
database <- ggplot2::diamonds
database$diamond <- row.names(diamonds) # needed for melting
subgroup_name <- "cut" # can replace with "color" or "clarity"
subgroup_column <- 2 # can replace with 3 for color, 4 for clarity
# This works, although it would be preferable not to need separate variables for subgroup_name and subgroup_column number
df <- database %>%
select(diamond, subgroup_column, x,y,z) %>%
melt(id.vars=c("diamond", subgroup_name)) %>%
group_by(cut, variable) %>%
summarise(value = round(mean(value, na.rm = TRUE),2))
# This does not work, I am expecting the same output as above
subgroup_analysis <- function(database,...){
df <- database %>%
select(diamond, subgroup_column, x,y,z) %>%
melt(id.vars=c("diamond", subgroup_name)) %>%
group_by_(subgroup_name, variable) %>% # problem appears to be with finding "variable"
summarise(value = round(mean(value, na.rm = TRUE),2))
print(df)
}
subgroup_analysis(database, subgroup_column, subgroup_name)

From the NSE vignette:
If you also want to output variables to vary, you need to pass a list
of quoted objects to the .dots argument:
Here, variable should be quoted:
subgroup_analysis <- function(database,...){
df <- database %>%
select(diamond, subgroup_column, x,y,z) %>%
melt(id.vars=c("diamond", subgroup_name)) %>%
group_by_(subgroup_name, quote(variable)) %>%
summarise(value = round(mean(value, na.rm = TRUE),2))
print(df)
}
subgroup_analysis(database, subgroup_column, subgroup_name)
As mentionned by #RichardScriven, if you plan to assign the result to a new variable, then you may want to remove the print call at the end and just write df, or not even assign df at all in the function
Otherwise the result prints even when you do x <- subgroup_analysis(...)

Related

R function returns 0, code inside same function when run outside of function returns correct value

Okay, so I'm new to stackoverflow and R. This is probably a pretty silly problem, but I couldn't find a solution anywhere.
Packages
library(tidyverse)
Here is the the code
sample_size_finder <- function(variable, value){
df %>%
select(variable) %>%
filter(variable == value) %>%
count()
}
df = data frame I imported earlier
variable = the name of the column I want to select and filter from
value = the value in that column I want to filter for
I want the function to look for the column in my data frame specified by the argument "variable", select it, and filter from it all observations that are equal to the argument value. Then, count() all those observations.
When I try to run the function
sample_size_finder("team_size", 4)
it returns
n
1 0
Which is zero and therefore the wrong output.
Running the code inside without the function wrapped around it with the same arguments
df %>%
select(team_size) %>%
filter(team_size == 4) %>%
count()
returns
n
1 12
which is the correct output (12 observations for team_size equal to 4).
I narrowed my issue down to the variable argument in my sample_size_finder(variable, value) function. Here is what I tried so far:
changed the function to only have "value" as an argument. This gives the desired output, but I want to have the "variable" argument included
changed the function to only have "variable" as an argument. This returned "0" as output
ran the function without putting "variable" in string quotes. This returned "0" as output
I'm almost ready to give up and just use the code without the function, but somehow this bothers me a lot. It would be so appreciated if a kind soul would help me :)
All the best,
Matthias
Use .data to access column names as variable name in the function.
library(dplyr)
sample_size_finder <- function(variable, value){
df %>%
select(.data[[variable]]) %>%
filter(.data[[variable]] == value) %>%
count()
}
sample_size_finder("team_size", 4)
This can be written in base R as -
sample_size_finder <- function(variable, value){
sum(df[[variable]] == value)
}
We may use across with all_of
library(dplyr)
sample_size_finder <- function(variable, value){
df %>%
select(all_of(variable)) %>%
filter(across(all_of(variable), ~ . == value)) %>%
count()
}
sample_size_finder("team_size", 4)

R version 3.6.3 (2020-02-29) | Using package dplyr_1.0.0 | Unable to perform summarise() function

Trying to perform the basic Summarise() function but getting the same error again and again!
I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID.
Error:
*Problem with `summarise()` input `complete_cases`.
x unused argument (Date)
i Input `complete_cases` is `n(Date)`.
i The error occured in group 1: ID = 1.*
Code:
library(dplyr)
merged <-do.call(rbind,lapply(list.files(),read.csv))
merged <- as.data.frame(merged)
remove_na <- merged[complete.cases(merged),]
new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date))
Here is what the data looks like
The problem is not coming from summarise but from n.
If you look at the help ?n, you will see that n is used without any argument, like this:
new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n())
This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count:
new_data_count <- remove_na %>% count(ID)
If you want to count the different Date values, you might want to use n_distinct:
new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date))
Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this:
library(purrr)
remove_na = map_dfr(list.files(), read.csv) %>% na.omit()
Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.

With dplyr and enquo my code works but not when I pass to purrr::map

I want to create a plot for each column in a vector called dates. My data frame contains only these columns and I want to group on it, count the occurrences and then plot it.
Below code works, except for map which I want to use to go across a previously unknown number of columns. I think I'm using map correctly, I've had success with it before. I'm new to using quosures but given that my function call works I'm not sure what is wrong. I've looked at several other posts that appear to be set up this way.
df <- data.frame(
date1 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
date2 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
stringsAsFactors = FALSE
)
dates<-names(df)
library(tidyverse)
dates.count<-function(.x){
group_by<-enquo(.x)
df %>% group_by(!!group_by) %>% summarise(count=n()) %>% ungroup() %>% ggplot() + geom_point(aes(y=count,x=!!group_by))
}
dates.count(date1)
map(dates,~dates.count(.x))
I get this error: Error in grouped_df_impl(data, unname(vars), drop) : Column .x is unknown
When you pass the variable names to map() you are using strings, which indicates you need ensym() instead of enquo().
So your function would look like
dates.count <- function(.x){
group_by = ensym(.x)
df %>%
group_by(!!group_by) %>%
summarise(count=n()) %>%
ungroup() %>%
ggplot() +
geom_point(aes(y=count,x=!!group_by))
}
And you would use the variable names as strings for the argument.
dates.count("date2")
Note that tidyeval doesn't always play nicely with the formula interface of map() (I think I'm remembering that correctly). You can always do an anonymous function instead, but in your case where you want to map the column names to a function with a single argument you can just do
map(dates, dates.count)
Using the formula interface in map() I needed an extra !!:
map(dates, ~dates.count(!!.x))

Loop through aggregation using dplyr

So I'm I've tried to find the answer to this probably-obvious question. I have multiple predictor variables that I need to loop through in order to get a summary of another column for each predictor. This data frame will change with every iteration so I need code that work for multiple different data frames. Here are the places I've looked so far:
R- producing a summary calculation for each column that is dependent on aggregations at a factor level
Multiple data frame handling
Using the mtcars package, this is what I've tried:
#get mtcars data from graphics package
install.packages("graphics")
library(graphics)
data <- mtcars
#loop through names
variable <- list(colnames(data))
for(i in variable){
data1 <- data %>%
group_by(i)
summarise('number' = mean(mpg))
}
However, I get the following error:
Error in grouped_df_impl(data, unname(vars), drop) :
Column `i` is unknown
Not sure where to go next.
There are multiple issues in the code,
1) the variable is unnecessarily created as a list
2) Looping through the 'variable' is not getting inside the list, which is an issue from 1.
3) group_by_at can be used in place of group_by for string inputs
4) there is a typo of no connection ie. chain (%>%) between group_by and summarise
5) the output should be stored in a list or else it will be overwritten as we are assigning to the same object 'data1'
The below code does the correction
variable <- colnames(data) #is a `vector` now
data1 <- list() # initialize as a `list`
for(i in variable){
data1[[i]] <- data %>%
group_by_at(i) %>% #changed to `group_by_at`
summarise(number = mean(mpg))
}
Or this can be done in a tidyverse syntax which will return the output as a list of tibble and to avoid the initialization of list and assignment
purrr::map(variable, ~ data %>%
group_by_at(.x) %>%
summarise(number = mean(mpg)))
If we need to bind the list elements use bind_rows. But, it would also create multiple columns as the first column name is different and fill with NA
purrr::map(variable, ~ data %>%
group_by_at(.x) %>%
summarise(number = mean(mpg))) %>%
set_names(variable) %>%
bind_rows(., .id = 'variable')

Why assigning dplyr's n() function makes it unexecutable within summarise and mutate?

Depending on some condition, I have to choose between using dplyr::n and an arbitrary function (say for instance a function that returns 2 whatever argument is given).
If I do the following:
new_n <- dplyr::n
new_n <- ifelse(is.null(k), new_n, my_new_n)
data <- data %>% group_by_(z) %>% mutate_(n = new_n)
If for instance dplyr::n gets assigned to new_n I get the error
Error: This function should not be called directly
while I was expecting it to work normally as it would do if I had written
data <- data %>% group_by_(z) %>% mutate_(n = n())
Why is this happening? Is there a work around? Basically I need to assign a different value to the variable n within the data depending on k but I cannot change the part of code where the mutate is performed due to the project requirements.
EDIT: added simple example.
For instance, if you try to run
if (require("nycflights13")) {
carriers <- group_by(flights, carrier)
summarise(carriers, n())
mutate(carriers, n = n())
filter(carriers, n() < 100)
}
everything works fine, however if you try to run
new_n <- n
summarise(carriers, new_n())
the code won't work and you'll get the error above even though what I did was just assigning n to new_n.
With mutate() you use n() but with mutate_() you use ~n()
So either use
data %>% group_by(z) %>% mutate(n = n())
or
data %>% group_by_(~z) %>% mutate_(n = ~n())

Resources