Loop through aggregation using dplyr - r

So I'm I've tried to find the answer to this probably-obvious question. I have multiple predictor variables that I need to loop through in order to get a summary of another column for each predictor. This data frame will change with every iteration so I need code that work for multiple different data frames. Here are the places I've looked so far:
R- producing a summary calculation for each column that is dependent on aggregations at a factor level
Multiple data frame handling
Using the mtcars package, this is what I've tried:
#get mtcars data from graphics package
install.packages("graphics")
library(graphics)
data <- mtcars
#loop through names
variable <- list(colnames(data))
for(i in variable){
data1 <- data %>%
group_by(i)
summarise('number' = mean(mpg))
}
However, I get the following error:
Error in grouped_df_impl(data, unname(vars), drop) :
Column `i` is unknown
Not sure where to go next.

There are multiple issues in the code,
1) the variable is unnecessarily created as a list
2) Looping through the 'variable' is not getting inside the list, which is an issue from 1.
3) group_by_at can be used in place of group_by for string inputs
4) there is a typo of no connection ie. chain (%>%) between group_by and summarise
5) the output should be stored in a list or else it will be overwritten as we are assigning to the same object 'data1'
The below code does the correction
variable <- colnames(data) #is a `vector` now
data1 <- list() # initialize as a `list`
for(i in variable){
data1[[i]] <- data %>%
group_by_at(i) %>% #changed to `group_by_at`
summarise(number = mean(mpg))
}
Or this can be done in a tidyverse syntax which will return the output as a list of tibble and to avoid the initialization of list and assignment
purrr::map(variable, ~ data %>%
group_by_at(.x) %>%
summarise(number = mean(mpg)))
If we need to bind the list elements use bind_rows. But, it would also create multiple columns as the first column name is different and fill with NA
purrr::map(variable, ~ data %>%
group_by_at(.x) %>%
summarise(number = mean(mpg))) %>%
set_names(variable) %>%
bind_rows(., .id = 'variable')

Related

Developing Functions to Make New Dataframes in R

I am trying to develop a function that will take data, see if it matches a value in a category (e.g., 'Accident', and if so, develop a new dataframe using the following code.
cat.df <- function(i) {
sdb.i <- sdb %>%
filter(Category == i) %>%
group_by(Year) %>%
summarise(count = n()}
The name of the dataframe should be sdb.i, where i is the name of the category (e.g., 'Accident'). Unfortunately, I cannot get it to work. I'm notoriously bad with functions and would love some help.
It's not entirely clear what you are after so I am making a guess.
First of all, your function cat.df misses a closing bracket so it would not run.
I think it is good practice to pass all objects as parameters to a function. In my example I use the iris dataset so I pass this explicitly to the function.
You cannot change the name of a data frame in the way you describe. I offer two alternatives: if the count of your categories is limited you can just create separate names for each object. it you have many categories, best to combine all result objects into a list.
library(dplyr)
data(iris)
cat.df <- function(data, i) {
data <- data %>%
filter(Species== i) %>%
group_by(Petal.Width) %>%
summarise(count = n())
return(data)
}
result.setosa <- cat.df(iris, "setosa") # unique name
Species <- sort(unique(iris$Species))
results_list <- lapply(Species, function(x) cat.df(iris, x)) # combine all df's into a list
names(results_list) <- Species # name the list elements
You can then get the list elements as e.g. results_list$setosa or results_list[[1]].

R version 3.6.3 (2020-02-29) | Using package dplyr_1.0.0 | Unable to perform summarise() function

Trying to perform the basic Summarise() function but getting the same error again and again!
I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID.
Error:
*Problem with `summarise()` input `complete_cases`.
x unused argument (Date)
i Input `complete_cases` is `n(Date)`.
i The error occured in group 1: ID = 1.*
Code:
library(dplyr)
merged <-do.call(rbind,lapply(list.files(),read.csv))
merged <- as.data.frame(merged)
remove_na <- merged[complete.cases(merged),]
new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date))
Here is what the data looks like
The problem is not coming from summarise but from n.
If you look at the help ?n, you will see that n is used without any argument, like this:
new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n())
This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count:
new_data_count <- remove_na %>% count(ID)
If you want to count the different Date values, you might want to use n_distinct:
new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date))
Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this:
library(purrr)
remove_na = map_dfr(list.files(), read.csv) %>% na.omit()
Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.

Comparing multiple variables in more than two groups with t.test

I tried to do a t-test comparing values between time1/2/3.. and threshold.
here is my data frame:
time.df1<-data.frame("condition" =c("A","B","C","A","C","B"),
"time1" = c(1,3,2,6,2,3) ,
"time2" = c(1,1,2,8,2,9) ,
"time3" = c(-2,12,4,1,0,6),
"time4" = c(-8,3,2,1,9,6),
"threshold" = c(-2,3,8,1,9,-3))
and I tried to compare each two values by:
time.df1%>%
select_if(is.numeric) %>%
purrr::map_df(~ broom::tidy(t.test(. ~ threshold)))
However, I got this error message
Error in eval(predvars, data, env) : object 'threshold' not found
So, I tried another way (maybe it is wrong)
time.df2<-time.df1%>%gather(TF,value,time1:time4)
time.df2%>% group_by(condition) %>% do(tidy(t.test(value~TF, data=.)))
sadly, I got this error. Even I limited the condition to only two levels (A,B)
Error in t.test.formula(value ~ TF, data = .) : grouping factor must have exactly 2 levels
I wish to loop t-test over each time column to threshold column per condition, then using broom::tidy to get the results in tidy format. My approaches apparently aren't working, any advice is much appreciated to improve my codes.
An alternative route would be to define a function with the required options for t.test() up front, then create data frames for each pair of variables (i.e. each combination of 'time*' and 'threshold') and nesting them into list columns and use map() combined with relevant functions from 'broom' to simplify the outputs.
library(tidyverse)
library(broom)
ttestfn <- function(data, ...){
# amend this function to include required options for t.test
res = t.test(data$resp, data$threshold)
return(res)
}
df2 <-
time.df1 %>%
gather(time, "resp", - threshold, -condition) %>%
group_by(time) %>%
nest() %>%
mutate(ttests = map(data, ttestfn),
glances = map(ttests, glance))
# df2 has data frames, t-test objects and glance summaries
# as separate list columns
Now it's easy to query this object to extract what you want
df2 %>%
unnest(glances, .drop=TRUE)
However, it's unclear to me what you want to do with 'condition', so I'm wondering if it is more straightforward to reframe the question in terms of a GLM (as camille suggested in the comments: ANOVA is part of the GLM family).
Reshape the data, define 'threshold' as the reference level of the 'time' factor and the default 'treatment' contrasts used by R will compare each time to 'threshold':
time.df2 <-
time.df1 %>%
gather(key = "time", value = "resp", -condition) %>%
mutate(time = fct_relevel(time, "threshold")) # define 'threshold' as baseline
fit.aov <- aov(resp ~ condition * time, data = time.df2)
summary(fit.aov)
summary.lm(fit.aov) # coefficients and p-values
Of course this assumes that all subjects are independent (i.e. there are no repeated measures). If not, then you'll need to move on to more complicated procedures. Anyway, moving to appropriate GLMs for the study design should help minimise the pitfalls of doing multiple t-tests on the same data set.
We could remove the threshold from the select and then reintroduce it by creating a data.frame which would go into the formula object of t.test
library(tidyverse)
time1.df %>%
select_if(is.numeric) %>%
select(-threshold) %>%
map_df(~ data.frame(time = .x, time1.df['threshold']) %>%
broom::tidy(t.test(. ~ threshold)))

Passing column names as both variables and columns in a single dplyr function in R

I am writing a code in which a column name (e.g. "Category") is supplied by the user and assigned to a variable biz.area. For example...
biz.area <- "Category"
The original data frame is saved as risk.data. User also supplies the range of columns to analyze by providing column names for variables first.column and last.column.
Text in these columns will be broken up into bigrams for further text analysis including tf_idf.
My code for this analysis is given below.
x.bigrams <- risk.data %>%
gather(fields, alldata, first.column:last.column) %>%
unnest_tokens(bigrams,alldata,token = "ngrams", n=2) %>%
count(bigrams, biz.area, sort=TRUE) %>%
bind_tf_idf(bigrams, biz.area, n) %>%
arrange(desc(tf_idf))
However, I get the following error.
Error in grouped_df_impl(data, unname(vars), drop) : Column
x.biz.area is unknown
This is because count() expects a column name text string instead of variable biz.area. If I use count_() instead, I get the following error.
Error in compat_lazy_dots(vars, caller_env()) : object 'bigrams'
not found
This is because count_() expects to find only variables and bigrams is not a variable.
How can I pass both a constant and a variable to count() or count_()?
Thanks for your suggestion!
It looks to me like you need to enclosures, so that you can pass column names as variables, rather than as strings or values. Since you're already using dplyr, you can use dplyr's non-standard evaluation techniques.
Try something along these lines:
library(tidyverse)
analyze_risk <- function(area, firstcol, lastcol) {
# turn your arguments into enclosures
areaq <- enquo(area)
firstcolq <- enquo(firstcol)
lastcolq <- enquo(lastcol)
# run your analysis on the risk data
risk.data %>%
gather(fields, alldata, !!firstcolq:!!lastcolq) %>%
unnest_tokens(bigrams,alldata,token = "ngrams", n=2) %>%
count(bigrams, !!areaq, sort=TRUE) %>%
bind_tf_idf(bigrams, !!areaq, n) %>%
arrange(desc(tf_idf))
}
In this case, your users would pass bare column names into the function like this:
myresults <- analyze_risk(Category, Name_of_Firstcol, Name_of_Lastcol)
If you want users to pass in strings, you'll need to use rlang::expr() instead of enquo().

Using dplyr within a function, Grouping Error with function arguments

Below I have a working example of what I would like the function to do, and then script for the function, noting where the Error occurs.
The error message is:
Error: index out of bounds
Which I know usually means R can’t find the variable that’s being called.
Interestingly, in my function example below, if I only group by my subgroup_name (which is passed to the function and becomes a column in the newly created dataframe) the function will successfully regroup that variable, but I also want to group by a newly created column (from the melt) called variable.
Similar code used to work for me using regroup(), but that has been deprecated. I am trying to use group_by_() but to no avail.
I have read many other posts and answers and experimented several hours today but still not successful.
# Initialize example dataset
database <- ggplot2::diamonds
database$diamond <- row.names(diamonds) # needed for melting
subgroup_name <- "cut" # can replace with "color" or "clarity"
subgroup_column <- 2 # can replace with 3 for color, 4 for clarity
# This works, although it would be preferable not to need separate variables for subgroup_name and subgroup_column number
df <- database %>%
select(diamond, subgroup_column, x,y,z) %>%
melt(id.vars=c("diamond", subgroup_name)) %>%
group_by(cut, variable) %>%
summarise(value = round(mean(value, na.rm = TRUE),2))
# This does not work, I am expecting the same output as above
subgroup_analysis <- function(database,...){
df <- database %>%
select(diamond, subgroup_column, x,y,z) %>%
melt(id.vars=c("diamond", subgroup_name)) %>%
group_by_(subgroup_name, variable) %>% # problem appears to be with finding "variable"
summarise(value = round(mean(value, na.rm = TRUE),2))
print(df)
}
subgroup_analysis(database, subgroup_column, subgroup_name)
From the NSE vignette:
If you also want to output variables to vary, you need to pass a list
of quoted objects to the .dots argument:
Here, variable should be quoted:
subgroup_analysis <- function(database,...){
df <- database %>%
select(diamond, subgroup_column, x,y,z) %>%
melt(id.vars=c("diamond", subgroup_name)) %>%
group_by_(subgroup_name, quote(variable)) %>%
summarise(value = round(mean(value, na.rm = TRUE),2))
print(df)
}
subgroup_analysis(database, subgroup_column, subgroup_name)
As mentionned by #RichardScriven, if you plan to assign the result to a new variable, then you may want to remove the print call at the end and just write df, or not even assign df at all in the function
Otherwise the result prints even when you do x <- subgroup_analysis(...)

Resources