summarizing across multiple variables and assign to new variable names - r

I am using the 'across' function to get the summary statistics for a series of variables (for example, all variables that starts with 'f_'. Since the across function will store the summarised results back to the original variable names, having multiple across functions with different summarising functions would overwrite the results (as shown below).
I can think of a work-around by renaming the variables after summarise() and cbind the resulting individual tables. However, that seems cumbersome, and I am wondering if there is a tidy (pun intended) way to store the series of summarised results to new variable names.
var_stats = data %>%
summarise(
across(starts_with('f_'), max, na.rm = T),
across(starts_with('f_'), min, na.rm = T)
)

With across you can use multiple summary functions at the same time and control the names of the variables. Does the following example help you?
mtcars %>%
summarise(across(starts_with("c"), list(custom_min = min, custom_max = max), .names = "{.col}_{.fn}"))
cyl_custom_min cyl_custom_max carb_custom_min carb_custom_max
1 4 8 1 8

Related

Using summarize_all() for multiple data types

I am attempting to use the new Dplyr scoped summarize() verbs to search through my data table and create a summary dataframe, grouped by treatment arm, for each one of a set of multiple outcomes that contains statistics for both numeric (2.5th, 50th, 97.5th percentiles) & categorical predictor variables (counts). I appear to have been successful with these computations (thanks to this massively helpful post - r summarize_if with multiple conditions).
However my dataframes are not visually friendly, as the quantile() & table() functions insert lists into each dataframe cell so I am unable to scroll through the dataframe within the R-Studio viewer to browse through the results. Does anybody have any suggestions regarding how to reorganize or view this dataframe in R-Studio in order to see the full results of these lists more clearly?
Thank you kindly!
outcome.dfs <- list()
dt <- data.table(short.vars.full) # convert to data table to allow for subsetting with column name stored in variable
for (ae.outcome in outcomes.list) {
outcome.dfs[[ae.outcome]] <- dt[get(ae.outcome) == ae.outcome, ] %>%
select(-USUBJID) %>%
group_by(XDARM) %>%
summarise_all(~ if(is.numeric(.))
list(format(round(quantile(., probs = c(.025,0.50,0.975), na.rm = TRUE), 2), nsmall=2))
else if (is.factor(.))
list(table(.)))
}

Why selected variables in dplyr package are not in output df in R?

I have a df with 30 columns and 2000 rows.
from the df, I selected several variables by their name and calculated mean of Value by 3 by3 rows of group and type variables.
But there are only 3 variables (group, type, res) in output data.
How should I tell to save selected variables into output df? Is there anything wrong with this code?
output <- data %>%
select(group, type, A, B, C, Value) %>%
group_by(group = gl(n()/3, 3), type) %>%
summarise(res = mean(Value))
Thanks in advance!
As others have pointed out, summarize only returns grouping variables and those variables specified in summarize. This is by design – summarize returns a single row for each group, so there must be a single value for each variable.
The function used in summarize must return a single value (so that's covered), while using group_by with variables ensures that these variables are the same within the group. But for the other variables, there could be several different values within the group: which would summarize choose? Instead of making a guess, it drops those variables.
There are several options to get around this, which one is best depends on your data and what you want to do with it:
Add these variables as grouping variables. This is the preferred method, but obviously it only works if the structure of the data allows it. For example, in a hypothetical dataset, if you want to group by city but want to preserve the state variable, using group_by(city, state) will divide into groups the same way as group_by(city) since city and state are linked (for example, "Boston" will always be with "MA").
Define them in summarize and choose only the first value to be the value for that group, as in #thc's answer. Note that you will lose any other values of those variables and it's not always clear which value will be kept and which will be lost.
Use mutate instead - this will keep the original number of rows rather than collapsing to 1 per group, but will ensure that you don't lose any data.
Join them as a comma (or other) separated string by adding: A = paste(A, sep = ', ') to the summarize for each variable you want to keep. This will preserve the information, at the expense of making it dificult to work with in any future steps.
You can include them in summarise instead, e.g.:
output <- data %>%
select(group, type, A, B, C, Value) %>%
group_by(group = gl(n()/3, 3), type) %>%
summarise(res = mean(Value), A=A[1], B=B[1], C=C[1] )
I believe this is the fastest approach under dplyr if you have a very large data.frame.

Subsets in R studio (basic questions)?

I am terrible with R and I am trying to figure out subsets. I have entered the data file into R studio via:
> Vehicle_Data <-read.table("VehicleData.txt.txt", header=T,sep="\t",quote="")
> attach(Vehicle_Data)
I'm confused about subsets. One of the columns in my data is Type which includes a variety of vehicle types. I need to narrow down Car within the type column so I can calculate the mean MPG value of the cars only.
Here's what I have tried:
> TypeCar<-subset(Vehicle_Data, Type=="Car")
I think this worked to subset the data, but I'm not sure. Also I have no idea how to calculate the mean MPG from the subset?
The code for subsetting appears to be fine. To calculate the mean, you need to use the mean() function in this way:
mean_mpg <- mean(TypeCar$MPG, na.rm = TRUE)
This code will also take care of any NA values present in your data
You can use tidyverse perform data transformations such as subsetting (filtering)
Vehicle_Data %>%
filter(Type=="Car")
You can also calculate the mean MPG per Type like so:
Vehicle_Data %>%
group_by(Type) %>%
summarise(mean.MPG=mean(MPG, na.rm = TRUE))
If you'd like to calculate the mean of an existing subset of data (i.e. TypeCar), you can just run mean(TypeCar$MPG, na.rm = TRUE)

dplyr default grouping option

I'm a bit confused about the default grouping option in dplyr. I always assumed that without explicitly group_by any operations are rowwise. However, I have a data frame
data = data.frame(a=c(1,2,3,4),b=c(1,2,3,4))
and when I want to calculate the mean of each ROW
data = data.raw %>%
mutate(data.average = mean(c(a,b),na.rm = T))
it returns the mean value of all the elements in a and b. It seems by doing c() all data are grouped in one group and mean performed on that. I wonder how it is possible to know how functions used within mutate etc. introduce grouping.
ps. I'm not looking for a solution for this specific problem, but asking for more generally how function calls affect grouping in dplyr.

R code for creating variable for accuracy/percentages

I am having some trouble with R code for a variable I am trying to add to my data frame. Essentially, participants responded to two classes of stimuli (A and B) and their responses could either be correct or incorrect. The important variables (columns) in my data set are: ID (participants' ID), stimtype (A or B), and response (correct or incorrect).
What I want to do is calculate, for each participant, create two "accuracy score" variables (columns): one where it lists accuracy percentage for stimulus type A, and one for stimulus type B.
I can get those percentages fairly easily using table functions, but am having difficulty creating those variables in my dataset. Any advice very much appreciated, thank you!!!
If you have a data.frame mydata with character stimtypes and a TRUE/FALSE response, you can use
library(dplyr)
result <- mydata %>%
group_by(ID, stimtype) %>%
summarize(pct_response = 100 * mean(response, na.rm = T))
This interprets the logical responses (T/F) as 1/0 and taking the mean will give you the percentage for a given ID and stimtype. However, the result will have two rows per ID, with one for each stimtype. If you want the results in two columns, you can use tidyr::spread
library(tidyr)
result %>%
spread(key = stimtype, value = pct_response)

Resources