Is there a way using dplyr to summarise using group_by() then take a global mean, then add that to the same data frame without having to create a second dataframe?
Right now I am doing this like this:
library(dplyr)
speciesiris <- iris %>%
group_by(Species) %>%
summarise(mpw=mean(Petal.Width))
iris %>%
summarise(mpw=mean(Petal.Width)) %>%
mutate(Species="All Species") %>%
bind_rows(speciesiris)
One potential pitfall here is that I want not the mean of means but rather a global mean or at least the option of both. So is there a better way of doing this hopefully all in one pipe?
One line to do everything (but not recommended):
iris %>% summarise(mpw=mean(Petal.Width)) # Global mean
%>% mutate(Species="All Species")
%>% bind_rows(
iris %>% group_by(Species) # Mean by Species
%>% summarise(mpw=mean(Petal.Width))
)
Related
I have a csv file, mtcars that contains models of cars with differnt variables
I know that to find the mean, I just do mean(mtcars$mpg) and to find variance, var(mtcars$mpg). Instead of writing this a number of times, how would I display all the means and variance of each variable in one line? The first column in the dataset are strings, so how would I disregard that one column when calculating the mean and variance? Thanks.
In tidyverse, we can reshape to 'long' format, then grouped by 'name' get the mean and variance as summarised output in two columns
library(dplyr)
library(tidyr)
library(tibble)
mtcars %>%
rownames_to_column('model') %>%
pivot_longer(cols = -model) %>%
group_by(name) %>%
summarise(Mean = mean(value), Var = var(value))
Or another option is summarise_if
mtcars %>%
rownames_to_column('model') %>%
summarise_if(is.numeric, list(Mean = mean, Var = var)) %>%
pivot_longer(cols = everything())
Or with colMeans and matrixStats::colVars
colMeans(mtcars[-1])
matrixStats::colVars(as.matrix(mtcars[-1]))
Good morning User.
Try also using
Rfast::colVars(x)
My experiments say it is at least 2 times faster. A second alternative would be
Rfast2::colmeansvars(x)
To get a quick frequency (tabulate) of one column or multiple columns at the one time I use tabyl function like so:
library(janitor)
library(tidyverse)
#tabulate one column at a time
iris %>%
tabyl(Petal.Width)
#tabulate multiple columns at once using map
iris %>%
select(Petal.Width, Petal.Length) %>%
map(tabyl)
I'm trying to replicate these two cases but have the output by a grouping variable, Species in this example. I would like the simplest solution and I would like to try the newer group_split and group_map commands for this.
I have been able to produce a similar type output in a dataframe format (although a simple list that tabyl produces is what I want for the case of more than one variable):
#works
iris %>%
group_by(Species) %>%
nest() %>%
mutate(out = map(data, ~ tabyl(.x$Petal.Width) %>%
as_tibble)) %>%
select(-data) %>%
unnest
This works but I would have thought it could be a bit more simple like my column method approach, I was thinking something like this for one column per grouping variable:
#by group for one column
iris %>%
group_by(Species) %>%
group_split() %>%
map(~tabyl(Petal.Width))
For multiple columns I'm not sure I need the select row here? Maybe group_map could simplify it in one line?
#by group for multiple columns
iris %>%
#do i need to select grouping variable and variables of interest?
select(Species, Petal.Width, Petal.Length) %>%
group_by(Species) %>%
group_split() %>%
map(~tabyl()) #could I use group_map and select the columns at once?
Any suggestions please?
iris %>%
#use split(.$Species) if you need a list with names
group_split(Species) %>%
map(~imap(.x %>%select(Species, Petal.Width, Petal.Length),
function(x,y){
out <-tabyl(x)
colnames(out)[1]=y
out}))
If you jsut need the default column name for the first column, then you can do iris %>% group_split(Species) %>% map(~map(.x, tabyl))
I feel like there is a more elegant way with dplyr to recreate the following result of joining the results of a summarize call with mutate.
inner_join(iris,
iris %>% group_by(Species) %>% summarize(n = length(Species),
Mean.Sepal.Length = mean(Sepal.Length)),
by = "Species")
When I feel there may be a way to use mutate in this way...
#iris %>% mutate(???)
No need for the inner_join You can just do group_by() with a mutate().
iris %>%
group_by(Species) %>%
mutate(n=n(), Mean.Sepal.Length=mean(Sepal.Length))
I am trying to summarise the value for one variable after splitting the data with group_by using dplyr package, the following code works fine and the output is listed below, but I can not substitute summarise_each with summriase even only one column need to be calculated, I wonder why?
iris %>% group_by(Species) %>% select(one_of('Sepal.Length')) %>%
summarise_each(funs(mean(.)))
or I will get the output like "S3:lazy".
summarize and summarize_each work quite differently. summarize is in fact simpler — just specify the expression directly:
iris %>%
group_by(Species) %>%
select(Sepal.Length) %>%
summarize(Sepal.Length = mean(Sepal.Length))
You can choose any name for the output column, it doesn’t need to be the same as the input.
I would like to compare the standard deviation for a variable, to the standard deviations of the variable once grouped by a factor.
This is the overall sd()
require(dplyr)
iris %.% summarise(
Overall.SD = sd(Sepal.Length)
)
However, I can't access it once I have used group_by
iris %.%
group_by(Species) %.%
summarise(
Species.SD = sd(Sepal.Length),
Overall.SD = sd(iris$Sepal.Length),
Species.SD < Overall.SD
)
Is there a way to make dplyr look back to the overall dataset?
I would compute the Overall.SD before grouping the data using mutate so that the other data is kept as it was.
iris %>%
mutate(Overall.SD = sd(Sepal.Length)) %>% # you can use mutate instead of summarise here
group_by(Species) %>%
summarise(Species.SD = sd(Sepal.Length),
Overall.SD = Overall.SD[1], # You could also remove this line if you just want the comparison and don't need to display the actual Overall.SD
Species.SD < Overall.SD)