I would like to compare the standard deviation for a variable, to the standard deviations of the variable once grouped by a factor.
This is the overall sd()
require(dplyr)
iris %.% summarise(
Overall.SD = sd(Sepal.Length)
)
However, I can't access it once I have used group_by
iris %.%
group_by(Species) %.%
summarise(
Species.SD = sd(Sepal.Length),
Overall.SD = sd(iris$Sepal.Length),
Species.SD < Overall.SD
)
Is there a way to make dplyr look back to the overall dataset?
I would compute the Overall.SD before grouping the data using mutate so that the other data is kept as it was.
iris %>%
mutate(Overall.SD = sd(Sepal.Length)) %>% # you can use mutate instead of summarise here
group_by(Species) %>%
summarise(Species.SD = sd(Sepal.Length),
Overall.SD = Overall.SD[1], # You could also remove this line if you just want the comparison and don't need to display the actual Overall.SD
Species.SD < Overall.SD)
Related
I am trying to create summary statistics without losing column values. For example using the iris dataset, I want to group_by the species and find the summary statistics, such as the sd and mean.
Once I have done this and I want to add this back to the original dataset. How can I can do this, I can only do the first step.
library("tidyverse")
data <- (iris)
data<-data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
this looks like this
I want to then add the result of mean and sd to the original iris data, this is so that I can get the z score for each individual row if it belongs to that species.
For further explanation; essentially create groups by the species and then find z score of each individual plant based on their species.
Though there already is an accepted answer, here is a way of computing the Z scores for all numeric variables.
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
mutate(across(where(is.numeric), scale)) %>%
rename_with(~str_c(., "_Z"), where(is.numeric)) %>%
ungroup() %>%
left_join(iris, ., by = "Species") %>%
relocate(Species, .after = last_col())
You can use something like
library("tidyverse")
data <- (iris)
df <- data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
data %>% left_join(df, by = "Species") %>%
mutate(Z = (Sepal.Length-mean.iris)/sd.iris)
I have a csv file, mtcars that contains models of cars with differnt variables
I know that to find the mean, I just do mean(mtcars$mpg) and to find variance, var(mtcars$mpg). Instead of writing this a number of times, how would I display all the means and variance of each variable in one line? The first column in the dataset are strings, so how would I disregard that one column when calculating the mean and variance? Thanks.
In tidyverse, we can reshape to 'long' format, then grouped by 'name' get the mean and variance as summarised output in two columns
library(dplyr)
library(tidyr)
library(tibble)
mtcars %>%
rownames_to_column('model') %>%
pivot_longer(cols = -model) %>%
group_by(name) %>%
summarise(Mean = mean(value), Var = var(value))
Or another option is summarise_if
mtcars %>%
rownames_to_column('model') %>%
summarise_if(is.numeric, list(Mean = mean, Var = var)) %>%
pivot_longer(cols = everything())
Or with colMeans and matrixStats::colVars
colMeans(mtcars[-1])
matrixStats::colVars(as.matrix(mtcars[-1]))
Good morning User.
Try also using
Rfast::colVars(x)
My experiments say it is at least 2 times faster. A second alternative would be
Rfast2::colmeansvars(x)
To get a quick frequency (tabulate) of one column or multiple columns at the one time I use tabyl function like so:
library(janitor)
library(tidyverse)
#tabulate one column at a time
iris %>%
tabyl(Petal.Width)
#tabulate multiple columns at once using map
iris %>%
select(Petal.Width, Petal.Length) %>%
map(tabyl)
I'm trying to replicate these two cases but have the output by a grouping variable, Species in this example. I would like the simplest solution and I would like to try the newer group_split and group_map commands for this.
I have been able to produce a similar type output in a dataframe format (although a simple list that tabyl produces is what I want for the case of more than one variable):
#works
iris %>%
group_by(Species) %>%
nest() %>%
mutate(out = map(data, ~ tabyl(.x$Petal.Width) %>%
as_tibble)) %>%
select(-data) %>%
unnest
This works but I would have thought it could be a bit more simple like my column method approach, I was thinking something like this for one column per grouping variable:
#by group for one column
iris %>%
group_by(Species) %>%
group_split() %>%
map(~tabyl(Petal.Width))
For multiple columns I'm not sure I need the select row here? Maybe group_map could simplify it in one line?
#by group for multiple columns
iris %>%
#do i need to select grouping variable and variables of interest?
select(Species, Petal.Width, Petal.Length) %>%
group_by(Species) %>%
group_split() %>%
map(~tabyl()) #could I use group_map and select the columns at once?
Any suggestions please?
iris %>%
#use split(.$Species) if you need a list with names
group_split(Species) %>%
map(~imap(.x %>%select(Species, Petal.Width, Petal.Length),
function(x,y){
out <-tabyl(x)
colnames(out)[1]=y
out}))
If you jsut need the default column name for the first column, then you can do iris %>% group_split(Species) %>% map(~map(.x, tabyl))
If I'm working with a dataset and I want to group the data (i.e. by country), compute a summary statistic (mean()) and then ungroup() the data.frame to have a dataset with the original dimensions (country-year) and a new column that lists the mean for each country (repeated over n years), how would I do that with dplyr? The ungroup() function doesn't return a data.frame with the original dimensions:
gapminder %>%
group_by(country) %>%
summarize(mn = mean(pop)) %>%
ungroup() # returns data.frame with nrows == length(unique(gapminder$country))
ungroup() is useful if you want to do something like
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
where you want to do some sort of transformation that uses an entire group's statistics. In the above example, mn is the ratio of a population to the group's average population. When it is ungrouped, any further mutations called on it would not use the grouping for aggregate statistics.
summarize automatically reduces the dimensions, and there's no way to get that back. Perhaps you wanted to do
gapminder %>%
group_by(country) %>%
mutate(mn = mean(pop)) %>%
ungroup()
Which creates mn as the mean for each group, replicated for each row within that group.
The summarize() reduced the number of rows. If you didn't want to change the number of rows, then use mutate() rather than summarize().
actually ungroup() is not needed in your case.
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop))
generates the same results as the following:
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
The only difference is that the latter actually runs a bit slower.
Is there a way using dplyr to summarise using group_by() then take a global mean, then add that to the same data frame without having to create a second dataframe?
Right now I am doing this like this:
library(dplyr)
speciesiris <- iris %>%
group_by(Species) %>%
summarise(mpw=mean(Petal.Width))
iris %>%
summarise(mpw=mean(Petal.Width)) %>%
mutate(Species="All Species") %>%
bind_rows(speciesiris)
One potential pitfall here is that I want not the mean of means but rather a global mean or at least the option of both. So is there a better way of doing this hopefully all in one pipe?
One line to do everything (but not recommended):
iris %>% summarise(mpw=mean(Petal.Width)) # Global mean
%>% mutate(Species="All Species")
%>% bind_rows(
iris %>% group_by(Species) # Mean by Species
%>% summarise(mpw=mean(Petal.Width))
)