I have a csv file, mtcars that contains models of cars with differnt variables
I know that to find the mean, I just do mean(mtcars$mpg) and to find variance, var(mtcars$mpg). Instead of writing this a number of times, how would I display all the means and variance of each variable in one line? The first column in the dataset are strings, so how would I disregard that one column when calculating the mean and variance? Thanks.
In tidyverse, we can reshape to 'long' format, then grouped by 'name' get the mean and variance as summarised output in two columns
library(dplyr)
library(tidyr)
library(tibble)
mtcars %>%
rownames_to_column('model') %>%
pivot_longer(cols = -model) %>%
group_by(name) %>%
summarise(Mean = mean(value), Var = var(value))
Or another option is summarise_if
mtcars %>%
rownames_to_column('model') %>%
summarise_if(is.numeric, list(Mean = mean, Var = var)) %>%
pivot_longer(cols = everything())
Or with colMeans and matrixStats::colVars
colMeans(mtcars[-1])
matrixStats::colVars(as.matrix(mtcars[-1]))
Good morning User.
Try also using
Rfast::colVars(x)
My experiments say it is at least 2 times faster. A second alternative would be
Rfast2::colmeansvars(x)
Related
I am trying to create summary statistics without losing column values. For example using the iris dataset, I want to group_by the species and find the summary statistics, such as the sd and mean.
Once I have done this and I want to add this back to the original dataset. How can I can do this, I can only do the first step.
library("tidyverse")
data <- (iris)
data<-data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
this looks like this
I want to then add the result of mean and sd to the original iris data, this is so that I can get the z score for each individual row if it belongs to that species.
For further explanation; essentially create groups by the species and then find z score of each individual plant based on their species.
Though there already is an accepted answer, here is a way of computing the Z scores for all numeric variables.
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
mutate(across(where(is.numeric), scale)) %>%
rename_with(~str_c(., "_Z"), where(is.numeric)) %>%
ungroup() %>%
left_join(iris, ., by = "Species") %>%
relocate(Species, .after = last_col())
You can use something like
library("tidyverse")
data <- (iris)
df <- data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
data %>% left_join(df, by = "Species") %>%
mutate(Z = (Sepal.Length-mean.iris)/sd.iris)
This is a long-lasting question, but now I really to solve this puzzle. I'm using dplyr all the time and I think it is great to summarise variables. However, I'm trying to display a pivot table with partial success only. Dplyr always reports one single row with all results, what's annoying. I have to copy-paste the results to excel to organize everything...
I got the code here
and it almost working.
This result
Should be like the following one:
Because I always report my results using this style
Use this code to get the same results:
library(tidyverse)
set.seed(123)
ds <- data.frame(group=c("american", "canadian"),
iq=rnorm(n=50,mean=100,sd=15),
income=rnorm(n=50, mean=1500, sd=300),
math=rnorm(n=50, mean=5, sd=2))
ds %>%
group_by(group) %>%
summarise_at(vars(iq, income, math),funs(mean, sd)) %>%
t %>%
as.data.frame %>%
rownames_to_column %>%
separate(rowname, into = c("feature", "fun"), sep = "_")
To clarify, I've tried this code, but spread works with only one summary (mean or sd, etc). Some people use gather(), but it's complicated to work with group_by and gather().
Thanks for any help.
Instead of transposing (t) and changing the class types, after the summarise step, do a gather to change it to 'long' format and then spread it back after doing some modifications with separate and unite
library(tidyverse)
ds %>%
group_by(group) %>%
summarise_at(vars(iq, income, math),funs(mean, sd)) %>%
gather(key, val, iq_mean:math_sd) %>%
separate(key, into = c('key1', 'key2')) %>%
unite(group, group, key2) %>%
spread(group, val)
If I'm working with a dataset and I want to group the data (i.e. by country), compute a summary statistic (mean()) and then ungroup() the data.frame to have a dataset with the original dimensions (country-year) and a new column that lists the mean for each country (repeated over n years), how would I do that with dplyr? The ungroup() function doesn't return a data.frame with the original dimensions:
gapminder %>%
group_by(country) %>%
summarize(mn = mean(pop)) %>%
ungroup() # returns data.frame with nrows == length(unique(gapminder$country))
ungroup() is useful if you want to do something like
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
where you want to do some sort of transformation that uses an entire group's statistics. In the above example, mn is the ratio of a population to the group's average population. When it is ungrouped, any further mutations called on it would not use the grouping for aggregate statistics.
summarize automatically reduces the dimensions, and there's no way to get that back. Perhaps you wanted to do
gapminder %>%
group_by(country) %>%
mutate(mn = mean(pop)) %>%
ungroup()
Which creates mn as the mean for each group, replicated for each row within that group.
The summarize() reduced the number of rows. If you didn't want to change the number of rows, then use mutate() rather than summarize().
actually ungroup() is not needed in your case.
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop))
generates the same results as the following:
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
The only difference is that the latter actually runs a bit slower.
Is there a way using dplyr to summarise using group_by() then take a global mean, then add that to the same data frame without having to create a second dataframe?
Right now I am doing this like this:
library(dplyr)
speciesiris <- iris %>%
group_by(Species) %>%
summarise(mpw=mean(Petal.Width))
iris %>%
summarise(mpw=mean(Petal.Width)) %>%
mutate(Species="All Species") %>%
bind_rows(speciesiris)
One potential pitfall here is that I want not the mean of means but rather a global mean or at least the option of both. So is there a better way of doing this hopefully all in one pipe?
One line to do everything (but not recommended):
iris %>% summarise(mpw=mean(Petal.Width)) # Global mean
%>% mutate(Species="All Species")
%>% bind_rows(
iris %>% group_by(Species) # Mean by Species
%>% summarise(mpw=mean(Petal.Width))
)
I would like to compare the standard deviation for a variable, to the standard deviations of the variable once grouped by a factor.
This is the overall sd()
require(dplyr)
iris %.% summarise(
Overall.SD = sd(Sepal.Length)
)
However, I can't access it once I have used group_by
iris %.%
group_by(Species) %.%
summarise(
Species.SD = sd(Sepal.Length),
Overall.SD = sd(iris$Sepal.Length),
Species.SD < Overall.SD
)
Is there a way to make dplyr look back to the overall dataset?
I would compute the Overall.SD before grouping the data using mutate so that the other data is kept as it was.
iris %>%
mutate(Overall.SD = sd(Sepal.Length)) %>% # you can use mutate instead of summarise here
group_by(Species) %>%
summarise(Species.SD = sd(Sepal.Length),
Overall.SD = Overall.SD[1], # You could also remove this line if you just want the comparison and don't need to display the actual Overall.SD
Species.SD < Overall.SD)