tabulate using tabyl by grouping variable using group_split and group_map - r

To get a quick frequency (tabulate) of one column or multiple columns at the one time I use tabyl function like so:
library(janitor)
library(tidyverse)
#tabulate one column at a time
iris %>%
tabyl(Petal.Width)
#tabulate multiple columns at once using map
iris %>%
select(Petal.Width, Petal.Length) %>%
map(tabyl)
I'm trying to replicate these two cases but have the output by a grouping variable, Species in this example. I would like the simplest solution and I would like to try the newer group_split and group_map commands for this.
I have been able to produce a similar type output in a dataframe format (although a simple list that tabyl produces is what I want for the case of more than one variable):
#works
iris %>%
group_by(Species) %>%
nest() %>%
mutate(out = map(data, ~ tabyl(.x$Petal.Width) %>%
as_tibble)) %>%
select(-data) %>%
unnest
This works but I would have thought it could be a bit more simple like my column method approach, I was thinking something like this for one column per grouping variable:
#by group for one column
iris %>%
group_by(Species) %>%
group_split() %>%
map(~tabyl(Petal.Width))
For multiple columns I'm not sure I need the select row here? Maybe group_map could simplify it in one line?
#by group for multiple columns
iris %>%
#do i need to select grouping variable and variables of interest?
select(Species, Petal.Width, Petal.Length) %>%
group_by(Species) %>%
group_split() %>%
map(~tabyl()) #could I use group_map and select the columns at once?
Any suggestions please?

iris %>%
#use split(.$Species) if you need a list with names
group_split(Species) %>%
map(~imap(.x %>%select(Species, Petal.Width, Petal.Length),
function(x,y){
out <-tabyl(x)
colnames(out)[1]=y
out}))
If you jsut need the default column name for the first column, then you can do iris %>% group_split(Species) %>% map(~map(.x, tabyl))

Related

Using map on specific column in list?

I'm trying to split a dataframe in a list of dataframes and then sort each dataframe by a specific variable using map(). I thought my approach would work, but I'm obviously not correctly passing something to the function, but I'm unsure as to how to make it work. For instance, using lapply() I could do this:
library(tidyverse)
df = iris
df %>%
group_split(Species) %>%
{lapply(.,function(x) {x %>% arrange(desc(Sepal.Length))})}
Using map(), I've tried this approach but it's not working:
df %>%
group_split(Species) %>%
map(.,arrange(Sepal.Length),desc)
How can I structure this so it works? I only want to apply the map() to one of the columns as in the lapply() example.
df %>%
group_split(Species) %>%
map(~arrange(.data = .x, desc(Sepal.Length)))
or
df %>%
group_split(Species) %>%
map(~.x %>% arrange(desc(Sepal.Length)))

Group by, summarise and return the value back to the dataset in R?

I am trying to create summary statistics without losing column values. For example using the iris dataset, I want to group_by the species and find the summary statistics, such as the sd and mean.
Once I have done this and I want to add this back to the original dataset. How can I can do this, I can only do the first step.
library("tidyverse")
data <- (iris)
data<-data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
this looks like this
I want to then add the result of mean and sd to the original iris data, this is so that I can get the z score for each individual row if it belongs to that species.
For further explanation; essentially create groups by the species and then find z score of each individual plant based on their species.
Though there already is an accepted answer, here is a way of computing the Z scores for all numeric variables.
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
mutate(across(where(is.numeric), scale)) %>%
rename_with(~str_c(., "_Z"), where(is.numeric)) %>%
ungroup() %>%
left_join(iris, ., by = "Species") %>%
relocate(Species, .after = last_col())
You can use something like
library("tidyverse")
data <- (iris)
df <- data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
data %>% left_join(df, by = "Species") %>%
mutate(Z = (Sepal.Length-mean.iris)/sd.iris)

Can I use mutate() to mimic a value i would join from a summarize() with dplyr?

I feel like there is a more elegant way with dplyr to recreate the following result of joining the results of a summarize call with mutate.
inner_join(iris,
iris %>% group_by(Species) %>% summarize(n = length(Species),
Mean.Sepal.Length = mean(Sepal.Length)),
by = "Species")
When I feel there may be a way to use mutate in this way...
#iris %>% mutate(???)
No need for the inner_join You can just do group_by() with a mutate().
iris %>%
group_by(Species) %>%
mutate(n=n(), Mean.Sepal.Length=mean(Sepal.Length))

Split a Dataset into a Nested List of Dataframes and then Spread Using Tidyr and Purrr

library(ggmosaic)
library(tidyverse)
Below is the sample code
happy2<-happy%>%
select(sex,marital,degree,health)%>%
group_by(sex,marital,degree,health)%>%
summarise(Count=n())
The following code splits the dataset into a nested list with tables of male and female (sex variable) for each category of the degree variable.
happy2 %>%
split(.$degree) %>%
lapply(function(x) split(x, x$sex))
This is where I'm now struggling. I would like to reshape, or using Tidyr, spread the "marital" variable, or perhaps this should be split again, so that each category of "marital" is a column header with each column containing the "health" variable and corresponding "Count". The redundant "sex" and "degree" columns can be dropped.
Since I'm working with a list, I've been attempting to use Tidyverse methods, for example, I've been trying to use purrr to drop variables:
happy2%>%map(~select(.x,-sex)
I'm thinking that I can also spread using purrr, but I'm having trouble making this work.
To help illustrate what I'm looking for, I attached a pic of the possible structure. I didn't include all categories and the counts are not correct since I'm only showing the structure. I suppose the "marital" category could also be a third split variable as well if that's easier? So what I'm hoping for is male and female tables for each category of degree, with marital by health and showing the corresponding count.
Help would be appreciated...
Would the following work? I changed the syntax for split by sex so that I can chain the subsequent commands together:
happy2 %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>%
spread(health, Count)))
Edit:
This would give you a separate table for each marital status:
happy2 %>%
ungroup() %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>% split(.$marital)))
And if you don't want the first column indicating marital status, the following version drops that:
happy2 %>%
ungroup() %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>% split(.$marital) %>%
lapply(function(x) x %>% select(-marital))))
What about this:
# cleaned up your code a bit
# removed the select (as it does nothing)
# consistent column names (count is lower case like the rest of the variables)
# added spacing
happy2 <- happy %>%
group_by(sex, marital, degree, health) %>%
summarise(count=n())
happy2 %>%
dplyr::ungroup() %>%
split(list(.$degree, .$sex, .$marital)) %>%
lapply(. %>% select(health, count))
Or do you really want the "martial" status as table heading for the "health" column has in your picture?

group_by and global mean within a single dplyr pipe

Is there a way using dplyr to summarise using group_by() then take a global mean, then add that to the same data frame without having to create a second dataframe?
Right now I am doing this like this:
library(dplyr)
speciesiris <- iris %>%
group_by(Species) %>%
summarise(mpw=mean(Petal.Width))
iris %>%
summarise(mpw=mean(Petal.Width)) %>%
mutate(Species="All Species") %>%
bind_rows(speciesiris)
One potential pitfall here is that I want not the mean of means but rather a global mean or at least the option of both. So is there a better way of doing this hopefully all in one pipe?
One line to do everything (but not recommended):
iris %>% summarise(mpw=mean(Petal.Width)) # Global mean
%>% mutate(Species="All Species")
%>% bind_rows(
iris %>% group_by(Species) # Mean by Species
%>% summarise(mpw=mean(Petal.Width))
)

Resources