Summarizing categorical variables by multiple groups - r

I have 1609 observations from 93 unique publications. I am doing a qualitative analysis of my data, and I have the following variables: soil texture (coarse soil, sandy soil, sandy loam, sandy clay loam), experimental design (field, greenhouse, and lab), and publication title (93 unique publication titles). I want to count unique publication titles for each soil texture for each experimental design.
I could only get unique publication titles for each experimental designs or each soil texture using the following code:
df4_2 <- metadata2 %>%
group_by(publication_title) %>%
group_by(experiment_cond) %>%
summarise(count = n_distinct(publication_title))%>%
drop_na()
View(df4_2)
# OR
df4_3 <- metadata2 %>%
group_by(publication_title) %>%
group_by(soil_texture) %>%
summarise(count = n_distinct(publication_title))%>%
drop_na()
View(df4_3)
Does anyone know how can I summarize unique publication titles for each the soil texture and each experimental design?
I tried the following code but it did not work:
df4_4 <- metadata2 %>%
group_by(publication_title) %>%
group_by(soil_texture) %>%
group_by(experiment_cond) %>%
summarise(count = n_distinct(publication_title))%>%
drop_na()
View(df4_4)

By default, each group_by() overrides/drops any previous groupings. To group by multiple variables, include them in the same group_by() call:
library(dplyr)
metadata2 %>%
group_by(soil_texture, experiment_cond) %>%
summarise(count = n_distinct(publication_title)) %>%
drop_na()
If you did need to add groups in separate calls (not necessary here, but sometimes useful), use the .add argument:
metadata2 %>%
group_by(soil_texture) %>%
group_by(experiment_cond, .add = TRUE) %>%
summarise(count = n_distinct(publication_title)) %>%
drop_na()
Finally, note you shouldn’t group by publication_title; if you do, the n_distinct() per group would always be 1.

Related

R impute/'approximate' delayed covid time series points using non-delayed total

I'm using a grouped time series data set where there's often NAs for more recent dates (length of NAs varies fairly randomly). A total of all the series is provided in the data, where for more recent dates, this total is actually greater than the sum of the individual series, I guess because of imputation/forecasting.
So, my question is, how can the missing values be estimated, assuming that the series total is correct?
My general approach is to calculate what proportion of the total each series is, and somehow extrapolate to the future missing dates. As you can see by the graphs, I'm not so successful. There's complications caused by differing last dates of reported data. I'm not sure if cumulative makes a difference.
R code for simulated data and failed solution below:
Code:
## simulate simple grouped time series
library(tidyverse)
set.seed(555)
## time series length, e.g. 10
len=10
## group names
grps=letters[1:5]
df=bind_rows(lapply(grps,function(z){
tibble(rn=seq(1:len)) %>%
mutate(real_val=runif(len,min=0,max=1)) %>%
mutate(grp=z) %>%
select(grp,rn,real_val) %>%
## replace final data points with NA, length varying by group
## this simulates delays in data reporting across groups
mutate(reported_val=ifelse(rn>len-match(z,letters)+1,NA,real_val)) %>%
# mutate(reported_val=ifelse(rn>len-runif(1,0,round(max_trim)),NA,real_val)) %>%
## make cumulative to assist viz a bit. may affect imputation method.
group_by(grp) %>%
arrange(rn) %>%
mutate(real_val=cumsum(real_val),reported_val=cumsum(reported_val)) %>%
ungroup()
}))
df
## attempt to impute/estimate missing real values, given total for each rn (date)
## general solution is to use (adjusted?) proportions of the total.
df2=df %>%
group_by(rn) %>%
mutate(sum_real_val=sum(real_val),sum_reported_val=sum(reported_val,na.rm=T)) %>%
ungroup() %>%
## total value missing for each date
mutate(val_missing=sum_real_val-sum_reported_val) %>%
## what proportion of the continent that country takes up
mutate(prop=reported_val/sum_real_val) %>%
# mutate(prop=reported_val/sum_reported_val) %>%
## fill missing proportions to end terminus from most recent value
group_by(grp) %>%
arrange(rn) %>%
fill(prop,.direction='down') %>%
ungroup() %>%
## get estimated proportion of those missing
mutate(temp1=ifelse(is.na(reported_val),prop,NA)) %>%
## re-calculate proportion as only of those missing.
group_by(grp) %>%
mutate(prop_temp1=temp1/sum(temp1,na.rm=T)) %>%
ungroup() %>%
## if value missing, then multiply total missing by expected proportion of missing
mutate(result_val=ifelse(is.na(reported_val),val_missing*prop_temp1,reported_val)) %>%
ungroup()
## time series plot
## stacked by group, black line shows real_val total.
ggplot(df2 %>%
select(grp,rn,real_val,reported_val,sum_real_val,result_val) %>%
gather(val_grp,val,-c(grp,rn,sum_real_val)) %>%
ungroup(),aes(rn,val))+
# geom_line(aes(colour=loc))+
geom_area(aes(fill=grp))+
geom_line(aes(y=sum_real_val))+
facet_wrap("val_grp")
## but alas the result total doesn't agree with the reported total
## nb the imputed values for each group don't necessarily have to agree with the real values.
It's hard to know what conclusions to draw from the incomplete dates. One simple assumption could be to take the last share and extrapolate that into the future:
default_share <- df %>%
count(rn, grp, wt = !is.na(reported_val)) %>%
count(rn, n) %>%
filter(nn == max(nn)) %>%
slice_max(rn) %>%
left_join(df) %>%
mutate(share = real_val / sum(real_val)) %>%
select(rn, grp, share)
df %>%
group_by(rn) %>%
mutate(result_val = if_else(rn > default_share$rn[[1]],
sum(real_val) * default_share$share[rn == rn],
real_val)
) %>% ungroup() %>%
select(-reported_val) %>%
pivot_longer(-c(grp:rn)) %>%
ggplot(aes(rn,value))+
geom_area(aes(fill=grp))+
facet_wrap(~name)

How to add a total distance column in 'flights' dataset? DPLYR, Group_by, Ungroup

I am working with 'flights' dataset from 'nycflights13' package in R.
I want to add a column which adds the total distance covered by each 'carrier' in 2013. I got the total distance covered by each carrier and have stored the value in a new variable.
We have 16 carriers so how I bind a row of 16 numbers with a data frame of many more rows.
carrier <- flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance))
How can i add the sum of these carrier distances in a new column in flights dataset?
Thank you for your time and effort here.]
PS. I tried running for loop, but it doesnt work. I am new to programming
Use mutate instead:
flights %>%
group_by(carrier) %>%
mutate(TotalDistance = sum(distance)) %>%
ungroup()-> carrier
We can also use left_join.
library(nycflights13)
data("flights")
library(dplyr)
flights %>%
left_join(flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance)), by='carrier')-> carrier
This will work even if you don't use arrange at the end.

Group by, summarise and return the value back to the dataset in R?

I am trying to create summary statistics without losing column values. For example using the iris dataset, I want to group_by the species and find the summary statistics, such as the sd and mean.
Once I have done this and I want to add this back to the original dataset. How can I can do this, I can only do the first step.
library("tidyverse")
data <- (iris)
data<-data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
this looks like this
I want to then add the result of mean and sd to the original iris data, this is so that I can get the z score for each individual row if it belongs to that species.
For further explanation; essentially create groups by the species and then find z score of each individual plant based on their species.
Though there already is an accepted answer, here is a way of computing the Z scores for all numeric variables.
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
mutate(across(where(is.numeric), scale)) %>%
rename_with(~str_c(., "_Z"), where(is.numeric)) %>%
ungroup() %>%
left_join(iris, ., by = "Species") %>%
relocate(Species, .after = last_col())
You can use something like
library("tidyverse")
data <- (iris)
df <- data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
data %>% left_join(df, by = "Species") %>%
mutate(Z = (Sepal.Length-mean.iris)/sd.iris)

Explain ungroup() in dplyr

If I'm working with a dataset and I want to group the data (i.e. by country), compute a summary statistic (mean()) and then ungroup() the data.frame to have a dataset with the original dimensions (country-year) and a new column that lists the mean for each country (repeated over n years), how would I do that with dplyr? The ungroup() function doesn't return a data.frame with the original dimensions:
gapminder %>%
group_by(country) %>%
summarize(mn = mean(pop)) %>%
ungroup() # returns data.frame with nrows == length(unique(gapminder$country))
ungroup() is useful if you want to do something like
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
where you want to do some sort of transformation that uses an entire group's statistics. In the above example, mn is the ratio of a population to the group's average population. When it is ungrouped, any further mutations called on it would not use the grouping for aggregate statistics.
summarize automatically reduces the dimensions, and there's no way to get that back. Perhaps you wanted to do
gapminder %>%
group_by(country) %>%
mutate(mn = mean(pop)) %>%
ungroup()
Which creates mn as the mean for each group, replicated for each row within that group.
The summarize() reduced the number of rows. If you didn't want to change the number of rows, then use mutate() rather than summarize().
actually ungroup() is not needed in your case.
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop))
generates the same results as the following:
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
The only difference is that the latter actually runs a bit slower.

Split a Dataset into a Nested List of Dataframes and then Spread Using Tidyr and Purrr

library(ggmosaic)
library(tidyverse)
Below is the sample code
happy2<-happy%>%
select(sex,marital,degree,health)%>%
group_by(sex,marital,degree,health)%>%
summarise(Count=n())
The following code splits the dataset into a nested list with tables of male and female (sex variable) for each category of the degree variable.
happy2 %>%
split(.$degree) %>%
lapply(function(x) split(x, x$sex))
This is where I'm now struggling. I would like to reshape, or using Tidyr, spread the "marital" variable, or perhaps this should be split again, so that each category of "marital" is a column header with each column containing the "health" variable and corresponding "Count". The redundant "sex" and "degree" columns can be dropped.
Since I'm working with a list, I've been attempting to use Tidyverse methods, for example, I've been trying to use purrr to drop variables:
happy2%>%map(~select(.x,-sex)
I'm thinking that I can also spread using purrr, but I'm having trouble making this work.
To help illustrate what I'm looking for, I attached a pic of the possible structure. I didn't include all categories and the counts are not correct since I'm only showing the structure. I suppose the "marital" category could also be a third split variable as well if that's easier? So what I'm hoping for is male and female tables for each category of degree, with marital by health and showing the corresponding count.
Help would be appreciated...
Would the following work? I changed the syntax for split by sex so that I can chain the subsequent commands together:
happy2 %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>%
spread(health, Count)))
Edit:
This would give you a separate table for each marital status:
happy2 %>%
ungroup() %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>% split(.$marital)))
And if you don't want the first column indicating marital status, the following version drops that:
happy2 %>%
ungroup() %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>% split(.$marital) %>%
lapply(function(x) x %>% select(-marital))))
What about this:
# cleaned up your code a bit
# removed the select (as it does nothing)
# consistent column names (count is lower case like the rest of the variables)
# added spacing
happy2 <- happy %>%
group_by(sex, marital, degree, health) %>%
summarise(count=n())
happy2 %>%
dplyr::ungroup() %>%
split(list(.$degree, .$sex, .$marital)) %>%
lapply(. %>% select(health, count))
Or do you really want the "martial" status as table heading for the "health" column has in your picture?

Resources