If I'm working with a dataset and I want to group the data (i.e. by country), compute a summary statistic (mean()) and then ungroup() the data.frame to have a dataset with the original dimensions (country-year) and a new column that lists the mean for each country (repeated over n years), how would I do that with dplyr? The ungroup() function doesn't return a data.frame with the original dimensions:
gapminder %>%
group_by(country) %>%
summarize(mn = mean(pop)) %>%
ungroup() # returns data.frame with nrows == length(unique(gapminder$country))
ungroup() is useful if you want to do something like
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
where you want to do some sort of transformation that uses an entire group's statistics. In the above example, mn is the ratio of a population to the group's average population. When it is ungrouped, any further mutations called on it would not use the grouping for aggregate statistics.
summarize automatically reduces the dimensions, and there's no way to get that back. Perhaps you wanted to do
gapminder %>%
group_by(country) %>%
mutate(mn = mean(pop)) %>%
ungroup()
Which creates mn as the mean for each group, replicated for each row within that group.
The summarize() reduced the number of rows. If you didn't want to change the number of rows, then use mutate() rather than summarize().
actually ungroup() is not needed in your case.
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop))
generates the same results as the following:
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
The only difference is that the latter actually runs a bit slower.
Related
I have 1609 observations from 93 unique publications. I am doing a qualitative analysis of my data, and I have the following variables: soil texture (coarse soil, sandy soil, sandy loam, sandy clay loam), experimental design (field, greenhouse, and lab), and publication title (93 unique publication titles). I want to count unique publication titles for each soil texture for each experimental design.
I could only get unique publication titles for each experimental designs or each soil texture using the following code:
df4_2 <- metadata2 %>%
group_by(publication_title) %>%
group_by(experiment_cond) %>%
summarise(count = n_distinct(publication_title))%>%
drop_na()
View(df4_2)
# OR
df4_3 <- metadata2 %>%
group_by(publication_title) %>%
group_by(soil_texture) %>%
summarise(count = n_distinct(publication_title))%>%
drop_na()
View(df4_3)
Does anyone know how can I summarize unique publication titles for each the soil texture and each experimental design?
I tried the following code but it did not work:
df4_4 <- metadata2 %>%
group_by(publication_title) %>%
group_by(soil_texture) %>%
group_by(experiment_cond) %>%
summarise(count = n_distinct(publication_title))%>%
drop_na()
View(df4_4)
By default, each group_by() overrides/drops any previous groupings. To group by multiple variables, include them in the same group_by() call:
library(dplyr)
metadata2 %>%
group_by(soil_texture, experiment_cond) %>%
summarise(count = n_distinct(publication_title)) %>%
drop_na()
If you did need to add groups in separate calls (not necessary here, but sometimes useful), use the .add argument:
metadata2 %>%
group_by(soil_texture) %>%
group_by(experiment_cond, .add = TRUE) %>%
summarise(count = n_distinct(publication_title)) %>%
drop_na()
Finally, note you shouldn’t group by publication_title; if you do, the n_distinct() per group would always be 1.
I'm using a grouped time series data set where there's often NAs for more recent dates (length of NAs varies fairly randomly). A total of all the series is provided in the data, where for more recent dates, this total is actually greater than the sum of the individual series, I guess because of imputation/forecasting.
So, my question is, how can the missing values be estimated, assuming that the series total is correct?
My general approach is to calculate what proportion of the total each series is, and somehow extrapolate to the future missing dates. As you can see by the graphs, I'm not so successful. There's complications caused by differing last dates of reported data. I'm not sure if cumulative makes a difference.
R code for simulated data and failed solution below:
Code:
## simulate simple grouped time series
library(tidyverse)
set.seed(555)
## time series length, e.g. 10
len=10
## group names
grps=letters[1:5]
df=bind_rows(lapply(grps,function(z){
tibble(rn=seq(1:len)) %>%
mutate(real_val=runif(len,min=0,max=1)) %>%
mutate(grp=z) %>%
select(grp,rn,real_val) %>%
## replace final data points with NA, length varying by group
## this simulates delays in data reporting across groups
mutate(reported_val=ifelse(rn>len-match(z,letters)+1,NA,real_val)) %>%
# mutate(reported_val=ifelse(rn>len-runif(1,0,round(max_trim)),NA,real_val)) %>%
## make cumulative to assist viz a bit. may affect imputation method.
group_by(grp) %>%
arrange(rn) %>%
mutate(real_val=cumsum(real_val),reported_val=cumsum(reported_val)) %>%
ungroup()
}))
df
## attempt to impute/estimate missing real values, given total for each rn (date)
## general solution is to use (adjusted?) proportions of the total.
df2=df %>%
group_by(rn) %>%
mutate(sum_real_val=sum(real_val),sum_reported_val=sum(reported_val,na.rm=T)) %>%
ungroup() %>%
## total value missing for each date
mutate(val_missing=sum_real_val-sum_reported_val) %>%
## what proportion of the continent that country takes up
mutate(prop=reported_val/sum_real_val) %>%
# mutate(prop=reported_val/sum_reported_val) %>%
## fill missing proportions to end terminus from most recent value
group_by(grp) %>%
arrange(rn) %>%
fill(prop,.direction='down') %>%
ungroup() %>%
## get estimated proportion of those missing
mutate(temp1=ifelse(is.na(reported_val),prop,NA)) %>%
## re-calculate proportion as only of those missing.
group_by(grp) %>%
mutate(prop_temp1=temp1/sum(temp1,na.rm=T)) %>%
ungroup() %>%
## if value missing, then multiply total missing by expected proportion of missing
mutate(result_val=ifelse(is.na(reported_val),val_missing*prop_temp1,reported_val)) %>%
ungroup()
## time series plot
## stacked by group, black line shows real_val total.
ggplot(df2 %>%
select(grp,rn,real_val,reported_val,sum_real_val,result_val) %>%
gather(val_grp,val,-c(grp,rn,sum_real_val)) %>%
ungroup(),aes(rn,val))+
# geom_line(aes(colour=loc))+
geom_area(aes(fill=grp))+
geom_line(aes(y=sum_real_val))+
facet_wrap("val_grp")
## but alas the result total doesn't agree with the reported total
## nb the imputed values for each group don't necessarily have to agree with the real values.
It's hard to know what conclusions to draw from the incomplete dates. One simple assumption could be to take the last share and extrapolate that into the future:
default_share <- df %>%
count(rn, grp, wt = !is.na(reported_val)) %>%
count(rn, n) %>%
filter(nn == max(nn)) %>%
slice_max(rn) %>%
left_join(df) %>%
mutate(share = real_val / sum(real_val)) %>%
select(rn, grp, share)
df %>%
group_by(rn) %>%
mutate(result_val = if_else(rn > default_share$rn[[1]],
sum(real_val) * default_share$share[rn == rn],
real_val)
) %>% ungroup() %>%
select(-reported_val) %>%
pivot_longer(-c(grp:rn)) %>%
ggplot(aes(rn,value))+
geom_area(aes(fill=grp))+
facet_wrap(~name)
I am working with 'flights' dataset from 'nycflights13' package in R.
I want to add a column which adds the total distance covered by each 'carrier' in 2013. I got the total distance covered by each carrier and have stored the value in a new variable.
We have 16 carriers so how I bind a row of 16 numbers with a data frame of many more rows.
carrier <- flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance))
How can i add the sum of these carrier distances in a new column in flights dataset?
Thank you for your time and effort here.]
PS. I tried running for loop, but it doesnt work. I am new to programming
Use mutate instead:
flights %>%
group_by(carrier) %>%
mutate(TotalDistance = sum(distance)) %>%
ungroup()-> carrier
We can also use left_join.
library(nycflights13)
data("flights")
library(dplyr)
flights %>%
left_join(flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance)), by='carrier')-> carrier
This will work even if you don't use arrange at the end.
I am not able to understand exactly how this code works. I have found it on a tutorial guide:
Data manipulation in R - Steph Locke
on page 133 an example that I am able to understand only partially.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(month, carrier) %>%
summarise(n=n()) %>% ##sum of items;
group_by(month) %>%
mutate(prop=scales::percent(n/sum(n)), n=NULL) %>%
spread(month, prop)
flights %>%
group_by(month, carrier) %>% ## This is grouping by months and within the months by carrier;
summarise(n=n()) %>% ## It is summing the items, giving for each month and each carrier the sum of items;
At this point there in another group_by(), it looks like a nested to group_by(month, carrier)
Then:
mutate(prop=scales::percent(n/sum(n)), n=NULL) %>% ## Calculates the percentage of items over the total and store them in "prop"
Last line it creates the matrix, putting in the columns month and inside the value obtained from prop
I would like to understand better what is doing exactly the second group_by(month) %>%
Thank you in advance for every reply.
The second group_by is not needed here as by default summarise step argument .groups = "drop_last". Therefore, after the first summarise, there is only a single grouping column i.e. 'month' remains. We can change the code to
flights %>%
group_by(month, carrier) %>%
summarise(n=n()) %>%
mutate(prop=scales::percent(n/sum(n)), n=NULL)
Suppose, we change the default value in .groups to "drop", then, it will drop all the grouping variables, and thus a new group_by statement is needed. Also, after the last grouping statement, if we are using mutate, it wouldn't drop the group attributes and thus ungroup would be useful
flights %>%
group_by(month, carrier) %>%
summarise(n=n(), .groups = "drop") %>%
group_by(month) %>%
mutate(prop=scales::percent(n/sum(n)), n=NULL) %>%
ungroup
I am trying to create summary statistics without losing column values. For example using the iris dataset, I want to group_by the species and find the summary statistics, such as the sd and mean.
Once I have done this and I want to add this back to the original dataset. How can I can do this, I can only do the first step.
library("tidyverse")
data <- (iris)
data<-data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
this looks like this
I want to then add the result of mean and sd to the original iris data, this is so that I can get the z score for each individual row if it belongs to that species.
For further explanation; essentially create groups by the species and then find z score of each individual plant based on their species.
Though there already is an accepted answer, here is a way of computing the Z scores for all numeric variables.
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
mutate(across(where(is.numeric), scale)) %>%
rename_with(~str_c(., "_Z"), where(is.numeric)) %>%
ungroup() %>%
left_join(iris, ., by = "Species") %>%
relocate(Species, .after = last_col())
You can use something like
library("tidyverse")
data <- (iris)
df <- data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
data %>% left_join(df, by = "Species") %>%
mutate(Z = (Sepal.Length-mean.iris)/sd.iris)