Compute grouped mean while retaining single-row group in R (dplyr) - r

I'm trying to compute mean + standard deviation for a dataset. I have a list of organizations, but one organization has just one single row for the column "cpue." When I try to compute the grouped mean for each organization and another variable (scientific name), this organization is removed and yields a NA. I would like to retain the single-group value however, and for it to be in the "mean" column so that I can plot it (without sd). Is there a way to tell dplyr to retain groups with a single row when calculating the mean? Data below:
l<- df<- data.frame(organization = c("A","B", "B", "A","B", "A", "C"),
species= c("turtle", "shark", "turtle", "bird", "turtle", "shark", "bird"),
cpue= c(1, 2, 1, 5, 6, 1, 3))
l2<- l %>%
group_by( organization, species)%>%
summarize(mean= mean(cpue),
sd=sd(cpue))
Any help would be much appreciated!

We can create an if/else condition in sd to check for the number of rows i.e. if n() ==1 then return the 'cpue' or else compute the sd of 'cpue'
library(dplyr)
l1 <- l %>%
group_by( organization, species)%>%
summarize(mean= mean(cpue),
sd= if(n() == 1) cpue else sd(cpue), .groups = 'drop')
-output
l1
# A tibble: 6 x 4
# organization species mean sd
#* <chr> <chr> <dbl> <dbl>
#1 A bird 5 5
#2 A shark 1 1
#3 A turtle 1 1
#4 B shark 2 2
#5 B turtle 3.5 3.54
#6 C bird 3 3
If the condition is based on the value of grouping variable 'organization', then create the condition in if/else by extracting the grouping variable with cur_group()
l %>%
group_by(organization, species) %>%
summarise(mean = mean(cpue),
sd = if(cur_group()$organization == 'A') cpue else sd(cpue),
.groups = 'drop')

Related

Calculating percentiles and showing them as stacked bars for benchmarking

This is a follow up question to Calculate proportions of multiple columns.
I have the following data:
location = rep(c("A", "B", "C", "D"),
times = c(4, 6, 3, 7))
ID = (1:20)
Var1 = rep(c(0,2,1,1,0), times = 4)
Var2 = rep(c(2,1,1,0,2), times = 4)
Var3 = rep(c(1,1,0,2,0), times = 4)
df=as.data.frame(cbind(location, ID, Var1, Var2, Var3))
And with some help I counted occurrences and proportions of the different scores (0, 1, 2) in the different Vars like this:
df %>%
pivot_longer(starts_with("Var"), values_to = "score") %>%
type_convert() %>%
group_by(location, name) %>%
count(score) %>%
mutate(frac = n / sum(n)) -> dfmut
Now I have a data frame that looks like this, which I called dfmut:
# A tibble: 36 × 5
# Groups: location, name [12]
location name score n frac
<chr> <chr> <dbl> <int> <dbl>
1 A Var1 0 2 0.4
2 A Var1 1 2 0.4
3 A Var1 2 1 0.2
4 A Var2 0 1 0.2
5 A Var2 1 2 0.4
6 A Var2 2 2 0.4
7 A Var3 0 2 0.4
8 A Var3 1 2 0.4
9 A Var3 2 1 0.2
10 B Var1 0 2 0.4
Now what I like to do is get the 10th, 25th, 75th and 90th percentiles of the scores that are not 0 and turn them into a stacked bar chart. I'll give you an example story: location (A, B, etc.) is gardens of different people where they grew different kinds of vegetables (Var1, Var 2, etc.). We scored how well the vegetables turned out with score 0 = optimal, score 1 = suboptimal, score 2 = failure.
The goal is to get a stacked bar chart that shows how high the proportion of non-optimal (score 1 and 2) vegetables is in the 10% best, the 25% best gardens, etc. Then I want to indicate to each gardener where they lie in the ranking regarding each Var.
This could look something like in the image with dark green: best 10% to dark pink-purplish: worst 10% with the dot indicating garden A.
I started making a new data frame with the quantiles, which is probable not very elegant, so feel free to point out how I could do this more efficiently:
dfmut %>%
subset(name =="Var1") %>%
subset(score == "1"| score == "2") -> Var1_12
Percentiles <- c("10", "25", "75", "90")
Var1 <- quantile(Var1_12$frac, probs = c(0.1, 0.25, 0.75, 0.9))
data <- data.frame(Percentiles, Var1)
dfmut %>%
subset(name =="Var2") %>%
subset(score == "1"| score == "2") -> Var2_12
data$Var2 <- quantile(Var2_12$frac, probs = c(0.1, 0.25, 0.75, 0.9))
data_tidy <- melt(data, id.vars = "Percentiles")
I can't get any further than this. Probably because I'm on an entirely wrong path...
Thank you for your help![]

New Variable that counts based on row values in R

I want to create a new column that counts the number of rows that meet a value.
Creating replicable data:
data <- tibble(Category = c("A", "B", "A", "A", "A"))
I want the data to eventually look like this code, but instead of just creating the variable manually like this, I create a new variable CountA using a conditional mutate() or something similar that counts the total number of rows where the value of Category is A only:
tibble(Category = c("A", "B", "A", "A", "A"), CountA = c(4,4,4,4,4))
I know that I could filter out the non-A values and then generate the CountA variable, but I need to keep those rows still for a different purpose.
You can create a logical in mutate, then sum the number of TRUE observations.
library(dplyr)
data %>%
mutate(countA = sum(Category == "A", na.rm = TRUE))
Or in base R:
data$countA <- sum(data$Category == "A", na.rm = TRUE)
Output
Category countA
<chr> <int>
1 A 4
2 B 4
3 A 4
4 A 4
5 A 4
If you are wanting to create a new column for every Category, then you could do something like this:
library(tidyverse)
data %>%
group_by(Category) %>%
mutate(obs = n(),
grp = Category,
row = row_number()) %>%
pivot_wider(names_from = "grp", values_from = "obs", names_prefix = "Count") %>%
ungroup %>%
select(-row) %>%
fill(-"Category", .direction = "updown")
Output
Category CountA CountB
<chr> <int> <int>
1 A 4 1
2 B 4 1
3 A 4 1
4 A 4 1
5 A 4 1

dplyr collapse by rank of variable but ignore NA

I am struggling with a collapse of my data.
Basically my data consists of multiple indicators with multiple observations for each year. I want to convert this to one observation for each indicator for each country.
I have a rank indicator which specifies the sequence by which sequence the observations have to be chosen.
Basically the observation with the first rank (thus 1 instead of 2) has to be chosen, as long as for that rank the value is not NA.
An additional question: The years in my dataset vary over time, thus is there a way to make the code dynamic in the sense that it applies the code to all column names from 1990 to 2025 when they exist?
df <- data.frame(country.code = c(1,1,1,1,1,1,1,1,1,1,1,1),
id = as.factor(c("GDP", "GDP", "GDP", "GDP", "CA", "CA", "CA", "GR", "GR", "GR", "GR", "GR")),
`1999` = c(NA,NA,NA, 1000,NA,NA, 100,NA,NA, NA,NA,22),
`2000` = c(NA,NA,1, 2,NA,1, 2,NA,1000, 12,13,2),
`2001` = c(3,100,1, 3,100,20, 1,1,44, 65,NA,NA),
rank = c(1, 2 , 3 , 4 , 1, 2, 3, 1, 3, 2, 4, 5))
The result should be the following dataset:
result <- data.frame(country.code = c(1, 1, 1),
id = as.factor(c("GDP", "CA", "GR")),
`1999`= c(1000, 100, 22),
`2000`= c(1, 1, 12),
`2001`= c(3, 100, 1))
I attempted the following solution (but this does not work given the NA's in the data and I would have to specify each column:
test <- df %>% group_by(Country.Code, Indicator.Code) %>%
summarise(test1999 = `1999`[which.min(rank))
I don't see how I can explain R to omit the cases of the column 1999 that are NA.
We can subset using the minimum rank of the non-null values for a column e.g x[rank==min(rank[!is.na(x)])].
An additional question: The years in my dataset vary over time,....
Using summarise_at, vars and matches can be used to select any column name with 4 digits i.e. 1990-2025 using a regular expression [0-9]{4} (which means search for a digit "0-9" repeated exactly 4 times) and apply the above procedure to them using funs
librar(dplyr)
df %>% group_by(country.code,id) %>%
summarise(`1999` = `1999`[rank==ifelse(all(is.na(`1999`)),1, min(rank[!is.na(`1999`)]))])
df %>% group_by(country.code,id) %>%
summarise_at(vars(matches("[0-9]{4}")),funs(.[rank==ifelse(all(is.na(.)), 1, min(rank[!is.na(.)]))]))
# A tibble: 3 x 5
# Groups: country.code [?]
country.code id `1999` `2000` `2001`
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 CA 100 1 100
2 1 GDP 1000 1 3
3 1 GR 22 12 1
Here is one option that uses tidyr::fill to replace the NAs by the first non-NA value after we arranged the data by id and rank. It might not be the most efficient approach because we first gather and then spread the data again.
library(tidyverse)
df %>%
arrange(id, rank) %>%
gather(key, value, X1999:X2001) %>%
tidyr::fill(value, .direction = "up") %>%
spread(key, value) %>%
group_by(id) %>%
slice(1) %>%
ungroup()
# A tibble: 3 x 6
# country.code id rank X1999 X2000 X2001
# <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 CA 1 100 1 100
#2 1 GDP 1 1000 1 3
#3 1 GR 1 22 12 1
NOTE: the column names are not 1999, 2000 etc. as in your data probably. But that is easily adoptable.
You can change dataframe to long form , remove na, select values corresponding to minimum rank and spread back to wide form
library(tidyr)
test <- df %>%
gather("Year", "Value", X1999:X2001) %>%
filter(!is.na(Value))%>%
group_by(country.code, id, Year) %>%
arrange(rank)%>%
summarise(first(Value)) %>%
spread(Year, `first(Value)`)

Iterate through columns and row values (list) in R dplyr

This question is based on the following post with additional requirements (Iterate through columns in dplyr?).
The original code is as follows:
df <- data.frame(col1 = rep(1, 15),
col2 = rep(2, 15),
col3 = rep(3, 15),
group = c(rep("A", 5), rep("B", 5), rep("C", 5)))
for(col in c("col1", "col2", "col3")){
filt.df <- df %>%
filter(group == "A") %>%
select_(.dots = c('group', col))
# do other things, like ggplotting
print(filt.df)
}
My objective is to output a frequency table for each unique COL by GROUP combination. The current example specifies a dplyr filter based on a GROUP value A, B, or C. In my case, I want to iterate (loop) through a list of values in GROUP (list <- c("A", "B", "C") and generate a frequency table for each combination.
The frequency table is based on counts. For Col1 the result would look something like the table below. The example data set is simplified. My real dataset is more complex with multiple 'values' per 'group'. I need to iterate through Col1-Col3 by group.
group value n prop
A 1 5 .1
B 2 5 .1
C 3 5 .1
A better example of the frequency table is here: How to use dplyr to generate a frequency table
I struggled with this for a couple days, and I could have done better with my example. Thanks for the posts. Here is what I ended up doing to solve this. The result is a series of frequency tables for each column and each unique value found in group. I had 3 columns (col1, col2, col3) and 3 unique values in group (A,B,C), 3x3. The result is 9 frequency tables and a frequency table for each group value that is non-sensical. I am sure there is a better way to do this. The output generates some labeling, which is useful.
# Build unique group list
group <- unique(df$group)
# Generate frequency tables via a loop
iterate_by_group <- function(x)
for (i in 1:length(group)){
filt.df <- df[df$group==group[i],]
print(lapply(filt.df, freq))
}
# Run
iterate_by_group(df)
We could gather into long format and then get the frequency (n()) by group
library(tidyverse)
gather(df, value, val, col1:col3) %>%
group_by(group, value = parse_number(value)) %>%
summarise(n = n(), prop = n/nrow(.))
# A tibble: 9 x 4
# Groups: group [?]
# group value n prop
# <fct> <dbl> <int> <dbl>
#1 A 1 5 0.111
#2 A 2 5 0.111
#3 A 3 5 0.111
#4 B 1 5 0.111
#5 B 2 5 0.111
#6 B 3 5 0.111
#7 C 1 5 0.111
#8 C 2 5 0.111
#9 C 3 5 0.111
Is this what you want?
df %>%
group_by(group) %>%
summarise_all(funs(freq = sum))

Using group_by and summarise from dplyr for all rows not containing the variable to group_by

I have a data.frame with such as
df1 <- data.frame(id = c("A", "A", "B", "B", "B"),
cost = c(100, 10, 120, 102, 102)
I know that I can use
df1.a <- group_by(df1, id) %>%
summarise(no.c = n(),
m.costs = mean(cost))
to calculate the number of observations and mean by id. How could I do so if I want to calculate the number of observations and mean for all rows that are NOT equal to the ID, so it would for example give me 3 as value for observations not A and 2 for observations not B.
I would like to use the dplyr package and the group_by functions since I have to this for a lot of huge dataframes.
You can use the . to refer to the whole data.frame, which lets you calculate the differences between the group and the whole:
df1 %>% group_by(id) %>%
summarise(n = n(),
n_other = nrow(.) - n,
mean_cost = mean(cost),
mean_other = (sum(.$cost) - sum(cost)) / n_other)
## # A tibble: 2 × 5
## id n n_other mean_cost mean_other
## <fctr> <int> <int> <dbl> <dbl>
## 1 A 2 3 55 108
## 2 B 3 2 108 55
As you can see from the results, with two groups you could just use rev, but this approach will scale to more groups or calculations easily.
Looking for something like this? This calculates the total cost and total number of rows firstly and then subtract the total cost and total number of rows for each group and take average for the cost:
sumCost = sum(df1$cost)
totRows = nrow(df1)
df1 %>%
group_by(id) %>%
summarise(no.c = totRows - n(),
m.costs = (sumCost - sum(cost))/no.c)
# A tibble: 2 x 3
# id no.c m.costs
# <fctr> <int> <dbl>
#1 A 3 108
#2 B 2 55

Resources