Extracting unique column combination and finding sum and count in R

Extracting unique column combination and finding sum and count in R - r

I have a flight database with 4 columns like shown below.
Original:
I want an output which gives rows based on unique combination of 3 col (origin/destination/Airline), sums the number of passengers for each unique combination and count the numbers of rows for each unique combination. The result would be something like this.
Output:
I am able to do 1 part of it using the group_by function
df %>% group_by(Origin, destination, carrier) %>% summarise(count = n())
How to include the sum of population?

We can use dplyr
library(dplyr)
df1 %>%
group_by(Origin, Destination, Airline) %>%
dplyr::summarise(count = n(), TotalPassengers = sum(Passengers))
# Groups: Origin, Destination [2]
# Origin Destination Airline count TotalPassengers
# <chr> <chr> <chr> <int> <dbl>
#1 ABE ATL 9A 2 3
#2 ABE ATL DL 1 5
#3 NYC SFA AA 3 21
#4 NYC SFA DL 1 5
data
df1 <- data.frame(Origin = rep(c("ABE", "NYC"), c(3, 4)),
Destination = rep(c("ATL", "SFA"), c(3, 4)),
Airline = c("9A", "9A", "DL", "AA", "AA", "AA", "DL"),
Passengers = c(2, 1, 5, 4, 10, 7, 5))

Related

Count unique strings that only occur in a single group based on all possible groups

I have the following df
a = data.frame(PA = c("A", "A", "A", "B", "B"), Family = c("aa", "ab", "ac", "aa", "ad"))
What I want to obtain is a count of unique 'Family' strings (aa, ab, ac, ad) in each PA (A or B) based on all possible PAs. For example, aa is a unique string for A and B, but since it occurs in both PAs I don't want it. On the other hand, ab and ac are unique for PA A and only occur in PA A: that's what I want.
Using dplyr I was doing something like:
df >%> group_by(PA) %>%
summarise(count_family = n_distinct(Family))
But this only returns unique terms inside each PA — and I want unique Families that occur inside unique PAs based on all possible PAs

Here's a tidyverse approach.
First remove all duplicated Family, then group_by(PA) and count.
library(tidyverse)
a %>% group_by(Family) %>%
filter(n() == 1) %>%
group_by(PA) %>%
summarize(count_family = n())
Output
# A tibble: 2 x 2
PA count_family
<chr> <int>
1 A 2
2 B 1
Output before summarise()
# A tibble: 3 x 2
# Groups: Family [3]
PA Family
<chr> <chr>
1 A ab
2 A ac
3 B ad

how to keep only rows that have highest value in certain column in R

I have a dataframe that looks like this:
library(tidyverse)
df <- tribble (
~Species, ~North, ~South, ~East, ~West,
"a", 4, 3, 2, 3,
"b", 2, 3, 4, 5,
"C", 2, 3, 3, 3,
"D", 3, 2, 2, 2
)
I want to filter for species that where the highest value is e.g. North.
In this case, species A and D would be selected. Expected output would be a df with only species A and D in it.
I used a workaround like this:
df %>%
group_by(species) %>%
mutate(rowmean = mean(North:West) %>%
filter(North > rowmean) %>%
ungroup() %>%
select(!rowmean)
which seems like a lot of code for a simple task!
I cant however find a way to do this more codefriendly. Is there a (preferably tidyverse) way to perform this task in a more clean way?
Kind regards

An easier approach is with max.col in base R. Select the columns that are numeric. Get the column index of each row where the value is max. Check if that is equal to 1 i.e. the first column (as we selected only from 2nd column onwards) and subset the rows
subset(df, max.col(df[-1], 'first') == 1)
# A tibble: 2 x 5
# Species North South East West
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 a 4 3 2 3
#2 D 3 2 2 2
If it is based on the rowwise mean
subset(df, North > rowMeans(df[-1]))
Or if we prefer to use dplyr
library(dplyr)
df %>%
filter(max.col(cur_data()[-1], 'first') == 1)
Similarly if it based on the rowwise mean
df %>%
filter(North > rowMeans(cur_data()[-1]))

# base
df[df$North > rowMeans(df[-1]), ]
# A tibble: 2 x 5
Species North South East West
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 4 3 2 3
2 D 3 2 2 2

Compute grouped mean while retaining single-row group in R (dplyr)

I'm trying to compute mean + standard deviation for a dataset. I have a list of organizations, but one organization has just one single row for the column "cpue." When I try to compute the grouped mean for each organization and another variable (scientific name), this organization is removed and yields a NA. I would like to retain the single-group value however, and for it to be in the "mean" column so that I can plot it (without sd). Is there a way to tell dplyr to retain groups with a single row when calculating the mean? Data below:
l<- df<- data.frame(organization = c("A","B", "B", "A","B", "A", "C"),
species= c("turtle", "shark", "turtle", "bird", "turtle", "shark", "bird"),
cpue= c(1, 2, 1, 5, 6, 1, 3))
l2<- l %>%
group_by( organization, species)%>%
summarize(mean= mean(cpue),
sd=sd(cpue))
Any help would be much appreciated!

We can create an if/else condition in sd to check for the number of rows i.e. if n() ==1 then return the 'cpue' or else compute the sd of 'cpue'
library(dplyr)
l1 <- l %>%
group_by( organization, species)%>%
summarize(mean= mean(cpue),
sd= if(n() == 1) cpue else sd(cpue), .groups = 'drop')
-output
l1
# A tibble: 6 x 4
# organization species mean sd
#* <chr> <chr> <dbl> <dbl>
#1 A bird 5 5
#2 A shark 1 1
#3 A turtle 1 1
#4 B shark 2 2
#5 B turtle 3.5 3.54
#6 C bird 3 3
If the condition is based on the value of grouping variable 'organization', then create the condition in if/else by extracting the grouping variable with cur_group()
l %>%
group_by(organization, species) %>%
summarise(mean = mean(cpue),
sd = if(cur_group()$organization == 'A') cpue else sd(cpue),
.groups = 'drop')

dplyr collapse by rank of variable but ignore NA

I am struggling with a collapse of my data.
Basically my data consists of multiple indicators with multiple observations for each year. I want to convert this to one observation for each indicator for each country.
I have a rank indicator which specifies the sequence by which sequence the observations have to be chosen.
Basically the observation with the first rank (thus 1 instead of 2) has to be chosen, as long as for that rank the value is not NA.
An additional question: The years in my dataset vary over time, thus is there a way to make the code dynamic in the sense that it applies the code to all column names from 1990 to 2025 when they exist?
df <- data.frame(country.code = c(1,1,1,1,1,1,1,1,1,1,1,1),
id = as.factor(c("GDP", "GDP", "GDP", "GDP", "CA", "CA", "CA", "GR", "GR", "GR", "GR", "GR")),
`1999` = c(NA,NA,NA, 1000,NA,NA, 100,NA,NA, NA,NA,22),
`2000` = c(NA,NA,1, 2,NA,1, 2,NA,1000, 12,13,2),
`2001` = c(3,100,1, 3,100,20, 1,1,44, 65,NA,NA),
rank = c(1, 2 , 3 , 4 , 1, 2, 3, 1, 3, 2, 4, 5))
The result should be the following dataset:
result <- data.frame(country.code = c(1, 1, 1),
id = as.factor(c("GDP", "CA", "GR")),
`1999`= c(1000, 100, 22),
`2000`= c(1, 1, 12),
`2001`= c(3, 100, 1))
I attempted the following solution (but this does not work given the NA's in the data and I would have to specify each column:
test <- df %>% group_by(Country.Code, Indicator.Code) %>%
summarise(test1999 = `1999`[which.min(rank))
I don't see how I can explain R to omit the cases of the column 1999 that are NA.

We can subset using the minimum rank of the non-null values for a column e.g x[rank==min(rank[!is.na(x)])].
An additional question: The years in my dataset vary over time,....
Using summarise_at, vars and matches can be used to select any column name with 4 digits i.e. 1990-2025 using a regular expression [0-9]{4} (which means search for a digit "0-9" repeated exactly 4 times) and apply the above procedure to them using funs
librar(dplyr)
df %>% group_by(country.code,id) %>%
summarise(`1999` = `1999`[rank==ifelse(all(is.na(`1999`)),1, min(rank[!is.na(`1999`)]))])
df %>% group_by(country.code,id) %>%
summarise_at(vars(matches("[0-9]{4}")),funs(.[rank==ifelse(all(is.na(.)), 1, min(rank[!is.na(.)]))]))
# A tibble: 3 x 5
# Groups: country.code [?]
country.code id `1999` `2000` `2001`
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 CA 100 1 100
2 1 GDP 1000 1 3
3 1 GR 22 12 1

Here is one option that uses tidyr::fill to replace the NAs by the first non-NA value after we arranged the data by id and rank. It might not be the most efficient approach because we first gather and then spread the data again.
library(tidyverse)
df %>%
arrange(id, rank) %>%
gather(key, value, X1999:X2001) %>%
tidyr::fill(value, .direction = "up") %>%
spread(key, value) %>%
group_by(id) %>%
slice(1) %>%
ungroup()
# A tibble: 3 x 6
# country.code id rank X1999 X2000 X2001
# <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 CA 1 100 1 100
#2 1 GDP 1 1000 1 3
#3 1 GR 1 22 12 1
NOTE: the column names are not 1999, 2000 etc. as in your data probably. But that is easily adoptable.

You can change dataframe to long form , remove na, select values corresponding to minimum rank and spread back to wide form
library(tidyr)
test <- df %>%
gather("Year", "Value", X1999:X2001) %>%
filter(!is.na(Value))%>%
group_by(country.code, id, Year) %>%
arrange(rank)%>%
summarise(first(Value)) %>%
spread(Year, `first(Value)`)

Using group_by and summarise from dplyr for all rows not containing the variable to group_by

I have a data.frame with such as
df1 <- data.frame(id = c("A", "A", "B", "B", "B"),
cost = c(100, 10, 120, 102, 102)
I know that I can use
df1.a <- group_by(df1, id) %>%
summarise(no.c = n(),
m.costs = mean(cost))
to calculate the number of observations and mean by id. How could I do so if I want to calculate the number of observations and mean for all rows that are NOT equal to the ID, so it would for example give me 3 as value for observations not A and 2 for observations not B.
I would like to use the dplyr package and the group_by functions since I have to this for a lot of huge dataframes.

You can use the . to refer to the whole data.frame, which lets you calculate the differences between the group and the whole:
df1 %>% group_by(id) %>%
summarise(n = n(),
n_other = nrow(.) - n,
mean_cost = mean(cost),
mean_other = (sum(.$cost) - sum(cost)) / n_other)
## # A tibble: 2 × 5
## id n n_other mean_cost mean_other
## <fctr> <int> <int> <dbl> <dbl>
## 1 A 2 3 55 108
## 2 B 3 2 108 55
As you can see from the results, with two groups you could just use rev, but this approach will scale to more groups or calculations easily.

Looking for something like this? This calculates the total cost and total number of rows firstly and then subtract the total cost and total number of rows for each group and take average for the cost:
sumCost = sum(df1$cost)
totRows = nrow(df1)
df1 %>%
group_by(id) %>%
summarise(no.c = totRows - n(),
m.costs = (sumCost - sum(cost))/no.c)
# A tibble: 2 x 3
# id no.c m.costs
# <fctr> <int> <dbl>
#1 A 3 108
#2 B 2 55

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting unique column combination and finding sum and count in R - r

Related

Count unique strings that only occur in a single group based on all possible groups

how to keep only rows that have highest value in certain column in R

Compute grouped mean while retaining single-row group in R (dplyr)

dplyr collapse by rank of variable but ignore NA

Using group_by and summarise from dplyr for all rows not containing the variable to group_by

Categories

Resources