select n-th largest row per group - r

I am trying to select the n-th largest row per group in a dataset. Example, look at the iris dataset - I found this code on the internet that does this for the second largest value of sepal.length for each type of flower species :
library(dplyr)
myfun <- function(x) {
u <- unique(x)
sort(u, decreasing = TRUE)[2L]
}
iris %>%
group_by(Species) %>%
summarise(result = myfun(Sepal.Length))`
I am just trying to clarification if I have understand this correctly. If I want 3rd largest, do I just make change like this? How I can select all rows from original data?
library(dplyr)
myfun <- function(x) {
u <- unique(x)
sort(u, decreasing = TRUE)[3L]
}
iris %>%
group_by(Species) %>%
summarise(result = myfun(Sepal.Length))
`

Just modify the function to have an extra argument n to make it dynamic
myfun <- function(x, n) {
u <- unique(x)
sort(u, decreasing = TRUE)[n]
}
and then call as
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(result = myfun(Sepal.Length, 3))
-output
# A tibble: 3 × 2
Species result
<fct> <dbl>
1 setosa 5.5
2 versicolor 6.8
3 virginica 7.6
To get all the numeric columns, loop across the numeric columns
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), ~ myfun(.x, 3)))
# or use nth
# summarise(across(where(is.numeric), ~ nth(unique(.x),
# order_by = -unique(.x), 3)))
-output
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.5 4.1 1.6 0.4
2 versicolor 6.8 3.2 4.9 1.6
3 virginica 7.6 3.4 6.6 2.3

We could use nth from dplyr package after grouping and arrange:
library(dplyr)
iris %>%
group_by(Species) %>%
arrange(-Sepal.Length, .by_group = TRUE) %>%
summarise(across(, ~nth(unique(.x), 3)))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.5 3.8 1.7 0.3
2 versicolor 6.8 2.8 4.8 1.7
3 virginica 7.6 2.8 6.9 2.3

Related

Get column value associated to another column maximum in dplyr's across

After grouping by species and taken max Sepal.Length (column 1) for each group I need to grab the value of column 2 to 4 that are associated to maximum value of column 1 (by group). I'm able to do so for each single column at once but not in an across process. Any tips?
library(dplyr)
library(datasets)
data(iris)
Summarize by species with data associates to max sepal.length (by group), column by column:
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
max_sep_length = max(Sepal.Length),
sep_w_associated_to = Sepal.Width[which.max(Sepal.Length)],
pet_l_associated_to = Petal.Length[which.max(Sepal.Length)],
pet_w_associated_to = Petal.Width[which.max(Sepal.Length)]
)
Now I would like obtain the same result using across, but the outcome is different from that I expected (the df iris_summary has now same number of rows as iris, I can't understand why...)
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
max_sepa_length = max(Sepal.Length),
across(
.cols = Sepal.Width : Petal.Width,
.funs = ~ .x[which.max(Sepal.Length)]
)
)
Or use slice_max
library(dplyr) # devel can have `.by` or use `group_by(Species)`
iris %>%
slice_max(Sepal.Length, n = 1, by = 'Species')
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 7.0 3.2 4.7 1.4 versicolor
3 7.9 3.8 6.4 2.0 virginica
in base R you could do:
merge(aggregate(Sepal.Length~Species, iris, max), iris)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.8 4.0 1.2 0.2
2 versicolor 7.0 3.2 4.7 1.4
3 virginica 7.9 3.8 6.4 2.0
If we want to do the same with across, here is one option:
iris %>%
group_by(Species) %>%
summarise(across(everything(), ~ .[which.max(Sepal.Length)]))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4 1.2 0.2
2 versicolor 7 3.2 4.7 1.4
3 virginica 7.9 3.8 6.4 2

How to add p value column in a data frame comparing row wise?

For the example dataset iris, I would like to compute a table that gives me the p values for a t-test comparing the species Sentosa and Versicolor to Virginia (i.e. Virginia would be the reference group/control)
Currently, I've processed the average values for columns (sepal length, sepal width, petal length, petal width) and am trying to do a t-test grouped by species against control.
as an example output would have these columns:
c=Sepal Width p value, Sepal length p value, Petal length p value, Petal width p value
Thanks in advance for all your help!
Edit 1:
Here is what I wrote applied to iris (which doesn't exactly fit). I basically cleaned up my data to only include certain independent variables, which is why I have so the %>%.
iris %>%
group_by(species) %>%
addcol = function(iris)%>%
Sepal.length.p.value = mutate(iris, function(t.test(vars(3), ~./[species == 'Sentosa'])))
and basically I did that for each of the independat variables.
You can try the following:
library(dplyr)
library(tidyr)
library(broom)
pivot_longer(iris,-Species) %>% group_by(name)
# A tibble: 600 x 3
# Groups: name [4]
Species name value
<fct> <chr> <dbl>
1 setosa Sepal.Length 5.1
2 setosa Sepal.Width 3.5
3 setosa Petal.Length 1.4
4 setosa Petal.Width 0.2
5 setosa Sepal.Length 4.9
6 setosa Sepal.Width 3
At this step, we have converted into long and group them according to the variable. It is a matter of applying a pairwise t.test within each group, and filtering out those you don't need. We can use broom for this:
res = pivot_longer(iris,-Species) %>% group_by(name) %>%
do(tidy(pairwise.t.test(.$value,.$Species,pool.sd =FALSE))) %>%
filter(group1=="virginica" | group2=="virginica")
# A tibble: 8 x 4
# Groups: name [4]
name group1 group2 p.value
<chr> <chr> <chr> <dbl>
1 Petal.Length virginica setosa 2.78e-49
2 Petal.Length virginica versicolor 4.90e-22
3 Petal.Width virginica setosa 7.31e-48
4 Petal.Width virginica versicolor 2.11e-25
5 Sepal.Length virginica setosa 1.19e-24
6 Sepal.Length virginica versicolor 1.87e- 7
7 Sepal.Width virginica setosa 9.14e- 9
8 Sepal.Width virginica versicolor 1.82e- 3
Note that I set pool.sd =FALSE in pairwise.t.test so that it would be similar to a t.test, but ideally, if you have many groups, and their variances are similar, it pays to use a pooled SD.
You can put this in wide format again:
pivot_wider(res,values_from=p.value,names_from=name)
# A tibble: 2 x 6
group1 group2 Petal.Length Petal.Width Sepal.Length Sepal.Width
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 virginica setosa 2.78e-49 7.31e-48 1.19e-24 0.00000000914
2 virginica versicolor 4.90e-22 2.11e-25 1.87e- 7 0.00182
This is a possible solution: Cycle through the variable names in iris using purrr::map_dfc() and within that map_dfc() you cycle through the treatment groups (versicolor and setosa) with purrr::map_dfr(). That way the results of the inner cycle are combined rowwise and the results of the outer cycle are combined columnwise.
var_names <- names(iris)
var_names <- var_names[-length(var_names)] # Last variable is the group/Species variable, we don't want to include that.
treat_group <- c(versicolor = "versicolor", setosa = "setosa") # Using a named vector here will help map_dfr() to give useful names to the rows, otherwise it would just be 1 and 2.
library(purrr)
library(dplyr)
map_dfc(var_names, function(x) {
map_dfr(treat_group, function(y) {
res <-
tibble(t.test(iris[x][iris$Species == "virginica",],
iris[x][iris$Species == y,])$p.value)
names(res) <- x
res
}, .id = "species")
}) %>%
select(-matches("[1-3]")) # drop columns with numeric characters in it, to get rid of repeated species columns
#> # A tibble: 2 x 5
#> species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor 1.87e- 7 0.00182 4.90e-22 2.11e-25
#> 2 setosa 3.97e-25 0.00000000457 9.27e-50 2.44e-48
You could just split your data into control and treatment groups and use dplyr::summarise within your groups to create a column that gives you the p-value of a t-test.
library(dplyr)
control <- iris %>%
filter(Species == "virginica")
dat <- iris %>%
group_by(Species) %>%
filter(Species != "virginica") %>%
summarise("Sepal Width p value" = t.test(Sepal.Width, control$Sepal.Width)$p.value,
"Sepal length p value" = t.test(Sepal.Length, control$Sepal.Length)$p.value,
"Petal length p value" = t.test(Petal.Length, control$Petal.Length)$p.value,
"Petal width p value" = t.test(Petal.Width, control$Petal.Width)$p.value)
With the output being:
# A tibble: 2 x 5
Species `Sepal Width p value` `Sepal length p value` `Petal length p value` `Petal width p value`
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 0.00000000457 3.97e-25 9.27e-50 2.44e-48
2 versicolor 0.00182 1.87e- 7 4.90e-22 2.11e-25

Data Frame to matrix - many rows

I'm trying to convert data.frame to matrix. I calculated some statistics for iris dataset and want every statistics to be placed in seperate row. Code below shows all statistics (avg and median) in one single row and that's not a desired output. I want sth like this:
stat Sepal.Lenght Sepal.Width ....
avg 10.5 .....
med ...... .....
Code below:
data_iris <- iris
avg <- data_iris %>%
summarise_at(vars(Sepal.Length:Petal.Width),mean,na.rm=TRUE)
med <- data_iris %>%
summarise_at(vars(Sepal.Length:Petal.Width),median,na.rm=TRUE)
column <- colnames(data_iris[1:4])
rown <- c("avg","median")
df <- data.frame(avg=avg,med=med)
m <- data.matrix(df)
And additional question: I'd like to calculate quantiles but error comes up:
qrtl <- data_iris %>%
summarise_at(vars(Sepal.Length:Petal.Width),quantile,na.rm=TRUE)
error: Column Sepal.Length must be length 1 (a summary value), not 5
What's wrong?
It can be done if we do a reshape into 'long' with pivot_longer
library(dplyr)
library(tidyr)
iris %>%
summarise_if(is.numeric, list(avg = mean, med = median)) %>%
pivot_longer(everything(), names_to = c('.value', 'stat'), names_sep="_")
# stat Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 avg 5.843333 3.057333 3.758 1.199333
#2 med 5.800000 3.000000 4.350 1.300000
If it needs to be converted to matrix, then change the 'stat' to rownames and then use data.matrix
library(tibble)
iris %>%
summarise_if(is.numeric, list(avg = mean, med = median)) %>%
pivot_longer(everything(), names_to = c('.value', 'stat'), names_sep="_") %>%
column_to_rownames('stat') %>%
data.matrix
The quantile is working fine in the dev version of dplyr - 0.8.99.9000`
iris %>%
summarise_at(vars(Sepal.Length:Petal.Width),quantile, na.rm=TRUE)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 4.3 2.0 1.00 0.1
#2 5.1 2.8 1.60 0.3
#3 5.8 3.0 4.35 1.3
#4 6.4 3.3 5.10 1.8
#5 7.9 4.4 6.90 2.5
The OP's package version is 0.8.3, so may be wrapping with list would work
iris %>%
summarise_at(vars(Sepal.Length:Petal.Width),
list(quantile = ~ list(quantile(., na.rm=TRUE)))) %>%
unnest(c(names(.)))
We can use map with transpose and then bind rows from different statistics together.
library(purrr)
map(data_iris[1:4], ~list(mean = mean(.x), sd = sd(.x))) %>%
transpose() %>%
dplyr::bind_rows(.id = "statistics")
# A tibble: 2 x 5
# statistics Sepal.Length Sepal.Width Petal.Length Petal.Width
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 mean 5.84 3.06 3.76 1.20
#2 sd 0.828 0.436 1.77 0.762
Or
map_df(data_iris[1:4], ~c(mean = mean(.x), sd = sd(.x)))

dplyr summarise (collapse) dataset by different functions for multiple columns

I'm trying to dplyr::summarise a dataset (collapse) by different summarise_at/summarise_if functions so that I have the same named variables in my output dataset. Example:
library(tidyverse)
data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris$letter <- as.factor(rep(c("A","B","C"),each=50)) ## factor column
head(iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species year color letter
1 5.1 3.5 1.4 0.2 setosa 2000 red A
2 4.9 3.0 1.4 0.2 setosa 2000 red A
3 4.7 3.2 1.3 0.2 setosa 2000 red A
The resulting dataset should look like this:
full
Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 setosa 2000 87 6.2 5.8 1.9 A red
2 setosa 3000 84.4 6.1 5.5 1.9 A red
3 versicolor 2000 69.4 33.6 7 4.9 B green
4 versicolor 3000 69.1 32.7 6.8 5.1 B green
5 virginica 2000 73.2 51.1 7.7 6.9 C blue
6 virginica 3000 75.5 50.2 7.9 6.4 C blue
I can achieve this by doing the following which is a bit repetitive:
sums <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Width")), list(sum))
max <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Length")), list(max))
last <- iris %>%
group_by(Species, year) %>%
summarise_if(is.factor, list(last))
first <- iris %>%
group_by(Species, year) %>%
summarise_if(is.character, list(first))
full <- full_join(sums, max) %>% full_join(last) %>% full_join(first)
I have found similar approaches below but can't figure out the approach I've tried here. I would prefer not to make my own function as I think something like this is cleaner by passing everything through a pipe and joining:
test <- iris %>%
#group_by(.vars = vars(Species, year)) %>% #why doesnt this work?
group_by_at(.vars = vars(Species, year)) %>% #doesnt work
{left_join(
summarise_at(., vars(matches("Width")), list(sum)),
summarise_at(., vars(matches("Length")), list(max)),
summarise_if(., is.factor, list(last)),
summarise_if(., is.character, list(first))
)
} #doesnt work
This doesnt work, any suggestions or other approaches?
Helpful:
How can I use summarise_at to apply different functions to different columns?
Summarize different Columns with different Functions
Using dplyr summarize with different operations for multiple columns
By default, the dplyr::left_join() function only accepts two data frames. If you want to use this function with more than two data frames, you can iterate it with the Reduce function (base R function):
iris %>%
group_by(Species, year) %>%
{
Reduce(
function(x, y) left_join(x, y),
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
))
}
# Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 setosa 2000 87 6.2 5.8 1.9 A red
# 2 setosa 3000 84.4 6.1 5.5 1.9 A red
# 3 versicolor 2000 69.4 33.6 7 4.9 B green
# 4 versicolor 3000 69.1 32.7 6.8 5.1 B green
# 5 virginica 2000 73.2 51.1 7.7 6.9 C blue
# 6 virginica 3000 75.5 50.2 7.9 6.4 C blue
Furthermore, notice I had to call functions from its package by using :: in order to avoid name overlapping with previously created data frames.
Robbing #Ulises idea and using purrr::reduce instead of Reduce is an alternative:
iris %>%
group_by(Species, year) %>%
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
) %>%
.[c(2:5)] %>%
reduce(left_join)
OR solution with curly brackets to suppress the first argument:
iris %>%
group_by(Species, year) %>%
{
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
)
} %>%
reduce(left_join)

How does one summarize with conditions into a single variable in R?

I would like to use summarise() from dplyr after grouping data to compute a new variable. But, I would like it to use one equation for some of the data and a second equation for the rest of the data.
I have tried using group_by() and and summarise() with if_else() but it isn't working.
Here's an example. Let's say--for some reason--I wanted to find a special value for sepal length. For the species 'setosa' this special value is twice the mean of the sepal length. For all of the other species it is simply the mean of sepal length. This is the code I've tried, but it doesn't work with summarise()
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(sepal_special = if_else(Species == "setosa", mean(Sepal.Length)*2, mean(Sepal.Length)))
This idea works with mutate() but I would need to re-format the tibble to be the dataset I am looking for.
library(dplyr)
iris %>%
group_by(Species) %>%
mutate(sepal_special = if_else(Species == "setosa", mean(Sepal.Length)*2, mean(Sepal.Length)))
This is how I want the resulting tibble to be laid out:
library(dplyr)
iris %>%
group_by(Species)%>%
summarise(sepal_mean = mean(Sepal.Length))
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
#>
But my result would show the value for setosa x 2
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa **10.02**
#2 versicolor 5.94
#3 virginica 6.59
#>
Suggestions? I feel like I've really searched for ways to use if_else() with summarise() but can't find it anywhere, which means there must be a better way.
Thanks!
After the mutate step, use summarise to get the first element of 'sepal_special' for each 'Species'
iris %>%
group_by(Species) %>%
mutate(sepal_special = if_else(Species == "setosa",
mean(Sepal.Length)*2, mean(Sepal.Length))) %>%
summarise(sepal_special = first(sepal_special))
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa 10.0
#2 versicolor 5.94
#3 virginica 6.59
Or instead of calling the mutate, after the if_else is applied, get the first value in summarise
iris %>%
group_by(Species) %>%
summarise(sepal_special = if_else(Species == "setosa",
mean(Sepal.Length)*2, mean(Sepal.Length))[1])
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa 10.0
#2 versicolor 5.94
#3 virginica 6.59
Another option: since twice the mean is the same as the mean of twice the values, you can double the sepal lengths for setosa and then summarise:
iris %>%
mutate(Sepal.Length = ifelse(Species == "setosa", 2*Sepal.Length, Sepal.Length)) %>%
group_by(Species) %>%
summarise(sepal_special = mean(Sepal.Length))
# A tibble: 3 x 2
Species sepal_special
<fct> <dbl>
1 setosa 10.0
2 versicolor 5.94
3 virginica 6.59

Resources