I can select and arrange a single column:
iris %>%
select(Petal.Width, Species) %>%
arrange(desc(Petal.Width))
But I want to do this for the whole dataframe. I'm approaching this with a forloop:
features <- colnames(iris)
top <- data.frame()
for (i in 1:length(features)) {
label <- features[[i]]
iris %>%
select(label, Species) %>%
arrange(desc(label)) %>%
top_n(3) %>%
rbind(top)
}
# Error in arrange_impl(.data, dots) :
# incorrect size (1) at position 1, expecting : 150
Which gives me an error.
Apparently the arrange(desc(label)) doesn't work. I searched around and tried things like UQ and substitute to unquote the label, but with no result.
The rbind(top) and the top_n end might also be not exactly what I want, but the main problem I have now is how to use the label so the forloop wil accept it.
And maybe someone knows a better approach alltogether than my forloop...
The desired output is a dataframe, with the top 3 of every column.
If you want to use something on all columns, there are multiple ways. I like to gather (or melt) the data first and then use dplyr again.
For example, in your case, this would result in
library(tidyr)
library(dplyr)
iris %>%
gather("var", "val", -Species) %>%
group_by(var) %>%
arrange(desc(val)) %>%
top_n(3)
#> Selecting by val
#> # A tibble: 14 x 3
#> # Groups: var [4]
#> Species var val
#> <fctr> <chr> <dbl>
#> 1 virginica Sepal.Length 7.9
#> 2 virginica Sepal.Length 7.7
#> 3 virginica Sepal.Length 7.7
#> 4 virginica Sepal.Length 7.7
#> 5 virginica Sepal.Length 7.7
#> 6 virginica Petal.Length 6.9
#> 7 virginica Petal.Length 6.7
#> 8 virginica Petal.Length 6.7
#> 9 setosa Sepal.Width 4.4
#> 10 setosa Sepal.Width 4.2
#> 11 setosa Sepal.Width 4.1
#> 12 virginica Petal.Width 2.5
#> 13 virginica Petal.Width 2.5
#> 14 virginica Petal.Width 2.5
What you see is that top_n selects the top-n values not top-n entries, but you can substitute the function for slice(1:3)
Does that give you what you where looking for?
Related
After grouping by species and taken max Sepal.Length (column 1) for each group I need to grab the value of column 2 to 4 that are associated to maximum value of column 1 (by group). I'm able to do so for each single column at once but not in an across process. Any tips?
library(dplyr)
library(datasets)
data(iris)
Summarize by species with data associates to max sepal.length (by group), column by column:
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
max_sep_length = max(Sepal.Length),
sep_w_associated_to = Sepal.Width[which.max(Sepal.Length)],
pet_l_associated_to = Petal.Length[which.max(Sepal.Length)],
pet_w_associated_to = Petal.Width[which.max(Sepal.Length)]
)
Now I would like obtain the same result using across, but the outcome is different from that I expected (the df iris_summary has now same number of rows as iris, I can't understand why...)
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
max_sepa_length = max(Sepal.Length),
across(
.cols = Sepal.Width : Petal.Width,
.funs = ~ .x[which.max(Sepal.Length)]
)
)
Or use slice_max
library(dplyr) # devel can have `.by` or use `group_by(Species)`
iris %>%
slice_max(Sepal.Length, n = 1, by = 'Species')
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 7.0 3.2 4.7 1.4 versicolor
3 7.9 3.8 6.4 2.0 virginica
in base R you could do:
merge(aggregate(Sepal.Length~Species, iris, max), iris)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.8 4.0 1.2 0.2
2 versicolor 7.0 3.2 4.7 1.4
3 virginica 7.9 3.8 6.4 2.0
If we want to do the same with across, here is one option:
iris %>%
group_by(Species) %>%
summarise(across(everything(), ~ .[which.max(Sepal.Length)]))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4 1.2 0.2
2 versicolor 7 3.2 4.7 1.4
3 virginica 7.9 3.8 6.4 2
Using iris dataset as an example, I want to write a user defined function that
run pairwise t-test on all 4 columns exempting Species columns for each data split
export the results as 3 worksheets of a csv file
See below for my attempt:
library(tidyr)
library(reshape) # for melting /stacking the data
library(multcomp) # for pairwise test
library(xlsx) # export excel file with worksheet
options(scipen = 100)
# dataset
iris
data_stats <- function(data){
# melt the dataframe
df <- melt(data, id.vars=c('Species'),var='group')
# split the dataframe into three list of dataframe
dfsplit<-split(df,df$column)
# pairwise t-test
results <- pairwise.t.test(dfsplit$value, dfsplit$group,p.adjust.method = "BH")
# export each result as a worksheet of an excel file
write.xlsx(results, file="Results.xlsx", sheetName="versicolor_stats", row.names=FALSE)
write.xlsx(results, file="Results.xlsx", sheetName="virginica_stats", append=TRUE, row.names=FALSE)
write.xlsx(results, file="Results.xlsx", sheetName="setosa_stats", append=TRUE, row.names=FALSE)
}
# testing the code on iris data
data_stats(iris)
Please comment and share your code. Thanks
Here is an option with tidyverse - reshape to 'long' format with pivot_longer, then use group_modify to do the pairwise.t.test , tidy the output and unnest the list output
library(dplyr)
library(tidyr)
library(broom)
ttest_out <- iris %>%
pivot_longer(cols = -Species) %>%
group_by(Species) %>%
group_modify(~ .x %>%
summarise(out = list(pairwise.t.test(value, name) %>%
tidy))) %>%
ungroup %>%
unnest(out)
-output
ttest_out
# A tibble: 18 × 4
Species group1 group2 p.value
<fct> <chr> <chr> <dbl>
1 setosa Petal.Width Petal.Length 1.77e- 54
2 setosa Sepal.Length Petal.Length 2.77e-132
3 setosa Sepal.Length Petal.Width 1.95e-156
4 setosa Sepal.Width Petal.Length 1.61e- 86
5 setosa Sepal.Width Petal.Width 1.13e-123
6 setosa Sepal.Width Sepal.Length 4.88e- 71
7 versicolor Petal.Width Petal.Length 5.35e- 90
8 versicolor Sepal.Length Petal.Length 3.78e- 52
9 versicolor Sepal.Length Petal.Width 5.02e-125
10 versicolor Sepal.Width Petal.Length 1.36e- 45
11 versicolor Sepal.Width Petal.Width 3.46e- 44
12 versicolor Sepal.Width Sepal.Length 1.25e- 95
13 virginica Petal.Width Petal.Length 1.39e- 90
14 virginica Sepal.Length Petal.Length 6.67e- 22
15 virginica Sepal.Length Petal.Width 3.47e-110
16 virginica Sepal.Width Petal.Length 2.35e- 68
17 virginica Sepal.Width Petal.Width 1.87e- 19
18 virginica Sepal.Width Sepal.Length 2.47e- 92
Update: The statistical part of this question (applying pairwise.t.test to the iris dataset) has been answered previously on SO. And here is another solution.
The accepted solution runs a series of pairwise t-tests and produces a column of p-values (exactly as the question asks) but it's a not very meaningful set of t-tests. You might suspect that from the fact that we see p-values like 2.77e-132 and that group1 and group2 are continuous variables, not the levels of a factor.
The hypotheses that these tests evaluate is whether, for each species separately, sepal is the same as petal and length is the same as width. The pairwise t-test procedure is designed to compare a single continuous variable (say sepal length) across all the levels a factor (say species).
To begin with, let's apply pairwise.t.test to the column Sepal.Length, so that we can check later on that we get the right p-values.
library("broom")
library("tidyverse")
pairwise.t.test(iris$Sepal.Width, iris$Species)
#>
#> Pairwise comparisons using t tests with pooled SD
#>
#> data: iris$Sepal.Width and iris$Species
#>
#> setosa versicolor
#> versicolor < 2e-16 -
#> virginica 9.1e-10 0.0031
#>
#> P value adjustment method: holm
If you've ever seen the iris dataset, you know these p-values "make sense": Virginica & Versicolor are more similar to each other than to Setosa.
So now let's apply the tests in a tidy way to the four numeric columns.
t_pvals <- iris %>%
pivot_longer(
-Species,
names_to = "Variable",
values_to = "x"
) %>%
# The trick to performing the right tests is to group the tibble by Variable,
# not by Species because Species is the grouping variable for the t-tests.
group_by(
Variable
) %>%
group_modify(
~ tidy(pairwise.t.test(.x$x, .x$Species))
) %>%
ungroup()
t_pvals
#> # A tibble: 12 × 4
#> Variable group1 group2 p.value
#> <chr> <chr> <chr> <dbl>
#> 1 Petal.Length versicolor setosa 1.05e-68
#> 2 Petal.Length virginica setosa 1.23e-90
#> 3 Petal.Length virginica versicolor 1.81e-31
#> 4 Petal.Width versicolor setosa 2.51e-57
#> 5 Petal.Width virginica setosa 2.39e-85
#> 6 Petal.Width virginica versicolor 8.82e-37
#> 7 Sepal.Length versicolor setosa 1.75e-15
#> 8 Sepal.Length virginica setosa 6.64e-32
#> 9 Sepal.Length virginica versicolor 2.77e- 9
#> 10 Sepal.Width versicolor setosa 5.50e-17
#> 11 Sepal.Width virginica setosa 9.08e-10
#> 12 Sepal.Width virginica versicolor 3.15e- 3
The p-values for the Sepal.Width comparisons are at the bottom. We got the p-values we expected!
Next we format the p-values so that they are easier on the eyes.
t_pvals <- t_pvals %>%
mutate(
across(
p.value, rstatix::p_format,
accuracy = 0.05
)
)
t_pvals
#> # A tibble: 12 × 4
#> Variable group1 group2 p.value
#> <chr> <chr> <chr> <chr>
#> 1 Petal.Length versicolor setosa <0.05
#> 2 Petal.Length virginica setosa <0.05
#> 3 Petal.Length virginica versicolor <0.05
#> 4 Petal.Width versicolor setosa <0.05
#> 5 Petal.Width virginica setosa <0.05
#> 6 Petal.Width virginica versicolor <0.05
#> 7 Sepal.Length versicolor setosa <0.05
#> 8 Sepal.Length virginica setosa <0.05
#> 9 Sepal.Length virginica versicolor <0.05
#> 10 Sepal.Width versicolor setosa <0.05
#> 11 Sepal.Width virginica setosa <0.05
#> 12 Sepal.Width virginica versicolor <0.05
And finally we save the results to a file.
t_pvals %>%
write_csv("pairwse-t-tests-on-iris-data.csv")
Let's say we want to calculate the means of sepal length based on tercile groups of sepal width.
We can use the split_quantile function from the fabricatr package and do the following:
iris %>%
group_by(split_quantile(Sepal.Width, 3)) %>%
summarise(Sepal.Length = mean(Sepal.Length))
So far so good. Now, let's say we want to group_by(Species, split_quantile(Sepal.Width, 3)) instead of just group_by(split_quantile(Sepal.Width, 3)).
However, what if we want the terciles to be calculated inside of the each species type and not generally?
Basically, what I'm looking for could be achieved by splitting iris into several dataframes based on Species, using split_quantile on those dataframes to calculate terciles and then joining the dataframes back together. However, I'm looking for a way to do this without splitting the dataframe.
You kinda have written the answer in your text, but you can create a new variable for tercile after grouping by species, then regroup with both Species and Tercile.
library(tidyverse)
library(fabricatr)
iris %>%
group_by(Species) %>%
mutate(Tercile = split_quantile(Sepal.Width, 3)) %>%
group_by(Species, Tercile) %>%
summarise(Sepal.Length = mean(Sepal.Length))
#> # A tibble: 9 x 3
#> # Groups: Species [3]
#> Species Tercile Sepal.Length
#> <fct> <fct> <dbl>
#> 1 setosa 1 4.69
#> 2 setosa 2 5.08
#> 3 setosa 3 5.27
#> 4 versicolor 1 5.61
#> 5 versicolor 2 6.12
#> 6 versicolor 3 6.22
#> 7 virginica 1 6.29
#> 8 virginica 2 6.73
#> 9 virginica 3 6.81
Created on 2020-05-27 by the reprex package (v0.3.0)
I am trying to find a correlation of all variables within a grouping variable. Specifically I am trying to use purrr to do this to replace a loop that I've been using. But I've gotten a bit stuck, partially because I want to use two functions when applying over the vector of interest. For example:
## load packages
library(corrr)
library(dplyr)
library(purrr)
Without any groups this works fine (and this is the base of what I'd like to do):
iris %>%
select(-Species) %>%
correlate() %>%
stretch()
But I get stymied when I try to group this:
iris %>%
group_by(Species) %>%
correlate() %>%
stretch()
Error in stats::cor(x = x, y = y, use = use, method = method) : 'x'
must be numeric
So my thought is to use purrr... seems like the exact place where I use it right?
iris %>%
split(.$Species) %>%
map_dbl(~correlate) ## then how do i incorporate `stretch()`
Error: Can't coerce element 1 from a closure to a double
Obviously this is wrong but I am not sure exactly how I should apply map_* here...
This is the loop what I am trying to replace which does give the correct output but I'd prefer to not use it - it is less flexible than the purrr approach:
Species <- unique(iris$Species)
df <- c()
for(i in seq_along(Species)){
u <- iris %>%
filter(Species == Species[i]) %>%
select(-Species) %>%
correlate() %>%
stretch() %>%
mutate(Species = Species[i])
df <- rbind(df, u)
}
df
# A tibble: 48 x 4
x y r Species
<chr> <chr> <dbl> <fctr>
1 Sepal.Length Sepal.Length NA setosa
2 Sepal.Length Sepal.Width 0.7425467 setosa
3 Sepal.Length Petal.Length 0.2671758 setosa
4 Sepal.Length Petal.Width 0.2780984 setosa
5 Sepal.Width Sepal.Length 0.7425467 setosa
6 Sepal.Width Sepal.Width NA setosa
7 Sepal.Width Petal.Length 0.1777000 setosa
8 Sepal.Width Petal.Width 0.2327520 setosa
9 Petal.Length Sepal.Length 0.2671758 setosa
10 Petal.Length Sepal.Width 0.1777000 setosa
So in sum, can someone outline how to use purrr when I need to use two functions. In other words, how do I replace the loop above?
You need more flexible summary syntax with group_by %>% do, where in do, you can access each subgroup with . and apply correlate and stretch just like a normal data frame:
library(corrr)
library(dplyr)
iris %>% group_by(Species) %>% do(
select(., -Species) %>% correlate() %>% stretch()
)
# A tibble: 48 x 4
# Groups: Species [3]
# Species x y r
# <fctr> <chr> <chr> <dbl>
# 1 setosa Sepal.Length Sepal.Length NA
# 2 setosa Sepal.Length Sepal.Width 0.7425467
# 3 setosa Sepal.Length Petal.Length 0.2671758
# 4 setosa Sepal.Length Petal.Width 0.2780984
# 5 setosa Sepal.Width Sepal.Length 0.7425467
# 6 setosa Sepal.Width Sepal.Width NA
# 7 setosa Sepal.Width Petal.Length 0.1777000
# 8 setosa Sepal.Width Petal.Width 0.2327520
# 9 setosa Petal.Length Sepal.Length 0.2671758
#10 setosa Petal.Length Sepal.Width 0.1777000
# ... with 38 more rows
With purrr, you can nest data under each group first and then map over it:
library(purrr)
library(tidyr)
library(dplyr)
iris %>%
group_by(Species) %>% nest() %>%
mutate(data = map(data, compose(stretch, correlate))) %>%
unnest()
# A tibble: 48 x 4
# Species x y r
# <fctr> <chr> <chr> <dbl>
# 1 setosa Sepal.Length Sepal.Length NA
# 2 setosa Sepal.Length Sepal.Width 0.7425467
# 3 setosa Sepal.Length Petal.Length 0.2671758
# 4 setosa Sepal.Length Petal.Width 0.2780984
# 5 setosa Sepal.Width Sepal.Length 0.7425467
# 6 setosa Sepal.Width Sepal.Width NA
# 7 setosa Sepal.Width Petal.Length 0.1777000
# 8 setosa Sepal.Width Petal.Width 0.2327520
# 9 setosa Petal.Length Sepal.Length 0.2671758
#10 setosa Petal.Length Sepal.Width 0.1777000
# ... with 38 more rows
I am trying to use dplyr to lag some variables (all of which have a common naming convention) for each group in my data set.
I thought mutate_if would work, but I get an error (below). mutate_each works, but for all columns rather than the select few.
For example, I were looking to lag only the Sepal measurements:
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
# mutate_each(funs(lag(.)))
mutate_if(contains("Sepal"), funs(lag(.)))
#> Error in get(as.character(FUN), mode = "function", envir = envir) : object 'p' of mode 'function' was not found
to get a final data set like:
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 NA NA 1.4 0.2 setosa
# 2 5.1 3.5 1.4 0.2 setosa
# 3 4.9 3.0 1.3 0.2 setosa
# 4 NA NA 4.7 1.4 versicolor
# 5 7.0 3.2 4.5 1.5 versicolor
# 6 6.4 3.2 4.9 1.5 versicolor
# 7 NA NA 6.0 2.5 virginica
# 8 6.3 3.3 5.1 1.9 virginica
# 9 5.8 2.7 5.9 2.1 virginica
This seems to work,
library(dplyr)
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
mutate_if(grepl('Sepal', names(.)), funs(lag(.)))
As #aosmith explains, contains returns an index of the columns that match the string, whereas mutate_if relies on a using predicate functions that return logical vectors, which is why the grepl option works.
In addition, as #StevenBeaupre mentions,
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
mutate_at(vars(contains('Sepal')), lag)