I got a data frame with a lot of columns and want to summarise them with multiple functions.
test_df <- data.frame(Group = sample(c("A", "B", "C"), 10, T), var1 = sample(1:5, 10, T), var2 = sample(3:7, 10, T))
test_df %>%
group_by(Group) %>%
summarise_all(c(Mean = mean, Sum = sum))
# A tibble: 3 x 5
Group var1_Mean var2_Mean var1_Sum var2_Sum
<chr> <dbl> <dbl> <int> <int>
1 A 3.14 5.14 22 36
2 B 4.5 4.5 9 9
3 C 4 6 4 6
This results in a tibble with the first row Group and column names with a combination of the previous column name and the function name.
The desired result is a table with the previous column names as first row and the groups and functions in the column names.
I can achive this with
test_longer <- test_df %>% pivot_longer(cols = starts_with("var"), names_to = "var", values_to = "val")
# Add row number because spread needs unique identifiers for rows
test_longer <- test_longer %>%
group_by(Group) %>%
mutate(grouped_id = row_number())
spread(test_longer, Group, val) %>%
select(-grouped_id) %>%
group_by(var) %>%
summarise_all(c(Mean = mean, Sum = sum), na.rm = T)
# A tibble: 2 x 7
var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
<chr> <dbl> <dbl> <dbl> <int> <int> <int>
1 var1 3.14 4.5 4 22 9 4
2 var2 5.14 4.5 6 36 9 6
But this seems to be a rather long detour... There probably is a better way, but I could not find it. Any suggestions? Thank you
There's lots of ways to go about it, but I would simplify it by pivoting to a longer data frame initially, and then grouping by var and group. Then you can just pivot wider to get the final result you want. Note that I used summarize(across()) which replaces the deprecated summarize_all(), even though with a single column could've just manually specified Mean = ... and Sum = ....
set.seed(123)
test_df %>%
pivot_longer(
var1:var2,
names_to = "var"
) %>%
group_by(Group, var) %>%
summarize(
across(
everything(),
list(Mean = mean, Sum = sum),
.names = "{.fn}"
),
.groups = "drop"
) %>%
pivot_wider(
names_from = "Group",
values_from = c(Mean, Sum),
names_glue = "{Group}_{.value}"
)
#> # A tibble: 2 × 7
#> var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
#> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
#> 1 var1 1 2.5 3.2 1 10 16
#> 2 var2 5 4.5 4.4 5 18 22
Related
I have a dataframe with 0-3 rows depending on the underlying data. Here is an example with 2 rows:
df <- tibble(ID = c(1, 1), v = c(1, 2))
ID v
<dbl> <dbl>
1 1 1
2 1 2
I now want to convert each row of v into a separate column. As I have 3 rows at maximum, the result should look like this:
ID v1 v2 v3
<dbl> <dbl> <dbl> <dbl>
1 1 NA 1 2
Whats the best way to achieve this? Thanks!
Perhaps this helps
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(nm = str_c("v", 2:3)) %>%
complete(ID, nm = str_c("v", 1:3)) %>%
pivot_wider(names_from = nm, values_from = v)
Update: Op request, see comments:
df %>%
group_by(ID) %>%
summarise(cur_data()[seq(max_n),]) %>%
arrange(!is.na(v), v) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = row,
values_from = v,
names_glue = "v_{.name}")
ID v_1 v_2 v_3
<dbl> <dbl> <dbl> <dbl>
1 1 NA 1 2
First answer:
Maybe something like this:
What we are doing here is:
define the max of your group (in this case it is 3)
then fill up each group to max of 3 with adding NA
For naming add a row_number() column and use pivot_wider with it'S arguments:
library(dplyr)
library(tidyr)
max_n <- 3
df %>%
group_by(ID) %>%
summarise(cur_data()[seq(max_n),]) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = row,
values_from = v,
names_glue = "v_{.name}")
ID v_1 v_2 v_3
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 NA
I have a 1-by-4 table that contains summary statistics of two variables. For example,
df <- data.frame(
x_min=1,
x_max=2,
y_min=3,
y_max=4)
df
x_min x_max y_min y_max
1 1 2 3 4
I'd like to shape it into a 2-by-2 format:
x y
min 1 3
max 2 4
I'm able to get the result by the following code:
df %>%
pivot_longer(everything(),names_to = 'stat',values_to = 'val') %>%
separate(stat,into = c('var','stat'),sep = '_') %>%
pivot_wider(names_from = var, values_from = val)
However, I feel that this is a bit too circuitous because it first converts df into a table that's way too "long", and then "widens" it back to the appropriate size.
Is there a way to use pivot_longer() to get directly to the final result (that is, without involving pivot_wider())?
You could do:
df <- data.frame(
x_min=1,
x_max=2,
y_min=3,
y_max=4)
tidyr::pivot_longer(df, everything(), names_to = c(".value", "name"), names_sep = "_")
#> # A tibble: 2 × 3
#> name x y
#> <chr> <dbl> <dbl>
#> 1 min 1 3
#> 2 max 2 4
library(tidyverse)
df %>%
pivot_longer(everything(), names_to = c('.value', 'rowname'), names_sep = '_')%>%
column_to_rownames()
x y
min 1 3
max 2 4
The dataframe looks like this
df = data.frame(name = c("A","B","C"),
exam1 = c(2,6,4),
exam2 = c(3,5,6),
exam3 = c(5,3,3),
exam4 = c(1,NA,5))
I want to extract the top 3 exam scores for each 'name' and find their average using apply() or dplyr rowwise() functions.
With apply, use MARGIN = 1, to loop over the rows on the numeric columns, sort, get the head/tail depending on decreasing = TRUE/FALSE and return with the mean in base R
apply(df[-1], 1, FUN = function(x) mean(head(sort(x, decreasing = TRUE), 3)))
[1] 3.333333 4.666667 5.000000
Or with dplyr/rowwise
library(dplyr)
df %>%
rowwise %>%
mutate(Mean = mean(head(sort(c_across(where(is.numeric)),
decreasing = TRUE), 3))) %>%
ungroup
# A tibble: 3 × 6
name exam1 exam2 exam3 exam4 Mean
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2 3 5 1 3.33
2 B 6 5 3 NA 4.67
3 C 4 6 3 5 5
Here is an alternative approach with pivoting and using top_n: This will give back only the top 3:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-name,
names_to = "exam",
values_to = "value"
) %>%
group_by(name) %>%
top_n(3, value) %>%
mutate(mean = mean(value)) %>%
pivot_wider(
names_from = exam,
values_from = value
)
name mean exam1 exam2 exam3 exam4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 3.33 2 3 5 NA
2 B 4.67 6 5 3 NA
3 C 5 4 6 NA 5
OR:
library(tidyr)
df %>%
pivot_longer(
-name,
names_to = "exam",
values_to = "value"
) %>%
group_by(name) %>%
top_n(3, value) %>%
summarise(mean = mean(value))
name mean
<chr> <dbl>
1 A 3.33
2 B 4.67
3 C 5
Using purrr::pmap_dfr:
library(tidyverse)
df = data.frame(name = c("A","B","C"),
exam1 = c(2,6,4),
exam2 = c(3,5,6),
exam3 = c(5,3,3),
exam4 = c(1,NA,5))
df %>%
pmap_dfr(~ list(means = mean(sort(c(..2,..3,..4,..5), decreasing=T)[1:3]))) %>%
bind_cols(df,.)
#> name exam1 exam2 exam3 exam4 means
#> 1 A 2 3 5 1 3.333333
#> 2 B 6 5 3 NA 4.666667
#> 3 C 4 6 3 5 5.000000
Another possible solution, based on tidyr::pivot_longer and without using rowwise:
library(tidyverse)
df = data.frame(name = c("A","B","C"),
exam1 = c(2,6,4),
exam2 = c(3,5,6),
exam3 = c(5,3,3),
exam4 = c(1,NA,5))
df %>%
pivot_longer(cols = 2:5, names_to = "names") %>%
group_by(name) %>%
slice_max(value, n=3) %>%
summarise(mean = mean(value)) %>%
inner_join(df)
#> Joining, by = "name"
#> # A tibble: 3 × 6
#> name mean exam1 exam2 exam3 exam4
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 3.33 2 3 5 1
#> 2 B 4.67 6 5 3 NA
#> 3 C 5 4 6 3 5
I went back to the question and tried using basic dplyr manipulation of 'df' which also works, much like some of the really helpful solutions in earlier posts.
df_long <- df %>%
pivot_longer(cols = -name,
names_to = "exam",
values_to = "score")
df_long %>%
group_by(name) %>%
arrange(desc(score)) %>%
slice(1:3) %>%
summarise(mean_score = mean(score))
#Paul Smith nice idea to add inner_join(df)
I would take #akrun and add the na.rm parameter, just in case you need it in future approach where the top scores can search though NA results.
The final results would be:
df <- data.frame(name = c("A","B","C"),
exam1 = c(2,6,4),
exam2 = c(3,5,6),
exam3 = c(5,3,3),
exam4 = c(1,NA,5))
results <- apply(df[-1], 1, FUN = function(x) mean(
head(sort(x, decreasing = TRUE), 3),
na.rm=TRUE))
names(results) <- df$name
results
The results should look like this:
> results
A B C
3.333333 4.666667 5.000000
>
I would like to perform multiple pairwise t-tests on a dataset containing about 400 different column variables and 3 subject groups, and extract p-values for every comparison. A shorter representative example of the data, using only 2 variables could be the following;
df <- tibble(var1 = rnorm(90, 1, 1), var2 = rnorm(90, 1.5, 1), group = rep(1:3, each = 30))
Ideally the end result will be a summarised data frame containing four columns; one for the variable being tested (var1, var2 etc.), two for the groups being tested every time and a final one for the p-value.
I've tried duplicating the group column in the long form, and doing a double group_by in order to do the comparisons but with no result
result <- df %>%
pivot_longer(var1:var2, "var", "value") %>%
rename(group_a = group) %>%
mutate(group_b = group_a) %>%
group_by(group_a, group_b) %>%
summarise(n = n())
We can reshape the data into 'long' format with pivot_longer, then grouped by 'group', apply the pairwise.t.test, extract the list elements and transform into tibble with tidy (from broom) and unnest the list column
library(dplyr)
library(tidyr)
library(broom)
df %>%
pivot_longer(cols = -group, names_to = 'grp') %>%
group_by(group) %>%
summarise(out = list(pairwise.t.test(value, grp
) %>%
tidy)) %>%
unnest(c(out))
-output
# A tibble: 3 x 4
group group1 group2 p.value
<int> <chr> <chr> <dbl>
1 1 var2 var1 0.0760
2 2 var2 var1 0.0233
3 3 var2 var1 0.000244
In case you end up wanting more information about the t-tests, here is an approach that will allow you to extract more information such as the degrees of freedom and value of the test statistic:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
df <- tibble(
var1 = rnorm(90, 1, 1),
var2 = rnorm(90, 1.5, 1),
group = rep(1:3, each = 30)
)
df %>%
select(-group) %>%
names() %>%
map_dfr(~ {
y <- .
combn(3, 2) %>%
t() %>%
as.data.frame() %>%
pmap_dfr(function(V1, V2) {
df %>%
select(group, all_of(y)) %>%
filter(group %in% c(V1, V2)) %>%
t.test(as.formula(sprintf("%s ~ group", y)), ., var.equal = TRUE) %>%
tidy() %>%
transmute(y = y,
group_1 = V1,
group_2 = V2,
df = parameter,
t_value = statistic,
p_value = p.value
)
})
})
#> # A tibble: 6 x 6
#> y group_1 group_2 df t_value p_value
#> <chr> <int> <int> <dbl> <dbl> <dbl>
#> 1 var1 1 2 58 -0.337 0.737
#> 2 var1 1 3 58 -1.35 0.183
#> 3 var1 2 3 58 -1.06 0.295
#> 4 var2 1 2 58 -0.152 0.879
#> 5 var2 1 3 58 1.72 0.0908
#> 6 var2 2 3 58 1.67 0.100
And here is #akrun's answer tweaked to give the same p-values as the above approach. Note the p.adjust.method = "none" which gives independent t-tests which will inflate your Type I error rate.
df %>%
pivot_longer(
cols = -group,
names_to = "y"
) %>%
group_by(y) %>%
summarise(
out = list(
tidy(
pairwise.t.test(
value,
group,
p.adjust.method = "none",
pool.sd = FALSE
)
)
)
) %>%
unnest(c(out))
#> # A tibble: 6 x 4
#> y group1 group2 p.value
#> <chr> <chr> <chr> <dbl>
#> 1 var1 2 1 0.737
#> 2 var1 3 1 0.183
#> 3 var1 3 2 0.295
#> 4 var2 2 1 0.879
#> 5 var2 3 1 0.0909
#> 6 var2 3 2 0.100
Created on 2021-07-30 by the reprex package (v1.0.0)
My code is dirty.
if condition smaller than two, names = unpopular.
df <- data.frame(vote=c("A","A","A","B","B","B","B","B","B","C","D"),
val=c(rep(1,11))
)
df %>% group_by(vote) %>% summarise(val=sum(val))
out
vote val
<fct> <dbl>
1 A 3
2 B 6
3 C 1
4 D 1
but I need
vote val
<fct> <dbl>
1 A 3
2 B 6
3 unpopular 2
my idea is
df2 <- df %>% group_by(vote) %>% summarise(val=sum(val))
df2$vote[df2$val < 2] <- "unpop"
df2 %>% group_by....
it's not cool.
do you know any cool & helpful function ?
We can do a double grouping
library(dplyr)
df %>%
group_by(vote) %>%
summarise(val=sum(val)) %>%
group_by(vote = replace(vote, val <2, 'unpop')) %>%
summarise(val = sum(val))
-output
# A tibble: 3 x 2
# vote val
# <chr> <dbl>
#1 A 3
#2 B 6
#3 unpop 2
Or another option with rowsum
df %>%
group_by(vote = replace(vote, vote %in%
names(which((rowsum(val, vote) < 2)[,1])), 'unpopular')) %>%
summarise(val = sum(val))
Or using fct_lump_n from forcats
library(forcats)
df %>%
group_by(vote = fct_lump_n(vote, 2, other_level = "unpop")) %>%
summarise(val = sum(val))
# A tibble: 3 x 2
# vote val
# <fct> <dbl>
#1 A 3
#2 B 6
#3 unpop 2
Or using table
df %>%
group_by(vote = replace(vote,
vote %in% names(which(table(vote) < 2)), 'unpop')) %>%
summarise(val = sum(val))
If you want to vote based on sum of val in base R you can do this as :
aggregate(val~vote, transform(aggregate(val~vote, df, sum),
vote = replace(vote, val < 2, 'unpop')), sum)
# vote val
#1 A 3
#2 B 6
#3 unpop 2