dplyr summarize: how to include all table columns in the output table - r

I have the follow dataset
# Dataset
x<-tbl_df(data.frame(locus=c(1,2,2,3,4,4,5,5,5,6),v=c(1,1,2,1,1,2,1,2,3,1),rpkm=rnorm(10,10)))
If I use the follow command
# Subset
x%>%group_by(locus)%>%summarize(max(rpkm))
I obtained
locus max(rpkm)
1 9.316949
2 10.273270
3 9.879886
4 10.944641
5 10.837681
6 13.450680
While I'd like to obtain
locus v max(rpkm)
1 1 9.316949
2 1 10.273270
3 1 9.879886
4 2 10.944641
5 1 10.837681
6 1 13.450680
So, I'd like to have in the output table the "v" correspondent row.
Is it possible?

Try:
x %>% group_by(locus) %>%
summarize(max(rpkm), v = v[which(rpkm==max(rpkm))])

You can use the top_n function instead
# with set.seed(15)
x %>% group_by(locus) %>% top_n(1, rpkm)
# locus v rpkm
# 1 1 1 10.258823
# 2 2 1 11.831121
# 3 3 1 10.897198
# 4 4 1 10.488016
# 5 5 2 11.090773
# 6 6 1 8.924999

Try this:
x %>% group_by(locus) %>% filter(rpkm==max(rpkm))

I assume you're looking for a way to not type all of the column names by hand, and you achieve that by using across within summarize, like so:
iris %>%
group_by(Species) %>%
dplyr::summarize(
across(everything()),
mean_l = mean(Sepal.Length)
) %>%
head()
# A tibble: 6 × 6
# Groups: Species [1]
Species Sepal.Length Sepal.Width Petal.Length Petal.Width mean_l
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 1.4 0.2 5.01
2 setosa 4.9 3 1.4 0.2 5.01
3 setosa 4.7 3.2 1.3 0.2 5.01
4 setosa 4.6 3.1 1.5 0.2 5.01
5 setosa 5 3.6 1.4 0.2 5.01
6 setosa 5.4 3.9 1.7 0.4 5.01

Related

Create a column in the original dataset to indicate whether the row was drawn in a random stratified sample

I would like to draw a stratified random sample (n = 375) from a dataset. Based on the stratified random sample, I would like to add a column to the original dataset indicating whether the row is in the stratified random sample (1) or not (0).
iris <- iris
# Get a random stratified sample
library(tidyverse)
stratified <- iris %>%
group_by(Species) %>%
sample_n(size=1)
# The final result I would like to get:
iris$sample3 <- 0
iris[21,6] <- 1
iris[65,6] <- 1
iris[106,6] <- 1
After doing that, I would like to repeat the procedure by drawing a second stratified random sample (n = 125) from my first stratified random sample (n = 375) and repeat the creation of a column.
You can add a column to your data frame that has the required number of 1s per group (and 0 otherwise).
set.seed(1)
samples <- 1
sample1 <- iris %>%
group_by(Species) %>%
mutate(sampled = as.numeric(row_number() %in% sample(n(), samples)))
sample1
sample1
#> # A tibble: 150 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sampled
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0
#> 2 4.9 3 1.4 0.2 setosa 0
#> 3 4.7 3.2 1.3 0.2 setosa 0
#> 4 4.6 3.1 1.5 0.2 setosa 1
#> 5 5 3.6 1.4 0.2 setosa 0
#> 6 5.4 3.9 1.7 0.4 setosa 0
#> 7 4.6 3.4 1.4 0.3 setosa 0
#> 8 5 3.4 1.5 0.2 setosa 0
#> 9 4.4 2.9 1.4 0.2 setosa 0
#> 10 4.9 3.1 1.5 0.1 setosa 0
#> # ... with 140 more rows
To get the sampled values, simply filter to find the 1s:
sample1 %>% filter(sampled == 1)
#> # A tibble: 3 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sampled
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 4.6 3.1 1.5 0.2 setosa 1
#> 2 5.6 3 4.1 1.3 versicolor 1
#> 3 6.3 3.3 6 2.5 virginica 1
Created on 2022-05-16 by the reprex package (v2.0.1)

save mutate output under dplyr

I'm computing the frequency by group under dplyr. But the output is not automatically saved as a dataframe and only shows the first 10 rows. Does anyone know how to do that? I need to use all rows of data for further analyses. THANKS!
library(dplyr)
data01 %>%
group_by(Country, relsta) %>%
summarize(Freq=n()) %>%
mutate (married = Freq/sum(Freq))
Output
Country relsta Freq married
<int> <chr> <int> <dbl>
1 1 1 15 0.176
2 1 3 1 0.0118
3 1 4 28 0.329
4 1 5 6 0.0706
5 1 6 22 0.259
6 1 7 1 0.0118
7 1 99 12 0.141
8 2 NA 273 1
9 3 NA 129 1
10 4 2 9 0.0796
# ... with 115 more rows
dplyr throws tibbles, the output is just hidden from you. Here an example using iris
library(dplyr)
res1 <- iris %>%
group_by(Sepal.Length, Species) %>%
summarize(Freq=n()) %>%
mutate(foo = Freq/sum(Freq))
res1
# Sepal.Length Species Freq foo
# <dbl> <fct> <int> <dbl>
# 1 4.3 setosa 1 1
# 2 4.4 setosa 3 1
# 3 4.5 setosa 1 1
# 4 4.6 setosa 4 1
# 5 4.7 setosa 2 1
# 6 4.8 setosa 5 1
# 7 4.9 setosa 4 0.667
# 8 4.9 versicolor 1 0.167
# 9 4.9 virginica 1 0.167
# 10 5 setosa 8 0.8
# # … with 47 more rows
Notice the … with 47 more rows. You may also check the dimensions:
dim(res1)
# [1] 57 4
Also,
class(res1)
# [1] "grouped_df" "tbl_df" "tbl" "data.frame"
whereas:
class(iris)
# [1] "data.frame"
To see more data, use as.data.frame(). If the data is too large, rows also get omitted. You may customize that with e.g. options(max.print=3000) where default is 1000.
as.data.frame(res1)
# Sepal.Length Species Freq foo
# 1 4.3 setosa 1 1.0000000
# 2 4.4 setosa 3 1.0000000
# 3 4.5 setosa 1 1.0000000
# [...]
# 55 7.6 virginica 1 1.0000000
# 56 7.7 virginica 4 1.0000000
# 57 7.9 virginica 1 1.0000000
You could also consider using base R. Since following line already gives you the "Freq" column,
as.data.frame.table(with(iris, table(Sepal.Length, Species)))
you could do this:
res2 <- with(iris, table(Sepal.Length, Species)) |>
as.data.frame.table() |>
transform(foo=ave(Freq, Sepal.Length, FUN=\(x) x/sum(x))) |>
subset(Freq > 0)
res2
# Sepal.Length Species Freq foo
# 1 4.3 setosa 1 1.0000000
# 2 4.4 setosa 3 1.0000000
# 3 4.5 setosa 1 1.0000000
# [...]
# 103 7.6 virginica 1 1.0000000
# 104 7.7 virginica 4 1.0000000
# 105 7.9 virginica 1 1.0000000
Where:
dim(res2)
# [1] 57 4
class(res2)
# [1] "data.frame"
Note: R >= 4.1 used
The summarize function always returns just one row per group. mutate will keep all the rows here. Try:
library(dplyr)
data02 = data01 %>%
group_by(Country, relsta) %>%
mutate(Freq=n()) %>%
mutate (married = Freq/sum(Freq))

How to take non-missing value associated with max index for each group using summarize_all

I want to find the non-missing value of each group associated with the largest index value, for many columns.
I have gotten fairly close by using summarize_all with which.max but I am not sure how to remove the NAs from each vector before I find the latest value. I read about using na.rm in summarize_all with functions like mean but not sure how to incorporate similar functionality without a built in function. I have tried na.omit but it doesnt provide the solution I'm looking for.
a <- head(iris, 10)
a$num <- 1:10
a$grp <- c("a","a","a","b","b","c","c","d","d","d")
a[10, "Species"] <- NA
a %>%
group_by(grp) %>%
summarize_all(funs(na.omit(.)[which.max(num)]))
grp Sepal.Length Sepal.Width Petal.Length Petal.Width Species num
<chr> <dbl> <dbl> <dbl> <dbl> <fct> <int>
1 a 4.70 3.20 1.30 0.200 setosa 3
2 b 5.00 3.60 1.40 0.200 setosa 5
3 c 4.60 3.40 1.40 0.300 setosa 7
4 d 4.90 3.10 1.50 0.100 NA 10
I expect all the values in the Species column to be setosa, however the last value is NA.
Instead of looking at all num, we may look only at those where the corresponding variable is not NA:
a %>%
group_by(grp) %>%
summarize_all(funs(na.omit(.)[which.max(num[!is.na(.)])]))
# A tibble: 4 x 7
# grp Sepal.Length Sepal.Width Petal.Length Petal.Width Species num
# <chr> <dbl> <dbl> <dbl> <dbl> <fct> <int>
# 1 a 4.7 3.2 1.3 0.2 setosa 3
# 2 b 5 3.6 1.4 0.2 setosa 5
# 3 c 4.6 3.4 1.4 0.3 setosa 7
# 4 d 4.9 3.1 1.5 0.1 setosa 10
if you use a data.table approach, you can try:
library (data.table)
a = data.table (a)
a [is.finite (Species), by = grp, .SD [which.max (num) ] ]
You could also approach this a little differently and complete the NA case first:
library(tidyverse)
a %>% group_by(grp) %>%
fill(Species) %>%
filter(num == max(num))
tibble: 4 x 7
# Groups: grp [4]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species num grp
<dbl> <dbl> <dbl> <dbl> <fct> <int> <chr>
1 4.7 3.2 1.3 0.2 setosa 3 a
2 5 3.6 1.4 0.2 setosa 5 b
3 4.6 3.4 1.4 0.3 setosa 7 c
4 4.9 3.1 1.5 0.1 setosa 10 d

purrr::reduce/reduce2 or mapped mutate_at()? - functions applied to respective column

I have a map of functions I want to apply to their respective columns.
Is there something liked a mapped mutate_at?
my_map <-
data_frame(col = names(iris)[-5],
calc = rep(c("floor", "ceiling"), 2))
my_map
# A tibble: 4 x 2
col calc
<chr> <chr>
Sepal.Length floor
Sepal.Width ceiling
Petal.Length floor
Petal.Width ceiling
Failed attempt:
tbl_df(iris) %>% mutate_at(vars(col_calcs$col), funs_(col_calcs$calc))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_floor Sepal.Width_floor Petal.Length_floor Petal.Width_floor Sepal.Length_ceiling
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
5.1 3.5 1.4 0.2 setosa 5 3 1 0 6
4.9 3 1.4 0.2 setosa 4 3 1 0 5
Desired output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
5.0 4.0 1.0 1.0 setosa
4.0 3.0 1.0 1.0 setosa
Last thing, my_map$calc may have unknown functions that may be applied.
Ex) Someone can change the last "floor" to "round".
I don't think there's a straight forward way to do this with dplyr::mutate_* function; One work around is to use the reduce (or reduce2) function and mutate column with the corresponding transform function one by one:
library(tidyverse)
reduce2(.x = my_map$col,
.y = my_map$calc,
.f = function(df, col, f) mutate_at(df, vars(col), f),
.init = iris) %>% head(2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5 4 1 1 setosa
# 2 4 3 1 1 setosa
Here is a way to use map2 to replace each column.
library(tidyverse)
iris2 <- iris
iris2[, -5] <- map2(my_map$calc, my_map$col, function(x, y){
x2 <- eval(parse(text = x))
y2 <- iris2[[y]]
result <- x2(y2)
return(result)
})
head(iris2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5 4 1 1 setosa
# 2 4 3 1 1 setosa
# 3 4 4 1 1 setosa
# 4 4 4 1 1 setosa
# 5 5 4 1 1 setosa
# 6 5 4 1 1 setosa
We could start from my_map :
library(tidyverse)
map2(my_map$col,my_map$calc,~transmute_at(iris,.x,.y)) %>%
bind_cols(iris[!names(iris) %in% my_map$col]) %>% # or less general: iris[-5]
head
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5 4 1 1 setosa
# 2 4 3 1 1 setosa
# 3 4 4 1 1 setosa
# 4 4 4 1 1 setosa
# 5 5 4 1 1 setosa
# 6 5 4 1 1 setosa
If we assume that all variables, which you want to take the floor function, contain the same character, i.e. Length, and all variables, which you want to take the ceiling function, contain the same character, i.e. Width, then we can apply the following code:
library(tidyverse)
iris %>%
mutate_at(vars(ends_with("Length")), funs(floor)) %>%
mutate_at(vars(ends_with("Width")), funs(ceiling))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5 4 1 1 setosa
# 2 4 3 1 1 setosa
Although verbose, I find the following very readable and a simple implementation of map:
iris2 <- iris %>%
mutate(id = 1:n()) %>%
gather(key = col, value, my_map$col ) %>%
full_join(my_map, by = "col") %>%
mutate(value = invoke_map(.f = calc, .x = value)) %>%
unnest() %>%
select(-calc) %>
spread(col, value) %>%
select(-id)
head(iris2)
# Species Petal.Length Petal.Width Sepal.Length Sepal.Width
# 1 setosa 1 1 5 4
# 2 setosa 1 1 4 3
# 3 setosa 1 1 4 4
# 4 setosa 1 1 4 4
# 5 setosa 1 1 5 4

R: dplyr for loop to get summaries by column

How do you write a dplyr for loop that will provide summaries for each column of a data.table object?
Let's examine a toy example to help illustrate what I'm trying to achieve and what I have tried. We have 5 variables:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
I can get a summary of counts by distinct Sepal.Length like so:
iris %>%
group_by(Sepal.Length) %>%
summarise(no_rows = length(Sepal.Length))
# A tibble: 35 x 2
Sepal.Length no_rows
<dbl> <int>
1 4.30 1
2 4.40 3
3 4.50 1
4 4.60 4
5 4.70 2
6 4.80 5
7 4.90 6
8 5.00 10
9 5.10 9
10 5.20 4
# ... with 25 more rows
I would like to write the above in a for loop that loops through each of the 5 variables in the data frame. I started by replacing Sepal.Length above with: paste(names(iris)[1]).
iris %>%
group_by( paste(names(iris)[1]) ) %>%
summarise(no_rows = length( paste(names(iris)[1])) )
But I get:
# A tibble: 1 x 2
`names(design_mat4)[1]` no_rows
<chr> <int>
1 email_status 1
Is there a better way of achieving my aims, perhaps one that avoids a for loop? Are there leads or suggestions that I can follow to write a working for loop? Code or suggestions welcome.
Not sure if it matters, but please note that I am working with a data.table object while the above toy example is a data.frame object. I know that there are nuances between the two that may impact the syntax needed.
Or do it in base R.
lapply(iris, function(x) aggregate(x, by = list(x), length))
This gives you the results in a list
lapply(names(iris),
function(var){
iris %>%
group_by(rlang::sym(var)) %>%
summarise(no_rows = n())
})
Here's a better dplyr answer from #Frank
lapply(names(iris) %>% setNames(.,.), function(var) iris %>% count(!!as.name(var)))
And a data.table answer
lapply(names(iris) %>% setNames(.,.), function(x) as.data.table(iris)[, .(n = .N), by = x])
If all variables are of the same type, a simpler way to approach the problem is to reshape to long form:
library(tidyverse)
iris %>%
select(-Species) %>%
gather(variable, value) %>%
count(variable, value)
#> # A tibble: 123 x 3
#> variable value n
#> <chr> <dbl> <int>
#> 1 Petal.Length 1.00 1
#> 2 Petal.Length 1.10 1
#> 3 Petal.Length 1.20 2
#> 4 Petal.Length 1.30 7
#> 5 Petal.Length 1.40 13
#> 6 Petal.Length 1.50 13
#> 7 Petal.Length 1.60 7
#> 8 Petal.Length 1.70 4
#> 9 Petal.Length 1.90 2
#> 10 Petal.Length 3.00 1
#> # ... with 113 more rows
If you include Species, the value column will be coerced to character, though.

Resources