How to group by counts in R? - r

How to count the number of cases (similar to COUNT . . . GROUP BY in sql) ?
Here is my code that works
library(magrittr)
library(dplyr)
df <- data.frame(dose=c("A", "B", "C","D", "E", "B","B", "E", "A","C", "C", "B"),
len=c(4.2, 10, 29.5,4.2, 10, 29.5,4.2, 10, 29.5,4.2, 10, 29.5))
mt_mean <- df %>% group_by(dose) %>% summarise(avg_count = sum(len) )
mt_mean
But I want the counts NOT the sum
So when I change avg_count = sum(len) to avg_count = count(len)
The following error is thrown
Error in summarise_impl(.data, dots) :
Evaluation error: no applicable method for 'groups' applied to an object of class "c('double', 'numeric')".
How to group by counts in R?

Staying with the dplyr library and using summarise:
mt_mean <- df %>%
group_by(dose) %>%
summarise(avg_count = n())
Alternatively, you can go even simpler in dplyr with count (per #Frank):
mt_mean <- df %>%
count(dose) %>%
rename(avg_count = n)
This way, you also avoid an unnecessary grouping.
Either way, both approaches give you:
> mt_mean
# A tibble: 5 x 2
dose avg_count
<fctr> <int>
1 A 2
2 B 4
3 C 3
4 D 1
5 E 2

Related

concentate 2 vectors to string by common element

I have a data.frame with 2 columns. If an element appears in both columns this should be the grouping criteria. I then want to create a new column which concentates all elements by group into a single, sorted string.
df <- tibble::tribble(
~col1, ~col2,
"a", "b",
"b","c",
"c","b",
"d",NA,
"e","d",
"f","d",
"g","d",
"h","i",
"i","h",
"j", NA
)
outcome <- tibble::tribble(
~result,
c("a_b_c"),
c("d_e_f_g"),
c("h_i"),
c("j")
)
any help is appreciated since I have not yet found a starting point to solve the question thanks!
Get the connected components from igraph and paste.
library(dplyr)
library(igraph)
df %>%
mutate(col2 = coalesce(col2, col1)) %>%
as.matrix %>%
graph_from_edgelist %>%
components %>%
groups %>%
sapply(paste, collapse = "_") %>%
stack
giving:
values ind
1 a_b_c 1
2 d_e_f_g 2
3 h_i 3
4 j 4

R group by problem_the multiple combination ID

I'm using group by function in a dataset using R software. But the target of the id would duplicate. Here is the sample dataset:
df <- data.frame(ID = c ("A","A","B","C","C","D"),
Var1 = c(1,3,2,3,1,2))
ID Var1
A 1
A 3
B 2
C 3
C 1
D 2 
I've to group ID by A+B and B+C and D (PS. say that F=A+B ,G=B+C) and the target result dataset below:
ID Var1
F 6
G 6
D 2
I use the following code to solve it
library(dplyr)
library(tidyr)
df <- df %>% mutate(F=ifelse(ID %in% c("A", "B"), 1, 0),
G = ifelse(ID %in% c("B", "C"), 1, 0),
D = ifelse(ID == "D", 1, 0))
df %>%
gather(var, val, F:D) %>%
filter(val==1) %>%
group_by(var) %>%
summarise(Var1=sum(Var1))
BUT this way failed because of the memory limit (the dataset is large)
Is there another way to solve it?
Any suggestions would be greatly appreciated.

How to splice a tidyselect-style list of column names into a call of my function

I am trying to write a function that deduplicates my grouped data frame. It asserts that the values in each groups are all the same and then only keeps the first line of the group. I am trying to give it tidyselect-like semantics like are seen in pivot_longer() because I just need to forward the column names into a summary(a = n_distinct(...)) call.
So for an example table
test <- tribble(
~G, ~F, ~v1, ~v2,
"A", "a", 1, 2,
"A", "b", 1, 2,
"B", "a", 3, 3,
"B", "b", 3, 3) %>%
group_by(G)
I expect the call remove_duplicates(test, c(v1, v2)) (using the tidyselect helper c() to return
G F v1 v2
A a 1 2
B a 1 2
but I get
Error: `arg` must be a symbol
I tried to use the new "embrace" syntax to solve this (see function code below), which fails with the message shown above.
# Assert that values in each group are identical and keep the first row of each
# group
# tab: A grouped tibble
# vars: <tidy-select> Columns expected to be constant throughout the group
remove_duplicates <- function(tab, vars){
# Assert identical results for identical models and keep only the first per group.
tab %>%
summarise(a = n_distinct({{{vars}}}) == 1, .groups = "drop") %>%
{stopifnot(all(.$a))}
# Remove duplicates
tab <- tab %>%
slice(1) %>%
ungroup()
return(tab)
}
I think that I somehow would need to specify that the evaluation context of the expression vars must be changed to the sub-data-frame of tab that is currently under evaluation by substitute.
So something like
tab %>%
summarise(a = do.call(n_distinct, TIDYSELECT_TO_LIST_OF_VECTORS(vars, context = CURRENT_GROUP))))
but I do not understand the technical details enough to really make this work...
This works as expected if you first enquos your vars then use the curly-curly operator on the result:
remove_duplicates <- function(tab, vars){
vars <- enquos(vars)
tab %>%
summarise(a = n_distinct({{vars}}) == 1, .groups = "drop") %>%
{stopifnot(all(.$a))}
tab %>% slice(1) %>% ungroup()
}
So now
remove_duplicates(test, c(v1, v2))
#> # A tibble: 2 x 4
#> G F v1 v2
#> <chr> <chr> <dbl> <dbl>
#> 1 A a 1 2
#> 2 B a 3 3

How to get count-of-a-count with dplyr?

Let's say we have the data frame
df <- data.frame(x = c("a", "a", "b", "a", "c"))
Using dplyr count, we get
df %>% count(x)
x n
1 a 3
2 b 1
3 c 1
I now want to do a count on the resulting n column. If the n column were named m, the result I'm looking for is
m n
1 1 2
2 3 1
How can this be done with dplyr?
Thank you very much!
dplyr seems to have trouble with count(n).
For instance:
d <- data.frame(n = sample(1:2, 10, TRUE), x = 1:10)
d %>% count(n)
A workaround is to rename n:
df %>% # using data defined in question
count(x) %>%
rename(m = n) %>%
count(m)
EDIT: I was wrong. Didn't have the newest version of dplyr so I didn't have the count function.
With dplyr a way to count is with n() In your example you would do the following to obtain the first counts:
df <- data.frame(x = c("a", "a", "b", "a", "c"))
df %>% group_by(x) %>% summarise(count=n())
Then if you want to count the occurrences of particular counts you can do:
df %>% group_by(x) %>% summarise(count=n()) %>% group_by(count) %>% summarise(newCount=n())
This is a dplyr way.
sum((df %>% count(x))$n)
##[1] 5
If you are willing to give data.table a try, it could be quite straight forward.
df <- data.frame(x = c("a", "a", "b", "a", "c"))
library(data.table)
setDT(df)[, .N, by=x][, list(count_of_N=.N), by=N]
# N count_of_N
# 1: 3 1
# 2: 1 2
If you want to count:
df %>% count(x) %>% summarise(length(n))
# length(n)
#1 3
If you want the sum:
df %>% count(x) %>% summarise(sum(n))
# sum(n)
#1 5
Its not pure plyr but this may work:
countr<-function(x){data.frame(table(x))}
t<-count(df,x)
countr(t[,2])

overlapping groups in dplyr

I'm trying to calculate "rolling" summary statistics based on a grouping factor. Is there a nice way to process by (overlapping) groups, based on (say) an ordered factor?
As an example, say I want to calculate the sum of val by groups
df <- data.frame(grp = c("a", "a", "b", "b", "c", "c", "c"),
val = rnorm(7))
For groups based on grp, it's easy:
df %>% group_by(grp) %>% summarise(total = sum(val))
# result:
grp total
1 a 1.6388
2 b 0.7421
3 c 1.1707
However, what I want to do is calculate "rolling" sums for successive groups ("a" & "b", then "b" & "c", etc.). The desired output would be something like this:
grp1 grp2 total
1 a b 1.6388
2 b c 0.7421
I'm having trouble doing this in dplyr. In particular, I can't seem to figure out how to get "overlapping" groups - the "b" rows in the above example should end up in two output groups.
Try lag:
df %>%
group_by(grp) %>%
arrange(grp) %>%
summarise(total = sum(val)) %>%
mutate(grp1 = lag(grp), grp2 = grp, total = total + lag(total)) %>%
select(grp1, grp2, total) %>%
na.omit

Resources