overlapping groups in dplyr - r

I'm trying to calculate "rolling" summary statistics based on a grouping factor. Is there a nice way to process by (overlapping) groups, based on (say) an ordered factor?
As an example, say I want to calculate the sum of val by groups
df <- data.frame(grp = c("a", "a", "b", "b", "c", "c", "c"),
val = rnorm(7))
For groups based on grp, it's easy:
df %>% group_by(grp) %>% summarise(total = sum(val))
# result:
grp total
1 a 1.6388
2 b 0.7421
3 c 1.1707
However, what I want to do is calculate "rolling" sums for successive groups ("a" & "b", then "b" & "c", etc.). The desired output would be something like this:
grp1 grp2 total
1 a b 1.6388
2 b c 0.7421
I'm having trouble doing this in dplyr. In particular, I can't seem to figure out how to get "overlapping" groups - the "b" rows in the above example should end up in two output groups.

Try lag:
df %>%
group_by(grp) %>%
arrange(grp) %>%
summarise(total = sum(val)) %>%
mutate(grp1 = lag(grp), grp2 = grp, total = total + lag(total)) %>%
select(grp1, grp2, total) %>%
na.omit

Related

I have a column with multiple groups. I want each group as its own column

I have a dataframe with two columns. The first column has the groups, and the second column has values.
I want each grouping to be a column, and I want all associated values under the appropriate column.
I prefer a tidyverse solution. If possible, I prefer a solution that automatically takes each group and creates a new column with the appropriate values under it. I do not want to manually type the headers.
This is what I have:
df <- data.frame(Group = c("A", "A", "A", "B", "B", "B"),
Value = c(2,2,2,3,3,3))
df$Group <- as.character(df$Group)
This is what I want:
want <- data.frame(A = c(2,2,2),
B = c(3,3,3))
With tidyverse, we can use pivot_wider after creating a sequence column by 'Group'
library(dplyr)
library(tidyr)
df %>%
group_by(Group) %>%
mutate(rn = row_number()) %>%
pivot_wider(names_from = Group, values_from = Value) %>%
select(-rn)
# A tibble: 3 x 2
# A B
# <dbl> <dbl>
#1 2 3
#2 2 3
#3 2 3
In base R, it can be one with unstack
unstack(df, Value ~ Group)
Or with data.table
library(data.table)
dcast(setDT(df), rowid(Group) ~ Group, value.var = 'Value')[, .(A, B)]

Using dplyr collapse rows taking condition from another numeric column

An example df:
experiment = c("A", "A", "A", "A", "A", "B", "B", "B")
count = c(1,2,3,4,5,1,2,1)
df = cbind.data.frame(experiment, count)
Desired output:
experiment_1 = c("A", "A", "A", "B", "B")
freq = c(1,1,3,2,1) # frequency
freq_per = c(20,20,60,66.6,33.3) # frequency percent
df_1 = cbind.data.frame(experiment_1, freq, freq_per)
I want to do the following:
Group df using experiment
Calculate freq using the count column
Calculate freq_per
Calculate sum of freq_per for all observations with count >= 3
I have the following code. How do I do the step 4?
freq_count = df %>% dplyr::group_by(experiment, count) %>% summarize(freq=n()) %>% na.omit() %>% mutate(freq_per=freq/sum(freq)*100)
Thank you very much.
There may be a more concise approach but I would suggest collapsing your count in a new column using mutate() and ifelse() and then summarising:
freq_count %>%
mutate(collapsed_count = ifelse(count >= 3, 3, count)) %>%
group_by(collapsed_count, add = TRUE) %>% # adds a 2nd grouping var
summarise(freq = sum(freq), freq_per = (sum(freq_per))) %>%
select(-collapsed_count) # dropped to match your df_1.
Also, just fyi, for step 2 you might consider the count() function if you're keen to save some keystrokes. Also tibble() or data.frame() are likely better options than calling the dataframe method of cbind explicitly to create a data frame.

Weighting of an already calculated average

I have this kind of data:
category <- c("A", "B", "C", "B", "B", "B", "C")
mean <- c(5,4,5,4,3,1,5)
counts <- c(5, 200, 300, 150, 400, 200,250)
df <- data.frame(category, mean, counts)
category is some kind of factor, mean is an already calculated average. mean was calculated through different shapes of a scale (1-5) and the number of counts. I just have the calculated means and not the single values it was calculated from.
The goal is to aggregate the different means over the different categorys. Like this:
library(dplyr)
df %>% group_by(category) %>%
summarise(weighted.mean(mean, counts))
A 5.000000
B 2.947368
C 5.000000
The problem is that it is more valuable (in my case) to get an average from higher counts like C (550) then from lower counts like A (5). Any Idea how to take this into account?
My solution would be this one. But I don´t know if it´s valid:
df %>% mutate(y = mean * counts) %>%
mutate(category = as.factor(category)) %>%
group_by(category) %>%
summarise(X = sum(y)) %>%
arrange(desc(X))
B 2800
C 2750
A 25

Drop unused levels from a factor after filtering data frame using dplyr

I used dplyr function to create a new data sets which contain the names that have less than 4 rows.
df <- data.frame(name = c("a", "a", "a", "b", "b", "c", "c", "c", "c"), x = 1:9)
aa = df %>%
group_by(name) %>%
filter(n() < 4)
But when I type
table(aa$name)
I get,
a b c
3 2 0
I would like to have my output as follow
a b
3 2
How to completely separate new frame aa from df?
To complete your answer and KoenV's comment you can just, write your solution in one line or apply the function factor will remove the unused levels:
table(droplevels(aa$name))
table(factor(aa$name))
or because you are using dplyr add droplevels at the end:
aa <- df %>%
group_by(name) %>%
filter(n() < 4) %>%
droplevels()
table(aa$name)
# Without using table
df %>%
group_by(name) %>%
summarise(count = n()) %>%
filter(count < 4)
aaNew <- droplevels(aa)
table(aa$name)

How to get count-of-a-count with dplyr?

Let's say we have the data frame
df <- data.frame(x = c("a", "a", "b", "a", "c"))
Using dplyr count, we get
df %>% count(x)
x n
1 a 3
2 b 1
3 c 1
I now want to do a count on the resulting n column. If the n column were named m, the result I'm looking for is
m n
1 1 2
2 3 1
How can this be done with dplyr?
Thank you very much!
dplyr seems to have trouble with count(n).
For instance:
d <- data.frame(n = sample(1:2, 10, TRUE), x = 1:10)
d %>% count(n)
A workaround is to rename n:
df %>% # using data defined in question
count(x) %>%
rename(m = n) %>%
count(m)
EDIT: I was wrong. Didn't have the newest version of dplyr so I didn't have the count function.
With dplyr a way to count is with n() In your example you would do the following to obtain the first counts:
df <- data.frame(x = c("a", "a", "b", "a", "c"))
df %>% group_by(x) %>% summarise(count=n())
Then if you want to count the occurrences of particular counts you can do:
df %>% group_by(x) %>% summarise(count=n()) %>% group_by(count) %>% summarise(newCount=n())
This is a dplyr way.
sum((df %>% count(x))$n)
##[1] 5
If you are willing to give data.table a try, it could be quite straight forward.
df <- data.frame(x = c("a", "a", "b", "a", "c"))
library(data.table)
setDT(df)[, .N, by=x][, list(count_of_N=.N), by=N]
# N count_of_N
# 1: 3 1
# 2: 1 2
If you want to count:
df %>% count(x) %>% summarise(length(n))
# length(n)
#1 3
If you want the sum:
df %>% count(x) %>% summarise(sum(n))
# sum(n)
#1 5
Its not pure plyr but this may work:
countr<-function(x){data.frame(table(x))}
t<-count(df,x)
countr(t[,2])

Resources