Stepwise column sum in data frame based on another column in R - r

I have a data frame like this:
Team
GF
A
3
B
5
A
2
A
3
B
1
B
6
Looking for output like this (just an additional column):
Team
x
avg(X)
A
3
0
B
5
0
A
2
3
A
3
2.5
B
1
5
B
6
3
avg(x) is the average of all previous instances of x where Team is the same. I have the following R code which gets the overall average, however I'm looking for the "step-wise" average.
new_df <- df %>% group_by(Team) %>% summarise(avg_x = mean(x))
Is there a way to vectorize this while only evaluating the previous rows on each "iteration"?

You want the cummean() function from dplyr, combined with lag():
df %>% group_by(Team) %>% mutate(avg_x = replace_na(lag(cummean(x)), 0))
Producing the following:
# A tibble: 6 × 3
# Groups: Team [2]
Team x avg_x
<chr> <dbl> <dbl>
1 A 3 0
2 B 5 0
3 A 2 3
4 A 3 2.5
5 B 1 5
6 B 6 3
As required.
Edit 1:
As #Ritchie Sacramento pointed out, the following is cleaner and clearer:
df %>% group_by(Team) %>% mutate(avg_x = lag(cummean(x), default = 0))

Related

R add rows to grouped df using dplyr

I have a grouped df and I would like to add additional rows to the top of the groups that match with a variable (item_code) from the df.
The additional rows do not have an id column. The additional rows should not be duplicated within the groups of df.
Example data:
df <- as.tibble(data.frame(id=rep(1:3,each=2),
item_code=c("A","A","B","B","B","Z"),
score=rep(1,6)))
additional_rows <- as.tibble(data.frame(item_code=c("A","Z"),
score=c(6,6)))
What I tried
I found this post and tried to apply it:
Add row in each group using dplyr and add_row()
df %>% group_by(id) %>% do(add_row(additional_rows %>%
filter(item_code %in% .$item_code)))
What I get:
# A tibble: 9 x 3
# Groups: id [3]
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 Z 6
3 1 NA NA
4 2 A 6
5 2 Z 6
6 2 NA NA
7 3 A 6
8 3 Z 6
9 3 NA NA
What I am looking for:
# A tibble: 6 x 3
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 A 1
3 1 A 1
4 2 B 1
5 2 B 1
6 3 B 1
7 3 Z 6
8 3 Z 1
This should do the trick:
library(plyr)
df %>%
join(subset(df, item_code %in% additional_rows$item_code, select = c(id, item_code)) %>%
join(additional_rows) %>%
subset(!duplicated(.)), type = "full") %>%
arrange(id, item_code, -score)
Not sure if its the best way, but it works
Edit: to get the score in the same order added the other arrange terms
Edit 2: alright, there should now be no duplicated rows added from the additional rows as per your comment

Dense Rank by Multiple Columns in R

How can I get a dense rank of multiple columns in a dataframe? For example,
# I have:
df <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3))
# I want:
res <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3),
r = c(1,2,3,4,5,5,5,6,7,8))
res
x y z
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 2 2 5
6 2 2 5
7 2 2 5
8 3 1 6
9 3 2 7
10 3 3 8
My hack approach works for this particular dataset:
df %>%
arrange(x,y) %>%
mutate(r = if_else(y - lag(y,default=0) == 0, 0, 1)) %>%
mutate(r = cumsum(r))
But there must be a more general solution, maybe using functions like dense_rank() or row_number(). But I'm struggling with this.
dplyr solutions are ideal.
Right after posting, I think I found a solution here. In my case, it would be:
mutate(df, r = dense_rank(interaction(x,y,lex.order=T)))
But if you have a better solution, please share.
data.table
data.table has you covered with frank().
library(data.table)
frank(df, x,y, ties.method = 'min')
[1] 1 2 3 4 5 5 5 8 9 10
You can df$r <- frank(df, x,y, ties.method = 'min') to add as a new column.
tidyr/dplyr
Another option (though clunkier) is to use tidyr::unite to collapse your columns to one plus dplyr::dense_rank.
library(tidyverse)
df %>%
# add a single column with all the info
unite(xy, x, y) %>%
cbind(df) %>%
# dense rank on that
mutate(r = dense_rank(xy)) %>%
# now drop the helper col
select(-xy)
You can use cur_group_id:
library(dplyr)
df %>%
group_by(x, y) %>%
mutate(r = cur_group_id())
# x y r
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 2 2
# 3 1 3 3
# 4 1 4 4
# 5 2 2 5
# 6 2 2 5
# 7 2 2 5
# 8 3 1 6
# 9 3 2 7
# 10 3 3 8

R Show duplicates in dataframe

I am trying to "highlight" duplicates in my dataframe. I found various tutorials on dropping duplicates or creating a new dataset containing only duplicates. But since I expect something went wrong in earlier stages of my datawork, I would (for now) just like to see which observations appear to be duplicates in order to understand what went wrong. I would like R to create column c
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
c <- c(2,1,2,1,2,2,1)
df <-data.frame(a,b,c)
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
df <-data.frame(a,b)
library(dplyr)
df %>%
group_by(a,b) %>% # for each combination of a and b
mutate(c = n()) %>% # count times they appear
ungroup()
# # A tibble: 7 x 3
# a b c
# <fct> <dbl> <int>
# 1 C 1 2
# 2 A 1 1
# 3 A 2 2
# 4 B 1 1
# 5 A 2 2
# 6 C 1 2
# 7 C 2 1

How to combine data points in a data frame in R?

The data frame x has a column in which the values are periodic. For each unique value in that column, I want to calculate summation of the second column. If x is something like this:
x <- data.frame(a=c(1:2,1:2,1:2),b=c(1,4,5,2,3,4))
a b
1 1 1
2 2 4
3 1 5
4 2 2
5 1 3
6 2 4
The output I want is the following data frame:
a b
1 9
2 10
Using aggregate as follows will get you your desired result
aggregate(b ~ a, x, sum)
Here is the option with dplyr
library(dplyr)
x %>%
group_by(a) %>%
summarise(b = sum(b))
# A tibble: 2 x 2
# a b
# <int> <dbl>
#1 1 9.00
#2 2 10.0

R: Producing frequency table by selecting certain rows

I have a minimal example of a data set D that looks something like:
score person freq
10 1 3
10 2 5
10 3 4
8 1 3
7 2 2
6 4 1
Now, I want to be able to plot frequency of score=10 against person.
However, if I do:
#My bad, turns out the next line only works for matrices anyway:
#D = D[which(D[,1] == 10)]
D = subset(D, score == 10)
then I get:
score person freq
10 1 3
10 2 5
10 3 4
However, this is what I would like to get:
score person freq
10 1 3
10 2 5
10 3 4
10 4 0
Is there any quick and painless way for me to do this in R?
Here's a base R approach:
subset(as.data.frame(xtabs(freq ~ score + person, df)), score == 10)
# score person Freq
#4 10 1 3
#8 10 2 5
#12 10 3 4
#16 10 4 0
You can use complete() from the tidyr package to create the missing rows and then you can simply subset:
library(tidyr)
D2 <- complete(D, score, person, fill = list(freq = 0))
D2[D2$score == 10, ]
## Source: local data frame [4 x 3]
##
## score person freq
## (int) (int) (dbl)
## 1 10 1 3
## 2 10 2 5
## 3 10 3 4
## 4 10 4 0
complete() takes as the first argument the data frame that it should work with. Then follow the names of the columns that should be completed. The argument fill is a list that gives for each of the remaining columns (which is only freq here) the value they should be filled with.
As suggested by docendo-discimus, this can be further simplified by using also the dplyr package as follows:
library(tidyr)
library(dplyr)
complete(D, score, person, fill = list(freq = 0)) %>% filter(score == 10)
Here is a dplyr approach:
D %>% mutate(freq = ifelse(score == 10, freq, 0),
score = 10) %>%
group_by(score, person) %>%
summarise(freq = max(freq))
Source: local data frame [4 x 3]
Groups: score [?]
score person freq
(dbl) (int) (dbl)
1 10 1 3
2 10 2 5
3 10 3 4
4 10 4 0

Resources