Cumulative aggregates within tidyverse - r

say I have a tibble (or data.table) which consists of two columns:
a <- tibble(id = rep(c("A", "B"), each = 6), val = c(1, 0, 0, 1 ,0,1,0,0,0,1,1,1))
Furthermore I have a function called myfun which takes a numeric vector of arbitrary length as input and returns a single number. For example, you can think of myfun as being the standard deviation.
Now I would like to create a third column to my tibble (called result) which contains the outputs of myfun applied to val cumulated and grouped with respect to id.
For example, the first entry of result should contain mfun(val[1]).
The second entry should contain myfun(val[1:2]), and so on.
I would like to implent a cumulated version of myfun.
Of course there a lot of easy solutions outside the tidyverse using loops and what not.
But I would be interested in a solution within the tidyverse or within the data.table frame work.
Any help is appreciated.

You could do it this way:
library(tidyverse)
a %>%
group_by(id) %>%
mutate(y = map_dbl(seq_along(val),~sd(val[1:.x]))) %>%
ungroup
# # A tibble: 12 x 3
# id val y
# <chr> <dbl> <dbl>
# 1 A 1 NA
# 2 A 0 0.7071068
# 3 A 0 0.5773503
# 4 A 1 0.5773503
# 5 A 0 0.5477226
# 6 A 1 0.5477226
# 7 B 0 NA
# 8 B 0 0.0000000
# 9 B 0 0.0000000
# 10 B 1 0.5000000
# 11 B 1 0.5477226
# 12 B 1 0.5477226
Explanation
We first group like often with tidyverse chains, then we use mutate, and not summarize, as we want to keep the same unaggregated rows.
The function map_dbl is here used to loop on a vector of final indices. seq_along(val) will be 1:6 for both groups here.
Using functions from the map family we can use the ~ notation, which will assume the first parameter of the function is named .x.
Looping through these indices we compute first sd(val[1:1]) which is sd(val[1]) which is NA, then sd(val[1:2]) etc...
map_dbl returns by design a vector of doubles, and these are stacked in the y column.

One can use zoo::rollapplyr with dynamic width (vector containing width). To prepare a dynamic width for each group 1:n() or seq(n()) can be used.
Let's apply it for function sd using data provided by OP :
library(dplyr)
library(zoo)
a %>% group_by(id) %>%
mutate(y = rollapplyr(val, 1:n(), sd ))
# # Groups: id [2]
# id val y
# <chr> <dbl> <dbl>
# 1 A 1.00 NA
# 2 A 0 0.707
# 3 A 0 0.577
# 4 A 1.00 0.577
# 5 A 0 0.548
# 6 A 1.00 0.548
# 7 B 0 NA
# 8 B 0 0
# 9 B 0 0
# 10 B 1.00 0.500
# 11 B 1.00 0.548
# 12 B 1.00 0.548

Related

How to vectorize nested loops and update a dataframe

I have a dataframe with a column called Product (with many products), a column called Timestamp (representing the date in a discrete ordinal variable) and a column called Rating.
I am trying to calculate the moving average and the moving standard deviation for the Rating variable, by each Product, taking into account the Timestamp.
The data looks something like this:
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
Now I add the columns for the moving average and the moving standard deviation:
DF$Moving.avg <- rep(0,nrow(DF))
DF$Moving.sd <- rep(0,nrow(DF))
And finally, I am using this code with nested for loops to get the result I want:
for (product in unique(DF$Product)) {
for (timestamp in DF[DF$Product==product,]$Timestamp){
if (timestamp==1) {
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <-
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Rating
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <- 0
}else{
index_start <- which(DF$Product==product &
DF$Timestamp==1)
index_end <- which(DF$Product==product &
DF$Timestamp==timestamp)
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <-
mean(DF[index_start:index_end,]$Rating)
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <-
sd(DF[index_start:index_end,]$Rating)
}
}
}
The code works fine but it is too slow.
I wonder how can I use vectorization to make this faster?
If you want to do the whole thing vectorised in base R you could try:
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
cbind(DF, do.call(rbind, lapply(split(DF, DF$Product), function(x) {
do.call(rbind, lapply(seq(nrow(x)), function(y) {
c(Moving.avg = mean(x$Rating[1:y]), Moving.sd = sd(x$Rating[1:y]))}))})))
#> Product Timestamp Rating Moving.avg Moving.sd
#> 1 a 1 4 4.000000 NA
#> 2 a 2 3 3.500000 0.7071068
#> 3 a 3 5 4.000000 1.0000000
#> 4 a 4 3 3.750000 0.9574271
#> 5 b 1 3 3.000000 NA
#> 6 b 2 4 3.500000 0.7071068
#> 7 b 3 5 4.000000 1.0000000
#> 8 c 1 3 3.000000 NA
#> 9 c 2 1 2.000000 1.4142136
#> 10 c 3 1 1.666667 1.1547005
#> 11 c 4 2 1.750000 0.9574271
#> 12 c 5 5 2.400000 1.6733201
Note though that the sd of a single number is NA rather than 0. It would be simple to replace these if desired by DF$Moving.sd[is.na(DF$Moving.sd)] <- 0
Created on 2020-08-31 by the reprex package (v0.3.0)
I think you are looking for cumulative mean and cumulative standard deviation.
For cumulative mean you can use cummean function and TTR::runSD for cumulative standard deviation.
library(dplyr)
DF %>%
group_by(Product) %>%
mutate(cum_avg = cummean(Rating),
cum_std = TTR::runSD(Rating, n = 1, cumulative = TRUE))
# Product Timestamp Rating cum_avg cum_std
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 4 4 NaN
# 2 a 2 3 3.5 0.707
# 3 a 3 5 4 1
# 4 a 4 3 3.75 0.957
# 5 b 1 3 3 NaN
# 6 b 2 4 3.5 0.707
# 7 b 3 5 4 1
# 8 c 1 3 3 NaN
# 9 c 2 1 2 1.41
#10 c 3 1 1.67 1.15
#11 c 4 2 1.75 0.957
#12 c 5 5 2.4 1.67
Does this example works for you? Here I am using the function runner() from runner package. The runner() will apply a function, on the window that you define, and works fine with the group_by() function from dplyr. You define the size of window of the function, on the k argument.
library(runner)
library(dplyr)
library(magrittr)
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
DF <- DF %>%
group_by(Product) %>%
arrange(Timestamp, .by_group = T)
DF <- DF %>%
mutate(
average = runner(Rating, f = function(x) mean(x), k = 3),
deviation = runner(Rating, f = function(x) sd(x), k = 3)
)
Is worth mention, that the function will expand the window size on the fisrt lines of each group (or each Product) on your data.frame, until reach the size defined on k argument. So in the first two lines, where we still not have 3 previous values, runner() will apply the function on these two lines.
Building on this answer to a related question, you could also do it this way with dplyr:
DF <- DF %>%
# Sort in order of product and then timestamp within product
arrange(Product, Timestamp) %>%
# group data by product
group_by(Product) %>%
# use the cumulative mean function to calculate the means
mutate(Moving.avg = cummean(Rating),
# use the map_dbl function to calculate standard deviations up to a certain index value
Moving.sd = map_dbl(seq_along(Timestamp),~sd(Rating[1:.x])),
# replace Moving.sd=0 when Timestamp takes on its smallest value
Moving.sd = case_when(Timestamp == min(Timestamp) ~ 0,
TRUE ~ Moving.sd)) %>%
# ungroup the data
ungroup

R list top n entries in a dataframe by factor or integer

I would like to list the top n entries in a dataframe for each level of a factor or an integer. Here is the code I have:
index <- c(1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3)
prob <- runif(20,0,1)
x.df <- data.frame(cbind(index,prob))
x.df <- x.df[order(x.df$index,-x.df$prob),]
head(x.df[x.df$index==1,],n=3)
head(x.df[x.df$index==2,],n=3)
head(x.df[x.df$index==3,],n=3)
This does the job, but I'd like to not have to explicitly have a head statement for every level/integer. Thanks.
In base R, there is by group method called tapply:
with(x.df, stack(tapply(prob, index, head, 3)))
# values ind
#1 0.9045300 1
#2 0.7651376 1
#3 0.3631891 1
#4 0.9471318 2
#5 0.9206743 2
#6 0.7675069 2
#7 0.9866379 3
#8 0.9149754 3
#9 0.7862320 3
And there is also the by function for data.frame:
do.call(rbind, by(x.df, index, head, 3))
which yields the same result
Assuming that your dataframe is order in a way you want, then you can do this:
library(dplyr)
x.df %>%
group_by(index) %>% # for each index
slice(1:3) %>% # get top 3 rows
ungroup() # forget the grouping
# # A tibble: 9 x 2
# index prob
# <dbl> <dbl>
# 1 1 0.943
# 2 1 0.461
# 3 1 0.251
# 4 2 0.739
# 5 2 0.697
# 6 2 0.695
# 7 3 0.968
# 8 3 0.915
# 9 3 0.635
Assuming it's unordered
x.df %>%
group_by(index) %>%
top_n(3) %>%
ungroup()
Easy solution using data.table package-
> setDT(x.df)[,head(.SD,3),by=index]
Output-
index prob
1: 1 0.7863076
2: 1 0.7103228
3: 1 0.5657803
4: 2 0.9939695
5: 2 0.7517839
6: 2 0.7348664
7: 3 0.9260537
8: 3 0.5889305
9: 3 0.5557626
Note- If your prob is not ordered then use this-
> setDT(x.df)[order(-prob),head(.SD,3),by=index]

Subtracting the last value in a group from previous values in dplyr

I have the following data
data = tribble(~t,~key,~value,
1,"a",10,
2,"a",20,
3,"a",30,
1,"b",100,
2,"b",200,
3,"b",300,
1,"c",1000,
2,"c",2000,
3,"c",3000)
and would like to get the following result
result = tribble(~t,~key,~value,
1,"a",-20,
2,"a",-10,
3,"a",0,
1,"b",-200,
2,"b",-100,
3,"b",0,
1,"c",-2000,
2,"c",-3000,
3,"c",0)
The idea is that I would like to subtract the 3rd value from all of the other values in that group. I tried to group_by the key, but struggled on the row wise subtraction within the group
We can use the last function from the dplyr. The arrange function is to make sure your dataset are in the right order.
library(dplyr)
data2 <- data %>%
arrange(key, t) %>%
group_by(key) %>%
mutate(value = value - last(value)) %>%
ungroup()
data2
# # A tibble: 9 x 3
# t key value
# <dbl> <chr> <dbl>
# 1 1 a -20
# 2 2 a -10
# 3 3 a 0
# 4 1 b -200
# 5 2 b -100
# 6 3 b 0
# 7 1 c -2000
# 8 2 c -1000
# 9 3 c 0

For each group find observations with max value of several columns

Assume I have a data frame like so:
set.seed(4)
df<-data.frame(
group = rep(1:10, each=3),
id = rep(sample(1:3), 10),
x = sample(c(rep(0, 15), runif(15))),
y = sample(c(rep(0, 15), runif(15))),
z = sample(c(rep(0, 15), runif(15)))
)
As seen above, some elements of x, y, z vectors take value of zero, the rest being drawn from the uniform distribution between 0 and 1.
For each group, determined by the first column, I want to find three IDs from the second column, pointing to the highest value of x, y, z variables in the group. Assume there are no draws except for the cases in which a variable takes a value of 0 in all observations of a given group - in that case I don't want to return any number as an id of a row with maximum value.
The output would look like so:
group x y z
1 2 2 1
2 2 3 1
... .........
My first thought is to select rows with maximum values separately for each variable and then use merge to put it in one table. However, I'm wondering if it can be done without merge, for example with standard dplyr functions.
Here is my proposed solution using plyr:
ddply(df,.variables = c("group"),
.fun = function(t){apply(X = t[,c(-1,-2)],MARGIN = 2,
function(z){ifelse(sum(abs(z))==0,yes = NA,no = t$id[which.max(z)])})})
# group x y z
#1 1 2 2 1
#2 2 2 3 1
#3 3 1 3 2
#4 4 3 3 1
#5 5 2 3 NA
#6 6 3 1 3
#7 7 1 1 2
#8 8 NA 2 3
#9 9 2 1 3
#10 10 2 NA 2
A solution uses dplyr and tidyr. Notice that if all numbers are the same, we cannot decide which id should be selected. So filter(n_distinct(Value) > 1) is added to remove those records. In the final output df2, NA indicates such condition where all numbers are the same. We can decide whether to impute those NA later if we want. This solution should work for any numbers of id or columns (x, y, z, ...).
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, -id) %>%
arrange(group, Column, desc(Value)) %>%
group_by(group, Column) %>%
# If all values from a group-Column are all the same, remove that group-Column
filter(n_distinct(Value) > 1) %>%
slice(1) %>%
select(-Value) %>%
spread(Column, id)
If you want to stick with just dplyr, you can use the multiple-column summarize/mutate functions. This should work regardless of the form of id; my initial attempt was slightly cleaner but assumed that an id of zero was invalid.
df %>%
group_by(group) %>%
mutate_at(vars(-id),
# If the row is the max within the group, set the value
# to the id and use NA otherwise
funs(ifelse(max(.) != 0 & . == max(.),
id,
NA))) %>%
select(-id) %>%
summarize_all(funs(
# There are zero or one non-NA values per group, so handle both cases
if(any(!is.na(.)))
na.omit(.) else NA))
## # A tibble: 10 x 4
## group x y z
## <int> <int> <int> <int>
## 1 1 2 2 1
## 2 2 2 3 1
## 3 3 1 3 2
## 4 4 3 3 1
## 5 5 2 3 NA
## 6 6 3 1 3
## 7 7 1 1 2
## 8 8 NA 2 3
## 9 9 2 1 3
## 10 10 2 NA 2

R: How to calculate mean for each row with missing values using dplyr

I want to calculate means over several columns for each row in my dataframe containing missing values, and place results in a new column called 'means.' Here's my dataframe:
df <- data.frame(A=c(3,4,5),B=c(0,6,8),C=c(9,NA,1))
A B C
1 3 0 9
2 4 6 NA
3 5 8 1
The code below successfully accomplishes the task if columns have no missing values, such as columns A and B.
library(dplyr)
df %>%
rowwise() %>%
mutate(means=mean(A:B, na.rm=T))
A B C means
<dbl> <dbl> <dbl> <dbl>
1 3 0 9 1.5
2 4 6 NA 5.0
3 5 8 1 6.5
However, if a column has missing values, such as C, then I get an error:
> df %>% rowwise() %>% mutate(means=mean(A:C, na.rm=T))
Error: NA/NaN argument
Ideally, I'd like to implement it with dplyr.
df %>%
mutate(means=rowMeans(., na.rm=TRUE))
The . is a "pronoun" that references the data frame df that was piped into mutate.
A B C means
1 3 0 9 4.000000
2 4 6 NA 5.000000
3 5 8 1 4.666667
You can also select only specific columns to include, using all the usual methods (column names, indices, grep, etc.).
df %>%
mutate(means=rowMeans(.[ , c("A","C")], na.rm=TRUE))
A B C means
1 3 0 9 6
2 4 6 NA 4
3 5 8 1 3
It is simple to accomplish in base R as well:
cbind(df, "means"=rowMeans(df, na.rm=TRUE))
A B C means
1 3 0 9 4.000000
2 4 6 NA 5.000000
3 5 8 1 4.666667
The rowMeans performs the calculation.and allows for the na.rm argument to skip missing values, while cbind allows you to bind the mean and whatever name you want to the the data.frame, df.
Regarding the error in OP's code, we can use the concatenate function c to get those elements as a single vector and then do the mean as mean can take only a single argument.
df %>%
rowwise() %>%
mutate(means = mean(c(A, B, C), na.rm = TRUE))
# A B C means
# <dbl> <dbl> <dbl> <dbl>
#1 3 0 9 4.000000
#2 4 6 NA 5.000000
#3 5 8 1 4.666667
Also, we can use rowMeans with transform
transform(df, means = rowMeans(df, na.rm = TRUE))
# A B C means
#1 3 0 9 4.000000
#2 4 6 NA 5.000000
#3 5 8 1 4.666667
Or using data.table
library(data.table)
setDT(df)[, means := rowMeans(.SD, na.rm = TRUE)]

Resources