How to vectorize nested loops and update a dataframe - r

I have a dataframe with a column called Product (with many products), a column called Timestamp (representing the date in a discrete ordinal variable) and a column called Rating.
I am trying to calculate the moving average and the moving standard deviation for the Rating variable, by each Product, taking into account the Timestamp.
The data looks something like this:
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
Now I add the columns for the moving average and the moving standard deviation:
DF$Moving.avg <- rep(0,nrow(DF))
DF$Moving.sd <- rep(0,nrow(DF))
And finally, I am using this code with nested for loops to get the result I want:
for (product in unique(DF$Product)) {
for (timestamp in DF[DF$Product==product,]$Timestamp){
if (timestamp==1) {
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <-
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Rating
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <- 0
}else{
index_start <- which(DF$Product==product &
DF$Timestamp==1)
index_end <- which(DF$Product==product &
DF$Timestamp==timestamp)
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <-
mean(DF[index_start:index_end,]$Rating)
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <-
sd(DF[index_start:index_end,]$Rating)
}
}
}
The code works fine but it is too slow.
I wonder how can I use vectorization to make this faster?

If you want to do the whole thing vectorised in base R you could try:
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
cbind(DF, do.call(rbind, lapply(split(DF, DF$Product), function(x) {
do.call(rbind, lapply(seq(nrow(x)), function(y) {
c(Moving.avg = mean(x$Rating[1:y]), Moving.sd = sd(x$Rating[1:y]))}))})))
#> Product Timestamp Rating Moving.avg Moving.sd
#> 1 a 1 4 4.000000 NA
#> 2 a 2 3 3.500000 0.7071068
#> 3 a 3 5 4.000000 1.0000000
#> 4 a 4 3 3.750000 0.9574271
#> 5 b 1 3 3.000000 NA
#> 6 b 2 4 3.500000 0.7071068
#> 7 b 3 5 4.000000 1.0000000
#> 8 c 1 3 3.000000 NA
#> 9 c 2 1 2.000000 1.4142136
#> 10 c 3 1 1.666667 1.1547005
#> 11 c 4 2 1.750000 0.9574271
#> 12 c 5 5 2.400000 1.6733201
Note though that the sd of a single number is NA rather than 0. It would be simple to replace these if desired by DF$Moving.sd[is.na(DF$Moving.sd)] <- 0
Created on 2020-08-31 by the reprex package (v0.3.0)

I think you are looking for cumulative mean and cumulative standard deviation.
For cumulative mean you can use cummean function and TTR::runSD for cumulative standard deviation.
library(dplyr)
DF %>%
group_by(Product) %>%
mutate(cum_avg = cummean(Rating),
cum_std = TTR::runSD(Rating, n = 1, cumulative = TRUE))
# Product Timestamp Rating cum_avg cum_std
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 4 4 NaN
# 2 a 2 3 3.5 0.707
# 3 a 3 5 4 1
# 4 a 4 3 3.75 0.957
# 5 b 1 3 3 NaN
# 6 b 2 4 3.5 0.707
# 7 b 3 5 4 1
# 8 c 1 3 3 NaN
# 9 c 2 1 2 1.41
#10 c 3 1 1.67 1.15
#11 c 4 2 1.75 0.957
#12 c 5 5 2.4 1.67

Does this example works for you? Here I am using the function runner() from runner package. The runner() will apply a function, on the window that you define, and works fine with the group_by() function from dplyr. You define the size of window of the function, on the k argument.
library(runner)
library(dplyr)
library(magrittr)
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
DF <- DF %>%
group_by(Product) %>%
arrange(Timestamp, .by_group = T)
DF <- DF %>%
mutate(
average = runner(Rating, f = function(x) mean(x), k = 3),
deviation = runner(Rating, f = function(x) sd(x), k = 3)
)
Is worth mention, that the function will expand the window size on the fisrt lines of each group (or each Product) on your data.frame, until reach the size defined on k argument. So in the first two lines, where we still not have 3 previous values, runner() will apply the function on these two lines.

Building on this answer to a related question, you could also do it this way with dplyr:
DF <- DF %>%
# Sort in order of product and then timestamp within product
arrange(Product, Timestamp) %>%
# group data by product
group_by(Product) %>%
# use the cumulative mean function to calculate the means
mutate(Moving.avg = cummean(Rating),
# use the map_dbl function to calculate standard deviations up to a certain index value
Moving.sd = map_dbl(seq_along(Timestamp),~sd(Rating[1:.x])),
# replace Moving.sd=0 when Timestamp takes on its smallest value
Moving.sd = case_when(Timestamp == min(Timestamp) ~ 0,
TRUE ~ Moving.sd)) %>%
# ungroup the data
ungroup

Related

Tidy way of comparing "tiles" of users

Let's say df present aggregated metric in AB test with groups A and B. x is for example number of page visits, n number of users with this number of visits. (In reality, there are way more users and differences are small). Note that there's different number of users per group.
library(tidyverse)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x)
I want to compare tiles of users. By tile, I mean users in group A that have the same x value.
For example, I if 34.17% of users in group A has value 0, I want to compare it to average number of x for the lowest 34.17% of users in group B. Next, for example, users with 1 visits in group A are between 34.17% and 74.8% - I want to compare them with the same percentile (but should be more precise) users in group B. Etc...
Here's my try:
n_fake <- 1000
df_agg_per_imp <- df %>%
group_by(group) %>%
mutate(
p_max = n_fake * cumsum(n) / sum(n),
p_min = lag(p_max, default = 0),
p = map2(p_min + 1, p_max, seq)
) %>%
ungroup()
df_agg_per_imp %>%
unnest(p) %>%
pivot_wider(id_cols = p, names_from = group, values_from = x) %>%
group_by(A) %>%
summarise(
p_min = min(p) / n_fake,
p_max = max(p) / n_fake,
rel_uplift = mean(B) / mean(A)
)
#> # A tibble: 6 × 4
#> A p_min p_max rel_uplift
#> <int> <dbl> <dbl> <dbl>
#> 1 0 0.001 0.34 Inf
#> 2 1 0.341 0.74 1.92
#> 3 2 0.741 0.91 1.57
#> 4 3 0.911 0.96 1.33
#> 5 4 0.961 0.99 1.21
#> 6 5 0.991 1 1.2
What I don't like is that I have to create row for each user (and this could be millions) to get the results I want. Is there simpler/better way to do it?
You may be able to do something like this:
extend the creation of your initial frame to get proportion in A and B, and pivot wider:
set.seed(123)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x) %>%
group_by(group) %>%
mutate(prop = n/sum(n)) %>%
pivot_wider(id_cols=x, names_from=group,values_from=prop)
With the seed above, this gives you a frame like this:
# A tibble: 7 x 3
x A B
<int> <dbl> <dbl>
1 0 0.35 0.095
2 1 0.38 0.33
3 2 0.21 0.285
4 3 0.04 0.14
5 4 0.02 0.085
6 5 NA 0.055
7 6 NA 0.01
Create a function estimates the rel_uplift, while also returning an updated set of group B proportions and group B values (i.e. xvalues)
f <- function(a,aval,bvec,bvals) {
cindex = which(cumsum(bvec)>=a)
if(length(cindex) == 0) bindex=seq_along(bvec)
else bindex= 1:min(cindex)
rem = sum(bvec[bindex])-a
bmean = sum(bvals[bindex] * (bvec[bindex] - c(rep(0,length(bindex)-1), rem)))
if(length(bindex)>1) {
if(rem!=0) bindex = bindex[1:(length(bindex)-1)]
bvec = bvec[-bindex]
bvals = bvals[-bindex]
}
bvec[1] = rem
list("rel_uplift" = bmean/(a*aval),"bvec" = bvec, "bvals" = bvals )
}
Initiate a dataframe, and a list called fres which contains the initial bvec and initial bvals
result=data.frame()
fres = list("bvec" = df$B,"bvals" = df$x)
Use a for loop to loop over the values of df$A, each time getting the rel_uplift, and preparing an updated set of bvec and bvals to be used in the function
for(a in df %>% filter(!is.na(A)) %>% pull(A)) {
x = df %>% filter(A==a) %>% pull(x)
fres = f(a, x,fres[["bvec"]],fres[["bvals"]])
result = rbind(result,data.frame(x =x, A=a,rel_uplift=fres[["rel_uplift"]]))
}
result
x A rel_uplift
1 0 0.35 Inf
2 1 0.38 1.855263
3 2 0.21 1.726190
4 3 0.04 1.666667
5 4 0.02 1.375000
If I understand right you want to compare counts by two parameters simultaneously, ie by $group and by $x.
From the example in the initial post I see that not all values $x may be available for each group.
Summarizing by 2 co-variables can be done with base R.
Here a simple function (assuming that you're always looking at $group and $x):
countnByGroup <- function(xx, asPercent=FALSE) {
lev <- unique(xx$x)
grp <- unique(xx$group)
out <- sapply(grp, function(x) {z <- rep(NA, length(lev)); names(z) <- lev
w <- which(xx$group==x); if(length(w) >0) z[match(xx$x[w], lev)] <- xx$n[w]
z })
if(asPercent) out <- 100*apply(out, 2, function(x) x/sum(x, na.rm=TRUE))
out }
Note, in the function above the man variable was called 'xx' to avoid confusion
with $x.
df # produced using the code from your example
## A tibble: 13 x 3
# group x n
# <chr> <int> <int>
# 1 A 0 36
# 2 A 1 38
# 3 A 2 19
# 4 A 3 6
# 5 A 4 1
# 6 B 0 27
# 7 B 1 44
# 8 B 2 55
# 9 B 3 44
#10 B 4 21
#11 B 5 6
#12 B 6 2
#13 B 8 1
One gets :
countnByGroup(df)
# A B
#0 36 27
#1 38 44
#2 19 55
#3 6 44
#4 1 21
#5 NA 6
#6 NA 2
#8 NA 1
## and
countnByGroup(df, asPercent=T)
# A B
#0 36 13.5
#1 38 22.0
#2 19 27.5
#3 6 22.0
#4 1 10.5
#5 NA 3.0
#6 NA 1.0
#8 NA 0.5
As long as you don't apply any rounding you'll have the results as precise as it gets.
By chance the random values from above did't produce more digits when processing and thus by chance the percent values for A are all integers.
Another interesting option may be to consider two-way tables in R using table().
But in this case you need your entries as separate lines and not already transformed to counting data as in your example above.

R list top n entries in a dataframe by factor or integer

I would like to list the top n entries in a dataframe for each level of a factor or an integer. Here is the code I have:
index <- c(1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3)
prob <- runif(20,0,1)
x.df <- data.frame(cbind(index,prob))
x.df <- x.df[order(x.df$index,-x.df$prob),]
head(x.df[x.df$index==1,],n=3)
head(x.df[x.df$index==2,],n=3)
head(x.df[x.df$index==3,],n=3)
This does the job, but I'd like to not have to explicitly have a head statement for every level/integer. Thanks.
In base R, there is by group method called tapply:
with(x.df, stack(tapply(prob, index, head, 3)))
# values ind
#1 0.9045300 1
#2 0.7651376 1
#3 0.3631891 1
#4 0.9471318 2
#5 0.9206743 2
#6 0.7675069 2
#7 0.9866379 3
#8 0.9149754 3
#9 0.7862320 3
And there is also the by function for data.frame:
do.call(rbind, by(x.df, index, head, 3))
which yields the same result
Assuming that your dataframe is order in a way you want, then you can do this:
library(dplyr)
x.df %>%
group_by(index) %>% # for each index
slice(1:3) %>% # get top 3 rows
ungroup() # forget the grouping
# # A tibble: 9 x 2
# index prob
# <dbl> <dbl>
# 1 1 0.943
# 2 1 0.461
# 3 1 0.251
# 4 2 0.739
# 5 2 0.697
# 6 2 0.695
# 7 3 0.968
# 8 3 0.915
# 9 3 0.635
Assuming it's unordered
x.df %>%
group_by(index) %>%
top_n(3) %>%
ungroup()
Easy solution using data.table package-
> setDT(x.df)[,head(.SD,3),by=index]
Output-
index prob
1: 1 0.7863076
2: 1 0.7103228
3: 1 0.5657803
4: 2 0.9939695
5: 2 0.7517839
6: 2 0.7348664
7: 3 0.9260537
8: 3 0.5889305
9: 3 0.5557626
Note- If your prob is not ordered then use this-
> setDT(x.df)[order(-prob),head(.SD,3),by=index]

Cumulative aggregates within tidyverse

say I have a tibble (or data.table) which consists of two columns:
a <- tibble(id = rep(c("A", "B"), each = 6), val = c(1, 0, 0, 1 ,0,1,0,0,0,1,1,1))
Furthermore I have a function called myfun which takes a numeric vector of arbitrary length as input and returns a single number. For example, you can think of myfun as being the standard deviation.
Now I would like to create a third column to my tibble (called result) which contains the outputs of myfun applied to val cumulated and grouped with respect to id.
For example, the first entry of result should contain mfun(val[1]).
The second entry should contain myfun(val[1:2]), and so on.
I would like to implent a cumulated version of myfun.
Of course there a lot of easy solutions outside the tidyverse using loops and what not.
But I would be interested in a solution within the tidyverse or within the data.table frame work.
Any help is appreciated.
You could do it this way:
library(tidyverse)
a %>%
group_by(id) %>%
mutate(y = map_dbl(seq_along(val),~sd(val[1:.x]))) %>%
ungroup
# # A tibble: 12 x 3
# id val y
# <chr> <dbl> <dbl>
# 1 A 1 NA
# 2 A 0 0.7071068
# 3 A 0 0.5773503
# 4 A 1 0.5773503
# 5 A 0 0.5477226
# 6 A 1 0.5477226
# 7 B 0 NA
# 8 B 0 0.0000000
# 9 B 0 0.0000000
# 10 B 1 0.5000000
# 11 B 1 0.5477226
# 12 B 1 0.5477226
Explanation
We first group like often with tidyverse chains, then we use mutate, and not summarize, as we want to keep the same unaggregated rows.
The function map_dbl is here used to loop on a vector of final indices. seq_along(val) will be 1:6 for both groups here.
Using functions from the map family we can use the ~ notation, which will assume the first parameter of the function is named .x.
Looping through these indices we compute first sd(val[1:1]) which is sd(val[1]) which is NA, then sd(val[1:2]) etc...
map_dbl returns by design a vector of doubles, and these are stacked in the y column.
One can use zoo::rollapplyr with dynamic width (vector containing width). To prepare a dynamic width for each group 1:n() or seq(n()) can be used.
Let's apply it for function sd using data provided by OP :
library(dplyr)
library(zoo)
a %>% group_by(id) %>%
mutate(y = rollapplyr(val, 1:n(), sd ))
# # Groups: id [2]
# id val y
# <chr> <dbl> <dbl>
# 1 A 1.00 NA
# 2 A 0 0.707
# 3 A 0 0.577
# 4 A 1.00 0.577
# 5 A 0 0.548
# 6 A 1.00 0.548
# 7 B 0 NA
# 8 B 0 0
# 9 B 0 0
# 10 B 1.00 0.500
# 11 B 1.00 0.548
# 12 B 1.00 0.548

Divide (and name) one group of columns by another group in dplyr

After a (very scaring) dplyr pipeline I've ended up with a dataset like this:
year A B C [....] Z count.A count.B count.C [....] count.Z
1999 10 20 10 ... 6 3 5 67 ... 6
2000 3 5 5 ... 7 5 2 5 ... 5
Some example data to reproduce:
df <- data.frame(year = c(1999, 2000),
A = c(10, 20),
B = c(3, 6),
C = c(1, 2),
count.A = c(1, 2),
count.B = c(8, 9),
count.C = c(5, 7))
What I really need is to combine each column with its "count" counterpart i.e.
weight.A = A / count.A,
weight.B = B / count.B
I've to do that programmatically as I have hundreds of columns. Is there a way to do that in a dplyr pipeline?
Don't store variables in column names. If you reshape your data to make it tidy, the calculation is really simple:
library(tidyverse)
df %>% gather(var, val, -year) %>% # reshape to long
separate(var, c('var', 'letter'), fill = 'left') %>% # extract var from former col names
mutate(var = coalesce(var, 'value')) %>% # add name for unnamed var
spread(var, val) %>% # reshape back to wide
mutate(weight = value / count) # now this is very simple
#> year letter count value weight
#> 1 1999 A 1 10 10.0000000
#> 2 1999 B 8 3 0.3750000
#> 3 1999 C 5 1 0.2000000
#> 4 2000 A 2 20 10.0000000
#> 5 2000 B 9 6 0.6666667
#> 6 2000 C 7 2 0.2857143
If your columns are consistently named (and easy enough to retrieve) you could easily do this using an lapply:
cols <- c("A","B","C")
df[,paste0("weighted.",cols)] <- lapply(cols, function(x) df[,x] / df[, paste0("count.",x)])
# year A B C count.A count.B count.C weighted.A weighted.B weighted.C
#1 1999 10 3 1 1 8 5 10 0.3750000 0.2000000
#2 2000 20 6 2 2 9 7 10 0.6666667 0.2857143
Assuming that the columns are in order, we can use data.table. Specify the columns of interest in .SDcols and divide by subset of columns of Subset of Data.table with the other half and assign (:=) it to new columns
library(data.table)
setDT(df)[, paste0("weighted.",names(df)[1:3]) := .SD[,1:3]/.SD[,4:6], .SDcols = A:count.C]
df
# year A B C count.A count.B count.C weighted.year weighted.A weighted.B
#1: 1999 10 3 1 1 8 5 10 0.3750000 0.2000000
#2: 2000 20 6 2 2 9 7 10 0.6666667 0.2857143
Assuming you can programmatically create a vector of all column names, here is how I'd do for your example above
for (c.name in c("A", "B", "C")) {
c.weight <- sprintf("weight.%s", c.name)
c.count <- sprintf("count.%s", c.name)
df[,c.weight] <- df[,c.name] / df[,c.count]
}

R: How to calculate mean for each row with missing values using dplyr

I want to calculate means over several columns for each row in my dataframe containing missing values, and place results in a new column called 'means.' Here's my dataframe:
df <- data.frame(A=c(3,4,5),B=c(0,6,8),C=c(9,NA,1))
A B C
1 3 0 9
2 4 6 NA
3 5 8 1
The code below successfully accomplishes the task if columns have no missing values, such as columns A and B.
library(dplyr)
df %>%
rowwise() %>%
mutate(means=mean(A:B, na.rm=T))
A B C means
<dbl> <dbl> <dbl> <dbl>
1 3 0 9 1.5
2 4 6 NA 5.0
3 5 8 1 6.5
However, if a column has missing values, such as C, then I get an error:
> df %>% rowwise() %>% mutate(means=mean(A:C, na.rm=T))
Error: NA/NaN argument
Ideally, I'd like to implement it with dplyr.
df %>%
mutate(means=rowMeans(., na.rm=TRUE))
The . is a "pronoun" that references the data frame df that was piped into mutate.
A B C means
1 3 0 9 4.000000
2 4 6 NA 5.000000
3 5 8 1 4.666667
You can also select only specific columns to include, using all the usual methods (column names, indices, grep, etc.).
df %>%
mutate(means=rowMeans(.[ , c("A","C")], na.rm=TRUE))
A B C means
1 3 0 9 6
2 4 6 NA 4
3 5 8 1 3
It is simple to accomplish in base R as well:
cbind(df, "means"=rowMeans(df, na.rm=TRUE))
A B C means
1 3 0 9 4.000000
2 4 6 NA 5.000000
3 5 8 1 4.666667
The rowMeans performs the calculation.and allows for the na.rm argument to skip missing values, while cbind allows you to bind the mean and whatever name you want to the the data.frame, df.
Regarding the error in OP's code, we can use the concatenate function c to get those elements as a single vector and then do the mean as mean can take only a single argument.
df %>%
rowwise() %>%
mutate(means = mean(c(A, B, C), na.rm = TRUE))
# A B C means
# <dbl> <dbl> <dbl> <dbl>
#1 3 0 9 4.000000
#2 4 6 NA 5.000000
#3 5 8 1 4.666667
Also, we can use rowMeans with transform
transform(df, means = rowMeans(df, na.rm = TRUE))
# A B C means
#1 3 0 9 4.000000
#2 4 6 NA 5.000000
#3 5 8 1 4.666667
Or using data.table
library(data.table)
setDT(df)[, means := rowMeans(.SD, na.rm = TRUE)]

Resources