Let's say df present aggregated metric in AB test with groups A and B. x is for example number of page visits, n number of users with this number of visits. (In reality, there are way more users and differences are small). Note that there's different number of users per group.
library(tidyverse)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x)
I want to compare tiles of users. By tile, I mean users in group A that have the same x value.
For example, I if 34.17% of users in group A has value 0, I want to compare it to average number of x for the lowest 34.17% of users in group B. Next, for example, users with 1 visits in group A are between 34.17% and 74.8% - I want to compare them with the same percentile (but should be more precise) users in group B. Etc...
Here's my try:
n_fake <- 1000
df_agg_per_imp <- df %>%
group_by(group) %>%
mutate(
p_max = n_fake * cumsum(n) / sum(n),
p_min = lag(p_max, default = 0),
p = map2(p_min + 1, p_max, seq)
) %>%
ungroup()
df_agg_per_imp %>%
unnest(p) %>%
pivot_wider(id_cols = p, names_from = group, values_from = x) %>%
group_by(A) %>%
summarise(
p_min = min(p) / n_fake,
p_max = max(p) / n_fake,
rel_uplift = mean(B) / mean(A)
)
#> # A tibble: 6 × 4
#> A p_min p_max rel_uplift
#> <int> <dbl> <dbl> <dbl>
#> 1 0 0.001 0.34 Inf
#> 2 1 0.341 0.74 1.92
#> 3 2 0.741 0.91 1.57
#> 4 3 0.911 0.96 1.33
#> 5 4 0.961 0.99 1.21
#> 6 5 0.991 1 1.2
What I don't like is that I have to create row for each user (and this could be millions) to get the results I want. Is there simpler/better way to do it?
You may be able to do something like this:
extend the creation of your initial frame to get proportion in A and B, and pivot wider:
set.seed(123)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x) %>%
group_by(group) %>%
mutate(prop = n/sum(n)) %>%
pivot_wider(id_cols=x, names_from=group,values_from=prop)
With the seed above, this gives you a frame like this:
# A tibble: 7 x 3
x A B
<int> <dbl> <dbl>
1 0 0.35 0.095
2 1 0.38 0.33
3 2 0.21 0.285
4 3 0.04 0.14
5 4 0.02 0.085
6 5 NA 0.055
7 6 NA 0.01
Create a function estimates the rel_uplift, while also returning an updated set of group B proportions and group B values (i.e. xvalues)
f <- function(a,aval,bvec,bvals) {
cindex = which(cumsum(bvec)>=a)
if(length(cindex) == 0) bindex=seq_along(bvec)
else bindex= 1:min(cindex)
rem = sum(bvec[bindex])-a
bmean = sum(bvals[bindex] * (bvec[bindex] - c(rep(0,length(bindex)-1), rem)))
if(length(bindex)>1) {
if(rem!=0) bindex = bindex[1:(length(bindex)-1)]
bvec = bvec[-bindex]
bvals = bvals[-bindex]
}
bvec[1] = rem
list("rel_uplift" = bmean/(a*aval),"bvec" = bvec, "bvals" = bvals )
}
Initiate a dataframe, and a list called fres which contains the initial bvec and initial bvals
result=data.frame()
fres = list("bvec" = df$B,"bvals" = df$x)
Use a for loop to loop over the values of df$A, each time getting the rel_uplift, and preparing an updated set of bvec and bvals to be used in the function
for(a in df %>% filter(!is.na(A)) %>% pull(A)) {
x = df %>% filter(A==a) %>% pull(x)
fres = f(a, x,fres[["bvec"]],fres[["bvals"]])
result = rbind(result,data.frame(x =x, A=a,rel_uplift=fres[["rel_uplift"]]))
}
result
x A rel_uplift
1 0 0.35 Inf
2 1 0.38 1.855263
3 2 0.21 1.726190
4 3 0.04 1.666667
5 4 0.02 1.375000
If I understand right you want to compare counts by two parameters simultaneously, ie by $group and by $x.
From the example in the initial post I see that not all values $x may be available for each group.
Summarizing by 2 co-variables can be done with base R.
Here a simple function (assuming that you're always looking at $group and $x):
countnByGroup <- function(xx, asPercent=FALSE) {
lev <- unique(xx$x)
grp <- unique(xx$group)
out <- sapply(grp, function(x) {z <- rep(NA, length(lev)); names(z) <- lev
w <- which(xx$group==x); if(length(w) >0) z[match(xx$x[w], lev)] <- xx$n[w]
z })
if(asPercent) out <- 100*apply(out, 2, function(x) x/sum(x, na.rm=TRUE))
out }
Note, in the function above the man variable was called 'xx' to avoid confusion
with $x.
df # produced using the code from your example
## A tibble: 13 x 3
# group x n
# <chr> <int> <int>
# 1 A 0 36
# 2 A 1 38
# 3 A 2 19
# 4 A 3 6
# 5 A 4 1
# 6 B 0 27
# 7 B 1 44
# 8 B 2 55
# 9 B 3 44
#10 B 4 21
#11 B 5 6
#12 B 6 2
#13 B 8 1
One gets :
countnByGroup(df)
# A B
#0 36 27
#1 38 44
#2 19 55
#3 6 44
#4 1 21
#5 NA 6
#6 NA 2
#8 NA 1
## and
countnByGroup(df, asPercent=T)
# A B
#0 36 13.5
#1 38 22.0
#2 19 27.5
#3 6 22.0
#4 1 10.5
#5 NA 3.0
#6 NA 1.0
#8 NA 0.5
As long as you don't apply any rounding you'll have the results as precise as it gets.
By chance the random values from above did't produce more digits when processing and thus by chance the percent values for A are all integers.
Another interesting option may be to consider two-way tables in R using table().
But in this case you need your entries as separate lines and not already transformed to counting data as in your example above.
Related
I have a dataset containing about 60 variables (A, B, C, D, ...), each with 3 corresponding information columns (A, Group_A and WOE_A) as in the list below:
ID A Group_A WOE_A B Group_B WOE_B C Group_C WOE_C D Group_D WOE_D Status
213 0 1 0.87 0 1 0.65 0 1 0.80 915.7 4 -0.30 1
321 12 5 0.08 4 4 -0.43 6 5 -0.20 85.3 2 0.26 0
32 0 1 0.87 0 1 0.65 0 1 0.80 28.6 2 0.26 1
13 7 4 -0.69 2 3 -0.82 4 4 -0.80 31.8 2 0.26 0
43 1 2 -0.04 1 2 -0.49 1 2 -0.22 51.7 2 0.26 0
656 2 3 -0.28 2 3 -0.82 2 3 -0.65 8.5 1 1.14 0
435 2 3 -0.28 0 1 0.65 0 1 0.80 39.8 2 0.26 0
65 8 4 -0.69 3 4 -0.43 5 4 -0.80 243.0 3 0.00 0
565 0 1 0.87 0 1 0.65 0 1 0.80 4.0 1 1.14 0
432 0 1 0.87 0 1 0.65 0 1 0.80 81.6 2 0.26 0
I want to print a table in R with some statistics (Min(A), Max(A), WOE_A, Count(Group_A), Count(Group_A, where Status=1), Count(Group_A, where Status=0)), all grouped by Group for each of the 60 variables and I think I need to perform it in a loop.
I tried the "dplyr" package, but I don't know how to refer to all the three columns (A, Group_A and WOE_A) that relate to a variable (A) and also how to summarize the information for all the desired statistics.
The code I began with is:
df <- data
List <- list(df)
for (colname in colnames(df)) {
List[[colname]]<- df %>%
group_by(df[,colname]) %>%
count()
}
List
This is how I want to print results:
**Var A
Group Min(A) Max(A) WOE_A Count(Group_A) Count_1(Group_A, where Status=1) Count_0(Group_A, where Status=0)**
1
2
3
4
5
Thank you very much!
Laura
Laura, as mentioned by the others, working with "long" data frames is better than with wide data frames.
Your initial idea using dplyr and group_by() got you almost there.
Note: this is also a way to break down your data and then combine it with generic columns, if the wide-long is pushing the limits.
Let's start with this:
library(dplyr)
#---------- extract all "A" measurements
df %>%
select(A, Group_A, WOE_A, Status) %>%
#---------- grouped summary of multiple stats
group_by(A) %>%
summarise(
Min = min(A)
, Max = max(A)
, WOE_A = unique(WOE_A)
, Count = n() # n() is a helper function of dplyr
, CountStatus1 = sum(Status == 1) # use sum() to count logical conditions
, CountStatus0 = sum(Status == 0)
)
This yields:
# A tibble: 6 x 7
A Min Max WOE_A Count CountStatus1 CountStatus0
<dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 0 0 0 0.87 4 2 2
2 1 1 1 -0.04 1 0 1
3 2 2 2 -0.28 2 0 2
4 7 7 7 -0.69 1 0 1
5 8 8 8 -0.69 1 0 1
6 12 12 12 0.08 1 0 1
OK. Turning your wide dataframe into a long one is not a trivial go as you nest measurements and variable names. On top, ID and Status are ids/key variables for each row.
The standard tool to convert wide to long is tidyr's pivot_longer(). Read up on this.
In your particular case we want to push multiple columns into multiple targets. For this you need to get a feel for the .value sentinel. The pivot_longer() help pages might be useful for studying this case.
To ease the pain of constructing a complex regex expression to decode the variable names, I rename your group-id-label, e.g. A, B, to X_A, X_B. This ensures that all column-names are built in the form of what_letter`!
library(tidyr)
df %>%
# ----------- prepare variable names to be well-formed, you may do this upstream
rename(X_A = A, X_B = B, X_C = C, X_D = D) %>%
# ----------- call pivot longer with .value sentinel and names_pattern
# ----------- that is an advanced use of the capabilities
pivot_longer(
cols = -c("ID","Status") # apply to all cols besides ID and Status
, names_to = c(".value", "label") # target column names are based on origin names
# and an individual label (think id, name as u like)
, names_pattern = "(.*)(.*_[A-D]{1})$") # regex for the origin column patterns
# pattern is built of 2 parts (...)(...)
# (.*) no or any symbol possibly multiple times
# (.*_[A-D]{1}) as above, but ending with underscore and 1 letter
This gives you
# A tibble: 40 x 6
ID Status label X Group WOE
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 213 1 _A 0 1 0.87
2 213 1 _B 0 1 0.65
3 213 1 _C 0 1 0.8
4 213 1 _D 916. 4 -0.3
5 321 0 _A 12 5 0.08
6 321 0 _B 4 4 -0.43
7 321 0 _C 6 5 -0.2
8 321 0 _D 85.3 2 0.26
9 32 1 _A 0 1 0.87
10 32 1 _B 0 1 0.65
Putting all together
df %>%
# ------------ prepare and make long
rename(X_A = A, X_B = B, X_C = C, X_D = D) %>%
pivot_longer(cols = -c("ID","Status")
, names_to = c(".value", "label")
, names_pattern = "(.*)(.*_[A-D]{1})$") %>%
# ------------- calculate stats on groups
group_by(label, X) %>%
summarise(Min = min(X), Max = max(X), WOE = unique(WOE)
,Count = n(), CountStatus1 = sum(Status == 1)
, CountStatus0 = sum(Status == 0)
)
Voila:
# A tibble: 27 x 8
# Groups: label [4]
label X Min Max WOE Count CountStatus1 CountStatus0
<chr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 _A 0 0 0 0.87 4 2 2
2 _A 1 1 1 -0.04 1 0 1
3 _A 2 2 2 -0.28 2 0 2
4 _A 7 7 7 -0.69 1 0 1
5 _A 8 8 8 -0.69 1 0 1
6 _A 12 12 12 0.08 1 0 1
7 _B 0 0 0 0.65 5 2 3
8 _B 1 1 1 -0.49 1 0 1
9 _B 2 2 2 -0.82 2 0 2
10 _B 3 3 3 -0.43 1 0 1
# ... with 17 more rows
The loop that I managed to do is available below.
Apart from the tables that I wanted to list, I also needed to make a chart which would show some of the information from each listed table, and then print a PDF with each variable and corresponding table and chart on a different page.
data <- as.data.frame(data)
# 5 is the column where my first information related to a variable is, so for each variable I am building the data with its' related columns
i <- 5
#each variable has 3 columns (Value, Group, WOE)
for (i in seq(5, 223, 3)){
ID <- data[,1]
A <- data[,i]
Group <- data[,i+1]
WOE <- data[,i+2]
Status <- data[,224]
df <- cbind(ID, A, Group, WOE, Status)
df <- data.frame(df)
# Perform table T with its' corresponding statistics
T <- df %>%
select(A, Group, WOE, Status) %>%
group_by(Group) %>%
summarise(
Min = min(A, na.rm=TRUE), Max = max(A, na.rm=TRUE), WOE = unique(WOE),
Count = n(),
CountStatus1 = sum(Status == 1),
CountStatus0 = sum(Status == 0),
BadRate = round((CountStatus1/Count)*100,1))
print(colnames(data)[i])
print(T)
# Then I plot some information from Table T
p <- ggplot(T) + geom_col(aes(x=Group, y=CountStatus1), size = 1, color = "darkgreen", fill = "darkgreen")
p <- p + geom_line(aes(x=Group, y=WOE*1000), col="firebrick", size=0.9) +
geom_point(aes(x=Group, y=WOE*1000), col="gray", size=3) +
ggtitle(label = paste("WOE and Event Count by Group", " - " , colnames(data)[i])) +
labs(x = "Group", y = "Event Count", size=7) +
theme(plot.title = element_text(size=8, face="bold", margin = margin(10, 0, 10, 0)),
axis.text.x = element_text(angle=0, hjust = 1)) +
scale_y_continuous(sec.axis = sec_axis(trans = ~ . /1000, name="WOE", breaks = seq(-3, 5, 0.5)))
print(p)
}
The information is printed for all the variables that I need as in the pictures below:
Table for one of the variables
Chart for the same variable
However, now I encounter some problems with exporting results in a pdf. I do not know how I could print the results of each table and chart on a distinct page in a PDF.
Each segment has different range, for example A is from 1 to 3 while C is from 1 to 7.
For each segment, there can be missing time for which I want to perform interpolation (linear, spline, etc)
How can I do it within dplyr?
have <- data.frame(time =c(1,3,1,2,5,1,3,5,7),
segment=c('A','A','B','B','B','C','C','C','C'),
toInterpolate= c(0.12,0.31,0.15,0.24,0.55,0.11,0.35,0.53,0.79))
have
want <- data.frame(time =c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7),
segment=c('A','A','A','B','B','B','B','B','C','C','C','C','C','C','C'),
Interpolated= c(0.12,0.21,0.31,0.15,0.24,0.34,0.41,0.55,0.11,0.28,0.35,0.45,0.53,0.69,0.79))
# note that the interpolated values here are just randomnly put, (not based on actual linear/spline interpolation)
want
We can use complete to complete the sequence and na.spline from zoo for interpolation.
library(dplyr)
library(tidyr)
library(zoo)
have %>%
group_by(segment) %>%
complete(time = min(time):max(time)) %>%
mutate(toInterpolate = na.spline(toInterpolate))
# segment time toInterpolate
# <chr> <dbl> <dbl>
# 1 A 1 0.12
# 2 A 2 0.215
# 3 A 3 0.31
# 4 B 1 0.15
# 5 B 2 0.24
# 6 B 3 0.337
# 7 B 4 0.44
# 8 B 5 0.55
# 9 C 1 0.11
#10 C 2 0.246
#11 C 3 0.35
#12 C 4 0.439
#13 C 5 0.53
#14 C 6 0.641
#15 C 7 0.79
To create sequence for smaller granularities.
have %>%
group_by(segment) %>%
complete(time = min(time):max(time)) %>%
mutate(toInterpolate = na.spline(toInterpolate)) %>%
complete(time = seq(min(time), max(time), 0.1))
I have a multilevel dataset df on my hands with the following organization:
ID Eye Video_number Time Day measurement1
40001 L 1 1 1 0.60
40001 L 2 1 1 0.50
40001 L 3 1 1 0.80
40001 L 1 2 1 0.60
40001 L 2 2 1 0.60
40001 L 3 2 1 0.60
Goal I am trying to replace cell values of measurements that have a coefficient of variance above 45 with NA, since these values are probably less precise and should be excluded.
The coefficient of variation(sometimes denoted CV) of a distribution is defined as the ratio of the standard deviation to the mean, with $\mu$ and $\sigma$ values obtained from the raw data
I obtained the CV values by Time units (averaging measurement of three videos in one Time unit) with the following function and for loop. I got help from the following threads:
How to correctly use group_by() and summarise() in a For loop in R
Append data frames together in a for loop
# Define function
cv <- function(x){
sd(na.omit(x))/mean(na.omit(x))*100}
# Variables
vars <- c("measurement1", "measurement2", "measurement3")
# Create a table with all CV values by ID, Eye, Day, and Time
df_cv=data.frame()
for (i in vars){
df<-df.m2
df$values<-df[,which(colnames(df.m2)==i)]
x<-df%>%
group_by(ID,Eye,Day,Time) %>%
summarise(Count = n(),
Mean = mean(values, na.rm = TRUE),
SD = sd(values, na.rm = TRUE),
CV = cv(values))%>%
mutate(Variable=paste(i,"cv",sep="_"))
df_cv<-rbind(df_cv,x)
df_cv$CV[is.nan(df_cv$CV)]<-0 # for 0/0 on CV formula giving NaN
}
It resulted in the following table df_cv:
ID Eye Day Time Count Mean SD CV Variable
40001 L 1 1 3 0.56666667 0.057735027 10.1885342 measurement1_cv
40001 L 1 2 3 0.36666667 0.404145188 110.2214150 measurement1_cv
40001 L 1 3 3 0.50000000 0.000000000 0.0000000 measurement1_cv
I reformatted df_cv above to wide format (Variables and CVs across row rather than down a column). This enabled me to merge the CVs with the original df
df_cv<-dcast(df_cv,PIDN+Eye+Day+Time~Variable,value.var = "CV")
df<-merge(df,df_cv,by=c("PIDN","Eye","Day","Time"))
ID Eye Video_number Time Day measurement1 measurement1_cv
40001 L 1 1 1 0.60 10.1885342
40001 L 2 1 1 0.50 10.1885342
40001 L 3 1 1 0.80 10.1885342
40001 L 1 2 1 0.80 110.2214150
40001 L 2 2 1 0.30 110.2214150
40001 L 3 2 1 0.00 110.2214150
I know want to input NAs into the cells of measurement 1 that have a CV>45. I know how to do this measurement by measurement, but I was wondering if there was a for loop capable of doing this, since I have a lot of variables I am analyzing.
df$measurement1[df$measurement1_cv>45]<-NA
df$measurement2[df$measurement2_cv>45]<-NA
df$measurement3[df$measurement3_cv>45]<-NA
Below are my failed attempts:
for (i in vars) {
df<-df.m3
df$i[df$i_cv>45]<-NA
}
Error in `$<-.data.frame`(`*tmp*`, "i", value = logical(0)) :
replacement has 0 rows, data has 609
for (i in vars) {
df<-df.m3
df$i[df$paste(i,"_cv")>45]<-NA
}
Error in df$paste(i, "_cv") : attempt to apply non-function
Any help is greatly appreciated!
Two thoughts.
But first, augmenting your data a little.
df$measurement2 <- 0.9; df$measurement2_cv <- c(rep(46, 3), rep(44, 3))
library(dplyr)
inline function that operates only on one group at a time (i.e., it ignores grouping, so it must be within dplyr::do).
myfunc <- function(x) {
CV <- grep("^measurement.*_cv", colnames(x), value = TRUE)
MEAS <- gsub("_cv$", "", CV)
CV <- CV[MEAS %in% colnames(x)]
MEAS <- MEAS[MEAS %in% colnames(x)]
x[,MEAS] <- Map(function(meas, cv) replace(meas, cv > 45, NA), x[,MEAS], x[,CV])
x
}
df %>%
group_by(ID, Eye, Day, Time) %>%
do(myfunc(.))
# # A tibble: 6 x 9
# # Groups: ID, Eye, Day, Time [2]
# ID Eye Video_number Time Day measurement1 measurement1_cv measurement2 measurement2_cv
# <int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 40001 L 1 1 1 0.6 10.2 NA 46
# 2 40001 L 2 1 1 0.5 10.2 NA 46
# 3 40001 L 3 1 1 0.8 10.2 NA 46
# 4 40001 L 1 2 1 NA 110. 0.9 44
# 5 40001 L 2 2 1 NA 110. 0.9 44
# 6 40001 L 3 2 1 NA 110. 0.9 44
Pivot, calculate, then unpivot, using tidyr.
library(tidyr) # pivot_*
df %>%
rename_with(.fn = ~ paste0(., "_val"), .cols = matches("^meas.*[^v]$")) %>%
rename_with(.fn = ~ gsub("(.*)_(.*)", "\\2_\\1", .), .cols = starts_with("meas")) %>%
pivot_longer(., matches("meas"), names_sep = "_", names_to = c(".value", "meas")) %>%
mutate(val = if_else(cv > 45, NA_real_, val)) %>%
pivot_wider(1:5, names_from = "meas", names_sep = "_", values_from = c("val", "cv"))
# # A tibble: 6 x 9
# ID Eye Video_number Time Day val_measurement1 val_measurement2 cv_measurement1 cv_measurement2
# <int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 40001 L 1 1 1 0.6 NA 10.2 46
# 2 40001 L 2 1 1 0.5 NA 10.2 46
# 3 40001 L 3 1 1 0.8 NA 10.2 46
# 4 40001 L 1 2 1 NA 0.9 110. 44
# 5 40001 L 2 2 1 NA 0.9 110. 44
# 6 40001 L 3 2 1 NA 0.9 110. 44
(I admit that this looks/feels wonky, should need all of that renaming. But now you need to un-rename for the final step.) Perhaps the first solution is cleaner/better :-)
I have a dataframe with a column called Product (with many products), a column called Timestamp (representing the date in a discrete ordinal variable) and a column called Rating.
I am trying to calculate the moving average and the moving standard deviation for the Rating variable, by each Product, taking into account the Timestamp.
The data looks something like this:
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
Now I add the columns for the moving average and the moving standard deviation:
DF$Moving.avg <- rep(0,nrow(DF))
DF$Moving.sd <- rep(0,nrow(DF))
And finally, I am using this code with nested for loops to get the result I want:
for (product in unique(DF$Product)) {
for (timestamp in DF[DF$Product==product,]$Timestamp){
if (timestamp==1) {
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <-
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Rating
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <- 0
}else{
index_start <- which(DF$Product==product &
DF$Timestamp==1)
index_end <- which(DF$Product==product &
DF$Timestamp==timestamp)
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <-
mean(DF[index_start:index_end,]$Rating)
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <-
sd(DF[index_start:index_end,]$Rating)
}
}
}
The code works fine but it is too slow.
I wonder how can I use vectorization to make this faster?
If you want to do the whole thing vectorised in base R you could try:
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
cbind(DF, do.call(rbind, lapply(split(DF, DF$Product), function(x) {
do.call(rbind, lapply(seq(nrow(x)), function(y) {
c(Moving.avg = mean(x$Rating[1:y]), Moving.sd = sd(x$Rating[1:y]))}))})))
#> Product Timestamp Rating Moving.avg Moving.sd
#> 1 a 1 4 4.000000 NA
#> 2 a 2 3 3.500000 0.7071068
#> 3 a 3 5 4.000000 1.0000000
#> 4 a 4 3 3.750000 0.9574271
#> 5 b 1 3 3.000000 NA
#> 6 b 2 4 3.500000 0.7071068
#> 7 b 3 5 4.000000 1.0000000
#> 8 c 1 3 3.000000 NA
#> 9 c 2 1 2.000000 1.4142136
#> 10 c 3 1 1.666667 1.1547005
#> 11 c 4 2 1.750000 0.9574271
#> 12 c 5 5 2.400000 1.6733201
Note though that the sd of a single number is NA rather than 0. It would be simple to replace these if desired by DF$Moving.sd[is.na(DF$Moving.sd)] <- 0
Created on 2020-08-31 by the reprex package (v0.3.0)
I think you are looking for cumulative mean and cumulative standard deviation.
For cumulative mean you can use cummean function and TTR::runSD for cumulative standard deviation.
library(dplyr)
DF %>%
group_by(Product) %>%
mutate(cum_avg = cummean(Rating),
cum_std = TTR::runSD(Rating, n = 1, cumulative = TRUE))
# Product Timestamp Rating cum_avg cum_std
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 4 4 NaN
# 2 a 2 3 3.5 0.707
# 3 a 3 5 4 1
# 4 a 4 3 3.75 0.957
# 5 b 1 3 3 NaN
# 6 b 2 4 3.5 0.707
# 7 b 3 5 4 1
# 8 c 1 3 3 NaN
# 9 c 2 1 2 1.41
#10 c 3 1 1.67 1.15
#11 c 4 2 1.75 0.957
#12 c 5 5 2.4 1.67
Does this example works for you? Here I am using the function runner() from runner package. The runner() will apply a function, on the window that you define, and works fine with the group_by() function from dplyr. You define the size of window of the function, on the k argument.
library(runner)
library(dplyr)
library(magrittr)
DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))
DF <- DF %>%
group_by(Product) %>%
arrange(Timestamp, .by_group = T)
DF <- DF %>%
mutate(
average = runner(Rating, f = function(x) mean(x), k = 3),
deviation = runner(Rating, f = function(x) sd(x), k = 3)
)
Is worth mention, that the function will expand the window size on the fisrt lines of each group (or each Product) on your data.frame, until reach the size defined on k argument. So in the first two lines, where we still not have 3 previous values, runner() will apply the function on these two lines.
Building on this answer to a related question, you could also do it this way with dplyr:
DF <- DF %>%
# Sort in order of product and then timestamp within product
arrange(Product, Timestamp) %>%
# group data by product
group_by(Product) %>%
# use the cumulative mean function to calculate the means
mutate(Moving.avg = cummean(Rating),
# use the map_dbl function to calculate standard deviations up to a certain index value
Moving.sd = map_dbl(seq_along(Timestamp),~sd(Rating[1:.x])),
# replace Moving.sd=0 when Timestamp takes on its smallest value
Moving.sd = case_when(Timestamp == min(Timestamp) ~ 0,
TRUE ~ Moving.sd)) %>%
# ungroup the data
ungroup
To visualize the problem, let's assume I have a dataset data in R with the following columns:
factor
param
T1_g1
T2_g1
T1_g2
T2_g2
I want to perform an operation on a subset of columns:
data_final <- data %>%
mutate_at(vars(T1, T2), funs(if(param > 100) {
. * T(n)_g1
} else {
. * T(n)_g2
}
How do I reference the correct column name in the expression T(n)_g1 so it fetches data from T1_g1 and T2_g1, respectively, while mutating?
(in a real case scenario, I have much more columns and conditions, hence manually typing all possible cases is not an option)
if needs a single comparison, but since this will be a vector, you need if_else (or ifelse). I don't know that you can (easily) dynamically determine the other column names based on the to-be-changed name within a quick mutate* interface. A quick hack could be:
data %>%
mutate(
T1 = if_else(param > 100, T1_g1, T1_g2) * T1,
T2 = if_else(param > 100, T2_g1, T2_g2) * T2
)
but this only works if you have a small/static list of T* variables to modify.
If there is a dynamic (or just "high") number of these T* variables, one method includes reshaping the frame to a longer format. (One could argue that a long format might be a better fit for this regardless, so I'll step you through wide-long-mutate as well as wide-long-mutate-wide.)
Some data:
x <- data_frame(
param = c(1L,50L,101L,150L),
T1 = 1:4,
T2 = 5:8,
T1_g1 = (1:4)/10,
T1_g2 = (1:4)*10,
T2_g1 = (5:8)/10,
T2_g2 = (5:8)*10
)
x
# # A tibble: 4 x 7
# param T1 T2 T1_g1 T1_g2 T2_g1 T2_g2
# <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 5 0.1 10 0.5 50
# 2 50 2 6 0.2 20 0.6 60
# 3 101 3 7 0.3 30 0.7 70
# 4 150 4 8 0.4 40 0.8 80
First, the first reshaping:
x %>%
gather(k, v, -param) %>%
mutate(
num = sub("^T([0-9]+).*", "\\1", k),
k = sub("^T[0-9]+(.*)", "T\\1", k)
) %>%
spread(k, v)
# # A tibble: 8 x 5
# param num T T_g1 T_g2
# <int> <chr> <dbl> <dbl> <dbl>
# 1 1 1 1 0.1 10
# 2 1 2 5 0.5 50
# 3 50 1 2 0.2 20
# 4 50 2 6 0.6 60
# 5 101 1 3 0.3 30
# 6 101 2 7 0.7 70
# 7 150 1 4 0.4 40
# 8 150 2 8 0.8 80
What we've done is turned four rows with 3*n columns with the T#, T#_g1, and T#_g2 pattern, into just 3 columns but n times the number of rows. We preserve this n as another column (for now). This is arguably a good format to work with in general: tidyverse and notably ggplot2 really likes data in this format, but there is likely more I don't know.
Now the full shebang (repeating the first few lines of code):
x %>%
gather(k, v, -param) %>%
mutate(
num = sub("^T([0-9]+).*", "\\1", k),
k = sub("^T[0-9]+(.*)", "T\\1", k)
) %>%
spread(k, v) %>%
mutate(T = T * if_else(param > 100, T_g1, T_g2)) %>%
gather(k, v, -param, -num) %>%
mutate(k = if_else(grepl("^T", k), paste0("T", num, substr(k, 2, nchar(k))), k)) %>%
select(-num) %>%
spread(k, v)
# # A tibble: 4 x 7
# param T1 T1_g1 T1_g2 T2 T2_g1 T2_g2
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 10 0.1 10 250 0.5 50
# 2 50 40 0.2 20 360 0.6 60
# 3 101 0.900 0.3 30 4.90 0.7 70
# 4 150 1.6 0.4 40 6.4 0.8 80
after the reshaping, your initial mutate_at concept is reduced to a single mutate(T = ...) call. The rest involves re-hydrating the width.
If your data is large, this might be a little cumbersome. Other solutions might involve manually determining the T# columns and manually doing the ifelse (outside of mutate).