Reference other columns by name fragment in mutate_at - r

To visualize the problem, let's assume I have a dataset data in R with the following columns:
factor
param
T1_g1
T2_g1
T1_g2
T2_g2
I want to perform an operation on a subset of columns:
data_final <- data %>%
mutate_at(vars(T1, T2), funs(if(param > 100) {
. * T(n)_g1
} else {
. * T(n)_g2
}
How do I reference the correct column name in the expression T(n)_g1 so it fetches data from T1_g1 and T2_g1, respectively, while mutating?
(in a real case scenario, I have much more columns and conditions, hence manually typing all possible cases is not an option)

if needs a single comparison, but since this will be a vector, you need if_else (or ifelse). I don't know that you can (easily) dynamically determine the other column names based on the to-be-changed name within a quick mutate* interface. A quick hack could be:
data %>%
mutate(
T1 = if_else(param > 100, T1_g1, T1_g2) * T1,
T2 = if_else(param > 100, T2_g1, T2_g2) * T2
)
but this only works if you have a small/static list of T* variables to modify.
If there is a dynamic (or just "high") number of these T* variables, one method includes reshaping the frame to a longer format. (One could argue that a long format might be a better fit for this regardless, so I'll step you through wide-long-mutate as well as wide-long-mutate-wide.)
Some data:
x <- data_frame(
param = c(1L,50L,101L,150L),
T1 = 1:4,
T2 = 5:8,
T1_g1 = (1:4)/10,
T1_g2 = (1:4)*10,
T2_g1 = (5:8)/10,
T2_g2 = (5:8)*10
)
x
# # A tibble: 4 x 7
# param T1 T2 T1_g1 T1_g2 T2_g1 T2_g2
# <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 5 0.1 10 0.5 50
# 2 50 2 6 0.2 20 0.6 60
# 3 101 3 7 0.3 30 0.7 70
# 4 150 4 8 0.4 40 0.8 80
First, the first reshaping:
x %>%
gather(k, v, -param) %>%
mutate(
num = sub("^T([0-9]+).*", "\\1", k),
k = sub("^T[0-9]+(.*)", "T\\1", k)
) %>%
spread(k, v)
# # A tibble: 8 x 5
# param num T T_g1 T_g2
# <int> <chr> <dbl> <dbl> <dbl>
# 1 1 1 1 0.1 10
# 2 1 2 5 0.5 50
# 3 50 1 2 0.2 20
# 4 50 2 6 0.6 60
# 5 101 1 3 0.3 30
# 6 101 2 7 0.7 70
# 7 150 1 4 0.4 40
# 8 150 2 8 0.8 80
What we've done is turned four rows with 3*n columns with the T#, T#_g1, and T#_g2 pattern, into just 3 columns but n times the number of rows. We preserve this n as another column (for now). This is arguably a good format to work with in general: tidyverse and notably ggplot2 really likes data in this format, but there is likely more I don't know.
Now the full shebang (repeating the first few lines of code):
x %>%
gather(k, v, -param) %>%
mutate(
num = sub("^T([0-9]+).*", "\\1", k),
k = sub("^T[0-9]+(.*)", "T\\1", k)
) %>%
spread(k, v) %>%
mutate(T = T * if_else(param > 100, T_g1, T_g2)) %>%
gather(k, v, -param, -num) %>%
mutate(k = if_else(grepl("^T", k), paste0("T", num, substr(k, 2, nchar(k))), k)) %>%
select(-num) %>%
spread(k, v)
# # A tibble: 4 x 7
# param T1 T1_g1 T1_g2 T2 T2_g1 T2_g2
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 10 0.1 10 250 0.5 50
# 2 50 40 0.2 20 360 0.6 60
# 3 101 0.900 0.3 30 4.90 0.7 70
# 4 150 1.6 0.4 40 6.4 0.8 80
after the reshaping, your initial mutate_at concept is reduced to a single mutate(T = ...) call. The rest involves re-hydrating the width.
If your data is large, this might be a little cumbersome. Other solutions might involve manually determining the T# columns and manually doing the ifelse (outside of mutate).

Related

Tidy way of comparing "tiles" of users

Let's say df present aggregated metric in AB test with groups A and B. x is for example number of page visits, n number of users with this number of visits. (In reality, there are way more users and differences are small). Note that there's different number of users per group.
library(tidyverse)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x)
I want to compare tiles of users. By tile, I mean users in group A that have the same x value.
For example, I if 34.17% of users in group A has value 0, I want to compare it to average number of x for the lowest 34.17% of users in group B. Next, for example, users with 1 visits in group A are between 34.17% and 74.8% - I want to compare them with the same percentile (but should be more precise) users in group B. Etc...
Here's my try:
n_fake <- 1000
df_agg_per_imp <- df %>%
group_by(group) %>%
mutate(
p_max = n_fake * cumsum(n) / sum(n),
p_min = lag(p_max, default = 0),
p = map2(p_min + 1, p_max, seq)
) %>%
ungroup()
df_agg_per_imp %>%
unnest(p) %>%
pivot_wider(id_cols = p, names_from = group, values_from = x) %>%
group_by(A) %>%
summarise(
p_min = min(p) / n_fake,
p_max = max(p) / n_fake,
rel_uplift = mean(B) / mean(A)
)
#> # A tibble: 6 × 4
#> A p_min p_max rel_uplift
#> <int> <dbl> <dbl> <dbl>
#> 1 0 0.001 0.34 Inf
#> 2 1 0.341 0.74 1.92
#> 3 2 0.741 0.91 1.57
#> 4 3 0.911 0.96 1.33
#> 5 4 0.961 0.99 1.21
#> 6 5 0.991 1 1.2
What I don't like is that I have to create row for each user (and this could be millions) to get the results I want. Is there simpler/better way to do it?
You may be able to do something like this:
extend the creation of your initial frame to get proportion in A and B, and pivot wider:
set.seed(123)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x) %>%
group_by(group) %>%
mutate(prop = n/sum(n)) %>%
pivot_wider(id_cols=x, names_from=group,values_from=prop)
With the seed above, this gives you a frame like this:
# A tibble: 7 x 3
x A B
<int> <dbl> <dbl>
1 0 0.35 0.095
2 1 0.38 0.33
3 2 0.21 0.285
4 3 0.04 0.14
5 4 0.02 0.085
6 5 NA 0.055
7 6 NA 0.01
Create a function estimates the rel_uplift, while also returning an updated set of group B proportions and group B values (i.e. xvalues)
f <- function(a,aval,bvec,bvals) {
cindex = which(cumsum(bvec)>=a)
if(length(cindex) == 0) bindex=seq_along(bvec)
else bindex= 1:min(cindex)
rem = sum(bvec[bindex])-a
bmean = sum(bvals[bindex] * (bvec[bindex] - c(rep(0,length(bindex)-1), rem)))
if(length(bindex)>1) {
if(rem!=0) bindex = bindex[1:(length(bindex)-1)]
bvec = bvec[-bindex]
bvals = bvals[-bindex]
}
bvec[1] = rem
list("rel_uplift" = bmean/(a*aval),"bvec" = bvec, "bvals" = bvals )
}
Initiate a dataframe, and a list called fres which contains the initial bvec and initial bvals
result=data.frame()
fres = list("bvec" = df$B,"bvals" = df$x)
Use a for loop to loop over the values of df$A, each time getting the rel_uplift, and preparing an updated set of bvec and bvals to be used in the function
for(a in df %>% filter(!is.na(A)) %>% pull(A)) {
x = df %>% filter(A==a) %>% pull(x)
fres = f(a, x,fres[["bvec"]],fres[["bvals"]])
result = rbind(result,data.frame(x =x, A=a,rel_uplift=fres[["rel_uplift"]]))
}
result
x A rel_uplift
1 0 0.35 Inf
2 1 0.38 1.855263
3 2 0.21 1.726190
4 3 0.04 1.666667
5 4 0.02 1.375000
If I understand right you want to compare counts by two parameters simultaneously, ie by $group and by $x.
From the example in the initial post I see that not all values $x may be available for each group.
Summarizing by 2 co-variables can be done with base R.
Here a simple function (assuming that you're always looking at $group and $x):
countnByGroup <- function(xx, asPercent=FALSE) {
lev <- unique(xx$x)
grp <- unique(xx$group)
out <- sapply(grp, function(x) {z <- rep(NA, length(lev)); names(z) <- lev
w <- which(xx$group==x); if(length(w) >0) z[match(xx$x[w], lev)] <- xx$n[w]
z })
if(asPercent) out <- 100*apply(out, 2, function(x) x/sum(x, na.rm=TRUE))
out }
Note, in the function above the man variable was called 'xx' to avoid confusion
with $x.
df # produced using the code from your example
## A tibble: 13 x 3
# group x n
# <chr> <int> <int>
# 1 A 0 36
# 2 A 1 38
# 3 A 2 19
# 4 A 3 6
# 5 A 4 1
# 6 B 0 27
# 7 B 1 44
# 8 B 2 55
# 9 B 3 44
#10 B 4 21
#11 B 5 6
#12 B 6 2
#13 B 8 1
One gets :
countnByGroup(df)
# A B
#0 36 27
#1 38 44
#2 19 55
#3 6 44
#4 1 21
#5 NA 6
#6 NA 2
#8 NA 1
## and
countnByGroup(df, asPercent=T)
# A B
#0 36 13.5
#1 38 22.0
#2 19 27.5
#3 6 22.0
#4 1 10.5
#5 NA 3.0
#6 NA 1.0
#8 NA 0.5
As long as you don't apply any rounding you'll have the results as precise as it gets.
By chance the random values from above did't produce more digits when processing and thus by chance the percent values for A are all integers.
Another interesting option may be to consider two-way tables in R using table().
But in this case you need your entries as separate lines and not already transformed to counting data as in your example above.

Get rows with same values and creates different columns in R

I have a df with repeated sequence in first column and I want to get the values within the same number (in column 1) and create columns with them.
Obs: my df has 25502100 rows and the sequence is formed by 845 values.
See one simple example of my df below:
df <- data.frame(x = c(1,2,3,4,1,2,3,4), y = c(0.1,-2,-3,1,0,10,6,9))
I would like a function to transform this df in:
df_new
x y z
1 1 0.1 0
2 2 -2.0 10
3 3 -3.0 6
4 4 1.0 9
Does anyone has a solution?
An option with pivot_wider
library(tidyr)
library(data.table)
library(dplyr)
df %>%
mutate(rn = c('y', 'z')[rowid(x)]) %>%
pivot_wider(names_from = rn, values_from = y)
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 0.1 0
#2 2 -2 10
#3 3 -3 6
#4 4 1 9

R: Replace NA in for loop with respective but changed variable name

I have a multilevel dataset df on my hands with the following organization:
ID Eye Video_number Time Day measurement1
40001 L 1 1 1 0.60
40001 L 2 1 1 0.50
40001 L 3 1 1 0.80
40001 L 1 2 1 0.60
40001 L 2 2 1 0.60
40001 L 3 2 1 0.60
Goal I am trying to replace cell values of measurements that have a coefficient of variance above 45 with NA, since these values are probably less precise and should be excluded.
The coefficient of variation(sometimes denoted CV) of a distribution is defined as the ratio of the standard deviation to the mean, with $\mu$ and $\sigma$ values obtained from the raw data
I obtained the CV values by Time units (averaging measurement of three videos in one Time unit) with the following function and for loop. I got help from the following threads:
How to correctly use group_by() and summarise() in a For loop in R
Append data frames together in a for loop
# Define function
cv <- function(x){
sd(na.omit(x))/mean(na.omit(x))*100}
# Variables
vars <- c("measurement1", "measurement2", "measurement3")
# Create a table with all CV values by ID, Eye, Day, and Time
df_cv=data.frame()
for (i in vars){
df<-df.m2
df$values<-df[,which(colnames(df.m2)==i)]
x<-df%>%
group_by(ID,Eye,Day,Time) %>%
summarise(Count = n(),
Mean = mean(values, na.rm = TRUE),
SD = sd(values, na.rm = TRUE),
CV = cv(values))%>%
mutate(Variable=paste(i,"cv",sep="_"))
df_cv<-rbind(df_cv,x)
df_cv$CV[is.nan(df_cv$CV)]<-0 # for 0/0 on CV formula giving NaN
}
It resulted in the following table df_cv:
ID Eye Day Time Count Mean SD CV Variable
40001 L 1 1 3 0.56666667 0.057735027 10.1885342 measurement1_cv
40001 L 1 2 3 0.36666667 0.404145188 110.2214150 measurement1_cv
40001 L 1 3 3 0.50000000 0.000000000 0.0000000 measurement1_cv
I reformatted df_cv above to wide format (Variables and CVs across row rather than down a column). This enabled me to merge the CVs with the original df
df_cv<-dcast(df_cv,PIDN+Eye+Day+Time~Variable,value.var = "CV")
df<-merge(df,df_cv,by=c("PIDN","Eye","Day","Time"))
ID Eye Video_number Time Day measurement1 measurement1_cv
40001 L 1 1 1 0.60 10.1885342
40001 L 2 1 1 0.50 10.1885342
40001 L 3 1 1 0.80 10.1885342
40001 L 1 2 1 0.80 110.2214150
40001 L 2 2 1 0.30 110.2214150
40001 L 3 2 1 0.00 110.2214150
I know want to input NAs into the cells of measurement 1 that have a CV>45. I know how to do this measurement by measurement, but I was wondering if there was a for loop capable of doing this, since I have a lot of variables I am analyzing.
df$measurement1[df$measurement1_cv>45]<-NA
df$measurement2[df$measurement2_cv>45]<-NA
df$measurement3[df$measurement3_cv>45]<-NA
Below are my failed attempts:
for (i in vars) {
df<-df.m3
df$i[df$i_cv>45]<-NA
}
Error in `$<-.data.frame`(`*tmp*`, "i", value = logical(0)) :
replacement has 0 rows, data has 609
for (i in vars) {
df<-df.m3
df$i[df$paste(i,"_cv")>45]<-NA
}
Error in df$paste(i, "_cv") : attempt to apply non-function
Any help is greatly appreciated!
Two thoughts.
But first, augmenting your data a little.
df$measurement2 <- 0.9; df$measurement2_cv <- c(rep(46, 3), rep(44, 3))
library(dplyr)
inline function that operates only on one group at a time (i.e., it ignores grouping, so it must be within dplyr::do).
myfunc <- function(x) {
CV <- grep("^measurement.*_cv", colnames(x), value = TRUE)
MEAS <- gsub("_cv$", "", CV)
CV <- CV[MEAS %in% colnames(x)]
MEAS <- MEAS[MEAS %in% colnames(x)]
x[,MEAS] <- Map(function(meas, cv) replace(meas, cv > 45, NA), x[,MEAS], x[,CV])
x
}
df %>%
group_by(ID, Eye, Day, Time) %>%
do(myfunc(.))
# # A tibble: 6 x 9
# # Groups: ID, Eye, Day, Time [2]
# ID Eye Video_number Time Day measurement1 measurement1_cv measurement2 measurement2_cv
# <int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 40001 L 1 1 1 0.6 10.2 NA 46
# 2 40001 L 2 1 1 0.5 10.2 NA 46
# 3 40001 L 3 1 1 0.8 10.2 NA 46
# 4 40001 L 1 2 1 NA 110. 0.9 44
# 5 40001 L 2 2 1 NA 110. 0.9 44
# 6 40001 L 3 2 1 NA 110. 0.9 44
Pivot, calculate, then unpivot, using tidyr.
library(tidyr) # pivot_*
df %>%
rename_with(.fn = ~ paste0(., "_val"), .cols = matches("^meas.*[^v]$")) %>%
rename_with(.fn = ~ gsub("(.*)_(.*)", "\\2_\\1", .), .cols = starts_with("meas")) %>%
pivot_longer(., matches("meas"), names_sep = "_", names_to = c(".value", "meas")) %>%
mutate(val = if_else(cv > 45, NA_real_, val)) %>%
pivot_wider(1:5, names_from = "meas", names_sep = "_", values_from = c("val", "cv"))
# # A tibble: 6 x 9
# ID Eye Video_number Time Day val_measurement1 val_measurement2 cv_measurement1 cv_measurement2
# <int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 40001 L 1 1 1 0.6 NA 10.2 46
# 2 40001 L 2 1 1 0.5 NA 10.2 46
# 3 40001 L 3 1 1 0.8 NA 10.2 46
# 4 40001 L 1 2 1 NA 0.9 110. 44
# 5 40001 L 2 2 1 NA 0.9 110. 44
# 6 40001 L 3 2 1 NA 0.9 110. 44
(I admit that this looks/feels wonky, should need all of that renaming. But now you need to un-rename for the final step.) Perhaps the first solution is cleaner/better :-)

Determine percentage of rows with missing values in a dataframe in R

I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!
It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25
For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]
Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4

Filling in non-existing rows in R + dplyr [duplicate]

This question already has answers here:
Proper idiom for adding zero count rows in tidyr/dplyr
(6 answers)
Closed 2 years ago.
Apologies if this is a duplicate question, I saw some questions which were similar to mine, but none exactly addressing my problem.
My data look basically like this:
FiscalWeek <- as.factor(c(45, 46, 48, 48, 48))
Group <- c("A", "A", "A", "B", "C")
Amount <- c(1, 1, 1, 5, 6)
df <- tibble(FiscalWeek, Group, Amount)
df
# A tibble: 5 x 3
FiscalWeek Group Amount
<fct> <chr> <dbl>
1 45 A 1
2 46 A 1
3 48 A 1
4 48 B 5
5 48 C 6
Note that FiscalWeek is a factor. So, when I take a weekly average by Group, I get this:
library(dplyr)
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount))
averages
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 1
2 B 5
3 C 6
But, this is actually a four-week period. Nothing at all happened in Week 47, and groups B and C didn't show data in weeks 45 and 46, but I still want averages that reflect the existence of those weeks. So I need to fill out my original data with zeroes such that this is my desired result:
DesiredGroup <- c("A", "B", "C")
DesiredAvgs <- c(0.75, 1.25, 1.5)
Desired <- tibble(DesiredGroup, DesiredAvgs)
Desired
# A tibble: 3 x 2
DesiredGroup DesiredAvgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
What is the best way to do this using dplyr?
Up front: missing data to me is very different from 0. I'm assuming that you "know" with certainty that missing data should bring all other values down.
The name FiscalWeek suggests that it is an integer-like data, but your use of factor suggests ordinal or categorical. Because of that, you need to define authoritatively what the complete set of factors can be. And because your current factor does not contain all possible levels, I'll infer them (you need to adjust your all_groups_weeks accordingly:
all_groups_weeks <- tidyr::expand_grid(FiscalWeek = as.factor(45:48), Group = c("A", "B", "C"))
all_groups_weeks
# # A tibble: 12 x 2
# FiscalWeek Group
# <fct> <chr>
# 1 45 A
# 2 45 B
# 3 45 C
# 4 46 A
# 5 46 B
# 6 46 C
# 7 47 A
# 8 47 B
# 9 47 C
# 10 48 A
# 11 48 B
# 12 48 C
From here, join in the full data in order to "complete" it. Using tidyr::complete won't work because you don't have all possible values in the data (47 missing).
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0))
# # A tibble: 12 x 3
# FiscalWeek Group Amount
# <fct> <chr> <dbl>
# 1 45 A 1
# 2 46 A 1
# 3 48 A 1
# 4 48 B 5
# 5 48 C 6
# 6 45 B 0
# 7 45 C 0
# 8 46 B 0
# 9 46 C 0
# 10 47 A 0
# 11 47 B 0
# 12 47 C 0
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0)) %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount, na.rm = TRUE))
# # A tibble: 3 x 2
# Group Avgs
# <chr> <dbl>
# 1 A 0.75
# 2 B 1.25
# 3 C 1.5
You can try this. I hope this helps.
library(dplyr)
#Define range
df %>% mutate(FiscalWeek=as.numeric(as.character(FiscalWeek))) -> df
range <- length(seq(min(df$FiscalWeek),max(df$FiscalWeek),by=1))
#Aggregation
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = sum(Amount)/range)
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
You can do it without filling if you know number of weeks:
df %>%
group_by(Group) %>%
summarise(Avgs = sum(Amount) / length(45:48))

Resources