I have a dataset containing about 60 variables (A, B, C, D, ...), each with 3 corresponding information columns (A, Group_A and WOE_A) as in the list below:
ID A Group_A WOE_A B Group_B WOE_B C Group_C WOE_C D Group_D WOE_D Status
213 0 1 0.87 0 1 0.65 0 1 0.80 915.7 4 -0.30 1
321 12 5 0.08 4 4 -0.43 6 5 -0.20 85.3 2 0.26 0
32 0 1 0.87 0 1 0.65 0 1 0.80 28.6 2 0.26 1
13 7 4 -0.69 2 3 -0.82 4 4 -0.80 31.8 2 0.26 0
43 1 2 -0.04 1 2 -0.49 1 2 -0.22 51.7 2 0.26 0
656 2 3 -0.28 2 3 -0.82 2 3 -0.65 8.5 1 1.14 0
435 2 3 -0.28 0 1 0.65 0 1 0.80 39.8 2 0.26 0
65 8 4 -0.69 3 4 -0.43 5 4 -0.80 243.0 3 0.00 0
565 0 1 0.87 0 1 0.65 0 1 0.80 4.0 1 1.14 0
432 0 1 0.87 0 1 0.65 0 1 0.80 81.6 2 0.26 0
I want to print a table in R with some statistics (Min(A), Max(A), WOE_A, Count(Group_A), Count(Group_A, where Status=1), Count(Group_A, where Status=0)), all grouped by Group for each of the 60 variables and I think I need to perform it in a loop.
I tried the "dplyr" package, but I don't know how to refer to all the three columns (A, Group_A and WOE_A) that relate to a variable (A) and also how to summarize the information for all the desired statistics.
The code I began with is:
df <- data
List <- list(df)
for (colname in colnames(df)) {
List[[colname]]<- df %>%
group_by(df[,colname]) %>%
count()
}
List
This is how I want to print results:
**Var A
Group Min(A) Max(A) WOE_A Count(Group_A) Count_1(Group_A, where Status=1) Count_0(Group_A, where Status=0)**
1
2
3
4
5
Thank you very much!
Laura
Laura, as mentioned by the others, working with "long" data frames is better than with wide data frames.
Your initial idea using dplyr and group_by() got you almost there.
Note: this is also a way to break down your data and then combine it with generic columns, if the wide-long is pushing the limits.
Let's start with this:
library(dplyr)
#---------- extract all "A" measurements
df %>%
select(A, Group_A, WOE_A, Status) %>%
#---------- grouped summary of multiple stats
group_by(A) %>%
summarise(
Min = min(A)
, Max = max(A)
, WOE_A = unique(WOE_A)
, Count = n() # n() is a helper function of dplyr
, CountStatus1 = sum(Status == 1) # use sum() to count logical conditions
, CountStatus0 = sum(Status == 0)
)
This yields:
# A tibble: 6 x 7
A Min Max WOE_A Count CountStatus1 CountStatus0
<dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 0 0 0 0.87 4 2 2
2 1 1 1 -0.04 1 0 1
3 2 2 2 -0.28 2 0 2
4 7 7 7 -0.69 1 0 1
5 8 8 8 -0.69 1 0 1
6 12 12 12 0.08 1 0 1
OK. Turning your wide dataframe into a long one is not a trivial go as you nest measurements and variable names. On top, ID and Status are ids/key variables for each row.
The standard tool to convert wide to long is tidyr's pivot_longer(). Read up on this.
In your particular case we want to push multiple columns into multiple targets. For this you need to get a feel for the .value sentinel. The pivot_longer() help pages might be useful for studying this case.
To ease the pain of constructing a complex regex expression to decode the variable names, I rename your group-id-label, e.g. A, B, to X_A, X_B. This ensures that all column-names are built in the form of what_letter`!
library(tidyr)
df %>%
# ----------- prepare variable names to be well-formed, you may do this upstream
rename(X_A = A, X_B = B, X_C = C, X_D = D) %>%
# ----------- call pivot longer with .value sentinel and names_pattern
# ----------- that is an advanced use of the capabilities
pivot_longer(
cols = -c("ID","Status") # apply to all cols besides ID and Status
, names_to = c(".value", "label") # target column names are based on origin names
# and an individual label (think id, name as u like)
, names_pattern = "(.*)(.*_[A-D]{1})$") # regex for the origin column patterns
# pattern is built of 2 parts (...)(...)
# (.*) no or any symbol possibly multiple times
# (.*_[A-D]{1}) as above, but ending with underscore and 1 letter
This gives you
# A tibble: 40 x 6
ID Status label X Group WOE
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 213 1 _A 0 1 0.87
2 213 1 _B 0 1 0.65
3 213 1 _C 0 1 0.8
4 213 1 _D 916. 4 -0.3
5 321 0 _A 12 5 0.08
6 321 0 _B 4 4 -0.43
7 321 0 _C 6 5 -0.2
8 321 0 _D 85.3 2 0.26
9 32 1 _A 0 1 0.87
10 32 1 _B 0 1 0.65
Putting all together
df %>%
# ------------ prepare and make long
rename(X_A = A, X_B = B, X_C = C, X_D = D) %>%
pivot_longer(cols = -c("ID","Status")
, names_to = c(".value", "label")
, names_pattern = "(.*)(.*_[A-D]{1})$") %>%
# ------------- calculate stats on groups
group_by(label, X) %>%
summarise(Min = min(X), Max = max(X), WOE = unique(WOE)
,Count = n(), CountStatus1 = sum(Status == 1)
, CountStatus0 = sum(Status == 0)
)
Voila:
# A tibble: 27 x 8
# Groups: label [4]
label X Min Max WOE Count CountStatus1 CountStatus0
<chr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 _A 0 0 0 0.87 4 2 2
2 _A 1 1 1 -0.04 1 0 1
3 _A 2 2 2 -0.28 2 0 2
4 _A 7 7 7 -0.69 1 0 1
5 _A 8 8 8 -0.69 1 0 1
6 _A 12 12 12 0.08 1 0 1
7 _B 0 0 0 0.65 5 2 3
8 _B 1 1 1 -0.49 1 0 1
9 _B 2 2 2 -0.82 2 0 2
10 _B 3 3 3 -0.43 1 0 1
# ... with 17 more rows
The loop that I managed to do is available below.
Apart from the tables that I wanted to list, I also needed to make a chart which would show some of the information from each listed table, and then print a PDF with each variable and corresponding table and chart on a different page.
data <- as.data.frame(data)
# 5 is the column where my first information related to a variable is, so for each variable I am building the data with its' related columns
i <- 5
#each variable has 3 columns (Value, Group, WOE)
for (i in seq(5, 223, 3)){
ID <- data[,1]
A <- data[,i]
Group <- data[,i+1]
WOE <- data[,i+2]
Status <- data[,224]
df <- cbind(ID, A, Group, WOE, Status)
df <- data.frame(df)
# Perform table T with its' corresponding statistics
T <- df %>%
select(A, Group, WOE, Status) %>%
group_by(Group) %>%
summarise(
Min = min(A, na.rm=TRUE), Max = max(A, na.rm=TRUE), WOE = unique(WOE),
Count = n(),
CountStatus1 = sum(Status == 1),
CountStatus0 = sum(Status == 0),
BadRate = round((CountStatus1/Count)*100,1))
print(colnames(data)[i])
print(T)
# Then I plot some information from Table T
p <- ggplot(T) + geom_col(aes(x=Group, y=CountStatus1), size = 1, color = "darkgreen", fill = "darkgreen")
p <- p + geom_line(aes(x=Group, y=WOE*1000), col="firebrick", size=0.9) +
geom_point(aes(x=Group, y=WOE*1000), col="gray", size=3) +
ggtitle(label = paste("WOE and Event Count by Group", " - " , colnames(data)[i])) +
labs(x = "Group", y = "Event Count", size=7) +
theme(plot.title = element_text(size=8, face="bold", margin = margin(10, 0, 10, 0)),
axis.text.x = element_text(angle=0, hjust = 1)) +
scale_y_continuous(sec.axis = sec_axis(trans = ~ . /1000, name="WOE", breaks = seq(-3, 5, 0.5)))
print(p)
}
The information is printed for all the variables that I need as in the pictures below:
Table for one of the variables
Chart for the same variable
However, now I encounter some problems with exporting results in a pdf. I do not know how I could print the results of each table and chart on a distinct page in a PDF.
Related
Let's say df present aggregated metric in AB test with groups A and B. x is for example number of page visits, n number of users with this number of visits. (In reality, there are way more users and differences are small). Note that there's different number of users per group.
library(tidyverse)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x)
I want to compare tiles of users. By tile, I mean users in group A that have the same x value.
For example, I if 34.17% of users in group A has value 0, I want to compare it to average number of x for the lowest 34.17% of users in group B. Next, for example, users with 1 visits in group A are between 34.17% and 74.8% - I want to compare them with the same percentile (but should be more precise) users in group B. Etc...
Here's my try:
n_fake <- 1000
df_agg_per_imp <- df %>%
group_by(group) %>%
mutate(
p_max = n_fake * cumsum(n) / sum(n),
p_min = lag(p_max, default = 0),
p = map2(p_min + 1, p_max, seq)
) %>%
ungroup()
df_agg_per_imp %>%
unnest(p) %>%
pivot_wider(id_cols = p, names_from = group, values_from = x) %>%
group_by(A) %>%
summarise(
p_min = min(p) / n_fake,
p_max = max(p) / n_fake,
rel_uplift = mean(B) / mean(A)
)
#> # A tibble: 6 × 4
#> A p_min p_max rel_uplift
#> <int> <dbl> <dbl> <dbl>
#> 1 0 0.001 0.34 Inf
#> 2 1 0.341 0.74 1.92
#> 3 2 0.741 0.91 1.57
#> 4 3 0.911 0.96 1.33
#> 5 4 0.961 0.99 1.21
#> 6 5 0.991 1 1.2
What I don't like is that I have to create row for each user (and this could be millions) to get the results I want. Is there simpler/better way to do it?
You may be able to do something like this:
extend the creation of your initial frame to get proportion in A and B, and pivot wider:
set.seed(123)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x) %>%
group_by(group) %>%
mutate(prop = n/sum(n)) %>%
pivot_wider(id_cols=x, names_from=group,values_from=prop)
With the seed above, this gives you a frame like this:
# A tibble: 7 x 3
x A B
<int> <dbl> <dbl>
1 0 0.35 0.095
2 1 0.38 0.33
3 2 0.21 0.285
4 3 0.04 0.14
5 4 0.02 0.085
6 5 NA 0.055
7 6 NA 0.01
Create a function estimates the rel_uplift, while also returning an updated set of group B proportions and group B values (i.e. xvalues)
f <- function(a,aval,bvec,bvals) {
cindex = which(cumsum(bvec)>=a)
if(length(cindex) == 0) bindex=seq_along(bvec)
else bindex= 1:min(cindex)
rem = sum(bvec[bindex])-a
bmean = sum(bvals[bindex] * (bvec[bindex] - c(rep(0,length(bindex)-1), rem)))
if(length(bindex)>1) {
if(rem!=0) bindex = bindex[1:(length(bindex)-1)]
bvec = bvec[-bindex]
bvals = bvals[-bindex]
}
bvec[1] = rem
list("rel_uplift" = bmean/(a*aval),"bvec" = bvec, "bvals" = bvals )
}
Initiate a dataframe, and a list called fres which contains the initial bvec and initial bvals
result=data.frame()
fres = list("bvec" = df$B,"bvals" = df$x)
Use a for loop to loop over the values of df$A, each time getting the rel_uplift, and preparing an updated set of bvec and bvals to be used in the function
for(a in df %>% filter(!is.na(A)) %>% pull(A)) {
x = df %>% filter(A==a) %>% pull(x)
fres = f(a, x,fres[["bvec"]],fres[["bvals"]])
result = rbind(result,data.frame(x =x, A=a,rel_uplift=fres[["rel_uplift"]]))
}
result
x A rel_uplift
1 0 0.35 Inf
2 1 0.38 1.855263
3 2 0.21 1.726190
4 3 0.04 1.666667
5 4 0.02 1.375000
If I understand right you want to compare counts by two parameters simultaneously, ie by $group and by $x.
From the example in the initial post I see that not all values $x may be available for each group.
Summarizing by 2 co-variables can be done with base R.
Here a simple function (assuming that you're always looking at $group and $x):
countnByGroup <- function(xx, asPercent=FALSE) {
lev <- unique(xx$x)
grp <- unique(xx$group)
out <- sapply(grp, function(x) {z <- rep(NA, length(lev)); names(z) <- lev
w <- which(xx$group==x); if(length(w) >0) z[match(xx$x[w], lev)] <- xx$n[w]
z })
if(asPercent) out <- 100*apply(out, 2, function(x) x/sum(x, na.rm=TRUE))
out }
Note, in the function above the man variable was called 'xx' to avoid confusion
with $x.
df # produced using the code from your example
## A tibble: 13 x 3
# group x n
# <chr> <int> <int>
# 1 A 0 36
# 2 A 1 38
# 3 A 2 19
# 4 A 3 6
# 5 A 4 1
# 6 B 0 27
# 7 B 1 44
# 8 B 2 55
# 9 B 3 44
#10 B 4 21
#11 B 5 6
#12 B 6 2
#13 B 8 1
One gets :
countnByGroup(df)
# A B
#0 36 27
#1 38 44
#2 19 55
#3 6 44
#4 1 21
#5 NA 6
#6 NA 2
#8 NA 1
## and
countnByGroup(df, asPercent=T)
# A B
#0 36 13.5
#1 38 22.0
#2 19 27.5
#3 6 22.0
#4 1 10.5
#5 NA 3.0
#6 NA 1.0
#8 NA 0.5
As long as you don't apply any rounding you'll have the results as precise as it gets.
By chance the random values from above did't produce more digits when processing and thus by chance the percent values for A are all integers.
Another interesting option may be to consider two-way tables in R using table().
But in this case you need your entries as separate lines and not already transformed to counting data as in your example above.
I am using the dplyr library in R.
I created the following dataset:
library(dplyr)
#create data
a = rnorm(100,100,10)
b = rnorm(100,100,10)
group <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.5, 0.2, 0.15, 0.15) )
#create frame
train_data = data.frame(a,b,group)
train_data$group = as.factor(train_data$group)
From here, I want to make a new variable called "diff" which records if variable "b" is bigger than variable "a":
train_data$diff = ifelse(train_data$b > train_data$a,1,0)
Now, I want to make a new variable ("perc") in the "train_data" table, which calculates:
for each unique group
the "percentage" of the "diff" variable
e.g.
suppose there 20 rows where "group = a".
In those 20 rows, there are 10 rows where the variable "diff" is "1".
Therefore, "perc" = 0.5 (10/20 = 0.5) . So, for those 20 rows, the value of "perc" should be 0.5
Using another stackoverflow post, (Compute "percent complete" within subgroups using dplyr in R?) I tried to implement this:
final_table = data.frame(train_data %>% group_by(group) %>% mutate(perc = diff/max(diff)))
But this is not giving me the desired output:
head(final_table)
a b group diff perc
1 107.19028 117.37028 D 1 1
2 105.34165 87.96513 A 0 0
3 120.21911 94.30301 C 0 0
4 98.06001 104.82173 D 1 1
5 104.54841 90.00205 B 0 0
6 90.77172 79.31384 D 0 0
7 96.22783 88.60185 D 0 0
8 113.67500 87.28380 B 0 0
9 96.82708 89.51343 C 0 0
10 115.38720 100.79550 C 0 0
11 105.30922 80.55969 C 0 0
12 114.93315 95.78172 B 0 0
13 105.20058 109.66729 C 1 1
For example, row 11 and row 13 both have "group = c", but different values of the "perc" variable. Furthermore, it doesn't seem like percentages are being calculated here.
Can someone please show me how to fix this?
Note: Is it also possible to create a table with 4 rows in which the summaries are provided? I think the Count = n() command can be used for this?
E.g.
Group Number of Rows Perc
a 20 0.6
b 20 0.7
c 50 0.9
d 10 0.24
Or a general summary (i.e. in the whole table, what is the percentage of rows where the "diff" variable is 1?):
d = sum(train_data$diff) / count(train_data$diff)
Thanks
Please let me know if I misunderstood your questions:
library(dplyr)
#create data
a = rnorm(100,100,10)
b = rnorm(100,100,10)
group <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.5, 0.2, 0.15, 0.15) )
#create frame
train_data = data.frame(a,b,group)
# Question 1
train_data %>%
group_by(group) %>%
mutate(
percent = sum(a>b)/n()
)
#> # A tibble: 100 x 4
#> # Groups: group [4]
#> a b group percent
#> <dbl> <dbl> <chr> <dbl>
#> 1 95.0 88.9 B 0.429
#> 2 96.4 95.1 A 0.35
#> 3 102. 110. A 0.35
#> 4 97.4 96.2 A 0.35
#> 5 90.7 92.7 A 0.35
#> 6 92.0 105. B 0.429
#> 7 93.8 85.1 A 0.35
#> 8 101. 102. B 0.429
#> 9 92.0 99.1 A 0.35
#> 10 77.6 87.8 B 0.429
#> # ... with 90 more rows
# Question 2
train_data %>%
group_by(group) %>%
summarize(
rows= n(),
percent = sum(a>b)/n()
)
#> # A tibble: 4 x 3
#> group rows percent
#> <chr> <int> <dbl>
#> 1 A 60 0.35
#> 2 B 21 0.429
#> 3 C 8 0.375
#> 4 D 11 0.364
Created on 2021-07-02 by the reprex package (v2.0.0)
Each segment has different range, for example A is from 1 to 3 while C is from 1 to 7.
For each segment, there can be missing time for which I want to perform interpolation (linear, spline, etc)
How can I do it within dplyr?
have <- data.frame(time =c(1,3,1,2,5,1,3,5,7),
segment=c('A','A','B','B','B','C','C','C','C'),
toInterpolate= c(0.12,0.31,0.15,0.24,0.55,0.11,0.35,0.53,0.79))
have
want <- data.frame(time =c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7),
segment=c('A','A','A','B','B','B','B','B','C','C','C','C','C','C','C'),
Interpolated= c(0.12,0.21,0.31,0.15,0.24,0.34,0.41,0.55,0.11,0.28,0.35,0.45,0.53,0.69,0.79))
# note that the interpolated values here are just randomnly put, (not based on actual linear/spline interpolation)
want
We can use complete to complete the sequence and na.spline from zoo for interpolation.
library(dplyr)
library(tidyr)
library(zoo)
have %>%
group_by(segment) %>%
complete(time = min(time):max(time)) %>%
mutate(toInterpolate = na.spline(toInterpolate))
# segment time toInterpolate
# <chr> <dbl> <dbl>
# 1 A 1 0.12
# 2 A 2 0.215
# 3 A 3 0.31
# 4 B 1 0.15
# 5 B 2 0.24
# 6 B 3 0.337
# 7 B 4 0.44
# 8 B 5 0.55
# 9 C 1 0.11
#10 C 2 0.246
#11 C 3 0.35
#12 C 4 0.439
#13 C 5 0.53
#14 C 6 0.641
#15 C 7 0.79
To create sequence for smaller granularities.
have %>%
group_by(segment) %>%
complete(time = min(time):max(time)) %>%
mutate(toInterpolate = na.spline(toInterpolate)) %>%
complete(time = seq(min(time), max(time), 0.1))
my first question in stack-overflow begins.
I have this code:
a <- rep(letters[1:4], each = 4); time <- c(0,0,1,1,0,1,2,2,1,1,2,2,0,0,1,2);
cost <- rep(c(0.4,0.2,0.1,0.5,0.5,0.22,0.15,0.18),each =2);
df <- data.frame(a = a, time = time, cost = cost);
The code above is just a short illustration from lots of data I have.
The data frame depicted is this one:
Do you know how can I merge the rows with duplicated time values into one and also aggregate the costs (they represent different kinds of costs even though they happen to be common in some instances) in each time point for each letter of column a?
Thanks in advance!
Does this work:
> library(dplyr)
> df %>% group_by(a, time) %>% summarise(cost = sum(cost))
`summarise()` regrouping output by 'a' (override with `.groups` argument)
# A tibble: 10 x 3
# Groups: a [4]
a time cost
<chr> <dbl> <dbl>
1 a 0 0.8
2 a 1 0.4
3 b 0 0.1
4 b 1 0.1
5 b 2 1
6 c 1 1
7 c 2 0.44
8 d 0 0.3
9 d 1 0.18
10 d 2 0.18
>
Using base R:
> aggregate(cost~a+time, df, sum)
a time cost
1 a 0 0.80
2 b 0 0.10
3 d 0 0.30
4 a 1 0.40
5 b 1 0.10
6 c 1 1.00
7 d 1 0.18
8 b 2 1.00
9 c 2 0.44
10 d 2 0.18
>
I have a multilevel dataset df on my hands with the following organization:
ID Eye Video_number Time Day measurement1
40001 L 1 1 1 0.60
40001 L 2 1 1 0.50
40001 L 3 1 1 0.80
40001 L 1 2 1 0.60
40001 L 2 2 1 0.60
40001 L 3 2 1 0.60
Goal I am trying to replace cell values of measurements that have a coefficient of variance above 45 with NA, since these values are probably less precise and should be excluded.
The coefficient of variation(sometimes denoted CV) of a distribution is defined as the ratio of the standard deviation to the mean, with $\mu$ and $\sigma$ values obtained from the raw data
I obtained the CV values by Time units (averaging measurement of three videos in one Time unit) with the following function and for loop. I got help from the following threads:
How to correctly use group_by() and summarise() in a For loop in R
Append data frames together in a for loop
# Define function
cv <- function(x){
sd(na.omit(x))/mean(na.omit(x))*100}
# Variables
vars <- c("measurement1", "measurement2", "measurement3")
# Create a table with all CV values by ID, Eye, Day, and Time
df_cv=data.frame()
for (i in vars){
df<-df.m2
df$values<-df[,which(colnames(df.m2)==i)]
x<-df%>%
group_by(ID,Eye,Day,Time) %>%
summarise(Count = n(),
Mean = mean(values, na.rm = TRUE),
SD = sd(values, na.rm = TRUE),
CV = cv(values))%>%
mutate(Variable=paste(i,"cv",sep="_"))
df_cv<-rbind(df_cv,x)
df_cv$CV[is.nan(df_cv$CV)]<-0 # for 0/0 on CV formula giving NaN
}
It resulted in the following table df_cv:
ID Eye Day Time Count Mean SD CV Variable
40001 L 1 1 3 0.56666667 0.057735027 10.1885342 measurement1_cv
40001 L 1 2 3 0.36666667 0.404145188 110.2214150 measurement1_cv
40001 L 1 3 3 0.50000000 0.000000000 0.0000000 measurement1_cv
I reformatted df_cv above to wide format (Variables and CVs across row rather than down a column). This enabled me to merge the CVs with the original df
df_cv<-dcast(df_cv,PIDN+Eye+Day+Time~Variable,value.var = "CV")
df<-merge(df,df_cv,by=c("PIDN","Eye","Day","Time"))
ID Eye Video_number Time Day measurement1 measurement1_cv
40001 L 1 1 1 0.60 10.1885342
40001 L 2 1 1 0.50 10.1885342
40001 L 3 1 1 0.80 10.1885342
40001 L 1 2 1 0.80 110.2214150
40001 L 2 2 1 0.30 110.2214150
40001 L 3 2 1 0.00 110.2214150
I know want to input NAs into the cells of measurement 1 that have a CV>45. I know how to do this measurement by measurement, but I was wondering if there was a for loop capable of doing this, since I have a lot of variables I am analyzing.
df$measurement1[df$measurement1_cv>45]<-NA
df$measurement2[df$measurement2_cv>45]<-NA
df$measurement3[df$measurement3_cv>45]<-NA
Below are my failed attempts:
for (i in vars) {
df<-df.m3
df$i[df$i_cv>45]<-NA
}
Error in `$<-.data.frame`(`*tmp*`, "i", value = logical(0)) :
replacement has 0 rows, data has 609
for (i in vars) {
df<-df.m3
df$i[df$paste(i,"_cv")>45]<-NA
}
Error in df$paste(i, "_cv") : attempt to apply non-function
Any help is greatly appreciated!
Two thoughts.
But first, augmenting your data a little.
df$measurement2 <- 0.9; df$measurement2_cv <- c(rep(46, 3), rep(44, 3))
library(dplyr)
inline function that operates only on one group at a time (i.e., it ignores grouping, so it must be within dplyr::do).
myfunc <- function(x) {
CV <- grep("^measurement.*_cv", colnames(x), value = TRUE)
MEAS <- gsub("_cv$", "", CV)
CV <- CV[MEAS %in% colnames(x)]
MEAS <- MEAS[MEAS %in% colnames(x)]
x[,MEAS] <- Map(function(meas, cv) replace(meas, cv > 45, NA), x[,MEAS], x[,CV])
x
}
df %>%
group_by(ID, Eye, Day, Time) %>%
do(myfunc(.))
# # A tibble: 6 x 9
# # Groups: ID, Eye, Day, Time [2]
# ID Eye Video_number Time Day measurement1 measurement1_cv measurement2 measurement2_cv
# <int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 40001 L 1 1 1 0.6 10.2 NA 46
# 2 40001 L 2 1 1 0.5 10.2 NA 46
# 3 40001 L 3 1 1 0.8 10.2 NA 46
# 4 40001 L 1 2 1 NA 110. 0.9 44
# 5 40001 L 2 2 1 NA 110. 0.9 44
# 6 40001 L 3 2 1 NA 110. 0.9 44
Pivot, calculate, then unpivot, using tidyr.
library(tidyr) # pivot_*
df %>%
rename_with(.fn = ~ paste0(., "_val"), .cols = matches("^meas.*[^v]$")) %>%
rename_with(.fn = ~ gsub("(.*)_(.*)", "\\2_\\1", .), .cols = starts_with("meas")) %>%
pivot_longer(., matches("meas"), names_sep = "_", names_to = c(".value", "meas")) %>%
mutate(val = if_else(cv > 45, NA_real_, val)) %>%
pivot_wider(1:5, names_from = "meas", names_sep = "_", values_from = c("val", "cv"))
# # A tibble: 6 x 9
# ID Eye Video_number Time Day val_measurement1 val_measurement2 cv_measurement1 cv_measurement2
# <int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 40001 L 1 1 1 0.6 NA 10.2 46
# 2 40001 L 2 1 1 0.5 NA 10.2 46
# 3 40001 L 3 1 1 0.8 NA 10.2 46
# 4 40001 L 1 2 1 NA 0.9 110. 44
# 5 40001 L 2 2 1 NA 0.9 110. 44
# 6 40001 L 3 2 1 NA 0.9 110. 44
(I admit that this looks/feels wonky, should need all of that renaming. But now you need to un-rename for the final step.) Perhaps the first solution is cleaner/better :-)