For MICE imputations I need to constrict the predictions so that the predicted values will have the same mean (which is a measured value). The situation is we are dealing with mean blood serum samples (individual blood samples are pooled together) where we have measured values, which are representative of the mean of those individuals. I am trying to predict what the concentration of x was in those individuals based on the measured mean and covariate data. You'll notice in my dummy dataset that there are 3 individuals (Individual_id) for each pool (Pool_id). So when imputing these values to the individuals we need the average of those 3 individuals to equal the Pool_mean.
How can we constrict the Mice algorithm to still predict based on covariate data, but have the means match exactly (can be any method chose, "cart", in this circumstance)? Could this conceptually be done through a MICE squeeze constraint with inputs from the mean?
The code is below:
library(mice)
library(dplyr)
#create demo data table as an example
Pool_id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
Pool_mean <- c(15, 15, 15, 35, 35, 35, 42, 42, 42)
Individual_id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
concentration <- c(10, 20, NA, 30, NA,NA, NA, NA, 70)
co_variate <- c(0.1, 0.2, 0.1, 0.2, 0.3, 0.1, 0.1, 0.2, 0.3)
df <- data.frame(Pool_id, Pool_mean, Individual_id, concentration, co_variate)
#run mice to impute missing data
initial_imputed <- mice(df, m = 5, maxit = 10, meth = "cart", seed = 3985))
completed <- complete(intial_imputed)
I know that we can constraint mice using a post process and maybe a custom function like the vec_squeeze below. However, I need to constraint the values based on a mean. How could I update this function to create this?
vec_squeeze <- function(x, bounds) {
stopifnot(length(x) == nrow(bounds))
pmin(pmax(x, bounds[,1]), bounds[,2])
}
Here's an example of how to use passive imputation on the 3rd variable to force the mean of the imputations to be equal to pool_mean from the data. First generate some data in 'wide' format.
set.seed(123)
# Using larger example data to avoid issues with imputation models
n <- 20
pool_id <- rep(1:n, each = 3)
ind_id <- rep(1:3, times = n)
cov_1 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
cov_2 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
cov_3 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
conc_1 <- round(rnorm(n*3, mean = 20 + 5*cov_1, sd = 5))
conc_2 <- round(rnorm(n*3, mean = 20 + 5*cov_2, sd = 5))
conc_3 <- round(rnorm(n*3, mean = 20 + 5*cov_3, sd = 5))
pool_mean <- apply(cbind(conc_1, conc_2, conc_3), FUN = mean, MARGIN = 1)
df <- data.frame(pool_id, ind_id, pool_mean, conc_1, conc_2,
conc_3, cov_1, cov_2, cov_3)
df[which(rbinom(n*3, 1, prob = 0.5) == 1), "conc_3"] <- NA
df[which(rbinom(n*3, 1, prob = 0.2) == 1), "conc_2"] <- NA
df[which(is.na(df$conc_2)),"conc_3"] <- NA
head(df)
#> pool_id ind_id pool_mean conc_1 conc_2 conc_3 cov_1 cov_2 cov_3
#> 1 1 1 18.00000 14 16 24 0.3 0.1 0.2
#> 2 1 2 24.33333 20 32 21 0.3 0.3 0.3
#> 3 1 3 16.33333 26 NA NA 0.3 0.1 0.2
#> 4 2 1 25.00000 25 NA NA 0.2 0.3 0.3
#> 5 2 2 22.00000 24 17 25 0.3 0.2 0.1
#> 6 2 3 22.00000 23 19 NA 0.2 0.3 0.3
I forced missing values into the third position to avoid re-arranging. I also have ind_id repeated within each pool_id instead of unique, but that's not important for what follows.
The key part of the passive imputation is meth["conc_3"] <- "~ I((3*pool_mean) - conc_1 - conc_2)". If (A+B+C)/3 = D, then 3D - A - B = C.
library(mice)
ini <- mice(df, maxit = 0, printFlag = FALSE)
# Limit the variables used in prediction, to avoid co-linearity
pred <- ini$predictorMatrix
pred[,] <- 0
pred["conc_1", c("pool_mean","cov_1")] <- 1
pred["conc_2", c("pool_mean","conc_1","cov_2")] <- 1
# Set the imputation methods. Use passive imputation for conc_3
meth <- ini$method
meth["conc_2"] <- "pmm"
meth["conc_3"] <- "~ I((3*pool_mean) - conc_1 - conc_2)"
# Control the visit sequence to ensure that conc_3 is updated
# after conc_2. Add other missing variables if needed.
visit_seq <- c("conc_2", "conc_3")
imps <- mice(df, method = meth,
predictorMatrix = pred,
visitSequence = visit_seq,
printFlag = FALSE)
head(complete(imps, action = 1))
#> pool_id ind_id pool_mean conc_1 conc_2 conc_3 cov_1 cov_2 cov_3
#> 1 1 1 18.00000 14 16 24 0.3 0.1 0.2
#> 2 1 2 24.33333 20 32 21 0.3 0.3 0.3
#> 3 1 3 16.33333 26 18 5 0.3 0.1 0.2
#> 4 2 1 25.00000 25 23 27 0.2 0.3 0.3
#> 5 2 2 22.00000 24 17 25 0.3 0.2 0.1
#> 6 2 3 22.00000 23 19 24 0.2 0.3 0.3
Created on 2022-11-20 with reprex v2.0.2
The imputation procedure has correctly replaced row 6's conc_3 value with 24. The other rows have received a random imputation for conc_2 and then had conc_3 passively imputed. In general conc_3 has no other restrictions on it. In row 3 conc_3 = 5, which may be questionably low. It could even become negative in some situations. Better modelling of conc_2 would help.
Related
I'm looking to create a hybrid of cumsum() and TTR::runSum()where cumSum() runs up until a pre-specified number of datapoints, at which points it acts more like a runSum()
For example:
library(TTR)
data <- rep(1:3,2)
cumsum <- cumsum(data)
runSum <- runSum(data, n = 3)
DesiredResult <- ifelse(is.na(runSum),cumsum,runSum)
Is there a way to get to DesiredResult that doesn't require getting finangly with NAs?
That is what the partial=TRUE argument to rollapplyr does. Here we show this with sum and also with sd and IQR. (Note that the sd of one value is NA and we chose IQR since it is a measure of spread that can be calculated for scalars although it is always 0 in that case.)
library(zoo)
rollapplyr(data, 3, sum, partial = TRUE)
## [1] 1 3 6 6 6 6
rollapplyr(data, 3, sd, partial = TRUE)
## [1] NA 0.7071068 1.0000000 1.0000000 1.0000000 1.0000000
rollapplyr(data, 3, IQR, partial = TRUE)
## [1] 0.0 0.5 1.0 1.0 1.0 1.0
Here are three alternatives.
n <- 3
rowSums(embed(c(rep(0, n - 1), data), n)) # base R
# [1] 1 3 6 6 6 6
library(TTR)
runSum(c(rep(0, n - 1), data), n = n)
# [1] NA NA 1 3 6 6 6 6 # na.omit fixes the beginning
library(zoo)
rollsum(c(rep(0, n - 1), data), k = 3, align = "right")
# [1] 1 3 6 6 6 6
I have a data.table that looks like:
A <- c(1,3,5,20,21,21)
B <- c(1, 2, 3, 4, 5, 6)
C <- c("I","I","II","II","III","III")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7)
M <- data.table(A,B,C,D)
My question is similar to R help: divide values by sum produced through factor with a few extra considerations. A specifies a date (I'm simply using integers here). B are individuals. C is a classification in the individual belongs to. D is a value variable.
For each classification c of C, for each day a of A, divide the value D by the sum of the values for all individuals in c, carrying backward when needed such that 0<x-a<=N where x is the date of another individual (meaning that we pick the smallest x-a and use that as an approximation for the value of the other individual in group c on day a).
Let's say N=5. Here's my expected output.
A <- c(1,3,5,20,21,21)
B <- c(1, 2, 3, 4, 5, 6)
C <- c("I","I","II","II","III","III")
D <- c(0.7/(0.7+0.3), 0.3/(0.3), 0.5/(0.5), 0.9/(0.9), 4/(4+7), 7/(4+7))
M <- data.table(A,B,C,D)
Note that the values for group B are not carried backward for individual 3, as the length is greater than 5 (20-5). Is there a nice way of doing this in data.table?
For each value in D, I wish to divide by the sum of all the values of the same group (either I, II,II) on that day. However, you'll notice for some groups, observations do not exist on that day. I'll try and walk through the logic on a few observations.
Edit: Let me try and walk through a few cases.
For individual 1 (column B) on day 1 (column A), the individual is of group I (column C). Other individuals of group I are: 2. For each of those others, we see that for individual 2, their nearest observation is on day 3 and 3-1<=5, so we'll use 0.3 in the denominator.
For individual 3 (column B) on day 5 (column A), the individual is of group II (column C). Other individuals of group II are: 3. For each of those others, we see that for individual 3, their nearest observation is on day 20 and 20-5>5, so we cannot use their observation in the denominator.
This, I think, will give you your answer:
A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
V <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)
N=5
#Put data into a frame
test = data.frame(A,B,C,V)
#order the data
test = test[order(as.numeric(test$C), test$A),]
#Get the 'rollback' possibilities for each value
Roll = sapply(test$A, FUN = function(x){paste(which(test$A < (x+N) & test$A >= x), collapse=",")})
#Get the groupings
Group = sapply(test$C, FUN = function(x){paste(which(test$C == x), collapse=",")})
#Intersect the values
ToGet = apply(cbind(Roll, Group), MARGIN=1, FUN=function(x){intersect(unlist(strsplit(x[1],",")), unlist(strsplit(x[2],",")))})
#Calculate the denominators
test$D = sapply(ToGet, FUN=function(x){sum(test$V[as.numeric(x)])})
test$Calc = test$V/test$D
Output:
> test
A B C V D Calc
1 1 1 I 0.7 1.0 0.7000000
2 3 2 I 0.3 0.4 0.7500000
7 7 7 I 0.1 0.1 1.0000000
3 5 3 II 0.5 0.5 1.0000000
4 20 4 II 0.9 0.9 1.0000000
5 21 5 III 4.0 11.0 0.3636364
6 21 6 III 7.0 11.0 0.6363636
The questions is tagged with data.table, so here is a data.table solution which uses non-equi joins to identify individuals within each group to treat them as cohort if the observations fall within a date window of 5 days.
library(data.table) # CRAN version 1.10.4 used
# set length of date window in days
N <- 5L
# give columns more semantic names according to OP's description
setnames(M, c("day", "id", "grp", "val"))
# prepare data for non-equi join: allowable date range
ranged <- M[, .(start = day, end = day + N, co.id = id, grp)]
# non-equi join to determine cohort
joined <- M[ranged, on = c("grp", "day>=start", "day<=end")]
# compute denominator for each cohort
grouped <- joined[, .(den = sum(val)), by = co.id]
# final update on join and order
result <- M[grouped, on = c("id==co.id"), calc := val / den][order(grp, id)]
result
# day id grp val calc
#1: 1 1 I 0.7 0.7000000
#2: 3 2 I 0.3 0.7500000
#3: 7 7 I 0.1 1.0000000
#4: 5 3 II 0.5 1.0000000
#5: 20 4 II 0.9 1.0000000
#6: 21 5 III 4.0 0.3636364
#7: 21 6 III 7.0 0.6363636
Data
A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)
M <- data.table(A,B,C,D)
Compact versions
For those who prefer compact code, here is a more convoluted version:
joined <- M[M[, .(start = day, end = day + N, co.id = id, grp)],
on = c("grp", "day>=start", "day<=end")]
M[joined[, .(den = sum(val)), by = co.id], on = c("id==co.id"),
calc := val / den][order(grp, id)]
Or, as a "one-liner":
M[M[M[, .(start = day, end = day + N, co.id = id, grp)],
on = c("grp", "day>=start", "day<=end")
][, .(den = sum(val)), co.id],
on = c("id==co.id"), calc := val / den][order(grp, id)]
I am trying to simulate the forecast (rather than taking mean) using this code:
set.seed(100)
df <- data.frame("lower" = c(-1, 0.5, 0), "upper" = c(0, 2, 4), "mean" = c(-0.5, 1.2, 2.5))
df$simulation <- rnorm(1, df$mean, (df$upper - df$lower) / 2 / 1.96)
df
# lower upper mean simulation
#1 -1.0 0 -0.5 -0.6281103
#2 0.5 2 1.2 -0.6281103
#3 0.0 4 2.5 -0.6281103
I get same value in the simulation column.
However, if I provide the vector in the n parameter I get result that looks better:
df$simulation <- rnorm(nrow(df), df$mean, (df$upper - df$lower) / 2 / 1.96)
df
# lower upper mean simulation
#1 -1.0 0 -0.5 -0.6281103
#2 0.5 2 1.2 1.2503308
#3 0.0 4 2.5 2.4194724
Is the later solution is the right way of doing this?
I'm trying to write a program that takes a large data frame and replaces each column of values by the cumulative frequency of those values (sorted ascending). For instance, if the column of values are: 5, 8, 3, 5, 4, 3, 8, 5, 5, 1. Then the relative and cumulative frequencies are:
1: rel_freq=0.1, cum_freq = 0.1
3: rel_freq=0.2, cum_freq = 0.3
4: rel_freq=0.1, cum_freq = 0.4
5: rel_freq=0.4, cum_freq = 0.8
8: rel_freq=0.2, cum_freq = 1.0
Then the original column becomes: 0.8, 1.0, 0.3, 0.8, 0.4, 0.3, 1.0, 0.8, 0.8, 0.1
The following code performs this operation correctly, but it scales poorly probably due to the nested loop. Any idea how to perform this task more efficiently?
mydata = read.table(.....)
totalcols = ncol(mydata)
totalrows = nrow(mydata)
for (i in 1:totalcols) {
freqtable = data.frame(table(mydata[,i])/totalrows) # create freq table
freqtable$CumSum = cumsum(freqtable$Freq) # calc cumulative freq
hashtable = new.env(hash=TRUE)
nrows = nrow(freqtable)
# store cum freq in hash
for (x in 1:nrows) {
dummy = toString(freqtable$Var1[x])
hashtable[[dummy]] = freqtable$CumSum[x]
}
# replace original data with cum freq
for (j in 1:totalrows) {
dummy = toString(mydata[j,i])
mydata[j,i] = hashtable[[dummy]]
}
}
This handles a single column without the for-loop:
R> x <- c(5, 8, 3, 5, 4, 3, 8, 5, 5, 1)
R> y <- cumsum(table(x)/length(x))
R> y[as.character(x)]
5 8 3 5 4 3 8 5 5 1
0.8 1.0 0.3 0.8 0.4 0.3 1.0 0.8 0.8 0.1
Here is one way. Using a data frame with two variables each containing your example data
d <- data.frame(var1 = c(5, 8, 3, 5, 4, 3, 8, 5, 5, 1),
var2 = c(5, 8, 3, 5, 4, 3, 8, 5, 5, 1))
use a simple function to
generate the cumsum() of the relative proportions given by table(x) / length(x), then
match() the observations in a variable with the names of the table of cumulative sums, then
use the id matches to select from the table of cumulative sums (and un-name it)
Such a functions is:
f <- function(x) {
tab <- cumsum(table(x) / length(x))
ind <- match(x, as.numeric(names(tab)))
unname(tab[ind])
}
In practice we use lapply() and coerce to a data frame:
out <- data.frame(lapply(d, f))
out
which gives:
R> out
var1 var2
1 0.8 0.8
2 1.0 1.0
3 0.3 0.3
4 0.8 0.8
5 0.4 0.4
6 0.3 0.3
7 1.0 1.0
8 0.8 0.8
9 0.8 0.8
10 0.1 0.1
R Version 2.11.1 32-bit on Windows 7
I have two data sets as shown below:
data_set_A:
USER_B ACTION
10 0.1
11 0.3
12 0.1
data_set_B:
USER_A USER_B ACTION
1 10 0.2
1 11 0.1
1 15 0.1
2 12 0.2
How to add the ACTION of USER_B from data_set_A to data_set_B? The USER_B in data_set_A is a subset of USER_B in data_set_B.
for the example above, it may be:
USER_A USER_B ACTION
1 10 0.2+0.1
1 11 0.1+0.3
1 15 0.1
2 12 0.2+0.1
In data_set_B I don't need to consider the USER_A, just consider the USER_B appear in data_set_A.
I wonder if it could be achieved without doing one by one?
dfa <- data.frame(
user_b = 10:12,
action = c(0.1, 0.3, 0.1)
)
dfb <- data.frame(
user_a = c(1, 1, 1, 2),
user_b = c(10, 11, 15, 12),
action = c(0.2, 0.1, 0.1, 0.2)
)
action <- dfa$action[match(dfb$user_b, dfa$user_b)]
action[is.na(action)] <- 0
dfb$action <- dfb$action + action
dfb
user_a user_b action
1 1 10 0.3
2 1 11 0.4
3 1 15 0.1
4 2 12 0.3
One way is to do the equivalent of a database merge on the two data sets to form the action pairs you want and then sum those. Using #Andrie's example code:
dfa <- data.frame(
user_b = 10:12,
action = c(0.1, 0.3, 0.1)
)
dfb <- data.frame(
user_a = c(1, 1, 1, 2),
user_b = c(10, 11, 15, 12),
action = c(0.2, 0.1, 0.1, 0.2)
)
Solution Code
I'll first present the full solution and then explain the steps:
mdat <- merge(dfb, dfa, by = "user_b", all.x = TRUE)
res <- data.frame(mdat[,c(2,1)],
action = rowSums(mdat[, c("action.x", "action.y")],
na.rm = TRUE))
res <- res[order(res$user_a, res$user_b),]
res now contains the results.
Explanation
We first merge the two data frames, matching on user_b:
## merge the data
mdat <- merge(dfb, dfa, by = "user_b", all.x = TRUE)
mdat
giving:
> mdat
user_b user_a action.x action.y
1 10 1 0.2 0.1
2 11 1 0.1 0.3
3 12 2 0.2 0.1
4 15 1 0.1 NA
Then we just use this object to create the result data frame, and sum the two action. columns row-wise:
## format the merged data with summed `action`
res <- data.frame(mdat[,c(2,1)],
action = rowSums(mdat[, c("action.x", "action.y")],
na.rm = TRUE))
## reorder
res <- res[order(res$user_a, res$user_b),]
res
resulting in
> res
user_a user_b action
1 1 10 0.3
2 1 11 0.4
4 1 15 0.1
3 2 12 0.3