Related
For MICE imputations I need to constrict the predictions so that the predicted values will have the same mean (which is a measured value). The situation is we are dealing with mean blood serum samples (individual blood samples are pooled together) where we have measured values, which are representative of the mean of those individuals. I am trying to predict what the concentration of x was in those individuals based on the measured mean and covariate data. You'll notice in my dummy dataset that there are 3 individuals (Individual_id) for each pool (Pool_id). So when imputing these values to the individuals we need the average of those 3 individuals to equal the Pool_mean.
How can we constrict the Mice algorithm to still predict based on covariate data, but have the means match exactly (can be any method chose, "cart", in this circumstance)? Could this conceptually be done through a MICE squeeze constraint with inputs from the mean?
The code is below:
library(mice)
library(dplyr)
#create demo data table as an example
Pool_id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
Pool_mean <- c(15, 15, 15, 35, 35, 35, 42, 42, 42)
Individual_id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
concentration <- c(10, 20, NA, 30, NA,NA, NA, NA, 70)
co_variate <- c(0.1, 0.2, 0.1, 0.2, 0.3, 0.1, 0.1, 0.2, 0.3)
df <- data.frame(Pool_id, Pool_mean, Individual_id, concentration, co_variate)
#run mice to impute missing data
initial_imputed <- mice(df, m = 5, maxit = 10, meth = "cart", seed = 3985))
completed <- complete(intial_imputed)
I know that we can constraint mice using a post process and maybe a custom function like the vec_squeeze below. However, I need to constraint the values based on a mean. How could I update this function to create this?
vec_squeeze <- function(x, bounds) {
stopifnot(length(x) == nrow(bounds))
pmin(pmax(x, bounds[,1]), bounds[,2])
}
Here's an example of how to use passive imputation on the 3rd variable to force the mean of the imputations to be equal to pool_mean from the data. First generate some data in 'wide' format.
set.seed(123)
# Using larger example data to avoid issues with imputation models
n <- 20
pool_id <- rep(1:n, each = 3)
ind_id <- rep(1:3, times = n)
cov_1 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
cov_2 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
cov_3 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
conc_1 <- round(rnorm(n*3, mean = 20 + 5*cov_1, sd = 5))
conc_2 <- round(rnorm(n*3, mean = 20 + 5*cov_2, sd = 5))
conc_3 <- round(rnorm(n*3, mean = 20 + 5*cov_3, sd = 5))
pool_mean <- apply(cbind(conc_1, conc_2, conc_3), FUN = mean, MARGIN = 1)
df <- data.frame(pool_id, ind_id, pool_mean, conc_1, conc_2,
conc_3, cov_1, cov_2, cov_3)
df[which(rbinom(n*3, 1, prob = 0.5) == 1), "conc_3"] <- NA
df[which(rbinom(n*3, 1, prob = 0.2) == 1), "conc_2"] <- NA
df[which(is.na(df$conc_2)),"conc_3"] <- NA
head(df)
#> pool_id ind_id pool_mean conc_1 conc_2 conc_3 cov_1 cov_2 cov_3
#> 1 1 1 18.00000 14 16 24 0.3 0.1 0.2
#> 2 1 2 24.33333 20 32 21 0.3 0.3 0.3
#> 3 1 3 16.33333 26 NA NA 0.3 0.1 0.2
#> 4 2 1 25.00000 25 NA NA 0.2 0.3 0.3
#> 5 2 2 22.00000 24 17 25 0.3 0.2 0.1
#> 6 2 3 22.00000 23 19 NA 0.2 0.3 0.3
I forced missing values into the third position to avoid re-arranging. I also have ind_id repeated within each pool_id instead of unique, but that's not important for what follows.
The key part of the passive imputation is meth["conc_3"] <- "~ I((3*pool_mean) - conc_1 - conc_2)". If (A+B+C)/3 = D, then 3D - A - B = C.
library(mice)
ini <- mice(df, maxit = 0, printFlag = FALSE)
# Limit the variables used in prediction, to avoid co-linearity
pred <- ini$predictorMatrix
pred[,] <- 0
pred["conc_1", c("pool_mean","cov_1")] <- 1
pred["conc_2", c("pool_mean","conc_1","cov_2")] <- 1
# Set the imputation methods. Use passive imputation for conc_3
meth <- ini$method
meth["conc_2"] <- "pmm"
meth["conc_3"] <- "~ I((3*pool_mean) - conc_1 - conc_2)"
# Control the visit sequence to ensure that conc_3 is updated
# after conc_2. Add other missing variables if needed.
visit_seq <- c("conc_2", "conc_3")
imps <- mice(df, method = meth,
predictorMatrix = pred,
visitSequence = visit_seq,
printFlag = FALSE)
head(complete(imps, action = 1))
#> pool_id ind_id pool_mean conc_1 conc_2 conc_3 cov_1 cov_2 cov_3
#> 1 1 1 18.00000 14 16 24 0.3 0.1 0.2
#> 2 1 2 24.33333 20 32 21 0.3 0.3 0.3
#> 3 1 3 16.33333 26 18 5 0.3 0.1 0.2
#> 4 2 1 25.00000 25 23 27 0.2 0.3 0.3
#> 5 2 2 22.00000 24 17 25 0.3 0.2 0.1
#> 6 2 3 22.00000 23 19 24 0.2 0.3 0.3
Created on 2022-11-20 with reprex v2.0.2
The imputation procedure has correctly replaced row 6's conc_3 value with 24. The other rows have received a random imputation for conc_2 and then had conc_3 passively imputed. In general conc_3 has no other restrictions on it. In row 3 conc_3 = 5, which may be questionably low. It could even become negative in some situations. Better modelling of conc_2 would help.
I tried to use mutate and them the accumulate, but I keep getting only the 1+port figure.
mutate(acc_growth_p1 = accumulate(1+port1, '*'))
mutate(acc_growth_p1 = cumprod(port1 + 1))
For example:
data.frame(rates = seq(0, 0.5, by = 0.1)) %>%
mutate(growth = cumprod(rates + 1))
rates growth
1 0.0 1.0000
2 0.1 1.1000
3 0.2 1.3200
4 0.3 1.7160
5 0.4 2.4024
6 0.5 3.6036
I have two data frames both with 220 obs and 80 variables. The first data frame, df1, has only the data points 1, 2, and 3. The second data frame, df2, has different numeric values consisting of decimals, such as 0.12, -0.03, 0.01 etc. (supposed to portray market cap weighted stock returns for a given month). PS: The length of the original data set is 80.
For example
df1 = data.frame(a = c(2, 2, 1), b = c(3, 2, 3), c = c(1, 1, 2), d = c(3, 3, 1))
a b c d
1 2 3 1 3
2 2 2 1 3
3 1 3 2 1
df2 = data.frame(a = c(0.1, 0.1, 0.2), b = c(0.3, 0.4, 0.6), c = c(0.2, 0.3, 0.5), d = c(0.1, 0.5, 0.6))
a b c d
1 0.1 0.3 0.2 0.1
2 0.1 0.4 0.3 0.5
3 0.2 0.6 0.5 0.6
How can I sum the rows of df2and turn into a matrix with 220 obs and 3 variables based on the values in df1. Note that df1 and df2 have the same column names in the same order. How can I create a third data frame df3 based on the indicator variables from df1 by summing the rows of df2? I want to sum the rows of df2 based on the values in df1 to create df3:
df3 =
X1 X2 X3
1 0.2 0.1 0.4
2 0.3 0.5 0.5
3 0.8 0.5 0.6
Let's first look at (X1,1). Row 1 in df1 only contain one data point with value 1, which is (c,1). Thus, we sum row 1 of df2 to get 0.2. Now look at (X1,3) (last value of column X1). Observe row 3 in df1 to find two data points with value 1. In df2 those two values are 0.2 (a,3) and 0.6 (d,3), and sum the values to get 0.8.
Here is the explanation of how df3 looks like:
calculation = data.frame("1" = c("0+0+0.2+0", "0+0+0.3+0", "0.2+0+0+0.6"), "2" = c("0.1+0+0+0", "0.1+0.4+0+0", "0+0+0.5+0"), "3" = c("0+0.3+0+0.1", "0+0+0+0.5", "0+0.6+0+0"))
X1 X2 X3
1 0 + 0 + 0.2 + 0 0.1 + 0 + 0 + 0 0 + 0.3 + 0 + 0.1
2 0 + 0 + 0.3 + 0 0.1 + 0.4 + 0 + 0 0 + 0 + 0 + 0.5
3 0.2 + 0 + 0 + 0.6 0 + 0 + 0.5 + 0 0 + 0.6 + 0 + 0
More practical explanation based on stocks. Assume df1 is a matrix that describes buy, hold, and sell recommendations. df2 describes the market weighted stock returns. All variables/columns are different stocks. df3 creates a matrix with three different portfolios. If the stock is "buy", I want to put it in a "buy" portfolio. If the stock is "hold", I want to put it in a "hold" portfolio, etc. This is easily done in Excel with nested IF,AND,OR functions, but I do not know how to do it in R.
We could use tapply by converting the datasets to matrix, use grouping variables as the row index of the data and the index of 'df1'
tapply(as.matrix(df2), list(row(df2), as.matrix(df1)), FUN = sum)
# 1 2 3
#[1,] 0.2 0.1 0.4
#[2,] 0.3 0.5 0.5
#[3,] 0.8 0.5 0.6
Or with tidyverse, bind the datasets after gathering the two in to 'long' data, and then do a group by sum
library(tidyverse)
gather(df1) %>%
bind_cols(gather(df2)) %>%
group_by(key) %>%
group_by(rn = row_number(), value) %>%
summarise(value1 = sum(value1)) %>%
spread(value, value1) %>%
ungroup %>%
select(-rn)
# A tibble: 3 x 3
# `1` `2` `3`
# <dbl> <dbl> <dbl>
#1 0.2 0.1 0.4
#2 0.3 0.5 0.5
#3 0.8 0.5 0.6
Here is another base R method that uses rowsum to perform group sums and loops through the rows with mapply.
t(mapply(rowsum, as.data.frame(t(df2)), as.data.frame(t(df1))))
[,1] [,2] [,3]
V1 0.2 0.1 0.4
V2 0.3 0.5 0.5
V3 0.8 0.5 0.6
Note that I am using R 3.4.4. I believe that as.data.frame is not necessary with R 3.5.0+, since t should return a data.frame when it is fed a data.frame.
For a specific task, I have written the following R script:
pred <- c(0.1, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3)
grp <- as.factor(c(1, 1, 2, 2, 1, 1, 1))
cut <- unique(pred)
cut_n <- length(cut)
n <- length(pred)
class_1 <- numeric(cut_n)
class_2 <- numeric(cut_n)
curr_cut <- cut[1]
class_1_c <- 0
class_2_c <- 0
j <- 1
for (i in 1:n){
if (curr_cut != pred[i]) {
j <- j + 1
curr_cut <- pred[i]
}
if (grp[i] == levels(grp)[1])
class_1_c <- class_1_c + 1
else
class_2_c <- class_2_c + 1
class_1[j] <- class_1_c
class_2[j] <- class_2_c
}
cat("index:", cut, "\n")
cat("class1:", class_1, "\n")
cat("class2:", class_2, "\n")
My goal above was to compute the cumulative number of times the factors in grp appear for each unique value in pred. For example, I get the following output for above:
index: 0.1 0.2 0.3
class1: 2 3 5
class2: 1 2 2
I am a beginner in R and I have few questions about this:
How can I make this code faster and simpler?
Is is it possible to vectorize this and avoid the for loop?
Is there a different "R-esque" way of doing this?
Any help would be greatly appreciated. Thanks!
You can start by getting a the unique group/pred counts using a table
table(grp, pred)
# pred
# grp 0.1 0.2 0.3
# 1 2 1 2
# 2 1 1 0
Of course this isn't exactly what you wanted. You want cumulative totals, so we can adjust this result by applying a cumulative sum across each row (transposed to better match your data layout)
t(apply(table(grp, pred), 1, cumsum))
# grp 0.1 0.2 0.3
# 1 2 3 5
# 2 1 2 2
Suppose I have the following vector.
test <- c(0.3,1.0,0.8,0.3,0.6,0.4,0.3,0.5,0.6,0.4,0.5,0.6,0.1,0.6,0.2,0.7,0.0,0.7,0.3,0.3,0.4,0.9,0.9,0.9,0.3,0.6,0.3,0.1)
Is there a way to get non logical frequency table such as?
Frequency between 0 and 0.1
Frequency between 0.2 and 0.4
Frequency between 0.5 and 0.8
Frequency between 0.9 and 1
Thanks
There are a few extra unnecessary groups in here but you can ignore those or subset them
table(cut(test, breaks = c(0,0.1,0.2,0.4,0.5,0.8,0.9,1)))
I'm not aware of a dedicated function, but you could write your own:
test <- c(0.3,1.0,0.8,0.3,0.6,0.4,0.3,0.5,0.6,0.4,0.5,0.6,0.1,0.6,0.2,0.7,0.0,0.7,0.3,0.3,0.4,0.9,0.9,0.9,0.3,0.6,0.3,0.1)
mapply(function (start, end) { sum(test >= start & test <= end) },
c(0, 0.2, 0.5, 0.9), # starts
c(0.1, 0.4, 0.8, 1)) # ends
# [1] 3 11 10 4
The use of mapply is purely to vectorise over the starts and ends which you supply. Note test is hard-coded into this function and the endpoints are inclusive, so adjust as necessary, etc.
Something like this maybe:
labs <- c("0 and 0.1", "0.2 and 0.4", "0.5 and 0.8", "0.9 and 1")
table(cut(test, c(0, .2, .5, .9, 1.1), right = FALSE, labels = labs))
## 0 and 0.1 0.2 and 0.4 0.5 and 0.8 0.9 and 1
## 3 11 10 4
Assuming that you really want to bin these as tenths, and there are no missing intervals, findInterval is made for the task.
Here, 1.0 is in a group by itself:
table(findInterval(test, c(0,.2, .5, .9, 1)))
## 1 2 3 4 5
## 3 11 10 3 1
With this statement, 1.0 is in the last interval, with .9:
table(findInterval(test, c(0,.2, .5, .9, 1), rightmost.closed=T))
## 1 2 3 4
## 3 11 10 4