How to simulate MAR missing data in R? - r

I would like to simulate some missing data in R but am having trouble. I have created two variables ('pre' and 'post') that represent a measurement for the same individual pre- and post-treatment (i.e. paired data). I have been able to do it for data that is Missing Completely at Random (MCAR) - see below, but am unable to figure out how to code it for Missing at Random (MAR). For the MAR missing data, I would like to create 3 categories based on the pre-treatment observations that will determine how many of the post-treatment observations are missing. i.e.
For pre > 25, 40% post missing
For pre > 21 and ≤ 25, 30% post missing
For pre ≤ 21, 20% post missing
Can anyone help out? (I'd be really grateful!)
Thanks
set.seed(80122)
n <- 1000
# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7.
data <- rmvnorm(n,mean=c(28,18),sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix
# Split into pre and post treatment and check correlation is what was specified
pre <- data[, 1]
post <- data[, 2]
cor.test(pre,post)
# Simulate MCAR
mcar <- 1 - rbinom(n, 1, 0.2) # Will create ~ 20% zero's which we'll convert to NA's
post_mcar <- post
post_mcar[mcar == 0] <- mcar[mcar==0] # Replace post data with random zero's from mcar vector
post_mcar[mcar == 0] <- NA # Change zero's to NAs

This is an old question, but I thought I'd take a crack at it.
Simulate fake data as in the OP:
library(tidyverse)
library(mvtnorm)
# Number of data values
n <- 1000
# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7.
set.seed(80122)
data <- rmvnorm(n, mean=c(28,18),
sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix
Convert to data frame:
data = as.data.frame(data)
names(data) = c("pre", "post")
Simulate missing completely at random (MCAR) data:
data$post_mcar <- data$post
set.seed(2)
data$post_mcar[sample(1:nrow(data), 0.2*nrow(data))] = NA
Simulate missing at random (MAR) data: First, we'll create a grouping variable, frac, whose value is the fraction of the group that we want to set to missing. We'll use the cut function to create these groups and set the label values, then we'll convert the labels to numeric for later use:
data = data %>%
mutate(post_mar = post,
frac = as.numeric(as.character(cut(pre, breaks=c(-Inf, 21, 25, Inf),
labels=c(0.2,0.3,0.4)))))
Now, group by frac and set a randomly selected fraction of the values to NA, using frac to determine the fraction of values set to NA.
set.seed(3)
data = data %>%
group_by(frac) %>%
mutate(post_mar=replace(post_mar, row_number(post_mar) %in% sample(1:n(), round(unique(frac)*n())), NA)) %>%
ungroup
Here are the last 6 rows of the resulting data frame:
pre post post_mcar post_mar frac
995 28.63476 19.35081 19.35081 19.35081 0.4
996 32.86278 24.16119 NA NA 0.4
997 28.25965 16.64538 16.64538 16.64538 0.4
998 24.35255 17.80365 17.80365 17.80365 0.3
999 28.12426 18.25222 18.25222 NA 0.4
1000 27.55075 14.47757 14.47757 14.47757 0.4
Here's a check on the fraction of values missing in each group. Note that the actual percentage of values set to missing can differ from frac if the requested percentage doesn't result in an integer number of rows. Here, for example, there's no way to select 20% of 8 values. It can be 12.5% (1 value) or 25% (2 values).
data %>% group_by(frac) %>%
summarise(N=n(),
N_missing=sum(is.na(post_mar)),
Frac_missing=N_missing/N)
frac N N_missing Frac_missing
1 0.2 8 2 0.2500000
2 0.3 138 41 0.2971014
3 0.4 854 342 0.4004684

Related

How to trim a percentage of data points within a specific range in R

I have a text file containing millions of p-values (Range: 1 - 5e-09, ($P)). My goal is to generate a Manhattan plot in R using these p-values. However, since the vast majority of the p-values are in the 0.01-1 range, I would like to randomly trim say, 95% of the p-values in this range before generating the plot (so as to reduce the output file size). Until now, I have been using:
data <- read.table(<path_to_my_p-value_file>)
data <- subset(data,data$P<=0.01)
but this command removes all p-values greater than 0.01, which results in an unsightly gap between the x-axis and the remaining p-values in the Manhattan plot. Is there a way to trim most of the p-values within a specified range (instead of all)?
It feels a bit hacky, but the following could do it. Basically it checks first for a condition (here if x > 0) and then replaces to missing a percentage of the values based on runif() (here .95). After that you could remove the rows with missing values.
There should be a better way of achieving the desired results though...
df2 <- df %>% mutate(
x = if_else(condition = x > 0,
true = if_else(runif(length(x))<.95, NA_real_, x),
false = x
)
)
reprex
library(dplyr)
set.seed(42)
n <- 300
df <- data.frame(
x = rnorm(n),
y = rnorm(n)
)
df2 <- df %>% mutate(
x = if_else(condition = x > 0,
true = if_else(runif(length(x))<.95, NA_real_, x),
false = x
)
)
plot(df, pch = 3)
points(df2, col = "red")
Created on 2021-07-05 by the reprex package (v2.0.0)
Here a method that would zero out 90% of the highest 95% of values. Obviously you would not want to be doing this on an original of your data, but rather on a copy from which you would then remove the 0's. Multiply the higher p-values (highest 95% in this example) by a random sampling from {,0,1} that is of the correct length with a probability of 0.9 for the 0 and 0.1 for the 1
set.seed(123)
dx <- data.frame(x=runif(100))
dx$sel <- dx$x < 0.05 #Should "select" the lowest 5%, leave them alone
dx$x[!dx$sel] <- dx$x[!dx$sel]* # only work on the higher ones
sample(c(0,1),size=sum(!dx$sel), replace=TRUE, prob=c(.9,.1))
Gets you five values below 0.05 and 11 above 0.05. The exact number of those higher values will vary a bit depending on the random seed and the length of the constructed vector.
> table(dx$x)
0 0.000624773325398564 0.0246136845089495 0.0420595335308462
84 1 1 1
0.0455564993899316 0.0458311666734517 0.0935949867125601 0.102924682665616
1 1 1 1
0.320373242488131 0.414546335814521 0.453334156190977 0.511505459900945
1 1 1 1
0.59414202044718 0.656758127966896 0.883017404004931 0.892419044394046
1 1 1 1
0.954503649147227
You also might look at the code used by functions that do "winsorizing". (No, I didn't misspell that term.)

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

How does createDataPartition function from caret package split data?

From the documentation:
For bootstrap samples, simple random sampling is used.
For other data splitting, the random sampling is done within the levels of y
when y is a factor in an attempt to balance the class distributions within
the splits.
For numeric y, the sample is split into groups sections based on percentiles
and sampling is done within these subgroups.
For createDataPartition, the number of percentiles is set via the groups
argument.
I don't understand why this "balance" thing is needed. I think I understand it superficially, but any additional insight would be really helpful.
It means, if you have a data set ds with 10000 rows
set.seed(42)
ds <- data.frame(values = runif(10000))
with 2 "classes" with unequal distribution (9000 vs 1000)
ds$class <- c(rep(1, 9000), rep(2, 1000))
ds$class <- as.factor(ds$class)
table(ds$class)
# 1 2
# 9000 1000
you can create a sample, which tries to maintain the ratio / "balance" of the factor classes.
dpart <- createDataPartition(ds$class, p = 0.1, list = F)
dsDP <- ds[dpart, ]
table(dsDP$class)
# 1 2
# 900 100

Calculate correlation coefficient by bootstrapping

I'm looking at the correlation between the day of the year that 5 species of bird started moulting their feathers and the numbers of days it took these 5 species to complete the moulting of their feathers.
I've tried to simulate my data in the code below. For each of the 5 species, I have start day for 10 individuals and the durations for 10 individuals. For each species, I calculated the mean start day and mean duration then calculated the correlation across these 5 species.
What I want to do is bootstrap the mean start date and bootstrap the mean duration for each species. I want to repeat this 10,000 times and calculate the correlation coefficient after each repeat. I then want to extract the 0.025, 0.5 and 0.975 quantiles of the 10,000 correlation coefficients.
I got as far as simulating the raw data, but my code quickly got messy once I tried to bootstrap. Can anyone help me with this?
# speciesXX_start_day is the day of the year that 10 individuals of birds started moulting their feathers
# speciesXX_duration is the number of days that each individuals bird took to complete the moulting of its feathers
species1_start_day <- as.integer(rnorm(10, 10, 2))
species1_duration <- as.integer(rnorm(10, 100, 2))
species2_start_day <- as.integer(rnorm(10, 20, 2))
species2_duration <- as.integer(rnorm(10, 101, 2))
species3_start_day <- as.integer(rnorm(10, 30, 2))
species3_duration <- as.integer(rnorm(10, 102, 2))
species4_start_day <- as.integer(rnorm(10, 40, 2))
species4_duration <- as.integer(rnorm(10, 103, 2))
species5_start_day <- as.integer(rnorm(10, 50, 2))
species5_duration <- as.integer(rnorm(10, 104, 2))
start_dates <- list(species1_start_day, species2_start_day, species3_start_day, species4_start_day, species5_start_day)
start_duration <- list(species1_duration, species2_duration, species3_duration, species4_duration, species5_duration)
library(plyr)
# mean start date for each of the 5 species
starts_mean <- laply(start_dates, mean)
# mean duration for each of the 5 species
durations_mean <- laply(start_duration, mean)
# correlation between start date and duration
cor(starts_mean, durations_mean)
R allows you to resample datasets with the sample function. In order to bootstrap you can just take random samples (with replacement) of your original dataset and then recalculate the statistics for each subsample. You can save the intermediate results in a datastructure so that you can process the data afterwards.
A possible example solution for your specific problem is added below. We take 10000 subsamples of size 3 for each of the species, calculate the statistics and then save the results in a list or vector. After the bootstrap we are able to process all the data:
nrSamples = 10000;
listOfMeanStart = list(nrSamples)
listOfMeanDuration = list(nrSamples)
correlations <- vector(mode="numeric", length=nrSamples)
for(i in seq(1,nrSamples))
{
sampleStartDate = sapply(start_dates,sample,size=3,replace=TRUE)
sampleDurations = sapply(start_duration,sample,size=3,replace=TRUE)
listOfMeans[[i]] <- apply(sampleStartDate,2,mean)
listOfMeanDuration[[i]] <- apply(sampleDurations,2,mean)
correlations[i] <- cor(listOfMeans[[i]], listOfMeanDuration[[i]])
}
quantile(correlations,c(0.025,.5,0.975))

Applying an lm function to different ranges of data and separate groups using data.table

How do I perform a linear regression using different intervals for data in different groups in a data.table?
I am currently doing this using plyr but with large data sets it gets very slow. Any help to speed up the process is greatly appreciated.
I have a data table which contains 10 counts of CO2 measurements over 10 days, for 10 plots and 3 fences. Different days fall into different time periods, as described below.
I would like to perform a linear regression to determine the rate of change of CO2 for each fence, plot and day combination using a different interval of counts during each period. Period 1 should regress CO2 during counts 1-5, period 2 using 1-7 and period 3 using 1-9.
CO2 <- rep((runif(10, 350,359)), 300) # 10 days, 10 plots, 3 fences
count <- rep((1:10), 300) # 10 days, 10 plots, 3 fences
DOY <-rep(rep(152:161, each=10),30) # 10 measurements/day, 10 plots, 3 fences
fence <- rep(1:3, each=1000) # 10 days, 10 measurements, 10 plots
plot <- rep(rep(1:10, each=100),3) # 10 days, 10 measurements, 3 fences
flux <- as.data.frame(cbind(CO2, count, DOY, fence, plot))
flux$period <- ifelse(flux$DOY <= 155, 1, ifelse(flux$DOY > 155 & flux$DOY < 158, 2, 3))
flux <- as.data.table(flux)
I expect an output which gives me the R2 fit and slope of the line for each plot, fence and DOY.
The data I have provided is a small subsample, my real data has 1*10^6 rows. The following works, but is slow:
model <- function(df)
{lm(CO2 ~ count, data = subset(df, ifelse(df$period == 1,count>1 &count<5,
ifelse(df$period == 2,count>1 & count<7,count>1 & count<9))))}
model_flux <- dlply(flux, .(fence, plot, DOY), model)
rsq <- function(x) summary(x)$r.squared
coefs_flux <- ldply(model_flux, function(x) c(coef(x), rsquare = rsq(x)))
names(coefs_flux)[1:5] <- c("fence", "plot", "DOY", "intercept", "slope")
Here is a "data.table" way to do this:
library(data.table)
flux <- as.data.table(flux)
setkey(flux,count)
flux[,include:=(period==1 & count %in% 2:4) |
(period==2 & count %in% 2:6) |
(period==3 & count %in% 2:8)]
flux.subset <- flux[(include),]
setkey(flux.subset,fence,plot,DOY)
model <- function(df) {
fit <- lm(CO2 ~ count, data = df)
return(list(intercept=coef(fit)[1],
slope=coef(fit)[2],
rsquare=summary(fit)$r.squared))
}
coefs_flux <- flux.subset[,model(.SD),by="fence,plot,DOY"]
Unless I'm missing something, the subsetting you do in each call to model(...) is unnecessary. You can segment the counts by period in one step at the beginning. This code yields the same results as yours, except that dlply(...) returns a data frame and this code produces a data table. It isn't much faster on this test dataset.

Resources