Generating different percentages of MAR data in R - r

The following​ two R functions are from the book "Flexible Imputation of Missing Data" (page no. 59 and 63). The first one generates missing completely at random(MCAR) data and the second on generates missing at random(MAR) data. Both functions give approximately 50% missing values. ​
In MCAR function, we can generate different percentages of missing data by changing the p value. But in MAR function, ​I don't understand ​which parameter should we change to generate different percentages of missing data like 10% or 30%?
MCAR
makemissing <- function(data, p=0.5){
rx <- rbinom(nrow(data), 1, p)
data[rx==0,"y"] <- NA
return(data)
}
MAR
logistic <- function(x) exp(x)/(1+exp(x))
set.seed(32881)
n <- 10000
y <- mvrnorm(n=n,mu=c(5,5),Sigma=matrix(c(1,0.6,0.6,1),nrow=2))
p2.marright <- 1 - logistic(-5 + y[,1])
r2.marright <- rbinom(n, 1, p2.marright)
yobs <- y
yobs[r2.marright==0, 2] <- NA

The probability of an observation being missing is 50% for every case for the MCAR function because, by definition, the missingness is random. For the MAR version, the probability of an observation being missing is different for each observation, since it depends on the values of y[,1]. In your code, the probability of missingness on y[,2] is saved in the variable p2.marright. You can perhaps see this more easily by lining up all of the values in a dataframe:
df <- data.frame(y1 = y[,1], y2_ori = y[,2], y2_mis = yobs[,2], p2.marright = p2.marright, r2.marright)
head(df)
y1 y2_ori y2_mis p2.marright r2.marright
1 2.086475 3.432803 3.432803 0.9485110 1
2 3.784675 5.005584 5.005584 0.7712399 1
3 4.818409 5.356688 NA 0.5452733 0
4 2.937422 3.898014 3.898014 0.8872124 1
5 6.422158 5.032659 5.032659 0.1943236 1
6 4.115106 5.083162 5.083162 0.7078354 1
You can see that whether or not an observation will be NA on y2 is encoded in r2.marright, which is a probabilistic binary version of p2.marright --- for higher values of p2.marright, r2.marright is more likely to 1. To change the overall rate of missingness, you can change the calculation of p2.marright to bias it higher or lower.
You can manipulate p2.marright by changing the constant in the logistic transformation (-5 in the example). If you increase it (make it less negative, e.g. -4) then p2.marright will decrease, resulting in more missing values on y2. If you decrease it (make it more negative, e.g. -6) then you'll end up with fewer missing values on y2. (The reason -5 is resulting in 50% missingness is because 5 is the mean of the variable being transformed, y1.) This works, but the mechanism is rather opaque, and it might be difficult for you to control it easily. For example, it's not obvious what you should set the constant to be if you want 20% missingness on y2.

Related

R function to find difference in mean greater than or equal to a specific number

I have just started my basic statistic course using R and we're studying using R for paired t-tests. I have come across questions where we're given two sets of data and we're asked to find whether the difference in mean is equal to 0 or greater than 0 so on so forth. The function we use for two samples x and y with an unknown variance is similar to the one below;
t.test(x, y, var.equal=TRUE, alternative="greater")
My question is, how would we to do this if we wanted to test the difference in mean is more than or equal to a specified number against the alternative that its less than a specific number and not 0.
For example, say we're given two datas for before and after weights of 10 people. How do we test that the mean difference in weight is more than or equal to say 3kg against the alternative where the mean difference in weight is less than 3kg. Is there a way to do this? Would really appreciate any guidance on this matter.
It might be worthwhile posting on https://stats.stackexchange.com/ as well if you're in need of more theoretical proof. Is it ok to add/subtract the 3kg from either x or y and then use the t-test to check for similarity? I think this would tell you at least which outcome is more likely, if that's the end goal. It would be good to get feedback on this
# number of obs, and rnorm dist for simulating
N <- 10
mu <- 70
sd <- 10
set.seed(1)
x <- round(rnorm(N, mu, sd), 1)
# three outcomes
# (1) no change
y_same <- x + round(rnorm(N, 0, 5), 1)
# (2) average increase of 3
y_imp <- x + rnorm(N, 3, 5)
# (3) average decrease of 3
y_dec <- x + rnorm(N, -3, 5)
# say y_imp is true
y_act <- y_imp
# can we test whether we're closer to the output by altering
# the original data? or conversely, altering y_imp
t_inc <- t.test(x+3, y_act, var.equal=TRUE, alternative="two.sided")
t_dec <- t.test(x-3, y_act, var.equal=TRUE, alternative="two.sided")
t_inc$p.value
[1] 0.8279801
t_dec$p.value
[1] 0.0956033
# one with the highest p.value has the closest distribution, so
# +3 kg more likely than -3kg
You can set mu=3 to change the null hypothesis from 0 to 3 assuming your x variables are in the units you describe above.
t.test(x, y, mu=3, alternative="greater", paired=TRUE)
More (general) information on Stack Exchange [here].(https://stats.stackexchange.com/questions/206316/can-a-paired-or-two-group-t-test-test-if-the-difference-between-two-means-is-l/206317#206317)

Why am I getting an error with plinear algorithm for non-linear regression in R?

I have a 2 element list with X and Y values which I would like to do non-linear regression on with R.
NP delta_f_norm
3.125E-08 1.305366836
6.25E-08 0
0.000000125 3.048361059
0.00000025 2.709158322
0.0000005 2.919379441
0.000001 42.8860945
0.000002 49.75418233
0.000004 50.89313017
0.000008 50.18050031
0.000016 49.67195257
0.000032 48.89396054
0.000064 48.00787709
0.0000006 16.50229042
0.0000007 8.906829316
0.0000008 14.2697833
2.74E-08 -0.913767771
4.11E-08 -0.942489364
6.17E-08 0.586660918
9.24E-08 -0.080955695
1.387E-07 1.672777115
2.081E-07 0.880006555
3.121E-07 13.23952061
4.682E-07 44.73003305
7.023E-07 57.11640257
1.0535E-06 54.09032726
1.5802E-06 58.71029183
2.3704E-06 56.85467325
3.5556E-06 57.83003606
5.3333E-06 53.71761902
0.000008 53.55511726
I import the plain text data, normalize the Y values and change the scale on the x values:
install.packages("tidyverse")
library(tidyverse)
# load in the data points, make sure the working directory is set correctly
# I have already trimmed data manually, so it is just tab separated, x values in the left
# column, y values in the right, with the first line containing the name of the variable
bind_curve <- read_tsv("MST_data.txt")
view(bind_curve)
# normalize curve to max
# as fractional occupancy of binding sites
bind_curve$delta_f_norm <- bind_curve$delta_f_norm/max(bind_curve$delta_f_norm)
#change units to nanomolar
bind_curve$NP <- bind_curve$NP*1e06
# due to the way the plinear algorithm works, y values cannot be zero, so we have to change them to very small values
for (i in 1:nrow(bind_curve))
{
if (bind_curve[i,2] == 0)
{
bind_curve[i,2] <- 1e-10
}
}
# here Ka is the apparent Kd and n is the hill coeficient, the parameters were
# guestimated by looking at the data
view(bind_curve)
hill_model <- nls((delta_f_norm ~ 1/(((Ka/NP)^n)+1)), data = bind_curve, start = list(Ka=700, n=2), algorithm = "plinear")
summary(hill_model)
this gives the following error:
Error in chol2inv(object$m$Rmat()) :
element (2, 2) is zero, so the inverse cannot be computed
This makes no sense, as element (2,2) was 0 when it was imported, but I specifically overwrote it with a small non-zero value to allow inversion. Inspection of the data frame before creating the non-linear model even shows the value is not 0, so why is it reporting that it is? Is this an issue where bind_curve exists in 2 different namespaces or something? That's the only possible way I can think that this would happen.
Ok I forgot to convert the units on my initial Ka guess when I changed the units on the NP data (700 vs. 0.7) so obviously my starting values were very far off which must have been what caused it to fail. I don't understand what that has to do with 0 values in the data, but whatever its fixed.
A mod can delete this post. I'm a moron :p

Minimal depth interaction from randomForestExplainer package

So when using the minimal depth interaction feature of the randomForestExplainer package, in R, I'm getting some hard to interpret results.
I simulated some data (x1, x2,..., x5) where x1 is binary and x2-x5 are continuous. In my model, there are no interactions.
Im using the randomForest package to create a random forest and then running it through the randomForestExplainer package.
Here's the code I'm using to simulate the data and random forest:
library(randomForest)
library(randomForestExplainer)
n <- 100
p <- 4
# Create data:
xrandom <- matrix(rnorm(n*p)+5, nrow=n)
colnames(xrandom)<- paste0("x",2:5)
d <- data.frame(xrandom)
d$x1 <- factor(sample(1:2, n, replace=T))
# Equation:
y <- d$x2 + rnorm(n)/5
y[d$x1==1] <- y[d$x1==1]+5
d$y <- y
# Random Forest:
fr <- randomForest(y ~ ., data=d,localImp=T)
# Random Forest Explainer:
interactions_frame <- min_depth_interactions(fr, names(d)[-6])
head(interactions_frame, 2)
This produces the following:
variable root_variable mean_min_depth occurrences interaction
1 x1 x1 4.670732 0 x1:x1
2 x1 x2 2.606190 221 x2:x1
uncond_mean_min_depth
1 1.703252
2 1.703252
So, my question is, if x1:x1 has 0 occurrence ( which is expected) then how can it also have a mean_min_depth?
Surely if it has 0 occurrences, then it can't possibly have a minimum depth? [or rather, the min depth = 0 or NA]
What's going on here? Am I misinterpreting something?
Thanks
My understanding is this has to do with the choice of the mean_sample argument of min_depth_interactions. The default choice replaces NAs with the depth of maximum subtree whose root is x1. Details below.
What is this argument mean_sample for? It specifies how to deal with trees where the interaction of interest is not present. There are three options:
relevant_trees. This only considers the trees where the interaction of interest is present. In your example, this gives NA for mean_min_depth of interaction x1:x1, which is the behavior you were looking for.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "relevant_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 NA 0 x1:x1 1.947475
2 x1 x2 1.426606 218 x2:x1 1.947475
all_trees. There is a major problem with relevant_trees, that is for an interaction only showing up in a small number of trees, taking the mean of conditional minimum depth ignores the fact that this interaction is not that important. In this case, a small mean conditional minimum depth doesn't mean an interaction is important. To address this, specifying mean_sample = "all_trees" replaces the conditional minimum depth for the interaction of interest by the mean depth of maximal subtree of the root variable. Basically, if we are looking at the interaction of x1:x2, it says for a tree where this interaction is absent, give it a value of the deepest tree whose root is x1. This gives a (hopefully large) numeric value to mean_min_depth of interaction x1:x2 thus making it less important.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "all_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.97568
2 x1 x2 3.654522 218 x2:x1 1.97568
top_trees. Now this is the default choice for mean_sample. My understanding is it's similar to all_trees, but tries to down-weight the contribution of replacing missing values. The motivation, is all_trees pulls mean_min_depth close to the same value when there are many parameters but not enough observations, i.e. shallow trees. To reduce the contribution of replacing missing values, top_trees only calculates the mean conditional minimal depth on a subset of n trees, where n is the number of trees where ANY interactions with specified roots are present. Let's say in your example, out of those 500 trees only 300 have any interaction x1:whatever, then we only consider those 300 trees when filling in value for x1:x1. Because there are 0 occurrence of this interaction, replacing 500 NAs vs replacing 300 NAs with the same value doesn't affect the mean, so it's the same value 4.787879. (There's a slight difference between our results, I think it has to do with seed values).
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "top_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.947475
2 x1 x2 2.951051 218 x2:x1 1.947475
This answer is based on my understanding of the package author's thesis: https://rawgit.com/geneticsMiNIng/BlackBoxOpener/master/randomForestExplainer_Master_thesis.pdf

How to solve prcomp.default(): cannot rescale a constant/zero column to unit variance

I have a data set of 9 samples (rows) with 51608 variables (columns) and I keep getting the error whenever I try to scale it:
This works fine
pca = prcomp(pca_data)
However,
pca = prcomp(pca_data, scale = T)
gives
> Error in prcomp.default(pca_data, center = T, scale = T) :
cannot rescale a constant/zero column to unit variance
Obviously it's a little hard to post a reproducible example. Any ideas what the deal could be?
Looking for constant columns:
sapply(1:ncol(pca_data), function(x){
length = unique(pca_data[, x]) %>% length
}) %>% table
Output:
.
2 3 4 5 6 7 8 9
3892 4189 2124 1783 1622 2078 5179 30741
So no constant columns. Same with NA's -
is.na(pca_data) %>% sum
>[1] 0
This works fine:
pca_data = scale(pca_data)
But then afterwards both still give the exact same error:
pca = prcomp(pca_data)
pca = prcomp(pca_data, center = F, scale = F)
So why cant I manage to get a scaled pca on this data? Ok, lets make 100% sure that it's not constant.
pca_data = pca_data + rnorm(nrow(pca_data) * ncol(pca_data))
Same errors. Numierc data?
sapply( 1:nrow(pca_data), function(row){
sapply(1:ncol(pca_data), function(column){
!is.numeric(pca_data[row, column])
})
} ) %>% sum
Still the same errors. I'm out of ideas.
Edit: more and a hack at least to solve it.
Later, still having a hard time clustering this data eg:
Error in hclust(d, method = "ward.D") :
NaN dissimilarity value in intermediate results.
Trimming values under a certain cuttoff eg < 1 to zero had no effect. What finally worked was trimming all columns that had more than x zeros in the column. Worked for # zeros <= 6, but 7+ gave errors. No idea if this means that this is a problem in general or if this just happened to catch a problematic column. Still would be happy to hear if anyone has any ideas why because this should work just fine as long as no variable is all zeros (or constant in another way).
I don't think you're looking for zero variance columns correctly. Let's try with some dummy data. First, an acceptable matrix: of 10x100:
mat <- matrix(rnorm(1000, 0), nrow = 10)
And one with a zero-variance column. Let's call it oopsmat.
const <- rep(0.1,100)
oopsmat <- cbind(const, mat)
The first few elements of oopsmat look like this:
const
[1,] 0.1 0.75048899 0.5997527 -0.151815650 0.01002536 0.6736613 -0.225324647 -0.64374844 -0.7879052
[2,] 0.1 0.09143491 -0.8732389 -1.844355560 0.23682805 0.4353462 -0.148243210 0.61859245 0.5691021
[3,] 0.1 -0.80649512 1.3929716 -1.438738923 -0.09881381 0.2504555 -0.857300053 -0.98528008 0.9816383
[4,] 0.1 0.49174471 -0.8110623 -0.941413109 -0.70916436 1.3332522 0.003040624 0.29067871 -0.3752594
[5,] 0.1 1.20068447 -0.9811222 0.928731706 -1.97469637 -1.1374734 0.661594937 2.96029102 0.6040814
Let's try scaled and unscaled PCAs on oopsmat:
PCs <- prcomp(oopsmat) #works
PCs <- prcomp(oopsmat, scale. = T) #not forgetting the dot
#Error in prcomp.default(oopsmat, scale. = T) :
#cannot rescale a constant/zero column to unit variance
Because you can't divide by the standard deviation if it's infinity. To identify the zero-variance column, we can use which as follows to get the variable name.
which(apply(oopsmat, 2, var)==0)
#const
#1
And to remove zero variance columns from the dataset, you can use the same apply expression, setting variance not equal to zero.
oopsmat[ , which(apply(oopsmat, 2, var) != 0)]
Hope that helps make things clearer!
In addition to Joe's answer, just check that the classes of the columns in your dataframe are numerics.
If there are integers, then you'll get variances of 0, causing the scaling to fail.
So if,
class(my_df$some_column)
is an integer64, for example, then do the following
my_df$some_column <- as.numeric(my_df$some_column)
Hope this helps someone.
The error is because one of the column has constant values.
Calculate standard deviation of all the numeric cols to find the zero variance variables.
If the standard deviation is zero, you can remove the variable and compute pca

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

Resources