Related
I am looking to simulate a data set with pre-determined correlations between the variables. The code, below, is where I am at but I want to be able to control the parameters of the features individually.
In short, how do I change the SD, mean and min/max, intervals, skew and kurtosis for each variable individually?
library(tidyverse)
library(faux)
cmat <- c(1, .195, .346, .674, .561,
.195, 1, .479, .721, .631,
.346, .479, 1, .154, .121,
.674, .721, .154, 1, .241,
.561, .631, .121, .241, 1)
nps_sales <- round(rnorm_multi(100, 5, 3, .5, cmat,
varnames = c("NPS",
"change in NPS",
"sales (t0)",
"sales (t1)",
"sales (t2)")), 0) %>%
tibble()
You have specified rnorm_multi(n = 100, vars = 5, mu = 3, sd = .5, cmat = ...). rnorm_multi will accept vectors of the appropriate length for mu and sd (e.g. mu = c(3,3,3,2,2) and sd = c(1,0.5,0.5,1,2), which will set the means and standard deviations accordingly.
Adjusting the other characteristics (min/max, skew, kurtosis, etc.), will be much more challenging, and may require a question on CrossValidated; the reason everyone uses the multivariate normal is that it's easy to specify means, SDs, and correlations, but you can't control the other aspects of the distributions easily. You can transform the results to achieve some level of skew/kurtosis, but this may not get as much flexibility and control as you want (see e.g. here).
I'm working on a task where we look at 12 independent and identically distributed random variables - each of which have standard normal distribution.
From that I understand we have a mean of 0 and sd of 1.
We then have an interval of (-1.644, 1.644)
To find the probability of a single random variable landing in this interval I write:
(pnorm(1.644, mean = 0, sd = 1, lower.tail=TRUE) - pnorm(-1.644, mean = 0, sd = 1, lower.tail=TRUE))
Which returns the Probability of 0.8998238
I'm able to find the probability of at least one of the 12 random variables landing outside of the interval (-1.644, 1.644) with the following:
PROB_1 = 1-(0.8998238^12))
#PROB_1 = 0.7182333
However - How would if find the probability of Exactly 2 random variables landing outside of the interval? I've attempted the following:
((12*11)/2)*((1-0.7182333)^2)*(0.7182333^10)
I'm sure I'm missing something here, and there is a much easier way to solve this.
Any help is much appreciated.
You need the binomial coefficient
prob=pnorm(1.644, mean = 0, sd = 1, lower.tail=TRUE)-pnorm(-1.644, mean = 0, sd = 1, lower.tail=TRUE)
dbinom(2, 12, 1-prob)
prob^10 * (1-prob)^2 * choose(12, 2)
0.2304877
I have some data that looks like this:
x y
1: 3 1
2: 6 1
3: 1 0
4: 31 8
5: 1 0
---
(Edit: if it helps, here are sample vectors for x and y
x = c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y = c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
The column on the left (x) is my sample size, and the column on the right (y) is the number successes that occur in each sample.
I would like to fit these data using a binomial distribution in order to find the probability of a success (p). All examples for fitting a binomial distribution that I've found so far assume a constant sample size (n) across all data points, but here I have varying sample sizes.
How do I fit data like these, with varying sample sizes, to a binomial distribution? The desired outcome is p, the probability of observing a success in a sample size of 1.
How do I accomplish a fit like this using R?
(Edit #2: Response below outlines solution and related R code if I assume that the events observed in each sample can be assumed to be independent, in addition to assuming that the samples themselves are also independent. This works for my data - thanks!)
What about calculating the empirical probability of success
x <- c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y <- c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
avr.sample <- mean(x)
avr.success <- mean(y)
p <- avr.success/avr.sample
[1] 0.1151515
Or using binom.test
z <- x-y # number of fails
binom.test(x = c(sum(y), sum(z)))
Exact binomial test
data: c(sum(y), sum(z))
number of successes = 19, number of trials = 165, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.07077061 0.17397215
sample estimates:
probability of success
0.1151515
However, this assumes that:
The events corresponding to the rows are independent from each other
The events in the same row are independent from each other as well
This means in every iteration k of the experiment (i.e. row of x) we execute an action such as throwing x[k] identical dices (not necessarily fair dices) and success would mean to get a given (predetermined) number n in 1:6.
If we supposed that that above results were achieved when trying to get a 1 when throwing x[k] dices in every iteration k, then one could say that the empirical probability of getting a 1 is (~) 0.1151515.
In the end, the distribution in question would be B(sum(x), p).
PS: In the above illustration, the dices are identical to each other not only in any given iteration but across all iterations.
library(bbmle)
x = c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y = c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
mf = function(prob, x, size){
-sum(dbinom(x, size, prob, log=TRUE))
}
m1 = mle2(mf, start=list(prob=0.01), data=list(x=y, size=x))
print(m1)
Coefficients:
prob
0.1151535
Log-likelihood: -13.47
I have some data on study participants, the percent change in a biomarker, and their ultimate outcome. I'd like to use an ROC curve to find the best cutoff value for the biomarker for predicting the outcome, using the Youden method but get different answers from different packages and need to know where I'm going wrong.
To set up the dataset:
ID <- c(1:17)
PercentChange <- c(-85.5927051671732, -85.4849965108165,
-63.302752293578, -33.5509138381201, -75, -87.0988867059594,
-93.2523616734143, 65.2037617554859, -19.226393629124,
-44.7095435684647, -65.7342657342657, -43.7831467295227,
-37.0022539444027, 518.75, -77.1014492753623, 20.6572769953052,
-72.0742534301856)
Outcome <- c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1)
df <- data.frame(ID, PercentChange, Outcome)
For outcome, 1 is favorable and 0 is unfavorable.
Now with package pROC I did:
library(pROC)
roc <- roc(df$Outcome, df$PercentChange, auc= TRUE, plot = TRUE)
coords(roc, "b", ret="t", best.method="youden")
plot(roc, print.thres="best", print.thres.best.method="youden",main = "Percent change")
This gives me a reasonable curve and a cutoff (by the Youden index) of -44.246 that I verified has the correct sensitivity and specificity listed. The cutpoint seems a bit weird since its halfway between two of the actual values and not an actual value, but it works.
Then using OptimalCutpoints I tried
library(OptimalCutpoints)
optimal.cutpoint.Youden <- optimal.cutpoints(X = "PercentChange", status = "Outcome", tag.healthy = 1, methods = "Youden", data = df)
summary(optimal.cutpoint.Youden)
plot(optimal.cutpoint.Youden)
This gives me a different curve, and a different cutpoint of -43.783, which is one of the two points that pROC took the midpoint of. The sensitivity and specificity are also flipped from what I calculated using that cutpoint.
Lastly I tried the roc function from the Epi package
library(Epi)
ROC(form=Outcome~PercentChange, data=df, plot="ROC", PV=TRUE, MX=TRUE)
Which gave me a third completely different curve and says "24" at the cutpoint which doesn't make any sense. Can anyone help me figure this out? I'm not asking on the stats stackexchange cause they'd want to get into whether Youden is appropriate or not and not the technical application of these functions.
Suppose, I have the following ordered data sets:
X <- c(12, 15, 23, 4, 9, 36, 10, 16, 67, 45, 58, 32, 40, 58, 33)
# and
Y <- c(1.5, 3.3, 10, 2.1, 8.3, 6.3, 4, 5.1, 1.4, 1.6, 1.8, 3.1, 2.2, 4, 3)
What does it mean by "the correlation of their ordered pairs after standardization"?
How to find (code for) it in R?
In order to standardize the given sets for X and Y, We will first
calculate average, variance, standard deviation of population.
In next step, we need to subtract each individual value in each
set from its mean and then in final step, we need to divide the
values obtained, from 2nd step, by its standard deviation, which
is nothing but a Z-scores of the set (and individual value, say Xi).
Doing so, we will get mean of 0 and standard deviation of 1 for both
X and Y sets.
This is standardized condition because we will always get mean as zero and standard deviation as one for all of the sets (in your case X and Y).
We will also look into the relationships between ordered pairs.
If we look at certain standard relationship such as co-variance
correlation, the slope is best fit line that plots Y against X,
then the Y intercept will they be the same for the original
values and the standardized values or will they be different?
And if they be different, how different will they be and why?
This was the context of question.
What I tried in R is as follows:
Your Data Set is:
X <- c(12, 15, 23, 4, 9, 36, 10, 16, 67, 45, 58, 32, 40, 58, 33)
# and
Y <- c(1.5, 3.3, 10, 2.1, 8.3, 6.3, 4, 5.1, 1.4, 1.6, 1.8, 3.1, 2.2, 4, 3)
Statistics for Original Data, where n = 15 observations for X and Y each
# Variance
VarX <- sum((X - mean(X))^2)/15 ## Which gives us Variance of X set as 374.5156
VarY <- sum((Y - mean(Y))^2)/15 ## Which gives us Variance of Y set as 6.226489
# Standard Deviation
sdX <- sqrt(VarX) ## Which gives us Std. Dev. of X set as 19.3524
sdY <- sqrt(VarY) ## Which gives us Std. Dev. of Y set as 2.495293
# Z-scores
Z_Score_X <- (X - mean(X))/sdX
Z_Score_Y <- (Y - mean(Y))/sdY
# A Check, mean of ZScores should be close or equal to 0
# and Std. Dev. must be close or equal to 1
round(mean(Z_Score_X), 0) # Yes, it is 0
round(sd(Z_Score_X), 0) # Yes, it is 1
round(mean(Z_Score_Y), 0) # Yes, it is 0
round(sd(Z_Score_Y), 0) # Yes, it is 1
This is the standardized condition where we have same mean
and standard deviation for both X and Y (as in above cases of Z Score data set).
Now we will look into the relationships between ordered pairs
If we look at certain standard relationship such as coveriance
correlation, the slope is best fit line that plots Y against X,
then the Y intercept will they be the same for the original
values and the standardized values or will they be different?
And if they be different, how different wil they be and why?
Let's calculate the rest...
First we look at the co-variance of X and Y...
Covariance (X, Y) = (1/n) * summation(i = 1 to n) of products
of (Xi - mean(X)) and (Yi - mean(Y))
and together, Xi and Yi are in ordered pair (remember step 3 above,
the Z-Scores)
# Covariance for older sets (X, Y)
covXY <- (1/15) * sum((X - mean(X))*(Y - mean(Y)))
# Covariance for New sets (Z_Score_X, Z_Score_Y)
covXYZ <- (1/15) * sum((Z_Score_X - mean(Z_Score_X))*(Z_Score_Y - mean(Z_Score_Y)))
Next we will look at slope (Beta) of best fit line of (X and Y)
Recall, Beta = slope = delta_Y / delta_X
# Slope for old set (X, Y)
Beta_X_Y <- round(lm(Y ~ X)$coeff[[2]], 2)
# Slope for standardized values in new set (Z_Score_z, Z_Score_z1)
Beta_ZScoreXY <- round(lm(Z_Score_X ~ Z_Score_Y)$coeff[[2]], 2)
Please note that intercept for the standardized values will always be ZERO
The reason for that is because the means for standardized values are always
on the best fit line and are zero (as in our case of Z_Score_X, Z_Score_Y,
the means are 0, 0).
In other words, the best fit line, for standardized data, must go through origin.
Although, not always necessary, but it is expected so.
# Intercept for old set
Intercept_X_Y <- round(lm(Y ~ X)$coeff[[1]], 2)
# 5.17
# Intercept for standardized set, should be zero
Intercept_ZScore_X_Y <- round(lm(Z_Score_Y ~ Z_Score_X)$coeff[[1]], 2)
# Yes, it is 0
Finally, we will look at Correlation, which is equal to
Covariate of X and Y divided by standard deviation of X times standard deviation of Y
# Correlation of old set
CorrelationXY <- round(covXY / (sdX * sdY), 2)
# Variance for new set
VarZScoreX <- sum((Z_Score_X - mean(Z_Score_X))^2)/15
VarZScoreY <- sum((Z_Score_Y - mean(Z_Score_Y))^2)/15
sdZScoreX <- sqrt(VarZScoreX)
sdZScoreY <- sqrt(VarZScoreY)
# Correlation of new set
correlation_ZScore_X_Y <- round(covXYZ / (sdZScoreX * sdZScoreY), 2)
Therefore, what we see here is, that overall thing that remains constant
for old set of data or new set of standardized (z score) data, is the
correlation (in our case it is -0.34). The correlation is UNCHANGED.
One another point to note, for every standardized set, the slope, the
covariance are EQUAL to correlation (all -0.34 in our case) and the intercept
of standardized set is equal to zero.