Creating data with pre-determined correlations in R - r

I am looking to simulate a data set with pre-determined correlations between the variables. The code, below, is where I am at but I want to be able to control the parameters of the features individually.
In short, how do I change the SD, mean and min/max, intervals, skew and kurtosis for each variable individually?
library(tidyverse)
library(faux)
cmat <- c(1, .195, .346, .674, .561,
.195, 1, .479, .721, .631,
.346, .479, 1, .154, .121,
.674, .721, .154, 1, .241,
.561, .631, .121, .241, 1)
nps_sales <- round(rnorm_multi(100, 5, 3, .5, cmat,
varnames = c("NPS",
"change in NPS",
"sales (t0)",
"sales (t1)",
"sales (t2)")), 0) %>%
tibble()

You have specified rnorm_multi(n = 100, vars = 5, mu = 3, sd = .5, cmat = ...). rnorm_multi will accept vectors of the appropriate length for mu and sd (e.g. mu = c(3,3,3,2,2) and sd = c(1,0.5,0.5,1,2), which will set the means and standard deviations accordingly.
Adjusting the other characteristics (min/max, skew, kurtosis, etc.), will be much more challenging, and may require a question on CrossValidated; the reason everyone uses the multivariate normal is that it's easy to specify means, SDs, and correlations, but you can't control the other aspects of the distributions easily. You can transform the results to achieve some level of skew/kurtosis, but this may not get as much flexibility and control as you want (see e.g. here).

Related

Is pnorm(q = (x - $\mu$)/$\sigma$) ever different from pnorm(q = x, mean = $\mu$, sd = $\sigma$)?

It is the case that the probability density for a standardized and unstandardized random variable will differ. E.g., in R
dnorm(x = 0, mean = 1, sd = 2)
dnorm(x = (0 - 1)/2)
However,
pnorm(q = 0, mean = 1, sd = 2)
pnorm(q = (0 - 1)/2)
yields the same value.
Are there any situations in which the Normal cumulative density function will yield a different probability for the same random variable when it is standardized versus unstandardized? If yes, is there a particular example in which this difference arises? If not, is there a general proof of this property?
Thanks so much for any help and/or insight!
This isn't really a coding question, but I'll answer it anyway.
Short answer: yes, they may differ.
Long answer:
A normal distribution is usually thought of as y=f(x), that is, a curve over the domain of x. When you standardize, you are converting from units of x to units of z. For example, if x~N(15,5^2), then a value of 10 is 5 x-units less than the mean. Notice that this is also 1 standard deviation less than the mean. When you standardize, you convert x to z~N(0,1^2). Now, that example value of 10, when standarized into z-units, becomes a value of -1 (i.e., it's still one standard deviation less than the mean).
As a result, the area under the curve to the left of x=10 is the same as the area under the curve to the left of z=-1. In words, the cumulative probability up to those cut-offs is the same.
However, the height of curves is different. Let the normal distribution curves be f(x) and g(z). Then f(10) != g(-1). In code:
dnorm(10, 15, 5) != dnorm(-1, 0, 1)
The reason is that the act of standardizing either "spreads" or "squishes" the f(x) curve to make it "fit" over the new z domain as g(z).
Here are two links that let you visualize the spreading/squishing:
https://academo.org/demos/gaussian-distribution/
https://www.intmath.com/counting-probability/normal-distribution-graph-interactive.php
Hope this helps!

Confused by ROC curves and cutoffs in R

I have some data on study participants, the percent change in a biomarker, and their ultimate outcome. I'd like to use an ROC curve to find the best cutoff value for the biomarker for predicting the outcome, using the Youden method but get different answers from different packages and need to know where I'm going wrong.
To set up the dataset:
ID <- c(1:17)
PercentChange <- c(-85.5927051671732, -85.4849965108165,
-63.302752293578, -33.5509138381201, -75, -87.0988867059594,
-93.2523616734143, 65.2037617554859, -19.226393629124,
-44.7095435684647, -65.7342657342657, -43.7831467295227,
-37.0022539444027, 518.75, -77.1014492753623, 20.6572769953052,
-72.0742534301856)
Outcome <- c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1)
df <- data.frame(ID, PercentChange, Outcome)
For outcome, 1 is favorable and 0 is unfavorable.
Now with package pROC I did:
library(pROC)
roc <- roc(df$Outcome, df$PercentChange, auc= TRUE, plot = TRUE)
coords(roc, "b", ret="t", best.method="youden")
plot(roc, print.thres="best", print.thres.best.method="youden",main = "Percent change")
This gives me a reasonable curve and a cutoff (by the Youden index) of -44.246 that I verified has the correct sensitivity and specificity listed. The cutpoint seems a bit weird since its halfway between two of the actual values and not an actual value, but it works.
Then using OptimalCutpoints I tried
library(OptimalCutpoints)
optimal.cutpoint.Youden <- optimal.cutpoints(X = "PercentChange", status = "Outcome", tag.healthy = 1, methods = "Youden", data = df)
summary(optimal.cutpoint.Youden)
plot(optimal.cutpoint.Youden)
This gives me a different curve, and a different cutpoint of -43.783, which is one of the two points that pROC took the midpoint of. The sensitivity and specificity are also flipped from what I calculated using that cutpoint.
Lastly I tried the roc function from the Epi package
library(Epi)
ROC(form=Outcome~PercentChange, data=df, plot="ROC", PV=TRUE, MX=TRUE)
Which gave me a third completely different curve and says "24" at the cutpoint which doesn't make any sense. Can anyone help me figure this out? I'm not asking on the stats stackexchange cause they'd want to get into whether Youden is appropriate or not and not the technical application of these functions.

Fitting truncated normal distribution in R

I'm trying to fit a truncated normal distribution to data using fitdistrplus::fitdistr and specifying upper and lower bounds. However, when comparing the MLE-fitted parameters to those of an MLE-fit without bounds, they seem to be the same.
library(fitdistrplus)
library(MASS)
dt <- rnorm(100, 1, 0.5)
cat("truncated:", fitdistr(dt, "normal", lower = 0, upper = 1.5, method = "mle")$estimate,
"original:", fitdist(dt, "norm", method = "mle")$estimate, sep = "\n")
truncated:
1.034495
0.4112629
original:
1.034495
0.4112629
I'm not a statistics genius, but I'm pretty sure that parameters should be different because truncating the distribution, both mean and sd will change (because the distribution is rescaled). Is this right?
Thanks for your advice
Cheers,
Simon

SEM-Path analysis in R

I run a path analysis in R and the following matrix represents the effect between variables.
M <- c(0, 0, 0, 0, 0)
p<-c(0, 0, 0, 0, 0)
O <- c(0, 0, 0, 0, 0)
T <- c(1, 0, 1, 1, 0)
Sales <- c(1, 1, 1, 1, 0)
sales_path <- rbind(M, p, O, T, Sales)
colnames(sales_path) <- rownames(sales_path)
#innerplot(sales_pls)
sales_blocks <- list(
c("m1", "m2"),
#c("pr"),
c("R1"),
#c("C1"),
c("tt1"),
c("Sales")
)
sales_modes = rep("A", 5)
sales_pls <- plspm(input_file, sales_path, sales_blocks, scheme = "centroid", scaled = FALSE, modes = sales_modes)
I have 2 questions:
The weights i receive can i use them to calculate the value of the latent variable e.g. My M variable has the manifest variables is there a formula to calculate its value?
The main purpose i run path analysis is to predict the sales. Is that possible by using the estimations(beta) for each latent variable?
i want to know if i am able to calculate the value of the latent
variable
Yes. Its actually very easy. Your latent variable scores can be obtained by running sales_pls$scores. You can also run summary(sales_pls). For easier interpretation you may wish to have the latent variables expressed in the same scale as the indicators (manifest variables). This can be accomplished by normalizing the outer weights in each block of indicators so that the weights are expressed as proportions. However, in order to apply this normalization all the outer weights must be positive.
You can apply the function rescale() to the object sales_pls and get scores expressed in the original scale of the manifest variables. Once you get the rescaled scores you can use summary() to verify that the obtained results make sense (i.e. scores expressed in original scale of indicators):
# rescaling scores
rescaled_sales_pls = rescale(sales_pls)
# summary
summary(rescaled_sales_pls)
(again you can also run rescaled_sales_pls)
if i can predict Sales using the betas from the output
Theoretically I guess you could, but I'm not really sure why you would want to. The utility of path analysis here is to decompose the sources of a correlation between an independent variable and a dependent variable of a multiple regression model.

R: how does caret choose default tuning range?

When using R caret to compare multiple models on the same data set, caret is smart enough to select different tuning ranges for different models if the same tuneLength is specified for all models and no model-specific tuneGrid is specified.
For example, the tuning ranges chosen by caret for one particular data set are:
earth(nprune): 2, 5, 8, 11, 14
gamSpline(df): 1, 1.5, 2, 2.5, 3
rpart(cp): 0.010, 0.054, 0.116, 0.123, 0.358
Does anyone know how caret determines these default tuning ranges? I have been searching through the documentation but still haven't pinned down the algorithm to choose the ranges.
It depends on the model. For rpart and a few others, it fits and initial model to get a sense of what reasonable values should be. In other cases, it is less intelligent. For example, for gamSpline it is expand.grid(df = seq(1, 3, length = len)).
You can see what it does per model using getModelInfo:
> getModelInfo("earth")[[1]]$grid
function(x, y, len = NULL) {
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
mod <- earth( .outcome~., data = dat, pmethod = "none")
maxTerms <- nrow(mod$dirs)
maxTerms <- min(200, floor(maxTerms * .75) + 2)
data.frame(nprune = unique(floor(seq(2, to = maxTerms, length = len))),
degree = 1)
}
Max

Resources