I have some data on study participants, the percent change in a biomarker, and their ultimate outcome. I'd like to use an ROC curve to find the best cutoff value for the biomarker for predicting the outcome, using the Youden method but get different answers from different packages and need to know where I'm going wrong.
To set up the dataset:
ID <- c(1:17)
PercentChange <- c(-85.5927051671732, -85.4849965108165,
-63.302752293578, -33.5509138381201, -75, -87.0988867059594,
-93.2523616734143, 65.2037617554859, -19.226393629124,
-44.7095435684647, -65.7342657342657, -43.7831467295227,
-37.0022539444027, 518.75, -77.1014492753623, 20.6572769953052,
-72.0742534301856)
Outcome <- c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1)
df <- data.frame(ID, PercentChange, Outcome)
For outcome, 1 is favorable and 0 is unfavorable.
Now with package pROC I did:
library(pROC)
roc <- roc(df$Outcome, df$PercentChange, auc= TRUE, plot = TRUE)
coords(roc, "b", ret="t", best.method="youden")
plot(roc, print.thres="best", print.thres.best.method="youden",main = "Percent change")
This gives me a reasonable curve and a cutoff (by the Youden index) of -44.246 that I verified has the correct sensitivity and specificity listed. The cutpoint seems a bit weird since its halfway between two of the actual values and not an actual value, but it works.
Then using OptimalCutpoints I tried
library(OptimalCutpoints)
optimal.cutpoint.Youden <- optimal.cutpoints(X = "PercentChange", status = "Outcome", tag.healthy = 1, methods = "Youden", data = df)
summary(optimal.cutpoint.Youden)
plot(optimal.cutpoint.Youden)
This gives me a different curve, and a different cutpoint of -43.783, which is one of the two points that pROC took the midpoint of. The sensitivity and specificity are also flipped from what I calculated using that cutpoint.
Lastly I tried the roc function from the Epi package
library(Epi)
ROC(form=Outcome~PercentChange, data=df, plot="ROC", PV=TRUE, MX=TRUE)
Which gave me a third completely different curve and says "24" at the cutpoint which doesn't make any sense. Can anyone help me figure this out? I'm not asking on the stats stackexchange cause they'd want to get into whether Youden is appropriate or not and not the technical application of these functions.
Related
I'm looking to calculate the Minimum Detectable Effect for a potential Difference-in-Differences design where treatment and control are assigned at the cluster level and my outcome at the individual level is dichotomous. To do this I'm working in R and using the clusterPower package, specifically I'm using the cpa.did.binary function. In the help file for this function it notes that d is "The expected absolute difference." I'm interpreting this as being a Minimum Detectable Effect, is that correct? If this is the MDE, is this output the expected difference in logits?
Thanks to anyone that can help. Alternatively, if you have a better package or way of calculating MDE that is also welcome.
# Input
cpa.did.binary(power = .8,
nclusters = 10,
nsubjects = 100,
p = .5,
d = NA,
ICC = .04,
rho_c = 0,
rho_s = 0)
# Output
> d
>0.2086531
I am looking to simulate a data set with pre-determined correlations between the variables. The code, below, is where I am at but I want to be able to control the parameters of the features individually.
In short, how do I change the SD, mean and min/max, intervals, skew and kurtosis for each variable individually?
library(tidyverse)
library(faux)
cmat <- c(1, .195, .346, .674, .561,
.195, 1, .479, .721, .631,
.346, .479, 1, .154, .121,
.674, .721, .154, 1, .241,
.561, .631, .121, .241, 1)
nps_sales <- round(rnorm_multi(100, 5, 3, .5, cmat,
varnames = c("NPS",
"change in NPS",
"sales (t0)",
"sales (t1)",
"sales (t2)")), 0) %>%
tibble()
You have specified rnorm_multi(n = 100, vars = 5, mu = 3, sd = .5, cmat = ...). rnorm_multi will accept vectors of the appropriate length for mu and sd (e.g. mu = c(3,3,3,2,2) and sd = c(1,0.5,0.5,1,2), which will set the means and standard deviations accordingly.
Adjusting the other characteristics (min/max, skew, kurtosis, etc.), will be much more challenging, and may require a question on CrossValidated; the reason everyone uses the multivariate normal is that it's easy to specify means, SDs, and correlations, but you can't control the other aspects of the distributions easily. You can transform the results to achieve some level of skew/kurtosis, but this may not get as much flexibility and control as you want (see e.g. here).
NN model to predict new data, but the error says:
"'train' and 'class' have different lengths"
can someone replicate and solve this error?
weather <- c(1, 1, 1, 0, 0, 0)
temperature <- c(1, 0, 0, 1, 0, 0)
golf <- c(1, 0, 1, 0, 1, 0)
df <- data.frame(weather, temperature, golf)
df_new <- data.frame(weather = c(1,1,1,1,1,1,1,1,1), temp = c(0,0,0,0,0,0,0,0,0), sunnday= c(1,1,1,0,1,1,1,0,0))
pred_knn <- knn(train=df[, c(1,2)], test=df_new, cl=df$golf, k=1)
Thank you very much!
The knn function in R requires the training data to only contain independent variables, as the dependent variable is called separately in the "cl" parameter. Changing the line below will fix this particular error.
pred_knn <- knn(train=df[,c(1,2)], test=df_new, cl=df$golf, k=1)
However, note that running the above line will throw another error. Since knn calculates the Euclidean distance between observations, it requires all independent variables to be numeric. These pages have helpful related information. I would recommend using a different classifier for this particular data set.
https://towardsdatascience.com/k-nearest-neighbors-algorithm-with-examples-in-r-simply-explained-knn-1f2c88da405c
https://discuss.analyticsvidhya.com/t/how-to-resolve-error-na-nan-inf-in-foreign-function-call-arg-6-in-knn/7280/4
Hope this helps.
I had a similar issue with data from the ISLR Weekly dataframe:
knn.pred = knn(train$Lag2,test$Lag2,train$Direction,k=1)
Error in knn(train$Lag2, test$Lag2, train$Direction, k = 1) :
dims of 'test' and 'train' differ
where
train = subset(Weekly, Year >= 1990 & Year <= 2008)
test = subset(Weekly, Year < 1990 | Year > 2008)
I finally solved it by putting the test and the train in as.matrix() like this:
knn.pred = knn(as.matrix(train$Lag2),as.matrix(test$Lag2),train$Direction,k=1)
Is there a way of extracting the functional form of the likelihood function that gets formed by the msm function in R?
How can I extract the likelihood function that gets formed in the example below? I want to try and implement my own version of the quasi-Newton maximisation algorithm to improve my understanding.
library(msm)
# look at transition counts
statetable.msm(state, PTNUM, data = cav)
# define transition intensity matrix
# 1's mean a transition can occur
# 0's mean a transition should not occur
# any number can be placed on the diagonal as R overwrites the diagonals
# prior to maximising
q <- rbind(
c(0, 1, 0, 1),
c(1, 0, 1, 1),
c(0, 1, 0, 1),
c(0, 0, 0, 0)
)
# fit msm to the data
# the fnscale rescales the likelihood to prevent overflow
msm.fit <- msm(state ~ years, PTNUM, data = cav, qmatrix = q, control=list(fnscale=4000))
I run a path analysis in R and the following matrix represents the effect between variables.
M <- c(0, 0, 0, 0, 0)
p<-c(0, 0, 0, 0, 0)
O <- c(0, 0, 0, 0, 0)
T <- c(1, 0, 1, 1, 0)
Sales <- c(1, 1, 1, 1, 0)
sales_path <- rbind(M, p, O, T, Sales)
colnames(sales_path) <- rownames(sales_path)
#innerplot(sales_pls)
sales_blocks <- list(
c("m1", "m2"),
#c("pr"),
c("R1"),
#c("C1"),
c("tt1"),
c("Sales")
)
sales_modes = rep("A", 5)
sales_pls <- plspm(input_file, sales_path, sales_blocks, scheme = "centroid", scaled = FALSE, modes = sales_modes)
I have 2 questions:
The weights i receive can i use them to calculate the value of the latent variable e.g. My M variable has the manifest variables is there a formula to calculate its value?
The main purpose i run path analysis is to predict the sales. Is that possible by using the estimations(beta) for each latent variable?
i want to know if i am able to calculate the value of the latent
variable
Yes. Its actually very easy. Your latent variable scores can be obtained by running sales_pls$scores. You can also run summary(sales_pls). For easier interpretation you may wish to have the latent variables expressed in the same scale as the indicators (manifest variables). This can be accomplished by normalizing the outer weights in each block of indicators so that the weights are expressed as proportions. However, in order to apply this normalization all the outer weights must be positive.
You can apply the function rescale() to the object sales_pls and get scores expressed in the original scale of the manifest variables. Once you get the rescaled scores you can use summary() to verify that the obtained results make sense (i.e. scores expressed in original scale of indicators):
# rescaling scores
rescaled_sales_pls = rescale(sales_pls)
# summary
summary(rescaled_sales_pls)
(again you can also run rescaled_sales_pls)
if i can predict Sales using the betas from the output
Theoretically I guess you could, but I'm not really sure why you would want to. The utility of path analysis here is to decompose the sources of a correlation between an independent variable and a dependent variable of a multiple regression model.