How to construct ROC curve in r with a small clinical dataset - r

I need help with ROC curve in R. My data is not difficult, but I don't know how to get teh ROC curve and AUC. I typed the dataset here if you need to have a look. The cutoff comes from CMMS score (eg. <5/10/20 then the patient have dementia).
Table - Relationship of clinical dementia to outcome on the Mini-Mental Status Test
CMMS score Nondemented Demented
0–5 0 2
6–10 0 1
11–15 3 4
16–20 9 5
21–25 16 3
26–30 18 1
Total 46 16
Please let me know any ideas. Thank you.

Related

Bayesian Question: Exponential Prior and Poisson Likelihood: Posterior?

I am needing assistance in a particular question and need confirmation of my understanding.
The belief is that absences in a company follow
a Poisson(λ) distribution.
It is believed additionally that 75% of thes value of λ is less than 5 therefore it is decided that a exponential distribution will be prior for λ. You take a random sample of 50 students and find out the number of absences that each has had over the past semester.
The data summarised below, note than 0 and 1 are binned collectively.
Number of absences
≤ 1 2 3 4 5 6 7 8 9 10
Frequency
18 13 8 3 4 3 0 0 0 1
Therefore in order to calculate a posterior distribution, My understanding is that prior x Likelihood which is this case is a Exponential(1/2.56) and a Poisson with the belief incorporated that the probability of less than 5 is 0.75 which is solved using
-ln(1-0.75)/(1/2.56)= 3.5489.
Furthermore a similar thread has calculated the Posterior to be that of a Gamma (sum(xi)+1,n+lambda)
Therefore with those assumptions, I have some code to visualise this
x=seq(from=0, to=10, by= 1)
plot(x,dexp(x,rate = 0.390625),type="l",col="red")
lines(x,dpois(x,3.54890),col="blue")
lines(x,dgamma(x,128+1,50+3.54890),col="green")
Any help or clarification surround this would be greatly appreciated

Output "randomForest" with changing MeanDecreaseAccuracyValues

I have a question relating to the “randomForest” package in R. I am trying to build a model with ecological variables that best explain my species occupancy data for 41 sites in the field (which I have gathered from camera traps). My ultimate goal is to do species occupancy modeling using the “unmarked” package but before I get to that stage I need to select the variables that are best explaining my occupancy, since I have many. To gain some understanding of the randomForest package I generated a fake occupancy dataset and a fake variable dataset (with variables A and D being good predictors of my occupancy and B and C being bad predictors). When I run the randomForest my output looks like this:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 25.3537667 27.75533 26.9634018 20.6505920
B 0.9567857 0.00000 0.9665287 0.0728273
C 0.4261638 0.00000 0.4242409 0.1411643
D 32.1889374 35.52439 34.0485837 27.0691574
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
I did not make a separate train and test set, I put extra weight on the model to correctly predict the “1’s” and the variables are scaled.
I understand that this output tells me that A and D are important variables because they have high MeanDecreaseAccuracy values. However, D is the inverse of A (they are perfectly correlated) so why does D have a higher MeanDecreaseAccuracy value?
Moreover, when I run the randomForest with only A and D as variables, these values change while the confusion matrix stays the same:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 28.79540 29.77911 29.00879 23.58469
D 29.75068 30.79498 29.97520 24.53415
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
When I run the model with only 1 good predictor (A or D) or with a good and bad predictor (AB or CD) the confusion matrix stays the same but the MeanDecreaseAccuracy values of my predictors change.
Why do these values change and how should I approach the selection of my variables? (I am a beginner in occupancy modeling).
Thanks a lot!

Predicting survival within time (cumulative hazard) [duplicate]

This question already has an answer here:
Extract survival probabilities in Survfit by groups
(1 answer)
Closed 3 years ago.
By using R, how can one develop an index score for predicting patient overall survival (OS)?
I have a shortlist of 4 candidate predictors that showed to be associate with OS. They resulted from Cox multivariate regression (run with coxph()). The predictors are protein levels, hence they are all continuous variables.
The data table looks something like this (showing only n=10 here):
days Status Prot1 Prot13 Prot7 Prot21
Subj_1 115.69 0 2.284498 6.319168 6.070115 8.457412
Subj_2 72.30 1 2.473034 6.066573 6.140178 8.225987
Subj_3 1.08 1 2.662481 6.212845 6.971018 8.128949
Subj_4 69.63 1 2.761391 5.902610 6.433883 7.876319
Subj_5 78.41 1 3.038122 6.355257 6.852981 7.500973
Subj_6 42.90 1 2.058549 6.020681 7.231307 8.164025
Subj_7 31.00 1 2.305096 5.415107 8.126941 8.566320
Subj_8 51.12 1 2.931978 5.574601 7.503275 7.529957
Subj_9 11.01 1 2.218814 6.270222 6.710297 8.193895
Subj_10 27.68 1 2.821947 6.132379 6.911071 8.428218
The question is: How can I create a formula which is capable to classify these patients into 2 groups: a group where the estimated survival is <60% in a 1-year period, and another which will include those with estimated survival> 60% in the same time period?
Would there be any function() in R that deals with that?
Thanks a lot in advance.
I think you should post this question here
https://stats.stackexchange.com
since it is a matter of statistics. Anyway, you could try with a binomial regression to start, but there are many other models you could try. how many subjects do you have?

How to produce a classification table of predicted vs actual values

I'm new to R and neutral network and haven't been able to figure out how predict variable from trained network (neural network) and produce classification table of predicted vs actual values. I just haven't been able to understand the code meaning of classification table vs actual . I would greatly appreciate it if you explain the code. This is what I have done so far:
model_Wine <- train(wine~., wine_df, method='nnet', trace=FALSE)
prediction <- predict(model_Wine, wine_df)
table(prediction, wine_df$wine)
prediction A B C
A 59 0 0
B 0 71 0
C 0 0 48
how do i produce classification table of predicted vs actual "wine" values?
Thank you!

Get data distribution given a dataframe R

I'm new to R and I would like to "see" which distribution my data follow.
My data represent the entering passengers at 10 bus stops for 1 month at the same time each day.
Date Stop.1 Stop.2 Stop.3 Stop.4 Stop.5 ...
2-9 3 26 11 3 0
3-9 2 44 23 0 12
4-9 26 16 0 0 4
...
My goal is to understand the distribution of the "entering" passengers at each stop so that I can create a simulation for the arrivals at each Stop.
Density Plot
You can plot the probability density function to let you "see" the distribution. Assuming df is a dataframe containing the information:
d <- density(df$Stop.1) # returns the density data for Stop 1.
plot(d) # plots the results
Summary statistics
Or you can use the basicStats function from the fBasics package to give some useful summary statistics.
library(fBasics)
basicStats(df)

Resources