Simulating from a vector of discrete data - r

I have a vector of discrete data and I want to simulate from the empirical distribution associated to this data, I was simulating with the function rlogspline after doing fit<-logspline(vector_of_data) where vector_of_data is data that is suppose to be coming from a continuous distribution, that's why I used logspline, but with this vector I have the certainty that the values in it are of discrete nature so I can't use logspline to adjust a "fit" for it.
Basically what I want to do is to adjust a "fit" of the observed data and then use that fit to simulate those values. Do you think this can be done in R?
Thank you very much for your help.

I think sample(x,...,replace=TRUE) (sampling with replacement) should simulate from the empirical distribution ...

I am not totally clear exactly what you are trying to do, but could you use something like quantile and runif, for example:
obs <- c(125,110,115,100,150) # original observations
sim <- quantile(obs, runif(10000)) # simulations
hist(sim, freq=FALSE)

Related

How to use termplot function with fixed predictive variable values?

Let´s assume I want to draw a plot similar to here here using R, i.e. hazard ratio on y axis and some predictor variable on x axis based on a Cox model with spline term. The only exception is that I want to set my own x axis points; termplot seems to pick all the unique values from the data set but I want to use some sort of grid. This is because I am doing multiple imputation which induces different unique values in every round. Otherwise I can do combined inference quite easily but it would be a lot easier to make predictions for the same predictor values every imputation round.
So, I need to find a way to use termplot function so that I can fix predictor values or to find a workaround. I tried to use predict function but its newdata argument requires values for all other (adjusting) variables too, which inflates standard errors. This is a problem because I am also plotting confidence intervals. I think I could do this manually without any functions except that spline terms are out of my reach in this sense.
Here is an illustrative example.
library(survival)
data(diabetic)
diabetic<-diabetic[diabetic$eye=="right",]
# Model with spline term and two adjusting variables
mod<-coxph(Surv(time,status)~+pspline(age,df=3)+risk+laser,data=diabetic)
summary(mod)
# Let's pretend this is the grid
# These are in the data but it's easier for comparison in this example
ages<-20:25
# Right SEs but what if I want to use different age values?
termplot(mod,term=1,se=TRUE,plot=F)$age[20:25,]
# This does something else
termplot(mod,data=data.frame(age=ages),term=1,se=TRUE,plot=F)$age
# This produces an error
predict(mod,newdata=data.frame(age=ages),se.fit=T)
# This inflates variance
# May actually work with models without categorical variables: what to do with them?
# Actual predictions are different but all that matters is the difference between ages
# and they are in line
predict(mod,newdata=data.frame(age=ages,risk=mean(diabetic$risk),laser="xenon"),se.fit=T)
Please let me know if didn't exlain my problem sufficiently. I tried to keep it as simple as possible.
In the end, this how I worked it out. First, I made the predictions and SEs using termplot function and then I used linear interpolation to get approximately right predictions and SEs for my custom grid.
ptemp<-termplot(mod,term=1,se=TRUE,plot=F)
ptemp<-data.frame(ptemp[1]) # Picking up age and corresponding estimates and SEs
x<-ptemp[,1]; y<-ptemp[,2]; se<-ptemp[,3]
f<-approxfun(x,y) # Linear interpolation function
x2<-seq(from=10,to=50,by=0.5) # You can make a finer grid if you want
y2<-f(x2) # Interpolation itself
f_se<-approxfun(x,se) # Same for SEs
se2<-f_se(x2)
dat<-data.frame(x2,y2,se2) # The wanted result

I am doing a risk aggregation of losses in R using poisson distribution as frequency of losses and ecdf as severity of losses

I am really new to this and I have no idea how to use the ecdf function in R. Below I have mention everything step by step:
Frequency of losses is defined using a Poisson distribution
Generate an ecdf function that is going to be used for the severity of losses.
Linearly interpolate the ecdf function.
Take inverse transform of the linearly interpolated ecdf function.
For example,
I can use code freq <- rpois(10,5) to generate the random number of loss frequency but further I have to use this vector to do steps 2-4 and I have no idea how to do that. For step 2 I am facing the problem that how can I use that Poisson distribution as an input and then use to compute severity using the ecdf function. If anybody knows this please help me.

How do I calculate AUC from two continuous variables in R?

I have the following data:
# actual value:
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
I already calculated MSE and RMSE for these two, but they're asking for AUC and ROC curve. How can I calculate it from this data using R? I thought AUC is for classification problems, was I mistaken? Can we still calculate AUC for numeric values like above?
Question:
I thought AUC is for classification problems, was I mistaken?
You are not mistaken. The area under the receiver operating characteristic curve can't be computed for two numeric vectors like in your example. It's used to determine how well your binary classifier stands up to a gold standard binary classifier. You need a vector of cases vs. controls, or levels for the a vector that put each value in one of two categories.
Here's an example of how you'd do this with the pROC package:
library(pROC)
# actual value
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
df <- data.frame(a = a, p = p)
# order the data frame according to the actual values
odf <- df[order(df$a),]
# convert the actual values to an ordered binary classification
odf$a <- odf$a > 12 # arbitrarily decided to use 12 as the threshold
# construct the roc object
roc_obj <- roc(odf$a, odf$p)
auc(roc_obj)
# Area under the curve: 0.9615
Here, we have arbitrarily decided that threshold for the gold standard (a) is 12. If that's the case, than observations that have a lower value than 12 are controls. The prediction (p) classifies very well, with an AUC of 0.9615. We don't have to decide on the threshold for our prediction classifier in order to determine the AUC, because it's independent of the threshold decision. We can slide up and down depending on whether it's more important to find cases or to not misclassify a control.
Important Note
I completely made up the threshold for the gold standard classifier. If you choose a different threshold (for the gold standard), you'll get a different AUC. For example, if we chose 28, the AUC would be 1. The AUC is independent of the threshold for the predictor, but absolutely depends on the threshold for the gold standard.
EDIT
To clarify the above note, which was apparently misunderstood, you were not mistaken. This kind of analysis is for classification problems. You cannot use it here without more information. In order to do it, you need a threshold for your a vector, which you don't have. You CAN'T make one up and expect to get a non made up result for the AUC. Because the AUC depends on the threshold for the gold standard classifier, if you just make up the threshold, as we did in the exercise above, you are also just making up the AUC.

Prediction at a new value using lowess function in R

I am using lowess function to fit a regression between two variables x and y. Now I want to know the fitted value at a new value of x. For example, how do I find the fitted value at x=2.5 in the following example. I know loess can do that, but I want to reproduce someone's plot and he used lowess.
set.seed(1)
x <- 1:10
y <- x + rnorm(x)
fit <- lowess(x, y)
plot(x, y)
lines(fit)
Local regression (lowess) is a non-parametric statistical method, it's a not like linear regression where you can use the model directly to estimate new values.
You'll need to take the values from the function (that's why it only returns a list to you), and choose your own interpolation scheme. Use the scheme to predict your new points.
Common technique is spline interpolation (but there're others):
https://www.r-bloggers.com/interpolation-and-smoothing-functions-in-base-r/
EDIT: I'm pretty sure the predict function does the interpolation for you. I also can't find any information about what exactly predict uses, so I've tried to trace the source code.
https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/library/stats/R/loess.R
else { ## interpolate
## need to eliminate points outside original range - not in pred_
I'm sure the R code calls the underlying C implementation, but it's not well documented so I don't know what algorithm it uses.
My suggestion is: either trust the predict function or roll out your own interpolation algorithm.

Extract random sample from a unknown distribution (no generate stochastic random deviates)

I have a vector of data. I need build the density / distribution function and from that, extract a random sample, i.e. I need obtain the result that give us a function similar to rnorm(), rpois(), rbinom(), etc, but with a distribution built from a vector of data. All in R. Thank you so much.
It has nothing to do with generate stochastic random deviates.
I know the function sample() do something similar, but not exactly. If I use sample() I obtain only elements from my original data, as a discrete distribution and I need as a continuous distribution.

Resources