Advice on calculating a function to describe upper bound of data - r

I have a scatter plot of a dataset and I am interested in calculating the upper bound of the data. I don't know if this is a standard statistical approach so what I was considering doing was splitting the X-axis data into small ranges, calculating the max for these ranges and then trying to identify a function to describe these points. Is there a function already in R to do this?
If it's relevant there are 92611 points.

You might like to look into quantile regression, which is available in the quantreg package. Whether this is useful will depend on whether you want the absolute maximum within your "windows" are whether some extreme quantile, say 95th or 99th, is acceptable? If you are not familiar with quantile regression, then consider the linear regression which fits a model for the expectation or mean response, conditional upon the model covariates. Quantile regression for the middle quantile (0.5) would fit a model to the median response, conditional upon the model covariates.
Here is an example using the quantreg package, to show you what I mean. First, generate some dummy data similar to the data you show:
set.seed(1)
N <- 5000
DF <- data.frame(Y = rev(sort(rlnorm(N, -0.9))) + rnorm(N),
X = seq_len(N))
plot(Y ~ X, data = DF)
Next, fit the model to the 99th percentile (or the 0.99 quantile):
mod <- rq(Y ~ log(X), data = DF, tau = .99)
To generate the "fitted line", we predict from the model at 100 equally spaced values in X
pDF <- data.frame(X = seq(1, 5000, length = 100))
pDF <- within(pDF, Y <- predict(mod, newdata = pDF))
and add the fitted model to the plot:
lines(Y ~ X, data = pDF, col = "red", lwd = 2)
This should give you this:

I would second Gavin's nomination for using quantile regression. Your data might be simulated with your X and Y each log-normally distributed. You can see what a plot of the joint distribution of two independent (no imposed correlation, but not necessarily cor(x,y)==0) log-normal variates looks like if you run:
x <- rlnorm(1000, log(300), sdlog=1)
y<- rlnorm(1000, log(7), sdlog=1)
plot(x,y, cex=0.3)
You might consider looking at their individual distributions with qqplot (in the base plotting functions) remembering that the tails of such distrubutions can behave in surprising manner. You should be more interested in how well the bulk of the values fit a particular distribution than the extremes ... unless of course your applications are in finance or insurance. Don't want another global financial crisis because of poor modeling assumptions about tail behavior, now do we?
qqplot(x, rlnorm(10000, log(300), sdlog=1) )

Related

Simulating likelihood ratio test (LRT) pvalue using Monte Carlo method [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 26 days ago.
I'm trying to figure out my assignment to simulate lrt test p-value output using the Monte Carlo method. As far as I understand, the lrt test is supposed to test for "better", more accurate model.
I know how to perform such a test:
nested <- glm(finalgrade~absences,data=grades)
complex <- glm(finalgrade~absences+age,data=grades)
lrtest(nested, complex)
From there I can return my p-value and perform some calculations like type I and type II errors or power of a test and see how it changes depending of number of simulations.
My question is how am I supposed to simulate the random data. It doesn't have to be grades or school related stuff this was just a showcase of my understanding.
I was thinking about making data frame with 3 to 4 columns with 1 column being a dependent value (0,1) and the rest being random numbers generated from the normal distribution or some different distribution.
But I don't know if this approach will create understandable results, or if this even makes sense.
I looked at this function function but it didn't really help me to understand anything.
I came up with something like this:
library(lmtest)
n <- 1000
depentend = sample(c(0,1), replace=TRUE, size=n)
pvalue <- c()
for(i in 1:1000) {
independend_x = rnorm(n, mean = 2,sd = 0.2)
independend_y = rnorm(n, mean = 7,sd = 0.5)
nested <- lm(depentend~independend_x)
complex <- lm(depentend~independend_x + independend_y)
lrtest(nested, complex)
pvalue <- c(pvalue, as.numeric(lrtest(nested, complex)[5][2,1]))
}
but I don't know if this is the right direction.
I would be really thankful if someone could help me to understand how to simulate data for the Monte Carlo sampling method.
Monte Carlo simulations are performed to compute a distribution of something that is difficult to compute or for which one is too lazy to perform the exact computation.
The likelihood ratio test computes a p-value based on the distribution of the likelihood ratio $\Lambda$, and that distribution is the value that you want to simulate instead of compute or estimate with formula's. The trick is to use simulation instead of computations.
Your problem does not seem to be so much how to perform the simulations, but more like what is the distribution that you are interested in and want to simulate and what are the boundary conditions that you need to fix. Which computation or estimation is it that you want to replace/estimate with simulation?
For your likelihood ratio test you probably want to test the hypothesis $H_0: \theta_{age} = 0$ against the alternative hypothesis $H_a: \theta_{age} \neq 0$. In this case you compute the ratio of the likelihood $\mathcal{L}$ where one of the hypotheses is a composite hypothesis and you select the highest likelihood among them.
$$\Lambda = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\text{sup}_{\theta_{age} \neq 0}\mathcal{L}( \theta_{age} | \text{some data})} = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\mathcal{L}( \hat\theta_{age} | \text{some data})} $$ where the supremum is found by using the likelihood for the maximum likelihood estimator $ \hat\theta_{age} $
To compute these likelihood functions you need assumptions about the distributions. In your case you do this with glm (where you need to decide on some distribution and link function) or more simple lm (which assumes Gaussian conditional distribution for the data).
The simulations are then computed for a given null hypothesis. For instance, given some data, you assume that $\theta_{age} = 0$ and you want to compute what the distribution of the outcomes of $\Lambda$ is. You need some more data and parameters
The independent variables. These you probably want to fix at some values that relate to your practical problem. You want to know the distribution given some independent variables. Potentially you may wish to study what happens when there is an error in these independent variables, in that case you may also simulate these variables.
The variance/dispersion/noise-level of the conditional distribution. This you may vary to see how this influences the statistic. Or you have some value of interest, for instance if you have data for which you estimated the noise.
The other coefficients. These you may likewise vary or keep fixed depending on the situation, whether you want to model a particular situation or a more range of situations.
Example
The code below computes a simulation for a given regressor matrix (the independent variables) and given other coefficients. For large sample size the distribution will approach a chi-squared distribution. The simulation shows that using that limit as an estimate for the distribution underestimates the p-value by a lot.
(I ran the code with only 5000 simulations because I am using an online r-editor an compiler, on a computer you can get more precise results)
n_sim = 5*10^3
### simulate likelihood ratio test
### given coefficient and independent variables
### we assume a logistic model with binomial distribution
sim = function(theta1, X) {
### compute model
Z = X %*% theta1
p = 1/(1+exp(-Z))
### simulate dependent variable
Y = rbinom(length(p), 1, p)
### compute (log)likelihood ratio
mod1 = glm(Y ~ 1 + X[,2] + X[,3], family = binomial)
mod0 = glm(Y ~ 1 + X[,2], family = binomial)
logratio = -2*(logLik(mod0)-logLik(mod1))
return(as.numeric(logratio))
}
set.seed(1)
n = 10
### coefficients with the last one zero
theta1 = c(1,1,0)
### some regressor matrix, independent variables
X = cbind(rep(1,n), matrix(rnorm(n*2),n)) ### first column is intercept
### simulate
Lsim = replicate(n_sim,sim(theta1,X))
### ordering for empirical distribution
Lsim = Lsim[order(Lsim)]
perc = c(1:length(Lsim))/length(Lsim)
plot(Lsim,1-perc, main = "emperical distribution", ylab = "P(likelihood > L)", xlab = "L", type = "l")
lines(qchisq(perc,1),1-perc, lty = 2)
legend(8,1, c("n=10","n=40", "chi-squared estimate"), lty = c(1,1,2), col = c(1,2,1))
#### repeat with larger n
set.seed(1)
n = 40
theta1 = c(1,1,0)
X = cbind(rep(1,n), matrix(rnorm(n*2),n))
Lsim2 = replicate(n_sim,sim(theta1,X))
Lsim2 = Lsim2[order(Lsim2)]
lines(Lsim2, 1-perc, col = 2)
Note that there are many variants and this is just an example what simulation does. Here we simulate data based on a given distribution. (And it replaces a computation that we could not perform. We had an estimate with a chi-squared distribution, but that is not accurate for small $n$.)
Other times this distribution is not know and one uses the data and some resampling method to simulate/estimate the distribution of the statistic.
For your situation you need to figure out what exact computation (for which information/conditions are given) it is that you want to replace by using simulations.

How to graph inc exponential decay in R?

My prof decided that our first experience with coding was going to be trying to fit the function z(t) = A(1-e^(-t/T)) into a given data-set from class using R. I'm completely lost. I keep using lm and nls functions, without quite knowing how they work. So far, I have the data graphed but I have no clue how to get any sort of line more complicated than
mod3<-lm(y~I(x^1/5))
pre3<-predict(mod3)
lines(pre3)
to sum up: how do I find the A and T parameters? Do I use nls for the formula? Anything helps. I'll include a picture of the graph and the data. Please ignore the random lines on the plot. graph depicting my dataset dataset I have to use
One could attempt transform your expression into a linear relationship, but sometimes it is easier to just let the computer do the work. As mention in the comments, R has the nls function to perform the nonlinear regression.
Here is an example using some dummy data. The supply the nls function with your equation, the data frame containing the data and supply it with the initial estimates of the parameters.
See comments for additional details.
#create dummy data
A= 0.8
T1 = 13
t <- seq(2, 50, 3)
z <- A*(1-exp(-t/T1))
z<- z +rnorm(length(z), 0, 0.005) #add noise
#starting data frame
df <-data.frame(t, z)
#solve non-linear model
model <- nls(z ~ A*(1-exp(-t/Tc)), data=df, start = list(A=1, Tc=1))
print(summary(model))
#predict
pred_y <-predict(model, data.frame(t))
#plot
plot(x=t, y=z)
lines(y=pred_y, x= t, col="blue")

Response Curves for binary part of Zero-Inflated model R

I've plotted the response curves for each of my predictors against all of predicted values to determine how each predictor influences my counts. However, I also want to plot the binary part of my zero-inflated model to see how the predictors in the binary part of the zero-inflated model help explain the probability of false zeroes. I am trying to get a plot similar to the one at the bottom of the page of the link below however they don't provide reproducible code in that example.
https://fukamilab.github.io/BIO202/04-C-zero-data.html#sketch_fitted_and_predicted_values
I've included some code below where I have my zero-inflated model and the predictors used. I then use the predict function to predict the estimates for a much larger raster grid (new.data) and I want to see the response between those predicted values and the predictors I use across the entire raster grid.
mod1 = zeroinfl(Response~x1+x2|x1,link ="logit",data=data,
dist="negbin")
modpred=predict(mod1, new.data, se.fit=T, type = "response")
response1 <- ggplot(data, aes(x = x1, y = modpred)) + geom_point()+
+geom_smooth(data = data, aes(x = x1, y = modpred))

How to fit and check negative binomial distribution to counts data in R?

I have a sample which is distributed as such , say, rnorm(300, mean=100, sd=10) (The rnorm just used as a sample data). I want to fit a negative binomial to it (a) visually using ggplot2 or base R package
(b) also run an appropriate test to check whether or not it is actually negative binomial (In this case it shouldn't). I am not sure how to achieve this in R.
So I tried this as an initial starting point (from this thread)
tt <- c(1:100)
hist(tt, prob = TRUE)
library(fitdistrplus)
fit <- fitdist(data = tt, distr = "nbinom", method = "mle")
fitD <- dnbinom(0:100, size=2.116252, mu=50.492772)
lines(fitD, lwd="3", col="blue")
However, this fits data to the probability density function. I want to fit it to a regular frequency distribution

How can I get the probability density function from a regression random forest?

I am using random-forest for a regression problem to predict the label values of Test-Y for a given set of Test-X (new values of features). The model has been trained over a given Train-X (features) and Train-Y (labels). "randomForest" of R serves me very well in predicting the numerical values of Test-Y. But this is not all I want.
Instead of only a number, I want to use random-forest to produce a probability density function. I searched for a solution for several days and here is I found so far:
"randomForest" doesn't produce probabilities for regression, but only in classification. (via "predict" and setting type=prob).
Using "quantregForest" provides a nice way to make and visualize prediction intervals. But still not the probability density function!
Any other thought on this?
Please see the predict.all parameter of the predict.randomForest function.
library("ggplot2")
library("randomForest")
data(mpg)
rf = randomForest(cty ~ displ + cyl + trans, data = mpg)
# Predict the first car in the dataset
pred = predict(rf, newdata = mpg[1, ], predict.all = TRUE)
hist(pred$individual)
The histogram of 500 "elementary" predictions looks like this:
You can also use quantregForest with a very fine grid of quantiles, convert them into a "cumulative distribution function (cdf)" with R-function ecdf and convert this cdf into a density estimation with a kernel density estimator.

Resources