Fit poisson distribution to data (histogram + line) - r

I need to do exactly what #interstellar asked here Fit poisson distribution to data but within the R environment (not matlab).
So, I created a barplot with my observed values and I just need to fit a poisson distribution on it.
Here my data:
df = read.table(text = 'Var1 Freq
6 1
7 2
8 5
9 7
10 9
11 6
12 4
13 3
14 2
15 1', header = TRUE)
the barplot created is the following:
t = barplot(df$Freq, ylim = c(0,10))
axis(1, at=t, labels=df$Var1)
I am still new to R so how I could use the fitdist function or something else to create a line above my barplot?
Any help would be really appreciated.
UPDATE
I have worked out something but I am not sure 100% if it is correct:
#create barplot
t = barplot(df$Freq, ylim = c(0,10))
axis(1, at=t, labels=df$Var1)
#find lambda value from my data
pois = fitdist(df$Freq, 'pois', method = 'mle')
print(pois)
#result
Fitting of the distribution ' pois ' by maximum likelihood
Parameters:
estimate Std. Error
lambda 4 0.6324555
#create 10 values from a real poisson distribution
dist = dpois(1:10, lambda = 4)
#multiply them by `sum(df$Freq)` in order to scale them to the barplot
dist = dist * sum(df$Freq)
#add the line plot to the original barplot
lines(dist, lwd = 2)
result
However, the curve is not smooth..

The package vcd comes with the goodfit() function which essentially does exactly what you ask for: fit the model by ML and then visualize observed and fitted frequencies. By default, a square-root scale is adopted to better bring out departures at lower expected frequencies. Also, by default, the bars are hanging from the curve to align all deviations along the axis. This version is called rootogram (see our recent discussion in The American Statistician for more details). The defaults can be changed though to get a standing barplot on the raw scale:
gf <- goodfit(df[, 2:1], "poisson")
plot(gf, type = "standing", scale = "raw")
plot(gf, type = "hanging", scale = "sqrt")
Attention: Also note that in your version of the code you obtain exactly 4 as the MLE because you just use $Freq in the estimation but not $Var1. My version of the code assumes that your data is meant to have 1 observation of a 6, 2 observations of a 7, etc. The code may have to be adapted if this is not what you mean.

#fit poisson distr to tbl$Freq data
poisson = fitdist(df$Freq, 'pois', method = 'mle')
print(poisson)
#plot
plot(df$Var1, df$Freq, type = 'h', ylim = c(0,10), ylab = 'No. of years with x events',
xlab = 'No. of events in a year', main = 'All 13-day events with Poisson')
dist = dpois(1:10, lambda = 4)
dist = dist * sum(df$Freq)
dist = as.data.frame(dist)
dist$Var1 = df$Var1
lines(dist$Var1, dist$dist, lwd = 2)

Related

adding knn fit to plot in R

~Beginner in R~
I have the following code for a data set that has variables: price, mileage, and color. I have plotted a basic plot of x=mileage and y=price, and fitted a linear regression line to the plot.
cd = read.csv("https://bitbucket.org/remcc/rob-data-sets/downloads/susedcars.csv")
cd = cd[,c('price','mileage','color')]
n = nrow(cd)
set.seed(99)
pin = .75 #percent train (or percent in-sample)
ii = sample(1:n,floor(pin*n))
cdtr = cd[ii,]
cdte = cd[-ii,]
dim(cd)
plot(cd$mileage, cd$price, xlab="Mileage", ylab="Price", pch=16,cex=.8)
abline(lm(cd$price ~ cd$mileage), col="red", lwd=2)
## FITTING KNN
pred_knn=knn(data.frame(cdtr$mileage), data.frame(cdte$mileage), cl=cdtr$price, k=50)
I am trying to fit a line using pred_knn to the plot so that my plot looks like this:
However, I am not sure how to go about adding the kNN fit to my plot
As dcarlson mentions in the comments, it looks like you're using class::knn to predict a continuous variable, when it's meant to be used for classification (i.e. categorical responses).
The FNN package allows for kNN regression. Does this help:
library(FNN)
pred_knn = FNN::knn.reg(train = cdtr[2], test = cdte[2], y = cdtr[1], k = 50)
plot(cdte$mileage,cdte$price,, xlab="Mileage", ylab="Price", pch=16,cex=.8))
ORDER = order(cdte$mileage)
lines(cdte$mileage[ORDER],pred_knn$pred[ORDER])

Graphing prediction line of a logistic regression in R

I have set up a logistic regression model in R and successfully plotted the points of the model to show a relationship in the dataset. I am having trouble showing the line graph of the prediction. The model predicts readmission rates of a hospital based on the length of the initial stay (in days). Here is my code:
mydata <- read.csv(file = 'C:\\Users\\nickg\\Downloads\\3kfid8emf9rkc9ek30sf\\medical_clean.csv', header=TRUE)[,c("Initial_days","ReAdmis")]
head(mydata)
mydata$ReAdmis.f <- factor(mydata$ReAdmis)
logfit <- glm(mydata$ReAdmis.f ~ mydata$Initial_days, data = mydata, family = binomial)
summary(logfit)
range(mydata$Initial_days)
xweight <- seq(0, 79.992, .008)
yweight <- predict(logfit, list(xweight), type = "response")
plot(mydata$Initial_days, mydata$ReAdmis.f, pch = 16, xlab = "Initial Days", ylab = "ReAdmission Y/N")
lines(xweight, yweight)
As you can see I have the model set up and ranges described by xweight and yweight, but nothing shows up for the line.
Always use curve for this:
plot(ReAdmis.f ~ Initial_days, data = mydata,
pch = 16, xlab = "Initial Days", ylab = "ReAdmission Y/N")
curve(predict(logfit, newdata = data.frame(Initial_days = x),
#x is created by the curve function based on the plot's x limits
#note that newdata must contain the x variable with exactly the same name as in the original data
type = "response"),
add = TRUE)
However, the issue here could be that your y variable is a factor variable (internally that's values of 1 and 2 if you have two levels) whereas logistic regression predictions are always in the interval [0, 1]. You should convert ReAdmis.f into 0/1 integer values before running the code.

R: Plot Individual Predictions

I am using the R programming language. I am trying to follow this tutorial :https://rdrr.io/cran/randomForestSRC/man/plot.competing.risk.rfsrc.html
This tutorial shows how to use the "survival random forest" algorithm - an algorithm used to analyze survival data. In this example, the "follic" data set is used, the survival random forest algorithm is used to analyze the instant hazard of observation experiencing "status 1" vs "status 2" (this is called "competing risks).
In the code below, the survival random forest model is trained on the follic data set using all observations except the last two observations. Then, this model is used to predict the hazards of the last two observations:
#load library
library(randomForestSRC)
#load data
data(follic, package = "randomForestSRC")
#train model on all observations except the last 2 observations
follic.obj <- rfsrc(Surv(time, status) ~ ., follic[c(1:539),], nsplit = 3, ntree = 100)
#use model to predict the last two observations
f <- predict(follic.obj, follic[540:541, ])
#plot individual curves - does not work
plot.competing.risk(f)
However, this seems to produce the average hazards for the last two observations experiencing "status 1 vs status 2".
Is there a way to plot the individual hazards of the first observation and the second observation?
Thanks
EDIT1:
I know how to do this for other functions in this package, e.g. here you can plot these curves for 7 observations at once:
data(veteran, package = "randomForestSRC")
plot.survival(rfsrc(Surv(time, status)~ ., veteran), cens.model = "rfsrc")
## pbc data
data(pbc, package = "randomForestSRC")
pbc.obj <- rfsrc(Surv(days, status) ~ ., pbc)
## use subset to focus on specific individuals
plot.survival(pbc.obj, subset = c(3, 10))
This example seems to show the predicted survival curves for 7 observations (plus the confidence intervals - the red line is the average) at once. But I still do not know how to do this for the "plot.competing.risk" function.
EDIT2:
I think there might be an indirect way to solve this - you can predict each observation individually:
#use model to predict the last two observations individually
f1 <- predict(follic.obj, follic[540, ])
f2 <- predict(follic.obj, follic[541, ])
#plot individual curves
plot.competing.risk(f1)
plot.competing.risk(f2)
But I was hoping there was a more straightforward way to do this. Does anyone know how?
One possible way is to modify the function plot.competing.risk for individual line, and plot over a for loop for overlapping individual lines, as shown below.
#use model to predict the last three observations
f <- predict(follic.obj, follic[539:541, ])
x <- f
par(mfrow = c(2, 2))
for (k in 1:3) { #k for type of plot
for (i in 1:dim(x$chf)[1]) { #i for all individuals in x
#cschf <- apply(x$chf, c(2, 3), mean, na.rm = TRUE) #original group mean
cschf = x$chf[i,,] #individual values
#cif <- apply(x$cif, c(2, 3), mean, na.rm = TRUE) #original group mean
cif = x$cif[i,,] #individual values
cpc <- do.call(cbind, lapply(1:ncol(cif), function(j) {
cif[, j]/(1 - rowSums(cif[, -j, drop = FALSE]))
}))
if (k==1)
{matx = cschf
range = range(x$chf)
}
if (k==2)
{matx = cif
range = range(x$cif)
}
if (k==3)
{matx = cpc
range = c(0,1) #manually assign, for now
}
ylab = c("Cause-Specific CHF","Probability (%)","Probability (%)")[k]
matplot(x$time.interest, matx, type='l', lty=1, lwd=3, col=1:2,
add=ifelse(i==1,F,T), ylim=range, xlab="Time", ylab=ylab) #ADD tag for overlapping individual lines
}
legend <- paste(c("CSCHF","CIF","CPC")[k], 1:2, " ")
legend("bottomright", legend = legend, col = (1:2), lty = 1, lwd = 3)
}

Copula result in R

I have a table of two column, it consist of an already computed index for 2 variables, a simple is quoted as following:
V1, V2
0.46,1.08
0.84,1.05
-0.68,0.93
-0.99,0.68
-0.87,0.30
-1.08,-0.09
-1.16,-0.34
-0.61,-0.43
-0.65,-0.48
0.73,-0.48
In order to find out the correlation between the aforementioned data I have, I am using the copula package in R.
The following VineCopula code I have used to figure out which family of Copula to use:
library(VineCopula)
selectedCopula <- BiCopSelect(u,v,familyset=NA)
selectedCopula
It has suggested to use the survival Gumbel, the rotated version of the Gumbel Copula according to the copula R manual (Link)
However, I chose The Frank copula, since it offers symmetric dependence structure, and it permits modeling positive as negative dependence in the data, how plausible is that?
One more thing, after running the following self explanatory copula code:
# Estimate V1 distribution parameters and visually compare simulated vs observed data
x_mean <- mean(mydata$V1)
#Normal Distribution
hist(mydata$V1, breaks = 20, col = "green", density = 30)
hist(rnorm( nrow(mydata), mean = x_mean, sd = sd(mydata$V1)),
breaks = 20,col = "blue", add = T, density = 30, angle = -45)
# Same for V2
y_mean <- mean(mydata$V2)
#Normal Distribution
hist(mydata$V2, breaks = 20, col = "green", density = 30)
hist(rnorm(nrow(mydata), mean = y_mean,sd = sd(mydata$V2)),
breaks = 20, col = "blue", add = T, density = 30, angle = -45)
# Measure association using Kendall's Tau
cor(mydata, method = "kendall")
#Fitting process with copula choice
# Estimate copula parameters
cop_model <- frankCopula(dim = 2)
m <- pobs(as.matrix(mydata))
fit <- fitCopula(cop_model, m, method = 'ml')
coef(fit)
# Check Kendall's tau value for the frank copula with = 3.236104
tau(frankCopula(param = 3.23))
#Building the bivariate distribution using frank copula
# Build the bivariate distribution
sdx =sd(mydata$V1)
sdy =sd(mydata$V2)
my_dist <- mvdc(frankCopula(param = 3.23, dim = 2), margins = c("norm","norm"),
paramMargins = list(list(mean = x_mean, sd=sdx),
list(mean = y_mean, sd=sdy)))
# Generate 439 random sample observations from the multivariate distribution
v <- rMvdc(439, my_dist)
# Compute the density
pdf_mvd <- dMvdc(v, my_dist)
# Compute the CDF
cdf_mvd <- pMvdc(v, my_dist)
# Sample 439 observations from the distribution
sim <- rMvdc(439,my_dist)
# Plot the data for a visual comparison
plot(mydata$V1, mydata$V2, main = 'Test dataset x and y', col = "blue")
points(sim[,1], sim[,2], col = 'red')
legend('bottomright', c('Observed', 'Simulated'), col = c('blue', 'red'), pch=21)
The plotted data set shows good fitting results even for extreme values.
here, I want to present the correlated values from applying frank copula with my original data in the same line graph,
I could not figure out how to extract the frank copula results?
(A one column so I can plot with the original data and have a visual comparison)
I am not sure if I correctly understand your questions. However, if you want to get the copula data (generated from Frank copula) they are stored in sim. If you are asking for the Kendall tau then they should be stored in the fitcopula. You cannot have a frank copula data as one column as it must be a matrix. Also, pobs function will give you a result as a matrix so you do not need to use as.matrix. If you need more help, I am very happy to help.

Exponential curve fitting in R

time = 1:100
head(y)
0.07841589 0.07686316 0.07534116 0.07384931 0.07238699 0.07095363
plot(time,y)
This is an exponential curve.
How can I fit line on this curve without knowing the formula ? I can't use 'nls' as the formula is unknown (only data points are given).
How can I get the equation for this curve and determine the constants in the equation?
I tried loess but it doesn't give the intercepts.
You need a model to fit to the data.
Without knowing the full details of your model, let's say that this is an
exponential growth model,
which one could write as: y = a * e r*t
Where y is your measured variable, t is the time at which it was measured,
a is the value of y when t = 0 and r is the growth constant.
We want to estimate a and r.
This is a non-linear problem because we want to estimate the exponent, r.
However, in this case we can use some algebra and transform it into a linear equation by taking the log on both sides and solving (remember
logarithmic rules), resulting in:
log(y) = log(a) + r * t
We can visualise this with an example, by generating a curve from our model, assuming some values for a and r:
t <- 1:100 # these are your time points
a <- 10 # assume the size at t = 0 is 10
r <- 0.1 # assume a growth constant
y <- a*exp(r*t) # generate some y observations from our exponential model
# visualise
par(mfrow = c(1, 2))
plot(t, y) # on the original scale
plot(t, log(y)) # taking the log(y)
So, for this case, we could explore two possibilies:
Fit our non-linear model to the original data (for example using nls() function)
Fit our "linearised" model to the log-transformed data (for example using the lm() function)
Which option to choose (and there's more options), depends on what we think
(or assume) is the data-generating process behind our data.
Let's illustrate with some simulations that include added noise (sampled from
a normal distribution), to mimic real data. Please look at this
StackExchange post
for the reasoning behind this simulation (pointed out by Alejo Bernardin's comment).
set.seed(12) # for reproducible results
# errors constant across time - additive
y_add <- a*exp(r*t) + rnorm(length(t), sd = 5000) # or: rnorm(length(t), mean = a*exp(r*t), sd = 5000)
# errors grow as y grows - multiplicative (constant on the log-scale)
y_mult <- a*exp(r*t + rnorm(length(t), sd = 1)) # or: rlnorm(length(t), mean = log(a) + r*t, sd = 1)
# visualise
par(mfrow = c(1, 2))
plot(t, y_add, main = "additive error")
lines(t, a*exp(t*r), col = "red")
plot(t, y_mult, main = "multiplicative error")
lines(t, a*exp(t*r), col = "red")
For the additive model, we could use nls(), because the error is constant across
t. When using nls() we need to specify some starting values for the optimization algorithm (try to "guesstimate" what these are, because nls() often struggles to converge on a solution).
add_nls <- nls(y_add ~ a*exp(r*t),
start = list(a = 0.5, r = 0.2))
coef(add_nls)
# a r
# 11.30876845 0.09867135
Using the coef() function we can get the estimates for the two parameters.
This gives us OK estimates, close to what we simulated (a = 10 and r = 0.1).
You could see that the error variance is reasonably constant across the range of the data, by plotting the residuals of the model:
plot(t, resid(add_nls))
abline(h = 0, lty = 2)
For the multiplicative error case (our y_mult simulated values), we should use lm() on log-transformed data, because
the error is constant on that scale instead.
mult_lm <- lm(log(y_mult) ~ t)
coef(mult_lm)
# (Intercept) t
# 2.39448488 0.09837215
To interpret this output, remember again that our linearised model is log(y) = log(a) + r*t, which is equivalent to a linear model of the form Y = β0 + β1 * X, where β0 is our intercept and β1 our slope.
Therefore, in this output (Intercept) is equivalent to log(a) of our model and t is the coefficient for the time variable, so equivalent to our r.
To meaningfully interpret the (Intercept) we can take its exponential (exp(2.39448488)), giving us ~10.96, which is quite close to our simulated value.
It's worth noting what would happen if we'd fit data where the error is multiplicative
using the nls function instead:
mult_nls <- nls(y_mult ~ a*exp(r*t), start = list(a = 0.5, r = 0.2))
coef(mult_nls)
# a r
# 281.06913343 0.06955642
Now we over-estimate a and under-estimate r
(Mario Reutter
highlighted this in his comment). We can visualise the consequence of using the wrong approach to fit our model:
# get the model's coefficients
lm_coef <- coef(mult_lm)
nls_coef <- coef(mult_nls)
# make the plot
plot(t, y_mult)
lines(t, a*exp(r*t), col = "brown", lwd = 5)
lines(t, exp(lm_coef[1])*exp(lm_coef[2]*t), col = "dodgerblue", lwd = 2)
lines(t, nls_coef[1]*exp(nls_coef[2]*t), col = "orange2", lwd = 2)
legend("topleft", col = c("brown", "dodgerblue", "orange2"),
legend = c("known model", "nls fit", "lm fit"), lwd = 3)
We can see how the lm() fit to log-transformed data was substantially better than the nls() fit on the original data.
You can again plot the residuals of this model, to see that the variance is not constant across the range of the data (we can also see this in the graphs above, where the spread of the data increases for higher values of t):
plot(t, resid(mult_nls))
abline(h = 0, lty = 2)
Unfortunately taking the logarithm and fitting a linear model is not optimal.
The reason is that the errors for large y-values weight much more than those
for small y-values when apply the exponential function to go back to the
original model.
Here is one example:
f <- function(x){exp(0.3*x+5)}
squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
x <- 0:12
y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
x
y
#--------------------------------------------------------------------
M <- lm(log(y)~x)
a <- unlist(M[1])[2]
b <- unlist(M[1])[1]
print(c(a,b))
squaredError(a,b,x,y)
approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
for ( i in 0:10 )
{
eps <- -i*sign(approxPartAbl_a)*1e-5
print(c(eps,squaredError(a+eps,b,x,y)))
}
Result:
> f <- function(x){exp(0.3*x+5)}
> squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
> x <- 0:12
> y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
> x
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12
> y
[1] 151.2182 203.4020 278.3769 366.8992 503.5895 682.4353 880.1597 1186.5158 1630.9129 2238.1607 3035.8076 4094.6925 5559.3036
> #--------------------------------------------------------------------
>
> M <- lm(log(y)~x)
> a <- unlist(M[1])[2]
> b <- unlist(M[1])[1]
> print(c(a,b))
coefficients.x coefficients.(Intercept)
0.2995808 5.0135529
> squaredError(a,b,x,y)
[1] 5409.752
> approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
> for ( i in 0:10 )
+ {
+ eps <- -i*sign(approxPartAbl_a)*1e-5
+ print(c(eps,squaredError(a+eps,b,x,y)))
+ }
[1] 0.000 5409.752
[1] -0.00001 5282.91927
[1] -0.00002 5157.68422
[1] -0.00003 5034.04589
[1] -0.00004 4912.00375
[1] -0.00005 4791.55728
[1] -0.00006 4672.70592
[1] -0.00007 4555.44917
[1] -0.00008 4439.78647
[1] -0.00009 4325.71730
[1] -0.0001 4213.2411
>
Perhaps one can try some numeric method, i.e. gradient search, to find the
minimum of the squared error function.
If it really is exponential, you can try taking the logarithm of your variable and fitting a linear model to that.

Resources