Logistic function for regression kriging - r

I would like to perform regression kriging (RK) for binary presence-absence and host grid data as a constant predictor. I have used logistic function to estimate relationship between binary outcome and predictor, however I think it is not passing the RK assumptions? The predictor variable does not come out as a significant in the model. Is there any alternative how to approach it?
Data for the code: https://drive.google.com/folderview?id=0B7-8DA0HVZqDYk1BcFFwSkZCcjQ&usp=sharing
presabs <- read.csv("Pres_Abs.csv",header=T,
colClasses = c("integer","numeric","numeric",
"integer"))
coordinates(presabs) <- c("Long","Lat") # creates SpatialPointsDataFrame
host <- read.asciigrid("host.asc.txt") # reads ArcInfo Ascii raster map
host.ov <- overlay(host, presabs) # create grid-points overlay
presabs$host.asc.txt <- host.ov$host.asc.txt #copy host values
presabs$host.asc.txt <- log(host.ov$host.asc.txt)
glm(formula = Pres ~ host.asc.txt, family = binomial, data = presabs)
summary(glm.presabs)
Weighted Residuals:
Min 1Q Median 3Q Max
-0.3786 -0.3762 -0.3708 -0.3497 3.3137
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.942428 0.320104 -6.068 1.38e-08 ***
host.asc.txt -0.001453 0.003034 -0.479 0.633
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.007 on 127 degrees of freedom
Multiple R-squared: 1.317e-05, Adjusted R-squared: -0.007861
F-statistic: 0.001673 on 1 and 127 DF, p-value: 0.9674
Then, when it comes to the actual kriging, I have build this code from a tutorial, but it seems the actual residuals from the glm are not fed into krige function. Can it be improved in gstat?
library(gstat)
# Set bin width for the variogram and max distance:
Bin <- 0.09
MaxDist <- 1
BinNo <- MaxDist/Bin
# Calculate and plot the variogram
surpts.var <- variogram(Pres~1, presabs, cutoff=MaxDist, width = Bin)
plot(surpts.var)
# Insert parameter values for the variogram model
psill = 0.05921
distance = 63.7/111
nugget = 0.06233 # constant
# Fit and plot variogram model:
null.vgm <- vgm(psill,"Sph",distance,nugget) # initial parameters
vgm_Pres_r <- fit.variogram(surpts.var, model=null.vgm, fit.ranges=TRUE,
fit.method=1)
plot(surpts.var,vgm_Pres_r)
# Run RK using universal kriging:
presabs_uk <- krige(Pres~host.asc.txt, locations=presabs,
newdata=host, model=vgm_Pres_r)

krige mentions that it is
[using universal kriging]
This means that it fits a linear model, but not a generalized linear model. It uses the variogram that you fitted to the raw data, not to the residuals. The residual variogram would have been obtained by
surpts.var <- variogram(Pres~host.asc.txt, presabs, cutoff=MaxDist, width = Bin)
but is nearly identical, since your variable and the grid map are nearly uncorrelated:
> cor(presabs$Pres,presabs$host.asc.txt)
[1] -0.04281038
so, it is not a suprise that you don't recognize the grid map in the universal kriging predictions: the two are nearly (linearly) independent.

Related

How to fit a known linear equation to my data in R?

I used a linear model to obtain the best fit to my data, lm() function.
From literature I know that the optimal fit would be a linear regression with the slope = 1 and the intercept = 0. I would like to see how good this equation (y=x) fits my data? How do I proceed in order to find an R^2 as well as a p-value?
This is my data
(y = modelled, x = measured)
measured<-c(67.39369,28.73695,60.18499,49.32405,166.39318,222.29022,271.83573,241.72247, 368.46304,220.27018,169.92343,56.49579,38.18381,49.33753,130.91752,161.63536,294.14740,363.91029,358.32905,239.84112,129.65078,32.76462,30.13952,52.83656,67.35427,132.23034,366.87857,247.40125,273.19316,278.27902,123.24256,45.98363,83.50199,240.99459,266.95707,308.69814,228.34256,220.51319,83.97942,58.32171,57.93815,94.64370,264.78007,274.25863,245.72940,155.41777,77.45236,70.44223,104.22838,294.01645,312.42321,122.80831,41.65770,242.22661,300.07147,291.59902,230.54478,89.42498,55.81760,55.60525,111.64263,305.76432,264.27192,233.28214,192.75603,75.60803,63.75376)
modelled<-c(42.58318,71.64667,111.08853,67.06974,156.47303,240.41188,238.25893,196.42247,404.28974,138.73164,116.73998,55.21672,82.71556,64.27752,145.84891,133.67465,295.01014,335.25432,253.01847,166.69241,68.84971,26.03600,45.04720,75.56405,109.55975,202.57084,288.52887,140.58476,152.20510,153.99427,75.70720,92.56287,144.93923,335.90871,NA,264.25732,141.93407,122.80440,83.23812,42.18676,107.97732,123.96824,270.52620,388.93979,308.35117,100.79047,127.70644,91.23133,162.53323,NA ,276.46554,100.79440,81.10756,272.17680,387.28700,208.29715,152.91548,62.54459,31.98732,74.26625,115.50051,324.91248,210.14204,168.29598,157.30373,45.76027,76.07370)
Now I would like to see how good the equation y=x fits the data presented above (R^2 and p-value)?
I am very grateful if somebody can help me with this (basic) problem, as I found no answers to my question on stackoverflow?
Best regards Cyril
Let's be clear what you are asking here. You have an existing model, which is "the modelled values are the expected value of the measured values", or in other words, measured = modelled + e, where e are the normally distributed residuals.
You say that the "optimal fit" should be a straight line with intercept 0 and slope 1, which is another way of saying the same thing.
The thing is, this "optimal fit" is not the optimal fit for your actual data, as we can easily see by doing:
summary(lm(measured ~ modelled))
#>
#> Call:
#> lm(formula = measured ~ modelled)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -103.328 -39.130 -4.881 40.428 114.829
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.09461 13.11026 1.762 0.083 .
#> modelled 0.91143 0.07052 12.924 <2e-16 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 55.13 on 63 degrees of freedom
#> Multiple R-squared: 0.7261, Adjusted R-squared: 0.7218
#> F-statistic: 167 on 1 and 63 DF, p-value: < 2.2e-16
This shows us the line that would produce the optimal fit to your data in terms of reducing the sum of the squared residuals.
But I guess what you are asking is "How well do my data fit the model measured = modelled + e ?"
Trying to coerce lm into giving you a fixed intercept and slope probably isn't the best way to answer this question. Remember, the p value for the slope only tells you whether the actual slope is significantly different from 0. The above model already confirms that. If you want to know the r-squared of measured = modelled + e, you just need to know the proportion of the variance of measured that is explained by modelled. In other words:
1 - var(measured - modelled) / var(measured)
#> [1] 0.7192672
This is pretty close to the r squared from the lm call.
I think you have sufficient evidence to say that your data is consistent with the model measured = modelled, in that the slope in the lm model includes the value 1 within its 95% confidence interval, and the intercept contains the value 0 within its 95% confidence interval.
As mentioned in the comments, you can use the lm() function, but this actually estimates the slope and intercept for you, whereas what you want is something different.
If slope = 1 and the intercept = 0, essentially you have a fit and your modelled is already the predicted value. You need the r-square from this fit. R squared is defined as:
R2 = MSS/TSS = (TSS − RSS)/TSS
See this link for definition of RSS and TSS.
We can only work with observations that are complete (non NA). So we calculate each of them:
TSS = nonNA = !is.na(modelled) & !is.na(measured)
# residuals from your prediction
RSS = sum((modelled[nonNA] - measured[nonNA])^2,na.rm=T)
# total residuals from data
TSS = sum((measured[nonNA] - mean(measured[nonNA]))^2,na.rm=T)
1 - RSS/TSS
[1] 0.7116585
If measured and modelled are supposed to represent the actual and fitted values of an undisclosed model, as discussed in the comments below another answer, then if fm is the lm object for that undisclosed model then
summary(fm)
will show the R^2 and p value of that model.
The R squared value can actually be calculated using only measured and modelled but the formula is different if there is or is not an intercept in the undisclosed model. The signs are that there is no intercept since if there were an intercept sum(modelled - measured, an.rm = TRUE) should be 0 but in fact it is far from it.
In any case R^2 and the p value are shown in the output of the summary(fm) where fm is the undisclosed linear model so there is no point in restricting the discussion to measured and modelled if you have the lm object of the undisclosed model.
For example, if the undisclosed model is the following then using the builtin CO2 data frame:
fm <- lm(uptake ~ Type + conc, CO2)
summary(fm)
we have the this output where the last two lines show R squared and p value.
Call:
lm(formula = uptake ~ Type + conc, data = CO2)
Residuals:
Min 1Q Median 3Q Max
-18.2145 -4.2549 0.5479 5.3048 12.9968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.830052 1.579918 16.349 < 2e-16 ***
TypeMississippi -12.659524 1.544261 -8.198 3.06e-12 ***
conc 0.017731 0.002625 6.755 2.00e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.077 on 81 degrees of freedom
Multiple R-squared: 0.5821, Adjusted R-squared: 0.5718
F-statistic: 56.42 on 2 and 81 DF, p-value: 4.498e-16

How to us lapply or sapply for GLM on multiple species separately?

I am trying to run a GLM on multiple different species in my data set. Currently I have been sub-setting my data for each species and copying this code and it's turned into quite the mess. I know there has to be a better way to do this, (maybe with the lapply function?) but I'm not sure how to begin with that.
I'm running the model on the CPUE (catch per unit effort) for a species and using Year, Salinity, Discharge, and Rainfall as my explanatory variables.
My data is here: https://drive.google.com/file/d/1_ylbMoqevvsuucwZn2VMA_KMNaykDItk/view?usp=sharing
This is the code that I have tried. It gets the job done, but I have just been copying this code and changing the species each time. I'm hoping to find a way to simplify this process and clean up my code a bit.
fish_df$pinfishCPUE <- ifelse(fish_df$Commonname == "Pinfish", fish_all$CPUE, 0)
#create binomial column
fish_df$binom <- ifelse(fish_df$pinfishCPUE > 0, 1,0)
glm.full.bin = glm(binom~Year+Salinity+Discharge +Rainfall,data=fish_df,family=binomial)
glm.base.bin = glm(binom~Year,data=fish_df,family=binomial)
#step to simplify model and get appropriate order
glm.step.bin = step(glm.base.bin,scope=list(upper=glm.full.bin,lower=~Year),direction='forward',
trace=1,k=log(nrow(fish_df)))
#final model - may choose to reduce based on deviance and cutoff in above step
glm.final.bin = glm.step.bin
print(summary(glm.final.bin))
#calculate the LSMeans for the proportion of positive trips
lsm.b.glm = emmeans(glm.final.bin,"Year",data=fish_df)
LSMeansProp = summary(lsm.b.glm)
Output:
Call:
glm(formula = log.CPUE ~ Month + Salinity + Temperature, family = gaussian,
data = fish_B_pos)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.8927 -0.7852 0.1038 0.8974 3.5887
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.38530 0.72009 3.313 0.00098 ***
Month 0.10333 0.03433 3.010 0.00272 **
Salinity -0.13530 0.01241 -10.900 < 2e-16 ***
Temperature 0.06901 0.01434 4.811 1.9e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 1.679401)
Null deviance: 1286.4 on 603 degrees of freedom
Residual deviance: 1007.6 on 600 degrees of freedom
AIC: 2033.2
Number of Fisher Scoring iterations: 2
I would suggest next approach creating a function for the models and then using lapply over a list which results from applying split() to the dataframe by variable Commonname:
library(emmeans)
#Load data
fish_df <- read.csv('fish_df.csv',stringsAsFactors = F)
#Code
List <- split(fish_df,fish_df$Commonname)
#Function for models
mymodelfun <- function(x)
{
#Create binomial column
x$binom <- ifelse(x$pinfishCPUE > 0, 1,0)
glm.full.bin = glm(binom~Year+Salinity+Discharge +Rainfall,data=x,family=binomial)
glm.base.bin = glm(binom~Year,data=x,family=binomial)
#step to simplify model and get appropriate order
glm.step.bin = step(glm.base.bin,scope=list(upper=glm.full.bin,lower=~Year),direction='forward',
trace=1,k=log(nrow(x)))
#final model - may choose to reduce based on deviance and cutoff in above step
glm.final.bin = glm.step.bin
print(summary(glm.final.bin))
#calculate the LSMeans for the proportion of positive trips
lsm.b.glm = emmeans(glm.final.bin,"Year",data=x)
LSMeansProp = summary(lsm.b.glm)
return(LSMeansProp)
}
#Apply function
Lmods <- lapply(List,mymodelfun)
In Lmods there will be the results of the models, here an example:
Lmods$`Atlantic Stingray`
Output:
Year emmean SE df asymp.LCL asymp.UCL
2009 -22.6 48196 Inf -94485 94440
Results are given on the logit (not the response) scale.
Confidence level used: 0.95

How to obtain Poisson's distribution "lambda" from R glm() coefficients

My R-script produces glm() coeffs below.
What is Poisson's lambda, then? It should be ~3.0 since that's what I used to create the distribution.
Call:
glm(formula = h_counts ~ ., family = poisson(link = log), data = pois_ideal_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.726 -12.726 -8.624 6.405 18.515
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.222532 0.015100 544.53 <2e-16 ***
h_mids -0.363560 0.004393 -82.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 11451.0 on 10 degrees of freedom
Residual deviance: 1975.5 on 9 degrees of freedom
AIC: 2059
Number of Fisher Scoring iterations: 5
random_pois = rpois(10000,3)
h=hist(random_pois, breaks = 10)
mean(random_pois) #verifying that the mean is close to 3.
h_mids = h$mids
h_counts = h$counts
pois_ideal_data <- data.frame(h_mids, h_counts)
pois_ideal_model <- glm(h_counts ~ ., pois_ideal_data, family=poisson(link=log))
summary_ideal=summary(pois_ideal_model)
summary_ideal
What are you doing here???!!! You used a glm to fit a distribution???
Well, it is not impossible to do so, but it is done via this:
set.seed(0)
x <- rpois(10000,3)
fit <- glm(x ~ 1, family = poisson())
i.e., we fit data with an intercept-only regression model.
fit$fitted[1]
# 3.005
This is the same as:
mean(x)
# 3.005
It looks like you're trying to do a Poisson fit to aggregated or binned data; that's not what glm does. I took a quick look for canned ways of fitting distributions to canned data but couldn't find one; it looks like earlier versions of the bda package might have offered this, but not now.
At root, what you need to do is set up a negative log-likelihood function that computes (# counts)*prob(count|lambda) and minimize it using optim(); the solution given below using the bbmle package is a little more complex up-front but gives you added benefits like easily computing confidence intervals etc..
Set up data:
set.seed(101)
random_pois <- rpois(10000,3)
tt <- table(random_pois)
dd <- data.frame(counts=unname(c(tt)),
val=as.numeric(names(tt)))
Here I'm using table rather than hist because histograms on discrete data are fussy (having integer cutpoints often makes things confusing because you have to be careful about right- vs left-closure)
Set up density function for binned Poisson data (to work with bbmle's formula interface, the first argument must be called x, and it must have a log argument).
dpoisbin <- function(x,val,lambda,log=FALSE) {
probs <- dpois(val,lambda,log=TRUE)
r <- sum(x*probs)
if (log) r else exp(r)
}
Fit lambda (log link helps prevent numerical problems/warnings from negative lambda values):
library(bbmle)
m1 <- mle2(counts~dpoisbin(val,exp(loglambda)),
data=dd,
start=list(loglambda=0))
all.equal(unname(exp(coef(m1))),mean(random_pois),tol=1e-6) ## TRUE
exp(confint(m1))
## 2.5 % 97.5 %
## 2.972047 3.040009

Multiple correlation coefficient in R

I am looking for a way to calculate the multiple correlation coefficient in R http://en.wikipedia.org/wiki/Multiple_correlation, is there a built-in function to calculate it ?
I have one dependent variable and three independent ones.
I am not able to find it online, any idea ?
The easiest way to calculate what you seem to be asking for when you refer to 'the multiple correlation coefficient' (i.e. the correlation between two or more independent variables on the one hand, and one dependent variable on the other) is to create a multiple linear regression (predicting the values of one variable treated as dependent from the values of two or more variables treated as independent) and then calculate the coefficient of correlation between the predicted and observed values of the dependent variable.
Here, for example, we create a linear model called mpg.model, with mpg as the dependent variable and wt and cyl as the independent variables, using the built-in mtcars dataset:
> mpg.model <- lm(mpg ~ wt + cyl, data = mtcars)
Having created the above model, we correlate the observed values of mpg (which are embedded in the object, within the model data frame) with the predicted values for the same variable (also embedded):
> cor(mpg.model$model$mpg, mpg.model$fitted.values)
[1] 0.9111681
R will in fact do this calculation for you, but without telling you so, when you ask it to create the summary of a model (as in Brian's answer): the summary of an lm object contains R-squared, which is the square of the coefficient of correlation between the predicted and observed values of the dependent variable. So an alternative way to get the same result is to extract R-squared from the summary.lm object and take the square root of it, thus:
> sqrt(summary(mpg.model)$r.squared)
[1] 0.9111681
I feel that I should point out, however, that the term 'multiple correlation coefficient' is ambiguous.
The built-in function lm gives at least one version, not sure if this is what you are looking for:
fit <- lm(yield ~ N + P + K, data = npk)
summary(fit)
Gives:
Call:
lm(formula = yield ~ N + P + K, data = npk)
Residuals:
Min 1Q Median 3Q Max
-9.2667 -3.6542 0.7083 3.4792 9.3333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.650 2.205 24.784 <2e-16 ***
N1 5.617 2.205 2.547 0.0192 *
P1 -1.183 2.205 -0.537 0.5974
K1 -3.983 2.205 -1.806 0.0859 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.401 on 20 degrees of freedom
Multiple R-squared: 0.3342, Adjusted R-squared: 0.2343
F-statistic: 3.346 on 3 and 20 DF, p-value: 0.0397
More info on what's going on at ?summary.lm and ?lm.
Try this:
# load sample data
data(mtcars)
# calculate correlation coefficient between all variables in `mtcars` using
# the inbulit function
M <- cor(mtcars)
# M is a matrix of correlation coefficient which you can display just by
# running
print(M)
# If you want to plot the correlation coefficient
library(corrplot)
corrplot(M, method="number",type= "lower",insig = "blank", number.cex = 0.6)

how can i draw a gaussian curve on datas which obey gaussian distribution by R language?

I have some datas which looks like obeying gausssian distribution. So i use
my.glm<- glm(b1~a1,family=Gaussian)
and then use command
summary(my.glm).
The results are:
Call:
glm(formula = b1 ~ a1, family = gaussian)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.067556 -0.029598 0.002121 0.030980 0.044499
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.433697 0.018629 23.28 1.36e-12 ***
a1 -0.027146 0.001927 -14.09 1.16e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.001262014)
Null deviance: 0.268224 on 15 degrees of freedom
Residual deviance: 0.017668 on 14 degrees of freedom
AIC: -57.531
Number of Fisher Scoring iterations: 2
I think they fit well. But how can i draw a gaussian curve on these datas?
Assuming that the intercept has a normal distribution, you can plot its distribution like this:
x <- seq(0.3,0.6,by =0.001)
plot(x, dnorm(x, 0.433697, 0.018629), type = 'l')
and you might want to add your data:
rug(b1)
since you didn't supply data, we can make some up (with some transforms to match stats in the example):
set.seed(0)
b <- rnorm(15)
b1 <- ((b - mean(b))/sd(b) * 0.018629) + 0.433697
rug(b1)
you could also overlay a kernel density estimate of the data
lines(density(b1), col = 'red')
Giving the following plot:
Simple: ?dnorm
Use dnorm to create a gaussian curve of desired mean and s.d. without tying yourself to any numerically fitted function. This is a simple, and good, way to show how your data 'fits' to a theoretical curve. Not the same thing as plotting the fitted data and trying to figure out "how close" to a gaussian it is.

Resources