Compute a kernel ridge regression in R for model selection - r

I have a dataframe df
df<-structure(list(P = c(794.102395099402, 1299.01021921817, 1219.80731174175,
1403.00786976395, 742.749487463385, 340.246973543409, 90.3220586792255,
195.85557320714, 199.390867672674, 191.4970921278, 334.452413539092,
251.730350291822, 235.899165861309, 442.969718728163, 471.120193046119,
458.464154601097, 950.298132134912, 454.660729622624, 591.212003320456,
546.188716055825, 976.994105334083, 1021.67000560164, 945.965200876724,
932.324768081307, 3112.60002304117, 624.005047807736, 0, 937.509240627289,
892.926195849975, 598.564015734103, 907.984807726741, 363.400837339461,
817.629824627294, 2493.75851182081, 451.149000503123, 1028.41455932241,
615.640039284434, 688.915621065535, NaN, 988.21297, NaN, 394.7,
277.7, 277.7, 492.7, 823.6, 1539.1, 556.4, 556.4, 556.4), T = c(11.7087701201175,
8.38748953516909, 9.07065637842101, 9.96978059247473, 2.87026334756687,
-1.20497751697385, 1.69057148825093, 2.79168506923385, -1.03659741363293,
-2.44619473778322, -1.0414166493637, -0.0616510891024765, -2.19566614081763,
2.101408628412, 1.30197334094966, 1.38963309876057, 1.11283280896495,
0.570385633957982, 1.05118063842584, 0.816991857384802, 8.95069454902333,
6.41067954598958, 8.42110173395973, 13.6455092557636, 25.706509843239,
15.5098014530832, 6.60783204117648, 6.27004335176393, 10.0769600264915,
3.05237224011361, 7.52869186722913, 11.2970127691776, 6.60356510073103,
7.3210245298803, 8.4723724171517, 21.6988324356057, 7.34952593890056,
6.04325232771032, NaN, 25.990913731, NaN, 1.5416666667, 15.1416666667,
15.1416666667, 0.825, 4.3666666667, 7.225, -2.075, -2.075, -2.075
), A = c(76.6, 52.5, 3.5, 15, 71.5, 161.833333333333, 154, 72.5,
39, 40, 23, 14.5, 5.5, 78, 129, 73.5, 100, 10, 3, 29.5, 65, 44,
68.5, 56.5, 101, 52.1428571428571, 66.5, 1, 106, 36.6, 21.2,
10, 135, 46.5, 17.5, 35.5, 86, 70.5, 65, 97, 30.5, 96, 79, 11,
162, 350, 42, 200, 50, 250), Y = c(1135.40733061247, 2232.28817154825,
682.15711101488, 1205.97307573068, 1004.2559099408, 656.537378609781,
520.796355544007, 437.780508459633, 449.167726897157, 256.552344558528,
585.618137514404, 299.815636674633, 230.279491515383, 1051.74875971674,
801.07750760983, 572.337961145761, 666.132923644351, 373.524159859929,
128.198042456082, 528.555426408071, 1077.30188477292, 1529.43757814094,
1802.78658590423, 1289.80342084379, 3703.38329098125, 1834.54460388103,
1087.48954802548, 613.15010408836, 1750.11457900004, 704.123482171384,
1710.60321283154, 326.663507855032, 1468.32489464969, 1233.05517321796,
852.500007182098, 1246.5605930537, 1186.31346316832, 1460.48566379373,
2770, 3630, 3225, 831, 734, 387, 548.8, 1144, 1055, 911, 727,
777)), .Names = c("P", "T", "A", "Y"), row.names = c(NA, -50L
), class = "data.frame")
I want to do a model selection by using a kernel ridge regression. I have done it with a simple step wise regression analysis (see below) but I would like to do it using a kernel ridge regression now.
library(caret)
Step <- train(Y~ P+T+A, data=df,
preProcess= c("center", "scale"),
method = "lmStepAIC",
trainControl(method="cv",repeats = 10), na.rm=T)
Anyone know how I could compute a kernel ridge regression for model selection?

Using the CVST package that etienne linked, here is how you can train and predict with a Kernel Ridge Regression learner:
library(CVST)
## Assuming df is already in your environment
d = constructData(x=df[,1:3], y=df$Y) ## Structure data in CVST format
krr_learner = constructKRRLearner() ## Build the base learner
params = list(kernel='rbfdot', sigma=100, lambda=0.01) ## Function params; documentation defines lambda as '.1/getN(d)'
krr_trained = krr_learner$learn(d, params)
## Now to predict, format your test data, 'dTest', the same way as you did in 'd'
pred = krr_learner$predict(krr_trained, dTest)
What makes CVST slightly painful is the intermediary data prep step that requires you call the constructData function. This is an adapted example from page 7 in the documentation.
It's worth mentioning that when I ran this code on your example, I got the following singularity warning:
Lapack routine dgesv: system is exactly singular: U[1,1] = 0

Related

R:mgcv add colorbar to 2D heatmap of GAM

I'm fitting a gam with mgcv and plot the result with the default plot.gam() function. My model includes a 2D-smoother and I want to plot the result as a heatmap. Is there any way to add a colorbar for the heatmap?
I've previously looked into other GAM potting packages, but none of them provided the necessary visualisation. Please note, this is just a simplification for illustration purposes; the actual model (and reporting needs) is much more complicated
edited: I initially had swapped y and z in my tensor product, updated to reflect the correct version both in the code and the plot
df.gam<-gam(y~te(x,z), data=df, method='REML')
plot(df.gam, scheme=2, hcolors=heat.colors(999, rev =T), rug=F)
sample data:
structure(list(x = c(3, 17, 37, 9, 4, 11, 20.5, 11.5, 16, 17,
18, 15, 13, 29.5, 13.5, 25, 15, 13, 20, 20.5, 17, 11, 11, 5,
16, 13, 3.5, 16, 16, 5, 20.5, 2, 20, 9, 23.5, 18, 3.5, 16, 23,
3, 37, 24, 5, 2, 9, 3, 8, 10.5, 37, 3, 9, 11, 10.5, 9, 5.5, 8,
22, 15.5, 18, 15, 3.5, 4.5, 20, 22, 4, 8, 18, 19, 26, 9, 5, 18,
10.5, 30, 15, 13, 27, 19, 5.5, 18, 11.5, 23.5, 2, 25, 30, 17,
18, 5, 16.5, 9, 2, 2, 23, 21, 15.5, 13, 3, 24, 17, 4.5), z = c(144,
59, 66, 99, 136, 46, 76, 87, 54, 59, 46, 96, 38, 101, 84, 64,
92, 56, 69, 76, 93, 109, 46, 124, 54, 98, 131, 89, 69, 124, 105,
120, 69, 99, 84, 75, 129, 69, 74, 112, 66, 78, 118, 120, 103,
116, 98, 57, 66, 116, 108, 95, 57, 41, 20, 89, 61, 61, 82, 52,
129, 119, 69, 61, 136, 98, 94, 70, 77, 108, 118, 94, 105, 52,
52, 38, 73, 59, 110, 97, 87, 84, 119, 64, 68, 93, 94, 9, 96,
103, 119, 119, 74, 52, 95, 56, 112, 78, 93, 119), y = c(96.535,
113.54, 108.17, 104.755, 94.36, 110.74, 112.83, 110.525, 103.645,
117.875, 105.035, 109.62, 105.24, 119.485, 107.52, 107.925, 107.875,
108.015, 115.455, 114.69, 116.715, 103.725, 110.395, 100.42,
108.79, 110.94, 99.13, 110.935, 112.94, 100.785, 110.035, 102.95,
108.42, 109.385, 119.09, 110.93, 99.885, 109.96, 116.575, 100.91,
114.615, 113.87, 103.08, 101.15, 98.68, 101.825, 105.36, 110.045,
118.575, 108.45, 99.21, 109.19, 107.175, 103.14, 94.855, 108.15,
109.345, 110.935, 112.395, 111.13, 95.185, 100.335, 112.105,
111.595, 100.365, 108.75, 116.695, 110.745, 112.455, 104.92,
102.13, 110.905, 107.365, 113.785, 105.595, 107.65, 114.325,
108.195, 96.72, 112.65, 103.81, 115.93, 101.41, 115.455, 108.58,
118.705, 116.465, 96.89, 108.655, 107.225, 101.79, 102.235, 112.08,
109.455, 111.945, 104.11, 94.775, 110.745, 112.44, 102.525)), row.names = c(NA,
-100L), class = "data.frame")
It would be easier (IMHO) to do this reliably within the ggplot2 ecosphere.
I'll show a canned approach using my {gratia} package but also checkout {mgcViz}. I'll also suggest a more generic solution using tools from {gratia} to extra information about your model's smooths and then plot them yourself using ggplot().
library('mgcv')
library('gratia')
library('ggplot2')
library('dplyr')
# load your snippet of data via df <- structure( .... )
# then fit your model (note you have y as response & in the tensor product
# I assume z is the response below and x and y are coordinates
m <- gam(z ~ te(x, y), data=df, method='REML')
# now visualize the mode using {gratia}
draw(m)
This produces:
{gratia}'s draw() methods can't plot everything yet, but where it doesn't work you should still be able to evaluate the data you need using tools in {gratia}, which you can then plot with ggplot() itself by hand.
To get values for your smooths, i.e. the data behind the plots that plot.gam() or draw() display, use gratia::smooth_estimates()
# dist controls what we do with covariate combinations too far
# from support of the data. 0.1 matches mgcv:::plot.gam behaviour
sm <- smooth_estimates(m, dist = 0.1)
yielding
r$> sm
# A tibble: 10,000 × 7
smooth type by est se x y
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 te(x,y) Tensor NA 35.3 11.5 2 94.4
2 te(x,y) Tensor NA 35.5 11.0 2 94.6
3 te(x,y) Tensor NA 35.7 10.6 2 94.9
4 te(x,y) Tensor NA 35.9 10.3 2 95.1
5 te(x,y) Tensor NA 36.2 9.87 2 95.4
6 te(x,y) Tensor NA 36.4 9.49 2 95.6
7 te(x,y) Tensor NA 36.6 9.13 2 95.9
8 te(x,y) Tensor NA 36.8 8.78 2 96.1
9 te(x,y) Tensor NA 37.0 8.45 2 96.4
10 te(x,y) Tensor NA 37.2 8.13 2 96.6
# … with 9,990 more rows
In the output, x and y are a grid of values over the range of both covariates (the number of points in the grid in each covariate is controlled by n such that the grid for a 2d tensor product smooth is of size n by n). est is the estimated value of the smooth at the values of the covariates and se its standard error. For models with multiple smooths, the smooth variable uses the internal label that {mgcv} gives each smooth - these are the labels used in the output you get from calling summary() on your GAM.
We can add a confidence interval if needed using add_confint().
Now you can plot your smooth(s) by hand using ggplot(). At this point you have two options
if draw() can handle the type of smooth you want to plot, you can use the draw() method for that object and then build upon it, or
plot everything by hand.
Option 1
# evaluate just the smooth you want to plot
smooth_estimates(m, smooth = "te(x,y)", dist = 0.1) %>%
draw() +
geom_point(data = df, alpha = 0.2) # add a point layer for original data
This pretty much gets you what draw() produced when given the model object itself. And you can add to it as if it were a ggplot object (which is not the case of the objects returned by gratia:::draw.gam(), which is wrapped by {patchwork} and needs other ways to interact with the plots).
Option 2
Here you are in full control
sm <- smooth_estimates(m, smooth = "te(x,y)", dist = 0.1)
ggplot(sm, aes(x = x, y = y)) +
geom_raster(aes(fill = est)) +
geom_point(data = df, alpha = 0.2) + # add a point layer for original data
scale_fill_viridis_c(option = "plasma")
which produces
A diverging palette is likely better for this, along the lines of the one gratia:::draw.smooth_estimates uses
sm <- smooth_estimates(m, smooth = "te(x,y)", dist = 0.1)
ggplot(sm, aes(x = x, y = y)) +
geom_raster(aes(fill = est)) +
geom_contour(aes(z = est), colour = "black") +
geom_point(data = df, alpha = 0.2) + # add a point layer for original data
scale_fill_distiller(palette = "RdBu", type = "div") +
expand_limits(fill = c(-1,1) * abs(max(sm[["est"]])))
which produces
Finally, if {gratia} can't handle your model, I'd appreciate you filing a bug report here so that I can work on supporting as many model types as possible. But do try {mgcViz} as well for an alternative approach to visualsing GAMs fitted using {mgcv}.
A base plot solution would be to use fields::image.plot directly. Unfortunately, it require data in a classic wide format, not the long format needed by ggplot.
We can facilitate plotting by grabbing the object returned by plot.gam(), and then do a little manipulation of the object to get what we need for image.plot()
Following on from #Anke's answer then, instead of plotting with plot.gam() then using image.plot() to add the legend, we proceed to use plot.gam() to get what we need to plot, but do everything in image.plot()
plt <- plot(df.gam)
plt <- plt[[1]] # plot.gam returns a list of n elements, one per plot
# extract the `$fit` variable - this is est from smooth_estimates
fit <- plt$fit
# reshape fit (which is a 1 column matrix) to have dimension 40x40
dim(fit) <- c(40,40)
# plot with image.plot
image.plot(x = plt$x, y = plt$y, z = fit, col = heat.colors(999, rev = TRUE))
contour(x = plt$x, y = plt$y, z = fit, add = TRUE)
box()
This produces:
You could also use the fields::plot.surface() function
l <- list(x = plt$x, y = plt$y, z = fit)
plot.surface(l, type = "C", col = heat.colors(999, rev = TRUE))
box()
This produces:
See ?fields::plot.surface for other arguments to modify the contour plot etc.
As shown, these all have the correct range on the colour bar. It would appear that #Anke's version the colour bar mapping is off in all of the plots, but mostly just a little bit so it wasn't as noticeable.
Following Gavin Simpson's answer and this thread (How to add colorbar with perspective plot in R), I think I've come up with a solution that uses plot.gam() (though I really love that {gratia} takes it into a ggplot universe and will definitely look more into that)
require(fields)
df.gam<-gam(y~te(x,z), data=df, method='REML')
sm <- as.data.frame(smooth_estimates(df.gam, dist = 0.1))
plot(df.gam, scheme=2, hcolors=heat.colors(999, rev =T), contour.col='black', rug=F, main='', cex.lab=1.75, cex.axis=1.75)
image.plot(legend.only=T, zlim=range(sm$est), col=heat.colors(999, rev =T), legend.shrink = 0.5, axis.args = list(at =c(-10,-5,0,5, 10, 15, 20)))
I hope I understood correctly that gratia:smooth_estimates() actually pulls out the partial effects.
For my model with multiple terms (and multiple tensor products), this seems to work nicely by indexing the sections of the respective terms in sm. Except for one, where the colorbar and the heatmap aren't quite matching up. I can't provide the actual underlaying data, but add that plot for illustration in case anyone has any idea. I'm using the same approach as outlined above. In the colorbar, dark red is at 15-20, but in the heatmap the isolines just above 0 already correspond with the dark red (while 0 is dark yellow'ish in the colorbar).

Unexpected result while using lowess to smooth a data.table column in R

I have a data.table test_dt in which I want to smooth the y column using lowess function.
test_dt <- structure(list(x = c(28.75, 30, 31.25, 32.5, 33.75, 35, 36.25,
37.5, 38.75, 40, 41.25, 42.5, 43.75, 45, 46.25, 47.5, 48.75,
50, 52.5, 55, 57.5, 60, 62.5, 63.75, 65, 67.5, 70, 72.5, 75,
77.5, 80, 82.5, 85, 87.5, 90, 92.5, 95, 97.5, 100, 102.5, 103.75,
105, 106.25, 107.5, 108.75, 110, 111.25, 112.5, 113.75, 115,
116.25, 117.5, 118.75, 120, 121.25, 122.5, 125, 130, 135, 140,
145), y = c(116.78, 115.53, 114.28, 113.05, 111.78, 110.53, 109.28,
108.05, 106.78, 105.53, 104.28, 103.025, 101.775, 100.525, 99.28,
98.05, 96.8, 95.525, 93.1, 90.65, 88.225, 85.775, 83.35, 82.15,
80.9, 78.5, 76.075, 73.675, 71.25, 68.85, 66.5, 64.075, 61.725,
59.4, 57.075, 54.725, 52.475, 50.225, 48, 45.75, 44.65, 43.55,
42.475, 41.45, 40.35, 39.275, 38.25, 37.225, 36.175, 35.175,
34.175, 33.225, 32.275, 31.3, 30.35, 29.45, 27.625, 24.175, 21,
18.125, 15.55), z = c(116.778248424972, 115.531456655985, 114.284502467544,
113.034850770519, 111.784500981402, 110.533319511795, 109.284500954429,
108.034850457264, 106.784502297216, 105.531265565238, 104.278221015846,
103.026780249377, 101.775992395759, 100.528761292272, 99.2853168637851,
98.043586202838, 96.8021989104315, 95.5702032427799, 93.1041279347743,
90.6575956222915, 88.2179393348852, 85.783500434839, 83.3503011023971,
82.136280706039, 80.922846825298, 78.4965179152157, 76.0823895453039,
73.6686672097464, 71.264486719796, 68.8702598156142, 66.4865368523571,
64.1182523898466, 61.7552221811808, 59.4004347738795, 57.0823289450761,
54.7908645949795, 52.5071096685879, 50.2308279167219, 47.9940967492558,
45.7658417529877, 44.6514226583931, 43.5622751034012, 42.4876666190815,
41.4173110074806, 40.3555584369672, 39.3004471381618, 38.2552969838653,
37.2202353638959, 36.1963659189447, 35.1889616530209, 34.2004259883859,
33.2295174626826, 32.2669278456991, 31.3171387914754, 30.3742375589802,
29.4555719783757, 27.6243725086786, 23.9784367995753, 27.625,
27.625, 27.625)), row.names = c(NA, -61L), class = c("data.table",
"data.frame"))
As can be seen in the image below, I am getting an unexpected result. The expected result is that the line (z column) in the graph below should closely follow the points (y column).
Here is my code -
library(data.table)
library(ggplot2)
test_dt[, z := lowess(x = x, y = y, f = 0.1)$y]
ggplot(test_dt) + geom_point(aes(x, y)) + geom_line(aes(x, z))
Q1. Can someone suggest why lowess is not smoothing properly?
Q2. Since lowess is not working as expected, is there any other function in R that would be more efficient in smoothing the y column without producing a spike (as lowess did on the boundary points)?
You could use loess instead:
test_dt[, z := predict(loess(y ~ x, data = test_dt))]
ggplot(test_dt) + geom_point(aes(x, y)) + geom_line(aes(x, z))
Note though, that if all you want to do is plot the line, this is exactly the method that geom_smooth uses, so without even creating a z column, you could do:
ggplot(test_dt, aes(x, y)) + geom_point() + geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Created on 2021-11-07 by the reprex package (v2.0.0)
The problem got solved by keeping the number of iterations to zero.lowess acts like loess when iterations are kept at zero.
test_dt[, z := lowess(x = x, y = y, f = 0.1, iter=0)$y]

How do I plot 3 variables along the x-axis in R plot?

How do I label the x-axis as actual, knn, and pca and plot its respective values along the y-axis?
dat.pca.knn <- rbind(actual, knn, pca)
plot(c(1, ncol(dat)),range(dat.pca.knn),type="n",main="Profile plot of probeset 206054_at\nwith actual and imputed values of GSM146784_Normal",xlab="Actual vs imputed values",ylab="Expression Intensity",axes=F,col="red")
axis(side=1,at=1:length(dat.pca.knn),labels=dimnames(dat.pca.knn)[[2]],cex.axis=0.4,col="1",las=2,tick=T)
axis(side=2)
for(i in 1:length(dat.pca.knn)) {
dat.y <- as.numeric(dat.pca.knn[i,])
lines(c(1:ncol(dat.pca.knn)),dat.y,lwd=2,type="p",col="1")
}
Data
dput(dat[1:2,])
structure(c(1942.1, 40.1, 2358.3, 58.2, 2465.2, 132.6, 2732.9,
64.3, 1952.2, 66.1, 2048.3, 69, 2109, 109.7, 3005.1, 59.4, 2568.1,
81.7, 2107.7, 100.8, 1940.2, 170.1, 2608.8, 186.7, 1837.2, 103.8,
1559.2, 86.8, 2111.6, 86, 2641, 152.7, 1972.7, 124.8, 1737.2,
115, 1636.1, 202.1, 2718.4, 257.3), .Dim = c(2L, 20L), .Dimnames = list(
c("1007_s_at", "1053_at"), c("GSM146778_Normal", "GSM146780_Normal",
"GSM146782_Normal", "GSM146784_Normal", "GSM146786_Normal",
"GSM146789_Normal", "GSM146790_Normal", "GSM146792_Normal",
"GSM146794_Normal", "GSM146796_Normal", "GSM146779_Tumor",
"GSM146781_Tumor", "GSM146783_Tumor", "GSM146785_Tumor",
"GSM146787_Tumor", "GSM146788_Tumor", "GSM146791_Tumor",
"GSM146793_Tumor", "GSM146795_Tumor", "GSM146797_Tumor")))
dat.pca.knn
> print(dat.pca.knn)
[,1]
actual 8385.300
knn 7559.533
pca 10418.002
you probably need a barplot if Understand you correctly.
# just to recreate your data
dat.pca.knn <- dplyr::tribble(
~actual, ~knn, ~pca,
8385.300, 7559.533, 10418.002
)
with(
dat.pca.knn,
barplot(height = c(actual, knn, pca),
names.arg = c("actual", "knn", "pca"))
)

Overlaying a histogram with normal distribution

I neeed to draw a histogram of data (weights) overlayed with a line of the expected normal distributioh
I am totally new to R and statistics. I know I probably got something fundamentally wrong about frequencies density and dnorm, but I am stuck.
weights <- c(97.6,95,94.3 ,92.3 ,90.7 ,89.4 ,88.2 ,86.9 ,85.8 ,85.5 ,84.4 ,84.1 ,82.5 ,81.4 ,80.8 ,80 ,79.8 ,79.5 ,78.4 ,78.4 ,78.2 ,78.1 ,78 ,77.4 ,76.5 ,75.4 ,74.8 ,74.1 ,73.5 ,73.2 ,73 ,72.3 ,72.3 ,72.2 ,71.8 ,71.7 ,71.6 ,71.6 ,71.5 ,71.3 ,70.7 ,70.6 ,70.5 ,69.2 ,68.6 ,68.3 ,67.5 ,67 ,66.8 ,66.6 ,65.8 ,65.6 ,64.9 ,64.6 ,64.5 ,64.5 ,64.3 ,64.2 ,63.9 ,63.7 ,62.7 ,62.3 ,62.2 ,59.4 ,57.8 ,57.8 ,57.6 ,56.4 ,53.6 ,53.2 )
hist(weights)
m <- mean(weights)
sd <- sd(weights)
x <- seq(min(weights), max(weights), length.out length(weights))
xn <- dnorm(x, mean = m, sd = sd) * length(weights) #what is the correct factor???
lines(x, xn)
I expected the line to follow the histogram approximately, but it is too low in the histogram
what you need is to plot the histogram with the frequency of the examples and then plot the density of the weights, i.e.
weights = c(97.6,95,94.3 ,92.3 ,90.7 ,89.4 ,88.2 ,86.9 ,85.8 ,85.5 ,84.4 ,84.1 ,82.5 ,81.4 ,80.8 ,80 ,79.8 ,79.5 ,78.4 ,78.4 ,78.2 ,78.1 ,78 ,77.4 ,76.5 ,75.4 ,74.8 ,74.1 ,73.5 ,73.2 ,73 ,72.3 ,72.3 ,72.2 ,71.8 ,71.7 ,71.6 ,71.6 ,71.5 ,71.3 ,70.7 ,70.6 ,70.5 ,69.2 ,68.6 ,68.3 ,67.5 ,67 ,66.8 ,66.6 ,65.8 ,65.6 ,64.9 ,64.6 ,64.5 ,64.5 ,64.3 ,64.2 ,63.9 ,63.7 ,62.7 ,62.3 ,62.2 ,59.4 ,57.8 ,57.8 ,57.6 ,56.4 ,53.6 ,53.2 )
hist(weights, prob = T)
lines(density(weights), col = "red")
Hope this helps.
The problem in your code is that hist plots frequencies and dnorm calculates densities.
You can try making a histogram with densities and then you will see the histogram or the line just adding freq=F to the histogram:
hist(weights, freq = F)
You're nearly there, you just have to factor in the histogram bin widths.
weights <- c(97.6, 95, 94.3, 92.3, 90.7, 89.4, 88.2, 86.9, 85.8,
85.5, 84.4, 84.1, 82.5, 81.4, 80.8, 80, 79.8, 79.5, 78.4, 78.4,
78.2, 78.1, 78, 77.4, 76.5, 75.4, 74.8, 74.1, 73.5, 73.2, 73,
72.3, 72.3, 72.2, 71.8, 71.7, 71.6, 71.6, 71.5, 71.3, 70.7,
70.6, 70.5, 69.2, 68.6, 68.3, 67.5, 67, 66.8, 66.6, 65.8, 65.6,
64.9, 64.6, 64.5, 64.5, 64.3, 64.2, 63.9, 63.7, 62.7, 62.3,
62.2, 59.4, 57.8, 57.8, 57.6, 56.4, 53.6, 53.2)
h <- hist(weights, freq=TRUE)
binwi <- diff(h$breaks)[1]
x <- seq(min(weights)-10, max(weights)+10, 0.01)
xn <- dnorm(x, mean=mean(weights), sd=sd(weights)) * length(weights) * binwi
lines(x, xn)

Error when running a gam with poisson family and offset

I'm trying to run a gam in R and I'm getting a strange error message.
Generally, I have some number of counts, per volume of water sampled, and I want to correct by that number of counts. I'm trying to generate a smooth function that fits the counts as a function of depth, accounting for differences in volume sampled.
test <- structure(list(depth = c(2.5, 7.5, 12.5, 17.5, 22.5, 27.5, 32.5,
37.5, 42.5, 47.5, 52.5, 57.5, 62.5, 67.5, 72.5, 77.5, 82.5, 87.5,
92.5, 97.5), count = c(53323, 665, 1090, 491, 540, 514, 612,
775, 601, 497, 295, 348, 357, 294, 292, 968, 455, 148, 155, 101
), vol = c(2119.92, 111.76, 156.64, 71.28, 77.44, 73.92, 62.48,
78.32, 74.8, 81.84, 53.68, 80.96, 80.08, 79.2, 79.2, 77.44, 77.44,
84.48, 73.04, 59.84)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), .Names = c("depth", "count", "vol"
))
gam(count ~ s(depth) + offset(vol), data = test, family = "poisson")
Error in if (pdev - old.pdev > div.thresh) { : missing value where TRUE/FALSE needed
Any idea why this is not working? If I get rid of the offset, or if I set family = "gaussian" the function runs as one would expect.
Edit: I find that
gam(count ~ s(depth) + offset(log(vol)), data = test, family = "poisson")
does run, and I think I saw something that said that one wants to log transform the offset variable for these, so maybe this is actually working ok.
You definitely need to put vol on the log scale (for this model).
More generally, an offset enters the model on the scale of the link function. Hence if your model use family = poisson(link = 'sqrt'), then you'd want to include
the offset as offset(sqrt(vol)).
I suspect the error is coming from some overflow or bad value in the likelihood/deviance arising from assuming that the vol values were on the log scale whilst the initial model was fitting.

Resources