Error when running a gam with poisson family and offset - r

I'm trying to run a gam in R and I'm getting a strange error message.
Generally, I have some number of counts, per volume of water sampled, and I want to correct by that number of counts. I'm trying to generate a smooth function that fits the counts as a function of depth, accounting for differences in volume sampled.
test <- structure(list(depth = c(2.5, 7.5, 12.5, 17.5, 22.5, 27.5, 32.5,
37.5, 42.5, 47.5, 52.5, 57.5, 62.5, 67.5, 72.5, 77.5, 82.5, 87.5,
92.5, 97.5), count = c(53323, 665, 1090, 491, 540, 514, 612,
775, 601, 497, 295, 348, 357, 294, 292, 968, 455, 148, 155, 101
), vol = c(2119.92, 111.76, 156.64, 71.28, 77.44, 73.92, 62.48,
78.32, 74.8, 81.84, 53.68, 80.96, 80.08, 79.2, 79.2, 77.44, 77.44,
84.48, 73.04, 59.84)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), .Names = c("depth", "count", "vol"
))
gam(count ~ s(depth) + offset(vol), data = test, family = "poisson")
Error in if (pdev - old.pdev > div.thresh) { : missing value where TRUE/FALSE needed
Any idea why this is not working? If I get rid of the offset, or if I set family = "gaussian" the function runs as one would expect.
Edit: I find that
gam(count ~ s(depth) + offset(log(vol)), data = test, family = "poisson")
does run, and I think I saw something that said that one wants to log transform the offset variable for these, so maybe this is actually working ok.

You definitely need to put vol on the log scale (for this model).
More generally, an offset enters the model on the scale of the link function. Hence if your model use family = poisson(link = 'sqrt'), then you'd want to include
the offset as offset(sqrt(vol)).
I suspect the error is coming from some overflow or bad value in the likelihood/deviance arising from assuming that the vol values were on the log scale whilst the initial model was fitting.

Related

simple curve fitting in R

I am trying to find a fit for my data. But so far had no luck.
Tried the logarithmic, different ones from the drc package .. but I am sure there must be a better one I just don't know the type.
On a different note - I would be grateful for advice on how to go about curve hunting in general.
library(drc)
df<-structure(list(x = c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55), y = c(0.1066, -0.6204, -0.2028, 0.2621, 0.4083,
0.4497, 0.6343, 0.7762, 0.8809, 1.0029, 0.8089, 0.7845, 0.8009,
0.9319, 0.9414, 0.9505, 0.9323, 1.0321, 0.9381, 0.8975, 1.0929,
1.0236, 0.9589, 1.0644, 1.0411, 1.0763, 0.9679, 1.003, 1.142,
1.1049, 1.2868, 1.1569, 1.1952, 1.0802, 1.2125, 1.3765, 1.263,
1.2507, 1.2125, 1.2207, 1.2836, 1.3352, 1.1311, 1.2321, 1.4277,
1.1645), w = c(898, 20566, 3011, 1364, 1520, 2376, 1923, 1934,
1366, 1010, 380, 421, 283, 262, 227, 173, 118, 113, 95, 69, 123,
70, 80, 82, 68, 83, 76, 94, 101, 97, 115, 79, 98, 84, 92, 121,
97, 102, 93, 92, 101, 74, 124, 64, 52, 63)), row.names = c(NA,
-46L), class = c("tbl_df", "tbl", "data.frame"), na.action = structure(c(`47` = 47L), class = "omit"))
fit <- drm(data = df,y ~ x,fct=LL.4(), weights = w)
plot(fit)
1) If we ignore the weights then y = a + b * x + c/x^2 seems to fit and is linear in the coefficients so is easy to fit. This seems upward sloping so we started with a line but then we needed to dampen that so we added a reciprocal term. A reciprocal quadratic worked slightly better than a plain reciprocal based on the residual sum of squares so we switched to that.
fm <- lm(y ~ x + I(1 / x^2), df)
coef(summary(fm))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.053856e+00 0.116960752 9.010341 1.849238e-11
## x 4.863077e-03 0.002718613 1.788808 8.069195e-02
## I(1/x^2) -1.460443e+02 16.518887452 -8.841049 3.160306e-11
The coefficient of the x term is not significant at the 5% level -- the p value is 8% in the table above -- so we can remove it and it will fit nearly as well giving a model with only two parameters. In the plot below the fm fit with 3 parameters is solid and the fm2 fit with 2 parameters is dashed.
fm2 <- lm(y ~ I(1 / x^2), df)
plot(y ~ x, df)
lines(fitted(fm) ~ x, df)
lines(fitted(fm2) ~ x, df, lty = 2)
2) Another approach is to use two straight lines. This is still continuous but has one non-differentiable point at the point of transition. The model has 4 parameters, the intercepts and slopes of each line. Below we do use the weights. It has the advantage of an obvious motivation based on the appearance of the data. The break point at the intersection of the two lines may have significance as the transition point between the higher sloping initial growth and the lower sloping subsequent growth.
# starting values use lines fitted to 1st ten and last 10 points
fm_1 <- lm(y ~ x, df, subset = 1:10)
fm_2 <- lm(y ~ x, df, subset = seq(to = nrow(df), length = 10))
st <- list(a = coef(fm_1)[[1]], b = coef(fm_1)[[2]],
c = coef(fm_2)[[1]], d = coef(fm_2)[[2]])
fm3 <- nls(y ~ pmin(a + b * x, c + d * x), df, start = st, weights = w)
# point of transition
X <- with(as.list(coef(fm3)), (a - c) / (d - b)); X
## [1] 16.38465
Y <- with(as.list(coef(fm3)), a + b * X); Y
## [1] 0.8262229
plot(y ~ x, df)
lines(fitted(fm3) ~ x, df)
The basic idea is to understand how the selected function performs. Take a function you know (e.g. the logistic) and modify it. Or (even better) go to the literature and see which functions people use in your specific domain. Then create a user-defined model, play with it to understand the parameters, define good start values and then fit it.
Her a quick & dirty example of a user-defined function (with package growthrates). It can surely be made similarly with drc.
library("growthrates")
grow_userdefined <- function (time, parms) {
with(as.list(parms), {
y <- (K * y0)/(y0 + (K - y0) * exp(-mumax * time)) + shift
return(as.matrix(data.frame(time = time, y = y)))
})
}
fit <- fit_growthmodel(FUN=grow_userdefined,
p = c(y0 = -1, K = 1, mumax = 0.1, shift = 1),
time = df$x, y = df$y)
plot(fit)
summary(fit)
It can of course be made better. As we have no exponential start at the onset, one can for example start with a simple saturation function instead of a logistic, e.g. something Monod-like. As said, the preferred way is to use a function related to the application domain.

Unexpected result while using lowess to smooth a data.table column in R

I have a data.table test_dt in which I want to smooth the y column using lowess function.
test_dt <- structure(list(x = c(28.75, 30, 31.25, 32.5, 33.75, 35, 36.25,
37.5, 38.75, 40, 41.25, 42.5, 43.75, 45, 46.25, 47.5, 48.75,
50, 52.5, 55, 57.5, 60, 62.5, 63.75, 65, 67.5, 70, 72.5, 75,
77.5, 80, 82.5, 85, 87.5, 90, 92.5, 95, 97.5, 100, 102.5, 103.75,
105, 106.25, 107.5, 108.75, 110, 111.25, 112.5, 113.75, 115,
116.25, 117.5, 118.75, 120, 121.25, 122.5, 125, 130, 135, 140,
145), y = c(116.78, 115.53, 114.28, 113.05, 111.78, 110.53, 109.28,
108.05, 106.78, 105.53, 104.28, 103.025, 101.775, 100.525, 99.28,
98.05, 96.8, 95.525, 93.1, 90.65, 88.225, 85.775, 83.35, 82.15,
80.9, 78.5, 76.075, 73.675, 71.25, 68.85, 66.5, 64.075, 61.725,
59.4, 57.075, 54.725, 52.475, 50.225, 48, 45.75, 44.65, 43.55,
42.475, 41.45, 40.35, 39.275, 38.25, 37.225, 36.175, 35.175,
34.175, 33.225, 32.275, 31.3, 30.35, 29.45, 27.625, 24.175, 21,
18.125, 15.55), z = c(116.778248424972, 115.531456655985, 114.284502467544,
113.034850770519, 111.784500981402, 110.533319511795, 109.284500954429,
108.034850457264, 106.784502297216, 105.531265565238, 104.278221015846,
103.026780249377, 101.775992395759, 100.528761292272, 99.2853168637851,
98.043586202838, 96.8021989104315, 95.5702032427799, 93.1041279347743,
90.6575956222915, 88.2179393348852, 85.783500434839, 83.3503011023971,
82.136280706039, 80.922846825298, 78.4965179152157, 76.0823895453039,
73.6686672097464, 71.264486719796, 68.8702598156142, 66.4865368523571,
64.1182523898466, 61.7552221811808, 59.4004347738795, 57.0823289450761,
54.7908645949795, 52.5071096685879, 50.2308279167219, 47.9940967492558,
45.7658417529877, 44.6514226583931, 43.5622751034012, 42.4876666190815,
41.4173110074806, 40.3555584369672, 39.3004471381618, 38.2552969838653,
37.2202353638959, 36.1963659189447, 35.1889616530209, 34.2004259883859,
33.2295174626826, 32.2669278456991, 31.3171387914754, 30.3742375589802,
29.4555719783757, 27.6243725086786, 23.9784367995753, 27.625,
27.625, 27.625)), row.names = c(NA, -61L), class = c("data.table",
"data.frame"))
As can be seen in the image below, I am getting an unexpected result. The expected result is that the line (z column) in the graph below should closely follow the points (y column).
Here is my code -
library(data.table)
library(ggplot2)
test_dt[, z := lowess(x = x, y = y, f = 0.1)$y]
ggplot(test_dt) + geom_point(aes(x, y)) + geom_line(aes(x, z))
Q1. Can someone suggest why lowess is not smoothing properly?
Q2. Since lowess is not working as expected, is there any other function in R that would be more efficient in smoothing the y column without producing a spike (as lowess did on the boundary points)?
You could use loess instead:
test_dt[, z := predict(loess(y ~ x, data = test_dt))]
ggplot(test_dt) + geom_point(aes(x, y)) + geom_line(aes(x, z))
Note though, that if all you want to do is plot the line, this is exactly the method that geom_smooth uses, so without even creating a z column, you could do:
ggplot(test_dt, aes(x, y)) + geom_point() + geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Created on 2021-11-07 by the reprex package (v2.0.0)
The problem got solved by keeping the number of iterations to zero.lowess acts like loess when iterations are kept at zero.
test_dt[, z := lowess(x = x, y = y, f = 0.1, iter=0)$y]

R shiny Aesthetics must be either length 1 or the same as the data (8): y

I am trying to plot multiple line charts in R Shiny and include checkboxes but I get an error message when I try to select more than one:
Error: Aesthetics must be either length 1 or the same as the data (8): y
I was able to plot them without error when using radio buttons, but checkboxes don't work. Here's my code:
library(readxl)
scores <- read.csv("newscores.csv")
weeks <- read.csv("weeks.csv")
library(ggplot2)
# ui.R
fluidPage(
# Give the page a title
titlePanel("Fantasy Scores"),
# Generate a row with a sidebar
sidebarLayout(
# Define the sidebar with one input
sidebarPanel(
checkboxGroupInput("Team", label = h3("Select Team"),
choices = c("Jeff","Jordan","Emmerts",
"CJ","Jimmy","Phil",
"Mat","Clegg","Rob","Shawn",
"Seth","Truscott"))
),
# Create a spot for the barplot
mainPanel(
plotOutput("phonePlot")
)
)
)
# server.R
function(input, output) {
# Fill in the spot we created for a plot
output$phonePlot <- renderPlot({
# Render a barplot
ggplot(weeks, aes(x = Week, y = weeks[,input$Team])) +
geom_point() + geom_line()+ylim(59,160)+labs(title="Scores by Week",
x="Week",y="Points")+theme_minimal()
})
}
And I pasted the dataset here (it's quite small) (weeks.csv):
structure(list(Week = c(1, 2, 3, 4, 5, 6, 7, 8), Jeff = c(150.4,
91.5, 85.9, 122.9, 84.8, 116.6, 107.7, 101.3), Jordan = c(111,
89.7, 127.6, 105.1, 115.1, 84, 108.9, 65.6), Emmerts = c(128.9,
108.6, 92.7, 96.4, 78, 73.8, 131, 120.4), CJ = c(72.8, 104.1,
84, 92.9, 62.8, 59.4, 82.2, 70.7), Jimmy = c(76.7, 68.4, 105.7,
111.3, 97.3, 108.5, 102.6, 109.7), Phil = c(96.2, 83.9, 91.5,
98.8, 81.7, 71.7, 78.3, 76.6), Mat = c(93.6, 106.2, 92.8, 87.5,
131, 152.7, 142.6, 105.1), Clegg = c(117.9, 93.7, 98, 109.8,
95.5, 104.2, 93.3, 103.1), Rob = c(80, 72.1, 74, 84.8, 111.7,
105, 89.4, 77.6), Shawn = c(84.4, 116.6, 80.9, 106.1, 106.2,
85.5, 88.4, 107.2), Seth = c(131.9, 98.6, 87.2, 111.7, 109, 125.7,
96.3, 108.9), Truscott = c(100.5, 68.5, 88.8, 96.3, 91.5, 97.6,
70.4, 111.3)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8"))
Thanks for the help!
This is a solution:
library(reshape2)
library(dplyr)
library(ggplot2)
#input <- NULL
#input$Team <- c("Seth", "Truscott")
weeks_melt <- melt(weeks, "Week")
p <- ggplot(weeks_melt, aes(x = Week, y = value, color = variable)) + geom_blank()
weeks_filtered <- week_melt %>% filter(variable %in% input$Team)
p + geom_line(data = weeks_filtered)
Note that we create a geom_blank() so ggplot creates the scales right away (so the colors are fixed even when you change what is selected), then we filter the data before adding the lines. If we don't fix the scales, all the colors and axis scales will vary each time you include or remove one of the factor levels, and this is not a good idea in most cases.
Shiny doesn't require you to build your plots in a single line, so you're free to filter your data before doing anything. Another point is that the first two non-commented lines could be outside your server function, to save a bit of time when transitioning between visualizations.

How to smooth data of increasing noise

Chemist here (so not very good with statistical analysis) and novice in R:
I have various sets of data where the yield of a reaction is monitored with time such as:
The data:
df <- structure(list(time = c(15, 30, 45, 60, 75, 90, 105, 120, 135,
150, 165, 180, 195, 210, 225, 240, 255, 270, 285, 300, 315, 330,
345, 360, 375, 390, 405, 420, 435, 450, 465, 480, 495, 510, 525,
540, 555, 570, 585, 600, 615, 630, 645, 660, 675, 690, 705, 720,
735, 750, 765, 780, 795, 810, 825, 840, 855, 870, 885, 900, 915,
930, 945, 960, 975, 990, 1005, 1020, 1035, 1050, 1065, 1080,
1095, 1110, 1125, 1140, 1155, 1170, 1185, 1200, 1215, 1230, 1245,
1260, 1275, 1290, 1305, 1320, 1335, 1350, 1365, 1380, 1395, 1410,
1425, 1440, 1455, 1470, 1485, 1500, 1515, 1530, 1545, 1560, 1575,
1590, 1605, 1620, 1635, 1650, 1665, 1680, 1695, 1710, 1725, 1740,
1755, 1770, 1785, 1800, 1815, 1830, 1845, 1860, 1875, 1890, 1905,
1920, 1935, 1950, 1965, 1980, 1995, 2010, 2025, 2040, 2055, 2070,
2085, 2100, 2115, 2130), yield = c(9.3411, 9.32582, 10.5475,
13.5358, 17.3376, 16.7444, 20.7234, 19.8374, 24.327, 27.4162,
27.38, 31.3926, 29.3289, 32.2556, 33.0025, 35.3358, 35.8986,
40.1859, 40.3886, 42.2828, 41.23, 43.8108, 43.9391, 43.9543,
48.0524, 47.8295, 48.674, 48.2456, 50.2641, 50.7147, 49.6828,
52.8877, 51.7906, 57.2553, 53.6175, 57.0186, 57.6598, 56.4049,
57.1446, 58.5464, 60.7213, 61.0584, 57.7481, 59.9151, 64.475,
61.2322, 63.5167, 64.6289, 64.4245, 62.0048, 65.5821, 65.8275,
65.7584, 68.0523, 65.4874, 68.401, 68.1503, 67.8713, 69.5478,
69.9774, 73.4199, 66.7266, 70.4732, 67.5119, 69.6107, 70.4911,
72.7592, 69.3821, 72.049, 70.2548, 71.6336, 70.6215, 70.8611,
72.0337, 72.2842, 76.0792, 75.2526, 72.7016, 73.6547, 75.6202,
76.5013, 74.2459, 76.033, 78.4803, 76.3058, 73.837, 74.795, 76.2126,
75.1816, 75.3594, 79.9158, 77.8157, 77.8152, 75.3712, 78.3249,
79.1198, 77.6184, 78.1244, 78.1741, 77.9305, 79.7576, 78.0261,
79.8136, 75.5314, 80.2177, 79.786, 81.078, 78.4183, 80.8013,
79.3855, 81.5268, 78.416, 78.9021, 79.9394, 80.8221, 81.241,
80.6111, 79.7504, 81.6001, 80.7021, 81.1008, 82.843, 82.2716,
83.024, 81.0381, 80.0248, 85.1418, 83.1229, 83.3334, 83.2149,
84.836, 79.5156, 81.909, 81.1477, 85.1715, 83.7502, 83.8336,
83.7595, 86.0062, 84.9572, 86.6709, 84.4124)), .Names = c("time",
"yield"), row.names = c(NA, -142L), class = "data.frame")
What i want to do to the data:
I need to smooth the data in order to plot the 1st derivative. In the paper the author mentioned that one can fit a high order polynomial and use that to do the processing which i think is wrong since we dont really know the true relationship between time and yield for the data and is definitely not polyonymic. I tried regardless and the plot of the derivative did not make any chemical sense as expected. Next i looked into loess using: loes<-loess(Yield~Time,data=df,span=0.9) which gave a much better fit. However, the best results so far was using :
spl <- smooth.spline(df$Time, y=df$Yield,cv=TRUE)
colnames(predspl)<-c('Time','Yield')
pred.der<-as.data.frame(predict(spl, deriv=1))
colnames(pred.der)<-c('Time', 'Yield')
which gave the best fit especially in the initial data points (by visual inspection).
The problem i have:
The issue however is that the derivative looks really good only up to t=500s and then it starts wiggling more and more towards the end. This shouldnt happen from a chemistry point of view and it is just a result of overfitting towards the end of the data due to the increase of the noise. I know this since for some experiments that i have performed 3 times and averaged the data (so the noise decreased) the wiggling is much smaller in the plot of the derivative.
What i have tried so far:
I tried different values of spar which although it smoothens correctly the later data it causes a poor fit in the initial data (which are the most important). I also tried to reduce the number of knots but i got a similar result with the one from changing the spar value. What i think i need is to have a larger amount of knots in the begining which will smoothly decrease to a small number of knots towards the end to avoid that overfitting.
The question:
Is my reasoning correct here? Does anyone know how can i have the above effect in order to get a smooth derivative without any wiggling? Do i need to try a different fit other than the spline maybe? I have attached a pic in the end where you can see the derivative from the smooth.spline vs time and a black line (drawn by hand) of what it should look like. Thank you for your help in advance.
I think you're on the right track on having more closely spaced knots for the spline at the start of the curve. You can specify knot locations for smooth.spline using all.knots (at least on R >= 3.4.3; I skimmed the release notes for R, but couldn't pinpoint the version where this became available).
Below is an example, and the resulting, smoother fit for the derivative after some manual work of trying out different knot positions:
with(df, {
kn <- c(0, c(50, 100, 200, 350, 500, 1500) / max(time), 1)
s <- smooth.spline(time, yield, cv = T)
s2 <- smooth.spline(time, yield, all.knots = kn)
ds <- predict(s, d = 1)
ds2 <- predict(s2, d = 1)
np <- list(mfrow = c(2, 1), mar = c(4, 4, 1, 2))
withr::with_par(np, {
plot(time, yield)
lines(s)
lines(s2, lty = 2, col = 'red')
plot(ds, type = 'l', ylim = c(0, 0.15))
lines(ds2, lty = 2, col = 'red')
})
})
You can probably fine tune the locations further, but I wouldn't be too concerned about it. The primary fits are already near enough indistinguishable, and I'd say you're asking quite a lot from these data in terms of identifying details about the derivative (this should be evident if you plot(time[-1], diff(yield) / diff(time)) which gives you an impression about the level of information your data carry about the derivative).
Created on 2018-02-15 by the reprex package (v0.2.0).

Compute a kernel ridge regression in R for model selection

I have a dataframe df
df<-structure(list(P = c(794.102395099402, 1299.01021921817, 1219.80731174175,
1403.00786976395, 742.749487463385, 340.246973543409, 90.3220586792255,
195.85557320714, 199.390867672674, 191.4970921278, 334.452413539092,
251.730350291822, 235.899165861309, 442.969718728163, 471.120193046119,
458.464154601097, 950.298132134912, 454.660729622624, 591.212003320456,
546.188716055825, 976.994105334083, 1021.67000560164, 945.965200876724,
932.324768081307, 3112.60002304117, 624.005047807736, 0, 937.509240627289,
892.926195849975, 598.564015734103, 907.984807726741, 363.400837339461,
817.629824627294, 2493.75851182081, 451.149000503123, 1028.41455932241,
615.640039284434, 688.915621065535, NaN, 988.21297, NaN, 394.7,
277.7, 277.7, 492.7, 823.6, 1539.1, 556.4, 556.4, 556.4), T = c(11.7087701201175,
8.38748953516909, 9.07065637842101, 9.96978059247473, 2.87026334756687,
-1.20497751697385, 1.69057148825093, 2.79168506923385, -1.03659741363293,
-2.44619473778322, -1.0414166493637, -0.0616510891024765, -2.19566614081763,
2.101408628412, 1.30197334094966, 1.38963309876057, 1.11283280896495,
0.570385633957982, 1.05118063842584, 0.816991857384802, 8.95069454902333,
6.41067954598958, 8.42110173395973, 13.6455092557636, 25.706509843239,
15.5098014530832, 6.60783204117648, 6.27004335176393, 10.0769600264915,
3.05237224011361, 7.52869186722913, 11.2970127691776, 6.60356510073103,
7.3210245298803, 8.4723724171517, 21.6988324356057, 7.34952593890056,
6.04325232771032, NaN, 25.990913731, NaN, 1.5416666667, 15.1416666667,
15.1416666667, 0.825, 4.3666666667, 7.225, -2.075, -2.075, -2.075
), A = c(76.6, 52.5, 3.5, 15, 71.5, 161.833333333333, 154, 72.5,
39, 40, 23, 14.5, 5.5, 78, 129, 73.5, 100, 10, 3, 29.5, 65, 44,
68.5, 56.5, 101, 52.1428571428571, 66.5, 1, 106, 36.6, 21.2,
10, 135, 46.5, 17.5, 35.5, 86, 70.5, 65, 97, 30.5, 96, 79, 11,
162, 350, 42, 200, 50, 250), Y = c(1135.40733061247, 2232.28817154825,
682.15711101488, 1205.97307573068, 1004.2559099408, 656.537378609781,
520.796355544007, 437.780508459633, 449.167726897157, 256.552344558528,
585.618137514404, 299.815636674633, 230.279491515383, 1051.74875971674,
801.07750760983, 572.337961145761, 666.132923644351, 373.524159859929,
128.198042456082, 528.555426408071, 1077.30188477292, 1529.43757814094,
1802.78658590423, 1289.80342084379, 3703.38329098125, 1834.54460388103,
1087.48954802548, 613.15010408836, 1750.11457900004, 704.123482171384,
1710.60321283154, 326.663507855032, 1468.32489464969, 1233.05517321796,
852.500007182098, 1246.5605930537, 1186.31346316832, 1460.48566379373,
2770, 3630, 3225, 831, 734, 387, 548.8, 1144, 1055, 911, 727,
777)), .Names = c("P", "T", "A", "Y"), row.names = c(NA, -50L
), class = "data.frame")
I want to do a model selection by using a kernel ridge regression. I have done it with a simple step wise regression analysis (see below) but I would like to do it using a kernel ridge regression now.
library(caret)
Step <- train(Y~ P+T+A, data=df,
preProcess= c("center", "scale"),
method = "lmStepAIC",
trainControl(method="cv",repeats = 10), na.rm=T)
Anyone know how I could compute a kernel ridge regression for model selection?
Using the CVST package that etienne linked, here is how you can train and predict with a Kernel Ridge Regression learner:
library(CVST)
## Assuming df is already in your environment
d = constructData(x=df[,1:3], y=df$Y) ## Structure data in CVST format
krr_learner = constructKRRLearner() ## Build the base learner
params = list(kernel='rbfdot', sigma=100, lambda=0.01) ## Function params; documentation defines lambda as '.1/getN(d)'
krr_trained = krr_learner$learn(d, params)
## Now to predict, format your test data, 'dTest', the same way as you did in 'd'
pred = krr_learner$predict(krr_trained, dTest)
What makes CVST slightly painful is the intermediary data prep step that requires you call the constructData function. This is an adapted example from page 7 in the documentation.
It's worth mentioning that when I ran this code on your example, I got the following singularity warning:
Lapack routine dgesv: system is exactly singular: U[1,1] = 0

Resources