Rpart Plot in R [duplicate] - r

This question already has an answer here:
not creating tree by rpart in R
(1 answer)
Closed 7 months ago.
I am doing some regression analysis on the small data I have based on the admission number where I want to see the effect of other variables on it. Regression works fine and I do get a good output but how can I build a regression Tree. Can anyone please help me! It is only giving me 1 node, not the complete tree.
data:
structure(list(YEAR = c(2012, 2013, 2014, 2015, 2016, 2017, 2018,
2019, 2020, 2021), RANK_W = c(197, 115, 98, 101, 88, 94, 103,
128, 127, 121), RANK_Y = c(19, 6, 6, 6, 4, 6, 5, 10, 6, 6), GRADS = c(10276,
10156, 10144, 10163, 10080, 9958, 9636, 9102, 8833, 8234), CPINL = c(96.04,
98.44, 99.4, 100, 100.32, 101.7, 103.44, 106.16, 107.51, 110.39
), RENT = c(576, 576, 576, 621, 621, 621, 629, 629, 629, 662),
ACCOM = c(33902, 35449, 35838, 35719, 35747, 36362, 36841,
36882, 36797, 37675), UNEMP = c(0.54, 0.74, 0.74, 0.63, 0.57,
0.47, 0.34, 0.31, 0.35, 0.38), HINC = c(24800, 24800, 26000,
26000, 26900, 27700, 27900, 29800, 30500, 30500), Adm.Numbers = c(1660,
1726, 1846, 1955, 2026, 1999, 1954, 1924, 1952, 2078)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
Code:
model <- lm(Adm.Numbers ~. - YEAR, data = FACTORS_Thesis_1_)
print(model)
summary(model)
Tree <- rpart(Adm.Numbers ~. - YEAR, data = FACTORS_Thesis_1_, method = "anova")
Tree
rpart.plot(Tree)

It is not possible to plot the tree using its default settings. You can control these in the rpart function. Here is a reproducible example:
library(rpart.plot)
Tree <- rpart(Adm.Numbers ~. - YEAR, data = FACTORS_Thesis_1_, method = "anova", control =rpart.control(minsplit =1,minbucket=1, cp=0))
rpart.plot(Tree)
Created on 2022-07-11 by the reprex package (v2.0.1)

Related

How the wind rose varies by month: Package openair

I have date for 8 years. Sample of my data:
structure(list(Data = c("1/1/2015", "1/2/2015", "1/3/2015", "1/4/2015",
"1/5/2015", "1/6/2015", "1/7/2015", "1/8/2015", "1/9/2015", "1/10/2015",
"1/11/2015", "1/12/2015", "1/13/2015", "1/14/2015", "1/15/2015",
"1/16/2015", "1/17/2015", "1/18/2015", "1/19/2015", "1/20/2015",
"1/21/2015", "1/22/2015", "1/23/2015", "1/24/2015", "1/25/2015",
"1/26/2015", "1/27/2015", "1/28/2015", "1/29/2015", "1/30/2015",
"1/31/2015"), no2 = c(3.56, 11.13, 11.84, 4.88, 6.16, 12.56,
18.99, 24.74, 10.81, 12.7, 6.08, 7.34, 16.88, 16.65, 15.81, 20.78,
15.03, 11.82, 15.18, 17, 15.21, 13.86, 10.28, 8.34, 11.89, 7.22,
15.44, 10.55, 8.19, 5.04, 14.65), ws = c(10.84, 3.71, 2.08, 4.59,
6.18, 2.97, 2.13, 1.22, 1.92, 2.07, 3.09, 4.75, 2.12, 1.8, 1.9,
1.79, 1.58, 1.86, 1.58, 1.47, 1.7, 2.6, 2.67, 3.21, 1.78, 4.58,
1.79, 3.1, 3.49, 6.15, 2.59), wd = c(90, 112.5, 112.5, 270, 90,
135, 112.5, 112.5, 270, 315, 270, 112.5, 112.5, 135, 135, 112.5,
292.5, 135, 270, 135, 112.5, 112.5, 270, 112.5, 112.5, 112.5,
112.5, 112.5, 270, 270, 270)), class = "data.frame", row.names = c(NA,
-31L))
library(openair)
windRose(nitrogen,
key = list(header="Wind Rose Acri", footer="wind speed",
plot.style = c("ticks", "border"),
fit = "all", height = 1,
space = "top"))
pollutionRose(nitrogen, pollutant = "no2")
I want to show how the wind rose varies by month. The same problem (Wind rose with ggplot (R)?) but tried realised by function from Openair package.
You could convert your Data column to a name called date with date format and specify type argument with "month". type according to documenation:
type determines how the data are split i.e. conditioned, and then
plotted. The default is will produce a single plot using the entire
data. Type can be one of the built-in types as detailed in cutData
e.g. “season”, “year”, “weekday” and so on. For example, type =
"season" will produce four plots --- one for each season.
It is also possible to choose type as another variable in the data
frame. If that variable is numeric, then the data will be split into
four quantiles (if possible) and labelled accordingly. If type is an
existing character or factor variable, then those categories/levels
will be used directly. This offers great flexibility for understanding
the variation of different variables and how they depend on one
another.
Type can be up length two e.g. type = c("season", "weekday") will
produce a 2x2 plot split by season and day of the week. Note, when two
types are provided the first forms the columns and the second the
rows.
Please note you only provided one month:
library(openair)
# add month column
nitrogen$date <- as.POSIXct(nitrogen$Data, format = '%m/%d/%Y')
windRose(nitrogen,
key = list(header="Wind Rose Acri", footer="wind speed",
plot.style = c("ticks", "border"),
fit = "all", height = 1,
space = "top"),
type = 'month')
Created on 2022-12-13 with reprex v2.0.2
Here is an example with build-in data with type = 'month':
library(openair)
windRose(mydata, type = "month")
Created on 2022-12-13 with reprex v2.0.2

Forecasting with ARIMA and dummy variables

I am attempting to include a dummy regressor that notes the beginning of the pandemic and runs a regression with ARIMA errors.
My dataset revolves around breaking & entering's happening in Toronto from 2014 to 2021. The issue is that the trend takes a turn due to covid-19 around 2020.
Auto.arima provides me with a ARIMA(1,0,1) model as it is not taking into account the impact of covid-19 and is performing according to the implied return to the series average.
When trying to include a dummy regressor that notes the beginning of the pandemic and run a regression with ARIMA errors I get the following error:
In ifelse(time(BEDATA_GROUPEDtsssarima) >= yearmonth("2020-03"), :
Incompatible methods ("Ops.ts", ">=.vctrs_vctr") for ">="
Code:
# Create a binary time series that indicates the start of the pandemic
library(fpp3)
library(forecast)
library(zoo)
# Check if timeseries
class(BEDATA_GROUPED)
#Convert timeseries
BEDATA_GROUPEDtsssarima <- ts(BEDATA_GROUPED[,2], frequency = 12, start = c(2014, 1))
class(BEDATA_GROUPEDtsssarima)
#Plot
forecast::autoplot(BEDATA_GROUPEDtsssarima)
# Assume that the pandemic began in March 2020
pandemic_dummy <- ifelse(time(BEDATA_GROUPEDtsssarima) >= yearmonth("2020-03"), 1, 0)
# Use auto.arima() to fit an ARIMA model with the dummy variable as an exogenous variable
beddatamodel <- auto.arima(BEDATA_GROUPEDtsssarima, xreg = pandemic_dummy, ic="aic", trace = TRUE)
# Create a binary time series that indicates the start of the pandemic
# In this example, we will assume that the pandemic began in March 2020
pandemic_dummy <- ifelse(time(BEDATA_GROUPEDtsssarima) >= yearmonth("2020-03"), 1, 0)
# Use auto.arima() to fit an ARIMA model with the dummy variable as an exogenous variable
beddatamodel <- auto.arima(BEDATA_GROUPEDtsssarima, xreg = pandemic_dummy, ic="aic", trace = TRUE)
# Create a binary time series for the forecast period that includes the pandemic dummy variable
forecast_period <- time(BEDATA_GROUPEDtsssarima)["2022/01/01/":"2023/12/31/"]
pandemic_dummy_forecast <- ifelse(forecast_period >= yearmonth("2020-03"), 1, 0)
# Use the forecast()
forecast(pandemic_dummy_forecast)
Dataset:
structure(list(occurrence_yrmn = c("2014-January", "2014-February",
"2014-March", "2014-April", "2014-May", "2014-June", "2014-July",
"2014-August", "2014-September", "2014-October", "2014-November",
"2014-December", "2015-January", "2015-February", "2015-March",
"2015-April", "2015-May", "2015-June", "2015-July", "2015-August",
"2015-September", "2015-October", "2015-November", "2015-December",
"2016-January", "2016-February", "2016-March", "2016-April",
"2016-May", "2016-June", "2016-July", "2016-August", "2016-September",
"2016-October", "2016-November", "2016-December", "2017-January",
"2017-February", "2017-March", "2017-April", "2017-May", "2017-June",
"2017-July", "2017-August", "2017-September", "2017-October",
"2017-November", "2017-December", "2018-January", "2018-February",
"2018-March", "2018-April", "2018-May", "2018-June", "2018-July",
"2018-August", "2018-September", "2018-October", "2018-November",
"2018-December", "2019-January", "2019-February", "2019-March",
"2019-April", "2019-May", "2019-June", "2019-July", "2019-August",
"2019-September", "2019-October", "2019-November", "2019-December",
"2020-January", "2020-February", "2020-March", "2020-April",
"2020-May", "2020-June", "2020-July", "2020-August", "2020-September",
"2020-October", "2020-November", "2020-December", "2021-January",
"2021-February", "2021-March", "2021-April", "2021-May", "2021-June",
"2021-July", "2021-August", "2021-September", "2021-October",
"2021-November", "2021-December"), MCI = c(586, 482, 567, 626,
625, 610, 576, 634, 636, 663, 657, 556, 513, 415, 510, 542, 549,
618, 623, 666, 641, 632, 593, 617, 541, 523, 504, 536, 498, 552,
522, 519, 496, 541, 602, 570, 571, 492, 560, 525, 507, 523, 593,
623, 578, 657, 683, 588, 664, 582, 619, 512, 630, 644, 563, 654,
635, 732, 639, 748, 719, 567, 607, 746, 739, 686, 805, 762, 696,
777, 755, 675, 704, 617, 732, 609, 464, 487, 565, 609, 513, 533,
505, 578, 526, 418, 428, 421, 502, 452, 509, 492, 478, 469, 457,
457)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-96L))
I see you have used the fpp3 library, so I've had a go using the tidyverts tools. I've had a go at three models: a plain ARIMA, a plain regression using the pandemic dummy variable, and a dynamic model using both ARIMA and the dummy variable.
Hope this helps! :-)
library(tsibble)
library(fable)
library(fabletools)
library(feasts)
library(dplyr)
Create a tsibble:
BEDATA_GROUPED <- BEDATA_GROUPED |>
mutate(Month = yearmonth(occurrence_yrmn)) |>
as_tsibble(index = Month)
autoplot(BEDATA_GROUPED)
Assume that the pandemic began in March 2020
and create a dummy variable:
pandemic_start <- yearmonth("2020-03-01")
BEDATA_GROUPED <- BEDATA_GROUPED |>
mutate(pandemic_dummy = ifelse(Month >= pandemic_start, 1, 0))
Work up a plain ARIMA:
BEDATA_GROUPED_arima <- BEDATA_GROUPED |>
model(ARIMA(MCI, stepwise = FALSE))
BEDATA_GROUPED_arima |>
gg_tsresiduals()
BEDATA_GROUPED_arima |>
forecast(h = 5) |>
autoplot()
Work up a plain regression:
BEDATA_GROUPED_TSLM <- BEDATA_GROUPED |>
model(TSLM(MCI ~ pandemic_dummy)) |>
report()
BEDATA_GROUPED_TSLM |>
gg_tsresiduals()
Make a data set to predict on:
new_data <- structure(list(Month = structure(c(18993, 19024, 19052, 19083,
19113), class = c("yearmonth", "vctrs_vctr")), pandemic_dummy = c(1,
1, 1, 1, 1)), class = c("tbl_ts", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), key = structure(list(.rows = structure(list(
1:5), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr",
"list"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-1L)), index = structure("Month", ordered = TRUE), index2 = "Month", interval = structure(list(
year = 0, quarter = 0, month = 1, week = 0, day = 0, hour = 0,
minute = 0, second = 0, millisecond = 0, microsecond = 0,
nanosecond = 0, unit = 0), .regular = TRUE, class = c("interval",
"vctrs_rcrd", "vctrs_vctr")))
Forecast plain regression:
BEDATA_GROUPED_TSLM |>
forecast(new_data = new_data) |>
autoplot()
Work up a dynamic regression, with ARIMA and the pandemic dummy variablee:
BEDATA_GROUPED_dyn_ARIMA <- BEDATA_GROUPED |>
model(ARIMA(MCI ~ pandemic_dummy)) |>
report()
BEDATA_GROUPED_dyn_ARIMA |>
gg_tsresiduals()
BEDATA_GROUPED_dyn_ARIMA |>
forecast(new_data = new_data) |>
autoplot()

Predicting effects of a cyclic covariate in a spatial GAM-GEE

I have fit a GAM-GEE in the {geepack} package as I wanted to account for within individual residual autocorrelation for some spatial animal data with known individuals. I have a cyclic covariate I want to fit using a cyclic spline and so I did this by generating the basis functions for the spline first in {mgcv}, as cyclic cubic splines are not currently available in the {splines} package, and then using this as a covariate in the geeglm() fit. I did this by first:
cyclic_basisfunc <- mgcv::gam(response ~s(cyclic_term, bs="cc"), fit=F, data=df, family = "binomial")$X %>%
as_tibble() %>%
dplyr::select(V2:ncol(.)) %>%
as.matrix()
and then:
fit1 <- geepack::geeglm(response ~ cyclic_basisfunc +
bs(beta1)+
bs(lon+lat),
family="binomial",
data=df,
id=as.factor(id),
corstr = "independence")
I can estimate the partial effects of the cyclic smooth by extracting the relevant elements of the model matrix which correspond to that term (in this case 8 basis functions for the cyclic term). However when it comes to predicting over a spatial grid, using the predict() function, I cannot figure out how to appropriately name the elements in the prediction grid. The input variable for the cyclic term was a matrix and any attempt to include the prediction values either as a numeric vector or a matrix returns faults.
I tried generating a prediction grid over my study area as such (n.b. have tried using both the name of the cyclic_basisfunc and cyclic_term as column headers):
predgrid <- data.frame(expand.grid(lat=seq(min(df$lat-0.1), max(df$lat+0.1), length.out = 20),
lon=seq(min(df$lon-0.1), max(df$lon+0.1), length.out = 20),
cyclic_basis_func=seq(min(cyclic_basis_func),
max(cyclic_basisfunc), length.out=20),
beta1=25)))
predict(fit1, predgrid)
and get the error:
Error: variable 'cyclic_basisfunc' was fitted with type "nmatrix.8" but type "numeric" was supplied
This makes sense as the original covariate was fit as an 8 column matrix, 1 for each basis function. Have tried inputting a column into the prediction gird as an 8 column matrix yet still this does not seem to work. Is there a way to generate an appropriate prediction grid for this model? (dput for sample of data below).
df <- structure(list(response = c(0.117, 0.915, 0.22, 0.284, 0.551,
0.871, 0.25, 0.261, 0.117, 0.875, 0.67, 0.533, 0.324, 0.138,
0.154, 0.286, 0.935, 0.744, 0.118, 0.224, 0.865, 0.13, 0.889,
0.115, 0, 0, 0.703, 0.917, 0.812, 0.14, 0.219, 0.114, 0.24, 0.163,
0.115, 0.228, 0, 0, 0.115, 0.106, 0.243, 0.13, 0.908, 0.117,
0.472, 0.95, 0.217, 0.133, 0.744, 0.92, 0.26, 0.138, 0.958, 0.113,
0.147, 0.132, 0.496, 0.148, 0.119, 0.186, 0.721, 0.889, 0.162,
0.157, 0.269, 0.25, 0.129, 0.357, 0.188, 0.361, 0.137, 0.128,
0.872, 0.121, 0.135, 0.466, 0.15, 0.589, 0.138, 0.134, 0.122,
0.463, 0.121, 0.369, 0.813, 0.145, 0.911, 0.17, 0.123, 0.649,
0.476, 0.119, 0.367, 0.135, 0.923, 0.875, 0.119, 0.115, 0.895,
0.923), cyclic_term = c(0.732222222222222, -2.28277777777778,
2.56777777777778, -3.43333333333333, -0.89, -5.68611111111111,
-3.47388888888889, -2.88277777777778, -1.79277777777778, 1.50333333333333,
-0.910555555555556, -4.14944444444444, -0.379027777777778, 0.113333333333333,
0.075, -3.99055555555556, -5.48388888888889, -1.84513888888889,
0.286111111111111, -1.24833333333333, -2.19652777777778, -0.532222222222222,
0.598333333333333, 5.43222222222222, -2.73222222222222, -1.11125,
3.16833333333333, -2.88055555555556, 1.90319444444444, 2.585,
-5.64333333333333, -3.79666666666667, 0.692083333333333, 5.80666666666667,
-4.42333333333333, 1.95666666666667, 2.61722222222222, -5.90055555555556,
1.63, 3.55222222222222, -4.20111111111111, -2.34, 3.39277777777778,
-4.32944444444444, 1.32222222222222, 4.74388888888889, 0.251111111111111,
0.0705555555555556, -1.84513888888889, 5.14305555555556, -3.92555555555556,
-2.34277777777778, 2.55777777777778, -3.75944444444444, 2.32277777777778,
1.82944444444444, -3.42111111111111, 3.22388888888889, -3.90055555555556,
1.94, -5.01888888888889, 4.93902777777778, -2.97388888888889,
4.11222222222222, 1.55055555555556, -2.29055555555556, 0.556666666666667,
1.40375, 3.52944444444444, 4.56555555555556, 1.30611111111111,
-2.59944444444444, 4.11166666666667, 6.005, 1.28111111111111,
-2.35333333333333, -0.125, 0.991666666666667, -4.29055555555556,
4.64055555555556, 1.19222222222222, -0.710555555555556, 0.125,
3.835, -3.79722222222222, 1.46, 0.455833333333333, -5.855, 2.59277777777778,
-1.42777777777778, 4.815, 0.508888888888889, -2.14333333333333,
-1.47444444444444, 5.01847222222222, 3.06666666666667, 5.92388888888889,
1.39944444444444, 5.00236111111111, 4.21666666666667), beta1 = c(32.95,
28.8, 32.15, 32.75, 34.7, 29.7, 28.95, 28.85, 32.25, 28.5, 33.5,
28.5, 30.75, 28.8, 32.4, 32.65, 29.7, 32.25, 29.7, 31.85, 32.15,
31.45, 31.25, 31.05, 38.7, 35.2, 31.65, 33, 32.05, 28.7, 31.85,
35.5, 31.25, 35.25, 33.25, 29.7, 35.5, 30.55, 35.45, 36, 33,
29.85, 31.15, 33.35, 34.8, 28.1, 35.35, 28.8, 32.25, 29.3, 29.7,
28.95, 28.4, 34.7, 28.5, 28.8, 28.5, 33.1, 36.35, 29, 26.95,
33.05, 32.2, 29.2, 30.35, 36, 29.7, 34.25, 34.1, 31.9, 32.05,
28.9, 33.3, 31.85, 32.55, 28.8, 29.1, 39.2, 28.95, 32.15, 28.8,
33.1, 29.5, 37.95, 32.85, 28.5, 30.3, 34.55, 28.15, 33.35, 35.35,
31.6, 35.95, 28.9, 31.1, 32.5, 35.7, 31.85, 32.95, 33.55), lon = c(-8.14386899769604,
-8.1572279376935, -8.15157242384156, -8.11266145447895, -8.15174118001044,
-8.15335952072546, -8.14600667978252, -8.15297882985764, -8.15568356665537,
-8.13472008705139, -8.09368161533273, -8.10491923518749, -8.1603014305949,
-8.15632063618518, -8.13543502792374, -8.09733172904193, -8.15868642071182,
-8.12876868592058, -8.15690393084368, -8.10847025857788, -8.15564957894737,
-8.16047965739412, -8.1538774272955, -8.13959002494812, -8.16031308174808,
-8.13153629064039, -8.10225327088153, -8.11704735322503, -8.10320579591837,
-8.09718723480212, -8.14769670963066, -8.15279598640478, -8.1536752924518,
-8.15005347845117, -8.11959004402161, -8.15327362169542, -8.15338984397156,
-8.15480377425017, -8.14843624352758, -8.1536150198704, -8.14265275265451,
-8.15419676931013, -8.14388546959556, -8.12423783110794, -8.15865186565657,
-8.12779523300791, -8.15498210353148, -8.15711005849511, -8.12876868592058,
-8.14498268712871, -8.12777905464012, -8.14658887045715, -8.14966988563538,
-8.15137416124342, -8.14757286777317, -8.15659830243711, -8.11739216850327,
-8.14816670318125, -8.15283383471808, -8.15503278645551, -8.12968355281506,
-8.11532096238244, -8.15388445232111, -8.12550097443724, -8.14214966153336,
-8.15406262640947, -8.15366204068896, -8.16073804747475, -8.14748077293754,
-8.10236112317726, -8.13306401111526, -8.15754008293152, -8.11496173845736,
-8.14744857712061, -8.13097980901942, -8.15712565747807, -8.16003438931358,
-8.16002796870351, -8.11892837135011, -8.13700008392334, -8.15721941772399,
-8.14819490909576, -8.1561399618782, -8.16012501716613, -8.11709369783742,
-8.13470092070733, -8.14629675, -8.15713679162397, -8.13686372838595,
-8.12430530737705, -8.15464372671311, -8.15669989585876, -8.16044796649062,
-8.15766701538992, -8.09362511111111, -8.13870000839233, -8.14998125100998,
-8.15243885077905, -8.12291705479452, -8.15384632745479), lat = c(33.6622974395803,
33.6600173368609, 33.6599819460086, 33.6598656189251, 33.6673908233704,
33.6593042234088, 33.6572338143965, 33.6580708565473, 33.6629485478539,
33.6547317504883, 33.6598712810572, 33.6567040290043, 33.6652851274788,
33.6596383685524, 33.6570077561196, 33.6611995549193, 33.661593106719,
33.6588793953069, 33.6614662478323, 33.6584457075095, 33.6641300278638,
33.6621415089752, 33.6598426484043, 33.6570205688477, 33.6727939605693,
33.6593126847291, 33.6591196082918, 33.6591313969, 33.6605346,
33.6553558936485, 33.6650662271625, 33.6653484273653, 33.6600826614748,
33.6642524699626, 33.6585006713867, 33.658733988916, 33.664975062454,
33.6610514512704, 33.6621421965042, 33.6687249002733, 33.6576841575676,
33.6600360803426, 33.6574947407709, 33.6584060573279, 33.6673802707071,
33.6550894507841, 33.6654066532252, 33.6595139325288, 33.6588793953069,
33.6568044356436, 33.6559013620991, 33.6568076288124, 33.6572189331055,
33.6670448613725, 33.6563744930416, 33.6598780530983, 33.6554991693199,
33.6602992307323, 33.6667773110577, 33.6591215807971, 33.6549378741848,
33.6592876394984, 33.6624833258972, 33.6556429238423, 33.6564236265265,
33.6680011078708, 33.6591718634038, 33.6684195434343, 33.6642363505652,
33.6599223753418, 33.6575357831156, 33.6607284545898, 33.6602992227631,
33.6646009116676, 33.6569601709497, 33.6597380618501, 33.6610439757783,
33.6709440919706, 33.6553216040898, 33.6567993164063, 33.6604019538554,
33.6604042053223, 33.6607238769535, 33.6704235076904, 33.6597095889516,
33.6546409606935, 33.658124, 33.6659726122235, 33.6550456763148,
33.6591343852459, 33.6648648385784, 33.6633987426758, 33.6695201510491,
33.6609040792869, 33.655895, 33.6606597900391, 33.6629373624763,
33.6612642187822, 33.6580512191781, 33.6635307311956), id = c(8,
14, 12, 12, 7, 12, 8, 12, 10, 12, 14, 12, 6, 14, 10, 12, 8, 4,
14, 14, 2, 10, 2, 12, 9, 5, 12, 12, 5, 10, 12, 14, 2, 14, 14,
14, 12, 12, 14, 8, 12, 10, 12, 14, 6, 12, 12, 14, 4, 3, 12, 10,
14, 8, 14, 14, 12, 10, 8, 14, 12, 3, 10, 12, 12, 12, 14, 6, 14,
9, 12, 10, 12, 14, 14, 14, 14, 10, 12, 12, 14, 12, 14, 14, 12,
12, 4, 8, 14, 2, 14, 14, 11, 10, 3, 12, 8, 14, 3, 12)), row.names = c("25101",
"15358", "80097", "89024", "27479", "98805", "25425", "86335",
"31333", "93483", "12171", "100849", "155418", "12853", "33858",
"100851", "22470", "149448", "7443", "12167", "144934", "33938",
"144132", "91062", "56909", "153781", "95039", "99533", "153760",
"32687", "86913", "12298", "144133", "7402", "11672", "13939",
"78548", "98801", "8135", "24867", "91818", "32609", "95688",
"11675", "155218", "94268", "90367", "7440", "149447", "145506",
"90571", "35105", "14210", "26177", "14975", "16723", "86359",
"34450", "26139", "14198", "89237", "145503", "31062", "92665",
"87694", "78666", "13917", "155219", "15350", "60377", "82820",
"33174", "87056", "7406", "15370", "15356", "9330", "33533",
"86726", "95709", "8131", "96538", "13911", "14229", "86539",
"93482", "145837", "22101", "11305", "144939", "7391", "7445",
"21817", "32804", "145314", "98223", "24175", "8132", "145504",
"87396"), class = "data.frame")
Have you tried evaluating the new values using the same basis used in your original data, then use the result as covariates? Note that you will get several warnings because your response is a proportion/probability and you have not given weights or anything like that. Assuming that the model is correctly specified, see if the code below does what you want.
It seems that you can extract the splines from your GAM using the "lpmatrix" type in predict.gam. See here:
How to extract fitted splines from a GAM (`mgcv::gam`)
I get a bit lost to be honest, so I would suggest just setting up the basis for your covariate without fitting a GAM.
Make sure the number and location of knots makes sense and set up a basis.
As I understand it, cSplinesDes() evaluates the basis functions at the desired points
spl_bs <- cSplineDes(x = df$cyclic_term,
knots = seq(min(df$cyclic_term),
max(df$cyclic_term),
length.out = 8),
ord = 4, derivs = 0)
Give reasonable names
colnames(spl_bs) <- paste0("cyclic.", 1:ncol(spl_bs))
You probably want to include these new columns in your data frame.
Note that I removed the Intercept term below but you might want to leave it in
df <- cbind(spl_bs, df)
Fit GEE with convenience code to specify model
model <- reformulate(c(-1,
paste0("cyclic.", 1:(ncol(spl_bs))),
"splines::bs(beta1)", "splines::bs(lon+lat)"),
response = "response")
fit1 <- geepack::geeglm(model,
family="binomial",
data=df,
id=as.factor(id),
corstr = "independence")
Now with the predictions. Create new data
predgrid <- data.frame(expand.grid(lat=seq(min(df$lat-0.1), max(df$lat+0.1), length.out = 20),
lon=seq(min(df$lon-0.1), max(df$lon+0.1), length.out = 20),
cyclic_term=seq(min(df$cyclic_term),
max(df$cyclic_term), length.out=20),
beta1=25))
Set the same basis you used for the original data, but now evaluate at the new points
new_spline <- cSplineDes(predgrid$cyclic_term,
knots = seq(min(df$cyclic_term),
max(df$cyclic_term),
length.out = 8),
ord = 4, derivs=0)
colnames(new_spline) <- paste0("cyclic.", 1:ncol(new_spline))
Assemble new data and predict from fit1
newdf <- cbind(new_spline, predgrid)
predict(fit1, newdf)
Note that latitude/longitude spans out of range of original data, which is dangerous

Negative binomial regression of month trend

I read a paper about negative binomial regression:"We modelled the number of Ecoli bloodstream infections and E coli UTIs per month using negative-binomial regression (incorporating overdispersion), assuming the same underlying population(no offset)." The figure as the followings
I also have a set of data, want to figure the infection like the figure with month/year, how can I do that? thank you very much
df <- structure(list(Year = c(2013, 2013, 2013, 2013, 2013, 2013, 2013,
2013, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2014, 2014,
2014, 2014, 2014, 2014, 2014, 2014, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2015, 2015), Month = c(1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), Incidence = c(2.25538216197745,
3.49502862307924, 2.76311704439615, 2.9836483329794, 3.09375,
3.0368028900429, 3.82920688208141, 3.9154960734432, 3.33517393705135,
3.54593329432417, 3.27586206896552, 3.25655281969817, 3.35912052117264,
3.21672101986362, 2.78237182605312, 2.58435732397113, 2.72516428295323,
3.1227603153476, 2.6300688599847, 2.66324718879463, 2.62653374233129,
2.45256358498183, 2.39520958083832, 3.58683926645092, 3.41995942421022,
3.61001317523057, 2.62718158187895, 2.86944045911047, 2.77978993118435,
2.89282762420792, 2.69410829432029, 3.22232223222322, 3.39818882811799,
3.36725958337297, 2.90030211480363, 3.20789124668435), Inpatient = c(8779,
6638, 9663, 9418, 9600, 8858, 9532, 9041, 9055, 8545, 9280, 10072,
9824, 6746, 10279, 10254, 10348, 9767, 10456, 10138, 10432, 9908,
9853, 11124, 10351, 7590, 10772, 11152, 11044, 10889, 11321,
11110, 11153, 10513, 11585, 12064), infection = c(198, 232, 267,
281, 297, 269, 365, 354, 302, 303, 304, 328, 330, 217, 286, 265,
282, 305, 275, 270, 274, 243, 236, 399, 354, 274, 283, 320, 307,
315, 305, 358, 379, 354, 336, 387)), row.names = c(NA, -36L), class = c("tbl_df",
"tbl", "data.frame"))
reference:
Vihta K D, Stoesser N, Llewelyn M J, et al. Trends over time in Escherichia coli bloodstream infections, urinary tract infections, and antibiotic susceptibilities in Oxfordshire, UK, 1998–2016: a study of electronic health records[J]. The Lancet Infectious Diseases, 2018, 18(10): 1138-1149.
Using the data above, one can do the following:
library(MASS) # for function glm.nb
library(ggplot2)
library(broom) # for tidy model outputs
Create a date, to make plotting easy
df$t <- as.Date(paste("01", df$Month, df$Year, sep = "-"), format = "%d-%m-%Y")
Plot the data. geom_smooth adds the trend line and confidence intervals, using the date as the predictor.
p <- ggplot(data = df, aes(x = t, y = infection)) +
geom_point() +
geom_smooth(method = "glm.nb")
p
To perform the regression, set the count of infections as the dependent variable and the nth month as the independent variable, below month_as_integer.
df$month_as_integer <- seq_along(df$Month)
m1 <- glm.nb(infection ~ month_as_integer, data = df)
using tidy from the broom package, one can get the estimate and confidence intervals as a data frame.
out1 <- as.data.frame(tidy(m1, exponentiate = TRUE, conf.int = TRUE) )
out1
term estimate std.error statistic p.value conf.low conf.high
1 (Intercept) 264.44399 0.048006493 116.184897 0.000000000 240.943378 290.556355
2 month_as_integer 1.00697 0.002250993 3.085763 0.002030303 1.002569 1.011394

Compute a kernel ridge regression in R for model selection

I have a dataframe df
df<-structure(list(P = c(794.102395099402, 1299.01021921817, 1219.80731174175,
1403.00786976395, 742.749487463385, 340.246973543409, 90.3220586792255,
195.85557320714, 199.390867672674, 191.4970921278, 334.452413539092,
251.730350291822, 235.899165861309, 442.969718728163, 471.120193046119,
458.464154601097, 950.298132134912, 454.660729622624, 591.212003320456,
546.188716055825, 976.994105334083, 1021.67000560164, 945.965200876724,
932.324768081307, 3112.60002304117, 624.005047807736, 0, 937.509240627289,
892.926195849975, 598.564015734103, 907.984807726741, 363.400837339461,
817.629824627294, 2493.75851182081, 451.149000503123, 1028.41455932241,
615.640039284434, 688.915621065535, NaN, 988.21297, NaN, 394.7,
277.7, 277.7, 492.7, 823.6, 1539.1, 556.4, 556.4, 556.4), T = c(11.7087701201175,
8.38748953516909, 9.07065637842101, 9.96978059247473, 2.87026334756687,
-1.20497751697385, 1.69057148825093, 2.79168506923385, -1.03659741363293,
-2.44619473778322, -1.0414166493637, -0.0616510891024765, -2.19566614081763,
2.101408628412, 1.30197334094966, 1.38963309876057, 1.11283280896495,
0.570385633957982, 1.05118063842584, 0.816991857384802, 8.95069454902333,
6.41067954598958, 8.42110173395973, 13.6455092557636, 25.706509843239,
15.5098014530832, 6.60783204117648, 6.27004335176393, 10.0769600264915,
3.05237224011361, 7.52869186722913, 11.2970127691776, 6.60356510073103,
7.3210245298803, 8.4723724171517, 21.6988324356057, 7.34952593890056,
6.04325232771032, NaN, 25.990913731, NaN, 1.5416666667, 15.1416666667,
15.1416666667, 0.825, 4.3666666667, 7.225, -2.075, -2.075, -2.075
), A = c(76.6, 52.5, 3.5, 15, 71.5, 161.833333333333, 154, 72.5,
39, 40, 23, 14.5, 5.5, 78, 129, 73.5, 100, 10, 3, 29.5, 65, 44,
68.5, 56.5, 101, 52.1428571428571, 66.5, 1, 106, 36.6, 21.2,
10, 135, 46.5, 17.5, 35.5, 86, 70.5, 65, 97, 30.5, 96, 79, 11,
162, 350, 42, 200, 50, 250), Y = c(1135.40733061247, 2232.28817154825,
682.15711101488, 1205.97307573068, 1004.2559099408, 656.537378609781,
520.796355544007, 437.780508459633, 449.167726897157, 256.552344558528,
585.618137514404, 299.815636674633, 230.279491515383, 1051.74875971674,
801.07750760983, 572.337961145761, 666.132923644351, 373.524159859929,
128.198042456082, 528.555426408071, 1077.30188477292, 1529.43757814094,
1802.78658590423, 1289.80342084379, 3703.38329098125, 1834.54460388103,
1087.48954802548, 613.15010408836, 1750.11457900004, 704.123482171384,
1710.60321283154, 326.663507855032, 1468.32489464969, 1233.05517321796,
852.500007182098, 1246.5605930537, 1186.31346316832, 1460.48566379373,
2770, 3630, 3225, 831, 734, 387, 548.8, 1144, 1055, 911, 727,
777)), .Names = c("P", "T", "A", "Y"), row.names = c(NA, -50L
), class = "data.frame")
I want to do a model selection by using a kernel ridge regression. I have done it with a simple step wise regression analysis (see below) but I would like to do it using a kernel ridge regression now.
library(caret)
Step <- train(Y~ P+T+A, data=df,
preProcess= c("center", "scale"),
method = "lmStepAIC",
trainControl(method="cv",repeats = 10), na.rm=T)
Anyone know how I could compute a kernel ridge regression for model selection?
Using the CVST package that etienne linked, here is how you can train and predict with a Kernel Ridge Regression learner:
library(CVST)
## Assuming df is already in your environment
d = constructData(x=df[,1:3], y=df$Y) ## Structure data in CVST format
krr_learner = constructKRRLearner() ## Build the base learner
params = list(kernel='rbfdot', sigma=100, lambda=0.01) ## Function params; documentation defines lambda as '.1/getN(d)'
krr_trained = krr_learner$learn(d, params)
## Now to predict, format your test data, 'dTest', the same way as you did in 'd'
pred = krr_learner$predict(krr_trained, dTest)
What makes CVST slightly painful is the intermediary data prep step that requires you call the constructData function. This is an adapted example from page 7 in the documentation.
It's worth mentioning that when I ran this code on your example, I got the following singularity warning:
Lapack routine dgesv: system is exactly singular: U[1,1] = 0

Resources