Predicting effects of a cyclic covariate in a spatial GAM-GEE - r

I have fit a GAM-GEE in the {geepack} package as I wanted to account for within individual residual autocorrelation for some spatial animal data with known individuals. I have a cyclic covariate I want to fit using a cyclic spline and so I did this by generating the basis functions for the spline first in {mgcv}, as cyclic cubic splines are not currently available in the {splines} package, and then using this as a covariate in the geeglm() fit. I did this by first:
cyclic_basisfunc <- mgcv::gam(response ~s(cyclic_term, bs="cc"), fit=F, data=df, family = "binomial")$X %>%
as_tibble() %>%
dplyr::select(V2:ncol(.)) %>%
as.matrix()
and then:
fit1 <- geepack::geeglm(response ~ cyclic_basisfunc +
bs(beta1)+
bs(lon+lat),
family="binomial",
data=df,
id=as.factor(id),
corstr = "independence")
I can estimate the partial effects of the cyclic smooth by extracting the relevant elements of the model matrix which correspond to that term (in this case 8 basis functions for the cyclic term). However when it comes to predicting over a spatial grid, using the predict() function, I cannot figure out how to appropriately name the elements in the prediction grid. The input variable for the cyclic term was a matrix and any attempt to include the prediction values either as a numeric vector or a matrix returns faults.
I tried generating a prediction grid over my study area as such (n.b. have tried using both the name of the cyclic_basisfunc and cyclic_term as column headers):
predgrid <- data.frame(expand.grid(lat=seq(min(df$lat-0.1), max(df$lat+0.1), length.out = 20),
lon=seq(min(df$lon-0.1), max(df$lon+0.1), length.out = 20),
cyclic_basis_func=seq(min(cyclic_basis_func),
max(cyclic_basisfunc), length.out=20),
beta1=25)))
predict(fit1, predgrid)
and get the error:
Error: variable 'cyclic_basisfunc' was fitted with type "nmatrix.8" but type "numeric" was supplied
This makes sense as the original covariate was fit as an 8 column matrix, 1 for each basis function. Have tried inputting a column into the prediction gird as an 8 column matrix yet still this does not seem to work. Is there a way to generate an appropriate prediction grid for this model? (dput for sample of data below).
df <- structure(list(response = c(0.117, 0.915, 0.22, 0.284, 0.551,
0.871, 0.25, 0.261, 0.117, 0.875, 0.67, 0.533, 0.324, 0.138,
0.154, 0.286, 0.935, 0.744, 0.118, 0.224, 0.865, 0.13, 0.889,
0.115, 0, 0, 0.703, 0.917, 0.812, 0.14, 0.219, 0.114, 0.24, 0.163,
0.115, 0.228, 0, 0, 0.115, 0.106, 0.243, 0.13, 0.908, 0.117,
0.472, 0.95, 0.217, 0.133, 0.744, 0.92, 0.26, 0.138, 0.958, 0.113,
0.147, 0.132, 0.496, 0.148, 0.119, 0.186, 0.721, 0.889, 0.162,
0.157, 0.269, 0.25, 0.129, 0.357, 0.188, 0.361, 0.137, 0.128,
0.872, 0.121, 0.135, 0.466, 0.15, 0.589, 0.138, 0.134, 0.122,
0.463, 0.121, 0.369, 0.813, 0.145, 0.911, 0.17, 0.123, 0.649,
0.476, 0.119, 0.367, 0.135, 0.923, 0.875, 0.119, 0.115, 0.895,
0.923), cyclic_term = c(0.732222222222222, -2.28277777777778,
2.56777777777778, -3.43333333333333, -0.89, -5.68611111111111,
-3.47388888888889, -2.88277777777778, -1.79277777777778, 1.50333333333333,
-0.910555555555556, -4.14944444444444, -0.379027777777778, 0.113333333333333,
0.075, -3.99055555555556, -5.48388888888889, -1.84513888888889,
0.286111111111111, -1.24833333333333, -2.19652777777778, -0.532222222222222,
0.598333333333333, 5.43222222222222, -2.73222222222222, -1.11125,
3.16833333333333, -2.88055555555556, 1.90319444444444, 2.585,
-5.64333333333333, -3.79666666666667, 0.692083333333333, 5.80666666666667,
-4.42333333333333, 1.95666666666667, 2.61722222222222, -5.90055555555556,
1.63, 3.55222222222222, -4.20111111111111, -2.34, 3.39277777777778,
-4.32944444444444, 1.32222222222222, 4.74388888888889, 0.251111111111111,
0.0705555555555556, -1.84513888888889, 5.14305555555556, -3.92555555555556,
-2.34277777777778, 2.55777777777778, -3.75944444444444, 2.32277777777778,
1.82944444444444, -3.42111111111111, 3.22388888888889, -3.90055555555556,
1.94, -5.01888888888889, 4.93902777777778, -2.97388888888889,
4.11222222222222, 1.55055555555556, -2.29055555555556, 0.556666666666667,
1.40375, 3.52944444444444, 4.56555555555556, 1.30611111111111,
-2.59944444444444, 4.11166666666667, 6.005, 1.28111111111111,
-2.35333333333333, -0.125, 0.991666666666667, -4.29055555555556,
4.64055555555556, 1.19222222222222, -0.710555555555556, 0.125,
3.835, -3.79722222222222, 1.46, 0.455833333333333, -5.855, 2.59277777777778,
-1.42777777777778, 4.815, 0.508888888888889, -2.14333333333333,
-1.47444444444444, 5.01847222222222, 3.06666666666667, 5.92388888888889,
1.39944444444444, 5.00236111111111, 4.21666666666667), beta1 = c(32.95,
28.8, 32.15, 32.75, 34.7, 29.7, 28.95, 28.85, 32.25, 28.5, 33.5,
28.5, 30.75, 28.8, 32.4, 32.65, 29.7, 32.25, 29.7, 31.85, 32.15,
31.45, 31.25, 31.05, 38.7, 35.2, 31.65, 33, 32.05, 28.7, 31.85,
35.5, 31.25, 35.25, 33.25, 29.7, 35.5, 30.55, 35.45, 36, 33,
29.85, 31.15, 33.35, 34.8, 28.1, 35.35, 28.8, 32.25, 29.3, 29.7,
28.95, 28.4, 34.7, 28.5, 28.8, 28.5, 33.1, 36.35, 29, 26.95,
33.05, 32.2, 29.2, 30.35, 36, 29.7, 34.25, 34.1, 31.9, 32.05,
28.9, 33.3, 31.85, 32.55, 28.8, 29.1, 39.2, 28.95, 32.15, 28.8,
33.1, 29.5, 37.95, 32.85, 28.5, 30.3, 34.55, 28.15, 33.35, 35.35,
31.6, 35.95, 28.9, 31.1, 32.5, 35.7, 31.85, 32.95, 33.55), lon = c(-8.14386899769604,
-8.1572279376935, -8.15157242384156, -8.11266145447895, -8.15174118001044,
-8.15335952072546, -8.14600667978252, -8.15297882985764, -8.15568356665537,
-8.13472008705139, -8.09368161533273, -8.10491923518749, -8.1603014305949,
-8.15632063618518, -8.13543502792374, -8.09733172904193, -8.15868642071182,
-8.12876868592058, -8.15690393084368, -8.10847025857788, -8.15564957894737,
-8.16047965739412, -8.1538774272955, -8.13959002494812, -8.16031308174808,
-8.13153629064039, -8.10225327088153, -8.11704735322503, -8.10320579591837,
-8.09718723480212, -8.14769670963066, -8.15279598640478, -8.1536752924518,
-8.15005347845117, -8.11959004402161, -8.15327362169542, -8.15338984397156,
-8.15480377425017, -8.14843624352758, -8.1536150198704, -8.14265275265451,
-8.15419676931013, -8.14388546959556, -8.12423783110794, -8.15865186565657,
-8.12779523300791, -8.15498210353148, -8.15711005849511, -8.12876868592058,
-8.14498268712871, -8.12777905464012, -8.14658887045715, -8.14966988563538,
-8.15137416124342, -8.14757286777317, -8.15659830243711, -8.11739216850327,
-8.14816670318125, -8.15283383471808, -8.15503278645551, -8.12968355281506,
-8.11532096238244, -8.15388445232111, -8.12550097443724, -8.14214966153336,
-8.15406262640947, -8.15366204068896, -8.16073804747475, -8.14748077293754,
-8.10236112317726, -8.13306401111526, -8.15754008293152, -8.11496173845736,
-8.14744857712061, -8.13097980901942, -8.15712565747807, -8.16003438931358,
-8.16002796870351, -8.11892837135011, -8.13700008392334, -8.15721941772399,
-8.14819490909576, -8.1561399618782, -8.16012501716613, -8.11709369783742,
-8.13470092070733, -8.14629675, -8.15713679162397, -8.13686372838595,
-8.12430530737705, -8.15464372671311, -8.15669989585876, -8.16044796649062,
-8.15766701538992, -8.09362511111111, -8.13870000839233, -8.14998125100998,
-8.15243885077905, -8.12291705479452, -8.15384632745479), lat = c(33.6622974395803,
33.6600173368609, 33.6599819460086, 33.6598656189251, 33.6673908233704,
33.6593042234088, 33.6572338143965, 33.6580708565473, 33.6629485478539,
33.6547317504883, 33.6598712810572, 33.6567040290043, 33.6652851274788,
33.6596383685524, 33.6570077561196, 33.6611995549193, 33.661593106719,
33.6588793953069, 33.6614662478323, 33.6584457075095, 33.6641300278638,
33.6621415089752, 33.6598426484043, 33.6570205688477, 33.6727939605693,
33.6593126847291, 33.6591196082918, 33.6591313969, 33.6605346,
33.6553558936485, 33.6650662271625, 33.6653484273653, 33.6600826614748,
33.6642524699626, 33.6585006713867, 33.658733988916, 33.664975062454,
33.6610514512704, 33.6621421965042, 33.6687249002733, 33.6576841575676,
33.6600360803426, 33.6574947407709, 33.6584060573279, 33.6673802707071,
33.6550894507841, 33.6654066532252, 33.6595139325288, 33.6588793953069,
33.6568044356436, 33.6559013620991, 33.6568076288124, 33.6572189331055,
33.6670448613725, 33.6563744930416, 33.6598780530983, 33.6554991693199,
33.6602992307323, 33.6667773110577, 33.6591215807971, 33.6549378741848,
33.6592876394984, 33.6624833258972, 33.6556429238423, 33.6564236265265,
33.6680011078708, 33.6591718634038, 33.6684195434343, 33.6642363505652,
33.6599223753418, 33.6575357831156, 33.6607284545898, 33.6602992227631,
33.6646009116676, 33.6569601709497, 33.6597380618501, 33.6610439757783,
33.6709440919706, 33.6553216040898, 33.6567993164063, 33.6604019538554,
33.6604042053223, 33.6607238769535, 33.6704235076904, 33.6597095889516,
33.6546409606935, 33.658124, 33.6659726122235, 33.6550456763148,
33.6591343852459, 33.6648648385784, 33.6633987426758, 33.6695201510491,
33.6609040792869, 33.655895, 33.6606597900391, 33.6629373624763,
33.6612642187822, 33.6580512191781, 33.6635307311956), id = c(8,
14, 12, 12, 7, 12, 8, 12, 10, 12, 14, 12, 6, 14, 10, 12, 8, 4,
14, 14, 2, 10, 2, 12, 9, 5, 12, 12, 5, 10, 12, 14, 2, 14, 14,
14, 12, 12, 14, 8, 12, 10, 12, 14, 6, 12, 12, 14, 4, 3, 12, 10,
14, 8, 14, 14, 12, 10, 8, 14, 12, 3, 10, 12, 12, 12, 14, 6, 14,
9, 12, 10, 12, 14, 14, 14, 14, 10, 12, 12, 14, 12, 14, 14, 12,
12, 4, 8, 14, 2, 14, 14, 11, 10, 3, 12, 8, 14, 3, 12)), row.names = c("25101",
"15358", "80097", "89024", "27479", "98805", "25425", "86335",
"31333", "93483", "12171", "100849", "155418", "12853", "33858",
"100851", "22470", "149448", "7443", "12167", "144934", "33938",
"144132", "91062", "56909", "153781", "95039", "99533", "153760",
"32687", "86913", "12298", "144133", "7402", "11672", "13939",
"78548", "98801", "8135", "24867", "91818", "32609", "95688",
"11675", "155218", "94268", "90367", "7440", "149447", "145506",
"90571", "35105", "14210", "26177", "14975", "16723", "86359",
"34450", "26139", "14198", "89237", "145503", "31062", "92665",
"87694", "78666", "13917", "155219", "15350", "60377", "82820",
"33174", "87056", "7406", "15370", "15356", "9330", "33533",
"86726", "95709", "8131", "96538", "13911", "14229", "86539",
"93482", "145837", "22101", "11305", "144939", "7391", "7445",
"21817", "32804", "145314", "98223", "24175", "8132", "145504",
"87396"), class = "data.frame")

Have you tried evaluating the new values using the same basis used in your original data, then use the result as covariates? Note that you will get several warnings because your response is a proportion/probability and you have not given weights or anything like that. Assuming that the model is correctly specified, see if the code below does what you want.
It seems that you can extract the splines from your GAM using the "lpmatrix" type in predict.gam. See here:
How to extract fitted splines from a GAM (`mgcv::gam`)
I get a bit lost to be honest, so I would suggest just setting up the basis for your covariate without fitting a GAM.
Make sure the number and location of knots makes sense and set up a basis.
As I understand it, cSplinesDes() evaluates the basis functions at the desired points
spl_bs <- cSplineDes(x = df$cyclic_term,
knots = seq(min(df$cyclic_term),
max(df$cyclic_term),
length.out = 8),
ord = 4, derivs = 0)
Give reasonable names
colnames(spl_bs) <- paste0("cyclic.", 1:ncol(spl_bs))
You probably want to include these new columns in your data frame.
Note that I removed the Intercept term below but you might want to leave it in
df <- cbind(spl_bs, df)
Fit GEE with convenience code to specify model
model <- reformulate(c(-1,
paste0("cyclic.", 1:(ncol(spl_bs))),
"splines::bs(beta1)", "splines::bs(lon+lat)"),
response = "response")
fit1 <- geepack::geeglm(model,
family="binomial",
data=df,
id=as.factor(id),
corstr = "independence")
Now with the predictions. Create new data
predgrid <- data.frame(expand.grid(lat=seq(min(df$lat-0.1), max(df$lat+0.1), length.out = 20),
lon=seq(min(df$lon-0.1), max(df$lon+0.1), length.out = 20),
cyclic_term=seq(min(df$cyclic_term),
max(df$cyclic_term), length.out=20),
beta1=25))
Set the same basis you used for the original data, but now evaluate at the new points
new_spline <- cSplineDes(predgrid$cyclic_term,
knots = seq(min(df$cyclic_term),
max(df$cyclic_term),
length.out = 8),
ord = 4, derivs=0)
colnames(new_spline) <- paste0("cyclic.", 1:ncol(new_spline))
Assemble new data and predict from fit1
newdf <- cbind(new_spline, predgrid)
predict(fit1, newdf)
Note that latitude/longitude spans out of range of original data, which is dangerous

Related

Error in nlsModel(formula, mf, start, wts) : singular gradient matrix at initial parameter estimates for Bb

I am trying to fit an equation through the given data but have been unsuccessful
data <- data.frame(x=c(2.27, 2.72, 3.18, 3.63, 4.08, 4.54, 4.99, 5.45, 5.90, 6.35, 6.81, 7.26, 7.71, 8.17, 8.62, 9.08, 9.53, 9.98, 10.44, 10.89, 11.34, 11.80, 12.25, 12.71, 13.16, 13.61, 14.07, 14.52, 14.97, 15.43, 15.88, 16.34, 16.79, 17.24, 17.70, 18.15, 18.61, 19.06, 19.51, 19.97, 20.42, 20.87, 21.33),
y=c(200.723, 249.508, 293.024, 327.770, 354.081, 372.079, 381.493, 383.478, 378.901, 368.833, 354.063, 336.278, 316.076, 293.924, 271.432, 248.239, 225.940, 204.327, 183.262, 163.830, 145.750, 128.835, 113.568, 99.451, 87.036, 75.876, 65.766, 57.008, 49.223, 42.267, 36.352, 31.062, 26.580, 22.644, 19.255, 16.391, 13.811, 11.716, 9.921, 8.364, 7.087, 5.801, 4.523))
frequency <- (data$x)
brightness <- (data$y)*2.71057477e-3
# Define the Planck blackbody equation as a function in R
B <- function(frequency, t) {
h <- 6.62607015e-34
c <- 299792458
k <- 1.380649e-23
(2 * h * frequency^3 * c^-2) / (exp((h * frequency) / (k * t)) - 1)
}
library(stats)
fit <- nls(brightness ~ B(frequency, t), data = data, start = list(t = 2.5))
# Summarize the fit
summary(fit)
# Plot the data and the fitted model
library(ggplot2)
ggplot(data, aes(x = frequency, y = brightness)) +
geom_point() +
geom_line(aes(x = frequency, y = predict(fit)))
I just need a basic scatter plot of the data with the equation fitted over it as a line
data <- data.frame(x=c(2.27, 2.72, 3.18, 3.63, 4.08, 4.54, 4.99, 5.45, 5.90, 6.35, 6.81, 7.26, 7.71, 8.17, 8.62, 9.08, 9.53, 9.98, 10.44, 10.89, 11.34, 11.80, 12.25, 12.71, 13.16, 13.61, 14.07, 14.52, 14.97, 15.43, 15.88, 16.34, 16.79, 17.24, 17.70, 18.15, 18.61, 19.06, 19.51, 19.97, 20.42, 20.87, 21.33),
y=c(200.723, 249.508, 293.024, 327.770, 354.081, 372.079, 381.493, 383.478, 378.901, 368.833, 354.063, 336.278, 316.076, 293.924, 271.432, 248.239, 225.940, 204.327, 183.262, 163.830, 145.750, 128.835, 113.568, 99.451, 87.036, 75.876, 65.766, 57.008, 49.223, 42.267, 36.352, 31.062, 26.580, 22.644, 19.255, 16.391, 13.811, 11.716, 9.921, 8.364, 7.087, 5.801, 4.523))
data$frequency <- (data$x)
data$brightness <- (data$y)*2.71057477e-3
# Define the Planck blackbody equation as a function in R
B <- function(frequency, t) {
h <- 6.62607015e-34
c <- 299792458
k <- 1.380649e-23
(2 * h * frequency^3 * c^-2) / (expm1((h * frequency) / (k * t)))
}
library(stats)
fit <- nls(brightness ~ B(frequency, t), data = data, start = list(t = 2.5))
# Summarize the fit
summary(fit)
# Plot the data and the fitted model
library(ggplot2)
ggplot(data, aes(x = frequency, y = brightness)) +
geom_point() +
geom_line(aes(x = frequency, y = predict(fit)))
This has been asked as comments on your previous questions, What are the units on frequency and brightness?
Going back to your original question the units on data$x is 1/cm so it is not frequency but wavenumber and brightness is MJy/sr which is equal to 1.256E-19 kg/s2
So if you use the correct form of the Plank's equation (There is a typo on the Wikipedia page) and perform the unit conversions you are able to perform a fit. Not necessary a great fit.
data <- data.frame(freq = c(2.27, 2.72, 3.18, 3.63, 4.08, 4.54, 4.99, 5.45, 5.90, 6.35, 6.81, 7.26, 7.71, 8.17, 8.62, 9.08, 9.53, 9.98, 10.44, 10.89, 11.34, 11.80, 12.25, 12.71, 13.16, 13.61, 14.07, 14.52, 14.97, 15.43, 15.88, 16.34, 16.79, 17.24, 17.70, 18.15, 18.61, 19.06, 19.51, 19.97, 20.42, 20.87, 21.33),
brightness = c(200.723, 249.508, 293.024, 327.770, 354.081, 372.079, 381.493, 383.478, 378.901, 368.833, 354.063, 336.278, 316.076, 293.924, 271.432, 248.239, 225.940, 204.327, 183.262, 163.830, 145.750, 128.835, 113.568, 99.451, 87.036, 75.876, 65.766, 57.008, 49.223, 42.267, 36.352, 31.062, 26.580, 22.644, 19.255, 16.391, 13.811, 11.716, 9.921, 8.364, 7.087, 5.801, 4.523))
planck <- function(freq, t, h, c, k) {
# freq = wavenumber * c
# (2 * h * freq^3) / (c^2) * 1/(exp((h * freq) / (k * t)) - 1)
(2 * h * c * freq^3) * 1/(exp((h*c*freq)/(k*t))-1)
}
# fit the data using nls
h = 6.62607e-34
c= 3e8
k = 1.38065e-23
data$freq<- data$freq*100 #(convert from 1/cm to 1/m)
fit <- nls(brightness*1.256E-19 ~ planck(freq, t, h, c, k), start = list(t = 3), data = data)
summary(fit)
#1.256E-19 to convert from kg/s^2 to MJy/sr
brightness_fit <- predict(fit, data)/1.256E-19
# plot the data and the fitted curve
plot(data$freq, data$brightness, xlab = "Frequency (1/m)", ylab = "Brightness")
lines(data$freq, y=brightness_fit, col = "red")

How the wind rose varies by month: Package openair

I have date for 8 years. Sample of my data:
structure(list(Data = c("1/1/2015", "1/2/2015", "1/3/2015", "1/4/2015",
"1/5/2015", "1/6/2015", "1/7/2015", "1/8/2015", "1/9/2015", "1/10/2015",
"1/11/2015", "1/12/2015", "1/13/2015", "1/14/2015", "1/15/2015",
"1/16/2015", "1/17/2015", "1/18/2015", "1/19/2015", "1/20/2015",
"1/21/2015", "1/22/2015", "1/23/2015", "1/24/2015", "1/25/2015",
"1/26/2015", "1/27/2015", "1/28/2015", "1/29/2015", "1/30/2015",
"1/31/2015"), no2 = c(3.56, 11.13, 11.84, 4.88, 6.16, 12.56,
18.99, 24.74, 10.81, 12.7, 6.08, 7.34, 16.88, 16.65, 15.81, 20.78,
15.03, 11.82, 15.18, 17, 15.21, 13.86, 10.28, 8.34, 11.89, 7.22,
15.44, 10.55, 8.19, 5.04, 14.65), ws = c(10.84, 3.71, 2.08, 4.59,
6.18, 2.97, 2.13, 1.22, 1.92, 2.07, 3.09, 4.75, 2.12, 1.8, 1.9,
1.79, 1.58, 1.86, 1.58, 1.47, 1.7, 2.6, 2.67, 3.21, 1.78, 4.58,
1.79, 3.1, 3.49, 6.15, 2.59), wd = c(90, 112.5, 112.5, 270, 90,
135, 112.5, 112.5, 270, 315, 270, 112.5, 112.5, 135, 135, 112.5,
292.5, 135, 270, 135, 112.5, 112.5, 270, 112.5, 112.5, 112.5,
112.5, 112.5, 270, 270, 270)), class = "data.frame", row.names = c(NA,
-31L))
library(openair)
windRose(nitrogen,
key = list(header="Wind Rose Acri", footer="wind speed",
plot.style = c("ticks", "border"),
fit = "all", height = 1,
space = "top"))
pollutionRose(nitrogen, pollutant = "no2")
I want to show how the wind rose varies by month. The same problem (Wind rose with ggplot (R)?) but tried realised by function from Openair package.
You could convert your Data column to a name called date with date format and specify type argument with "month". type according to documenation:
type determines how the data are split i.e. conditioned, and then
plotted. The default is will produce a single plot using the entire
data. Type can be one of the built-in types as detailed in cutData
e.g. “season”, “year”, “weekday” and so on. For example, type =
"season" will produce four plots --- one for each season.
It is also possible to choose type as another variable in the data
frame. If that variable is numeric, then the data will be split into
four quantiles (if possible) and labelled accordingly. If type is an
existing character or factor variable, then those categories/levels
will be used directly. This offers great flexibility for understanding
the variation of different variables and how they depend on one
another.
Type can be up length two e.g. type = c("season", "weekday") will
produce a 2x2 plot split by season and day of the week. Note, when two
types are provided the first forms the columns and the second the
rows.
Please note you only provided one month:
library(openair)
# add month column
nitrogen$date <- as.POSIXct(nitrogen$Data, format = '%m/%d/%Y')
windRose(nitrogen,
key = list(header="Wind Rose Acri", footer="wind speed",
plot.style = c("ticks", "border"),
fit = "all", height = 1,
space = "top"),
type = 'month')
Created on 2022-12-13 with reprex v2.0.2
Here is an example with build-in data with type = 'month':
library(openair)
windRose(mydata, type = "month")
Created on 2022-12-13 with reprex v2.0.2

Getting the distance matrix back from already clustered data

I have used hclust in the TSclust package to do agglomerative hierarchical clustering. My question is, Can I get the dissimlarity (distance) matrix back from hclust? I wanted the values of the distance to rank which variable is closer to a single variable in the group of variables.
example: If (x1,x2, x3,x4,x5,x6,x7,x8,x9,x10) are the variables used to form the distance matrix, then what I wanted is the distance between x3 and the rest of variables (x3x1,x3x2,x3x4,x3x5, and so on). Can we do that? Here is the code and reproducible data.
Data:
structure(list(x1 = c(186.41, 100.18, 12.3, 14.38, 25.97, 0.06,
0, 6.17, 244.06, 19.26, 256.18, 255.69, 121.88, 75, 121.45, 11.34,
34.68, 3.09, 34.3, 26.13, 111.31), x2 = c(327.2, 8.05, 4.23,
6.7, 3.12, 1.91, 37.03, 39.17, 140.06, 83.72, 263.29, 261.22,
202.48, 23.27, 2.87, 7.17, 14.48, 3.41, 5.95, 70.56, 91.58),
x3 = c(220.18, 126.14, 98.59, 8.56, 0.5, 0.9, 17.45, 191.1,
164.64, 224.36, 262.86, 237.75, 254.88, 42.05, 9.12, 0.04,
12.22, 0.61, 61.86, 114.08, 78.94), x4 = c(90.74, 26.11,
47.86, 10.86, 3.74, 23.69, 61.79, 68.12, 87.92, 171.76, 260.98,
266.62, 96.27, 57.15, 78.89, 16.73, 6.59, 49.44, 57.21, 202.2,
67.17), x5 = c(134.09, 27.06, 7.44, 4.53, 17, 47.66, 95.96,
129.53, 40.23, 157.37, 172.61, 248.56, 160.84, 421.94, 109.93,
22.77, 2.11, 49.18, 64.13, 52.61, 180.87), x6 = c(173.17,
46.68, 6.54, 3.05, 0.35, 0.12, 5.09, 72.46, 58.19, 112.31,
233.77, 215.82, 100.63, 65.84, 2.69, 0.01, 3.63, 12.93, 66.55,
28, 61.74), x7 = c(157.22, 141.81, 19.98, 116.18, 16.55,
122.3, 62.67, 141.84, 78.3, 227.27, 340.22, 351.38, 147.73,
0.3, 56.12, 33.2, 5.51, 54.4, 82.98, 152.66, 218.26), x8 = c(274.08,
51.92, 54.86, 15.37, 0.31, 0.05, 36.3, 162.04, 171.78, 181.39,
310.73, 261.55, 237.99, 123.99, 1.92, 0.74, 0.23, 18.51,
7.68, 65.55, 171.33), x9 = c(262.71, 192.34, 2.75, 21.68,
1.69, 3.92, 0.09, 9.33, 120.36, 282.92, 236.7, 161.59, 255.44,
126.44, 7.63, 2.04, 1.02, 0.12, 5.87, 146.25, 134.11), x10 = c(82.71,
44.09, 1.52, 2.63, 4.38, 28.64, 168.43, 80.62, 20.36, 39.29,
302.31, 247.52, 165.73, 18.27, 2.67, 1.77, 23.13, 53.47,
53.14, 46.61, 86.29)), class = "data.frame", row.names = c(NA,
-21L))
Code:
as.ts(cdata)
library(dplyr) # data wrangling
library(ggplot2) # grammar of graphics
library(ggdendro) # dendrograms
library(TSclust) # cluster time series
cluster analysis
dist_ts <- TSclust::diss(SERIES = t(cdata), METHOD = "INT.PER") # note the data frame must be transposed
hc <- stats::hclust(dist_ts, method="complete") # method can be also "average" or diana (for DIvisive ANAlysis Clustering)
hcdata <- ggdendro::dendro_data(hc)
names_order <- hcdata$labels$label
# Use the following to remove labels from dendogram so not doubling up - but good for checking hcdata$labels$label <- ""
hcdata%>%ggdendro::ggdendrogram(., rotate=FALSE, leaf_labels=FALSE)
I believe the object you are looking for is stored in the variable dist_ts:
dist_ts <- TSclust::diss(SERIES = t(cdata), METHOD = "INT.PER")
print(dist_ts)

How to make months of the year my x-axis using xyplot

Here is my data
[![enter image description here][1]][1]
my code
library(ggplot2)
library(reshape)
dt1 =read.csv("C:/Users/My DELL/Documents/R_data/machine learning/dt1.csv")
head(dt1)
dt1$month <- seq(nrow(dt1))
library(reshape2)
mm <- melt(subset(dt1,select=c(month,EgbeNa,UrejeNa,EroNa,RefNa,EgbeMg,UrejeMg,EroMg,RefMg
)),id.var="month")
head(mm)
library(lattice)
xyplot(value ~ month|variable,data=mm,type="l",
scales=list(y=list(relation="free")),
layout=c(1,8))
dt_repr = structure(list(Date = c("01-11-17", "01-12-17", "01-01-18", "01-02-18",
"01-03-18", "01-04-18", "01-05-18", "01-06-18", "01-07-18", "01-08-18",
"01-09-18", "01-10-18", "01-11-18", "01-12-18", "01-01-19", "01-02-19",
"01-03-19", "01-04-19", "01-05-19", "01-06-19", "01-07-19", "01-08-19",
"01-09-19", "01-10-19"), month = 1:24, EgbeNa = c(27.4, 29.25,
31.1, 20.4, 13.55, 14, 16.25, 18.5, 24.95, 16.2, 30.15, 28.6,
35.1, 36.5, 28.45, 31.5, 38.1, 28, 32.55, 30.5, 33.2, 30.8, 13,
24.3), UrejeNa = c(10.45, 9, 7.55, 13.35, 11.6, 12.475, 20.1625,
27.85, 21.5, 32.05, 17.65, 15.15, 25.7, 18.8, 26.85, 20.65, 23.5,
26.45, 30.2, 25.75, 28.3, 31.45, 44.4, 39.6), EroNa = c(44.45,
40.55, 36.65, 43, 39.825, 36.825, 44.1, 51.65, 44.2, 56.1, 61.3,
66.05, 15.75, 19.15, 13.05, 12.2, 21.7, 17.9, 14.6, 33.3, 21.2,
19.6, 32.7, 25.1), RefNa = c(10.55, 9.75, 12.35, 19.65, 10.6,
13.74, 22.62, 25.82, 20.4, 31.2, 16.95, 14.25, 15.03, 17.15,
12.75, 13.5, 20.45, 16.8, 15.5, 25.4, 19.5, 19.8, 26.7, 25.1),
EgbeMg = c(4.118, 4.7155, 5.313, 4.4865, 5.1535, 5.1295,
5.113, 5.103, 5.721, 5.285, 3.8575, 4.128, 5.4205, 6.2975,
5.134, 5.4605, 5.124, 4.203, 5.2635, 5.135, 6.092, 5.575,
4.139, 4.8645), UrejeMg = c(3.6655, 3.977, 4.288, 4.192,
4.676, 4.434, 4.7005, 4.966, 5.3895, 5.7165, 4.881, 4.1015,
3.743, 6.132, 6.0785, 6.1775, 6.3135, 6.028, 5.739, 6.126,
4.5155, 4.716, 5.2165, 5.678), EroMg = c(2.472, 2.31425,
2.1565, 2.2115, 2.184, 2.135, 4.135, 6.2005, 5.457, 5.981,
5.784, 5.885, 5.406, 5.248, 4.967, 4.449, 5.058, 5.1675,
5.667, 6.966, 5.17, 4.8965, 7.201, 6.538), RefMg = c(3.75,
3.87, 4.82, 4.132, 3.98, 4.23, 4.57, 5.01, 5.02, 4.67, 4.18,
4.51, 5.21, 5.18, 4.76, 4.29, 4.95, 5.07, 5.45, 5.86, 5.11,
4.79, 6.01, 5.24)), class = "data.frame", row.names = c(NA,
-24L)) #This data is reproducible
and the output
I want to use Date as my x-axis, the Date covers 24 months. It starts at 01-11-17 and ends at 01-10-19. Anyone can help please.
It is difficult to provide answers without using your data. You need to provide your data in a usable format as #r2evans says above. However, you can convert your Date row, which appears to be a string, to Date type and use that as your X-axis. You can format how the date should be displayed by adding the format in the scales list.
For example, in your case:
...
scales=list(
y=list(relation="free"),
x = list(format = "%m-%Y") # or whatever format you need
),
...
or whatever format you need.
Here is one way how you could achieve your task:
library(tidyverse)
library(lubridate)
library(lattice)
df <- dt_repr %>%
pivot_longer(
cols = c(-Date, -month),
names_to = "names",
values_to = "values"
) %>%
mutate(Date = dmy(Date))
xyplot(values ~ Date|names,data=df,type="l",
scales=list(y=list(relation="free")),
layout=c(1,8))
I got the solution using this set of instruction:
#From Painless way to install a new version of R?
Run in the old version of R (or via RStudio)
setwd("C:/Temp/")
packages <- installed.packages()[,"Package"]
save(packages, file="Rpackages")
# INSTALL NEW R VERSION
if(!require(installr)) { install.packages("installr"); require(installr)} #load / install+load installr
# See here for more on installr: https://www.r-statistics.com/2013/03/updating-r-from-r-on-windows-using-the-installr-package/
# step by step functions:
check.for.updates.R() # tells you if there is a new version of R or not.
install.R() # download and run the latest R installer
# Install library - run in the new version of R. This calls package names and installs them from repos, thus all packages should be correct to the most recent version
setwd("C:/Temp/")
load("Rpackages")
for (p in setdiff(packages, installed.packages()[,"Package"]))
install.packages(p)
# Installr includes a package migration tool but this simply copies packages, it does not update them
copy.packages.between.libraries() # copy your packages to the newest R installation from the one version before it (if ask=T, it will ask you between which two versions to perform the copying)
Then all the error messages are gone, the missing packages tidyverse and ggplot2 came back and I have my desired plot with expected x axis

How to calculate the 99th percentile of a dataset

I have a dataset of the mean of weights of two sample sizes, I have 100,000 tests and I am trying to find out the 99th percentile but I do not understand how to do so, I have found out the median quartile by doing the following;
summary(Lifts)
Large Small
Min. : 62.5 Min. : 54.2
1st Qu.: 99.1 1st Qu.: 96.0
Median :106.0 Median :106.0
Mean :106.0 Mean :106.0
3rd Qu.:112.9 3rd Qu.:116.0
Max. :147.5 Max. :156.8
I need to find the 99th percentile of both the large and the small, I have tried using the quartile command;
quantile(Lifts, probs = c(0, 0.25, 0.50, 0.99))
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) :
undefined columns selected
But I receive that error
any help would be appreciated
If we specify the column (for example using $ notation) we get rid of the error:
quantile(Lifts$Large, probs = c(0, 0.25, 0.50, 0.99))
Or
quantile(Lifts$Small, probs = c(0, 0.25, 0.50, 0.99))
Generally, to apply a function on all columns of a data frame, we can use lapply, which also works with quantile.
lapply(lifts, quantile, probs=c(0, 0.25, 0.50, 0.99))
# $large
# 0% 25% 50% 99%
# 14.400 161.675 488.450 950.506
#
# $small
# 0% 25% 50% 99%
# 0.900 30.800 43.650 97.744
We also may use sapply which does the same but yields as output a matrix.
sapply(lifts, quantile, probs=c(0, 0.25, 0.50, 0.99))
# large small
# 0% 14.400 0.900
# 25% 161.675 30.800
# 50% 488.450 43.650
# 99% 950.506 97.744
Data
lifts <- structure(list(large = c(489.9, 734.5, 905.6, 41.9, 950.2, 73.9,
444.7, 950.8, 303.9, 539, 399.4, 429.5, 670.2, 39.1, 324.6, 829.6,
97.9, 216.6, 500.1, 364.4, 762.6, 205.7, 191.6, 128.6, 749.2,
185, 736.9, 46.9, 114.2, 774.4, 626.5, 42.5, 52.5, 724.3, 518.3,
932.7, 602.5, 14.4, 794.9, 149.7, 621.6, 674.2, 685.1, 153.9,
42.3, 487, 787.5, 351.6, 689.3, 862.3), small = c(56.5, 63.6,
49.5, 76.7, 78, 25.8, 57.8, 19.2, 27.7, 38.3, 36.4, 4.4, 89.2,
68.8, 36.1, 71.8, 69.1, 35.8, 38.2, 26.9, 95.5, 30.7, 43.2, 58.8,
44.1, 35.4, 91.2, 37.1, 99.9, 94.5, 52, 38.2, 40.1, 50.9, 81.7,
7.5, 77.5, 71.9, 70.6, 8.2, 90.1, 31.1, 3.4, 52, 0.9, 30.5, 12.7,
45.6, 34.2, 13.4)), class = "data.frame", row.names = c(NA, -50L
))

Resources