I use glm monthly to calculate a binomial model on the payment behaviour of a credit database, using a call like:
modelx = glm(paid ~ ., data = credit_db, family = binomial())
For the last month, I use R version 3.2.2 (just recently upgraded) and the results were very different than the previous month (done with R version 3.2.0). In order to check the code, I repeated the previous month calculations with version 3.2.2 and got different results from the previous calculation done in R 3.2.0.
Coefficients are also very different, in a wild form. I use at the beginning an exploratory model, with a variable that is the average number of delinquency days during the month, which should yield low coefficients for low average. In version 3.2.0, an extract of summary(modelx) was:
## Coefficients: Estimate Std. Error z value
## delinquency_avg_days1 -0.59329 0.18581 -3.193
## delinquency_avg_days2 -1.32286 0.19830 -6.671
## delinquency_avg_days3 -1.47359 0.21986 -6.702
## delinquency_avg_days4 -1.64158 0.21653 -7.581
## delinquency_avg_days5 -2.56311 0.25234 -10.158
## delinquency_avg_days6 -2.59042 0.25886 -10.007
and for version 3.2.2
## Coefficients Estimate Std. Error z value
## delinquency_avg_days.L -1.320e+01 1.083e+03 -0.012
## delinquency_avg_days.Q -1.140e+00 1.169e+03 -0.001
## delinquency_avg_days.C 3.439e+00 1.118e+03 0.003
## delinquency_avg_days^4 8.454e+00 1.020e+03 0.008
## delinquency_avg_days^5 3.733e+00 9.362e+02 0.004
## delinquency_avg_days^6 -4.988e+00 9.348e+02 -0.005
The summary output is a little different, since the Pr(>|z|) is shown. Notice also that the coefficient names changed too.
In the dataset this delinquency_avg_days variable have the following distribution (0 is "not paid", 1 is "paid", and as you can see, coefficients might be large for average days larger than 20 or so. Number of paid was sampled to match closely the number of "not paid".
0 1
0 140 663
1 59 209
2 62 118
3 56 87
4 66 50
5 69 41
6 64 40
7 78 30
8 75 31
9 70 29
10 77 23
11 69 18
12 79 17
13 61 13
14 53 5
15 67 18
16 50 10
17 40 9
18 39 8
19 23 9
20 24 2
21 36 9
22 35 1
23 17 0
24 11 0
25 11 0
26 7 1
27 3 0
28 0 0
29 0 1
30 1 0
In previous months, I used this exploratory model to create a second binomial model using ranges af average delinquency days. But this other model gives similar results with a few levels.
Now, I'd like to know whether there are substantial changes that require specifying other parameters or there is an issue with glm in version 3.2.2.


Creating and plotting confidence intervals

I have fitted a gaussian GLM model to my data, i now wish to create 95% CIs and fit them to my data. Im having a couple of issues with this when plotting as i cant get them to capture my data, they just seem to plot the same line as the model without captuing the data points. Also Im also unsure that I've created my CIs the correct way here for the mean. I entered my data and code below if anyone knows how to fix this
data used
cases quarter date
1 2 1 83.00
2 6 2 83.25
3 10 3 83.50
4 8 4 83.75
5 12 1 84.00
6 9 2 84.25
7 28 3 84.50
8 28 4 84.75
9 36 1 85.00
10 32 2 85.25
11 46 3 85.50
12 47 4 85.75
13 50 1 86.00
14 61 2 86.25
15 99 3 86.50
16 95 4 86.75
17 150 1 87.00
18 143 2 87.25
19 197 3 87.50
20 159 4 87.75
21 204 1 88.00
22 168 2 88.25
23 196 3 88.50
24 194 4 88.75
25 210 1 89.00
26 180 2 89.25
27 277 3 89.50
28 181 4 89.75
29 327 1 90.00
30 276 2 90.25
31 365 3 90.50
32 300 4 90.75
33 356 1 91.00
34 304 2 91.25
35 307 3 91.50
36 386 4 91.75
37 331 1 92.00
38 368 2 92.25
39 416 3 92.50
40 374 4 92.75
41 412 1 93.00
42 358 2 93.25
43 416 3 93.50
44 414 4 93.75
45 496 1 94.00
my code used to create the model and intervals before plotting
#creating the model
model3 = glm(cases ~ date,
data = aids,
family = poisson(link='log'))
#now to add approx. 95% confidence envelope around this line
#predict again but at the linear predictor level along with standard errors
my_preds <- predict(model3, newdata=data.frame(aids),, type="link")
#calculate CI limit since linear predictor is approx. Gaussian
upper <- my_preds$fit+1.96*my_preds$ #this might be logit not log
lower <- my_preds$fit-1.96*my_preds$
#transform the CI limit to get one at the level of the mean
upper <- exp(upper)/(1+exp(upper))
lower <- exp(lower)/(1+exp(lower))
#plotting data
plot(aids$date, aids$cases,
xlab = 'Date', ylab = 'Cases', pch = 20)
#adding CI lines
plot(aids$date, exp(my_preds$fit), type = "link",
xlab = 'Date', ylab = 'Cases') #add title
outcome i currently get with no data points, the model is correct here but the CI isnt as i have no data points, so the CIs are made incorrectly i think somewhere
Edit: Response to OP's providing full data set.
This started out as a question about plotting data and models on the same graph, but has morphed considerably. You seem you have an answer to the original question. Below is one way to address the rest.
Looking at your (and my) plots it seems clear that poisson glm is just not a good model. To say it differently, the number of cases may vary with date, but is also influenced by other things not in your model (external regressors).
Plotting just your data suggests strongly that you have at least two and perhaps more regimes: time frames where the growth in cases follows different models.
ggplot(aids, aes(x=date)) + geom_point(aes(y=cases))
This suggests segmented regression. As with most things in R, there is a package for that (more than one actually). The code below uses the segmented package to build successive poisson glm using 1 breakpoint (two regimes).
setDT(aids) # convert aids to a data.table
aids[, pred:=
segmented(glm(cases~date, .SD, family = poisson), seg.Z = ~date, npsi=1),
ggplot(aids, aes(x=date))+ geom_line(aes(y=pred))+ geom_point(aes(y=cases))
Note that we need to tell segmented the count of breakpoints, but not where they are - the algorithm figures that out for you. So here, we see a regime prior to 3Q87 which is well modeled using poission glm, and a regime after that which is not. This is a fancy way of saying that "something happened" around 3Q87 which changed the course of the disease (at least in this data).
The code below does the same thing but for between 1 and 4 breakpoints.
get.pred <- \(p.n, p.DT) {
fit <- glm(cases~date, p.DT, family=poisson) <- segmented(fit, seg.Z = ~date, npsi=p.n)
predict(, type='response',[c('fit', '')]
gg.dt <- rbindlist(lapply(1:4, \(x) { copy(aids)[, c('pred', 'se'):=get.pred(x, .SD)][, npsi:=x] } ))
ggplot(gg.dt, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
Note that the location of the first breakpoint does not seem to change, and also that, notwithstanding the use of the poisson glm the growth appears linear in all but the first regime.
There are goodness-of-fit metrics described in the package documentation which can help you decide how many break points are most consistent with your data.
Finally, there is also the mcp package which is a bit more powerful but also a bit more complex to use.
Original Response: Here is one way that builds the model predictions and std. error in a data.table, then plots using ggplot.
setDT(aids) # convert aids to a data.table
aids[, c('pred', 'se', 'resid.scale'):=predict(glm(cases~date, data=.SD, family=poisson), type='response',]
ggplot(aids, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
Or, you could let ggplot do all the work for you.
ggplot(aids, aes(x=date, y=cases))+
stat_smooth(method = glm, method.args=list(family=poisson))+

evaluating neural network performance

I trained my neural network with a sigmoid activation function so that the predicted values lie in the range [0,1). However, the range of real data in which the z-score transformation has been performed goes beyond [0,1). In this case what would be the appropriate way to evaluate my model. Should I rescale as well the original test data to the same range and then evaluate with criteria like mean square forecast error?
> real_predicted_neural
predicted real
1 1.909219e-07 -3.57877473
2 4.161819e-08 -2.28704595
3 1.754706e-11 -1.08509429
4 1.149891e-13 -0.46573114
5 7.777560e-02 0.42381300
6 4.173448e-07 -0.44060297
7 1.119703e-01 0.21075550
8 8.682557e-01 -0.01292402
9 4.736056e-08 -0.29830701
10 7.506821e-08 -1.20302227
11 7.341235e-01 -0.03986571
12 7.501776e-05 -0.94315815
13 1.145697e-04 0.49730175
14 2.214929e-13 0.04252241
15 4.597199e-01 -0.38539901
16 2.324931e-03 -0.74468628
17 4.366025e-06 -0.77037244
18 1.394450e-06 0.16679048
19 5.869884e-11 -0.75876486
20 1.817941e-04 0.04303387
21 7.060773e-04 0.06099372
22 8.267170e-06 -1.21687318
23 9.388680e-02 0.61135319
24 1.099290e-01 0.55715201
25 9.757236e-01 -0.33480226
26 9.544055e-01 0.09061006
27 7.322074e-07 0.09290822
28 1.014327e-06 -0.61658893
29 7.848382e-08 -0.78739456
30 1.791908e-04 -0.44073540
31 1.357918e-03 -0.22099008
32 5.192233e-06 -0.32744703
33 2.624779e-06 -0.37644068
34 6.414216e-02 -0.36947939
35 1.388143e-06 -0.00994845
36 3.010872e-05 -0.05984833
37 9.873201e-03 -0.21815268
38 3.896163e-04 -0.24009094
39 2.718760e-02 0.33383333
40 1.025650e-02 0.09779867

Clustering biological sequences based on numeric values

I am trying to cluster several amino acid sequences of a fixed length (13) into K clusters based on the Atchley factors (5 numbers which represent each amino acid.
For example, I have an input vector of strings like the following:
key <- HDMD::AAMetric.Atchley
sequences <- sapply(1:10000, function(x) paste(sapply(1:13, function (X) sample(rownames(key), 1)), collapse = ""))
However, my actual list of sequences is over 10^5 (specifying for need for computational efficiency).
I then convert these sequences into numeric vectors by the following:
key <- HDMD::AAMetric.Atchley
m1 <- key[strsplit(paste(sequences, collapse = ""), "")[[1]], ]
p = 13
output <-, lapply(1:p, function(i)
m1[seq(i, nrow(m1), by = p), ]))
I want to output (which is now 65 dimensional vectors) in an efficient way.
I was originally using Mini-batch kmeans, but I noticed the results were very inconsistent when I repeated. I need a consistent clustering approach.
I also was concerned about the curse of dimensionality, considering at 65 dimensions, Euclidean distance doesn't work.
Many high dimensional clustering algorithms I saw assume that outliers and noise exists in the data, but as these are biological sequences converted to numeric values, there is no noise or outlier.
In addition to this, feature selection will not work, as each of the properties of each amino acid and each amino acid are relevant in the biological context.
How would you recommend clustering these vectors?
I think self organizing maps can be of help here - at least the implementation is quite fast so you will know soon enough if it is helpful or not:
using the data from the op along with:
rownames(output) <- 1:nrow(output)
colnames(output) <- make.names(colnames(output), unique = TRUE)
you define the number of cluster in advance
fit <- trainSOM( , dimension = c(5, 5), = 10, maxit = 2000,
scaling="none", radius.type = "gaussian")
the is used as intermediate steps for further exploration how the training developed during the iterations:
plot(fit, what ="energy")
seems like more iterations is in order
check the frequency of clusters:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
428 417 439 393 505 458 382 406 271 299 390 303 336 358 365 372 332 268 437 464 541 381 569 419 467
predict clusters based on new data:
predict(my.som, output[1:20,])
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
19 12 11 8 9 1 11 13 14 5 18 2 22 21 23 22 4 14 24 12
check which variables were important for clustering:
#part of output
Class : somRes
Self-Organizing Map object...
online learning, type: numeric
5 x 5 grid with square topology
neighbourhood type: gaussian
distance type: euclidean
Final energy : 44.93509
Topographic error: 0.0053
Degrees of freedom : 24
F pvalue significativity
pah 1.343 0.12156074
pss 1.300 0.14868987
ms 16.401 0.00000000 ***
cc 1.695 0.01827619 *
ec 17.853 0.00000000 ***
find optimal number of clusters:
fit1 <- superClass(fit, k = 4)
#part of output
SOM Super Classes
Initial number of clusters : 25
Number of super clusters : 4
Frequency table
1 2 3 4
6 9 4 6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 1 2 2 2 1 1 2 2 2 1 1 2 2 2 3 3 4 4 4 3 3 4 4 4
Degrees of freedom : 3
F pvalue significativity
pah 1.393 0.24277933
pss 3.071 0.02664661 *
ms 19.007 0.00000000 ***
cc 2.906 0.03332672 *
ec 23.103 0.00000000 ***
Much more in this vignette

R: calculate p-value of t-test using qt-function

We are testing if a computer's performance increases after its OS update by comparing the performance of 10 different Programs before and after the update which results in:
Program: #1 #2 #3 #4 #5
before: 34 29 32 27 28
after: 32 34 36 27 28
now we should do a t-test by calculating t on our own and only the p-value using the qt-function in R. But how do I have to use the qt-function to get my p-value?
I calculated t and it is -0,6486 (t.test in R says -0.64854 so close enough) and df is 8

Barchart help in R

I am trying to set up a bar chart to compare control and experimental samples taken of specific compounds. The data set is known as 'hydrocarbon3' and contains the following information:
Exp. Contr.
c12 89 49
c17 79 30
c26 78 35
c42 63 3
pris 0.5 0.8
phy 0.5 0.9
nap 87 48
nap1 83 44
nap2 78 44
nap3 73 20
acen1 81 50
acen2 86 46
fluor 83 11
fluor1 68 13
fluor2 79 17
dibe 65 7
dibe1 67 6
dibe2 56 10
phen 82 13
phen1 70 12
phen2 65 15
phen3 53 14
fluro 62 9
pyren 48 11
pyren1 34 10
pyren2 19 8
chrys 22 3
chrys1 21 3
chrys2 21 3
When I create a bar chart with the formula:
main=c("Fig 1. Change in concentrations of different hydrocarbon compounds\nin sediments with and without the presence of bacteria after 21 days"),
xlab="Oiled sediment samples collected at 21 days",
ylab="% loss in concentration relative to day 0")
I receive this diagram, however I need the control and experimental samples of each chemical be next to each other allow a more accurate comparison, rather than the experimental samples bunched on the left and control samples bunched on the right: Is there a way to correct this on R?
Try transposing your matrix:
barplot(t(as.matrix(hydrocarbon3)), beside=T)
Basically, barplot will plot things in the order they show up in the matrix, which, since a matrix is just a vector wrapped colwise, means barplot will plot all the values of the first column, then all those of the second column, etc.
Check this question out: Barplot with 2 variables side by side
It uses ggplot2, so you'll have to use the following code before running it:
Hopefully this works for you. Plus it looks a little nicer with ggplot2!
> df
row exp con
1 a 1 2
2 b 2 3
3 c 3 4
> barplot(rbind(df$exp,df$con),
+ beside = TRUE,names.arg=df$row)
