interpretation of the output of R function bs() (B-spline basis matrix) - r

I often use B-splines for regression. Up to now I've never needed to understand the output of bs in detail: I would just choose the model I was interested in, and fit it with lm. However, I now need to reproduce a b-spline model in an external (non-R) code. So, what's the meaning of the matrix generated by bs? Example:
x <- c(0.0, 11.0, 17.9, 49.3, 77.4)
bs(x, df = 3, degree = 1) # generate degree 1 (linear) B-splines with 2 internal knots
# 1 2 3
# [1,] 0.0000000 0.0000000 0.0000000
# [2,] 0.8270677 0.0000000 0.0000000
# [3,] 0.8198433 0.1801567 0.0000000
# [4,] 0.0000000 0.7286085 0.2713915
# [5,] 0.0000000 0.0000000 1.0000000
# attr(,"degree")
# [1] 1
# attr(,"knots")
# 33.33333% 66.66667%
# 13.30000 38.83333
# attr(,"Boundary.knots")
# [1] 0.0 77.4
# attr(,"intercept")
# [1] FALSE
# attr(,"class")
# [1] "bs" "basis" "matrix"
Ok, so degree is 1, as I specified in input. knots is telling me that the two internal knots are at x = 13.3000 and x = 38.8333 respectively. Was a bit surprised to see that the knots are at fixed quantiles, I hoped R would find the best quantiles for my data, but of course that would make the model not linear, and also wouldn't be possible without knowing the response data. intercept = FALSE means that no intercept was included in the basis (is that a good thing? I've always being taught not to fit linear models without an intercept...well guess lm is just adding one anyway).
However, what about the matrix? I don't really understand how to interpret it. With three columns, I would think it means that the basis functions are three. This makes sense: if I have two internal knots K1 and K2, I will have a spline between left boundary knot B1 and K1, another spline between K1 and K2, and a final one between K2 and B2, so...three basis functions, ok. But which are the basis functions exactly? For example, what does this column mean?
# 1
# [1,] 0.0000000
# [2,] 0.8270677
# [3,] 0.8198433
# [4,] 0.0000000
# [5,] 0.0000000
EDIT: this is similar to but not precisely the same as this question. That question asks about the interpretation of the regression coefficients, but I'm a step before that: I would like to understand the meaning of the model matrix coefficients. If I try to make the same plots as suggested in the first answer, I get a messed up plot:
b <- bs(x, df = 3, degree = 1)
b1 <- b[, 1] ## basis 1
b2 <- b[, 2] ## basis 2
b3 <- b[,3]
par(mfrow = c(1, 3))
plot(x, b1, type = "l", main = "basis 1: b1")
plot(x, b2, type = "l", main = "basis 2: b2")
plot(x, b3, type = "l", main = "basis 3: b3")
These can't be the B-spline basis functions, because they have too many knots (each function should only have one).
The second answer would actually allow me to reconstruct my model outside R, so I guess I could go with that. However, also that answer doesn't exactly explains what the elements of the b matrix are: it deals with the coefficients of a linear regression, which I haven't still introduced here. It's true that that is my final goal, but I wanted to understand also this intermediate step.

The matrix b
# 1 2 3
# [1,] 0.0000000 0.0000000 0.0000000
# [2,] 0.8270677 0.0000000 0.0000000
# [3,] 0.8198433 0.1801567 0.0000000
# [4,] 0.0000000 0.7286085 0.2713915
# [5,] 0.0000000 0.0000000 1.0000000
is actually just the matrix of the values of the three basis functions in each point of x, which should have been obvious to me since it's exactly the same interpretation as for a polynomial linear model. As a matter of fact, since the boundary knots are
bknots <- attr(b,"Boundary.knots")
# [1] 0.0 77.4
and the internal knots are
iknots <- attr(b,"knots")
# 33.33333% 66.66667%
# 13.30000 38.83333
then the three basis functions, as shown here, are:
knots <- c(bknots[1],iknots,bknots[2])
y1 <- c(0,1,0,0)
y2 <- c(0,0,1,0)
y3 <- c(0,0,0,1)
par(mfrow = c(1, 3))
plot(knots, y1, type = "l", main = "basis 1: b1")
plot(knots, y2, type = "l", main = "basis 2: b2")
plot(knots, b3, type = "l", main = "basis 3: b3")
Now, consider b[,1]
# 1
# [1,] 0.0000000
# [2,] 0.8270677
# [3,] 0.8198433
# [4,] 0.0000000
# [5,] 0.0000000
These must be the values of b1 in x <- c(0.0, 11.0, 17.9, 49.3, 77.4). As a matter of fact, b1 is 0 in knots[1] = 0 and 1 in knots[2] = 13.3000, meaning that in x[2] (11.0) the value must be 11/13.3 = 0.8270677, as expected. Similarly, since b1 is 0 for knots[3] = 38.83333, the value in x[3] (17.9) must be (38.83333-13.3)/17.9 = 0.8198433. Since x[4], x[5] > knots[3] = 38.83333, b1 is 0 there. A similar interpretation can be given for the other two columns.

Just a small correction to the excellent answer by #DeltaIV above (it looks like I can not comment.)
So in b1, when he calculated b1(x[3]), it should be (38.83333-17.9)/(38.83333-13.3)=0.8198433 by linear interpolation. Everything else is perfect.
Note b1 should look like this
\frac{t}{13.3}I(0<=t<13.3)+\frac{38.83333-t}{38.83333-13.3}I(13.3<=t<38.83333)

Related

Calculating a distance matrix by dtw

I have two matrices of normalized read counts for control and treatment in a time series day1 to day26. I want to calculate distance matrix by Dynamic Time Wrapping afterward use that for clustering but seems too complicated. I did so; who can help for more clarification please? Thanks a lot
> head(control[,1:4])
MAST2 WWC2 PHYHIPL R3HDM2
Control_D1 6.591024 5.695156 3.388652 5.756384
Control_D1 8.043454 5.365221 6.859768 6.936970
Control_D3 7.731590 4.868267 6.919972 6.931073
Control_D4 8.129948 5.105528 6.627016 7.090268
Control_D5 7.690863 4.729501 6.824746 6.904610
Control_D6 8.101723 5.334501 6.868990 7.115883
>
> head(lead[,1:4])
MAST2 WWC2 PHYHIPL R3HDM2
Lead30_D1 6.418423 5.610699 3.734425 5.778046
Lead30_D2 7.918360 4.295191 6.559294 6.780952
Lead30_D3 7.807142 4.294722 6.599187 6.716040
Lead30_D4 7.856720 4.432136 6.572337 6.848483
Lead30_D5 7.827311 4.204738 6.607107 6.784094
Lead30_D6 7.848760 4.458451 6.581216 6.943003
>
> dim(control)
[1] 26 2603
> dim(lead)
[1] 26 2603
library(dtw)
for (i in control) {
for (j in lead) {
result[i,j] <- dtw( dist(control[,,i],lead[,,j]), distance.only=T )$normalizedDistance
}
}
Says that
Error in lead[, , j] : incorrect number of dimensions
There have already been questions similar to yours,
but the answers haven't been too detailed.
Here's a breakdown of what you need to know,
in the specific case of R.
Calculating cross-distance matrices
The proxy package is made specifically for the calculation of cross-distance matrices.
You should check its vignette to know which measures are already implemented by it.
An example of its use:
set.seed(1L)
sample_data <- matrix(rnorm(50L), nrow = 5L, ncol = 10L)
suppressPackageStartupMessages(library(proxy))
distance_matrix <- proxy::dist(sample_data, method = "euclidean",
upper = TRUE, diag = TRUE)
print(distance_matrix)
#> 1 2 3 4 5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000
Note: in the context of time series,
proxy treats each row in a matrix as a series,
which can be confirmed by the fact that sample_data above is a 5x10 matrix and the resulting cross-distance matrix is 5x5.
Using the DTW distance
The dtw package implements many variations of DTW,
and it also leverages proxy.
You could calculate a DTW distance matrix with:
suppressPackageStartupMessages(library(dtw))
dtw_distmat <- proxy::dist(sample_data, method = "dtw",
upper = TRUE, diag = TRUE)
print(distance_matrix)
#> 1 2 3 4 5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000
Using custom distances
One nice thing about proxy is that it gives you the option to register custom functions.
You seem to be interested in the normalized version of DTW,
so you could do something like this:
ndtw <- function(x, y = NULL, ...) {
dtw::dtw(x, y, ..., distance.only = TRUE)$normalizedDistance
}
pr_DB$set_entry(
FUN = ndtw,
names = "ndtw",
loop = TRUE,
distance = TRUE
)
ndtw_distmat <- proxy::dist(sample_data, method = "ndtw",
upper = TRUE, diag = TRUE)
print(ndtw_distmat)
#> 1 2 3 4 5
#> 1 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> 2 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> 3 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> 4 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> 5 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000
See the documentation of pr_DB for more information.
Other DTW implementations
The dtwclust package
(which I made)
implements a basic but faster version of DTW which can use multi-threading and also leverages proxy:
suppressPackageStartupMessages(library(dtwclust))
dtw_basic_distmat <- proxy::dist(sample_data, method = "dtw_basic", normalize = TRUE)
print(dtw_basic_distmat)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> [2,] 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> [3,] 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> [4,] 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> [5,] 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000
The dtw_basic implementation only supports two step patterns and one window type,
but it is considerably faster:
suppressPackageStartupMessages(library(microbenchmark))
microbenchmark(
proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L),
proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)
)
Unit: microseconds
expr min lq mean
proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L) 5279.124 5621.742 6070.069
proxy::dist(sample_data, method = "dtw_basic", window.size = 5L) 657.966 710.418 776.474
median uq max neval cld
5802.354 6348.199 10411.000 100 b
752.282 814.037 1161.626 100 a
Another multi-threaded implementation is included in the parallelDist package,
although I haven't personally tested it.
Multivariate or multi-dimensional time series
A single multivariate series is commonly a matrix where time spans the rows and the multiple variables span the columns.
DTW also works for them:
mv_series1 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
mv_series2 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
print(dtw_distance <- dtw_basic(mv_series1, mv_series2))
#> [1] 22.80421
The nice thing about proxy is that it can calculate distances between objects contained in lists too,
so you can put several multivariate series in lists of matrices:
mv_series <- lapply(1L:5L, function(dummy) {
matrix(rnorm(15L), nrow = 5L, ncol = 3L)
})
mv_distmat_dtwclust <- proxy::dist(mv_series, method = "dtw_basic")
print(mv_distmat_dtwclust)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00000 27.43599 32.14207 36.42211 31.19279
#> [2,] 27.43599 0.00000 20.88470 23.88436 29.73219
#> [3,] 32.14207 20.88470 0.00000 22.14376 29.99899
#> [4,] 36.42211 23.88436 22.14376 0.00000 28.81111
#> [5,] 31.19279 29.73219 29.99899 28.81111 0.00000
Your case
Regardless of what you choose,
you can probably use proxy to get your result,
but since you haven't provided your whole data,
I can't give you a more specific example.
I presume that dtwclust::dtw_basic(control[, 1:4], lead[, 1:4], normalize = TRUE) would give you the distance between one pair of series,
assuming you're treating each one as a multivariate series with 4 variables.
If your question is "why am I getting this error?" the answer is that you're trying to subset a matrix, which is a two dimensional array, according to a 3rd dimension.
see:
dim(lead)
# [1] 26 2603
lead[,,6.418423] # yes, that's the value j has the first time through the loop
# This will reproduce your error
lead[,,1]
# This will also reproduce your error
Hopefully you can see now that you have a few problems:
You're trying to subset a matrix according to a 3rd dimension
Your i and j values are the values in control and lead respectively. You can use them as their values, or you can generate the index, e.g., for(i in seq_along(control) if you're planning to use it for something other than getting that same value out.
Taking it to the next step, it's unclear what you want to pass to the dist function. dist takes a single matrix and computes the distance between its rows. You seem to be trying to pass it two values from two different matrices, or perhaps two subsets of two different matrices. It looks like you might need to go back and look at the examples in the documentation for xtr

Rolling PCA and plotting proportional variance of principal components

I'm using the following code to perform PCA:
PCA <- prcomp(Ret1, center = TRUE, scale. = TRUE)
summary(PCA)
I get the following result:
#Importance of components:
# PC1 PC2 PC3 PC4
#Standard deviation 1.6338 0.9675 0.60446 0.17051
#Proportion of Variance 0.6673 0.2340 0.09134 0.00727
#Cumulative Proportion 0.6673 0.9014 0.99273 1.00000
What I would like to do is a Rolling PCA for a specific window ( e.g. 180 days). The Result should be a matrix which shows the evolution of the "Proportion of Variance" of all principal components though time.
I tried it with
rollapply(Ret1, 180, prcomp)
but this doesn't work and I have no Idea how to save the "Proportion of Variance" for each time step in matrix.
The output matrix should look like this:
# PC1 PC2 PC3 PC4
#Period 1 0.6673 0.2340 0.09134 0.00727
#Period 2 0.7673 0.1340 0.09134 0.00727
# ....
Here is a mini subset of my data Ret1:
Cats Dogs Human Frogs
2016-12-13 0.0084041063 6.518479e-03 6.096295e-04 5.781271e-03
2016-12-14 -0.0035340384 -8.150321e-03 4.418382e-04 -5.978296e-03
2016-12-15 0.0107522782 3.875708e-03 -1.784663e-02 3.012253e-03
2016-12-16 0.0033034130 -1.752174e-03 -1.753624e-03 -4.448850e-04
2016-12-17 0.0000000000 0.000000e+00 0.000000e+00 0.000000e+00
2016-12-18 0.0000000000 0.000000e+00 0.000000e+00 0.000000e+00
2016-12-19 0.0019876743 1.973190e-03 -8.577261e-03 1.996151e-03
2016-12-20 0.0033235161 3.630921e-03 -4.757395e-03 4.594355e-03
2016-12-21 0.0003401156 -2.460351e-03 3.708875e-03 -1.636413e-03
2016-12-22 -0.0010940147 -1.864724e-03 -7.991572e-03 -1.158029e-03
2016-12-23 -0.0005387228 1.250898e-03 -2.843725e-03 7.492594e-04
2016-12-24 0.0000000000 0.000000e+00 0.000000e+00 0.000000e+00
2016-12-25 0.0000000000 0.000000e+00 0.000000e+00 0.000000e+00
2016-12-26 0.0000000000 0.000000e+00 0.000000e+00 0.000000e+00
2016-12-27 0.0019465877 2.245918e-03 0.000000e+00 5.632058e-04
2016-12-28 0.0002396803 -8.391658e-03 8.307552e-03 -5.598988e-03
2016-12-29 -0.0020884556 -2.933868e-04 1.661246e-03 -7.010738e-04
2016-12-30 0.0026172923 -4.647865e-03 9.574997e-03 -2.889166e-03
I tried the following:
PCA <- function(x){
Output=cumsum((apply((prcomp(x,center = TRUE, scale. = TRUE))$x, 2, var))/sum(vars))
return(Output)}
window <- 10
data <- Ret1
result <- rollapply(data, window,PCA)
plot(result)
#Gives you the Proportion of Variance = cumsum((apply((prcomp(x,center = TRUE, scale. = TRUE))$x, 2, var))/sum(vars))
First, the correct function for your purpose may be written as follow, using $sdev result of prcomp. I have left over center = TRUE and scale. = TRUE as they are function default.
PCA <- function(x){
oo <- prcomp(x)$sdev
oo / sum(oo)
}
Now, we can easily use sapply to do rolling operation:
## for your mini dataset of 18 rows
window <- 10
n <- nrow(Ret1)
oo <- sapply(seq_len(n - window + 1), function (i) PCA(Ret1[i:(i + window - 1), ]))
oo <- t(oo) ## an extra transposition as `sapply` does `cbind`
# [,1] [,2] [,3] [,4]
# [1,] 0.5206345 0.3251099 0.12789683 0.02635877
# [2,] 0.5722264 0.2493518 0.14588631 0.03253553
# [3,] 0.6051199 0.1973694 0.16151859 0.03599217
# [4,] 0.5195527 0.2874197 0.16497219 0.02805543
# [5,] 0.5682829 0.3100708 0.09456654 0.02707977
# [6,] 0.5344804 0.3149862 0.08912882 0.06140464
# [7,] 0.5954948 0.2542775 0.10434155 0.04588616
# [8,] 0.5627977 0.2581071 0.13068875 0.04840648
# [9,] 0.6089650 0.2559285 0.11022974 0.02487672
Each column is a PC, while each row gives proportional variance for each component in that period.
To further plot the result, you can use matplot:
matplot(oo, type = "l", lty = 1, col = 1:4,
xlab = "period", ylab = "proportional variance")
PCA 1-4 are sketched with colour 1:4, i.e., "black", "red", "green" and "blue".
Additional comments:
If you want to use zoo::rollapply, do
oo <- zoo::rollapply(Ret1, window, PCA, by.column = FALSE)
Precisely, I am reporting proportional standard deviation. If you really want proportional variance, chance PCA function to:
PCA <- function(x){
oo <- prcomp(x)$sdev ^ 2
oo / sum(oo)
}

Building markov chain in r

I have a text in a column and i would like to build a markov chain. I was wondering of there is a way to build markov chain for states A, B,C, D and generate a markov chain with that states. Any thoughts?
A<- c('A-B-C-D', 'A-B-C-A', 'A-B-A-B')
Since you mentioned that you know how to work with statetable.msm, here's a way to translate the data into a form it can handle:
dd <- c('A-B-C-D', 'A-B-C-A', 'A-B-A-B')
Split on dashes and arrange in columns:
d2 <- data.frame(do.call(cbind,strsplit(dd,"-")))
Arrange in a data frame, identified by sequence:
d3 <- tidyr::gather(d2)
Construct the transition matrix:
statetable.msm(value,key,data=d3)
If you want to compute the transition probability matrix (row stochastic) with MLE from the data, try this:
A <- c('A-B-C-D', 'A-B-C-A', 'A-B-A-B', 'D-B-C-A') # the data: by modifying your example data little bit
df <- as.data.frame(do.call(rbind, lapply(strsplit(A, split='-'), function(x) t(sapply(1:(length(x)-1), function(i) c(x[i], x[i+1]))))))
tr.mat <- table(df[,1], df[,2])
tr.mat <- tr.mat / rowSums(tr.mat) # make the matrix row-stochastic
tr.mat
# A B C D
# A 0.0000000 1.0000000 0.0000000 0.0000000 # P(A|A), P(B|A), P(C|A), P(D|A) with MLE from data
# B 0.2500000 0.0000000 0.7500000 0.0000000
# C 0.6666667 0.0000000 0.0000000 0.3333333
# D 0.0000000 1.0000000 0.0000000 0.0000000

How to calculate "terms" from predict-function manually when regression has an interaction term

does anyone know how predict-function calculates terms when there are an interaction term in a regression model? I know how to solve terms when regression has no interaction terms in it but when I add one I cant solve those manually anymore. Here is some example data and I would like to see how to calculate those values manually. Thanks! -Aleksi
set.seed(2)
a <- c(4,3,2,5,3) # first I make some data
b <- c(2,1,4,3,5)
e <- rnorm(5)
y= 0.6*a+e
data <- data.frame(a,b,y)
model1 <- lm(y~a*b,data=data) # regression
predict(model1,type='terms',data) # terms
#This gives the result:
a b a:b
1 0.04870807 -0.3649011 0.2049069
2 -0.03247205 -0.7298021 0.7740928
3 -0.11365216 0.3649011 0.2049069
4 0.12988818 0.0000000 -0.5919534
5 -0.03247205 0.7298021 -0.5919534
attr(,"constant")
[1] 1.973031
Your model is technically y ~ b0 + b1*a + b2*a*b + e. Calculating a is done by multiplying independent variable by its coefficient and centering the result. So for example, terms for a would be
cf <- coef(model1)
scale(a * cf[2], scale = FALSE)
[,1]
[1,] 0.04870807
[2,] -0.03247205
[3,] -0.11365216
[4,] 0.12988818
[5,] -0.03247205
which matches your output above.
And since interaction term is nothing else than multiplying independent variables, this translates to
scale(a * b * cf[4], scale = FALSE)
[,1]
[1,] 0.2049069
[2,] 0.7740928
[3,] 0.2049069
[4,] -0.5919534
[5,] -0.5919534

Find the maximum value in decision tree

I created decision tree with Party package in R.
I'm trying to get the route/branch with the maximum value.
It can be mean value that came from box-plot
and it can be probability value that came from binary tree
(source: rdatamining.com)
This can be done pretty easily actually, though while your definition of maximum value is clear for a regression tree, it is not very clear for a classification tree, as in each node different level can have it's own maximum
Either way, here's a pretty simple helper function that will return you the predictions for each type of tree
GetPredicts <- function(ct){
f <- function(ct, i) nodes(ct, i)[[1]]$prediction
Terminals <- unique(where(ct))
Predictions <- sapply(Terminals, f, ct = ct)
if(is.matrix(Predictions)){
colnames(Predictions) <- Terminals
return(Predictions)
} else {
return(setNames(Predictions, Terminals))
}
}
Now luckily you've took your trees from the examples of ?ctree, so we can test them (next time, please provide the code you used yourself)
Regression Tree (your frist tree)
## load the package and create the tree
library(party)
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq,
controls = ctree_control(maxsurrogate = 3))
plot(airct)
Now, test the function
res <- GetPredicts(airct)
res
# 5 3 6 9 8
# 18.47917 55.60000 31.14286 48.71429 81.63333
So we've got the predictions per each terminal node. You can easily proceed with which.max(res) from here (I'll leave it for you to decide)
Classification tree (your second tree)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct, type = "simple")
Run the function
res <- GetPredicts(irisct)
res
# 2 5 6 7
# [1,] 1 0.00000000 0.0 0.00000000
# [2,] 0 0.97826087 0.5 0.02173913
# [3,] 0 0.02173913 0.5 0.97826087
Now, the output is a bit harder to read because each class has it's own probabilities. You could make this a bit more readable using
row.names(res) <- levels(iris$Species)
res
# 2 5 6 7
# setosa 1 0.00000000 0.0 0.00000000
# versicolor 0 0.97826087 0.5 0.02173913
# virginica 0 0.02173913 0.5 0.97826087
The, you could do something like the following in order to get the overall maximum value
which(res == max(res), arr.ind = TRUE)
# row col
# setosa 1 1
For column/row maxes, you could do
matrixStats::colMaxs(res)
# [1] 1.0000000 0.9782609 0.5000000 0.9782609
matrixStats::rowMaxs(res)
# [1] 1.0000000 0.9782609 0.9782609
But, again, I'll leave to you to decide on how to proceed from here.

Resources