I have a text in a column and i would like to build a markov chain. I was wondering of there is a way to build markov chain for states A, B,C, D and generate a markov chain with that states. Any thoughts?
A<- c('A-B-C-D', 'A-B-C-A', 'A-B-A-B')
Since you mentioned that you know how to work with statetable.msm, here's a way to translate the data into a form it can handle:
dd <- c('A-B-C-D', 'A-B-C-A', 'A-B-A-B')
Split on dashes and arrange in columns:
d2 <- data.frame(do.call(cbind,strsplit(dd,"-")))
Arrange in a data frame, identified by sequence:
d3 <- tidyr::gather(d2)
Construct the transition matrix:
statetable.msm(value,key,data=d3)
If you want to compute the transition probability matrix (row stochastic) with MLE from the data, try this:
A <- c('A-B-C-D', 'A-B-C-A', 'A-B-A-B', 'D-B-C-A') # the data: by modifying your example data little bit
df <- as.data.frame(do.call(rbind, lapply(strsplit(A, split='-'), function(x) t(sapply(1:(length(x)-1), function(i) c(x[i], x[i+1]))))))
tr.mat <- table(df[,1], df[,2])
tr.mat <- tr.mat / rowSums(tr.mat) # make the matrix row-stochastic
tr.mat
# A B C D
# A 0.0000000 1.0000000 0.0000000 0.0000000 # P(A|A), P(B|A), P(C|A), P(D|A) with MLE from data
# B 0.2500000 0.0000000 0.7500000 0.0000000
# C 0.6666667 0.0000000 0.0000000 0.3333333
# D 0.0000000 1.0000000 0.0000000 0.0000000
Related
I am trying to store the coefficients & SEs of a linear regression in R. The regression starts with a sample size of 10 and needs to add 1 for each loop up to 1000. I have generated random variables (using rnorm), created variables to store the values in and can get the code to store the first regression, but it stops after 1 loop (sample size 10). What am I missing in my code here? Thank you for your help.
matrix_coef <-NULL
df <- data.frame(yi, x1, x2)
for (i in 10:1000) {
lm(df)
matrix_coef <- summary(lm(df))
b0[i]<- coef(matrix_coef)[1:1, 1:1]
bx1[i] <- coef(matrix_coef)[2:2, 1:1 ]
bx2[i] <- coef(matrix_coef)[3:3, 1:1]
sd0[i] <- coef(matrix_coef)[1:1, 2:2]
sdx1[i] <- coef(matrix_coef)[2:2, 2:2]
sdx2[i] <- coef(matrix_coef)[3:3, 2:2]
}
You've got a few issues:
Inside the loop, df doesn't change, you're always fitting lm to the full data. You need to use i to change the data that the model trains on. Something like lm(df[1:i, ]), perhaps (assuming your full data has 1000 rows and you want the iterations to fit the first 10, then 11, then 12, ... 1000 rows, rather than, say, resampling each time)
The line lm(df) by itself fits a model, but you don't do store the results and your next line summary(lm(df)) fits the exact same model again.
You initialize the wrong object. matrix_coef <- NULL essentially does nothing. The line matrix_coef <- summary(lm(df)) inside your loop will have no problem creating the matrix_coef object. (And we can save a bunch of typing by defining it as coef(summary(lm(df))) and not typing coef() on all the subsequent lines.) However, the objects that you index should be initialized, and preferably to the correct length.
1:1 is the same as 1. 2:2 is the same as 2, etc.
Addressing all of these gives us this:
set.seed(47)
n <- 15 ## small n for demonstration purposes
df <- data.frame(yi = rnorm(n), x1 = rnorm(n), x2 = rnorm(n))
b0 <- numeric(n)
bx1 <- numeric(n)
bx2 <- numeric(n)
sd0 <- numeric(n)
sdx1 <- numeric(n)
sdx2 <- numeric(n)
for (i in 10:n) {
matrix_coef <- coef(summary(lm(df[1:i, ])))
b0[i] <- matrix_coef[1, 1]
bx1[i] <- matrix_coef[2, 1]
bx2[i] <- matrix_coef[3, 1]
sd0[i] <- matrix_coef[1, 2]
sdx1[i] <- matrix_coef[2, 2]
sdx2[i] <- matrix_coef[3, 2]
}
b0
# [1] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
# [9] 0.00000000 -0.17267676 -0.17490218 -0.08251370 -0.04048010 -0.04976162 -0.03881043
sdx2
# [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.4888940
# [11] 0.4470402 0.4569932 0.4724890 0.4669553 0.4323498
I often use B-splines for regression. Up to now I've never needed to understand the output of bs in detail: I would just choose the model I was interested in, and fit it with lm. However, I now need to reproduce a b-spline model in an external (non-R) code. So, what's the meaning of the matrix generated by bs? Example:
x <- c(0.0, 11.0, 17.9, 49.3, 77.4)
bs(x, df = 3, degree = 1) # generate degree 1 (linear) B-splines with 2 internal knots
# 1 2 3
# [1,] 0.0000000 0.0000000 0.0000000
# [2,] 0.8270677 0.0000000 0.0000000
# [3,] 0.8198433 0.1801567 0.0000000
# [4,] 0.0000000 0.7286085 0.2713915
# [5,] 0.0000000 0.0000000 1.0000000
# attr(,"degree")
# [1] 1
# attr(,"knots")
# 33.33333% 66.66667%
# 13.30000 38.83333
# attr(,"Boundary.knots")
# [1] 0.0 77.4
# attr(,"intercept")
# [1] FALSE
# attr(,"class")
# [1] "bs" "basis" "matrix"
Ok, so degree is 1, as I specified in input. knots is telling me that the two internal knots are at x = 13.3000 and x = 38.8333 respectively. Was a bit surprised to see that the knots are at fixed quantiles, I hoped R would find the best quantiles for my data, but of course that would make the model not linear, and also wouldn't be possible without knowing the response data. intercept = FALSE means that no intercept was included in the basis (is that a good thing? I've always being taught not to fit linear models without an intercept...well guess lm is just adding one anyway).
However, what about the matrix? I don't really understand how to interpret it. With three columns, I would think it means that the basis functions are three. This makes sense: if I have two internal knots K1 and K2, I will have a spline between left boundary knot B1 and K1, another spline between K1 and K2, and a final one between K2 and B2, so...three basis functions, ok. But which are the basis functions exactly? For example, what does this column mean?
# 1
# [1,] 0.0000000
# [2,] 0.8270677
# [3,] 0.8198433
# [4,] 0.0000000
# [5,] 0.0000000
EDIT: this is similar to but not precisely the same as this question. That question asks about the interpretation of the regression coefficients, but I'm a step before that: I would like to understand the meaning of the model matrix coefficients. If I try to make the same plots as suggested in the first answer, I get a messed up plot:
b <- bs(x, df = 3, degree = 1)
b1 <- b[, 1] ## basis 1
b2 <- b[, 2] ## basis 2
b3 <- b[,3]
par(mfrow = c(1, 3))
plot(x, b1, type = "l", main = "basis 1: b1")
plot(x, b2, type = "l", main = "basis 2: b2")
plot(x, b3, type = "l", main = "basis 3: b3")
These can't be the B-spline basis functions, because they have too many knots (each function should only have one).
The second answer would actually allow me to reconstruct my model outside R, so I guess I could go with that. However, also that answer doesn't exactly explains what the elements of the b matrix are: it deals with the coefficients of a linear regression, which I haven't still introduced here. It's true that that is my final goal, but I wanted to understand also this intermediate step.
The matrix b
# 1 2 3
# [1,] 0.0000000 0.0000000 0.0000000
# [2,] 0.8270677 0.0000000 0.0000000
# [3,] 0.8198433 0.1801567 0.0000000
# [4,] 0.0000000 0.7286085 0.2713915
# [5,] 0.0000000 0.0000000 1.0000000
is actually just the matrix of the values of the three basis functions in each point of x, which should have been obvious to me since it's exactly the same interpretation as for a polynomial linear model. As a matter of fact, since the boundary knots are
bknots <- attr(b,"Boundary.knots")
# [1] 0.0 77.4
and the internal knots are
iknots <- attr(b,"knots")
# 33.33333% 66.66667%
# 13.30000 38.83333
then the three basis functions, as shown here, are:
knots <- c(bknots[1],iknots,bknots[2])
y1 <- c(0,1,0,0)
y2 <- c(0,0,1,0)
y3 <- c(0,0,0,1)
par(mfrow = c(1, 3))
plot(knots, y1, type = "l", main = "basis 1: b1")
plot(knots, y2, type = "l", main = "basis 2: b2")
plot(knots, b3, type = "l", main = "basis 3: b3")
Now, consider b[,1]
# 1
# [1,] 0.0000000
# [2,] 0.8270677
# [3,] 0.8198433
# [4,] 0.0000000
# [5,] 0.0000000
These must be the values of b1 in x <- c(0.0, 11.0, 17.9, 49.3, 77.4). As a matter of fact, b1 is 0 in knots[1] = 0 and 1 in knots[2] = 13.3000, meaning that in x[2] (11.0) the value must be 11/13.3 = 0.8270677, as expected. Similarly, since b1 is 0 for knots[3] = 38.83333, the value in x[3] (17.9) must be (38.83333-13.3)/17.9 = 0.8198433. Since x[4], x[5] > knots[3] = 38.83333, b1 is 0 there. A similar interpretation can be given for the other two columns.
Just a small correction to the excellent answer by #DeltaIV above (it looks like I can not comment.)
So in b1, when he calculated b1(x[3]), it should be (38.83333-17.9)/(38.83333-13.3)=0.8198433 by linear interpolation. Everything else is perfect.
Note b1 should look like this
\frac{t}{13.3}I(0<=t<13.3)+\frac{38.83333-t}{38.83333-13.3}I(13.3<=t<38.83333)
I created decision tree with Party package in R.
I'm trying to get the route/branch with the maximum value.
It can be mean value that came from box-plot
and it can be probability value that came from binary tree
(source: rdatamining.com)
This can be done pretty easily actually, though while your definition of maximum value is clear for a regression tree, it is not very clear for a classification tree, as in each node different level can have it's own maximum
Either way, here's a pretty simple helper function that will return you the predictions for each type of tree
GetPredicts <- function(ct){
f <- function(ct, i) nodes(ct, i)[[1]]$prediction
Terminals <- unique(where(ct))
Predictions <- sapply(Terminals, f, ct = ct)
if(is.matrix(Predictions)){
colnames(Predictions) <- Terminals
return(Predictions)
} else {
return(setNames(Predictions, Terminals))
}
}
Now luckily you've took your trees from the examples of ?ctree, so we can test them (next time, please provide the code you used yourself)
Regression Tree (your frist tree)
## load the package and create the tree
library(party)
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq,
controls = ctree_control(maxsurrogate = 3))
plot(airct)
Now, test the function
res <- GetPredicts(airct)
res
# 5 3 6 9 8
# 18.47917 55.60000 31.14286 48.71429 81.63333
So we've got the predictions per each terminal node. You can easily proceed with which.max(res) from here (I'll leave it for you to decide)
Classification tree (your second tree)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct, type = "simple")
Run the function
res <- GetPredicts(irisct)
res
# 2 5 6 7
# [1,] 1 0.00000000 0.0 0.00000000
# [2,] 0 0.97826087 0.5 0.02173913
# [3,] 0 0.02173913 0.5 0.97826087
Now, the output is a bit harder to read because each class has it's own probabilities. You could make this a bit more readable using
row.names(res) <- levels(iris$Species)
res
# 2 5 6 7
# setosa 1 0.00000000 0.0 0.00000000
# versicolor 0 0.97826087 0.5 0.02173913
# virginica 0 0.02173913 0.5 0.97826087
The, you could do something like the following in order to get the overall maximum value
which(res == max(res), arr.ind = TRUE)
# row col
# setosa 1 1
For column/row maxes, you could do
matrixStats::colMaxs(res)
# [1] 1.0000000 0.9782609 0.5000000 0.9782609
matrixStats::rowMaxs(res)
# [1] 1.0000000 0.9782609 0.9782609
But, again, I'll leave to you to decide on how to proceed from here.
I want a matrix with only the correlation coefficients which are bigger than 0.2. I came up with the following solution.
mts.data <- ts(data.frame(a=arima.sim(model=list(1,0,0), n=10),
b=arima.sim(model=list(1,0,1), n=10), c=arima.sim(model=list(1,0,0),
n=10), d=arima.sim(model=list(1,0,2), n=10),
e=arima.sim(model=list(2,0,1), n=10)), start=c(2007,1), frequency=12)
critcor <- function(x) {
crit.mat <- matrix(0, nrow=ncol(x), ncol=ncol(x))
for(j in 1:ncol(x)) {
for(i in 1:ncol(x)) {
if(abs(cor(x[,i], x[,j])) > 0.2) {
crit.mat[i,j] <- cor(x[,i], x[,j])
}
}
}
return(crit.mat)
}
This works fine. Unfortunately, my data set contains missing values.
mts.data[1:3, 4] <- NA
mts.data[9:10, 5] <- NA
When I run my function, I got an error.
critcor(mts.data)
# Error in if (abs(cor(x[, i], x[, j])) > 0.2) { :
# missing value where TRUE/FALSE needed
I'm browsing the Internet for several hours now and I have absolutely no idea how I could fix this. If a correlation is not possible because of the missing values, I want my function just print a 0 instead.
You can greatly simplify your code like this:
cm = cor(mts.data, use = "p")
cm[abs(cm) <= 0.2] = 0
which gives:
> cm
a b c d e
a 1.0000000 0.0000000 -0.4667718 -0.5241904 -0.6864418
b 0.0000000 1.0000000 0.0000000 -0.3270387 0.0000000
c -0.4667718 0.0000000 1.0000000 0.4708803 0.5222566
d -0.5241904 -0.3270387 0.4708803 1.0000000 0.0000000
e -0.6864418 0.0000000 0.5222566 0.0000000 1.0000000
The snippet use = "p" is short for "pairwise complete observations", i.e. NAs will be omitted when necessary. For more options and details see ?cor.
The error you received was when you had a value that was NA. Then also the comparison NA > 0.2 will be NA and if does not accept NA as its input, thus the error.
I have got the following code:
test <- ca.jo(x, type='trace', ecdet='const', K=2)
When I am writing summary(test) there occurs:
Eigenvectors, normalised to first column:
(These are the cointegration relations)
gld.l2 gdx.l2
gld.l2 1.000000 1.0000000
gdx.l2 -1.488325 -0.1993057
How can I call these normalized Eigenvectors?
When I am writing
slot(test, "Vorg")
I only get the following data
gld.l2 gdx.l2
gld.l2 -0.01346063 -0.012380092
gdx.l2 0.02003378 0.002467422
but I want to call the normalized ones.
data(denmark)
sjd <- denmark[, c("LRM", "LRY", "IBO", "IDE")]
sjd.vecm <- ca.jo(sjd, ecdet = "const", type="eigen", K=2, spec="longrun",
season=4)
sm <- summary(sjd.vecm)
sm#V
LRM.l2 LRY.l2 IBO.l2 IDE.l2 constant
LRM.l2 1.000000 1.0000000 1.0000000 1.000000 1.0000000
LRY.l2 -1.032949 -1.3681031 -3.2266580 -1.883625 -0.6336946
IBO.l2 5.206919 0.2429825 0.5382847 24.399487 1.6965828
IDE.l2 -4.215879 6.8411103 -5.6473903 -14.298037 -1.8951589
constant -6.059932 -4.2708474 7.8963696 -2.263224 -8.0330127
You might want to check str(sm) for more.