R generate random sample using higher order markov chain - r

is there a way to generate a random sample from a higher order markov chain? I used the package clickstream to estimate a 2nd order markov chain and i'm now trying to generate a sample from it. I understand how to do this from a transition matrix with the randomClickstreams function but that would only work for a 1st order markov chain.
Here's a reproducible example where we generate a sample from a transition matrix and then fit a 2nd order markov chain on the sample:
trans_mat <- matrix(c(0, 0.2, 0.7, 0, 0.1,
0.2, 0, 0.5, 0, 0.3,
0.1, 0.1, 0.1, 0.7, 0,
0, 0.4, 0.2, 0.1, 0.3,
0, 0 , 0 , 0, 1), nrow = 5)
cls <- randomClickstreams(states = c("P1", "P2", "P3", "P4", "end"),
startProbabilities = c(0.5, 0.5, 0, 0, 0),
transitionMatrix = trans_mat,
meanLength = 20, n = 1000)
# fit 2nd order markov chain:
mc <- fitMarkovChain(clickstreamList = cls, order = 2,
control = list(optimizer = "quadratic"))
This is made of 2 transition matrices and 2 lambda parameters:
How can i then use these elements to create a random sample of say 10000 journeys?

Related

bnlearn Error: Wrong number of conditional probability distributions

I am learning to work with bnlearn and I keep running into the following error in the last line of my code below:
Error in custom.fit(dag, cpt) : wrong number of conditional probability distributions
What am I doing wrong?
modelstring(dag)= "[s][r][nblw|r][nblg|nblw][mlw|s:r][f|s:r:mlw][mlg|mlw:f]
[mlgr|mlg:nblg]"
###View DAG Specifics
dag
arcs(dag)
nodes(dag)
# Create Levels
State <- c("State0", "State1")
##Create probability distributions given; these are all 2d b/c they have 1 or 2 nodes
cptS <- matrix(c(0.6, 0.4), ncol=2, dimnames=list(NULL, State))
cptR <- matrix(c(0.7, 0.3), ncol=2, dimnames=list(NULL, State))
cptNBLW <- matrix(c(0.95, 0.05, 0.05, 0.95), ncol=2, dimnames=list(NULL, "r"= State))
cptNBLG <- matrix(c(0.9, 0.099999999999999998, 0.2, 0.8), ncol=2, dimnames=list(NULL,
"nblw"=State))
cptMLG <- matrix(c(0.95, 0.05, 0.4, 0.6, 0.2, 0.8, 0.05, 0.95),ncol=2,nrow = 2,
dimnames=list("mlw"= State, "f"=State))
cptMLGR <- matrix(c(0.6,0.4,0.95,0.05,0.2,0.8,0.55,0.45),ncol=2,nrow = 2,
dimnames=list("mlg"= State, "nblg"=State))
cptMLW <-matrix(c(0.95, 0.05, 0.1, 0.9, 0.2, 0.8, 0.01, 0.99), ncol=2,nrow = 2,byrow = TRUE,
dimnames=list("r"= State, "s"=State))
# Build 3-d matrices( becuase you have 3 nodes, you can't use the matrix function; you
have to build it from scratch)
cptF <- c(0.05, 0.95, 0.4, 0.6, 0.9, 0.1, 0.99, 0.01, 0.9, 0.1, 0.95, 0.05, 0.95, 0.05, 0.99,
0.01)
dim(cptF) <- c(2, 2, 2, 2)
dimnames(cptF) <- list("s"=State, "r"=State, "mlw"=State)
###Create CPT Table
cpt <- list(s = cptS, r = cptR, mlw = cptMLW,nblw= cptNBLW,
mlg= cptMLG, nblg= cptNBLG, mlgr= cptMLGR)
# Construct BN network with Conditional Probability Table
S.net <- custom.fit(dag,cpt)
Reference: https://rpubs.com/sarataheri/bnlearnCGM
You have several errors in your CPT definitions. Primarily, you need to make sure that:
the number of probabilities supplied are equal to the product of the number of states in the child and parent nodes,
that the number of dimensions of the matrix/array is equal to the number of parent nodes plus one, for the child node,
the child node should be given in the first dimension when the node dimension is greater than one.
the names given in the dimnames arguments (e.g. the names in dimnames=list(ThisName = ...)) should match the names that were defined in the DAG, in your case with modelstring and in my answer with model2network. (So my earlier suggestion of using dimnames=list(cptNBLW = ...) should be dimnames=list(nblw = ...) to match how node nblw was declared in the model string)
You also did not add node f into your cpt list.
Below is your code with comments where things have been changed. (I have commented out the offending lines and added ones straight after)
library(bnlearn)
dag <- model2network("[s][r][nblw|r][nblg|nblw][mlw|s:r][mlg|mlw:f][mlgr|mlg:nblg][f|s:r:mlw]")
State <- c("State0", "State1")
cptS <- matrix(c(0.6, 0.4), ncol=2, dimnames=list(NULL, State))
cptR <- matrix(c(0.7, 0.3), ncol=2, dimnames=list(NULL, State))
# add child node into first slot of dimnames
cptNBLW <- matrix(c(0.95, 0.05, 0.05, 0.95), ncol=2, dimnames=list(nblw=State, "r"= State))
cptNBLG <- matrix(c(0.9, 0.099999999999999998, 0.2, 0.8), ncol=2, dimnames=list(nblg=State,"nblw"=State))
# Use a 3d array and not matrix, and add child node into dimnames
# cptMLG <- matrix(c(0.95, 0.05, 0.4, 0.6, 0.2, 0.8, 0.05, 0.95),ncol=2,nrow = 2, dimnames=list("mlw"= State, "f"=State))
cptMLG <- array(c(0.95, 0.05, 0.4, 0.6, 0.2, 0.8, 0.05, 0.95),dim=c(2,2,2), dimnames=list(mlg = State, "mlw"= State, "f"=State))
# cptMLGR <- matrix(c(0.6,0.4,0.95,0.05,0.2,0.8,0.55,0.45),ncol=2,nrow = 2, dimnames=list("mlg"= State, "nblg"=State))
cptMLGR <- array(c(0.6,0.4,0.95,0.05,0.2,0.8,0.55,0.45), dim=c(2,2,2), dimnames=list(mlgr=State, "mlg"= State, "nblg"=State))
# cptMLW <-matrix(c(0.95, 0.05, 0.1, 0.9, 0.2, 0.8, 0.01, 0.99), ncol=2,nrow = 2,byrow = TRUE, dimnames=list("r"= State, "s"=State))
cptMLW <-array(c(0.95, 0.05, 0.1, 0.9, 0.2, 0.8, 0.01, 0.99), dim=c(2,2,2), dimnames=list(mlw=State, "r"= State, "s"=State))
# add child into first slot of dimnames
cptF <- c(0.05, 0.95, 0.4, 0.6, 0.9, 0.1, 0.99, 0.01, 0.9, 0.1, 0.95, 0.05, 0.95, 0.05, 0.99, 0.01)
dim(cptF) <- c(2, 2, 2, 2)
dimnames(cptF) <- list("f" = State, "s"=State, "r"=State, "mlw"=State)
# add missing node f into list
cpt <- list(s = cptS, r = cptR, mlw = cptMLW,nblw= cptNBLW, mlg= cptMLG, nblg= cptNBLG, mlgr= cptMLGR, f=cptF)
# Construct BN network with Conditional Probability Table
S.net <- custom.fit(dag, dist=cpt)

How to keep some variables fixed during the search of nsga2R?

I'm using NSGA2R to get the fitness value of the best solution that has 10 variables. However, I would like to keep 4 of them fix through all generations and 6 generate randomly by the algorithm, How do we do that with the nsga2R optimization algorithm?
Sample of the code that I'm using now:
NSGA <- nsga2R(fn = function(x) myfitnessFun(x,m,10), varNo = 10, objDim = 2, generations = 1,
mprob = 0.2, popSize = 50, cprob = 0.8,
lowerBounds = c(rep(1, 1)), upperBounds = c(rep(N, 10)))
I'm looking to find the best 10 sensors locations out of N sensors that satisfied our fitness function. The question is how to fix, as an example, 4 of these 10? where the remaining 6 will be randomly selected.
As our data has the coordinates of these sensors: sample of this data
(structure(c(47.4, 47.6105263157895, 47.8210526315789, 48.0315789473684,
5.71, 5.71, 5.71, 5.71, 0, 0, 0, 0), .Dim = c(4L, 3L)))

Getting the last number in a series before lowering it below threshold in R

I have the following data:
dat<- structure(list(Pentad = 1:73, RR = c(0, 0.014285714, 0, 0.088571429,
0.071428571, 0, 0.065714286, 0.028571429, 0.094285714, 0.011428571,
0, 0, 0, 0, 0, 0, 0.04, 0, 0.814285714, 0.285714286, 1.14, 5.334285714,
2.351428571, 1.985714286, 1.494285714, 2.005714286, 20.04857143,
25.00857143, 16.32, 11.06857143, 8.965714286, 3.985714286, 5.202857143,
7.802857143, 4.451428571, 9.22, 32.04857143, 19.50571429, 3.148571429,
2.434285714, 9.057142857, 28.70857143, 34.15142857, 33.02571429,
46.50571429, 70.61714286, 3.168571429, 1.928571429, 7.031428571,
0.902857143, 5.377142857, 11.35714286, 15.04571429, 11.66285714,
21.24, 11.43714286, 11.69428571, 2.977142857, 4.337142857, 0.871428571,
1.391428571, 0.871428571, 1.145714286, 2.317142857, 0.182857143,
0.282857143, 0.348571429, 0, 0.345714286, 0.142857143, 0.18,
4.894285714, 0.037142857), YY = c(0.577142857, 0, 1.282857143,
1.445714286, 0.111428571, 0.36, 0, 0, 0, 1, 0.011428571, 0.008571429,
0.305714286, 0, 0, 0, 0, 0.8, 0.062857143, 0, 0, 0, 0, 0.013333333,
0.043333333, 1.486666667, 0, 2.486666667, 1.943333333, 0.773333333,
8.106666667, 7.733333333, 0.5, 4.356666667, 2.66, 6.626666667,
4.404285714, 7.977142857, 12.94285714, 18.49428571, 7.357142857,
11.08285714, 9.034285714, 14.29142857, 34.61428571, 45.30285714,
6.66, 6.702857143, 5.962857143, 14.85428571, 2.1, 2.837142857,
7.391428571, 32.03714286, 9.005714286, 3.525714286, 12.32, 2.32,
7.994285714, 6.565714286, 4.771428571, 2.354285714, 0.005714286,
2.508571429, 0.817142857, 2.885714286, 0.897142857, 0, 0, 0,
0, 0.145714286, 0.434285714)), class = "data.frame", row.names = c(NA,
-73L))
There are three columns: Pentad, RR, and YY.
I would like to get the following:
(a) Get the first pentad when the precipitation exceeds the "annual mean" in "at least three consecutive pentads"
(b) Get the last pentad when the precipitation exceeds the "annual mean" in at least three consecutive pentads BEFORE lowering it below the annual mean.
I was able to do (a) using the following script:
first_exceed_seq <- function(x, thresh = mean(x), len = 3)
{
# Logical vector, does x exceed the threshold
exceed_thresh <- x > thresh
# Indices of transition points; where exceed_thresh[i - 1] != exceed_thresh[i]
transition <- which(diff(c(0, exceed_thresh)) != 0)
# Reference index, grouping observations after each transition
index <- vector("numeric", length(x))
index[transition] <- 1
index <- cumsum(index)
# Break x into groups following the transitions
exceed_list <- split(exceed_thresh, index)
# Get the number of values exceeded in each index period
num_exceed <- vapply(exceed_list, sum, numeric(1))
# Get the starting index of the first sequence where more then len exceed thresh
transition[as.numeric(names(which(num_exceed >= len))[1])]
}
first_exceed_seq(dat$RR)
Here's the plot of the time series:
The correct answer in (a) is 27.
I would like to ask how can I do this for (b). The correct answer for (b) should be 57.
I'll appreciate any help on in this in R.
I don't know if I got your problem right.
This is what I tried:
dat %>%
mutate(
anual_mean = mean(RR),
exceed_thresh = RR > anual_mean,
lag1 = lag(exceed_thresh, 1),
lag2 = lag(exceed_thresh, 2),
pick_3 = ifelse(exceed_thresh & lag1 & lag2, RR, NA)
)

how to set up an optimization problem to split a group of people into two groups, with several constraints

I am a bit stuck with this problem; I found a temporary (an perhaps suboptimal) solution using Excel, but I'd like to hear your opinion /advice, please.
9 people want to form a group and go on on a trip together.
They have 5 possible destinations and 5 possible dates.
Each person gives a score for each destination and each date (separately, so 10 scores in total per person, not 25).
A score equal to 0 means that the person won't participate if the corresponding destination or date is chosen.
All scores different from 0 are normalized to sum up to 1 for each person, by destination and by date.
These are the data (in R):
tr_op <- structure(list(Dest_A = c(0.333333333, 0.285714286, 0.1, 0.333333333,
0.263157895, 0.2, 0.2, 0.25, 0), Dest_B = c(0.266666667, 0, 0.5,
0.333333333, 0.105263158, 0.2, 0.2, 0, 0), Dest_C = c(0.133333333,
0.214285714, 0.3, 0, 0.263157895, 0.2, 0.2, 0.25, 0.5), Dest_D = c(0.2,
0.357142857, 0.1, 0, 0.105263158, 0.2, 0.2, 0.25, 0.5), Dest_E = c(0.066666667,
0.142857143, 0, 0.333333333, 0.263157895, 0.2, 0.2, 0.25, 0),
Date_1 = c(0.119047619, 0.294117647, 0.2, 0.238095238, 0.111111111,
0, 0.2, 0.095238095, 0.333333333), Date_2 = c(0.166666667,
0.058823529, 0.2, 0.238095238, 0.111111111, 0, 0.2, 0.095238095,
0), Date_3 = c(0.238095238, 0.294117647, 0.2, 0.047619048,
0.111111111, 0, 0.2, 0.095238095, 0.333333333), Date_4 = c(0.238095238,
0.058823529, 0.2, 0.238095238, 0.111111111, 0, 0.2, 0.238095238,
0), Date_5 = c(0.238095238, 0.294117647, 0.2, 0.238095238,
0.555555556, 1, 0.2, 0.476190476, 0.333333333)), .Names = c("Dest_A",
"Dest_B", "Dest_C", "Dest_D", "Dest_E", "Date_1", "Date_2", "Date_3",
"Date_4", "Date_5"), class = "data.frame", row.names = c(NA,
-9L))
Each row is one person.
Each cell is the normalized score expressed by the person in the corresponding row, for the destination or date in the corresponding column name.
As there is no single destination or date everyone agrees about, they can't all go together, so they decide to split into two groups and do two trips (2 different destinations and 2 different dates).
The problem is to assign each person to a destination and a date, at the same time choosing the destination-date combinations that maximize the sum of normalized scores.
There is no need to include all the people.
However, there can't be more than 2 destinations or more than 2 dates, and obviously the people in the same group must all agree on the destination and on the date (i.e. a person can't be assigned to a destination or a date for which they gave score 0).
I used Excel's Solver, setting up binary cells as solution vectors and constraining them and their sums as needed; it concluded that the people corresponding to row 1, 3 and 4 should go to destination B on date 4, and the rest to destination D on date 5.
However, the problem as I set it up is (apparently) not linear programming, it can't be solved by simplex, and only the evolutionary method worked. So I am not sure I got a globally optimal answer.
--> How would you solve this?
Do you see any analogy with already known problems?
Can you suggest further reading / other posts or literature I could consult?
Thanks!

Huge deegree of markov chain matrix

we met with interesting problem of R. We wanted to find 100 degree of any given markov chain matrix.
The problem is that after some time matrix suddenly goes to zero. We think that is causation of approximately of matrix multiplication.
Do you have any ideas how to fix it?
a <- c(0.2, 0, 0.7, 0.1)
b <- c(0.5, 0.2, 0.2, 0.1)
c <- c(0, 0.3, 0.7, 0)
d <- c(0.1, 0.8, 0, 0.1)
mat <- matrix(c(a,b,c,d), ncol=4, byrow=TRUE)
z <- mat
for(i in 1:100)
{
print(i)
z <- z%*%z
print(z)
}

Resources