Consider the Markov chain with state space S = {1, 2} and transition matrix
and initial distribution α = (1/2, 1/2).
Suppose, the source code for simulation is the following:
alpha <- c(1, 1) / 2
mat <- matrix(c(1 / 2, 0, 1 / 2, 1), nrow = 2, ncol = 2)
chainSim <- function(alpha, mat, n)
{
out <- numeric(n)
out[1] <- sample(1:2, 1, prob = alpha)
for(i in 2:n)
out[i] <- sample(1:2, 1, prob = mat[out[i - 1], ])
out
}
Suppose the following is the result of a 5-step Markov Chain simulation repeated 10 times:
> sim
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 1 1 2 2 2 1 1 1 2
[2,] 2 1 2 2 2 2 2 1 1 2
[3,] 2 1 2 2 2 2 2 1 2 2
[4,] 2 2 2 2 2 2 2 1 2 2
[5,] 2 2 2 2 2 2 2 2 2 2
[6,] 2 2 2 2 2 2 2 2 2 2
What would be the values of the following?
P(X1 = 1, X3 = 1)
P(X5 = 2 | X0 = 1, X2 = 1)
E(X2)
I tried them as follows:
mean(sim[4, ] == 1 && sim[2, ]== 1)
?
c(1,2) * mean(sim[2, ])
What would be (2)? Am I correct with the rest?
Kindly explain your response.
You are almost correct about 1: there is a difference whether you use && or &, see
?`&&`
It should be
mean(sim[1 + 1, ] == 1 & sim[1 + 3, ] == 1)
Then 2 is given by
mean(sim[1 + 5, sim[1 + 0, ] == 1 & sim[1 + 2, ] == 1] == 2)
where you may get NaN in the case if the conditional event {X0 = 1, X2 = 1} doesn't appear in your simulation.
Lastly, point 3 is
mean(sim[1 + 2, ])
since a natural estimator of the expected value is simply the sample average.
Take advantage of the problem structure, state 2 is an absorbing state. The only way for X1=1 and X3=1 is if it begins with 1 and in every intermediate step, it keep visiting state 1. Hence, the answer is (0.5)4=0.0625.
In terms of simulation, rather than
mean(sim[4, ] == 1 && sim[2, ]== 1
It should be
mean(sim[4, ] == 1 & sim[2, ]== 1
&& only check the first component.
For the second part, one possible way is to note that
P(X5=2|X0=1, X2=1)=P(X5=2,X0=1, X2=1)/P(X0=1, X2=1)
of which you can then first estimate the numerator and the denominator separately and then compute the ratio.
Alternatively, P(X5=2|X0=1, X2=1)=P(X5=2| X2=1)=P(X3=2| X0=1)
For the third question, E(X2) is a single number, it is not a vector. It can be estimated by mean(sim[3,])
Related
I have a vector made of 0 and non-zero numbers. I would like to know the length and starting-position of each of the non-zero number series:
a = c(0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 2.6301334 1.8372030 0.0000000 0.0000000 0.0000000 1.5632647 1.1433757 0.0000000 1.5412216 0.8762267 0.0000000 1.3087967 0.0000000 0.0000000 0.0000000)
based on a previous post it is easy to find the starting positions of the non-zero regions:
Finding the index of first changes in the elements of a vector in R
c(1,1+which(diff(a)!=0))
However I cannot seem to configure a way of finding the length of these regions....
I have tried the following:
dif=diff(which(a==0))
dif_corrected=dif-1 # to correct for the added lengths
row=rbind(postion=seq(length(a)), length=c(1, dif_corrected))
position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
length 1 0 0 0 0 2 0 0 2 2 1 0 0 1 0
NOTE: not all columns are displayed ( there are actually 20)
Then I subset this to take away 0 values:
> row[,-which(row[2,]==0)]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
position 1 6 9 10 11 14 19
length 1 2 2 2 1 1 2
This seems like a decent way of coming up with the positions and lengths of each non-zero series in the series, but it is incorrect:
The position 9 (identified as the start of a non-zero series) is a 0 and instead 10 and 11 are non-zero so I would expect the position 10 and a length of 2 to appear here....
The only result that is correct is position 6 which is the start of the first non-zero series- it is correctly identified as having a length of 2- all other positions are incorrect.
Can anyone tell me how to index correctly to identify the starting-position of each of the non-zero series and the corresponding lengths?
NOTE I only did this in R because of the usefulness of the which command but it would also be good to know how to do this numpy and create a dictionary of positions and length values
It seems like rle could be useful here.
# a slightly simpler vector
a <- c(0, 0, 1, 2, 0, 2, 1, 2, 0, 0, 0, 1)
# runs of zero and non-zero elements
r <- rle(a != 0)
# lengths of non-zero elements
r$lengths[r$values]
# [1] 2 3 1
# start of non-zero runs
cumsum(r$lengths)[r$values] - r$lengths[r$values] + 1
# [1] 3 6 12
This also works on vectors with only 0 or non-0, and does not depend on whether or not the vector starts/ends with 0 or non-0. E.g.:
a <- c(1, 1)
a <- c(0, 0)
a <- c(1, 1, 0, 1, 1)
a <- c(0, 0, 1, 1, 0, 0)
A possibly data.table alternative, using rleid to create groups, and .I to get start index and calculate length.
library(data.table)
d <- data.table(a)
d[ , .(start = min(.I), len = max(.I) - min(.I) + 1, nonzero = (a != 0)[1]),
by = .(run = rleid(a != 0))]
# run start len nonzero
# 1: 1 1 2 FALSE
# 2: 2 3 2 TRUE
# 3: 3 5 1 FALSE
# 4: 4 6 3 TRUE
# 5: 5 9 3 FALSE
# 6: 6 12 1 TRUE
If desired, the runs can then easily be sliced by the 'nonzero' column.
For numpy this is a parallel method to #Maple (with a fix for arrays ending with a nonzero):
def subSeries(a):
d = np.logical_not(np.isclose(a, np.zeros_like(a))).astype(int)
starts = np.where(np.diff(np.r_[0, d, 0]) == 1))
ends = np.where(np.diff(np.r_[0, d, 0]) == -1))
return np.c_[starts - 1, ends - starts]
Definition:
sublistLen = function(list) {
z_list <- c(0, list, 0)
ids_start <- which(diff(z_list != 0) == 1)
ids_end <- which(diff(z_list != 0) == - 1)
lengths <- ids_end - ids_start
return(
list(
'ids_start' = ids_start,
'ids_end' = ids_end - 1,
'lengths' = lengths)
)
}
Example:
> a <- c(-2,0,0,12,5,0,124,0,0,0,0,4,48,24,12,2,0,9,1)
> sublistLen(a)
$ids_start
[1] 1 4 7 12 18
$ids_end
[1] 1 5 7 16 19
$lengths
[1] 1 2 1 5 2
I'm trying to create a vector whose elements add up to a specific number. For example, let's say I want to create a vector with 4 elements, and they must add up to 20, so its elements could be 6, 6, 4, 4 or 2, 5, 7, 6, whatever. I tried to run some lines using sample() and seq() but I cannot do it.
Any help appreciated.
To divide into 4 parts, you need three breakpoints from the 19 possible breaks between 20 numbers. Then your partitions are just the sizes of the intervals between 0, your partitions, and 20:
> sort(sample(19,3))
[1] 5 7 12
> diff(c(0, 5,7,12,20))
[1] 5 2 5 8
Test, lets create a big matrix of them. Each column is an instance:
> trials = sapply(1:1000, function(X){diff(c(0,sort(sample(19,3)),20))})
> trials[,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3 1 8 13 3 2
[2,] 4 7 10 2 9 5
[3,] 2 11 1 4 3 7
[4,] 11 1 1 1 5 6
Do they all add to 20?
> all(apply(trials,2,sum)==20)
[1] TRUE
Are there any weird cases?
> range(trials)
[1] 1 17
No, there are no zeroes and nothing bigger than 17, which will be a (1,1,1,17) case. You can't have an 18 without a zero.
foo = function(n, sum1){
#Divide sum1 into 'n' parts
x = rep(sum1/n, n)
#For each x, sample a value from 1 to that value minus one
f = sapply(x, function(a) sample(1:(a-1), 1))
#Add and subtract f from 'x' so that sum(x) does not change
x = x + sample(f)
x = x - sample(f)
x = floor(x)
x[n] = x[n] - (sum(x) - sum1)
return(x)
}
I asked a question similar to this one previously. But this one little more tricky. I have POSITIVE INTEGER solutions(previously NON-NEGATIVE solutions) matrix(say A) to the indefinite equation x1+x2+x3 = 8. Also, I have another matrix(say B) with columns
0 1 0 1
0 0 1 1
I want to generate matrices using rows of A and the columns of B.
For an example, let (2,2,4) is the one solution(one row) of the matrix A. In this case, I just cannot use rep. So I tried to generate all the three column matrices from matrix B and then try to apply rep, but couldn't figure that out. I use the following lines to generate lists of all three column matrices.
cols <- combn(ncol(B), 3, simplify=F, FUN=as.numeric)
M3 <- lapply(cols, function(x) cbind(B[,x]))
For an example, cols[[1]]
[1] 1 2 3
Then, the columns of my new matrix would be
0 0 1 1 0 0 0 0
0 0 0 0 1 1 1 1
Columns of this new matrix are the multiples of columns of B. i.e., first column 2-times, second column 2-time and third column 4-times. I want to use this procedure all the rows of matrix A. How do I do this?
?rep(x, times) says;
if times is a vector of the same length as x (after replication by
each), the result consists of x[1] repeated times[1] times, x[2]
repeated times[2] times and so on.
Basic idea is;
B <- matrix(c(0, 1, 0, 1, 0, 0, 1, 1), byrow = T, nrow = 2)
cols <- combn(ncol(B), 3, simplify=F, FUN=as.numeric)
a1 <- c(2, 2, 4)
cols[[1]] # [1] 1 2 3
rep(cols[[1]], a1) # [1] 1 1 2 2 3 3 3 3
B[, rep(cols[[1]], a1)]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0 0 1 1 0 0 0 0
# [2,] 0 0 0 0 1 1 1 1
testA <- rbind(c(2,2,4), c(2,1,5), c(2,3,3))
## apply(..., lapply(...)) approach (output is list in list)
apply(testA, 1, function(x) lapply(cols, function(y) B[, rep(y, x)]))
## other approach using combination of indices
ind <- expand.grid(ind_cols = 1:length(cols), ind_A = 1:nrow(testA))
col_ind <- apply(ind, 1, function(x) rep(cols[[x[1]]], testA[x[2],]))
lapply(1:ncol(col_ind), function(x) B[, col_ind[,x]]) # output is list
library(dplyr)
apply(col_ind, 2, function(x) t(B[, x])) %>% matrix(ncol = 8, byrow=T) # output is matrix
In this post there is a method to initialize the centers for the K-means algorithm in R. However, the data used therein is scalar (i.e. numbers).
A variation on this question: what if the data has multiple dimensions. In that case, the new centers should be vectors, so start should be a vector of vectors... I tried something like :
C1<- c(1,2)
C2<- c(4,-5)
to have my two initial centers, and then use
kmeans(dat, c(C1,C2))
but it didn't work. I also tried cbind() instead of c(). Same result...
You expand the matrix start to have cluster rows and variables columns (dimensions), where cluster is the number of clusters you are attempting to identify and variables is the number of variables in the data set.
Here is an extension of the post you linked to, expanding the example to 3 dimensions (variables), x, y, and z:
set.seed(1)
dat <- data.frame(x = rnorm(99, mean = c(-5, 0 , 5)),
y = rnorm(99, mean = c(-5, 0, 5)),
z = rnorm(99, mean = c(-5, 2, -4)))
plot(dat)
The plot is:
Now we need to specify cluster centres for each of our three clusters. This is done via a matrix as before:
start <- matrix(c(-5, 0, 5, -5, 0, 5, -5, 2, -4), nrow = 3, ncol = 3)
> start
[,1] [,2] [,3]
[1,] -5 -5 -5
[2,] 0 0 2
[3,] 5 5 -4
Here, the important thing to note is that the clusters are in rows. The columns are coordinates on that dimension of the specified cluster centre. Hence for cluster 1 we are specifying that the centroid is at (-5,-5,-5)
Calling kmeans()
kmeans(dat, start)
results in it picking groups very close to our initial starting points (as it should for this example):
> kmeans(dat, start)
K-means clustering with 3 clusters of sizes 33, 33, 33
Cluster means:
x y z
1 -4.8371412 -4.98259934 -4.953537
2 0.2106241 0.07808787 2.073369
3 4.9708243 4.77465974 -4.047120
Clustering vector:
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
[39] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
[77] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
Within cluster sum of squares by cluster:
[1] 117.78043 77.65203 77.00541
(between_SS / total_SS = 93.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
It is worth noting here the output for the cluster centres:
Cluster means:
x y z
1 -4.8371412 -4.98259934 -4.953537
2 0.2106241 0.07808787 2.073369
3 4.9708243 4.77465974 -4.047120
This layout is exactly the same as the matrix start.
You don't have to build the matrix directly using matrix(), nor do you have to specify the centres column-wise. For example:
c1 <- c(-5, -5, -5)
c2 <- c( 0, 0, 2)
c3 <- c( 5, 5, -4)
start2 <- rbind(c1, c2, c3)
> start2
[,1] [,2] [,3]
c1 -5 -5 -5
c2 0 0 2
c3 5 5 -4
Or
start3 <- matrix(c(-5, -5, -5,
0, 0, 2,
5, 5, -4), ncol = 3, nrow = 3, byrow = TRUE)
> start3
[,1] [,2] [,3]
[1,] -5 -5 -5
[2,] 0 0 2
[3,] 5 5 -4
If those are more comfortable for you.
The key thing to remember is that variables are in columns, cluster centres in the rows.
## Your centers
C1 <- c(1, 2)
C2 <- c(4, -5)
## Simulate some data with groups around these centers
library(MASS)
set.seed(0)
dat <- rbind(mvrnorm(100, mu=C1, Sigma = matrix(c(2,3,3,10), 2)),
mvrnorm(100, mu=C2, Sigma = matrix(c(10,3,3,2), 2)))
clusts <- kmeans(dat, rbind(C1, C2)) # get clusters with your center starting points
## Look at them
plot(dat, col=clusts$cluster)
I need to read in a CSV file with no headers and with an unknown number of columns and rows. However , every other column belongs in one matrix while the next needs to be in a different matrix. Example
CSV input:
1,2,3,4
1,2,3,4
1,2,3,4
1,2,3,4
Desired result would be equivalent to:
matrix1 <- (c( 1, 3,
1, 3,
1, 3,
1, 3), NumberOfRows, NumberOfColumns, byrow=T);
and
matrix2 <- (c( 2, 4,
2, 4,
2, 4,
2, 4), NumberOfRows, NumberOfColumns, byrow=T);
I have tried something like this (but this seems overly complex and doesn't work anyways). Isn't there a simple way to do this in R?
mydata<- read.csv("~/Desktop/file.csv", header=FALSE, nrows=4000);
columnCount<-ncol(mydata);
rowCount<-nrow(mydata);
evenColumns <- matrix(); oddColumns <-matrix();
for (i in 1:columnCount) {
if (i %% 2) {
for (l in 1:rowCount){
col <- 1;
evenColumns[col, l] <-mydata[i,l];
col<-col+1;
}
}
else {
for (l in 1:rowCount){
col <-1;
oddColumns[col, l] <-mydata[i,l];
col<-col+1;
}
}
}
How should this be done properly in R?
You can get the column numbers with seq:
full = read.csv("mat.csv", header=FALSE)
odds = as.matrix(full[, seq(1, ncol(full), by=2)])
evens = as.matrix(full[, seq(2, ncol(full), by=2)])
Output:
> odds
V1 V3
[1,] 1 3
[2,] 1 3
[3,] 1 3
[4,] 1 3
> evens
V2 V4
[1,] 2 4
[2,] 2 4
[3,] 2 4
[4,] 2 4
Similar to the problem discussed here
mat.even <- mydata[,which(1:ncol(mydata) %% 2 == 0)]
mat.odd <- mydata[,which(1:ncol(mydata) %% 2 == 1)]
Every other starting with the first:
> cdat[ , c(TRUE,FALSE)]
V1 V3
1 1 3
2 1 3
3 1 3
4 1 3
Every other starting with the second:
> cdat[ , !c(TRUE,FALSE)]
V2 V4
1 2 4
2 2 4
3 2 4
4 2 4