Simulating a Markov Chain in R and Sequence Search - r

So I am working on simulating a Markov Chain with R in which the states are Sunny (S), cloudy (C) and rainy (R) and am looking to figure out the probability that a sunny day is followed by two consecutive cloudy days.
Here is what I have so far:
P = matrix(c(0.7, 0.3, 0.2, 0.2, 0.5, 0.6, 0.1, 0.2, 0.2), 3)
print(P)
x = c("S", "C", "R")
n = 10000
states = character(n+100)
states[1] = "C"
for (i in 2:(n+100)){
if (states[i-1] == "S") {cond.prob = P[1,]}
else if (states[i-1] == "C") {cond.prob = P[2,]}
else {cond.prob = P[3,]}
states[i]=sample(x, 1, prob = cond.prob )
}
print(states[1:100])
states = states[-(1:100)]
head(states)
tail(states)
states[1:200]
At the end of this I am left with a sequence of states. I am looking to divide this sequence into groups of three states (For the three days in the chain) and then count the number of those three set states that are equal to SCC.
I am running a blank on how I would go about doing this and any help would be greatly appreciated!!

Assuming you want a sliding window (i.e., SCC could occur in position 1-3, or 2-4, etc.), collapsing the states to a string and to a regex search should do the trick:
collapsed <- paste(states, collapse="")
length(gregexpr("SCC", collapsed)[[1]])
On the other hand, if you DON'T want a sliding window (i.e., SCC must be in position 1-3, or 4-6, or 7-9, etc.) then you can chop up the sequence using tapply:
indexer <- rep(1:(ceiling(length(states)/3)), each=3, length=length(states))
chopped <- tapply(states, indexer, paste0, collapse="")
sum(chopped == "SCC")

Eric provided the correct answer, but just for the sake of completeness:
you can get the probability you are looking for using the equilibrium distribution of the chain:
# the equilibrium distribution
e <- eigen(t(P))$vectors[,1]
e.norm <- e/sum(e)
# probability of SCC
e.norm[1] * P[1,2] * P[2,2]
This is less computationally expensive and will give you a more accurate estimate of the probability, because your simulation will be biased towards the initial state of the chain.

Related

Find local minimum in a vector with r

Taking the ideas from the following links:
the local minimum between the two peaks
How to explain ...
I look for the local minimum or minimums, avoiding the use of functions already created for this purpose [max / min locale or global].
Our progress:
#DATA
simulate <- function(lambda=0.3, mu=c(0, 4), sd=c(1, 1), n.obs=10^5) {
x1 <- rnorm(n.obs, mu[1], sd[1])
x2 <- rnorm(n.obs, mu[2], sd[2])
return(ifelse(runif(n.obs) < lambda, x1, x2))
}
data <- simulate()
hist(data)
d <- density(data)
#
#https://stackoverflow.com/a/25276661/8409550
##Since the x-values are equally spaced, we can estimate dy using diff(d$y)
d$x[which.min(abs(diff(d$y)))]
#With our data we did not obtain the expected value
#
d$x[which(diff(sign(diff(d$y)))>0)+1]#pit
d$x[which(diff(sign(diff(d$y)))<0)+1]#peak
#we check
#1
optimize(approxfun(d$x,d$y),interval=c(0,4))$minimum
optimize(approxfun(d$x,d$y),interval=c(0,4),maximum = TRUE)$maximum
#2
tp <- pastecs::turnpoints(d$y)
summary(tp)
ind <- (1:length(d$y))[extract(tp, no.tp = FALSE, peak = TRUE, pit = TRUE)]
d$x[ind[2]]
d$x[ind[1]]
d$x[ind[3]]
My questions and request for help:
Why did the command lines fail:
d$x[which.min(abs(diff(d$y)))]
It is possible to eliminate the need to add one to the index in the command lines:
d$x[which(diff(sign(diff(d$y)))>0)+1]#pit
d$x[which(diff(sign(diff(d$y)))<0)+1]#peak
How to get the optimize function to return the two expected maximum values?
Question 1
The answer to the first question is straighforward. The line d$x[which.min(abs(diff(d$y)))] asks for the x value at which there was the smallest change in y between two consecutive points. The answer is that this happened at the extreme right of the plot where the density curve is essentially flat:
which.min(abs(diff(d$y)))
#> [1] 511
length(abs(diff(d$y)))
#> [1] 511
This is not only smaller than the difference at your local maxima /minima points; it is orders of magnitude smaller. Let's zoom in to the peak value of d$y, including only the peak and the point on each side:
which.max(d$y)
#> [1] 324
plot(d$x[323:325], d$y[323:325])
We can see that the smallest difference is around 0.00005, or 5^-5, between two consecutive points. Now look at the end of the plot where it is flattest:
plot(d$x[510:512], d$y[510:512])
The difference is about 1^-7, which is why this is the flattest point.
Question 2
The answer to your second question is "no, not really". You are taking a double diff, which is two elements shorter than x, and if x is n elements long, a double diff will correspond to elements 2 to (n - 1) in x. You can remove the +1 from the index, but you will have an off-by-one error if you do that. If you really wanted to, you could concatenate dummy zeros at each stage of the diff, like this:
d$x[which(c(0, diff(sign(diff(c(d$y, 0))))) > 0)]
which gives the same result, but this is longer, harder to read and harder to justify, so why would you?
Question 3
The answer to the third question is that you could use the "pit" as the dividing point between the minimum and maximum value of d$x to find the two "peaks". If you really want a single call to get both at once, you could do it inside an sapply:
pit <- optimize(approxfun(d$x,d$y),interval=c(0,4))$minimum
peaks <- sapply(1:2, function(i) {
optimize(approxfun(d$x, d$y),
interval = c(min(d$x), pit, max(d$x))[i:(i + 1)],
maximum = TRUE)$maximum
})
pit
#> [1] 1.691798
peaks
#> [1] -0.02249845 3.99552521

Calculate income and wealth shares in R

I have a simple question, but I am confused with how deciles, quantiles, percentiles are defined.
My purpose is to compute various income and wealth shares. That is the share of x% per cent of the population from total income or wealth.
So, say that one wants to compute how much wealth the top 10% own.
How I can do this on R? Are my below calculation correct?
MWE
w<-rgamma(10000, 3, scale = 1/3)
per <- quantile(w, c(0.1, 0.9))
top_1_percent <- (per[2]/sum(w))*100
bottom_90_percent <-per[1]/sum(w))*100
The top 10% should be:
sum(w[w > per[2]])/sum(w)
Alternately:
sum(tail(sort(w), .1 * length(w))) / sum(w)
The bottom 90% is 1 - top 10%.
If I understand the question correctly, the following will do it.
set.seed(1234) # Make the results reproducible
w <- rgamma(10000, 3, scale = 1/3)
per <- quantile(w, c(0.1, 0.9))
Now get an index i1 on the top 10% and sum their wealth.
i1 <- w >= per[2]
sum(w[i1])
#[1] 2196.856
And the same for the bottom 10%, with index i2.
i2 <- w <= per[1]
sum(w[i2])
#[1] 254.6375
Note that I am using >= and <=. See the help page ?quantile to see the types of quantile computations R can do. This is given by argument type.
Edit.
To compute proportions and percentages of wealth of the top 10% and bottom 10%, divide by the total wealth and multiply by 100.
top10 <- sum(w[i1])/sum(w)
top10
#[1] 0.221291
100*top10
#[1] 22.1291
bottom10 <- sum(w[i2])/sum(w)
bottom10
#[1] 0.02564983
100*bottom10
#[1] 2.564983

Rolling weighted average in R (multiple observations)

Is there any fast function that is able to calculate a rolling average that is weighted? This is necessary because I have multiple observation (not always the same number) per data point (change in seconds) and I average that. When I take the rolling average, I want to re-weight to get an unbiased rolling average.
So far, I came up with this solution (in this example with a window of 3 seconds).
sam <- data.table(val_mean=c(1:15),N=c(11:25))
sam[,weighted:=val_mean*N]
sam[,rollnumerator:=rollapply(weighted,3,sum,fill=NA,align="left")]
sam[,rolldenominator:=rollapply(N,3,sum,fill=NA,align="left")]
sam[,rollnumerator/rolldenominator]
I couldn't find any question that already addresses this problem.
This is not about unequal spacing of the data: I can take care of that by expanding my data.table with NAs to include each second (the example above is equally spaced). Also, I don't want to include weights in the sense of RcppRoll's roll_mean: There, weights are fixed for all time windows ("A vector of length n, giving the weights for each element within a window."), while in my case the weights change according to the values currently processed. Thirdly, I don't want an adaptive window size, it should stay fixed (say at 3 seconds).
1) Use by.column = FALSE:
library(data.table)
library(zoo)
wmean <- function(x) weighted.mean(x[, 1], x[, 2])
sam[, rollapplyr(.SD, 3, wmean, by.column = FALSE, fill = NA, align = "left")]
2) Another approach is to encode the values and weights into a complex vector:
wmean_cmplx <- function(x) weighted.mean(Re(x), Im(x))
sam[, rollapply(complex(real = val_mean, imag = N), 3, wmean_cmplx,
fill = NA, align = "left")]

Generating random edges with probability between nodes multiple times in R?

So I want to generate a random network with 3 nodes. And the edges have to random between the two nodes with a specific probability or strength.
Choosing the nodes with following probability :
Node 1 :0.6
Node 2: 0.3
Node 3: 0.1
And I want to do it for multiple times as this is a part of temporal data. So in each time stamp one connection.
I am doing it in R with igraph. But the ER model could not do it.
Any idea how can I do it?
Edit: The edges are directed.
( Note : I am a biology student not a hardcore person who works with network. So any direction and suggestions will be helpful.)
What you want seems to be impossible. There are three possible edges: 1-2, 1-3, and 2-3. Let p_ij denote the probability that edge i-j is chosen. Note that the probability that a given node appears on a randomly chosen edge is the sum of the two edge probabilities that involve that node, so that e.g.
p(1) = p_12 + p_13
You seem to want the p_ij to satisfy:
p_12 + p_13 = 0.6
p_12 + p_23 = 0.3
p_13 + p_23 = 0.1
p_12 + p_13 + p_23 = 1
with the further constraint that each p_ij >= 0.
This is simply impossible. The first three equations yield p_12 = 0.4, p_13 = 0.2 and p_23 = -0.1, which violates the fourth equation and the nonnegativity constraints.
If this argument doesn't imply that what you want is impossible, please explain what you are trying to do more clearly.
On Edit: If you want to simulate movement between nodes (as per your comments) you can do something like this:
#the following function takes a vector, nodes of
#populations and picks a random move of an individual
#from one node to another. Chosen source node is
#is picked with probability which is proportional
#to the population. Return value is a vector of
#the form c(from,to)
move <- function(nodes){
total.pop <- sum(nodes)
n <- length(nodes)
from <- sample(1:n,1,prob = nodes/total.pop)
to <- sample((1:n)[-from],1)
c(from,to)
}
#The following function takes an initial population
#distribution and simulates n moves according to the above
#it returns a dataframe with one row per step
moves <- function(nodes,steps){
n <- length(nodes)
current <- nodes #current nodes
A <- matrix(0,steps+1,n+3)
A[1,] <- c(0,c(NA,NA),nodes)
for(i in 1:steps){
v <- move(current)
current[v[1]] <- current[v[1]] - 1
current[v[2]] <- current[v[2]] + 1
A[i+1,] <- c(i,v,current)
}
df <- data.frame(A)
names(df) <- c("Step","From","To",paste("Node",1:n,sep = ""))
df
}
For example:
> sim <- moves(c(60,30,10),1000)
> plot(sim$Step,sim$Node1,type = "l",ylim = c(0,100),xlab = "Time Step", ylab = "populations")
> points(sim$Step,sim$Node2,type = "l",col = "blue")
> points(sim$Step,sim$Node3,type = "l",col = "red")
Output:

Consistent Cluster Order with Kmeans in R

This might not be possible, but Google has failed me so far so I'm hoping someone else might have some insight. Sorry if this has been asked before.
The background is, I have a database of information on different cities, so like name, population, pollution, crime, etc by year. I'm querying it to aggregate the data on a per-city basis and outputting the result to a table. That works fine.
The next step is I'm running the kmeans() function in R on the data set to find clusters, in testing I've found that 5 clusters is almost always a good choice via the "elbow method".
The issue I'm having is that these clusters have distinct meanings/interpretations, so I want to tag each row in the original data set with the cluster's interpretation for that row, not the cluster number. So I don't want to identify row 2 with "cluster 5", I want to say "low population, high crime, low income".
If R would output the clusters in the same order, say having cluster 5 always equate to the cluster of cities with "low population, high crime, low income", that would work fine, but it doesn't. For instance, if you run code like this:
> a = kmeans(city_date,centers=5)
> b = kmeans(city_date,centers=5)
> c = kmeans(city_date,centers=5)
The run this code:
a$centers
b$centers
c$centers
The clusters will all contain the same data set, but the cluster number will be different. So if I have a mapping table in SQL that has cluster number and interpretation, it won't work, because when I run it one day it might have the "low population, high crime, low income" cluster as 5, and the next it might be 2, the next 4, etc.
What I'm trying to figure out is if there is a way to keep the output consistent. The data set gets updated so it won't even be the same every time, and since R doesn't keep the cluster order consistent even with the same data set, I am wondering if it will be possible at all.
Thanks for any help anyone can provide. On my end my current idea is to output the $centers data to a SQL table, then order the table by the various metrics, each time the one with the highest/lowest getting tagged as such, and then concatenating the results to tag the level. This may work but isn't very elegant.
I know this is a very old post, but I only came across it now. I had the same problem today and adapted the suggestion by Barker to come up with a solution:
library(dplyr)
# create a random data frame
df <- data.frame(id = 1:10, obs = sample(0:500, 10))
# use kmeans a first time to get the centers
centers <- kmeans(df$obs, centers = 3)$centers
# order the centers
centers <- sort(centers)
# call kmeans again but this time passing the centers calculated in the previous step
clusteridx <- kmeans(df$obs, centers = centers)$cluster
Not very elegant, but it works. The clusteridx vector will always return the cluster number based on the centers in ascending order.
This can also be collapsed into just one line if you prefer:
clusteridx <- kmeans(df$obs, centers = sort(kmeans(df$obs, centers = 3)$centers))$cluster
Usually k-means are initialized randomly few times to avoid local minimums. If you want to have resulting clusters ordered, you have to order them manually after k-means algorithm stops to work.
I haven't done this myself so I am not sure it will work, but kmeans has the parameter:
centers - either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres.
If you know know basically where the clusters should be (perhaps by getting the cluster centers from a dataset you are matching to), you could use that to initialize the model. That would make the starting locations non-random, so the clusters should stay in the same order. Also, as an added benefit, initializing the cluster centers close to where they will end up should speed up your clustering.
Edit
I just checked using the data from the kmeans example but initializing with the first datapoint at (1,1) and the second at (0,0) (the means of the distributions used to makes the clusters) as below.
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, matrix(c(1,0,1,0),ncol=2)))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
After repeated runs, I found that the first cluster was always in the top right and the second in the bottom left where as initializing with 2 clusters caused then to switch back and forth. If you have some approximate starting values for your clusters (ie quantification for "low population, high crime, low income") that could be your initialization and give you the results you want.
This function runs kmeans with 1-dimensional input and returns a normal "kmeans" object with sensibly numbered clusters, without having to run the kmeans twice.
ordered_kmeans = function(x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"),
trace = FALSE,
desc = TRUE) {
if (NCOL(x) > 1) {
stop("only one-dimensional inputs are allowed")
}
k = kmeans(x = x, centers = centers, iter.max = iter.max, nstart = nstart,
algorithm = algorithm, trace = trace)
centers_ind = order(k$centers, decreasing = desc)
centers_ord = setNames(seq_along(k$centers), nm = centers_ind)
k$cluster = unname(centers_ord[as.character(k$cluster)])
k$centers = matrix(k$centers[centers_ind], ncol = 1)
k$withinss = k$withinss[centers_ind]
k$size = k$size[centers_ind]
k
}
Example usage:
vec = c(20.28, 9.49, 7.14, 2.48, 2.36, 1.82, 1.3, 1.26, 1.11, 0.98,
0.81, 0.73, 0.66, 0.63, 0.57, 0.53, 0.44, 0.42, 0.38, 0.37, 0.33,
0.29, 0.28, 0.27, 0.26, 0.23, 0.23, 0.2, 0.18, 0.16, 0.15, 0.14,
0.14, 0.12, 0.11, 0.1, 0.1, 0.08)
# For comparispon
set.seed(1)
k = kmeans(vec, centers = 3); k
set.seed(1)
k = ordered_kmeans(vec, centers = 3); k
set.seed(1)
k = ordered_kmeans(vec, centers = 3, desc = FALSE); k
Here's an example where you ascribe letter factor groups to the k-means clusters, ordered from A is low to C is high. The parameters can be altered to fit the data you have.
df <- data.frame(id = 1:10, obs = sample(0:500, 10))
km <- kmeans(df$obs, centers = 3)
km.order <- as.numeric(names(sort(km$centers[,1])))
names(km.order) <- toupper(letters)[1:3]
km.order <- sort(km.order)
clus.order <- factor(names(km.order[km$cluster]))

Resources