R: Simulating Correlated Coin Flips - r

I am working with the R programming language.
I want to simulate coin flips such that:
If heads, then next head with p = 0.6 and tail = 0.4
if tails, the next tails with p = 0.6 and heads = 0.4
Using the 'markovchain' package in R, I did this as follows:
library(markovchain)
# transition matrix
P <- matrix(c(0.6, 0.4, 0.4, 0.6), byrow = TRUE, nrow = 2)
rownames(P) <- colnames(P) <- c("H", "T")
mc <- new("markovchain", states = c("H", "T"), transitionMatrix = P)
# Generate states
states <- rmarkovchain(n = 100, object = mc, t0 = "H")
# Print
table(states)
The output looks something like this:
> states
[1] "H" "T" "T" "H" "T" "T" "T" "H" "H"
My Question: Can someone please show me how I can do this in base R?
I think I need to:
create an empty list of size "n"
assign n[1] = H or T with prob 0.5
write an IFELSE statement that says n[i] = ifelse(n[i-1] == "H", sample(c("H", "T"), prob = c(0.6, 0.4), sample(c("H", "T"), prob = c(0.4, 0.6))
But I am not sure how to do this.
Can someone please show me how to do this?
Thanks!

Here it is.
Fair=function() sample( c(rep('H',5), rep('T',5)),1)
A=function() sample( c(rep('H',6), rep('T',4)),1)
B=function() sample( c(rep('H',4), rep('T',6)),1)
x=Fair()
for (i in 1:10) x=c(x,ifelse(tail(x,n=1)=='H', A(), B()))
print(x)

Related

Extract single linkage clusters from very large pairs list

I have a very large pairs list that I need to break down into single linkage communities. So far I have been able to do this entirely in R just fine. But I need to prepare for the eventuality that the entire list may be too large to hold in memory, or for igraph's R implementation to handle. A very simple version of this task looks like:
library(igraph)
df <- data.frame("p1" = c("a", "a", "d", "d"),
"p2" = c("b", "c", "e", "f"),
"val" = c(0.5, 0.75, 0.25, 0.35))
g <- graph_from_data_frame(d = df,
directed = FALSE)
sg <- groups(components(g))
sg <- sapply(sg,
function(x) induced_subgraph(graph = g,
vids = x),
USE.NAMES = FALSE,
simplify = FALSE)
if df is incredibly large - on the scale of hundreds of millions, to tens of billions of rows, is there a way for me to extract individual positions of sg without having to build g in it's entirety? It's relatively easy for me to store representations of df outside of R either as a compressed txt file or as a sqlite database.
To adress the problem with igraph's R implementation (assuming the dataset is still holdable in RAM, otherwise see #Paul Brodersen's answer):
The solution below works by specifying one element of the graph and then going over all connections until no further edges are found. It therefore creates the subgraph without building the whole graph. It looks a bit hacky compared to a recursive function but scales better.
library(igraph)
reduce_graph <- function(df, element) {
stop = F
elements_to_inspect <- element
rows_graph <-0
while(stop ==F) {
graph_parts <- df[df$p1 %in% elements_to_inspect |
df$p2 %in% elements_to_inspect,]
elements_to_inspect <- unique(c(unique(graph_parts$p1),
unique(graph_parts$p2)))
if(dim(graph_parts)[1] == rows_graph) {
stop <-TRUE
} else {
rows_graph <- dim(graph_parts)[1]
}
}
return(graph_parts)
}
df <- data.frame("p1" = c("a", "a", "d", "d","o"),
"p2" = c("b", "c", "e", "f","u"),
"val" = c(100, 0.75, 0.25, 0.35,1))
small_graph <- reduce_graph(df, "f")
g <- graph_from_data_frame(d = small_graph,
directed = FALSE)
sg <- groups(components(g))
sg <- sapply(sg,
function(x) induced_subgraph(graph = g,
vids = x),
USE.NAMES = FALSE,
simplify = FALSE)
One can test the speed on a bigger dataset.
##larger dataset with lots of sparse graphs.
set.seed(100)
p1 <- as.character(sample(1:10000000, 1000000, replace=T))
p2 <- as.character(sample(1:10000000, 1000000, replace=T))
val <- rep(1, 1000000)
df <- data.frame("p1" = p1,
"p2" = p2,
"val" = val)
small_graph <- reduce_graph(df, "9420672") #has 3 pairwise connections
g <- graph_from_data_frame(d = small_graph,
directed = FALSE)
sg <- groups(components(g))
sg <- sapply(sg,
function(x) induced_subgraph(graph = g,
vids = x),
USE.NAMES = FALSE,
simplify = FALSE)
Building groups and subgraph takes one second, compared to multiple minutes for the whole graph on my machine. This of course depends on how sparsely connected the graphs are.

How to generate n MarkovChain sequences of 25 transitions each

I have a transition matrix "T" and would like to produce 20 different sequences of 25 states each.
I have the markovchain package and have tried the following:
lapply(1:20,markovchainSequence(n = 25, markovchain = T, t0 = "In"))
but it says that markovcahinsequence is not a function. Is there a way around this please?
A reproducible example can really help here but I think this does the job done! You may just need a bigger transition matrix?!
set.seed(123)
statesNames <- c("a", "b", "c") #easier with three states
t <- new("markovchain", states = statesNames,
transitionMatrix = matrix(c(0.2, 0.5, 0.3, 0, 0.2, 0.8, 0.1, 0.8, 0.1),
nrow = 3, byrow = TRUE, dimnames = list(statesNames, statesNames)))
mchain = function(n){
markovchainSequence(n = n, markovchain = t, t0 = "a")
}
lapply(rep(25, each=20), mchain) # you may change 25 to desired number

Subset r data.table conditionally using is.null()

I have a data.table
library(data.table)
testDT <- data.table(
L = (1:32),
M = rep(letters[23:26], each = 64),
N = rep(LETTERS[1:2], times = 2, each = 512),
O = rnorm(2048, 1))
testDT$L <- factor(testDT$L, levels = seq(from = 1, to = 32, by = 1))
I created a function to subset the data set conditionally. If the subsetting variable G is NULL and H is "w", then I want all values within testDT$N and all values "w" in testDT$M to be returned in testDT. This is what I created, which does not function correctly:
G <- NULL
H <- "w"
testDT1 <- testDT[if(is.null(G)) {eval(call("%in%", as.name("N"), G))} &
if(is.null(H)) {eval(call("%in%", as.name("M"), H))}]
I verified that everything but the if(is.null()) portion works by creating this, which subsets correctly:
G <- "A"
H <- "w"
testDT1 <- testDT[{eval(call("%in%", as.name("N"), G))} &
{eval(call("%in%", as.name("M"), H))}]
How can I use the is.null() condition correctly?
Using computing on the language you can prepare call object using dedicated function.
library(data.table)
testDT = data.table(
L = factor(1:32),
M = rep(letters[23:26], each = 64),
N = rep(LETTERS[1:2], times = 2, each = 512),
O = rnorm(2048, 1)
)
i.expr = function(var, x){
if(is.null(x)) TRUE
else call("%in%", as.name(var), x)
}
G = NULL
H = "w"
i.expr("N",G)
#[1] TRUE
i.expr("M",H)
#M %in% "w"
testDT1 = testDT[eval(i.expr("N",G)) & eval(i.expr("M",H))]
G = "A"
H = "w"
i.expr("N",G)
#N %in% "A"
i.expr("M",H)
#M %in% "w"
testDT2 = testDT[eval(i.expr("N",G)) & eval(i.expr("M",H))]
If you always subset by two conditions and & operator. I would merge it into a single function so you can call it once using testDT[eval(i.expr(...))].

Iteratively define user-defined discrete distributions

I am writing a script that, using -distr-, defines some discrete distributions based on the following objects:
margins <- c("discrete1", "discrete2")
vec1 <- list(support=c(0,1,2), probabilities=c(0.2, 0.2, 0.6))
vec2 <- list(support=c(12,14,20), probabilities=c(0.1, 0.15, 0.75))
Here you have the code that works as expeced: it creates the two distributions.
library("distr")
discrete1 <- DiscreteDistribution (supp = vec1[[1]], prob = vec1[[2]])
ddiscrete1 <- d(discrete1) # Density function
pdiscrete1 <- p(discrete1) # Distribution function
qdiscrete1 <- q(discrete1) # Quantile function
rdiscrete1 <- r(discrete1)
discrete2 <- DiscreteDistribution (supp = vec2[[1]], prob = vec2[[2]])
ddiscrete2 <- d(discrete2)
pdiscrete2 <- p(discrete2)
qdiscrete2 <- q(discrete2)
rdiscrete2 <- r(discrete2)
Once the two (or possibly more) distributions are defined, my final goal is to sample random numbers from them:
rdiscrete1(100)
rdiscrete2(100)
The problem with this code is that the number of distributions can be very high.. I wonder how it could be possible to automatize the creation of the functions in a more elegant manner.
Also, I need the two functions to be of class DiscreteDistribution and not as nested in lists (see is(discrete1) in my example).
l <- list(list(support = c(0, 1, 2), probabilities = c(0.2, 0.2, 0.6)),
list(support = c(12, 14, 20), probabilities = c(0.1, 0.15, 0.75)))
distrs <- lapply(1:length(l), function(n) {
d <- DiscreteDistribution(supp = l[[n]][[1]], prob = l[[n]][[2]])
list(d = d, dd = d(d), pd = p(d), qd = q(d), rd = r(d))
})
# First object of class DiscreteDistribution
is(distrs[[1]][[1]])
# [1] "DiscreteDistribution" "UnivariateDistribution" "AcDcLcDistribution"
# [4] "Distribution" "UnivDistrListOrDistribution"
# Random numbers
dim(sapply(distrs, function(x) x[[5]](100)))
# [1] 100 2

Problem with cut() in R

I want to assign subjects to classes based on probabilities that I provide. I will be doing this in a variety of cases, with different values. Sometimes, I want the probability of a particular class to be 0. I've been using
classlist <- cut(runif(p), c(0, pdrop, ptitrate, pcomplete, pnoise, 1), labels = c("D", "T", "C", "N", "O"))
but this fails when two of the p variables are the same. I could make them different by minimal amounts e.g. pdrop = .2 ptitrate = .200001. But is there some better way?
Thanks
Peter
I suggest sample():
> p <- 100
> groups <- c("D", "T", "C", "N", "O")
> probVec <- c(0.2, 0.2, 0.3, 0.25, 0.05)
> classlist <- factor(sample(groups, size=p, replace=TRUE, prob=probVec))
> table(classlist)
classlist
C D N O T
26 16 28 5 25

Resources