The setup is: "Suppose that you have a population of 10,000 and 6,000 people are Democrats. We simulate survey sampling from this hypothetical population. First, generate a vector of the population in R. Then, create 500 random samples of size n = 1,000."
My code so far is:
pop<-c(rep("Democrat", 6000), rep("Republican", 4000))
nTrials <- 500
n <- 1000
results <- rep(NA, nTrials)
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=FALSE)
results[i]<- sampledpop<-c(rep(1, 6000), rep(0, 4000))
nTrials <- 500
n <- 1000
results <- matrix(data=NA, ncol = nTrials, nrow = n)
Y<-matrix(data=NA, ncol=nTrials, nrow=1)
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=TRUE)
results[,i]<- sampled
Y[,i]<- sum(results[,i])
}
I think this code works, but I'm worried about how to tell if the matrix is filling correctly.
You can easily view objects you've saved using the view function. Link to View Function in R.
we can also put a line into our R code that halts execution until after a key stroke. Stack exchange thread covering this
Putting the two together, we can then put put two lines into a loop, one which shows us the current version of the final output, and another which pauses the loop until we continue. This will let us explore the behaviour of the loop step by step. Using one of your loops as an example:
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=TRUE)
results[,i]<- sampled
Y[,i]<- sum(results[,i])
View(Y)
readline(prompt="Press [enter] to continue")
}
Keep in mind this will keep going for the specified amount of trials.
You could limit the amount of trials, but then you cannot be sure to get the same result, so instead we could put a break into the code. link to info about the break statement. This lets us interrupt a for loop early, assuming we are happy with how things are building up. To make the break really shine, lets pair it with some user input, you can choose if you'd like to continue or not. link for collecting user input in r
so then combing all of this we get something like:
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=TRUE)
results[,i]<- sampled
Y[,i]<- sum(results[,i])
View(Y,)
interrupt = readline(prompt="Enter 1 for next loop, 0 to exit: ")
if (interrupt == 0) {break}
}
For what it's worth, your code looks perfectly fine to me so far.
Try this
replicate(500L, sample(c("Democrat", "Republican"), 1000L, replace = TRUE, prob = c(0.6, 0.4)), simplify = FALSE)
Or this
pop <- c(rep("Democrat", 6000L), rep("Republican", 4000L))
replicate(500L, sample(pop, 1000L), simplify = FALSE)
Related
I was just attempting to recreate an example using the rbinom function, but the numbers of "successes" is a lot higher than I would be expecting due to the low probabilities.
numSamples <- 10000 #number of samples to be drawn from population
numTrials <- 100 #this is the sample size (size of each sample)
probs <- seq(0.001, 0.9999, 0.01)
for (i in 1:length(probs)) {
x <- rbinom(n = numSamples, size = numTrials, prob = probs[i])
}
Everything seems straightforward, except that I am getting the number of success to be between 97 - 100 for all of the samples. When I do a few test cases manually using the smaller probabilities such as 0.001 I get the number of successes expected: 0. So there is an issue in how my for loop is reading things. What is going wrong ?
Here's one way to get all the samples:
res <- list()
for (i in 1:length(probs)) {
res <- c(res,
list(rbinom(n = numSamples, size = numTrials, prob = probs[i]))
}
You can then unlist(res) to get one long vector, or do.call(rbind, res) to collapse to a matrix. Growing a list this way and then collapsing it doesn't incur the same performance penalty as growing a vector.
R and probability noob here. I'm looking to create a histogram that shows the distribution of how many attempts it took to return a heads, repeated over 1000+ simulated runs on the equivalent of an unfairly weighted coin (0.1 heads, 0.9 tails).
From my understanding, this is not a geometric distribution or binomial distribution (but might make use of either of these to create the simulated results).
The real-world (ish) scenario I am looking to model this for is a speedrun of the game Zelda: Ocarina of Time. One of the goals in this speedrun is to obtain an item from a character that has a 1 in 10 chance of giving the player the item each attempt. As such, the player stops attempting once they receive the item (which they have a 1/10 chance of receiving each attempt). Every run, runners/viewers will keep track of how many attempts it took to receive the item during that run, as this affects the time it takes the runner to complete the game.
This is an example of what I'm looking to create:
(though with more detailed labels on the x axis if possible). In this, I manually flipped a virtual coin with a 1/10 chance of heads over and over. Once I got a successful result I recorded how many attempts it took into a vector in R and then repeated about 100 times - I then mapped this vector onto a histogram to visualise what the distribution would look like for the usual amount of attempts it will take to get a successful result - basically, i'd like to automate this simulation instead of me having to manually flip the virtual unfair coin, write down how many attempts it took before heads, and then enter it into R myself).
I'm not sure if this is quite what you're looking for, but if you create a function for your manual coin flipping, you can just use replicate() to call it many times:
foo <- function(p = 0.1) {
i <- 0
failure <- TRUE
while ( failure ) {
i <- i + 1
if ( sample(x = c(TRUE, FALSE), size = 1, prob = c(p, 1-p)) ) {
failure <- FALSE
}
}
return(i)
}
set.seed(42)
number_of_attempts <- replicate(1000, foo())
hist(number_of_attempts, xlab = "Number of Attempts Until First Success")
As I alluded to in my comment though, I'm not sure why you think the geometric distribution is inappropriate.
It "is used for modeling the number of failures until the first success" (from the Wikipedia on it).
So, we can just sample from it and add one; the approaches are equivalent, but this will be faster when your number of samples is high:
number_of_attempts2 <- rgeom(1000, 0.1) + 1
hist(number_of_attempts2, xlab = "Number of Attempts Until First Success")
I would use the 'rle' function since you can make a lot of simulations in a short period of time. Use this to count the run of tails before a head:
> n <- 1e6
> # generate a long string of flips with unfair coin
> flips <- sample(0:1,
+ n,
+ replace = TRUE,
+ .... [TRUNCATED]
> counts <- rle(flips)
> # now pull out the "lengths" of "0" which will be the tails before
> # a head is flipped
> runs <- counts$lengths[counts$value == 0]
> sprintf("# of simulations: %d max run of tails: %d mean: %.1f\n",
+ length(runs),
+ max(runs),
+ mean(runs))
[1] "# of simulations: 90326 max run of tails: 115 mean: 10.0\n"
> ggplot()+
+ geom_histogram(aes(runs),
+ binwidth = 1,
+ fill = 'blue')
and you get a chart like this:
Histograph of runs
I would tabulate the cumsum.
p=.1
N <- 1e8
set.seed(42)
tosses <- sample(0:1, N, T, prob=c(1-p, p))
attempts <- tabulate(cumsum(tosses))
length(attempts)
# [1] 10003599
hist(attempts, freq=F, col="#F48024")
I asked a similar question on CrossValidated, but did not get a response. I went ahead anyway, and built out a function but am having a problem with replication...
The original question, posted here is as such:
I am seeking a function (or short algorithm, ideally implemented in R) that produces something similar to the following:
See, I would like to be able to generate a vector of n items that follows this sort of pattern, mapped to a set of inputs (say, seq(1:n)). Ideally, I would be able to tell the algorithm to "spike" to a maximum height h on every kth time period, and decay at rate r. However, I would be sufficiently happy with simply being able to generate a spike pattern that occurs periodically.
I wrote some code in R, which is included here, that works fairly well...
## Neural Networks / Deep Learning ##
# first, must install Python from:
# https://www.anaconda.com/download/#windows
# https://www.python.org/downloads/
if (!require(keras)) devtools::install_github("rstudio/keras") ; library(keras)
# install_tensorflow()
spikes_model <- function(maxiter, total_spikes = 10, max_height = 0.001, min_height = 0.000005, decay_rate = 1) {
value_at_iteration <- rep(0, maxiter)
spike_at <- maxiter / total_spikes
current_rate <- min_height
holder_timeval <- 0
for(i in 1:maxiter) {
spike_indicator <- i / spike_at
if (is.integer(spike_indicator)) {
current_rate <- max_height
value_at_iteration[i] <- current_rate
holder_timeval <- spike_indicator
} else if (i < spike_at) {
current_rate <- min_height
value_at_iteration[i] <- current_rate
} else {
timeval <- i - (holder_timeval*spike_at)
current_rate <- max_height*exp(-decay_rate*timeval) + min_height
value_at_iteration[i] <- current_rate
}
}
return(value_at_iteration)
}
asdf <- spikes_model(maxiter = 100)
plot(asdf, type="l")
... which results in the following plot:
This is exactly what I want, except there is only one spike. I know there is a code or logic error somewhere, but I can not find where I am going wrong. Please help me replicate this spike procedure across time.
The code this scheduler is used in:
eps <- 1000
sch <- spikes_model(eps)
lr_schedule <- function(epoch, lr) {
lrn <- sch[as.integer(epoch)]
lrn <- k_cast_to_floatx(lrn)
return(lrn)
}
## Add callback to automatically adjust learning rate downward when training reaches plateau ##
reduce_lr <- callback_learning_rate_scheduler(lr_schedule)
## Fit model using trainig data, validate with validation data ##
mod1.hst <- mod1 %>% fit(
x=X.train, y=Y.train,
epochs=eps, batch_size=nrow(X.train),
validation_data = list(X.val, Y.val),
shuffle=TRUE, callbacks = list(checkpoint, reduce_lr)
)
Wow, I just figured out my own error. I was using the is.integer() function, which does not work how I wanted. I needed to use the is.whole.number() function from mosaic.
Fixing that single error, I find the following chart, which is exactly what I wanted.
set.seed(123)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.30)
result[[m]]=u
}
result
for (m in 1:40) if (any(result[[m]] == 1)) break
m
m is the exit time for company, as we change the probability it will give different result. Using this m as exit, I have to find if there was a funding round inbetween, so I created a random binomial distribution with some prob, when you will get a 1 that means there is a funding round(j). if there is a funding round i have to find the limit of round using the random uniform distribution. I am not sure if the code is right for rbinom and is running till m. And imat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
am gettin the y value for all 40 iteration I Need it when I get rbinom==1 it should go to next loop. I am trying to store the value in matrix but its not getting stored too. Please help me with that.
mat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
for(j in 1:m) {
k<- if(any(rbinom(1e3,40,0.42)==1)) #funding round
{
y<- runif(j, min = 0, max = 1) #lower and upper bound
mat1[l][0]<-j
mat1[l][1]<-y #matrix storing the value
}
}
resl
mat1
y
The answer to your first question:
result <- vector("list",40)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.05)
print(u)
result[[m]]=u
}
u
The second question is not clear. Could you rephrase it?
To generate 40 vectors of random binomial numbers you don't need a loop at all, use ?replicate.
u <- replicate(40, rbinom(1e3, 40, 0.05))
As for your second question, there are several problems with your code. I will try address them, it will be up to you to say if the proposed corrections are right.
The following does basically nothing
for(k in 1:40)
{
n<- (any(rbinom(1e3,40,0.05)==1)) # n is TRUE/FALSE
}
k # at this point, equal to 40
There are better ways of creating a T/F variable.
#matrix(0, nrow = 40,ncol = 2) # wrong, don't use list()
matrix(0, nrow = 40,ncol = 2) # or maybe NA
Then you set l=0 when indices in R start at 1. Anyway, I don't believe you'll need this variable l.
if(any(rbinom(1e3,40,0.30)==1)) # probably TRUE, left as an exercise
# in probability theory
Then, finally,
mat1[l][0]<-j # index `0` doesn't exist
Please revise your code, and tell us what you want to do, we're glad to help.
I have two samples of 1000 in length each, and I need to construct covariance matrices for these two samples.
Each sample is made up of 10 clusters of size 100. Now, each unit has a variable attached to it that identifies the cluster it came from, and the covariance between two units will be X if they are from the same cluster, or Y if they are from different clusters.
So I need to find a way to construct a covariance matrix that looks like the following picture, except the blocks of X are 100x100 and not 3x3:
Is there any method of doing this easily? The matrix is far too big to create it by manually inputting the data, and the procedure needs to be repeated thousands of times within a loop.
You mean something like this?
m <- c(rep(1, 100), rep(0, 300),
rep(0, 100), rep(1, 100), rep(0, 200),
rep(0, 200), rep(1, 100), rep(0, 100),
rep(0, 300), rep(1, 100))
m <- matrix(m, byrow = TRUE)
m
I managed to find a straightforward solution requiring no extra packages, so I'll post the solution here in case other people encounter the same problem.
The easiest way for me was to create a double loop that goes through each index of the matrix and manually enters the item. Obviously this is very computationally exhaustive, so if you need to do this many times I'd recommend using a much more efficient approach.
m<-matrix(rep(NA,1000000),ncol=1000)
for(i in 1:1000){
for(j in 1:1000){
if(sampleA$cluster[i]==sampleA$cluster[j]){
m[i,j]<-"X"
}
else{
m[i,j]<-"Y"
}
}
}