I am new in R programming. I need to solve 1 problem in R. I need to simulate the following experiment in R.
A poker hand consists of 5 cards dealt from a conventional pack of 52 cards, the order of the cards not being important. Find the probability that a given hand has at least one king and at least one queen.
I know how to find for atleast 1 king but not for at least one king and at least one queen. for atleast 1 king code is :
deck<- rep(1:13 , each=4)
#here J=11 , Q=12, K=13
nhands <- 1000
Xk<- c(rep(-1, nhands))
for( i in 1:nhands){
hand <- sample( deck , 5 , replace= FALSE)
numberofK<-0
for( j in 1:5){ # count Kings
if( hand[j] == 13){
numberofK <- numberofK +1
}
}
#print(numberofK)
Xk[i] <-numberofK
#print(hand)
}
table(Xk) /nhands
Can anyone please help me in coding the required 1.. Thanks
The probability of a hand of 5 cards that contain at least one king or one queen can be simply written as following, where sample(deck, 5) gives a hand of 5 cards while any(c(12, 13) ...) checks whether King or Queen is within the hand and sum counts how many times such case happens within the 1000 simulation:
set.seed(10)
sum(sapply(1:100000, function(i) { any(c(12, 13) %in% sample(deck, 5))}))/100000
# [1] 0.58365
Theoretically, the probability of such case would be:
(choose(52, 5) - choose(44, 5))/choose(52, 5)
# [1] 0.5821375
Which are pretty close.
And on the other hand, if it is indeed And which means at least a king and a queen, simulation gives:
set.seed(10)
sum(sapply(1:100000, function(i) { all(c(12, 13) %in% sample(deck, 5))}))/100000
# [1] 0.09932
And theoretically:
(choose(52, 5) - choose(44, 5) - (2*(choose(48, 5) - choose(44, 5))))/choose(52, 5)
# [1] 0.1001785
And the number matches closely.
Related
I am visiting a restaurant that has a menu with N dishes. Every time that I visit the restaurant I pick one dish at random. I am thinking, what is the average time until I taste all the N dishes in the restaurant?
I think that the number of dishes that I have tasted after n visits in the restaurant is a Markov chain with transition probabilities:
p_{k,k+1} = (N-k)/N
and
p_{k,k} = k/N
for k =0,1,2,...,N
I want to simulate this process in R.
Doing so (I need help here) given that the restaurant has 100 dishes I did:
nits <- 1000 #simulate the problem 1000 times
count <- 0
N = 100 # number of dishes
for (i in 1:nits){
x <- 1:N
while(length(x) > 0){
x <- x[x != sample(x=x,size=1)] # pick one dish at random that I have not tasted
count <- count + 1/nits
}
}
count
I want some help because my mathematical result is the the average time is N*log(N) and the code above produces different results.
You have 2 issues.
It's always a red flag when you loop over i, but don't use i inside the loop. Set up a structure to hold the results of every iteration:
results = integer(length = nits)
...
for (i in 1:nits){
...
while(length(x) > 0){
...
}
results[i] <- count
}
Your text says
pick one dish at random
Your code says
pick one dish at random that I have not tasted
If you always pick a dish you have not tasted, then the problem is trivial - it will take N visits. Let's adjust your code to pick on dish at random whether you have tasted it or not:
nits <- 1000 #simulate the problem 1000 times
results = integer(length = nits)
N = 100 # number of dishes
for (i in 1:nits){
dishes = 1:N
tasted = rep(0, N)
count = 0
while(sum(tasted) < N){
tasted[sample(dishes, size = 1)] = 1
count = count + 1
}
results[i] = count
}
results
Looking at the results, I think you may have made a math error:
100 * log(100)
# [1] 460.517
mean(results)
# [1] 518.302
You can read more about this problem on Wikipedia: Coupon Collector's Problem. Using the result there, the simulation is doing quite well:
100 * log(100) + .577 * 100 + 0.5
# [1] 518.717
I'm trying to run a simulation on R but I'm quite stuck; The simulation has to do with a variation of the Airplane Probability problem.
This is the scenario: A small 100 seat theatre is conducting a play, and assigns a random seat number (from 1–100) to the ticketed guests right before they walk in. There are 36 guests in total, who usually sit in their assigned seats. If their seats are occupied for some reason, they choose another seat at random. An actor who is part of the play messes this up by picking a seat out of the 100 seats randomly, possibly taking a ticketed audience's numbered seat.
I want to try and run this on R and try and answer the questions: What is the probability that the last person is in the wrong seat? and On average, approximately how many people will sit in the wrong seat?
Could someone help me with this? I added my code below and what I attempt to do is find the probability that the last person is in the wrong seat... I think there are errors in my code and I would love some suggestions/help to make it better!
#Following are the 2 empty vectors which we will use later to store some probabilities and other stuff
Probregister <- c()
Register <- c()
Person <- c(1:36) #this vector creates 100 people standing in que from 1 to 100.
Seat <- sample(1:100, 100) #this vector allots each one of them a seat randomly. These are the assigned seats.
Actualseats <- c(1:100) #these are 100 empty seats in the theatre yet to be filled. Each entry is a seat no.
Actualperson <- rep(0,36) #This is an empty vector. Here we will know who is actually occupying the given Actualseat.
Data <- data.frame(Person, Seat, Actualseats, Actualperson)
Data$Actualperson[sample(1:100,1)] <- 1 #Selecting any seat from 100 empty seats in the theatre.
#this next loop cycles the decision procedure given in question from 2nd person to 36th person.
for(i in 2:36) {
if (Data$Actualperson[Data$Seat[i]] == 0) {
Data$Actualperson[Data$Seat[i]] <- i #If the seat assigned to ith person is empty then the person sits in it.
} else {
#This next line is very crucial piece and read it carefully.
#First square bracket selects only those seats which are empty. ie. Actualperson = 0
#Second square bracket randomly chooses 1 seat from these empty seats for ith person to sit in.
Data$Actualperson[which(Data$Actualperson == 0)][sample(1:length(Data$Actualperson[which(Data$Actualperson == 0)]), 1)] <- i #If their assigned seat is unavailable then they select randomly from remaining empty seats.
}
} #Here the loop ends for one trial. T
if(Data$Actualperson[Data$Seat[36]] == 36) {
Register <- append(Register, "Yes", after = length(Register)) #if 36th person is sitting in his alloted seat then add "Yes" to the Register.
} else {
Register <- append(Register, "No", after = length(Register)) #if 36th person is not sitting in his alloted seat then add "No" to the Register.
}
}
Probability <- length(Register[which(Register=="Yes")])/length(Register)
Probregister <- append(Probregister, Probability, after = length(Probregister))
}
Probsummary <- summary(Probregister)
plot(density(Probregister), col="red")
abline(v = Probsummary[3], col="blue")
This is a simulation I perform. p is probability that actor remove the seats. You may change this as n and remove n <- floor(100 * p) line in function.
func <- function(p){
x <- c(1:100) #stands for seats
y <- c(1:36) #stands for 36 person's seats, consider it as 1~36 cause it doesn't matter
correct <- rep(NA, 36) #dummy vector to record if person seat on correct seat
fin_passenger_dummy <- rep(NA,36) #dummy vector to record final passenger seat on correct seat
n <- floor(100 * p) #number of seats that an actor remove
yy <- sample(y, 36) #order of persons
actor <- sample(x, n) #id's of removed seats
seats <- setdiff(x, actor) #id's of remained seats
for (i in 1:36){
if (yy[i] %in% seats){
correct[yy[i]] <- TRUE #append that yy[i] seat on his seat
fin_passenger_dummy[i] <- TRUE #append that yy[i] seat on his seat
seats <- setdiff(seats, yy[i]) #update remaining seats
} else{
y_sad <- sample(seats, 1) #randomly choose seat to seat
correct[yy[i]] <- FALSE
fin_passenger_dummy[i] <- FALSE
seats <- setdiff(seats, y_sad)
}
}
return(list(total = correct, final = last(fin_passenger_dummy)))
}
To get the probability that the last person is in the wrong seat, replicate this function for enough time and take mean of $final. For example, letting p = 0.3 means an actor remove 30 seats,
dummy <- c()
for (i in 1:1000){
dummy <- c(dummy, func(0.3)$final)
}
mean(dummy)
[1] 0.532
And to get "On average, approximately how many people will sit in the wrong seat",
dummy <- c()
for (i in 1:1000){
dummy <- c(dummy, sum(func(0.3)$total))
}
mean(dummy)
[1] 11.7015
will do.
If you need more description about the code, pleas let me know
I'll refer to anyone that sits in a random unticketed seat as a "floater". The first floater is the actor. If the actor takes someone's seat, that person becomes the floater, etc. A few observations to speed things up:
The actual seats positions/ordering doesn't matter, only the order with which the guests enter the theatre
Each unoccupied ticketed seat has an equal probability that the floater will sit in it
The probability that the floater will sit in another guest's seat is the number of unoccupied ticketed seats divided by the number of unoccupied seats
An expected 64% of the simulation replications result in all guests sitting in their ticketed seats. The other replications don't need to be simulated. We need only simulate the number of replications that require simulation (via rbinom).
This lets us run the simulation recursively:
TheatreRec <- function(tickets, seats) {
# recursive function for simulating the theatre seating problem
# Inputs:
# tickets: the number of ticketed guests yet to be seated
# seats: the number of unoccupied seats remaining
# Output: an integer vector of length 2:
# 1: number of guests in the wrong seat
# 2: whether the last seated guest sits in the wrong seat (0 or 1)
# the floater sits in a random unoccupied seat
floater <- sample(seats, 1)
if (floater > tickets) {
# the floater didn't take anyone's seat
return(c(0L, 0L))
} else if (floater < tickets){
# the floater took one of the guests' seats, but not the last guest's seat
return(c(1L, 0L) + TheatreRec(tickets - floater, seats - floater))
} else {
# the floater took the last guest's seat
return(c(1L, 1L))
}
}
# create a vectorized version of TheatreRec
TheatreRecVec <- Vectorize(TheatreRec)
I'll run a million replications. For an expected 36% of the replications, the actor will sit in one of the guest's seats. For these replications, use the sample function to simulate whose seat the actor takes (in order of entry into the theatre). Then complete the simulation with TheatreRecVec, which gives the results column-wise. Note that for all of these replications, the first floater (after the actor) needs to be added to the results of TheatreRecVec.
floater <- sample(36, rbinom(1, 1e6, 0.36), replace = TRUE)
(results <- setNames(rowSums(rbind(1L, floater == 36) + TheatreRecVec(36L - floater, 100L - floater))/1e6, c("Avg. in wrong seat", "P(last guess in wrong seat)")))
Avg. in wrong seat P(last guess in wrong seat)
0.442944 0.015423
EDIT to compare the simulation to the exact solutions:
To check the simulation, we can use the exact value for the expected number of guests who end up in the wrong seat:
digamma(s + 1) - digamma(s - g + 1)
where s is the number of seats in the theatre, and g is the number of ticketed guests.
> digamma(100 + 1) - digamma(100 - 36 + 1)
[1] 0.4434866
The probability that the last guess ends up in the wrong seat is simply 1/(s - g + 1)
> 1/(100 - 36 + 1)
[1] 0.01538462
These match pretty closely with the simulation results above.
A five card charlie is where you draw five cards and don't go bust, i.e. the points from 5 cards is <= 21. I want to find the probability of a 5 card charlie by brute force - i.e. simulate a large number of "plays" and check if you go bust or not.
by using brute force using R. I'm assuming here that there are 4 decks as is common in Casinos, and I'm sampling 5 cards from these 4 decks, checking if they've won and if so counting it towards the probability. Googling states it should be around 1/50, i.e. 2%:
deck <- c(rep(1:9, 16), rep(10, 64))
n <- 0
size <- 1:10e6
for (i in size){
smpl <- sample(deck,5,replace = F)
if (sum(smpl) <= 21){
n <- n+1
}
}
print(n/max(size) * 100)
[1] 5.98644
Note that "deck" here is the point system, i.e. we have 1:9 points for 4 suits, and 4 deck of cards hence need 1:9 16 times, and similarly Jack Queen King and Ten all count as ten but 4*4*4 possible cards.
Sample 5 cards without replacement, check if the sum is <= 21, and if so count it, then finally do this 10 million times and calculated the probability. However this gives 6% rather than 2%.
I have two questions:
1) How can I modify this so that I can sample 100 million or more plays?
2) Where am I going wrong with the 6% probability?
I think that what is off here is the assumption that it should be 2%.
Your code said about 5%. I've adapted an existing answer and it also says 5%:
deck <- c(rep(1:9, 4), rep(10, 16))
result <- combn(deck, 5, function(x) {sum(x) <= 21})
sum(result)/dim(result)
[1] 0.05385693
For k = 5, 6, 7 - card charlie you could try the following (to compute the probability with simulation) with replicate:
sapply(5:7, function(k) mean(replicate(n=10^6,
sum(sample(c(rep(1:9, 4), rep(10, 16)), k, replace = F)) <= 21)))
#[1] 0.053943 0.008525 0.000890
Here is how the probability decreases with k (for k-card charlie)
library(ggplot2)
ggplot(aes(card, prob),
data=data.frame(card=2:7, prob=sapply(2:7, function(x) mean(replicate(n=10^6, sum(sample(c(rep(1:9, 4), rep(10, 16)),x,replace = F)) <= 21))))) +
geom_point() + geom_line()
I have following data.frame and dictionaries with pos/negWords:
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "orgtop",
"great improvement", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super")
negWords <- c("hate","bad","not good","horrible")
And the following function, which is matching words in each sentence with pos/negWords from dictionaries and compute sentiment value according frequency of occurance - but it is exact match.
# descending order for words length (prepare data for function below)
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL
scoreSentence <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
match <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(match,sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), '', sentence) # remove words which were matched
} score
}
which generate desired output with calling:
SentimentScore <- unlist(lapply(sent$words, scoreSentence))
bbb <- cbind(sent, SentimentScore)
This resulted into mentioned desired output:
words user SentimentScore
1 just right size and i love this notebook 1 2
2 benefits great laptop 2 2
3 wouldnt bad notebook 3 -1
4 very good quality 4 1
5 orgtop 5 0
6 great improvement 6 1
7 notebook is not good but i love batterytop 7 0
For those purposes for loop was used, but I have 7000 pos/negWords and 200.000 sentences, so it is neverending...
Please, do you have some better solution for this task. Mainly to have the same result in SentimentScore :-)
I'll appreciate any of your advice or solution. Many thanks in advance.
First, you should run on subpieces of your data.frame, because list resizing during lapply probably generates a huge overhead :
ptm = proc.time(); f=lapply(1:100000, function(X){X}); print(proc.time()-ptm)
user system elapsed
0.056 0.004 0.061
ptm = proc.time(); f=lapply(1:1000000, function(X){X}); print(proc.time()-ptm)
user system elapsed
1.112 0.004 1.119
Here a factor 10 in sequence size yields a factor 21 in computation time. So use small lists, then concatenate them in one big list.
The declaration of a big data.frame does not take long compared to its extension, so you have to declare it and then fill it with your sublists :
bbb = data.frame( words=sent[1], user=sent[2], scoreSentence=rep(0, nrow(sent)) )
MAX_SIZE = 10000
for ( ii in 0:(ceiling(nrow(sent)/MAX_SIZE)-1) ) {
selected_rows = (1 + ii * MAX_SIZE):min( (ii+1)*MAX_SIZE, nrow(sent) )
bbb[selected_rows, "scoreSentence"] = unlist(lapply(sent$words[selected_rows], scoreSentence))
}
MAX_SIZE must be big enough because a for loop is slower than lapply (you want to do as little loop through the for as possible) but not too big or the list extension overhead will make the program slower.
Alternative with parallelisation
Parallelisation is a good way to make a set of complex calculations faster by running each of them on a different core. In your case we make the calculations complex by sending big chunks of sentences.
With mclapply from the parallel package, you send each chunk to a different thread each thread is fast because the chunck is not too big. A wrapper for scoreSentence that handle a vector is recquired :
bbb = data.frame( words=sent[1], user=sent[2], scoreSentence=rep(0, nrow(sent)) )
MAX_SIZE = 10000
mc_list = list()
mc_list[[ceiling(nrow(sent)/MAX_SIZE)]] = 0
for ( ii in 0:(ceiling(nrow(sent)/MAX_SIZE)-1) ) {
mc_list[[ii+1]] = (1 + ii * MAX_SIZE):min( (ii+1)*MAX_SIZE, nrow(sent) )
}
bbb[,"scoreSentence"] = unlist(mclapply(mc_list, scoreSentenceWrapper))
scoreSentenceWrapper <- function(selected_rows) {
return(unlist(lapply(sent$words[selected_rows], scoreSentence)))
}
I am relatively new to R and all its wisdom and I am trying to be more efficient with my script. I am using a loop to simulate how an animal moves among different sites. The problem that I have is that when I increase the number of sites or change the initial parameters (based on fixed probability of moving or staying in the same site) then I end with a very complicated loop. If I have to run several different simulations with different parameters, I prefer a more efficient loop or function that could adjust to different situations. The first loop will fill a matrix according to the initial probabilities and the second loop will compared the cumulative probability matrix against a random number from a list of values (10 in this example) and will decide the fate of that individual (either stay or go to a new site)
Here is a simplification of my code:
N<-4 # number of sites
sites<-LETTERS[seq(from=1,to=N)]
p.stay<-0.45
p.move<-0.4
move<-matrix(c(0),nrow=N,ncol=N,dimnames=list(c(sites),c(sites)))
from<-array(0,c(N,N),dimnames=list(c(sites),c(sites)))
to<-array(0,c(N,N),dimnames=list(c(sites),c(sites)))
# Filling matrix with fixed probability #
for(from in 1:N){
for(to in 1:N){
if(from==to){move[from,to]<-p.stay} else {move[from,to]<-p.move/(N-1)}
}
}
move
cumsum.move<-cumsum(data.frame(move))
steps<-100
result<-as.character("") # for storing results
rand<-sample(random,steps,replace=TRUE)
time.step<-data.frame(rand)
colnames(time.step)<-c("time.step")
time.step$event<-""
to.r<-(rbind(sites))
j<-sample(1:N,1,replace=T) # first column to select (random number)
k<-sample(1:N,1,replace=T) # site selected after leaving and coming back
# Beginning of the longer loop #
for(i in 1:steps){
if (time.step$time.step[i]<cumsum.move[1,j]){time.step$event[i]<-to.r[1]} else
if (time.step$time.step[i]<cumsum.move[2,j]){time.step$event[i]<-to.r[2]} else
if (time.step$time.step[i]<cumsum.move[3,j]){time.step$event[i]<-to.r[3]} else
if (time.step$time.step[i]<cumsum.move[4,j]){time.step$event[i]<-to.r[4]} else
if (time.step$time.step[i]<(0.95)){time.step$event[i]<-NA} else
if (time.step$time.step[i]<1.0) break # break the loop
result[i]<-time.step$event[i]
j<-which(to.r==result[i])
if(length(j)==0){j<-k} # for individuals the leave and come back later
}
time.step
result
This loop is part of a bigger loop that will simulate and store the result after a series of simulations. Any ideas or comments on how I can improve the efficiency of this loop so that I can easily modify the number of sites or change the initial probability parameters without repeating or having to do major edits of the loop will be appreciated.
I'm not sure if I'm capturing the essence of your code, but this is faster than the for loops. This started having an advantage as soon as we start getting past a few thousand steps. I replace "random" with a sample of the uniform distribution (runif())
system.time(
time.step$event <- sapply(
time.step$time.step,
function(x) rownames(
cumsum.move[which(cumsum.move[,j] > x),])[[1]]
)
)
Here are my results # 10,000 steps. I'm working on a laptop so 100,000 with the for loop didn't compute in under 1 minute, but sapply did it in 14 seconds.
> system.time(
+ time.step$event <- sapply(
+ time.step$time.step,
+ function(x) rownames(
+ cumsum.move[which(cumsum.move[,j] > x),])[[1]]
+ )
+ )
user system elapsed
1.384 0.000 1.387
> head(time.step)
time.step event
1 0.2787642 C
2 0.3098240 C
3 0.9079045 D
4 0.9904031 D
5 0.3754330 C
6 0.6984415 C
> system.time(
+ for(i in 1:steps){
+ if (time.step$time.step[i]<cumsum.move[1,j]){time.step$event[i]<-to.r[1]} else
+ if (time.step$time.step[i]<cumsum.move[2,j]){time.step$event[i]<-to.r[2]} else
+ if (time.step$time.step[i]<cumsum.move[3,j]){time.step$event[i]<-to.r[3]} else
+ if (time.step$time.step[i]<cumsum.move[4,j]){time.step$event[i]<-to.r[4]}
+ result[i]<-time.step$event[i]
+ }
+ )
user system elapsed
3.137 0.000 3.143
> head(time.step)
time.step event
1 0.2787642 C
2 0.3098240 C
3 0.9079045 D
4 0.9904031 D
5 0.3754330 C
6 0.6984415 C