Generate samples from frequency table with fixed values - r

I have a population 2x2 frequency table, with specific values: 20, 37, 37, 20. I need to generate N number of samples from this population (for simulation purposes).
How can I do it in R?

Try this. In the example, the integers represent cell 1, 2, 3, and 4 of the 2x2 table. As you can see the relative frequencies closely resemble those in your 20, 37, 37, 20 table.
probs<-c(20, 37, 37, 20)
N<-1000 #sample size
mysample<-sample(x=c(1,2,3,4), size=N, replace = TRUE, prob = probs/sum(probs))
table(mysample)/N
#Run Again for 100,000 samples
N<-100000
mysample<-sample(x=c(1,2,3,4), size=N, replace = TRUE, prob = probs/sum(probs))
#The relative probabilities should be similar to those in the original table
table(mysample)/N

Related

How to simulate random variates with no pdf in R?

I want to simulate a string of random non-negative integer values in R.
However, those values should not follow any particular probability distribution function and could be empirically distributed.
How do I go about doing it?
You will need a distribution; there is no alternative, philosophically. There's no such thing as a "random number," only numbers randomly distributed according to some distribution.
To sample from an empirical distribution stored as my_dist, you can use sample():
my_dist <- c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55) # first 10 Fibonacci numbers
sample(my_dist, 100, replace = T) # draw 100 numbers from my_dist w/ replacement
Or, for some uniformly-distributed numbers between (for instance) 1 and 10, you could do:
sample(1:10, 100, replace = T)
There are, of course, specific distributions implemented as functions in base R and various packages, but I'll avoid those since you said you weren't interested in them.
Editing per Rui's good suggestion: If you want non-uniform variables, you can specify the prob parameter:
sample(1:3, 100, replace = T, prob = c(6, 3, 1))
# draws a 1 with 60% prob., a 2 with 30% prob., and a 3 with 10% prob.

Calculating Running Percentile in R

I'm trying to calculate the percentiles from 1:i in a column. For example, for the nth data point, calculate the percentile only using the first n values.
I have tried using quantile, but can't seem to figure out how to generalize it.
mydata <- c(1, 25, 43, 2, 5, 17, 40, 15, 12, 8)
perc.fn <- function(vec, n){
(rank(vec[1:n], na.last=TRUE) - 1)/(length(vec[1:n])-1)}

In R, sample from a neighborhood according to scores

I have a vector of numbers, and I would like to sample a number which is between a given position in the vector and its neighbors such that the two closest neighbors have the largest impact, and this impact is decreasing according to the distance from the reference point.
For example, lets say I have the following vector:
vec = c(15, 16, 18, 21, 24, 30, 31)
and my reference is the number 16 in position #2. I would like to sample a number which will be with a high probability between 15 and 16 or (with the same high probability) between 16 and 18. The sampled numbers can be floats. Then, with a decreasing probability to sample a number between 16 and 21, and with a yet lower probability between 16 and 24, and so on.
The position of the reference is not known in advance, it can be anywhere in the vector.
I tried playing with runif and quantiles, but I'm not sure how to design the scores of the neighbors.
Specifically, I wrote the following function but I suspect there might be a better/more efficient way of doing this:
GenerateNumbers <- function(Ind,N){
dist <- 1/abs(Ind- 1:length(N))
dist <- dist[!is.infinite(dist)]
dist <- dist/sum(dist)
sum(dist) #sanity check --> 1
V = numeric(length(N) - 1)
for (i in 1:(length(N)-1)) {
V[i] = runif(1, N[i], N[i+1])
}
sample(V,1,prob = dist)
}
where Ind is the position of the reference number (16 in this case), and N is the vector. "Dist" is a way of weighing the probabilities so that the closer neighbors have a higher impact.
Improvements upon this code would be highly appreciated!
I would go with a truncated Gaussian random sample generator, such as in the truncnorm package. On your example:
# To install it: install.package("truncnorm")
library(truncnorm)
vec <- c(15, 16, 18, 21, 24, 30, 31)
x <- rtruncnorm(n=100, a=vec[1], b=vec[7], mean=vec[2], sd=1)
The histogram of the generated sample fulfills the given prerequisites.

Fitting Binomial Distribution in R using data with varying sample sizes

I have some data that looks like this:
x y
1: 3 1
2: 6 1
3: 1 0
4: 31 8
5: 1 0
---
(Edit: if it helps, here are sample vectors for x and y
x = c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y = c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
The column on the left (x) is my sample size, and the column on the right (y) is the number successes that occur in each sample.
I would like to fit these data using a binomial distribution in order to find the probability of a success (p). All examples for fitting a binomial distribution that I've found so far assume a constant sample size (n) across all data points, but here I have varying sample sizes.
How do I fit data like these, with varying sample sizes, to a binomial distribution? The desired outcome is p, the probability of observing a success in a sample size of 1.
How do I accomplish a fit like this using R?
(Edit #2: Response below outlines solution and related R code if I assume that the events observed in each sample can be assumed to be independent, in addition to assuming that the samples themselves are also independent. This works for my data - thanks!)
What about calculating the empirical probability of success
x <- c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y <- c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
avr.sample <- mean(x)
avr.success <- mean(y)
p <- avr.success/avr.sample
[1] 0.1151515
Or using binom.test
z <- x-y # number of fails
binom.test(x = c(sum(y), sum(z)))
Exact binomial test
data: c(sum(y), sum(z))
number of successes = 19, number of trials = 165, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.07077061 0.17397215
sample estimates:
probability of success
0.1151515
However, this assumes that:
The events corresponding to the rows are independent from each other
The events in the same row are independent from each other as well
This means in every iteration k of the experiment (i.e. row of x) we execute an action such as throwing x[k] identical dices (not necessarily fair dices) and success would mean to get a given (predetermined) number n in 1:6.
If we supposed that that above results were achieved when trying to get a 1 when throwing x[k] dices in every iteration k, then one could say that the empirical probability of getting a 1 is (~) 0.1151515.
In the end, the distribution in question would be B(sum(x), p).
PS: In the above illustration, the dices are identical to each other not only in any given iteration but across all iterations.
library(bbmle)
x = c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y = c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
mf = function(prob, x, size){
-sum(dbinom(x, size, prob, log=TRUE))
}
m1 = mle2(mf, start=list(prob=0.01), data=list(x=y, size=x))
print(m1)
Coefficients:
prob
0.1151535
Log-likelihood: -13.47

Knapsack algorithm restricted to N-element solution

This excerpt from the CRAN documentation for the adagio function knapsack() functions as expected -- it solves the knapsack problem with profit vector p, weight vector w, and capacity cap, selecting the subset of elements with maximum profit subject to the constraint that the total weight of selected elements does not exceed the capacity.
library(adagio)
p <- c(15, 100, 90, 60, 40, 15, 10, 1)
w <- c( 2, 20, 20, 30, 40, 30, 60, 10)
cap <- 102
(is <- knapsack(w, p, cap))
How can I add a vector length constraint to the solution and still get an optimal answer? For example, the above exercise, but the selected subset must include exactly three elements.
One approach would be to explicitly model the problem as a mixed integer linear programming problem; the advantage of explicitly modeling it in this way is that linear constraints like "pick exactly three objects" are simple to model. Here is an example with the lpSolve package in R, where each element in the knapsack problem is represented by a binary variable in a mixed integer linear programming formulation. The requirement that we select exactly three elements is captured by the constraint requiring the decision variables to sum to exactly 3.
library(lpSolve)
p <- c(15, 100, 90, 60, 40, 15, 10, 1)
w <- c( 2, 20, 20, 30, 40, 30, 60, 10)
cap <- 102
exact.num.elt <- 3
mod <- lp(direction = "max",
objective.in = p,
const.mat = rbind(w, rep(1, length(p))),
const.dir = c("<=", "="),
const.rhs = c(cap, exact.num.elt),
all.bin = TRUE)
# Solution
which(mod$solution >= 0.999)
# [1] 2 3 4
# Profit
mod$objval
# [1] 250
While subsetting the optimal solution from the adagio:::knapsack function to the desired size is a reasonable heuristic for the case when the desired subset size is smaller than the cardinality of the optimal solution to the standard problem, there exist examples where the optimal solution to the standard knapsack problem and the optimal solution to the size-constrained knapsack problem are disjoint. For instance, consider the following problem data:
p <- c(2, 2, 2, 2, 3, 3)
w <- c(1, 1, 1, 1, 2, 2)
cap <- 4
exact.num.elt <- 2
With capacity 4 and no size constraint, the standard knapsack problem will select the four elements with profit 2 and weight 1, getting total profit 8. However, with size limit 2 the optimal solution is instead to select the two elements with profit 3 and weight 2, getting total profit 6.

Resources