Calculating Running Percentile in R - r

I'm trying to calculate the percentiles from 1:i in a column. For example, for the nth data point, calculate the percentile only using the first n values.
I have tried using quantile, but can't seem to figure out how to generalize it.
mydata <- c(1, 25, 43, 2, 5, 17, 40, 15, 12, 8)

perc.fn <- function(vec, n){
(rank(vec[1:n], na.last=TRUE) - 1)/(length(vec[1:n])-1)}

Related

In R, sample from a neighborhood according to scores

I have a vector of numbers, and I would like to sample a number which is between a given position in the vector and its neighbors such that the two closest neighbors have the largest impact, and this impact is decreasing according to the distance from the reference point.
For example, lets say I have the following vector:
vec = c(15, 16, 18, 21, 24, 30, 31)
and my reference is the number 16 in position #2. I would like to sample a number which will be with a high probability between 15 and 16 or (with the same high probability) between 16 and 18. The sampled numbers can be floats. Then, with a decreasing probability to sample a number between 16 and 21, and with a yet lower probability between 16 and 24, and so on.
The position of the reference is not known in advance, it can be anywhere in the vector.
I tried playing with runif and quantiles, but I'm not sure how to design the scores of the neighbors.
Specifically, I wrote the following function but I suspect there might be a better/more efficient way of doing this:
GenerateNumbers <- function(Ind,N){
dist <- 1/abs(Ind- 1:length(N))
dist <- dist[!is.infinite(dist)]
dist <- dist/sum(dist)
sum(dist) #sanity check --> 1
V = numeric(length(N) - 1)
for (i in 1:(length(N)-1)) {
V[i] = runif(1, N[i], N[i+1])
}
sample(V,1,prob = dist)
}
where Ind is the position of the reference number (16 in this case), and N is the vector. "Dist" is a way of weighing the probabilities so that the closer neighbors have a higher impact.
Improvements upon this code would be highly appreciated!
I would go with a truncated Gaussian random sample generator, such as in the truncnorm package. On your example:
# To install it: install.package("truncnorm")
library(truncnorm)
vec <- c(15, 16, 18, 21, 24, 30, 31)
x <- rtruncnorm(n=100, a=vec[1], b=vec[7], mean=vec[2], sd=1)
The histogram of the generated sample fulfills the given prerequisites.

Fitting Binomial Distribution in R using data with varying sample sizes

I have some data that looks like this:
x y
1: 3 1
2: 6 1
3: 1 0
4: 31 8
5: 1 0
---
(Edit: if it helps, here are sample vectors for x and y
x = c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y = c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
The column on the left (x) is my sample size, and the column on the right (y) is the number successes that occur in each sample.
I would like to fit these data using a binomial distribution in order to find the probability of a success (p). All examples for fitting a binomial distribution that I've found so far assume a constant sample size (n) across all data points, but here I have varying sample sizes.
How do I fit data like these, with varying sample sizes, to a binomial distribution? The desired outcome is p, the probability of observing a success in a sample size of 1.
How do I accomplish a fit like this using R?
(Edit #2: Response below outlines solution and related R code if I assume that the events observed in each sample can be assumed to be independent, in addition to assuming that the samples themselves are also independent. This works for my data - thanks!)
What about calculating the empirical probability of success
x <- c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y <- c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
avr.sample <- mean(x)
avr.success <- mean(y)
p <- avr.success/avr.sample
[1] 0.1151515
Or using binom.test
z <- x-y # number of fails
binom.test(x = c(sum(y), sum(z)))
Exact binomial test
data: c(sum(y), sum(z))
number of successes = 19, number of trials = 165, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.07077061 0.17397215
sample estimates:
probability of success
0.1151515
However, this assumes that:
The events corresponding to the rows are independent from each other
The events in the same row are independent from each other as well
This means in every iteration k of the experiment (i.e. row of x) we execute an action such as throwing x[k] identical dices (not necessarily fair dices) and success would mean to get a given (predetermined) number n in 1:6.
If we supposed that that above results were achieved when trying to get a 1 when throwing x[k] dices in every iteration k, then one could say that the empirical probability of getting a 1 is (~) 0.1151515.
In the end, the distribution in question would be B(sum(x), p).
PS: In the above illustration, the dices are identical to each other not only in any given iteration but across all iterations.
library(bbmle)
x = c(3, 6, 1, 31, 1, 18, 73, 29, 2, 1)
y = c(1, 1, 0, 8, 0, 0, 8, 1, 0, 0)
mf = function(prob, x, size){
-sum(dbinom(x, size, prob, log=TRUE))
}
m1 = mle2(mf, start=list(prob=0.01), data=list(x=y, size=x))
print(m1)
Coefficients:
prob
0.1151535
Log-likelihood: -13.47

Knapsack algorithm restricted to N-element solution

This excerpt from the CRAN documentation for the adagio function knapsack() functions as expected -- it solves the knapsack problem with profit vector p, weight vector w, and capacity cap, selecting the subset of elements with maximum profit subject to the constraint that the total weight of selected elements does not exceed the capacity.
library(adagio)
p <- c(15, 100, 90, 60, 40, 15, 10, 1)
w <- c( 2, 20, 20, 30, 40, 30, 60, 10)
cap <- 102
(is <- knapsack(w, p, cap))
How can I add a vector length constraint to the solution and still get an optimal answer? For example, the above exercise, but the selected subset must include exactly three elements.
One approach would be to explicitly model the problem as a mixed integer linear programming problem; the advantage of explicitly modeling it in this way is that linear constraints like "pick exactly three objects" are simple to model. Here is an example with the lpSolve package in R, where each element in the knapsack problem is represented by a binary variable in a mixed integer linear programming formulation. The requirement that we select exactly three elements is captured by the constraint requiring the decision variables to sum to exactly 3.
library(lpSolve)
p <- c(15, 100, 90, 60, 40, 15, 10, 1)
w <- c( 2, 20, 20, 30, 40, 30, 60, 10)
cap <- 102
exact.num.elt <- 3
mod <- lp(direction = "max",
objective.in = p,
const.mat = rbind(w, rep(1, length(p))),
const.dir = c("<=", "="),
const.rhs = c(cap, exact.num.elt),
all.bin = TRUE)
# Solution
which(mod$solution >= 0.999)
# [1] 2 3 4
# Profit
mod$objval
# [1] 250
While subsetting the optimal solution from the adagio:::knapsack function to the desired size is a reasonable heuristic for the case when the desired subset size is smaller than the cardinality of the optimal solution to the standard problem, there exist examples where the optimal solution to the standard knapsack problem and the optimal solution to the size-constrained knapsack problem are disjoint. For instance, consider the following problem data:
p <- c(2, 2, 2, 2, 3, 3)
w <- c(1, 1, 1, 1, 2, 2)
cap <- 4
exact.num.elt <- 2
With capacity 4 and no size constraint, the standard knapsack problem will select the four elements with profit 2 and weight 1, getting total profit 8. However, with size limit 2 the optimal solution is instead to select the two elements with profit 3 and weight 2, getting total profit 6.

Generate samples from frequency table with fixed values

I have a population 2x2 frequency table, with specific values: 20, 37, 37, 20. I need to generate N number of samples from this population (for simulation purposes).
How can I do it in R?
Try this. In the example, the integers represent cell 1, 2, 3, and 4 of the 2x2 table. As you can see the relative frequencies closely resemble those in your 20, 37, 37, 20 table.
probs<-c(20, 37, 37, 20)
N<-1000 #sample size
mysample<-sample(x=c(1,2,3,4), size=N, replace = TRUE, prob = probs/sum(probs))
table(mysample)/N
#Run Again for 100,000 samples
N<-100000
mysample<-sample(x=c(1,2,3,4), size=N, replace = TRUE, prob = probs/sum(probs))
#The relative probabilities should be similar to those in the original table
table(mysample)/N

How to apply Shapiro test in R?

I'm pretty new to statistics and I need your help. I just installed the R software and I have no idea how to work with it. I have a small sample looking as follows:
Group A : 10, 12, 14, 19, 20, 23, 34, 41, 12, 13
Group B : 8, 12, 14, 15, 15, 16, 21, 36, 14, 19
I want to apply t-test but before that I would like to apply Shapiro test to know whether my sample comes from a population which has a normal distribution. I know there is a function shapiro.test() but how can I give my numbers as an input to this function?
Can I simply enter shapiro.test(10,12,14,19,20,23,34,41,12,13, 8,12, 14,15,15,16,21,36,14,19)?
OK, because I'm feeling nice, let's work through this. I am assuming you know how to run commands, etc. First up, put your data into vector:
A = c(10, 12, 14, 19, 20, 23, 34, 41, 12, 13)
B = c(8, 12, 14, 15, 15, 16, 21, 36, 14, 19)
Let's check the help for shapiro.test().
help(shapiro.test)
In there you'll see the following:
Usage
shapiro.test(x)
Arguments
x a numeric vector of data values. Missing values are allowed, but
the number of non-missing values must be between 3 and 5000.
So, the inputs need to be a vector values. Now we know that we can run the 'shapiro.test()' function directly with our vectors, A and B. R uses named arguments for most of its functions, and so we tell the function what we are passing in:
shapiro.test(x = A)
and the result is put to the screen:
Shapiro-Wilk normality test
data: A
W = 0.8429, p-value = 0.0478
then we can do the same for B:
shapiro.test(x = B)
which gives us
Shapiro-Wilk normality test
data: B
W = 0.8051, p-value = 0.0167
If we want, we can test A and B together, although it's hard to know if this is a valid test or not. By 'valid', I mean imagine that you are pulling numbers out of a bag to get A and B. If the numbers in A get thrown back in the bag, and then we take B, we've just double counted. If the numbers in A didn't get thrown back in, testing x =c(A,B) is reasonable because all we've done is increased the size of our sample.
shapiro.test(x = c(A,B))
Do these mean that the data are normally distributed? Well, in the help we see this:
Value
...
p.value an approximate p-value for the test. This is said in Royston (1995) to be adequate for p.value < 0.1
So maybe that's good enough. But it depends on your requirements!

Resources