I wrote this function which returns the probability that the value 1 does not appear in a random sample when iterated 1000 times for varying sample sizes.
bday.function <- function(sample.size){
x <- vector()
for (i in 1:1000){
x[i] <- !any(data.frame(table(sample(1:365, sample.size, replace=TRUE)))$Var1 == 1)
}
return(mean(x))
}
Now I want to use this function and another for-loop to calculate the probability for every sample size between 500 and 1500 and make a simple scatter plot of my results. Here is what I tried:
z <- vector()
for (i in 500:1500) {
z[i] <- bday.function(i)
return(plot(z))
}
Edit: when I run bday.function the output is number of TRUE values divided by the total (1000) TRUE/FALSE outcomes:
bday.function(750)
[1] 0.122
I would like to replicate this for sample sizes between 500 and 1500 to generate a simple scatter plot
Edit 2: Thanks to everybody for the help! Here's my final solution:
x <- vector(length = 1000)
for (i in 1:1000){
x[i] <- !any(sample(1:365, 500, replace=TRUE) == 1)
}
x
bday.function <- function(sample.size){
x <- vector(length= 1000)
for (i in 1:1000){
x[i] <- !any(sample(1:365, sample.size, replace=TRUE) == 1)
}
return(mean(x))
}
bday.function(750)
z <- vector(length = 1000)
tmp.index <- 500:1500
for (i in seq_along(tmp.index)) {
z[i] <- bday.function(tmp.index[i])
}
#Plot
plot(tmp.index, z, xlab = "sample size", ylab = "Probability of no birthdays")
As #JohnColeman pointed in his sage comment, your function can be slow. Try these changes on your code for the printing output. I have run only 60 sims as I need to complete other things:
#Function
bday.function <- function(sample.size){
x <- vector()
for (i in 1:1000){
x[i] <- !any(data.frame(table(sample(1:365, sample.size, replace=TRUE)))$Var1 == 1)
}
return(mean(x))
}
#Loop
z <- vector()
vec <- 500:1500
for (i in seq_along(vec)) {
z[i] <- bday.function(vec[i])
}
#Plot
plot(z)
Output:
Are you looking for something like this?
bday.function <- function(sample.size) {
mean(sapply(seq(1000), function(x)
+!any(sample(365, sample.size, replace = TRUE) == 1)))
}
x <- 500:1500
y <- sapply(x, bday.function)
plot(x, y, xlab = "sample size", ylab = "Probability of no birthdays")
Related
I am looking to sample repeatedly from a distribution with a specific condition.
I am sampling 50 values for four iterations and saving the results. However I need each individual results from the iteration to be smaller than the last result at the same position.
mu.c <- c(7,6,5,3) # Means of control chains
chains.sim <- function(vector, N) {
all.list <- list()
for (i in 1:length(vector)) {
Y <- MASS::rnegbin(n = N, mu = vector[i], theta = 4)
name <- paste('position:',i, sep = '')
all.list[[name]] <- Y
}
all.list
}
chains.sim(mu.c, 50)
The sampling part works fine, but the Y individual results are of course not always smaller than the results from the previous iteration ("position").
Is there a way to repeat the sampling process until the result is smaller?
I would really appreciate your help!
I would add a while loop inside your for loop which samples data sets until the condition is met.
mu.c <- c(7,6,5,3) # Means of control chains
chain.sim <- function(vector, N) {
all.list <- list()
all.list[[1]] <- MASS::rnegbin(n = N, mu = vector[1], theta = 4)
for (i in 2:length(vector)) {
is_smaller <- FALSE
while(!is_smaller){
Y <- MASS::rnegbin(n = N, mu = vector[i], theta = 4)
if (all(all.list[[i-1]] >= Y)) is_smaller <- TRUE
}
all.list[[i]] <- Y
}
all.list
}
chain.sim(mu.c, 3)
Note that I changed the condition to >=, because if 0 is generated in any round, it will never find smaller values. Also, with 50 elements this code will never stop, because it is really unlikely to get two samples where each value is smaller, let alone 4 different samples.
Edit:
it can be much faster by sampling individually as you pointed out
chain.sim <- function(vector, N) {
all.list <- list()
all.list[[1]] <- MASS::rnegbin(n = N, mu = vector[1], theta = 4)
for (i in 2:length(vector)) {
Y <- numeric(N)
for (j in 1:N){
previous_value <- all.list[[i-1]][j]
if (previous_value == 0){
Y[j] = 0
next
}
is_smaller <- FALSE
while(!is_smaller){
val <- MASS::rnegbin(1, mu = vector[i], theta = 4)
if (val <= previous_value) is_smaller <- TRUE
Y[j] <- val
}
}
all.list[[i]] <- Y
}
all.list
}
chain.sim(mu.c, 50)
If 0 is encountered anywhere, no more simulation is necessary as we know the next value can only be 0. This makes the simulation much faster
I want to create a custom bootstrap function because I want to better understand what bootstrap is doing and it seems like the other bootstrap libraries out there does not solve my issue.
The Problem: I would like to create my own wald confidence interval function where it takes in the bootstrap data, outputs the confidence interval, test the confidence interval is within a range, and gets the coverage.
Right now, I am getting this type of error:
Error in bootresults[i,}<-waldCI(y=bootdata[i], n=numTrials):number of
items to replace is not a multiple of replacement length
The goal: My goal is to get the bootresults dataset to return 4 columns(p value,One that shows the upper bound, lower bound, and whether or not the p is in the interval) and get a graph similar to this one:
Wald interval chart
Code:
set.seed(42)
samples10 <- list()
i <- 1
while(i < 100) {
sample10[[i]] <- rbinom(1500, size=10, prob=i*.01) ## rows=1500 ;columns=10
i <- i + 1
}
sample10 <- data.frame(samples10)
colnames(sample10) <- c(seq(.01, .99, .01)) ## p-values
waldconfidenceinterval <- function(y, n, alpha=0.05) {
p <- colSums(y)/(n*200)
sd <- sqrt(p*((1 - p)/(n*200)))
z <- qnorm(c(alpha/2, 1 - alpha/2))
ci <- p + z*sd
return(ci)
}
B <- 200
numTrials <- 10
bootresults <- matrix(ncol=length(sample10), nrow=B) ## rows=200, cols=99
## empty matrix in the beginning
set.seed(42)
for(i in seq_len(B)) {
bootdata <- sample10[sample(B, replace=T), ]
bootresults[i, ] <- waldCI(y=bootdata[i], n=numTrials)
## Pseudocode:
# boot_test_data$in_interval <-
# ifelse(boot_test_data$lower1 < i/100 & i/100 < boot_test_data$upper1, 1, 0)
# coverage[i] <- sum(boot_test_data$in_interval) / length(boot_test_data$in_interval)
}
Any help is greatly appreciated since I am fairly new to R.
Looks like that you want to initialize a three-dimensional array bootresults rather than a two-dimensional matrix. In your waldCI() you may use colMeans.
waldCI <- function(y, alpha=0.05) {
p <- colMeans(y)
se <- sqrt(p*(1 - p)/nrow(y))
z <- qnorm(1 - alpha/2)
ci <- p + z*se %*% cbind(lower=-1, upper=1)
return(ci)
}
B <- 200
numTrials <- 10
## initialize array
bootresults1 <- array(dim=c(ncol(samples10), 4, B),
dimnames=list(c(), c("p.values", "lower", "upper", "in.int"), c()))
set.seed(42)
for(i in seq_len(B)) {
samp <- samples10[sample(nrow(samples10), numTrials, replace=F), ]
ci <- waldCI(samp)
bootresults1[,,i] <- cbind(p.values, ci, in.int=ci[, 1] < p.values & p.values < ci[, 2])
}
coverage <- rowMeans(bootresults[,4,])
plot(p.values, coverage, type="l", main="My Plot")
Similar approach, more R-ish, though:
p.values <- seq(.01, .99, .01)
set.seed(42)
samples10 <- `colnames<-`(sapply(p.values, function(pr) rbinom(1.5e3, 1, pr)), p.values)
BOOT <- function(numTrials, ...) {
samp <- samples10[sample(nrow(samples10), numTrials, replace=F), ]
ci <- waldCI(samp, ...)
cbind(p.values, ci, in.int=ci[, 1] < p.values & p.values < ci[, 2])
}
B <- 200
numTrials <- 10
set.seed(42)
bootresults2 <- replicate(B, BOOT(numTrials=10))
stopifnot(all.equal(bootresults1, bootresults2))
Data:
Note, that I used rbinom(..., size=1, ...) to create your sample data. The use of "p" as an object name suggested that the data should be binomial.
set.seed(42)
samples10 <- matrix(nrow=1500, ncol=99, dimnames=list(c(), c(seq(.01, .99, .01))))
i <- 1
while (i < 100) {
samples10[, i] <- rbinom(1500, size=1, prob=i*.01) ## rows=1500 ;columns=10
i <- i + 1
}
Without a while loop, you could proceed vectorized:
p.values <- seq(.01, .99, .01)
set.seed(42)
samples10 <- `colnames<-`(sapply(p.values, function(pr) rbinom(1.5e3, 1, pr)), p.values)
What is the minimum sample size n (or the length n = length(x) of the data vector x) such that the difference D = 1 - statx4(x)/statx5(x) of the functions statx4 and statx5 is no more than 1/100 i.e. D ≤ 1/100?
And here are the functions:
statx4 <- function(x) {
numerator <- sum((x-mean(x))^2)
denominator <- length(x)
result <- numerator/denominator
return(result)
}
statx5 <- function(x) {
numerator <- sum((x-mean(x))^2)
denominator <- length(x)-1
result <- numerator/denominator
return(result)
}
I've been doing this exercise set for a while, but haven't managed to get anything valid on this question. Could you point me to right direction?
For the normal distribution, it is the following:
statx4 <- function(x) {
numerator <- sum((x-mean(x))^2)
denominator <- length(x)
result <- numerator/denominator
return(result)
}
statx5 <- function(x) {
numerator <- sum((x-mean(x))^2)
denominator <- length(x)-1
result <- numerator/denominator
return(result)
}
D <- function(x){
1-statx4(x)/statx5(x)
}
DD <- function(N=1111,seed =1){
set.seed(seed)
Logi <- vector()
for (n in 1:N) {
x<- rnorm(n)
y <- D(x)
Logi[n] <- (y > 1/100)
}
return(Logi)
}
min <- vector()
for (seed in 1:100) {
message(seed)
DD(1000,seed)
min[seed] <- length(which(DD(1000) == TRUE))
}
Answer <- mean(min)+1
Answer
Note that the function D evaluates the difference of the unbiased variance and the ordinal variance.
I think this problem should be more clear in mathematical sense.
I got solutions today and all you had to do was guess random values:
a <- rnorm(99); 1-statx4(a)/statx5(a)
a <- rnorm(100); 1-statx4(a)/statx5(a)
a <- rnorm(101); 1-statx4(a)/statx5(a)`
And correct answer is 100.
Thank you all for help.
I have the following code to analyze data sets:
library("Matrix")
Data <-list(c(2,3),c(3,2),c(2,2))
TheSizes=c(3,4)
n=2
dd=2
StdGrid <- function(Data,TheSizes)
{
SGrid <- list(
Values = Data,
Sizes = TheSizes
)
class(SGrid) <- append(class(SGrid), c("StdGrid","Moment"))
return(SGrid)
}
theObject=StdGrid
MHistogramC <- function(theObject,n,dd)
{
sizes <- theObject$Sizes
l <- length(sizes)
data <- theObject$Values
Xarray <- matrix(rep(0,l*n),ncol=n)
N <- matrix(rep(0,l*n),ncol=n)
Histo <- matrix(rep(0,l*n),ncol=n)
GrandX <- lapply(data,function(x) log(x))
minX <- rep(0,l)
maxX <- rep(0,l)
DeltaX <- rep(0,l)
for(i in 1:l){
minX[i] <- min(GrandX[[i]])
maxX[i] <- max(GrandX[[i]])
DeltaX[i] <- maxX[i]/n-minX[i]/n
}
nzero <- numeric()
for(j in 1:n){
for(i in 1:l){
Xarray[i,j] <- minX[i]+(j-1/2)*DeltaX[i]
N[i,j] <- length(which((GrandX[[i]] >= minX[i]+(j-1/2)*DeltaX[i]-DeltaX[i]) & (GrandX[[i]] <= minX[i]+(j-1/2)*DeltaX[i]+DeltaX[i])))
Histo[i,j] <- log(N[i,j])
}
if(min(Histo[,j]) > - 10000){
nzero <- c(nzero,j)
}
}
alpha <- rep(0,lnzero)
falpha <- rep(0,lnzero)
for(j in 1:length(nzero)){
fit <- lm(Xarray[,nzero[j]] ~ log(sizes/dd))
alpha[j] <- fit$coefficients[[2]]
fit2 <- lm(Histo[,nzero[j]] ~ log(sizes/dd))
falpha[j] <- -fit2$coefficients[[2]]
}
Result <- data.frame(alpha=alpha,falpha=falpha)
return(Result)
}
MHistogramU <- function(theObject,n,dd)
{
sizes <- theObject$Sizes
l <- length(sizes)
data <- theObject$Values
Xarray <- matrix(rep(0,l*n),ncol=n)
N <- matrix(rep(0,l*n),ncol=n)
Histo <- matrix(rep(0,l*n),ncol=n)
GrandX <- lapply(data,function(x) log(x))
minX <- rep(0,l)
maxX <- rep(0,l)
DeltaX <- rep(0,l)
for(i in 1:l){
minX[i] <- min(GrandX[[i]])
maxX[i] <- max(GrandX[[i]])
DeltaX[i] <- maxX[i]/n-minX[i]/n
}
nzero <- numeric()
for(j in 1:n){
for(i in 1:l){
Xarray[i,j] <- minX[i]+(j-1/2)*DeltaX[i]
N[i,j] <- length(which((GrandX[[i]] >= minX[i]+(j-1/2)*DeltaX[i]-sqrt(DeltaX[i])) & (GrandX[[i]] <= minX[i]+(j-1/2)*DeltaX[i]+sqrt(DeltaX[i]))))
Histo[i,j] <- log(N[i,j])
}
if(min(Histo[,j]) > - 10000){
nzero <- c(nzero,j)
}
}
alpha <- rep(0,lnzero)
falpha <- rep(0,lnzero)
for(j in 1:length(nzero)){
fit <- lm(Xarray[,nzero[j]] ~ log(sizes/dd))
alpha[j] <- fit$coefficients[[2]]
fit2 <- lm(Histo[,nzero[j]] ~ log(sizes/dd))
falpha[j] <- -fit2$coefficients[[2]]
}
Result <- data.frame(alpha=alpha,falpha=falpha)
return(Result)
}
Which compiles, but i don't get anything in return. If I try to print "Result" the console says that the object "Result" was not found.
The inputs are:
Data : is a list of vector/grids
TheSizes : is a vector
theObject : the data defined as the class 'StdGrid' (defined below);
n : the number of values of alpha to be calculated;
dd : the dimension of the physical support of the measure.
What can I do to see the data frame that the code is supposed to return?
That is because your code is just a bunch of functions which are not called at all. A function would return value only when the function is called, it won't call itself.
Now looking at your code, It's hard to deduce what you are trying to calculate/analyze, but assuming all other codes are correct, and all functions are coded perfectly, you need to add the following lines to view the result at the end of your code:
var_MHistogramU <- MHistogramU(theObject,n,dd)
var_MHistogramC <- MHistogramC(theObject,n,dd)
To view the result, simply print the variables.
print(var_MHistogramU)
print(var_MHistogramC)
Remember: Printing Result won't work as the variable result is a local variable for the function, which is inaccessible globally.
I would like to do an acf plot in R for only the negative values of a time series. I cannot do this by just subsetting the data for only negative values beforehand, because then the autocorrelation will remove arbitrary number of positive days in between the negative values and be unreasonably high, but rather, I would like to run the autocorrelation on the whole time series and then filter out the results given the first day is negative.
For example, in theory, I could make a data frame with the original series and all of the lagged time series in a data frame, then filter for the negative values in the original series, and then plot the correlations. However, I would like to automate this using the acf plot.
Here is an example of my time series:
> dput(exampleSeries)
c(0, 0, -0.000687, -0.004489, -0.005688, 0.000801, 0.005601,
0.004546, 0.003451, -0.000836, -0.002796, 0.005581, -0.003247,
-0.002416, 0.00122, 0.005337, -0.000195, -0.004255, -0.003097,
0.000751, -0.002037, 0.00837, -0.003965, -0.001786, 0.008497,
0.000693, 0.000824, 0.005681, 0.002274, 0.000773, 0.001141, 0.000652,
0.001559, -0.006201, 0.000479, -0.002041, 0.002757, -0.000736,
-2.1e-05, 0.000904, -0.000319, -0.000227, -0.006589, 0.000998,
0.00171, 0.000271, -0.004121, -0.002788, -9e-04, 0.001639, 0.004245,
-0.00267, -0.004738, 0.001192, 0.002175, 0.004666, 0.006005,
0.001218, -0.003188, -0.004363, 0.000462, -0.002241, -0.004806,
0.000463, 0.000795, -0.005715, 0.004635, -0.004286, -0.008908,
-0.001044, -0.000842, -0.00445, -0.006094, -0.001846, 0.005013,
-0.006599, 0.001914, 0.00221, 6.2e-05, -0.001391, 0.004369, -0.005739,
-0.003467, -0.002103, -0.000882, 0.001483, 0.003074, 0.00165,
-0.00035, -0.000573, -0.00316, -0.00102, -0.00144, 0.003421,
0.005436, 0.001994, 0.00619, 0.005319, 7.3e-05, 0.004513)
I tried to implement your description.
correl <- function(x, lag.max = 10){
library(dplyr)
m <- matrix(ncol = lag.max, nrow = length(x))
for(i in 1:lag.max){
m[,i] <- lag(x, i)
}
m <- m[x<0,]
res <- apply(m, 2, function(y) cor(y, x[x<0], use = "complete.obs"))
barplot(res)
}
correl(exampleSeries)
Maybe just write your own function? Something like:
negativeACF <- function(x, num.lags = 10)
{
n <- length(x)
acfs <- sapply(0:num.lags, function(i) cor(x[-i:-1], x[(-n-1+i):-n]))
names(acfs) <- 0:num.lags
acfs[acfs < 0]
}
results <- negativeACF(exampleSeries, num.lags=20)
barplot(results)
Yea I ended up writing my own functions and just replacing the values in the R acf object with my own values that are just the correlations. So:
genACF <- function(series, my.acf, lag.max = NULL, neg){
x <- na.fail(as.ts(series))
x.freq <- frequency(x)
x <- as.matrix(x)
if (!is.numeric(x))
stop("'x' must be numeric")
sampleT <- as.integer(nrow(x))
nser <- as.integer(ncol(x))
if (is.null(lag.max))
lag.max <- floor(10 * (log10(sampleT) - log10(nser)))
lag.max <- as.integer(min(lag.max, sampleT - 1L))
if (is.na(lag.max) || lag.max < 0)
stop("'lag.max' must be at least 0")
if(neg){
indices <- which(series < 0)
}else{
indices <- which(series > 0)
}
series <- scale(series, scale = FALSE)
series.zoo <- zoo(series)
for(i in 0:lag.max){
lag.series <- lag(series.zoo, k = -i, na.pad = TRUE)
temp.corr <- cor(series.zoo[indices], lag.series[indices], use = 'complete.obs', method = 'pearson')
my.acf[i+1] <- temp.corr
}
my.acf[1] <- 0
return(my.acf)
}
plotMyACF <- function(series, main, type = 'correlation', neg = TRUE){
series.acf <- acf(series, plot = FALSE)
my.acf <- genACF(series, series.acf$acf, neg = neg)
series.acf$acf <- my.acf
plot(series.acf, xlim = c(1, dim(series.acf$acf)[1] - (type == 'correlation')), xaxt = "n", main = main)
if (dim(series.acf$acf)[1] < 25){
axis(1, at = 1:(dim(series.acf$acf)[1] - 1))
}else{
axis(1)
}
}
And I get something like this: