Plotting distribution of variances - r

My dataset has 2 fields:
Time stamp t --- Varies between 0 to 60
Variable x – variance in value of a variable (say, A) from t-1 to t. Varies between -100% to 100%
There are roughly 500 records for each value of time stamp- e.g.
500 records where t= 0 and x takes any value between -100% to 100%
490 records where t= 1 and x takes any value between -100% to 100%, and so on.
Note, the value of x is 0 for ~80% of the records
The aim here is to determine at what value of t (Can be one value, or a range, e.g., when t= 22, or is between 20 -25), is the day-on-day change in A the minimum: Which effectively translates to finding out t when x is very frequently= 0, and when not, is at least close to zero.
To this purpose, I aim to plot the variance of x for each day. I can think of using a violin plot with x (Y axis) and t (X-axis), but there being 60 values of t makes it difficult to plot all in one chart. Can you suggest any alternative plot for the intended visual analysis?

Does it help if you do the absolute value of the variance (so its concentrated in 0-100) and trying with logs in here? https://stats.stackexchange.com/questions/251066/boxplot-for-data-with-a-large-number-of-zero-values.
When you say smallest, you mean closest to 0, right? In this case its better to work to reduce absolute variance (on a 0-1 scale), as you can then treat this like zero-inflated binomial data e.g. with the VGAM package: https://rdrr.io/cran/VGAM/man/zibinomial.html
I've had a play around, and below is an example that I think makes sense. I've only had some experience with zero-inflated models, so would be good if anyone has some feedback :)
library(ggplot2)
library(data.table)
library(VGAM)
# simulate some data
N_t <- 60 # number of t
N_o <- 500 # number of observations at t
t_smallest <- 30 # best value
# simulate some data crudely
set.seed(1)
dataL <- lapply(1:N_t, function(t){
dist <- abs(t_smallest-t)+10
values <- round(rbeta(N_o, 10/dist, 300/dist), 2) * sample(c(-1,1), N_o, replace=TRUE)
data.table(t, values)
})
data <- rbindlist(dataL)
# raw
ggplot(data, aes(factor(t), values)) + geom_boxplot() +
coord_cartesian(ylim=c(0, 0.1))
# log transformed - may look better with your data
ggplot(data, aes(factor(t), log(abs(values)+1))) +
geom_violin()
# use absolute values, package needs it as integer p & n, so approximate these
data[, abs.values := abs(values)]
data[, p := round(1000*abs.values, 0)]
data[, n := 1000]
# with a gam, so smooth fit on t. Found it to be unstable though
fit <- vgam(cbind(p, n-p) ~ s(t), zibinomialff, data = data, trace = TRUE)
# glm, with a coefficient for each t, so treats independently
fit2 <- vglm(cbind(p, n-p) ~ factor(t), zibinomialff, data = data, trace = TRUE)
# predict
output <- data.table(t=1:N_t)
output[, prediction := predict(fit, newdata=output, type="response")]
output[, prediction2 := predict(fit2, newdata=output, type="response")]
# plot out with predictions
ggplot(data, aes(factor(t), abs.values)) +
geom_boxplot(col="darkgrey") +
geom_line(data=output, aes(x=t, y=prediction2)) +
geom_line(data=output, aes(x=t, y=prediction), col="darkorange") +
geom_vline(xintercept = output[prediction==min(prediction), t]) +
coord_cartesian(ylim=c(0, 0.1))

Related

Simulate in R the number of samples needed in order to achieve the true standard deviation

i want to recreate in R the figure above that simulates the number of samples needed in order to achieve the true standard deviation.
How can I do it in R ?
I suppose that the distribution is t-distribution or normal.
So I have to generate numbers from these distributions and each time to increase the size of the sample and plot it in order to recreate this plot as shown in the figure.
Any help ?
set.seed(123)
x <- list(v1=rnorm(1,0,12),v2=rnorm(10,0,11),
v3=rnorm(20,0,10),v4=rnorm(30,0,9),
v5=rnorm(40,0,8),v6=rnorm(50,0,7),
v7=rnorm(60,0,6),v8=rnorm(70,0,5),
v9=rnorm(80,0,4),v10=rnorm(90,0,3),
v11=rnorm(100,0,2),v12=rnorm(110,0,2))
g = lapply(x,sd)
g
g1 = unlist(g)
plot(g1,type="l")
First, start with a random uniform distribution of suitable size, and select which sample sizes you want to compute your standard error of the mean.
set.seed(123)
x <- runif(1e6, 0, 1)
sample_size <- 5:120
You can define a function to compute this sigma_m. Here you sample with replacement a sample of n from x, and take the standard deviation and divide by sqrt(n).
calc_sigma_m <- function(n, x) {
sd(sample(x, n, replace = TRUE))/sqrt(n)
}
A data frame can neatly store the sample sizes and sigma_m values for plotting:
df <- data.frame(sample_size,
sigma_m = sapply(sample_size, calc_sigma_m, x))
Your initial plot will look like this:
library(ggplot2)
ggplot(df, aes(sample_size, sigma_m)) +
geom_line()
As expected, this is not smooth especially at smaller sample sizes.
If you want a smooth curve for demonstration, you repeat the sampling process and sigma_m calculation many times, and take the mean.
calc_sigma_m_mean <- function(n, x) {
mean(replicate(1000, sd(sample(x, n, replace = TRUE))/sqrt(n)))
}
df <- data.frame(sample_size, sigma_m = sapply(sample_size, calc_sigma_m_mean, x))
Then you will get a smoother curve:
ggplot(df, aes(sample_size, sigma_m)) +
geom_line()

How to generate a population of random numbers within a certain exponentially increasing range

I have 16068 datapoints with values that range between 150 and 54850 (mean = 3034.22). What would the R code be to generate a set of random numbers that grow in frequency exponentially between 54850 and 150?
I've tried using the rexp() function in R, but can't figure out how to set the range to between 150 and 54850. In my actual data population, the lambda value is 25.
set.seed(123)
myrange <- c(54850, 150)
rexp(16068, 1/25, myrange)
The call produces an error.
Error in rexp(16068, 1/25, myrange) : unused argument (myrange)
The hypothesized population should increase exponentially the closer the data values are to 150. I have 25 data points with a value of 150 and only one with a value of 54850. The simulated population should fall in this range.
This is really more of a question for math.stackexchange, but out of curiosity I provide this solution. Maybe it is sufficient for your needs.
First, ?rexp tells us that it has only two arguments, so we generate a random exponential distribution with the desired length.
set.seed(42) # for sake of reproducibility
n <- 16068
mr <- c(54850, 150) # your 'myrange' with less typing
y0 <- rexp(n, 1/25) # simulate exp. dist.
y <- y0[order(-y0)] # sort
Now we need a mathematical approach to rescale the distribution.
# f(x) = (b-a)(x - min(x))/(max(x)-min(x)) + a
y.scaled <- (mr[1] - mr[2]) * (y - min(y)) / (max(y) - min(y)) + mr[2]
Proof:
> range(y.scaled)
[1] 150.312 54850.312
That's not too bad.
Plot:
plot(y.scaled, type="l")
Note: There might be some mathematical issues, see therefore e.g. this answer.

Partial Cross-correlation in R

I think the title is fairly self-explanatory. I want to compute the cross-correlation between two time series controlled for the values at other lags. I can't find any existing R code to do this, and I'm not at all confident enough in my knowledge of statistics (or R) to try to write something myself. It would be analogous to the partial autocorrelation function, just for the cross-correlation instead of the autocorrelation.
If it helps at all, my larger objective is to look for lagged correlations between different measurements of a physical system (to start with, flux and photon index from gamma ray measurements of blazars), with the goal of building a general linear model to try to predict flaring events.
Look at my answer to my own question (same as the one you posted).
You can make use of the pacf function in R, extending it to a matrix with 2 or more time series. I have checked results between the multivariate acf and ccf functions and they yield the same results, so the same can be concluded about the multivariate pacfand the non-existing pccf.
I believe this work,
pccf <- function(x,y,nlags=7,partial=TRUE){
# x (numeric): variable that leads y
# y (numeric): variable of interest
# nlags (integer): number of lags (uncluding zero)
# partial (boolean): partial or absolute correlation
# trim y
y <- y[-(1:(nlags-1))]
# lagged matrix of x
x_lagged <- embed(x,nlags)
# process for each lag
rho <- lag <- NULL
for(i in 1:(nlags)){
if(partial){
# residuals of x at lag of interest regressed on all other lags of x
ex <- lm(x_lagged[,i] ~ x_lagged[,-i])$residuals
# residuals of y regressed on all lags of x but the one of interest
ey <- lm(y ~ x_lagged[,-i])$residuals
}else{
ex <- x_lagged[,i]
ey <- y
}
# calculate correlation
rho[i] = cor(ex,ey, use="pairwise.complete.obs")
lag[i] = i-1
}
return(
tibble(lag=lag, rho=rho) %>%
arrange(lag)
)
}
# test
n <- 200 # count
nlag <- 6 # number of lags
x <- as.numeric(arima.sim(n=n,list(ar=c(phi=0.9)),sd=1)) # simulate times series x
y <- lag(x,nlag) + rnorm(n,0,0.5) # simulate y to lag x
y <- y[(nlag+1):n] # remove NAs from lag
x <- x[(nlag+1):n] # align with y
pccf(x,y,nlags=10,partial=FALSE) %>%
mutate(type='Cross correlation') %>%
bind_rows(
pccf(x,y,nlags=10,partial=TRUE) %>%
mutate(type='Partial cross correlation')
) %>%
ggplot() +
geom_col(aes(-lag,rho),width=0.1) +
facet_wrap(~type,scales='free_y', ncol=1) +
scale_x_continuous(breaks=-10:0) +
theme_bw(base_size=20)

How to simulate daily stock returns in R

I need to simulate a stock's daily returns. I am given r=(P(t+1)-P(t))/P(t) (normal distribution) mean of µ=1% and sd of σ =5%. P(t) is the stock price at end of day t. Simulate 100,000 instances of such daily returns.
Since I am a new R user, how do I setup t for this example. I am assuming P should be setup as:
P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)
You are getting it wrong: from what you wrote, the mean and the sd applies on the return and not on the price. I furthermore make the assumption that the mean is set for an annual basis (1% rate of return from one day to another is just ...huge!) and t moves along a day range of 252 days per year.
With these hypothesis, you can get a series of daily return in R with:
r = rnorm(100000, .01/252, .005)
Assuming the model you mentioned, you can get the serie of the prices P (containing 100001 elements, I will take P[1]=100 - change it with your own value if needed):
factor = 1 + r
temp = 100
P = c(100, sapply(1:100000, function(u){
p = factor[u]*temp
temp<<-p
p
}))
Your configuration for the return price you mention (mean=0.01 and sd=0.05) will however lead to exploding stock price (unrealistic model and parameters). Be carefull to check that prod(rate) will not return Inf .
Here is the result for the first 1000 values of P, representing 4 years:
plot(1:1000, P[1:1000])
One of the classical model (which does not mean this model is realistic) assumes the observed log return are following a normal distribution.
Hope this helps.
I see you already have an answer and ColonelBeauvel might have more domain knowledge than I (assuming this is business or finance homework.) I approached it a bit differently and am going to post a commented transcript. His method uses the <<- operator which is considered as a somewhat suspect strategy in R, although I must admit it seems quite elegant in this application. I suspect my method will probably be a lot faster if you ever get into doing large scale simulations.
Starting with your code:
P <- rnorm(100000, .01, .05)
# r=(P(t+1)-P(t))/P(t) definition, not R code
# inference: P_t+1 = r_t*P_t + P_t = P_t*(1+r_t)
# So, all future P's will be determined by P_1 and r_t
Since P_2 will be P_1*(1+r_1)r_1 then P_3 will be P_1*(1+r_1)*(1+r_2), .i.e a continued product of the vector (1+r) for which there is a vectorized function.
P <- P_1*cumprod(1+r)
#Error: object 'P_1' not found
P_1 <- 100
P <- P_1*cumprod(1+r)
#Error: object 'r' not found
# So the random simulation should have been for `r`, not P
r <- rnorm(100000, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
#Error in plot.window(...) : infinite axis extents [GEPretty(-inf,inf,5)]
str(P)
This occurred because the cumulative product went above the limits of numerical capacity and got assigned to Inf (infinity). Let's be a little more careful:
r <- rnorm(300, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
This strategy below iteratively updates the price at time t as 'temp' and multiplies it it by a single value. It's likely to be a lot slower.
r = rnorm(100000, .01/252, .005)
factor = 1 + r
temp = 100
P = c(100, sapply(1:300, function(u){
p = factor[u]*temp
temp<<-p
p
}))
> system.time( {r <- rnorm(10000, .01/250, .05)
+ P <- P_1*cumprod(1+r)
+ })
user system elapsed
0.001 0.000 0.002
> system.time({r = rnorm(10000, .01/252, .05)
+ factor = 1 + r
+ temp = 100
+ P = c(100, sapply(1:300, function(u){
+ p = factor[u]*temp
+ temp<<-p
+ p
+ }))})
user system elapsed
0.079 0.004 0.101
To simulate a log return of the daily stock, use the following method:
Consider working with 256 days of daily stock return data.
Load the original data into R
Create another data.frame for simulating Log return.
Code:
logr <- data.frame(Date=gati$Date[1:255], Shareprice=gati$Adj.Close[1:255], LogReturn=log(gati$Adj.Close[1:251]/gati$Adj.Close[2:256]))
gati is the dataset
Date and Adj.close are the variables
notice the [] values.
P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)
second line translates directly into :
r <- (P[-1] - P[length(P)]) / P[length(P)] # (1:5)[-1] gives 2:5
Stock returns are not normally distributed for Simple Returns ("R"), given their -1 lower bound per compounded period. However, Log Returns ("r") generally are. The below is adapted from #42's post above. There don't seem to be any solutions to simulating from Log Mean ("Expected Return") and Log Stdev ("Risk") in #Rstats, so I've included them here for those looking for "Monte Carlo Simulation using Log Expected Return and Log Standard Deviation"), which are normally distributed, and have no lower bound at -1. Note: from this single example, it would require looping over thousands of times to simulate a portfolio--i.e., stacking 100k plots like the below and averaging a single slice to calculate a portfolio's average expected return at a chosen forward month. The below should give a good basis for doing so.
startPrice = 100
forwardPeriods = 12*10 # 10 years * 12 months with Month-over-Month E[r]
factor = exp(rnorm(forwardPeriods, .04, .10)) # Monthly Expected Ln Return = .04 and Expected Monthly Risk = .1
temp = startPrice
P = c(startPrice, sapply(1:forwardPeriods, function(u){p = factor[u]*temp; temp <<- p; p}))
plot(P, type = "b", xlab = "Forward End of Month Prices", ylab = "Expected Price from Log E[r]", ylim = c(0,max(P)))
n <- length(P)
logRet <- log(P[-1]/P[-n])
# Notice, with many samples this nearly matches our initial log E[r] and stdev(r)
mean(logRet)
# [1] 0.04540838
sqrt(var(logRet))
# [1] 0.1055676
If tested with a negative log expected return, the price should not fall below zero. The other examples, will return negative prices with negative expected returns. The code I've shared here can be tested to confirm that negative prices do not exist in the simulation.
min(P)
# [1] 100
max(P)
# [1] 23252.67
Horizontal axis is number of days, and vertical axis is price.
n_prices <- 1000
volatility <- 0.2
amplitude <- 10
chng <- amplitude * rnorm(n_prices, 0, volatility)
prices <- cumsum(chng)
plot(prices, type='l')

R -- Simulate sigmoidally correlated covariates

I am attempting to simulate two weight and age values for a population of children. These data should be sigmoidally correlated such that at low ages weight changes slowly, then by approximately 30 weeks postmenstrual age there is an acceleration in weight gain, which begins to level off past about 50 weeks.
I have been able to use the code below to get a linear correlation between weight and age to work fairly well. The part I am having trouble with is adapting this code to get a more sigmoidal shape to the data. Any suggestions would be greatly appreciated.
# Load required packages
library(MASS)
library(ggplot2)
# Set the number of simulated data points
n <- 100
# Set the mean and standard deviations for
# the two variables
mean_age <- 50
sd_age <- 20
mean_wt <- 10
sd_wt <- 4
# Set the desired level of correlation
# between the two variables
cor_agewt <- 0.9
# Build the covariance matrix
covmat <- matrix(c(sd_age^2, cor_agewt * sd_age * sd_wt,
cor_agewt * sd_age * sd_wt, sd_wt^2),
nrow = 2, ncol = 2, byrow = TRUE)
# Simulate the correlated results
res <- mvrnorm(n, c(mean_age, mean_wt), covmat)
# Reorganize the simulate data into a data frame
df <- data.frame(age = res[,1],
wt = res[,2])
# Plot the results and fit a loess spline
# to the data
ggplot(df, aes(x = age, y = wt)) +
geom_point() +
stat_smooth(method = 'loess')
Current output:
Ideal output (albeit over a smaller range of ages and weights):
One approach is to specify the functional form between weight and age more specific than just a single correlation. After specifying the functional form of weight~age+e you just draw (age,e) and then calculate the weight. A simple example follows:
set.seed(1234)
mean_age <- 50; sd_age <- 20
mean_wt <- 3.5; sd_wt <- 2.2
n<-400
age.seq<-rnorm(n,mean_age,sd_age)
age.seq<-age.seq[order(age.seq)]
#functional form: (here a "logistic" with a a location and scale)
f<-function(x,loc,sca) 1/(1+exp(-(x-loc)/sca))
wt<-f(age.seq,65,20) #wt
m<-mean_wt/mean(wt) #simple adjustment of the mean
sdfit<-sqrt( sd_wt^2-var(m*wt) )
sim_wt<-m*wt+rnorm(n,0,sdfit) #simulated wt
plot(age.seq,sim_wt)
lines(age.seq,m*wt)
with mean & sd:
>sd(age.seq); sd(sim_wt); mean(sim_wt); mean(age.seq) #check
[1] 20.29432
[1] 2.20271
[1] 3.437339
[1] 50.1549
:::::: EDIT partially wrt. comment::::::
A restriction on the samplespace, eg. nonzero criteria for weights, gonna make the problem much harder. But if you drop the mean+sd restriction on weights, then it's easy to extend the example to a flexible specification of the functional-form. Following, is a simple example using a truncated normal-dist.:
set.seed(1234)
mean_age<-30
sd_age<-10
n<-500
#ex. of control of functional-form
loc<-40 #location
scale<-10 #scaling
sd_wt <- 0.8 #in the truncated normal
ey_min<-c(0,0.2) #in the truncated normal
ey_max<-c(55,6) #in the truncated normal
age.seq<-rnorm(n,mean_age,sd_age)
#age.seq<-0:55
n<-length(age.seq)
age.seq<-age.seq[order(age.seq)]
#functional form: (here a "logistic" with a a location and scale)
f<-function(x,loc,sca) 1/(1+exp(-(x-loc)/sca))
wt<-f(age.seq,loc,scale) #wt
#correct lower:
corr_lower<-ey_min[2]-f(ey_min[1],loc,scale) #add. correction lower
wt<-wt+corr_lower
#correct upper
mult<-(ey_max[2]-ey_min[2])/(f(ey_max[1],loc,scale)+corr_lower) #mult. correction
wt<-ey_min[2]+wt*mult*(age.seq/ey_max[1])
plot(age.seq,wt,type="l",ylim=c(0,8)) #plot mean used as par in the truncated normal
sim_wt<-truncnorm::rtruncnorm(n,0,,mean=wt,sd=sd_wt)
points(age.seq,sim_wt)
abline(h=0.2,col=2);abline(v=0,col=2)
abline(h=6,col=2);abline(v=55,col=2)
which gives (red lines illustrating the controls):
Of course you could also try control the variance wrt. age, simplified:
plot(age.seq,wt,type="l",ylim=c(0,8)) #plot mean used as par in the truncated normal
sim_wt<-truncnorm::rtruncnorm(n,0,,mean=wt,sd=sd_wt*seq(0.3,1.3,len=n))
points(age.seq,sim_wt)
The point here is, that you need more structure to simulate specific data like that (not going into ex. bootstrap methods), eg. there is no internal R-function to the rescue. Of course it get's harder to sample from the distribution when introducing more restrictions. You can always consult Cross Validated for different approaches, choice of distribution etc.

Resources