I was wondering if it would be possible to get help streamlining some code I have made for a class at uni. I have essentially been thrown into the deep end with R in the past week (so know very little), and wanted to know if there is a really obvious way I could streamline this so it isn't as clunky!
I am calculating the settlement rate of a population of barnacles on the rocky shore (As per Hines 1979). I have my script up and running for my three species at four different settlement rates no problem, I just wanted to know how I could neaten it up a bit. The script is as follows:
# Roughgarden et al 1985
# Six age classes. Data from Roughgardenetal1985_1Species.xls
# Population projection matrix
############################### C.FISSUS #####################################
#1.0
A <- matrix(c(0.8609, 1.4062, 1.9515, 2.4957, 2.6825, 2.8339,
0.1522, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.2378, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.1000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.1000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.1000, 0.0000
), nrow=6, byrow=TRUE)
par(mfrow=c(2,4))
# Starting population vector
N0 <- matrix(c(0, 0, 0, 0, 0, 0), nrow=6, byrow=TRUE)
# Settlement per unit free space (per cm2 / 100 = per mm2), for each species use: 1.0, 0.1, 0.01, and 0.001
s <-1.0
# Area occupied by age classes (mm2)
Ax <- matrix(c(2.33,9.45,15.15,18.78,20.92,22.14), nrow=6, byrow=TRUE)
# Set up matrix to store population stage (rows) structure over time (cols)
Nt<-matrix(data=0, ncol=50, nrow=6) # Create a vector to store results
Nt[,1]<-N0 # Make the first element (col 1) equal to N0
for (step in 1:49) { # Step through time, calculating Nt+1 each time
Nt[,step+1]<- A %*% Nt[,step] # Apply mortality
AreaOfBarnacles <- Ax * Nt[,step+1] # Calculate area occupied by surviving barnacles
Ft <- max(100 - sum(AreaOfBarnacles),0) # Calculate free space
print(sum(AreaOfBarnacles))
Nt[1,step+1] <- s * Ft # Number of new recruits
}
#Nt
# Transpose Nt for plotting
TNt <- t(Nt)
matplot(TNt, xlab = "Time, t", ylab = "Population Size, Nt", type="l", main = "Chthamalus fissus")
title(main="s = 1.0", line = 0.5)
I essentially need to run this part of script a total of 12 times. Four times for each of the three species (with a changing s value each time (1, 0.1, 0.01, and 0.001). I wanted to try and make it so I could add a bit where it would kind of be like "run this script under these four different settlement rates and produce four graphs of it each time" so I would just have this section of script repeated three times (once for each species). However, I can't seem to get it to work and ended up doing it the long way!
Thank you so much for taking the time to read this lengthy question, like I said, I'm VERY new to R (and coding in general) so I do apologise if anything I am asking is stupid!
P.S. (bonus round?)
How would I add a legend to these graphs without it getting in the way? Is there a way I can make a legend that is its own image so it doesn't overlay my graphs?
You can wrap your operations into a function:
## Defining the function
population.projection <- function(settlement, matrix_A, area_occupied) {
# Starting population vector
N0 <- matrix(c(0, 0, 0, 0, 0, 0), nrow=6, byrow=TRUE)
# Set up matrix to store population stage (rows) structure over time (cols)
Nt<-matrix(data=0, ncol=50, nrow=6) # Create a vector to store results
Nt[,1]<-N0 # Make the first element (col 1) equal to N0
for (step in 1:49) { # Step through time, calculating Nt+1 each time
Nt[,step+1]<- matrix_A %*% Nt[,step] # Apply mortality
AreaOfBarnacles <- area_occupied * Nt[,step+1] # Calculate area occupied by surviving barnacles
Ft <- max(100 - sum(AreaOfBarnacles),0) # Calculate free space
# print(sum(AreaOfBarnacles))
Nt[1,step+1] <- settlement * Ft # Number of new recruits
}
# Transpose Nt for plotting
return(t(Nt))
}
This function intakes your s variable and the two matrices A and Ax renamed settlement, matrix_A and area_occupied to be more self-explanatory.
You can then input your data:
## matrix_A input
matrix_A<- matrix(c(0.8609, 1.4062, 1.9515, 2.4957, 2.6825, 2.8339,
0.1522, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.2378, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.1000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.1000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.1000, 0.0000
), nrow=6, byrow=TRUE)
## Area occupied by age classes (mm2)
area_occupied <- matrix(c(2.33,9.45,15.15,18.78,20.92,22.14), nrow=6, byrow=TRUE)
## Setting the s values
my_settlement_values <- c(1, 0.1, 0.01, 0.001)
And loop through your settlement values for plotting the results:
## Setting the graphic parameters
par(mfrow=c(2,2))
## Looping through the s values
for(one_settlement in my_settlement_values) {
## Plotting the results
matplot(population.projection(settlement = one_settlement, matrix_A, area_occupied), xlab = "Time, t", ylab = "Population Size, Nt", type="l", main = "Chthamalus fissus")
## Adding the title
title(main = paste("s =", one_settlement), line = 0.5)
}
Related
I have a data set that is split into 3 profiles
Profile 1 = 0.478 (95% confidence interval: 0.4, 0.56)
Profile 2 = 0.415 (95% confidence interval: 0.34, 0.49)
Profile 3 = 0.107 (95% confidence interval: 0.06, 0.15)
Profile 1 + Profile 2 + Profile 3 = 1
I want to create a stochastic model that selects a value for each profile from each proportion's confidence interval. I want to keep that these add up to one. I have been using
pro1_prop<- rpert (1, 0.4, 0.478, 0.56)
pro2_prop<- rpert (1, 0.34, 0.415, 0.49)
pro3_prop<- 1- (pro1_prop + pro2_prop)
But this does not seem robust enough. Also on some iterations, (pro1_prop + pro2_prop) >1 which results in a negative value for pro3_prop. Is there a better way of doing this? Thank you!
It is straightforward to sample from the posterior distributions of the proportions using Bayesian methods. I'll assume a multinomial model, where each observation is one of the three profiles.
Say the counts data for the three profiles are 76, 66, and 17.
Using a Dirichlet prior distribution, Dir(1/2, 1/2, 1/2), the posterior is also Dirichlet-distributed: Dir(76.5, 66.5, 17.5), which can be sampled using normalized random gamma variates.
x <- c(76, 66, 17) # observations
# take 1M samples of the proportions from the posterior distribution
theta <- matrix(rgamma(3e6, rep(x + 1/2, each = 1e6)), ncol = 3)
theta <- theta/rowSums(theta)
head(theta)
#> [,1] [,2] [,3]
#> [1,] 0.5372362 0.3666786 0.09608526
#> [2,] 0.4008362 0.4365053 0.16265852
#> [3,] 0.5073144 0.3686412 0.12404435
#> [4,] 0.4752601 0.4367119 0.08802793
#> [5,] 0.4428575 0.4520680 0.10507456
#> [6,] 0.4494075 0.4178494 0.13274311
# compare the Bayesian credible intervals with the frequentist confidence intervals
cbind(
t(mapply(function(i) quantile(theta[,i], c(0.025, 0.975)), seq_along(x))),
t(mapply(function(y) setNames(prop.test(y, sum(x))$conf.int, c("2.5%", "97.5%")), x))
)
#> 2.5% 97.5% 2.5% 97.5%
#> [1,] 0.39994839 0.5537903 0.39873573 0.5583192
#> [2,] 0.33939396 0.4910900 0.33840295 0.4959541
#> [3,] 0.06581214 0.1614677 0.06535702 0.1682029
If samples within the individual 95% CIs are needed, simply reject samples that fall outside the desired interval.
TL;DR: Sample all three values (for example from a pert distribution, as you did) and norm those values afterwards so they add up to one.
Sampling all three values independently from each other and then dividing by their sum so that the normed values add up to one seems to be the easiest option as it is quite hard to sample from the set of legal values directly.
Legal values:
The downside of my approach is that the normed values are not necessarily legal (i.e. in the range of the confidence intervals) any more. However, for these values using a pert distribution, this only happens about 0.5% of the time.
Code:
library(plotly)
library(freedom)
library(data.table)
# define lower (L) and upper (U) bounds and expected values (E)
prof1L <- 0.4
prof1E <- 0.478
prof1U <- 0.56
prof2L <- 0.34
prof2E <- 0.415
prof2U <- 0.49
prof3L <- 0.06
prof3E <- 0.107
prof3U <- 0.15
dt <- as.data.table(expand.grid(
Profile1 = seq(prof1L, prof1U, by = 0.002),
Profile2 = seq(prof2L, prof2U, by = 0.002),
Profile3 = seq(prof3L, prof3U, by = 0.002)
))
# color based on how far the points are away from the center
dt[, color := abs(Profile1 - prof1E) + abs(Profile2 - prof2E) + abs(Profile3 - prof3E)]
# only keep those points that (almost) add up to one
dt <- dt[abs(Profile1 + Profile2 + Profile3 - 1) < 0.01]
# plot the legal values
fig <- plot_ly(dt, x = ~Profile1, y = ~Profile2, z = ~Profile3, color = ~color, colors = c('#BF382A', '#0C4B8E')) %>%
add_markers()
fig
# try to simulate the legal values:
# first sample without considering the condition that the profiles need to add up to 1
nSample <- 100000
dtSample <- data.table(
Profile1Sample = rpert(nSample, prof1L, prof1U, prof1E),
Profile2Sample = rpert(nSample, prof2L, prof2U, prof2E),
Profile3Sample = rpert(nSample, prof3L, prof3U, prof3E)
)
# we want to norm the samples by dividing by their sum
dtSample[, SampleSums := Profile1Sample + Profile2Sample + Profile3Sample]
dtSample[, Profile1SampleNormed := Profile1Sample / SampleSums]
dtSample[, Profile2SampleNormed := Profile2Sample / SampleSums]
dtSample[, Profile3SampleNormed := Profile3Sample / SampleSums]
# now get rid of the cases where the normed values are not legal any more
# (e.g. Profile 1 = 0.56, Profile 2 = 0.38, Profile 3 = 0.06 => dividing by their sum
# will make Profile 3 have an illegal value)
dtSample <- dtSample[
prof1L <= Profile1SampleNormed & Profile1SampleNormed <= prof1U &
prof2L <= Profile2SampleNormed & Profile2SampleNormed <= prof2U &
prof3L <= Profile3SampleNormed & Profile3SampleNormed <= prof3U
]
# see if the sampled values follow the desired distribution
hist(dtSample$Profile1SampleNormed)
hist(dtSample$Profile2SampleNormed)
hist(dtSample$Profile3SampleNormed)
Histogram of normed sampled values for Profile 1:
Ok, some thoughts on the matter.
Lets think about Dirichlet distribution, as one providing RV summed up to 1.
We're talking about Dir(a1, a2, a3), and have to find needed ai.
From the expression for E[Xi]=ai/Sum(i, ai), it is obvious we could get three ratios solving equations
a1/Sum(i, ai) = 0.478
a2/Sum(i, ai) = 0.415
a3/Sum(i, ai) = 0.107
Note, that we have only solved for RATIOS. In other words, if in the expression for E[Xi]=ai/Sum(i, ai) we multiply ai by the same value, mean will stay the same. So we have freedom to choose multiplier m, and what will change is the variance/std.dev. Large multiplier means smaller variance, tighter sampled values around the means
So we could choose m freely to satisfy three 95% CI conditions, three equations for variance but only one df. So it is not possible in general.
One cold play with numbers and the code
I estimated VECM and would like to make 4 separate tests of weak exogeneity for each variable.
library(urca)
library(vars)
data(Canada)
e prod rw U
1980 Q1 929.6105 405.3665 386.1361 7.53
1980 Q2 929.8040 404.6398 388.1358 7.70
1980 Q3 930.3184 403.8149 390.5401 7.47
1980 Q4 931.4277 404.2158 393.9638 7.27
1981 Q1 932.6620 405.0467 396.7647 7.37
1981 Q2 933.5509 404.4167 400.0217 7.13
...
jt = ca.jo(Canada, type = "trace", ecdet = "const", K = 2, spec = "transitory")
t = cajorls(jt, r = 1)
t$rlm$coefficients
e.d prod.d rw.d U.d
ect1 -0.005972228 0.004658649 -0.10607044 -0.02190508
e.dl1 0.812608320 -0.063226620 -0.36178542 -0.60482042
prod.dl1 0.208945048 0.275454380 -0.08418285 -0.09031236
rw.dl1 -0.045040603 0.094392696 -0.05462048 -0.01443323
U.dl1 0.218358784 -0.538972799 0.24391761 -0.16978208
t$beta
ect1
e.l1 1.00000000
prod.l1 0.08536852
rw.l1 -0.14261822
U.l1 4.28476955
constant -967.81673980
I guess that my equations are:
and I would like to test whether alpha_e, alpha_prod, alpha_rw, alpha_U (they marked red in the picture above) are zeros and impose necessary restrictions on my model. So, my question is: how can I do it?
I guess that my estimated alphas are:
e.d prod.d rw.d U.d
ect1 -0.005972228 0.004658649 -0.10607044 -0.02190508
I guess that I should use alrtest function from urca library:
alrtest(z = jt, A = A1, r = 1)
and probably my A matrix for alpha_e should be like this:
A1 = matrix(c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1),
nrow = 4, ncol = 3, byrow = TRUE)
The results of the test:
jt1 = alrtest(z = jt, A = A1, r = 1)
summary(jt1)
The value of the likelihood ratio test statistic:
0.48 distributed as chi square with 1 df.
The p-value of the test statistic is: 0.49
Eigenvectors, normalised to first column
of the restricted VAR:
[,1]
RK.e.l1 1.0000
RK.prod.l1 0.1352
RK.rw.l1 -0.1937
RK.U.l1 3.9760
RK.constant -960.2126
Weights W of the restricted VAR:
[,1]
[1,] 0.0000
[2,] 0.0084
[3,] -0.1342
[4,] -0.0315
Which I guess means that I can't reject my hypothesis of weak exogeneity of alpha_e. And my new alphas here are: 0.0000, 0.0084, -0.1342, -0.0315.
Now the question is how can I impose this restriction on my VECM model?
If I do:
t1 = cajorls(jt1, r = 1)
t1$rlm$coefficients
e.d prod.d rw.d U.d
ect1 -0.005754775 0.007717881 -0.13282970 -0.02848404
e.dl1 0.830418381 -0.049601229 -0.30644063 -0.60236338
prod.dl1 0.207857861 0.272499006 -0.06742147 -0.08561076
rw.dl1 -0.037677197 0.102991919 -0.05986655 -0.02019326
U.dl1 0.231855899 -0.530897862 0.30720652 -0.16277775
t1$beta
ect1
e.l1 1.0000000
prod.l1 0.1351633
rw.l1 -0.1936612
U.l1 3.9759842
constant -960.2126150
the new model don't have 0.0000, 0.0084, -0.1342, -0.0315 for alphas. It has -0.005754775 0.007717881 -0.13282970 -0.02848404 instead.
How can I get reestimated model with alpha_e = 0? I want reestimated model with alpha_e = 0 because I would like to use it for predictions (vecm -> vec2var -> predict, but vec2var doesn't accept jt1 directly). And in general - are calculations which I made correct or not?
Just for illustration, in EViews imposing restriction on alpha looks like this (not for this example):
If you have 1 cointegrating relationship (r=1), as it is in t = cajorls(jt, r = 1),
your loading matrix can not have 4 rows and 3 columns:
A1 = matrix(c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1),
nrow = 4, ncol = 3, byrow = TRUE)
Matrix A can only have 4 rows and 1 column, if you have 4 variables and 1 cointegrating relationship.
I have a (32x750) tensor
tensor([[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0043],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0043],
[ 0.0000, 0.0044, 0.0000, ..., 0.0044, 0.0000, 0.0000],
...,
[ 0.0059, 0.0000, 0.0059, ..., 0.0059, 0.0000, 0.0000],
[ 0.0059, 0.0000, 0.0059, ..., 0.0059, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0056, 0.0000]], device='cuda:0')
And I want to get the number of nonzero elements along each rows. Something like that [12 47 0 5 .... 8 7 50]
This discussion and this didn't solve my problem and concerned the number of nonzero elements for 1-D tensor.
Thanks
Problem solved using this post
I used: 750 - (tensor == 0).sum(dim=1)
list_of_num_nonzero_in_each_row = []
for row in my tensor:
list_of_num_nonzero_in_each_row.append(sum(row == 0.0).item())
Torch now has a count_nonzero function built in. This vectorized implementation is going to be faster than iterating over the tensor. It also supports counting over a given dimension.
> torch.count_nonzero(x, dim=0)
I have a data-frame with 20 variables and 400k instances. All variables are normalized with mean 0 and standard deviation 1. I want to write a function which could classify each instance of each variables into quantiles.
Lets say we have a normalized vector
a <- c(0.2132821 -1.5136988 0.6450274 1.5085178 0.2132821 1.5085178 0.6450274)
And the quantiles for this vector are
quant.a <- c(-1.5136988 -1.0819535 0.2132821 1.0767726 1.5085178)
where -1.5136988 is 0%
-1.0819535 is 25%
0.2132821 is 50%
1.0767726 is 75%
1.5085178 is 100% (all are elements in vector 'quant.a')
Now, I want to classify each element of vector 'a' as follows
new.a <- c(0.5, 0, 0.75, 1, 0.5, 1, 0.75)
You can use the following code to workout through the example as it is not possible for me to share the actual data
# Generate random data
set.seed(99)
# All variables are on a scale of 1-9
a <- floor(runif(500, min = 1, max = 9))
b <- floor(runif(500, min = 1, max = 9))
c <- floor(runif(500, min = 1, max = 9))
# store variables as dataframe
x <- data.frame(cbind(a,b,c))
#Scale variables
scaled.dat <- data.frame(scale(x))
# check that we get mean of 0 and sd of 1
colMeans(scaled.dat)
apply(scaled.dat, 2, sd)
# generate quantiles for each variables
quantiles <- data.frame(apply(scaled.dat,2,quantile))
Thanks in advance
a <- c(0.2132821, -1.5136988, 0.6450274 , 1.5085178 , 0.2132821 , 1.5085178 , 0.6450274)
quant.a = quantile(a)
aux_matrix = findInterval(a, quant.a)
new.a = ifelse(aux_matrix == 1|aux_matrix == 0, 0,
ifelse(aux_matrix == 2, 0.5,
ifelse(aux_matrix==3,0.75,
1)))
print(new.a)
0.50 0.00 0.75 1.00 0.50 1.00 0.75
library(dplyr)
yourdataframe %>%
mutate_all(funs(ntile(., 4)/4)
Let's say we have a simple long-only problem with four assets and several constraints. Below is how I would normally optimise portfolio weights with some basic constraints, such as weights sum to 1, no short selling and no leverage.
# set the covariance matrix:
cov <- cbind(c(0.1486, 0.0778, -0.0240, -0.0154),
c(0.0778, 0.1170, 0.0066, 0.0029),
c(-0.0240, 0.0066, 0.0444, 0.0193),
c(-0.0154, 0.0029, 0.0193, 0.0148)
)
# expected returns:
dvec <- c(0.0308, 0.0269, 0.0145, 0.0130)
# constraints:
# 1) weights to sum to 1
# 2) minimum weight for each asset = 0
# 3) maximum weight for each asset = 1
Amat <- cbind(c(1, 1, 1, 1), diag(1,4,4), diag(-1,4,4))
bvec <- c(1, 0, 0, 0, 0, -1, -1, -1, -1)
meq = 1
# The solution for portfolio weights is as follows:
round(solve.QP(cov, dvec=dvec, Amat=Amat, bvec=bvec, meq=meq)$solution,4)
Now, I would like to add a constraint that the first asset is less or equal than 60% of the first three assets taken together. How could I add this constraint to the above portfolio? It is easy to set the upper bound for an asset as a percentage of the overall portfolio, but I don't know how to set the upper bound for an asset as a percentage of a certain group of assets.
Any thoughts would be much appreciated.