R: how can I do data[[k]] calculation - r

I have a data set from a sample without replacement look like this:
The picture shows the frequency of each species, and there are 50 data.c[[k]] like this.
Now I'm trying the Jackknife resampling(without replacement) to estimate coverage, codes below:
data.c <- sapply(1:50, function(k)table(data[,k])) #freq
mdata <- sapply(1:50, function(k)sum(data.c[[k]]==1))
True_c <- 1- sum(np*(exp(lchoose(N-data.c[[k]], i))/exp(lchoose(N,i))))
##True_c function shows error message##
my result shows "Error in N - data.c : non-numeric argument to binary operator"
I want to do True_c with N(population size) minus species' frequncies and do the 'lchoose' function, how can I do or adjust my codes?
My entire codes show below:
### without replacement
for (seed in c(99,100)){
set.seed(seed)
for (s in c(100,1000)){
sdata <- rlnorm(s,0,1)
p <- sdata/sum(sdata)
gn <- p*s*10
gn <- round(gn)
M <-replace(gn, gn==0,1) #or M=gn[gn==0]=1
N <- sum(M); N
np <- M/N #new prob
pop_index = rep(1:s, time=M)
for (i in c(100,500,1000,5000,N))
{
data=replicate(50, sample(pop_index, i,
replace = FALSE, prob = NULL))
data.c=sapply(1:50, function(k)table(data[,k])) #freq
mdata=sapply(1:50, function(k)sum(data.c[[k]]==1)) #each group, total freq=1
True_c <- 1- sum(np*(exp(lchoose(N-data.c, i))/exp(lchoose(N,i))))
c.hat <- (1-(1-(i/N))*(mdata/i)) #geo
bias=mean(c.hat)-True_c
var=var(c.hat)
cat("sample_size",i,"\n",
"True_C=",True_c,"\n",
"bias =",bias,"\n",
"variance=",var,"\n","\n")
}
}
}

Related

Unnest/Unlist moving window results in R

I have a dataframe that has two columns, x and y (both populated with numbers). I am trying to look at a moving window within the data, and I've done it like this (source):
# Extract just x and y from the original data frame
df <- dat_fin %>% select(x, y)
# Moving window creation
nr <- nrow(df)
windowSize <- 10
windfs <- lapply(seq_len(nr - windowSize + 1), function(i) df[i:(i + windowSize - 1), ])
This lapply creates a list of tibbles that are each 10 (x, y) pairs. At this point, I am trying to compute a single quantity using each of the sets of 10 pairs; my current (not working) code looks like this:
library(shotGroups)
for (f in 1:length(windfs)) {
tsceps[f] = getCEP(windfs[f], accuracy = TRUE)
}
When I run this, I get the error:
Error in getCEP.default(windfs, accuracy = TRUE) : xy must be numeric
My goal is that the variable that I've called tsceps should be a 1 x length(windfs) data frame, each value in which comes from the getCEP calculation for each of the windowed subsets.
I've tried various things with unnest and unlist, all of which were unsuccessful.
What am I missing?
Working code:
df <- dat_fin %>% select(x, y)
nr <- nrow(df)
windowSize <- 10
windfs <- lapply(seq_len(nr - windowSize + 1), function(i) df[i:(i + windowSize - 1), ])
tsceps <- vector(mode = "numeric", length = length(windfs))
library(shotGroups)
for (j in 1:length(windfs)) {
tsceps[j] <- getCEP(windfs[[j]], type = "CorrNormal", CEPlevel = 0.50, accuracy = TRUE)
}
ults <- unlist(tsceps)
ults_cep <- vector(mode = "numeric", length = length(ults))
for (k in 1:length(ults)) {
ults_cep[k] <- ults[[k]]
}
To get this working with multiple type arguments to getCEP, just use additional code blocks for each type required.

for loop with 2 vectors to calculate power in R fails

I have 2 vectors containing numbers, I'm using to simulate power of my study but keeps getting this error at the for loop section
Error in pwr.2p2n.test(h, n1 = i, n2 = j, sig.level = 0.05) :
number of observations in the first group must be at least 2
would be grateful for your suggestions to get it working
##sample code
grp1.n <- seq(30,150,5) ##group 1, N
grp2.n <- seq(30,150,5)-15 ## group 2, N - 15
h=0.85 #specify large effect size
grp1.length <- length(grp1.n)
grp2.length <- length(grp2.n)
power.holder <- array(numeric(grp1.length*grp2.length), dim=c(grp1.length,grp2.length),dimnames=list(grp1.n,grp2.n))
for (i in 1:grp1.length){
for (j in 1:grp2.length){
result.pwr.2p2n.test <- pwr.2p2n.test(h, n1=i, n2=j, sig.level=0.05)
power.holder[i,j] <- ceiling(result.pwr.2p2n.test$power)
return(result.pwr.2p2n.test)
}
}
I'm not entirely sure if this is what you want, but I think it is:
grp1.n <- seq(30,150,5) ##group 1, N
grp2.n <- seq(30,150,5)-15 ## group 2, N - 15
h=0.85 #specify large effect size
grp1.length <- length(grp1.n)
grp2.length <- length(grp2.n)
power.holder <- array(numeric(grp1.length*grp2.length), dim=c(grp1.length,grp2.length),dimnames=list(grp1.n,grp2.n))
for (i in 1:grp1.length){
for (j in 1:grp2.length){
result.pwr.2p2n.test <- pwr.2p2n.test(h, n1=grp1.n[i], n2=grp2.n[j], sig.level=0.05)
power.holder[i,j] <- ceiling(result.pwr.2p2n.test$power)
return(power.holder)
}
}
The changes are in the pwr.2p2n.test function as well as the object you want to return.
Old: pwr.2p2n.test(h, n1=i, n2=j, sig.level=0.05)
New: pwr.2p2n.test(h, n1=grp1.n[i], n2=grp2.n[j], sig.level=0.05)
Note there was also a missing } bracket in your code.

Txt Prediction Model Numerical Expression Warning

I have three dataframes created from different ngram counts (Uni, Bi , Tri) each data frame contains the separated ngram, frequency counts (n) and have added probability using smoothing.
I have written three functions to look through the tables and return the highest probable word based on an input string. And have binded them
##Prediction Model
trigramwords <- function(FirstWord, SecondWord, n = 5 , allow.cartesian =TRUE) {
probword <- trigramtable[.(FirstWord, SecondWord), allow.cartesian = TRUE][order(-Prob)]
if(any(is.na(probword)))
return(bigramwords(SecondWord, n))
if(nrow(probword) > n)
return(probword[1:n, ThirdWord])
count <-nrow(probword)
bgramwords <- bigramtable(SecondWord, n)[1:(n - count)]
return(c(probword[, ThirdWord], bgramwords))
}
bigramwords <- function(FirstWord, n = 5 , allow.cartesian = TRUE){
probword <- bigramtable[FirstWord][order(-Prob)]
if(any(is.na(probword)))
return(Unigramword(n))
if (nrow(probword) > n)
return(probword[1:n, SecondWord])
count <- nrow(probword)
word1 <- Unigramword(n)[1:(n - count)]
return(c(probword[, SecondWord], word1))
}
##Back off Model
Unigramword <- function(n = 5, allow.cartesian = TRUE){
return(sample(UnigramTable[, FirstWord], size = n))
}
## Bind Functions
predictword <- function(str) {
require(quanteda)
tokens <- tokens(x = char_tolower(str))
tokens <- char_wordstem(rev(rev(tokens[[1]])[1:2]), language = "english")
words <- trigramwords(tokens[1], tokens[2], 5)
chain_1 <- paste(tokens[1], tokens[2], words[1], sep = " ")
print(words[1])
}
However I receive the following warning message and the output is always the same word. If I use only the bigramwords function it works fine, but when adding the trigram function I get the warning message. I believe it because 1:n is not defined correctly.
Warning message:
In 1:n : numerical expression has 5718534 elements: only the first used

Reqsubsets results differ with coef() for model with linear dependencies

while using Regsubsets from package leaps on data with linear dependencies, I found that results given by coef() and by summary()$which differs. It seems that, when linear dependencies are found, reordering changes position of coefficients and coef() returns wrong values.
I use mtcars just to "simulate" the problem I had with other data. In first example there is no issue of lin. dependencies and best given model by BIC is mpg~wt+cyl and both coef(),summary()$which gives the same result. In second example I add dummy variable so there is possibility of perfect multicollinearity, but variables in this order (dummy in last column) don't cause the problem. In last example after changing order of variables in dataset, the problem finally appears and coef(),summary()$which gives different models. Is there anything incorrect in this approach? Is there any other way to get coefficients from regsubsets?
require("leaps") #install.packages("leaps")
###Example1
dta <- mtcars[,c("mpg","cyl","am","wt","hp") ]
bestSubset.cars <- regsubsets(mpg~., data=dta)
(best.sum <- summary(bestSubset.cars))
#
w <- which.min(best.sum$bic)
best.sum$which[w,]
#
best.sum$outmat
coef(bestSubset.cars, w)
#
###Example2
dta2 <- cbind(dta, manual=as.numeric(!dta$am))
bestSubset.cars2 <- regsubsets(mpg~., data=dta)
(best.sum2 <- summary(bestSubset.cars2))
#
w <- which.min(best.sum2$bic)
best.sum2$which[w,]
#
coef(bestSubset.cars2, w)
#
###Example3
bestSubset.cars3 <- regsubsets(mpg~., data=dta2[,c("mpg","manual","am","cyl","wt","hp")])
(best.sum3 <- summary(bestSubset.cars3))
#
w <- which.min(best.sum3$bic)
best.sum3$which[w,]
#
coef(bestSubset.cars3, w)
#
best.sum2$which
coef(bestSubset.cars2,1:4)
best.sum3$which
coef(bestSubset.cars3,1:4)
The order of vars by summary.regsubsets and regsubsets are different. The generic function coef() of regsubsets calls those two in one function, and the results are in mess if you are trying to force.in or using formula with fixed order. Changing some lines in the coef() function might help. Try codes below, see if it works!
coef.regsubsets <- function (object, id, vcov = FALSE, ...)
{
s <- summary(object)
invars <- s$which[id, , drop = FALSE]
betas <- vector("list", length(id))
for (i in 1:length(id)) {
# added
var.name <- names(which(invars[i, ]))
thismodel <- which(object$xnames %in% var.name)
names(thismodel) <- var.name
# deleted
#thismodel <- which(invars[i, ])
qr <- .Fortran("REORDR", np = as.integer(object$np),
nrbar = as.integer(object$nrbar), vorder = as.integer(object$vorder),
d = as.double(object$d), rbar = as.double(object$rbar),
thetab = as.double(object$thetab), rss = as.double(object$rss),
tol = as.double(object$tol), list = as.integer(thismodel),
n = as.integer(length(thismodel)), pos1 = 1L, ier = integer(1))
beta <- .Fortran("REGCF", np = as.integer(qr$np), nrbar = as.integer(qr$nrbar),
d = as.double(qr$d), rbar = as.double(qr$rbar), thetab = as.double(qr$thetab),
tol = as.double(qr$tol), beta = numeric(length(thismodel)),
nreq = as.integer(length(thismodel)), ier = numeric(1))$beta
names(beta) <- object$xnames[qr$vorder[1:qr$n]]
reorder <- order(qr$vorder[1:qr$n])
beta <- beta[reorder]
if (vcov) {
p <- length(thismodel)
R <- diag(qr$np)
R[row(R) > col(R)] <- qr$rbar
R <- t(R)
R <- sqrt(qr$d) * R
R <- R[1:p, 1:p, drop = FALSE]
R <- chol2inv(R)
dimnames(R) <- list(object$xnames[qr$vorder[1:p]],
object$xnames[qr$vorder[1:p]])
V <- R * s$rss[id[i]]/(object$nn - p)
V <- V[reorder, reorder]
attr(beta, "vcov") <- V
}
betas[[i]] <- beta
}
if (length(id) == 1)
beta
else betas
}
Another solution that works for me is to randomize the order of the column(independent variables) in your dataset before running the regsubsets. The idea is that after reorder hopefully the highly correlated columns will be far apart from each other and will not trigger the reorder behavior in the regsubsets algorithm.

Value-at-Risk (Extreme-Value Theory) using Monte Carlo Simulation in R

I have code that successfully calculates VaR based on Extreme Value Theory using historical data. I'm trying to run this same code on multiple simulated price paths (i.e. calculating a VaR for each path) and then taking the median or average of those VaRs.
Every example I could find online had the simulation function return the price at the end of the period and then they replicated the function X many time. That makes sense to me, except that I essentially need to calculate value-at-risk for each simulated path. Below is the code I have so far. I can say that the code works when using historical data (i.e. the "evt" function works fine and the datatable is populated correctly when the lossOnly, u, and evtVar lines aren't in a function). However, I've been trying to implement simulation in the second function and trying various combinations, which have all failed.
library('RODBC')
library('nor1mix')
library('fExtremes')
library('QRM')
library('fGarch')
#function for computing the EVT VaR
evt <- function(data,u){
#fit excess returns to gpd to get estimates
gpdfit = tryCatch({
gpdfit <- gpdFit(data,u,type="mle")
}, warning = function(w) {
gpdfit <- gpdFit(data,u,type="mle",optfunc="nlminb")
return(gpdfit)
}, error = function(e) {
gpdfit <- gpdFit(data,u,type="pwm",optfunc="nlminb")
return(gpdfit)
}, finally = {})
#now calculate VaRs
xi <- gpdfit#fit$par.ests["xi"]
beta <- gpdfit#fit$par.ests["beta"]
Nu <- length(gpdfit#data$exceedances)
n <- length(data)
evtVar95 <- (u+((beta/xi)*(((n/Nu)*.05)^(-xi) - 1.)))*100
evtVar99 <- (u+((beta/xi)*(((n/Nu)*.01)^(-xi) - 1.)))*100
evtVar997 <- (u+((beta/xi)*(((n/Nu)*.003)^(-xi) - 1.)))*100
evtVar999 <- (u+((beta/xi)*(((n/Nu)*.001)^(-xi) - 1.)))*100
#return calculations
return(cbind(evtVar95,evtVar99,evtVar997,evtVar999,u,xi,beta,Nu,n))
}
#data <- read.table("pricedata.txt")
prices <- data$V1
returns <- diff(log(prices)) #or returns <- log(prices[-1]/prices[-n])
xi <- mean(returns)
std <- sd(returns)
N <- length(prices)
lstval <- prices[N]
options(scipen = 999)
p <- c(lstval, rep(NA, N-1))
gen.path <- function(){
N <- length(prices)
for(i in 2:N)
p[i] <- p[i-1] * exp(rnorm(1, xi, std))
# plot(p, type = "l", col = "brown", main = "Simulated Price")
#evt calculation
#first get only the losses and then make them absolute
lossOnly <- abs(p[p<0])
#get threshold
u <- quantile(lossOnly, probs = 0.9, names=FALSE)
evtVar <- evt(lossOnly,u)
return(evtVar)
}
runs <- 10
sim.evtVar <- replicate(runs, gen.path())
evtVar <- mean(sim.evtVar)
#add data to total table
VaR <- c(evtVar[1],evtVar[2],evtVar[3],evtVar[4],evtVar[5],evtVar[6],evtVar[7],evtVar[8],evtVar[9])
DF <- data.frame(VaR, row.names=c("evtVar95","evtVaR_99","evtVaR_997","evtVaR_999","u","xi","beta","Nu","n"))
In short, I'm trying to run the value-at-risk function (first function) within the monte carlo function (second function) and trying to put the average simulated values into a data tables. I know the first function works, but it's the second function that's driving me crazy. There are the errors I'm getting:
> sim.evtVar <- replicate(runs, gen.path())
Error in if (xi > 0.5) { : missing value where TRUE/FALSE needed
Called from: .gpdpwmFit(x, u)
Browse[1]> evtVar <- mean(sim.evtVar)
Error during wrapup: object 'sim.evtVar' not found
Browse[1]>
> #add data to total table
> VaR <- c(evtVar[1],evtVar[2],evtVar[3],evtVar[4],evtVar[5],evtVar[6],evtVar[7],evtVar[8],evtVar[9])
Error: object 'evtVar' not found
> DF <- data.frame(VaR, row.names=c("evtVar95","evtVaR_99","evtVaR_997","evtVaR_999","u","xi","beta","Nu","n"))
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class ""function"" to a data.frame
Any help you can provide is greatly appreciated! Thank you in advance!
I think the Problem is this row:
lstval <- prices[N]
because if you take a stock price, that can't ever be negative, you produce an empty vector at this row in your function:
lossOnly <- abs(p[p<0])
you should try instead:
lstval <- min(returns)
if you want the highest negative return of your dataset

Resources