Errors when trying to create and use a random permutation test function in R - r

I am trying to complete a random permutation test in RStudio and keep getting the following two errors:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Error during wrapup: evaluation nested too deeply: infinite recursion / options(expressions=)?
#create groups/data vectors
drinks = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
mosquito = c(27, 20, 21, 26, 27, 3, 24, 21, 20, 19, 23, 24, 28, 19, 24, 29, 18, 20, 17, 31, 20, 25, 28, 21, 27, 21, 22, 15, 12, 21, 19, 15, 22, 24, 19, 23, 13, 22, 20, 24, 18, 20)
#create function
rpermut = function(group, outcome, permutation){
diff = rep(NA, permutation)
for(i in 1:permutation){
outcome = sample(outcome)
diff[i] = mean(outcome[group==levels(group)[1]]) - mean(outcome[group==levels(group)[2]])}
diff
}
#adding values to function
mosrep = rpermut(group=drinks, outcome=mosquito, permutation=1000)
I am not sure what the error codes mean nor how to fix things so that the function will run. I would greatly appreciate any assistance you may be able to provide on where I am going wrong here!

So it seems to work for me with a few changes.
Firstly I assume that both drinks and mosquito should be the same length which in your question is not the case.
> length(drinks)
[1] 43
> length(mosquito)
[1] 42
Secondly, levels() works on factors whereas those objects drinks and mosquito are numeric vectors.
> class(drinks)
[1] "numeric"
> class(mosquito)
[1] "numeric"
to therefore make this function work on my machine i had to then adjust the function to be this:
rpermut = function(group, outcome, permutation){
diff = c()
group = as.factor(group)
for(i in 1:permutation){
outcome = sample(outcome)
diff[i] = mean(outcome[group==levels(group)[1]]) - mean(outcome[group==levels(group)[2]])
}
return(diff)
}
This is just changing the group to a factor with as.factor()
I also changed diff = rep(NA, permutation) to just diff = c() which is creating an empty vector. There is no need to assign NA to all the values as you can simply fill an entry with diff[i] the same way.
So the vectors need to be the same length and then this should work, a check could simply be added as well.
if(length(group) != length(outcome)){
stop("input vector lengths does not match!")
}
All together:
rpermut = function(group, outcome, permutation){
if(length(group) != length(outcome)){
stop("input vector lengths does not match!")
}
diff = c()
group = as.factor(group)
for(i in 1:permutation){
outcome = sample(outcome)
diff[i] = mean(outcome[group==levels(group)[1]]) - mean(outcome[group==levels(group)[2]])
}
return(diff)
}

Related

Is there a way for me to simply plot all of these vectors on a single graph?

these are the vectors that need to all be plotted on the same graph
I would like to plot all of these vectors on one set. I've seen methods using matrices but I can't fathom how I would organize this as a matrix and I also would rather work with the vectors. Is there a method I can use to have these all on a single graph?
x_axis <- c(0, 1, 2, 3, 4, 7)
mouse_r_veh <- c(6, 7, 5, 2, 3, 7)
mouse_r_cap <- c(27, 22, 21, 25, 21, 25)
mouse_rr_veh <- c(7, 3, 4, 6, 4, 17)
mouse_rr_cap <- c(24, 27, 29, 9, 10, 21)
mouse_l_veh <- c(10, 12, 11, 16, 13, 2)
mouse_l_cap <- c(26, 23, 23, 23, 24, 22)
mouse_ll_veh <- c(0, 2, 1, 3, 0, 0)
If you don't want to use matplot, nor yet ggplot, you could just do a single plot call and several lines:
plot(x_axis, ylim = c(0, 30))
lines(mouse_r_cap, col="red")
lines(mouse_r_veh, col = "green")
# ... et cetera
If you don't mind using matplot with a matrix, you could do:
mx <- cbind(x_axis, mouse_r_veh, mouse_r_cap,
mouse_rr_veh, mouse_rr_cap, mouse_l_veh,
mouse_l_cap, mouse_ll_veh)
matplot(mx, type ="l")
You could put the data in a data.frame and use pivot_longer to create a new variable with the name of each serie:
library(tidyr)
library(ggplot2)
df <- data.frame(x_axis,
mouse_r_veh,
mouse_r_cap,
mouse_rr_veh,
mouse_rr_cap,
mouse_l_veh,
mouse_l_cap,
mouse_ll_veh)
data <- df %>% pivot_longer(cols = contains('mouse'))
ggplot(data) + geom_line(aes(x = x_axis, y = value, color = name))

Zelen Exact Test - Trying to use a k 2x2 in the function zelen.test()

I am trying to use the zelen.test function on the package NSM3. I am having difficulty reading the data into the function.
You can recreate my data using
data <- c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1, 12, 13,
74, 74, 77, 85, 31, 37, 11, 7, 18, 18, 96, 97, 48, 40)
events <- matrix(data, ncol = 2)
The documentation on CRAN states that zelen.test(z, example = F, r = 3) where z is an array of k 2 x 2 matrix, example is set to FALSE because it returns a p-value for an example I cannot access, and r is the number of decimals the users wants returned in the p-value.
I've tried:
zelen.test(events, r = 4)
I thought it may want the study number and the trial data, so I tried this:
studies <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7)
data <- c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1, 12, 13,
74, 74, 77, 85, 31, 37, 11, 7, 18, 18, 96, 97, 48, 40)
events <- matrix(cbind(studies, events), ncol = 3)
zelen.test(events, r = 4)
but it continues to return and error stating
"Error in z[1, 1, ] : incorrect number of dimensions" for both cases I tried above.
Any help would be greatly appreciated!
If we check the source code by typing zelen.test on the console, if the example = TRUE, it is constructing a 3D array
...
if (example)
z <- array(c(2, 1, 2, 5, 1, 5, 4, 1), dim = c(2, 2, 2))
...
The input z dim is also specified in the documentation of ?zelen.test
z - data as an array of k 2x2 matrices. Small data sets only!
So, we may need to construct an array of dimensions 3
library(NSM3)
z1 <- array(c(4, 2, 3, 3, 8, 3, 4, 7), c(2, 2, 2))
zelen.test(z1, r = 4)
# Zelen's test:
# P = 1
Or with 3rd dimension of length 3
z1 <- array( c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1), c(2, 2, 3))
zelen.test(z1, r = 4)
# Zelen's test:
#P = 0.1238

as.h2o is creating 3 levels in my target variable instead of 2 levels so it makes the model multinational instead of binomial, how do I prevent this?

So I am using h2o.ai to create a binomial classification model however when I use
as.h2o to convert my data sets. It takes my target variable's column header which is "BUY"
and adds that to the levels so instead of just 2 levels 1 and 2 it becomes three levels which are
BUY, 1, and 2. This makes it multinomial and not wanted how do i fix this?
when I run perfH2o this is the output:
H2OMultinomialMetrics: gbm
Test Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 0.3260208
RMSE: (Extract with `h2o.rmse`) 0.5709823
Logloss: (Extract with `h2o.logloss`) 1.016186
Mean Per-Class Error: 0.2755556
R^2: (Extract with `h2o.r2`) -0.1913934
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
BUY NO YES Error Rate
BUY 1 0 0 0.0000 = 0 / 1 #see here it is taking the header and thinking it is a level
NO 0 16 9 0.3600 = 9 / 25
YES 0 7 8 0.4667 = 7 / 15
Totals 1 23 17 0.3902 = 16 / 41
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
=======================================================================
Top-3 Hit Ratios:
k hit_ratio
1 1 0.609756
2 2 0.975610
3 3 1.000000
Here is my code
#Getting packages
#install.packages("dplyr")
library(dplyr)
library(tidyverse)
library(tidyr)
#install.packages("tidyquant") #Used to quickly load the "tidyverse" (dplyr, tidyr, ggplot, etc)
along with custom,
#business-report-friendly ggplot themes. Also great for time series analysis (not featured)
library(tidyquant)
#install.packages("unbalanced")
library(unbalanced)#contains various methods for working with unbalanced data. I will be using
ubSMOTE() function
#installing H20 latest stable release H20 is a professional machine learning package
# The following two commands remove any previously installed H2O packages for R.
#if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
#if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
# Next, we download packages that H2O depends on.
#pkgs <- c("RCurl","jsonlite")
#for (pkg in pkgs) {
# if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
#}
# Now we download, install and initialize the H2O package for R.
#install.packages("h2o", type="source", repos="http://h2o-release.s3.amazonaws.com/h2o/rel-yule/2/R")
# Finally, let's load H2O and start up an H2O cluster
library(h2o)
h2o.init()
#Now getting the data
ngData <- read.csv(file.choose())
#Now I am going to create my Train, validation, and test set
splitPercentage1 <- .70
splitPercentage2 <- .5
numRows1 <- nrow(ngData)
sampleSize1 <- floor(splitPercentage1*numRows1)
set.seed(1)
idxTrain1 <- sample(1:numRows1, size = sampleSize1)
validationRaw <- ngData[-idxTrain1,]
trainRaw <- ngData[idxTrain1,]
#validation set created now time to make test set out of validation set
numRows2 <- nrow(validationRaw)
sampleSize2 <- floor(splitPercentage2*numRows2)
idxTrain2 <- sample(1:numRows2, size = sampleSize2)
testRaw <- validationRaw[-idxTrain2,]
validationRaw <- validationRaw[idxTrain2,]
#Now I have a randomly set train set, validation set, and test set
View(trainRaw)
View(testRaw)
View(validationRaw)
#all look good however we need our target variable "BUY" to be a factor not numeric
#also Buy = 1 Sell = 0 in the BUY column
trainRaw[,11] <- as.factor(trainRaw[,11])
testRaw[,11] <- as.factor(testRaw[,11])
validationRaw[,11] <- as.factor(validationRaw[,11])
View(trainRaw)
View(testRaw)
View(validationRaw)
#now to balance the data which i don't know if that is very necessary so I
#will check how balanced it is
Buytable <- table(trainRaw$BUY)
Buydistr <- prop.table(Buytable)
Buydistr
#very balanced with 52% sell and 47% buy so no need to balance
h2o.no_progress()
#converting into h2o data frames
trainH20 <- as.h2o(trainRaw)
validH20 <- as.h2o(validationRaw)
testH20 <- as.h2o(testRaw)
#now to find a classification model
y <- "BUY"
x <- setdiff(names(trainH20), y)
automl_models_h2o <- h2o.automl(
x = x,
y = y,
training_frame = trainH20,
validation_frame = validH20,
leaderboard_frame = testH20,
max_runtime_secs = 60
)
#time to extract the leading model
NGLeader <- automl_models_h2o#leader
#making predicitons using h2o.predict()
predH2o <- h2o.predict(NGLeader, newdata = testH20)
as_tibble(predH2o)
#now to check the performance
perfH2o <- h2o.performance(NGLeader, newdata = testH20)
perfH2o
h2o.r2(perfH2o)
#very bad r^2
#turns out my model believes that BUY is one of the possible outcomes of Y so it is multinomial I
must fix that
#######################################################################
Here is a glimpse() of my data:
Rows: 185
Columns: 11
$ ï..Month April, July, August, August, July, February, September, January, March, February, June,...
$ East.Region -12, 24, 26, 21, 19, -43, 25, -43, -15, -9, 27, -28, 26, -27, 22, 23, 32, -54, 21, 12, ...
$ Midwest.Region -20, 20, 36, 29, 16, -47, 35, -38, -7, -4, 35, -31, 45, -27, 22, 29, 27, -56, 30, 14, -...
$ Mountain.Region -4, 6, 4, 3, 2, -6, 3, -10, 2, 0, 9, -2, 5, -9, 5, 3, 6, -6, 4, 2, -4, 5, 5, 3, -1, -7,...
$ Pacific.Region 5, 5, 2, 0, -1, -10, 5, -13, 9, -1, 11, -3, 0, -14, 7, 0, 9, -11, 0, -3, -8, 5, 5, 6, 0...
$ South.Central.Region 12, 3, 2, -2, -2, -41, 37, -15, 35, 21, 18, 1, 20, -10, 5, -6, 32, -38, 12, -14, -6, 17...
$ Salt 8, -5, -2, -5, -6, -19, 14, 13, 19, 5, -1, -1, 3, 15, -5, -3, 12, -8, 1, -13, -3, 3, -2...
$ NonSalt 3, 7, 4, 4, 3, -22, 22, -28, 18, 16, 18, 3, 17, -25, 10, -4, 19, -29, 11, -2, -3, 15, 1...
$ Total.Lower.48 -19, 58, 69, 51, 34, -149, 105, -119, 23, 7, 98, -63, 96, -87, 61, 49, 106, -163, 67, 1...
$ Flow.Change -0.34, -0.06, 0.41, 3.64, -0.47, -0.10, 0.42, -0.51, -1.64, -1.08, -0.15, -0.27, 0.43, ...
$ BUY 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, ...
The first choice (better, but more lines of code have to change) is not to load the data into R, and have H2O load it, and also have H2O split it instead. The H2O loader will recognize the first row is a header row and treat it as column names, and not as data.
The second approach is to strip out the header row in the R code.
However, that is already the default behaviour of read.csv() (the header argument defaults to TRUE). So your data must have "BUY" somewhere other than the first row. In which case either fix the data manually, or seek and destroy that bad row after loading it into R.
(If you disagree, can you post a sample data file that demonstrates the problem, using the code you have given.)

Changing lattice coplot strip font

I'm trying to change the size of the strip text for biom and days. I saw other posts suggesting to use strip.custom() to change the font size. But it's not working and I can't understand how to fix it. I got this warning:
Warning messages:
In plot.xy(xy.coords(x, y), type = type, ...) :
"strip" is not a graphical parameter
growth <- c(15, 12, 7, 9, 4, 5, 9, 9)
days <- c(1, 2, 1, 2, 1, 2, 1, 2)
biom <- c(0.5, 1, 0.5, 1, 0.5, 1, 0.5, 1)
herb <- c(20, 22, 35, 30, 35, 34, 28, 26)
dfgrowth <- data.frame(growth, days, biom, herb)
library(lattice)
coplot(growth ~ herb | as.factor(biom) * as.factor(days), data=dfgrowth,
columns=4, overlap=0, pch=21, bar.bg=c(num=gray(1), fac=gray(0.95)),
strip=strip.custom(par.strip.text=list(cex=2)))

How do I get confidence intervals without inverting a singular Hessian matrix in R?

I'm a student working on an epidemiology model in R, using maximum likelihood methods. I created my negative log likelihood function. It's sort of gross looking, but here it is:
NLLdiff = function(v1, CV1, v2, CV2, st1 = (czI01 - czV01), st2 = (czI02 - czV02), st01 = czI01, st02 = czI02, tt1 = czT01, tt2 = czT02) {
prob1 = (1 + v1 * CV1 * tt1)^(-1/CV1)
prob2 = ( 1 + v2 * CV2 * tt2)^(-1/CV2)
-(sum(dbinom(st1, st01, prob1, log = T)) + sum(dbinom(st2, st02, prob2, log = T)))
}
The reason the first line looks so awful is because most of the data it takes is input there. czI01, for example, is already declared. I did this simply so that my later calls to the function don't all have to have awful vectors in them.
I then optimized for CV1, CV2, v1 and v2 using mle2 (library bbmle). That's also a bit gross looking, and looks like:
ml.cz.diff = mle2 (NLLdiff, start=list(v1 = vguess, CV1 = cguess, v2 = vguess, CV2 = cguess), method="L-BFGS-B", lower = 0.0001)
Now, everything works fine up until here. ml.cz.diff gives me values that I can turn into a plot that reasonably fits my data. I also have several different models, and can get AICc values to compare them. However, when I try to get confidence intervals around v1, CV1, v2 and CV2 I have problems. Basically, I get a negative bound on CV1, which is impossible as it actually represents a square number in the biological model as well as some warnings.
Is there a better way to get confidence intervals? Or, really, a way to get confidence intervals that make sense here?
What I see happening is that, by coincidence, my hessian matrix is singular for some values in the optimization space. But, since I'm optimizing over 4 variables and don't have overly extensive programming knowledge, I can't come up with a good method of optimization that doesn't rely on the hessian. I have googled the problem - it suggested that my model's bad, but I'm reconstructing some work done before which suggests that my model's really not awful (the plots I make using the ml.cz.diff look like the plots of the original work). I have also read the relevant parts of the manual as well as Bolker's book Ecological Models in R. I have also tried different optimization methods, which resulted in a longer run time but the same errors. The "SANN" method didn't finish running within an hour, so I didn't wait around to see the result.
In a nutshell: my confidence intervals are bad. Is there a relatively straightforward way to fix them in R?
My vectors are:
czT01 = c(5, 5, 5, 5, 5, 5, 5, 25, 25, 25, 25, 25, 25, 25, 50, 50, 50, 50, 50, 50, 50)
czT02 = c(5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 25, 25, 25, 25, 25, 50, 50, 50, 50, 50, 75, 75, 75, 75, 75)
czI01 = c(25, 24, 22, 22, 26, 23, 25, 25, 25, 23, 25, 18, 21, 24, 22, 23, 25, 23, 25, 25, 25)
czI02 = c(13, 16, 5, 18, 16, 13, 17, 22, 13, 15, 15, 22, 12, 12, 13, 13, 11, 19, 21, 13, 21, 18, 16, 15, 11)
czV01 = c(1, 4, 5, 5, 2, 3, 4, 11, 8, 1, 11, 12, 10, 16, 5, 15, 18, 12, 23, 13, 22)
czV02 = c(0, 3, 1, 5, 1, 6, 3, 4, 7, 12, 2, 8, 8, 5, 3, 6, 4, 6, 11, 5, 11, 1, 13, 9, 7)
and I get my guesses by:
v = -log((c(czI01, czI02) - c(czV01, czV02))/c(czI01, czI02))/c(czT01, czT02)
vguess = mean(v)
cguess = var(v)/vguess^2
It's also possible that I'm doing something else completely wrong, but my results seem reasonable so I haven't caught it.
You could change the parameterization so that the constraints are always satisfied. Rewrite the likelihood as a a function of ln(CV1) and ln(CV2), that way you can be sure that CV1 and CV2 remain strictly positive.
NLLdiff_2 = function(v1, lnCV1, v2, lnCV2, st1 = (czI01 - czV01), st2 = (czI02 - czV02), st01 = czI01, st02 = czI02, tt1 = czT01, tt2 = czT02) {
prob1 = (1 + v1 * exp(lnCV1) * tt1)^(-1/exp(lnCV1))
prob2 = ( 1 + v2 * exp(lnCV2) * tt2)^(-1/exp(lnCV2))
-(sum(dbinom(st1, st01, prob1, log = T)) + sum(dbinom(st2, st02, prob2, log = T)))
}

Resources