Multtest package in R returning not properly adjusting p-values - r

I am trying to use the R package multtest to adjust a list of p-values for multiple testing. However, multtest will only return a list of "1" characters of equal length to the list of p-values that were analyzed.
The input file is a text file in which the pvalues are separated by newline characters. A segment of the file is reproduced below:
0.182942602
0.333002877
0.282000206
0.161501682
0.161501682
I downloaded the multtest package (multtest_2.14.0) from Bioconductor, and am running it in R version x64 2.15.2. Does anyone know if there is a compatibility problem between multtest and R 2.15.2?
My code:
library(multtest, verbose = FALSE)
table1 <- read.table("p-values.txt", header = FALSE, colClasses = "double")
table2 <-as.vector(as.double(table1[,1]))
results<-p.adjust(table2, method = c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"))
write.table(results, file = "output.txt")

This is not an error- this is the correct adjustment when there are no p-values that can be considered significant within that vector of p-values.
Your code performs the Holm correction (method takes only one argument, and in this case will use the "holm" method, the first item in your vector). The Holm method will correctly return all ones in the case that
min(p) * length(p) > 1
In that situation (using this multiple hypothesis testing framework), there are no p-values in the vector that can be considered significant.
If you'd like to see the gory details, the code for the holm method (taken directly from the multtest package) is
i <- seq_len(lp)
o <- order(p)
ro <- order(o)
pmin(1, cummax((n - i + 1L) * p[o]))[ro]
where p is the input vector, and lp and n are the length of the vector. That expression (n - i + 1L) * p[o] is saying "for each item in the sorted list, take n+1 minus its index, then multiply it by the value". For the minimum item, that is (n + 1 - 1) * min(p) -> n * min(p). The cummax means the cumulative maximum- which means that none of the subsequent items can be smaller than the first value. And pmin(1, ...) means that for every item in the vector, if the item is greater than 1, set the value equal to 1 (since a p-value about 1 is meaningless).
This means that if n * min(p) is greater than one, then the adjusted p-value of the smallest item is 1, which means the adjusted p-value of every item must be 1.

Related

What does lag(log(emp), 1:2) mean when using pgmm function?

I tried an example regarding pgmm function in plm package. The codes are as follows:
library(plm)
data("EmplUK", package = "plm")
## Arellano and Bond (1991), table 4 col. b
z1 <- pgmm(log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 0:1)
+ log(capital) + lag(log(output), 0:1) | lag(log(emp), 2:99),
data = EmplUK, effect = "twoways", model = "twosteps")
summary(z1, robust = FALSE)
I am not sure the meaning of lag(log(emp), 1:2) and also lag(log(emp), 2:99). Does lag(log(emp), 1:2) mean that from one unit to two unit lag value of log(emp) and lag(log(emp), 2:99) from two units to 99 units' lag value of log(emp)?
And also sometimes I got an error when running the regression in summary part but sometimes there was no such error (the codes are the same):
Error in !class_ind : invalid argument type
Can anyone help me with these problems?That's the error here
log, a base R function, gives you the (natural) logarithm, in this case of variable emp.
lag of package plm can be given a second argument, called k, like in your example. By looking at ?plm::lag.plm it becomes clear: k is
an integer, the number of lags for the lag and lead methods (can also
be negative). For the lag method, a positive (negative) k gives lagged
(leading) values. For the lead method, a positive (negative) k gives
leading (lagged) values, thus, lag(x, k = -1) yields the same as
lead(x, k = 1). If k is an integer with length > 1 (k = c(k1, k2,
...)), a matrix with multiple lagged pseries is returned
Thus, instead of typing lag twice to have the first and second lag:
(lag(<your_variable>, 1)
lag(<your_variable>, 2)
one can simply type
lag(<your_variable>, k = 1:2), or without the named argument
lag(<your_variable>, 1:2).
Setting k to 2:99 gives you the 2nd to 99th lags.
The number refers to the number of time periods the lagging is applied to, not to the number of individuals (units) as the lagging is applied to all individuals.
You may want to run the example in ?plm::lag.plm to aid understanding of that function.

Optimize within for loop cannot find function

I've got a function, KozakTaper, that returns the diameter of a tree trunk at a given height (DHT). There's no algebraic way to rearrange the original taper equation to return DHT at a given diameter (4 inches, for my purposes)...enter R! (using 3.4.3 on Windows 10)
My approach was to use a for loop to iterate likely values of DHT (25-100% of total tree height, HT), and then use optimize to choose the one that returns a diameter closest to 4". Too bad I get the error message Error in f(arg, ...) : could not find function "f".
Here's a shortened definition of KozakTaper along with my best attempt so far.
KozakTaper=function(Bark,SPP,DHT,DBH,HT,Planted){
if(Bark=='ob' & SPP=='AB'){
a0_tap=1.0693567631
a1_tap=0.9975021951
a2_tap=-0.01282775
b1_tap=0.3921013594
b2_tap=-1.054622304
b3_tap=0.7758393514
b4_tap=4.1034897617
b5_tap=0.1185960455
b6_tap=-1.080697381
b7_tap=0}
else if(Bark=='ob' & SPP=='RS'){
a0_tap=0.8758
a1_tap=0.992
a2_tap=0.0633
b1_tap=0.4128
b2_tap=-0.6877
b3_tap=0.4413
b4_tap=1.1818
b5_tap=0.1131
b6_tap=-0.4356
b7_tap=0.1042}
else{
a0_tap=1.1263776728
a1_tap=0.9485083275
a2_tap=0.0371321602
b1_tap=0.7662525552
b2_tap=-0.028147685
b3_tap=0.2334044323
b4_tap=4.8569609081
b5_tap=0.0753180483
b6_tap=-0.205052535
b7_tap=0}
p = 1.3/HT
z = DHT/HT
Xi = (1 - z^(1/3))/(1 - p^(1/3))
Qi = 1 - z^(1/3)
y = (a0_tap * (DBH^a1_tap) * (HT^a2_tap)) * Xi^(b1_tap * z^4 + b2_tap * (exp(-DBH/HT)) +
b3_tap * Xi^0.1 + b4_tap * (1/DBH) + b5_tap * HT^Qi + b6_tap * Xi + b7_tap*Planted)
return(y=round(y,4))}
HT <- .3048*85 #converting from english to metric (sorry, it's forestry)
for (i in c((HT*.25):(HT+1))) {
d <- KozakTaper(Bark='ob',SPP='RS',DHT=i,DBH=2.54*19,HT=.3048*85,Planted=0)
frame <- na.omit(d)
optimize(f=abs(10.16-d), interval=frame, lower=1, upper=90,
maximum = FALSE,
tol = .Machine$double.eps^0.25)
}
Eventually I would like this code to iterate through a csv and return i for the best d, which will require some rearranging, but I figured I should make it work for one tree first.
When I print d I get multiple values, so it is iterating through i, but it gets held up at the optimize function.
Defining frame was my most recent tactic, because d returns one NaN at the end, but it may not be the best input for interval. I've tried interval=c((HT*.25):(HT+1)), defining KozakTaper within the for loop, and defining f prior to the optimize, but I get the same error. Suggestions for what part I should target (or other approaches) are appreciated!
-KB
Forestry Research Fellow, Appalachian Mountain Club.
MS, University of Maine
**Edit with a follow-up question:
I'm now trying to run this script for each row of a csv, "Input." The row contains the values for KozakTaper, and I've called them with this:
Input=read.csv...
Input$Opt=0
o <- optimize(f = function(x) abs(10.16 - KozakTaper(Bark='ob',
SPP='Input$Species',
DHT=x,
DBH=(2.54*Input$DBH),
HT=(.3048*Input$Ht),
Planted=0)),
lower=Input$Ht*.25, upper=Input$Ht+1,
maximum = FALSE, tol = .Machine$double.eps^0.25)
Input$Opt <- o$minimum
Input$Mht <- Input$Opt/.3048. # converting back to English
Input$Ht and Input$DBH are numeric; Input$Species is factor.
However, I get the error invalid function value in 'optimize'. I get it whether I define "o" or just run optimize. Oddly, when I don't call values from the row but instead use the code from the answer, it tells me object 'HT' not found. I have the awful feeling this is due to some obvious/careless error on my part, but I'm not finding posts about this error with optimize. If you notice what I've done wrong, your explanation will be appreciated!
I'm not an expert on optimize, but I see three issues: 1) your call to KozakTaper does not iterate through the range you specify in the loop. 2) KozakTaper returns a a single number not a vector. 3) You haven't given optimize a function but an expression.
So what is happening is that you are not giving optimize anything to iterate over.
All you should need is this:
optimize(f = function(x) abs(10.16 - KozakTaper(Bark='ob',
SPP='RS',
DHT=x,
DBH=2.54*19,
HT=.3048*85,
Planted=0)),
lower=HT*.25, upper=HT+1,
maximum = FALSE, tol = .Machine$double.eps^0.25)
$minimum
[1] 22.67713 ##Hopefully this is the right answer
$objective
[1] 0
Optimize will now substitute x in from lower to higher, trying to minimize the difference

Set constraints in a Matrix - OPTIM in R

I have a vector of integers as input values (starting values for optim par)
my.data.var <- c(10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25)
Optimization problem is a min. problem.
The error function calculates sum of square root of diff in values between
TWO MATRICES (Given Values Matrix vs Calculated Matrix)
The calculated matrix is the one that uses above integer vector.
Hence, in the error function, I stack the integer vector into a
matrix as my.data.var.mat <- matrix(my.data.var,nrow = 4,ncol = 6,byrow = TRUE)
The constraint that I must introduce is that colSum(my.data.var.mat) <=1
The optim is defined as
sols<-optim(my.data.var,Error.func,method="L-BFGS-B",upper=c(Inf,1,1,1,1,1,Inf,1,1,1,1,1,Inf,1,1,1,1,1,Inf,1,1,1,1,1),
lower=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
Error Function is defined as
Error.func <- function(my.data.var){
my.data.var.mat <- matrix(my.data.var,nrow = ncol(my.data.matrix.prod),ncol = ncol(my.data.matrix.inj)+1,byrow = TRUE)
Calc.Qjk.Value <- Qjk.Cal.func(my.data.timet0,my.data.qo,my.data.matrix.time,
my.data.matrix.inj, my.data.matrix.prod,my.data.var,my.data.var.mat)
diff.values <- my.data.matrix.prod-Calc.Qjk.Value #FIND DIFFERENCE BETWEEN CAL. MATRIX AND ORIGINAL MATRIX
Error <- ((colSums ((diff.values^2), na.rm = FALSE, dims = 1))/nrow(my.data.matrix.inj))^0.5 #sum of square root of the diff
Error_total <- sum(Error,na.rm=FALSE)/ncol(my.data.matrix.prod) # total avg error
Error_total
}
Given Dataset: my.data.matrix.prod , my.data.timet0, my.data.qo, my.data.matrix.time, my.data.matrix.inj
So, my question is how and where should I introduce the matrix col sum constraint? Or the other way to put it as how would OPTIM vary integer vector under Matrix col sum constraint?
I realized that nloptr is a better option than optim since my problem consisted of "inequality constraints".
I modified the implementation as I explain in this post here. "multiple inequality constraints" - Minimization with R nloptr package
Hence, closing this thread.

Error in if (nval != nrow(par0$Rho)) stop(paste("Row dimension of \"Rho\" not equal to\n", : argument is of length zero

I am using the hmm.discnp library and combining it with Hmm library in R. I am trying to incorporate my data into HMM and it is giving me above error. Here is the code:
#Reading the csv file
corpus1<-read.csv("C:/Users/harspath/Downloads/Personal/RData.csv", header = TRUE)
good_data<- as.list(corpus1)
#Defining Libraries
library (HMM)
require(hmm.discnp)
#Defining states i.e. End State and symbols i.e. the observations
states=c("Buying", "Not Buying")
symbols=c("ID", "Device ID","DeviceOSVector","MobileBrandVector","BrowserVector","SearchValueVector","TimeOnPage","NoOfCicks","NoOfScrolls","PageLoadTime","TaskComplete")
hmm1 <- initHMM ( states, symbols,startProbs=NULL,transProbs=NULL, emissionProbs=NULL)
tpm<-hmm1$transProbs
rho<-hmm1$emissionProbs
my_hmm = hmm(good_data,par0 = list(tpm,rho),stationary=FALSE)
# transition probability matrix
my_hmm$tpm
# output probabilities
my_hmm$Rho
# initial probabilities (don't know/know)
my_hmm$ispd
This should fix it:
my_hmm = hmm(good_data,par0 = list(tpm,Rho = rho),stationary=FALSE)
OR:
Rho<-hmm1$emissionProbs
my_hmm = hmm(good_data,par0 = list(tpm,Rho),stationary=FALSE)
The reasoning here is that the par0 argument expects a named list. The names should be tpm and Rho (upper case first letter). However, you have written rho (lower case). From ?hmm:
par0
An optional (named) list of starting values for the parameters of the model, with components tpm (transition probability matrix) and Rho. The matrix Rho specifies the probability that the observations take on each value in yval, given the state of the hidden Markov chain. The columns of Rho correspond to states, the rows to the values of yval.
In the error, you can see that hmm has the following code:
if (nval != nrow(par0$Rho)) *something*
So, it tries to look in par0$Rho. Since you don't have that one, you get the error argument is of length zero.

cost function in cv.glm of boot library in R

I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.
The function has the following signature:
cv.glm(data, glmfit, cost, K)
with the first two denoting the data and model and K specifies the k-fold.
My problem is the cost parameter which is defined as:
cost: A function of two vector arguments specifying the cost function
for the crossvalidation. The first argument to cost should correspond
to the observed responses and the second argument should correspond to
the predicted or fitted responses from the generalized linear model.
cost must return a non-negative scalar value. The default is the
average squared error function.
I guess for classification it would make sense to have a function which returns the rate of misclassification something like:
nrow(subset(data, (predict >= 0.5 & data$response == "no") |
(predict < 0.5 & data$response == "yes")))
which is of course not even syntactically correct.
Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.
It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:
# [...] Since the response is a binary variable an
# appropriate cost function is
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:
## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <- rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1
## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333 ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
## (assuming you use 0.5 as the cutoff for your predictions).
I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)
Added on 2016-03-22:
The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:
cost <- function(r, pi = 0) {
mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}
I will try to explain the cost function in simple words. Let's take
cv.glm(data, glmfit, cost, K) arguments step by step:
data
The data consists of many observations. Think of it like series of numbers or even.
glmfit
It is generalized linear model, which runs on the above series. But there is a catch it splits data into several parts equal to K. And runs glmfit on each of them separately (test set), taking the rest of them as training set. The output of glmfit is a series consisting of same number of elements as the split input passed.
cost
Cost Function. It takes two arguments first the split input series(test set), and second the output of glmfit on the test input. The default is mean square error function.
.
It sums the square of difference between observed data point and predicted data point. Inside the function a loop runs over the test set (output and input should have same number of elements) calculates difference, squares it and adds to output variable.
K
The number to which the input should be split. Default gives leave one out cross validation.
Judging from your cost function description. Your input(x) would be a set of numbers between 0 and 1 (0-0.5 = no and 0.5-1 = yes) and output(y) is 'yes' or 'no'. So error(e) between observation(x) and prediction(y) would be :
cost<- function(x, y){
e=0
for (i in 1:length(x)){
if(x[i]>0.5)
{
if( y[i]=='yes') {e=0}
else {e=x[i]-0.5}
}else
{
if( y[i]=='no') {e=0}
else {e=0.5-x[i]}
}
e=e*e #square error
}
e=e/i #mean square error
return (e)
}
Sources : http://www.cs.cmu.edu/~schneide/tut5/node42.html
The cost function can optionally be defined if there is one you prefer over the default average squared error. If you wanted to do so then the you would write a function that returns the cost you want to minimize using two inputs: (1) the vector of known labels that you are predicting, and (2) the vector of predicted probabilities from your model for those corresponding labels. So for the cost function that (I think) you described in your post you are looking for a function that will return the average number of accurate classifications which would look something like this:
cost <- function(labels,pred){
mean(labels==ifelse(pred > 0.5, 1, 0))
}
With that function defined you can then pass it into your glm.cv() call. Although I wouldn't recommend using your own cost function over the default one unless you have reason to. Your example isn't reproducible, so here is another example:
> library(boot)
>
> cost <- function(labels,pred){
+ mean(labels==ifelse(pred > 0.5, 1, 0))
+ }
>
> #make model
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> #run cv with your cost function
> (nodal.glm.err <- cv.glm(nodal, nodal.glm, cost, nrow(nodal)))
$call
cv.glm(data = nodal, glmfit = nodal.glm, cost = cost, K = nrow(nodal))
$K
[1] 53
$delta
[1] 0.8113208 0.8113208
$seed
[1] 403 213 -2068233650 1849869992 -1836368725 -1035813431 1075589592 -782251898
...
The cost function defined in the example for cv.glm clearly assumes that the predictions are probabilities, which would require the type="response" argument in the predict function. The documentation from library(boot) should state this explicitly. I would otherwise be forced to assume that the default type="link" is used inside the cv.glm function, in which case the cost function would not work as intended.

Resources