Library "TableOne" multiple comparisons. Calculate line by line p-values - r

I received a comment from a reviewer who wanted to have all the p-values for each line of specific variables levels in a demographic characteristic table (Table 1). Even though the request appears quite strange (and inexact) to me, I would like to agree with his suggestion.
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
print(tableOne, nonnormal = c("bili","chol","copper","alk.phos","trig"), exact = c("status","stage"), smd = TRUE)
the output:
I need to have the p-values for each level of the variables status, edema and stage, with Bonferroni correction. I went through the documentation without success.
In addition, is it correct to use chi-squared to compare sample sizes across rows?
UPDATE:
I'm not sure if my approach is correct, however I would like to share it with you. I generated for the variable status a dummy variable for each strata, than I calculated the chisq .
library(tableone)
## Load data
library(survival); data(pbc)
d <- pbc[,c("status", "trt")]
# Convert dummy variables
d$status.0 <- ifelse(d$status==0, 1,0)
d$status.1 <- ifelse(d$status==1, 1,0)
d$status.2 <- ifelse(d$status==2, 1,0)
t <- rbind(
chisq.test(d$status.0, d$trt),
# p-value = 0.7202
chisq.test(d$status.1, d$trt),
# p-value = 1
chisq.test(d$status.2, d$trt)
#p-value = 0.7818
)
t
BONFERRONI ADJ FOR MULTIPLE COMPARISONS:
p <- t[,"p.value"]
p.adjust(p, method = "bonferroni")

This question was posted some time ago, so I supose you already answered the reviewer.
I don't really understand why computing adjusted p values for just three varibles. In fact, adjusting p values depends on the number of comparisons made. If you use p.adjust() with a vector of 3 p values, results will not really be "adjusted" by the amount of comparison made (you really did more than a dozen and a half!)
I show how to extract all p-values so you can compute the adjusted ones.
To extract pValues from package tableOne there is a way calling object attributes (explained first), and two quick and dirty ways (at the bottom part).
To extract them, first I copy your code to create your tableOne:
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
You can see what your "tableOne" object has via attributes()
attributes(tableOne)
You can see a tableOne usually has a table for continuous and categorical variables. You can use attributes() in them too
attributes(tableOne$CatTable)
# you can notice $pValues
Now you know "where" the pValues are, you can extract them with attr()
attr(tableOne$CatTable, "pValues")
Something similar with numerical variables:
attributes(tableOne$ContTable)
# $pValues are there
attr(tableOne$ContTable, "pValues")
You have pValues for Normal and NonNormal variables.
As you set them before, you can extract both
mypCont <- attr(tableOne$ContTable, "pValues") # put them in an object
nonnormal = c("bili","chol","copper","alk.phos","trig") # copied from your code
mypCont[rownames(mypCont) %in% c(nonnormal), "pNonNormal"] # extract NonNormal
"%!in%" <- Negate("%in%")
mypCont[rownames(mypCont) %!in% c(nonnormal), "pNonNormal"] # extract Normal
All that said, and your pValues extracted, I think there are two much more convenient quick and dirty ways to accomplish the same:
Quick and dirty way A: using dput() with your printed tableOne. Then search in the console where the pValues are and copy-paste them to the script, to store them in an object
Quick and dirty way B: If you look in tableOne vignette there is an "Exporting" section, you can use print(tableOne, quote = TRUE) and then just copy and paste to a spreadsheet (like LibreOffice, Excel...).
Then I would select the column with pValue, transpose it, and get it back to R, to compute adjusted p values with p.adjust() and copy them back to the spreadsheet for journal submission

Related

Hmisc: How do I include the results of reformM as an argument in aregImpute?

I am using aregImpute from the Hmisc package to impute missing values in a dataset. Since the results are not invariant to the order of variables in the formula, I have used the reformM function to generate a number of random formula permutations:
reformM(~ x1 + x2 + x3..., data = d, nperm = Z)
The function returns a list.
I have reviewed the help file, but am unclear on how the results can be passed to aregImpute.
Can the object be passed such that aregImpute will iterate over the different combinations to assess any variability? That is, can the more than one permutation be passed to aregImpute? If not, do I simply create a new formula using the new order of variables?
You can certainly use map to pass reformM to aregImpute. Below, I provide a minimal example. That said, I do think it is a better idea to reorder variables such that variables with higher missing data rates are first imputed.
library(missForest)
library(tidyverse)
iris_miss <- prodNA(iris,0.10) ## make every variable missing at 10%
## factorial(ncol(iris_miss)) ## number of possible permutations
permlist <- reformM( as.formula(paste(" ~ ", names(iris_miss) %>% str_flatten( " + "))), data=iris_miss, nperm=10 ) ## get 10 permutations
map(permlist, aregImpute, nk=3, n.impute=5, data= iris_miss)

error with rda test in vegan r package. Variable not being read correctly

I am trying to perform a simple RDA using the vegan package to test the effects of depth, basin and sector on genetic population structure using the following data frame.
datafile.
The "ALL" variable is the genetic population assignment (structure).
In case the link to my data doesn't work well, I'll paste a snippet of my data frame here.
I read in the data this way:
RDAmorph_Oct6 <- read.csv("RDAmorph_Oct6.csv")
My problems are two-fold:
1) I can't seem to get my genetic variable to read correctly. I have tried three things to fix this.
gen=rda(ALL ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Error in eval(specdata, environment(formula), enclos = globalenv()) :
object 'ALL' not found
In addition: There were 12 warnings (use warnings() to see them)
so, I tried things like:
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
so I specified numeric
> RDAmorph_Oct6$ALL = as.numeric(RDAmorph_Oct6$ALL)
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I am really baffled. I've also tried specifying each variable with dataset$variable, but this doesn't work either.
The strange thing is, I can get an rda to work if I look the effects of the environmental variables on a different, composite, variable
MC = RDAmorph_Oct6[,5:6]
H_morph_var=rda(MC ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Note that I did try to just extract the ALL column for the genetic rda above. This didn't work either.
Regardless, this leads to my second problem.
When I try to plot the rda I get a super weird plot. Note the five dots in three places. I have no idea where these come from.
I will have to graph the genetic rda, and I figure I'll come up with the same issue, so I thought I'd ask now.
I've been though several tutorials and tried many iterations of each issue. What I have provided here is I think the best summary. If anyone can give me some clues, I would much appreciate it.
The documentation, ?rda, says that the left-hand side of the formula specifying your model needs to be a data matrix. You can't pass it the name of a variable in the data object as the left-hand side (or at least if this was ever anticipated, doing so exposes bugs in how we parse the formula which is what leads to further errors).
What you want is a data frame containing a variable ALL for the left-hand side of the formula.
This works:
library('vegan')
df <- read.csv('~/Downloads/RDAmorph_Oct6.csv')
ALL <- df[, 'ALL', drop = FALSE]
Notice the drop = FALSE, which stops R from dropping the empty dimension (i.e. converting the single column data frame to a vector.
Then your original call works:
ord <- rda(ALL ~ Basin + Depth + Sector, data = df, na.action = 'na.exclude')
The problem is that rda expects a separate df for the first part of the formula (ALL in your code), and does not use the one in the data = argument.
As mentioned above, you can create a new df with the variable needed for analysis, but here's a oneline solution that should also work:
gen <- rda(RDAmorph_Oct6$ALL ~ Depth + Basin + Sector, data = RDAmorph_Oct6, na.action = na.exclude)
This is partly similar to Gavin simpson's answer. There is also a problem with the categorical vectors in your data frame. You can either use library(data.table) and the rowid function to set the categorical variables to unique integers. Most preferably, not use them. I also wanted to set the ID vector as site names, but I am too lazy now.
library(data.table)
RDAmorph_Oct6 <- read.csv("C:/........../RDAmorph_Oct6.csv")
#remove NAs before. I like looking at my dataframes before I analyze them.
RDAmorph_Oct6 <- na.omit(RDAmorph_Oct6)
#I removed one duplicate
RDAmorph_Oct6 <- RDAmorph_Oct6[!duplicated(RDAmorph_Oct6$ID),]
#Create vector with only ALL
ALL <- RDAmorph_Oct6$ALL
#Create data frame with only numeric vectors and remove ALL
dfn <- RDAmorph_Oct6[,-c(1,4,11,12)]
#Select all categorical vectors.
dfc <- RDAmorph_Oct6[,c(1,11,12)]
#Give the categorical vectors unique integers doesn't do this for ID (Why?).
dfc2 <- as.data.frame(apply(dfc, 2, function(x) rowid(x)))
#Bind back with numeric data frame
dfnc <- cbind.data.frame(dfn, dfc2)
#Select only what you need
df <- dfnc[c("Depth", "Basin", "Sector")]
#The rest you know
rda.out <- rda(ALL ~ ., data=df, scale=T)
plot(rda.out, scaling = 2, xlim=c(-3,2), ylim=c(-1,1))
#Also plot correlations
plot(cbind.data.frame(ALL, df))
Sector and depth have the highest variation. Almost logical, since there are only three vectors used. The assignment of integers to the categorical vector has probably no meaning at all. The function assigns from top to bottom unique integers to the following unique character string. I am also not really sure which question you want to answer. Based on this you can organize the data frame.

Can I get unwtd.count included when running the svymean from the R Survey package?

I've written an R script to loop through a bunch of variables in a survey and output weighted values, CVs, CIs etc.
I would like it to also output the unweighted observations count.
I know it's a bit of a lazy question because I can calculate unweighted counts on my own and join them back in. I'm just trying to replicate a stata script that would return 'obs'
svy:tab jdvariable, per cv ci obs column format(%14.4g)
This is my calculated values table:
myresult_year_calc <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
FUN = svymean # specify function from survey package
)
By using unwtd.count instead of FUN, I get the counts I want.
myresult_year_obs <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
unwtd.count
)
Honestly in writing this question I made it 98% through a solution, but I'll ask anyway in case someone knows a more efficient way.
myresult_year_calc and myresult_year_obs both return what I expect, and if I use merge(myresult_year_calc, myresult_year_obs by"year") I get the table I want. This actually just gives me one count, per year in this example instead of one count for 'Yes' responses and one count for 'No'.
Is there any way to get both means and unweighted counts with a single command?
I figured this out by creating a second dsgn function where weights = ~0. When I ran svyby using the svytotal function with the unweighted design it followed the formula.
dsgn2 <- svydesign(ids = ~0,
weights = ~0,
data = data,
na.rm = T)
unweighted_n <- svyby(~interaction(group1,group2), ~as.factor(mean_rating), design = dsgn2, FUN = svytotal, na.rm = T)

Issue with h2o Package in R using subsetted dataframes leading to near perfect prediction accuracy

I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.

Perform operation on each imputed dataset in R's MICE

How can I perform an operation (like subsetting or adding a calculated column) on each imputed dataset in an object of class mids from R's package mice? I would like the result to still be a mids object.
Edit: Example
library(mice)
data(nhanes)
# create imputed datasets
imput = mice(nhanes)
The imputed datasets are stored as a list of lists
imput$imp
where there are rows only for the observations with imputation for the given variable.
The original (incomplete) dataset is stored here:
imput$data
For example, how would I create a new variable calculated as chl/2 in each of the imputed datasets, yielding a new mids object?
This can be done easily as follows -
Use complete() to convert a mids object to a long-format data.frame:
long1 <- complete(midsobj1, action='long', include=TRUE)
Perform whatever manipulations needed:
long1$new.var <- long1$chl/2
long2 <- subset(long1, age >= 5)
use as.mids() to convert back manipulated data to mids object:
midsobj2 <- as.mids(long2)
Now you can use midsobj2 as required. Note that the include=TRUE (used to include the original data with missing values) is needed for as.mids() to compress the long-formatted data properly. Note that prior to mice v2.25 there was a bug in the as.mids() function (see this post https://stats.stackexchange.com/a/158327/69413)
EDIT: According to this answer https://stackoverflow.com/a/34859264/4269699 (from what is essentially a duplicate question) you can also edit the mids object directly by accessing $data and $imp. So for example
midsobj2<-midsobj1
midsobj2$data$new.var <- midsobj2$data$chl/2
midsobj2$imp$new.var <- midsobj2$imp$chl/2
You will run into trouble though if you want to subset $imp or if you want to use $call, so I wouldn't recommend this solution in general.
Another option is to calculate the variables before the imputation and place restrictions on them.
library(mice)
# Create the additional variable - this will have missing
nhanes$extra <- nhanes$chl / 2
# Change the method of imputation for extra, so that it always equals chl/2
# Change the predictor matrix so only chl predicts extra
ini <- mice(nhanes, max = 0, print = FALSE)
meth <- ini$meth
meth["extra"] <- "~I(chl / 2)"
pred <- ini$pred # extra isn't used to predict
pred["extra", "chl"] <- 1
# Imputations
imput <- mice(nhanes, seed = 1, pred = pred, meth = meth, print = FALSE)
There are examples in mice: Multivariate Imputation by Chained Equations in R.
There is an overload of with that can help you here
with(imput, chl/2)
the documentation is given at ?with.mids
There's a function for this in the basecamb package:
library(basecamb)
apply_function_to_imputed_data(mids_object, function)

Resources