R: Difficulty for analysis of GSE7864 of NCBI GEO in Limma - r

I am trying to analyze GSE7864 and would like to know how miR34a, miR34b, and miR34c influence the gene expression, i.e., what is the Differentially expressed genes (DGE) caused by miR34a, miR34b, and miR34c, respectively?
The following is my code, but I am not sure how to construct a design matrix according to the tTarget information (i.e., targets frame according to Limma tutorial). I am trying to select a subset according to different Cy3 and the subsetted targets frame called sTarget, I know sTarget belongs to two-color with common reference designs (p37 in Limma tutorial), but using sTargets only can not build linear model in Limma since no enough replicates for each treatment. In this case, how can I get the DGE permuted by miR34a, miR34b, and miR34c, respectively? Or is there another way to obtain the DGE by using all arrays instead of just 3 like in sTargets? If so, how to contrast the design matrix and contrast matrix? I can not find similar examples in Limma tutorial.
If 2-fold change has used the measure the extent of DGE for GSM190752 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM190752), then FC = 10^VAUE (since VALUE is LOG10 RATIO)? and the genes with abs(FC) > 2 are DGE permuted by miR34a?
Any help is appreciated!
Kevin
The code I used is listed:
#https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7864
eset <- getGEO(filename = "GSE7864_series_matrix.txt.gz")
tCy3 = rep(c("A549H1", "HCT116Dicer", "TOV21GH1", "DLDDicer", "HeLa", "A549p53", "TOV21Gp53"), each = 4)
tCy5 = rep(c("Luc", "miR34a", "miR34b", "miR34c"), times = 7)
pd <- pData(eset)
tTarget <- data.frame(gsm = rownames(pd), Cy3 = tCy3, Cy5 = tCy5)
sCy3 = c("A549H1")
sCy5 = c("miR34a", "miR34b", "miR34c")
isSelected <- (tTarget$Cy3 %in% sCy3) & (tTarget$Cy5 %in% sCy5)
sTarget <- tTarget[isSelected, ]

In Bioconductor,
https://support.bioconductor.org/p/91258/#91332
From this, I got the answer.

Related

Fastshap summary plot - Error: can't combine <double> and <factor<919a3>>

I'm trying to get a summary plot using fastshap explain function as in the code below.
p_function_G<- function(object, newdata)
caret::predict.train(object,
newdata =
newdata,
type = "prob")[,"AntiSocial"] # select G class
# Calculate the Shapley values
#
# boostFit: is a caret model using catboost algorithm
# trainset: is the dataset used for bulding the caret model.
# The dataset contains 4 categories W,G,R,GM
# corresponding to 4 diferent animal behaviors
library(caret)
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= game_train[which(game_test=="AntiSocial"),])
)
However I'm getting error
Error in 'stop_vctrs()':
can't combine latitude and gender <factor<919a3>>
What's the way out?
I see that you are adapting code from Julia Silge's Predict ratings for board games Tutorial. The original code used SHAPforxgboost for generating SHAP values, but you're using the fastshap package.
Because Shapley explanations are only recently starting to gain traction, there aren't very many standard data formats. fastshap does not like tidyverse tibbles, it only takes matrices or matrix-likes.
The error occurs because, by default, fastshap attempts to convert the tibble to a matrix. But this fails, because matrices can only have one type (f.x. either double or factor, not both).
I also ran into a similar issue and found that you can solve this by passing the X parameter as a data.frame. I don't have access to your full code but you could you try replacing the shap_values_G code-block as so:
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= as.data.frame(game_train[which(game_test=="AntiSocial"),]))
)
Wrap newdata with as.data.frame. This converts the tibble to a dataframe and so shouldn't upset fastshap.

Error: `data` and `reference` should be factors with the same levels random forrest

This is a code I am doing for an assignment. I cannot seem to get a confusion matrix for the predictions, please assist me with troubleshooting the code or make any recommendations necessary.
set.seed(1234)
test_index1 <-createDataPartition(water_potability3$Potability,p=0.1,list= FALSE)
water_potability_train <- water_potability3[test_index1,-c(4,6:9)]
water_potability3_test<- water_potability3[!1:nrow(water_potability3)%in%test_index1,-c(4,6:9)]
<- tuneRF(x=water_potability_train[,1:4],y=water_potability_train$Potability) (mintree <-trf[which.min(trf[,2]),1])
<-randomForest(x=water_potability_train[,-5],y=water_potability_train$Potability,mtry = mintree,importance = TRUE)
(rf_model,main="")
(rf_model,main="")
preds_rf<- predict(rf_model,water_potability3_test[,-5])
table(preds_rf,water_potability3_test$Potability)
confusionmatrix(preds_rf,water_potability3_test$Potability)
Everytime I do a confusion matrix I get the error "Error: data and reference should be factors with the same levels"
As you don't share a dataset that allows me to reproduce the error, I'm gonna have a guess and provide the solution I would have used myself. If this doesn't work for you, please provide some data and perhaps explain what the Potability column contains :-)
When randomly splitting the data into training and test partitions, you risk not having observations from every class in both partitions. E.g. if you have 10 classes, you might only have 8 of them in the smaller test partition. And then when your model predicts one of the two other classes that were available in the training partition, the two factors have different levels.
So I use partition() from groupdata2 with the cat_col argument that ensures each class is represented in both partitions (if possible). Then I use confusion_matrix() from cvms as it allows different levels in the two factors.
library(groupdata2)
library(cvms)
set.seed(1234)
# Create list with two partitions
# where the ratio of classes in Potability are similar
parts <- partition(water_potability3[, -c(4,6:9)],
p = 0.1, cat_col = "Potability")
# Extract the two partitions
water_potability3_test <- parts[[1]]
water_potability3_train <- parts[[2]]
# The modeling (haven't changed anything here)
trf <- tuneRF(x = water_potability_train[, 1:4],
y = water_potability_train$Potability)
(mintree <- trf[which.min(trf[, 2]), 1])
rf_model <- randomForest(
x = water_potability_train[, -5],
y = water_potability_train$Potability,
mtry = mintree,
importance = TRUE
)
preds_rf <- predict(rf_model, water_potability3_test[, -5])
# Create confusion matrix
conf_mat <- cvms::confusion_matrix(
targets = water_potability3_test$Potability,
predictions = preds_rf
)
# The basic confusion matrix table
conf_mat$Table
# Or as a plot
plot_confusion_matrix(conf_mat)
You may also check out cvms::evaluate() that has additional evaluation metrics.
Read more
More on the groupdata2 train/test partitioning functionality here:
https://cran.rstudio.com//web/packages/groupdata2/vignettes/cross-validation_with_groupdata2.html
More on the cvms confusion matrix functionality here:
https://cran.r-project.org/web/packages/cvms/vignettes/Creating_a_confusion_matrix.html

How to apply weights associated with the NIS (National inpatient sample) in R

I am trying to apply weights given with NIS data using the R package "survey", but I have been unsuccessful. I am fairly new to R and survey commands.
This is what I have tried:
# Create the unweighted dataset
d <- read.dta13(path)
# This produces the correct weighted amount of cases I need.
sum(d$DISCWT) # This produces the correct weighted amount of cases I need.
library(survey)
# Create survey object
dsvy <- svydesign(id = ~ d$HOSP_NIS, strata = ~ d$NIS_STRATUM, weights = ~ d$DISCWT, nest = TRUE, data = d)
d$count <- 1
svytotal(~d$count, dsvy)
However I get the following error after running the survey total:
Error in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1], :
Stratum (1131) has only one PSU at stage 1
Any help would be greatly appreciated, thank you!
The error indicates that you have specified a design where one of the strata has just a single primary sampling unit. It's not possible to get an unbiased estimate of variance for a design like that: the contribution of stratum 1131 will end up as 0/0.
As you see, R's default response is to give an error, because a reasonably likely explanation is that the data or the svydesign statement is wrong. Sometimes, as here, that's not what you want, and the global option 'survey.lonely.psu' describes other ways to respond. You want to set
options(survey.lonely.psu = "adjust")
This and other options are documented at help(surveyoptions)

Library "TableOne" multiple comparisons. Calculate line by line p-values

I received a comment from a reviewer who wanted to have all the p-values for each line of specific variables levels in a demographic characteristic table (Table 1). Even though the request appears quite strange (and inexact) to me, I would like to agree with his suggestion.
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
print(tableOne, nonnormal = c("bili","chol","copper","alk.phos","trig"), exact = c("status","stage"), smd = TRUE)
the output:
I need to have the p-values for each level of the variables status, edema and stage, with Bonferroni correction. I went through the documentation without success.
In addition, is it correct to use chi-squared to compare sample sizes across rows?
UPDATE:
I'm not sure if my approach is correct, however I would like to share it with you. I generated for the variable status a dummy variable for each strata, than I calculated the chisq .
library(tableone)
## Load data
library(survival); data(pbc)
d <- pbc[,c("status", "trt")]
# Convert dummy variables
d$status.0 <- ifelse(d$status==0, 1,0)
d$status.1 <- ifelse(d$status==1, 1,0)
d$status.2 <- ifelse(d$status==2, 1,0)
t <- rbind(
chisq.test(d$status.0, d$trt),
# p-value = 0.7202
chisq.test(d$status.1, d$trt),
# p-value = 1
chisq.test(d$status.2, d$trt)
#p-value = 0.7818
)
t
BONFERRONI ADJ FOR MULTIPLE COMPARISONS:
p <- t[,"p.value"]
p.adjust(p, method = "bonferroni")
This question was posted some time ago, so I supose you already answered the reviewer.
I don't really understand why computing adjusted p values for just three varibles. In fact, adjusting p values depends on the number of comparisons made. If you use p.adjust() with a vector of 3 p values, results will not really be "adjusted" by the amount of comparison made (you really did more than a dozen and a half!)
I show how to extract all p-values so you can compute the adjusted ones.
To extract pValues from package tableOne there is a way calling object attributes (explained first), and two quick and dirty ways (at the bottom part).
To extract them, first I copy your code to create your tableOne:
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
You can see what your "tableOne" object has via attributes()
attributes(tableOne)
You can see a tableOne usually has a table for continuous and categorical variables. You can use attributes() in them too
attributes(tableOne$CatTable)
# you can notice $pValues
Now you know "where" the pValues are, you can extract them with attr()
attr(tableOne$CatTable, "pValues")
Something similar with numerical variables:
attributes(tableOne$ContTable)
# $pValues are there
attr(tableOne$ContTable, "pValues")
You have pValues for Normal and NonNormal variables.
As you set them before, you can extract both
mypCont <- attr(tableOne$ContTable, "pValues") # put them in an object
nonnormal = c("bili","chol","copper","alk.phos","trig") # copied from your code
mypCont[rownames(mypCont) %in% c(nonnormal), "pNonNormal"] # extract NonNormal
"%!in%" <- Negate("%in%")
mypCont[rownames(mypCont) %!in% c(nonnormal), "pNonNormal"] # extract Normal
All that said, and your pValues extracted, I think there are two much more convenient quick and dirty ways to accomplish the same:
Quick and dirty way A: using dput() with your printed tableOne. Then search in the console where the pValues are and copy-paste them to the script, to store them in an object
Quick and dirty way B: If you look in tableOne vignette there is an "Exporting" section, you can use print(tableOne, quote = TRUE) and then just copy and paste to a spreadsheet (like LibreOffice, Excel...).
Then I would select the column with pValue, transpose it, and get it back to R, to compute adjusted p values with p.adjust() and copy them back to the spreadsheet for journal submission

How are BRR weights used in the survey package for R?

Does anyone know how to use BRR weights in Lumley's survey package for estimating variance if your dataset already has BRR weights it in?
I am working with PISA data, and they already include 80 BRR replicates in their dataset. How can I get as.svrepdesign to use these, instead of trying to create its own? I tried the following and got the subsequent error:
dstrat <- svydesign(id=~uniqueID,strata=~strataVar, weights=~studentWeight,
data=data, nest=TRUE)
dstrat <- as.svrepdesign(dstrat, type="BRR")
Error in brrweights(design$strata[, 1], design$cluster[, 1], ...,
fay.rho = fay.rho, : Can't split with odd numbers of PSUs in a stratum
Any help would be greatly appreciated, thanks.
no need to use as.svrepdesign() if you have a data frame with the replicate weights already :) you can create the replicate weighted design directly from your data frame.
say you have data with a main weight column called mainwgt and 80 replicate weight columns called repwgt1 through repwgt80 you could use this --
yoursurvey <-
svrepdesign(
weights = ~mainwgt ,
repweights = "repwgt[0-9]+" ,
type = "BRR",
data = yourdata ,
combined.weights = TRUE
)
-- this way, you don't have to identify the exact column numbers. then you can run normal survey commands like --
svymean( ~variable , design = yoursurvey )
if you'd like another example, here's some example code and an explanatory blog post using the current population survey.
I haven't used the PISA data, I used the svprepdesign method last year with the Public Use Microsample from the American Community Survey (US Census Bureau) which also shipped with 80 replicate weights. They state to use the Fay method for that specific survey, so here is how one can construct the svyrep object using that data:
pums_p.rep<-svrepdesign(variables=pums_p[,2:7],
repweights=pums_p[8:87],
weights=pums_p[,1],combined.weights=TRUE,
type="Fay",rho=(1-1/sqrt(4)),scale=1,rscales=1)
attach(pums_p.rep)
#CROSS - TABS
#unweighted
xtabs(~ is5to17youth + withinAMILimit)
table(is5to17youth + withinAMILimit)
#weighted, mean income by sex by race for select age groups
svyby(~PINCP,~RAC1P+SEX,subset(
pums_p.rep,AGEP > 25 & AGEP <35),na.rm = TRUE,svymean,vartype="se","cv")
In getting this to work, I found the article from A. Damico helpful: Damico, A. (2009). Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data. The R Journal, 1(2), 37–44.

Resources