I am trying to run a Mann-Whitney test across large data set. Here is an excerpt of my input:
GeneID GeneID-2 GeneName TSS-ID Locus-ID TAp73fTfTAAdEmp TAp73fTfTFAdEmp TAp73fTfTJAdEmp TAp73fTfTAAdCre TAp73fTfTFAdCre TAp73fTfTJAdCre
ENSMUSG00000028180 ENSMUSG00000028180 Zranb2 TSS1050,TSS17719,TSS52367,TSS53246,TSS72833,TSS73222 3:157534159-157548390 11.32013333 11.66344 11.87956667 13.01974667 14.70944667 10.94043867
ENSMUSG00000028184 ENSMUSG00000028184 Lphn2 TSS23298,TSS2403,TSS74519 3:148815585-148989316 15.0983 15.09572 14.03578667 17.00742667 17.90735333 14.69675333
ENSMUSG00000028187 ENSMUSG00000028187 Rpf1 TSS66485 3:146506347-146521423 12.34542667 14.11470667 10.493766 14.57954 11.93746667 11.07405867
ENSMUSG00000028189 ENSMUSG00000028189 Ctbs TSS36674,TSS72417 3:146450469-146465849 1.288003867 1.435658 1.959620667 1.427768 1.502116667 1.243928267
ENSMUSG00000020755 ENSMUSG00000020755 Sap30bp TSS14892,TSS218,TSS54781,TSS58430 11:115933281-115966725 31.91070667 31.68585333 26.86939333 39.05116667 30.62916667 27.22893333
ENSMUSG00000020752 ENSMUSG00000020752 Recql5 TSS26689,TSS42686,TSS60902,TSS75513,TSS9111 11:115892594-115933477 10.55415467 9.373216667 8.315984 7.255579333 7.022178 8.553787333
ENSMUSG00000020758 ENSMUSG00000020758 Itgb4 TSS23937,TSS28540,TSS29211,TSS34600,TSS36953,TSS4070,TSS6591,TSS68296 11:115974708-116008412 130.2124 117.3862 129.323 134.1108667 134.8743333 165.3330667
ENSMUSG00000069833 ENSMUSG00000069833 Ahnak TSS54612 19:8989283-9076919 116.3223333 135.2628 130.1286 147.045 142.8164 127.2352
ENSMUSG00000033863 ENSMUSG00000033863 Klf9 TSS87300 19:23141225-23166911 23.23418667 27.46006 26.56143333 21.09004667 18.47022 16.63767333
ENSMUSG00000069835 ENSMUSG00000069835 Sat2 TSS71535,TSS9615 11:69622023-69623870 0.975045133 0.886760067 1.593631333 1.469496 1.2373384 1.292182733
ENSMUSG00000028233 ENSMUSG00000028233 Tgs1 TSS24151,TSS28446,TSS50213,TSS68499,TSS79096 4:3574874-3616619 4.221024667 4.212087333 4.160574 5.113266667 6.917347333 5.22148
ENSMUSG00000028232 ENSMUSG00000028232 Tmem68 TSS12134,TSS25773,TSS25778,TSS49743,TSS7797 4:3549040-3574853 4.048868 3.906129333 6.024607333 4.613682 6.292972 4.287184
I wrote the same script for t-test and it worked. However the replacing test by "wilcox" is giving me the error:
Error in wilcox.test.default(x[i, 1:3], x[i, 4:6], var.equal = TRUE) :
'x' must be numeric
My code is:
library(preprocessCore)
err <-file("err.Rout", open="wt")
sink(err, type="message")
x <- read.table("Data.txt", row.names=1, header=TRUE, sep="\t", na.strings="NA")
x<-x[,5:ncol(x)]
p<-matrix(0,nrow(x),3)
for (i in 1:nrow(x)) {
myTest <- try(wilcox.test(x[i,1:3], x[i,4:6], var.equal=TRUE))
if (inherits(myTest, "try-error"))
{ p[i,2]=1 }
else
{p[i,2]=myTest$p.value; num=rowMeans(x[i,1:3], na.rm = FALSE); den=rowMeans(x[i,4:6], na.rm = FALSE); ratio=num/den; p[i,1]=ratio }
}
p[,3] = p.adjust(p[,2], method="none")
colnames(p) <- c("FoldChange", "p-value", "Adjusted-p")
write.table(p, file = "tmpPval-fold.txt", append = FALSE, quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)
sink()
I'd appreciate your help in this matter. As i said, it worked perfectly if I use test instead of 'wilcox'.
There are (at least) two problems with your code at the moment, one of them is the cause of that error. The class of the object returned by x[i,1:3] is data.frame which is a list object and fails the is.numeric test inside wilcox.test. Try coercing:
wilcox.test(as.numeric(x[1,(1:3)]), as.numeric(x[1,(4:6)]), var.equal=TRUE)
But what-the-F is var.equal doing in a call to a non-parametric test that will not have any assumption of equal variance? (Actually it is getting ignored is what is happening.) And how do you expect to be getting useful information from a test when you're only giving 3 items compared to 3 items. That is never giving to be "significant" or even particularly informative. I doubt that a t.test could be informative when it is 3 vs 3 but a non-parametric test that is based on ordering of values is going to be even less likely to give a statistical signal of "significance".
Related
I've been using the quanteda SML workflow as described in the quanteda tutorial (https://tutorials.quanteda.io/machine-learning/nb/) and found it extremely helpful to set up my own classification task. However, instead of the fixed held-out train/test sampling I would like to use a k-fold cross-validation. Could you point me towards the best way to implement it into the workflow? Is there an easy way to apply it in quanteda?
Many thanks
I tried to add a cross validation based on this example:
https://rdrr.io/github/quanteda/quanteda.classifiers/man/crossval.html
require(quanteda)
require(quanteda.textmodels)
require(caret)
corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)
# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)
# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)
# tokenize texts
toks_movies <- tokens(corp_movies, remove_punct = TRUE, remove_number = TRUE) %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_wordstem()
dfmt_movie <- dfm(toks_movies)
# get training set
dfmat_training <- dfm_subset(dfmt_movie, id_numeric %in% id_train)
# get test set (documents not in id_train)
dfmat_test <- dfm_subset(dfmt_movie, !id_numeric %in% id_train)
tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb)
dfmat_matched <- dfm_match(dfmat_test, features = featnames(dfmat_training))
actual_class <- dfmat_matched$sentiment
predicted_class <- predict(tmod_nb, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
tab_class
require(confusionMatrix)
confusionMatrix(tab_class, mode = "everything", positive = "pos")
#n-fold cross validation
require(crossval)
dfmat <- dfm(toks_movies)
tmod <- textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment)
crossval(tmod, k = 5, by_class = TRUE)
crossval(tmod, k = 5, by_class = FALSE)
crossval(tmod, k = 5, by_class = FALSE, verbose = TRUE)
but it returns "Error in group.samples(Y) : argument "Y" is missing, with no default"
It should probably be a comment, but I cannot post them yet. I think your problem is caused by the usage of the crossval() function from the improper package. The link you shared suggests that you want to use it from the remote quanteda/quanteda.classifiers package, instead of crossval. The one you used presumably requires a different pipeline cause its definition is different. The used function requires additional X and Y arguments. Their lack is a reason for your error.
I am trying to get a set of cross tables with 70 variables. But no matter what I did, R kept generating the "function" back to me. I tried to move substitute after CrossTable but R seemed to have trouble using list(i=as.name(x)).
library(gmodel)
Independent_List <- colnames(Comorbidity)[1:70]
Comorbidity_Table <- lapply(Independent_List, function(x) {
substitute(CrossTable(i ,
Comorbidity$sleep,
prop.c = TRUE,
prop.r = FALSE,
prop.t = FALSE,
prop.chisq = FALSE,
data =Comorbidity),
list(i=as.name(x)))
})
lapply(Comorbidity_Table, summary)
[[1]]
Length Class Mode
8 call call
[[2]]
Length Class Mode
8 call call
[[3]]
Length Class Mode
8 call call
The goal is to try to make a table with specific cell numbers and column percentage and merge with my looped glm results.
I ended up using a much simpler method to solve this problem:
Tables <- lapply(Table_Data[, 1:11], function(x){table(x, Table_Data$TSD,exclude = NA)})
Prop_Tabs <- lapply(Tables[1:11], function(x){prop.table(x,2)})
library(nnet)
set.seed(9850)
train1<- sample(1:155,110)
test1 <- setdiff(1:110,train1)
ideal <- class.ind(hepatitis$class)
hepatitisANN = nnet(hepatitis[train1,-20], ideal[train1,], size=10, softmax=TRUE)
j <- predict(hepatitisANN, hepatitis[test1,-20], type="class")
hepatitis[test1,]$class
table(predict(hepatitisANN, hepatitis[test1,-20], type="class"),hepatitis[test1,]$class)
confusionMatrix(hepatitis[test1,]$class, j)
Error:
Error in nnet.default(hepatitis[train1, -20], ideal[train1, ], size = 10, :
NA/NaN/Inf in foreign function call (arg 2)
In addition: Warning message:
In nnet.default(hepatitis[train1, -20], ideal[train1, ], size = 10, :
NAs introduced by coercion
hepatitis variable consists of the hepatitis dataset available on UCI.
This error message is because you have character values in your data.
Try reading the hepatitis dataset with na.strings = "?". This is defined in the description of the dataset on the uci page.
headers <- c("Class","AGE","SEX","STEROID","ANTIVIRALS","FATIGUE","MALAISE","ANOREXIA","LIVER BIG","LIVER FIRM","SPLEEN PALPABLE","SPIDERS","ASCITES","VARICES","BILIRUBIN","ALK PHOSPHATE","SGOT","ALBUMIN","PROTIME","HISTOLOGY")
hepatitis <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data", header = FALSE, na.strings = "?")
names(hepatitis) <- headers
library(nnet)
set.seed(9850)
train1<- sample(1:155,110)
test1 <- setdiff(1:110,train1)
ideal <- class.ind(hepatitis$Class)
# will give error due to missing values
# 1st column of hepatitis dataset is the class variable
hepatitisANN <- nnet(hepatitis[train1,-1], ideal[train1,], size=10, softmax=TRUE)
This code will not give your error, but it will give an error on missing values. You will need to do address those before you can continue.
Also be aware that the class variable is the first variable in the dataset straight from the UCI data repository
Edit based on comments:
The na.action only works if you use the formula notation of nnet.
So in your case:
hepatitisANN <- nnet(class.ind(Class)~., hepatitis[train1,], size=10, softmax=TRUE, na.action = na.omit)
A very basic quesiton. But i am not able to apply this to my code. Hence seeking help here
I am getting an error mentioned below while running this R code
knn.pred <- knn(tdm.stack.nl_train, tdm.stack.nl_Test, tdm.cand_train, prob = TRUE)
> Error in knn(tdm.stack.nl_train, tdm.stack.nl_Test, tdm.cand_train, prob = TRUE) :
> dims of 'test' and 'train' differ.
I want to print the error message as given below. However I could not achieve this. I am not good in writing functions yet.. Please help.
out <- tryCatch( when error = {print('New words seen in testing data')})
It's better and easier to use try:
knn.pred <- try(knn(tdm.stack.nl_train, tdm.stack.nl_Test, tdm.cand_train, prob = TRUE))
if (inherits(knn.pred, "try-error") { # error management
print('New words seen in testing data')
}
You could do:
tryCatch(knn.pred <- knn(tdm.stack.nl_train, tdm.stack.nl_Test, tdm.cand_train, prob = TRUE),
error = function(e) {
stop('New words seen in testing data')
})
This shows up as:
tryCatch(knn.pred <- knn(tdm.stack.nl_train, tdm.stack.nl_Test, tdm.cand_train, prob = TRUE),
error = function(e) {
stop('New words seen in testing data')
})
Error in value[[3L]](cond) : New words seen in testing data
I have a problem with plotting my results. Previously (about two weeks ago) I can use same code at below to plot my data but now I'am getting error
data<- read.table("my_step.odt", header = FALSE, sep = "", quote="\"'", dec=".", as.is = FALSE, strip.white=FALSE, col.names=c(.......);
mgn_my <- data[1:49999,18]
sim <- data[1:49999, 21]
plot(sim , mgn_my , type="l",xlab="Time (ns)",ylab="mx")
error
Error in table(x, y) : attempt to make a table with >= 2^31 elements
any suggestion?
I have had a similar problem as you before. Based on my response from another post, here's what I would suggest before you run plot:
Option 1: Use droplevels
mgn_my <- droplevels(data[1:49999,18])
Option 2: Use apply. This approach seems "friendlier" if you are familiar with apply-family functions in R. For example:
mgn_my <- data[1:49999,18]
apply(mgn_my,1,plot)