Class probabilities in Neural networks - r

I use the caret package with multi-layer perception.
My dataset consists of a labelled output value, which can be either A,B or C. The input vector consists of 4 variables.
I use the following lines of code to calculate the class probabilities for each input value:
fit <- train(device~.,data=dataframetrain[1:100,], method="mlp",
trControl=trainControl(classProbs=TRUE))
(p=(predict(fit,newdata=dataframetest,type=("prob"))))
I thought that the class probabilities for each record must sum up to one. But I get the following:
rowSums(p)
# 1 2 3 4 5 6 7 8
# 1.015291 1.015265 1.015291 1.015291 1.015291 1.014933 1.015011 1.015291
# 9 10 11 12 13 14 15 16
# 1.014933 1.015206 1.015291 1.015291 1.015291 1.015224 1.015011 1.015291
Can anybody help me because I don't know what I did wrong.

There's probably nothing wrong, it just seems that caret returns the values of the neurons in the output layer without converting them to probabilities (correct me if I'm wrong). When using the RSNNS::mlp function outside of caret the rows of the predictions also don't sum to one.
Since all output neurons have the same activation function the outputs can be converted to probabilities by dividing the predictions by the respective row sum, see this question.
This behavior seems to be true when using method = "mlp" or method = "mlpWeightDecay" but when using method = "nnet" the predictions do sum to one.
Example:
library(RSNNS)
data(iris)
#shuffle the vector
iris <- iris[sample(1:nrow(iris),length(1:nrow(iris))),1:ncol(iris)]
irisValues <- iris[,1:4]
irisTargets <- iris[,5]
irisTargetsDecoded <- decodeClassLabels(irisTargets)
iris2 <- splitForTrainingAndTest(irisValues, irisTargetsDecoded, ratio=0.15)
iris2 <- normTrainingAndTestSet(iris2)
set.seed(432)
model <- mlp(iris2$inputsTrain, iris2$targetsTrain,
size=5, learnFuncParams=c(0.1), maxit=50,
inputsTest=iris2$inputsTest, targetsTest=iris2$targetsTest)
predictions <- predict(model,iris2$inputsTest)
head(rowSums(predictions))
# 139 26 17 104 54 82
# 1.0227419 1.0770722 1.0642565 1.0764587 0.9952268 0.9988647
probs <- predictions / rowSums(predictions)
head(rowSums(probs))
# 139 26 17 104 54 82
# 1 1 1 1 1 1
# nnet example --------------------------------------
library(caret)
training <- sample(seq_along(irisTargets), size = 100, replace = F)
modelCaret <- train(y = irisTargets[training],
x = irisValues[training, ],
method = "nnet")
predictionsCaret <- predict(modelCaret,
newdata = irisValues[-training, ],
type = "prob")
head(rowSums(predictionsCaret))
# 122 100 89 134 30 86
# 1 1 1 1 1 1

I don't know how much flexibility the caret package offers in these choices, but the standard way to make a neural net produce outputs which sum to one is to use the softmax function as the activation function in the output layer.

Related

lcmm::predictClass with l-spline link function

I am getting an error message trying to predict class membership in lcmm::predictClass(). This seems to be due to using a spline-based link function, as exemplified below. The lcmm::predictClass() function works okay for the default link function.
The following shows 1) a reproduceable example giving the error message, and 2) a working example with the same broad approach.
## define initialisation values for quick result here
BB <- c(-19.064,21.718,-1.192,-1.295,-1.205,-0.281,0.110,
-0.232, 1.339,-1.007, 1.019,-9.395, 1.702,2.030,
2.089, 1.352,-9.369, 1.220, 1.532, 2.481,1.223)
library(lcmm)
m2c <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="3-quant-splines",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=BB)
## converges in 3 iterations
## define the prediction cases
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>% ## 10 random IDs
select(ID,Ydep1,Ydep2,Time,X1,X2)
## find predicted class memberships
predictClass(m2c, newdata=X)
## Error in multlcmm(fixed = Ydep1 + Ydep2 ~ 1 + Time * X2, mixture = ~1 + :
## Length of vector range is not correct.
On the other hand, a similar approach with a linear link function gives the following. Note that these models are based on the example in the ?multlcmm help section.
library(lcmm)
m2 <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="linear",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=c(18,-20.77,1.16,-1.41,-1.39,-0.32,0.16,
-0.26,1.69,1.12,1.1,10.8,1.24,24.88,1.89))
## converges in 2 iterations
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>%
select(ID,Ydep1,Ydep2,Time,X1,X2)
predictClass(m2, newdata=X)
## ID class prob1 prob2
## 1 21 2 0.031948951 9.680510e-01
## 2 25 2 0.042938984 9.570610e-01
## 3 33 2 0.026053178 9.739468e-01
## 4 46 1 0.999999964 3.597409e-08
## 5 50 2 0.066291287 9.337087e-01
## 6 74 2 0.005630593 9.943694e-01
## 7 120 2 0.024787290 9.752127e-01
## 8 171 2 0.053499974 9.465000e-01
## 9 229 1 0.999999996 4.368222e-09
##10 235 2 0.008173507 9.918265e-01
## ...or similar
The other predict functions predictL() and predictY() seem to work okay. The predictRE() throws the same error message.
I will also email the package maintainer.

Results from DTU analysis tutorial not what they should be?

I’m currently following a differential transcript usage (DTU) analysis tutorial (link here) and am using the sample datasets provided by the authors. However, my results stop matching those from the tutorial after I create a dmDSdata object and filter it (I’ve included the code below). Creating the object works fine, but after filtering and estimating model parameters, the results tables I produce show different genes and transcripts from the ones shown in the tutorial:
# Load the DRIMSeq package and create a dmDSdata object with the
# counts and samples data frames
library(DRIMSeq)
dmDS <- dmDSdata(counts = counts, samples = samples)
dmDS # returns information about the number of genes
# Each row of the dmDSdata object contains all the transcripts corresponding
# to a particular gene
methods(class = class(dmDS))
counts(dmDS[1,])[,1:4]
# Filter the dmDS object before estimating model parameters
n <- 12 # the total number of samples
n.small <- 6 # sample size of the smallest group
dmDS <- dmFilter(dmDS,
min_samps_feature_expr = n.small, min_feature_expr = 10 ,
min_samps_feature_prop = n.small, min_feature_prop = 0.1,
min_samps_gene_expr = n, min_gene_expr = 10)
dmDS
# Find out how many of the genes remaining after filtering have N isoforms
# by counting the number of unique gene IDs and tabulating the results
table(table(counts(dmDS)$gene_id))
# Create a design matrix using a design formula as well as the sample
# information contained in the dmDS object (accessed via samples.csv)
design_full <- model.matrix(~condition, data = DRIMSeq::samples(dmDS))
colnames(design_full)
# To accelerate the subsequent steps, subset to the first 250 genes
dmDS <- dmDS[1:250,]
# Estimating model parameters and testing for differential transcript use
# Estimate the precision, which is inversely related to dispersion in the
# Dirichlet Multinomial model
# Fit regression coefficients
# Perform null hypothesis testing on the coefficient of interest
set.seed(1)
system.time({
dmDS <- dmPrecision(dmDS, design = design_full )
dmDS <- dmFit (dmDS, design = design_full )
dmDS <- dmTest (dmDS, coef = "condition2")
})
# Tabulate the results, including a p-value per gene or a p-value per transcript
# p-value per gene: is there DTU within this gene?
# p-value per transcript: has the proportion of this transcript changed within
# its parent gene?
results <- DRIMSeq::results(dmDS) # per gene
results.txp <- DRIMSeq::results(dmDS, level = "feature") # per transcript
At this point, the results I should get are as follows:
head(results)
## gene_id lr df pvalue adj_pvalue
## 1 ENSG00000000457.13 1.493561 4 8.277814e-01 9.120246e-01
## 2 ENSG00000000460.16 1.068294 3 7.847330e-01 9.101892e-01
## 3 ENSG00000000938.12 4.366806 2 1.126575e-01 2.750169e-01
## 4 ENSG00000001084.11 1.630085 3 6.525877e-01 8.643316e-01
## 5 ENSG00000001167.14 28.402587 1 9.853354e-08 5.007113e-07
## 6 ENSG00000001461.16 9.815460 1 1.730510e-03 6.732766e-03
head(results.txp)
## gene_id feature_id lr df pvalue adj_pvalue
## 1 ENSG00000000457.13 ENST00000367771.10 0.16587607 1 0.6838032 0.9171007
## 2 ENSG00000000457.13 ENST00000367770.5 0.01666448 1 0.8972856 0.9788571
## 3 ENSG00000000457.13 ENST00000367772.8 1.02668495 1 0.3109386 0.6667146
## 4 ENSG00000000457.13 ENST00000423670.1 0.06046507 1 0.8057624 0.9323782
## 5 ENSG00000000457.13 ENST00000470238.1 0.28905766 1 0.5908250 0.8713427
## 6 ENSG00000000460.16 ENST00000496973.5 0.83415788 1 0.3610730 0.7232298
However, what I see in the R console is the following:
head(results)
## gene_id lr df pvalue adj_pvalue
## 1 ENSG00000237094.12 52.9721358 1 3.383138e-13 2.532227e-12
## 2 ENSG00000237491.8 2.7403807 1 9.784145e-02 3.179847e-01
## 3 ENSG00000228794.8 6.9271154 2 3.131814e-02 1.330626e-01
## 4 ENSG00000187961.13 0.9699384 2 6.157162e-01 8.934371e-01
## 5 ENSG00000217801.9 0.2262070 1 6.343506e-01 8.934371e-01
## 6 ENSG00000131591.17 30.4292202 1 3.462727e-08 2.136131e-07
head(results.txp)
## gene_id feature_id lr df pvalue adj_pvalue
## 1 ENSG00000237094.12 ENST00000599771.6 52.9721358 1 3.383138e-13 3.341499e-12
## 2 ENSG00000237094.12 ENST00000608420.1 52.9721358 1 3.383138e-13 3.341499e-12
## 3 ENSG00000237491.8 ENST00000585826.1 2.7403807 1 9.784145e-02 3.528888e-01
## 4 ENSG00000237491.8 ENST00000592547.1 2.7403807 1 9.784145e-02 3.528888e-01
## 5 ENSG00000228794.8 ENST00000445118.6 0.4788971 1 4.889223e-01 8.378376e-01
## 6 ENSG00000228794.8 ENST00000449005.5 0.5862693 1 4.438654e-01 8.201190e-01
I have tried switching from R version 4.1 and Bioconductor version 13.3 to the older ones used in the tutorial, but I got error messages when trying to download the rnaseqDTU package which said it was not available to older versions of Bioconductor. As I use RStudio, I also tried clearing my global environment and running the code again, but that did not work either. I’m not sure what to do about this issue and would appreciate any potential solutions! Thanks.

nnet gives me error "NA/NaN/Inf in foreign function call (arg 2)" in RStudio

I've been trying to run a neural network in RStudio to predict the answer of a bank marketing campaign, but for some reason I get the bellow error.
> bankData_net <- nnet(bankData[A,c(1:4)], Train_lab[A,], size=3, maxit=100, softmax=TRUE)
# weights: 23
Error in nnet.default(bankData[A, c(1:4)], Train_lab[A, ], size = 3, maxit = 100, :
NA/NaN/Inf in foreign function call (arg 2)
In addition: Warning message:
In nnet.default(bankData[A, c(1:4)], Train_lab[A, ], size = 3, maxit = 100, :
NAs introduced by coercion
The database looks like the one bellow (these are only the first 10 rows to get an idea. The db has a few thousands of rows).
age job marital education y
1 56 housemaid married basic.4y no
2 57 services married high.school no
3 37 services married high.school no
4 40 admin. married basic.6y no
5 56 services married high.school yes
6 45 services married basic.9y no
7 59 admin. married professional.course no
8 41 blue-collar married unknown yes
9 24 technician single professional.course no
10 25 services single high.school no
And the bellow is the code I'm trying to run.
# save the data set in a variable
bankData = read.csv("data/bank-additional.csv", sep = ";")
# print first 10 rows of iris data
head(bankData, n=10)
# Remove variable "duration" which is not helpful
newbankData <- subset(bankData, select = c(age, job, marital, education, y))
head(newbankData, n=10)
library(nnet)
# create train labels: convert the text bankData responses to numeric class labels
Train_lab <- class.ind(bankData$y)
# set seed for random number generator for repeatable results
set.seed(1)
# Create indexes for training (70%) and validation (30%) data
A <- sort(sample(nrow(bankData), nrow(bankData)*.7))
# train neural net
# bankData[A,c(1:4)] is to select the first 4 variables as inputs
# size=5 for 5 hidden units, maxit=100 to train for 100 iterations
bankData_net <- nnet(bankData[A,c(1:4)], Train_lab[A,], size=3, maxit=100, softmax=TRUE)
# test
Yt <- predict(bankData_net, bankData[-A,c(1:4)], type="class")
# build a confusion matrix
conf.matrix <- table(bankData[-A,]$y, Yt)
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix))
colnames(conf.matrix) <- paste("Pred", colnames(conf.matrix))
print(conf.matrix)
Please help me what I'm doing wrong and how I can fix this.

Creating a loop with compare_means

I am trying to create a loop to use compare_means (ggpubr library in R) across all columns in a dataframe and then select only significant p.adjusted values, but it does not work well.
Here is some code
head(df3)
sampleID Actio Beta Gammes Traw Cluster2
gut10 10 2.2 55 13 HIGH
gut12 20 44 67 12 HIGH
gut34 5.5 3 89 33 LOW
gut26 4 45 23 4 LOW
library(ggpubr)
data<-list()
for (i in 2:length(df3)){
data<-compare_means(df3[[i]] ~ Cluster2, data=df3, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
Error: `df3[i]` must evaluate to column positions or names, not a list
I would like to create an output to convert in dataframe with all the information contained in compare_means output
Thanks a lot
Try this:
library(ggpubr)
data<-list()
for (i in 2:(length(df3)-1)){
new<-df3[,c(i,"Cluster2")]
colnames(new)<-c("interest","Cluster2")
data<-compare_means(interest ~ Cluster2, data=new, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}

plot new values for best fit nonlinear curve

I have created the best fit for a non linear function. It seems to be working correctly:
#define a function
fncTtr <- function(n,d) (d/n)*((sqrt(1+2*(n/d))-1))
#fit
dFit <- nls(dData$ttr~fncTtr(dData$n,d),data=dData,start=list(d=25),trace=T)
summary(dFit)
plot(dData$ttr~dData$n,main="Fitted d value",pch=19,)
xl <- seq(min(dData$n),max(dData$n), (max(dData$n) - min(dData$n))/1000)
lines(xl,predict(dFit,newdata=xl,col=blue)
The plot for my observations are coming out correctly. I am having problems to display the best fit curve on my plot. I create the xl independent variable with 1000 values and I want to define the new values using the best fit. When I call the "lines" procedure, I get the error message:
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
If I try to execute only the predict function:
a <-predict(dFit,newdata=xl)
str(a)
I can see that xl has 1000 components but "a" has only 16 components. Shouldn't I have the same number of values in a?
data used:
n ttr d
1 35 0.6951 27.739
2 36 0.6925 28.072
3 37 0.6905 28.507
4 38 0.6887 28.946
5 39 0.6790 28.003
6 40 0.6703 27.247
7 41 0.6566 25.735
8 42 0.6605 26.981
9 43 0.6567 27.016
10 44 0.6466 26.026
11 45 0.6531 27.667
12 46 0.6461 27.128
13 47 0.6336 25.751
14 48 0.6225 24.636
15 49 0.6214 24.992
16 50 0.6248 26.011
Ok, I think I found the solution, however I'm not sure I would be able to explain it.
When calling predict.nls, what you're inputting to argument newdata has to be named according to the variable with which you're predicting (here n) and the name has to match that given in the original call to nls.
#Here I replaced dData$n with n
dFit <- nls(ttr~fncTtr(n,d),data=dData,start=list(d=25),trace=T)
plot(dData$ttr~dData$n,main="Fitted d value",pch=19,)
xl <- seq(min(dData$n),max(dData$n), (max(dData$n) - min(dData$n))/1000)
a <- predict(dFit,newdata=list(n=xl))
length(a)==length(xl)
[1] TRUE
lines(xl,a,col="blue")

Resources