Results from DTU analysis tutorial not what they should be? - r

I’m currently following a differential transcript usage (DTU) analysis tutorial (link here) and am using the sample datasets provided by the authors. However, my results stop matching those from the tutorial after I create a dmDSdata object and filter it (I’ve included the code below). Creating the object works fine, but after filtering and estimating model parameters, the results tables I produce show different genes and transcripts from the ones shown in the tutorial:
# Load the DRIMSeq package and create a dmDSdata object with the
# counts and samples data frames
library(DRIMSeq)
dmDS <- dmDSdata(counts = counts, samples = samples)
dmDS # returns information about the number of genes
# Each row of the dmDSdata object contains all the transcripts corresponding
# to a particular gene
methods(class = class(dmDS))
counts(dmDS[1,])[,1:4]
# Filter the dmDS object before estimating model parameters
n <- 12 # the total number of samples
n.small <- 6 # sample size of the smallest group
dmDS <- dmFilter(dmDS,
min_samps_feature_expr = n.small, min_feature_expr = 10 ,
min_samps_feature_prop = n.small, min_feature_prop = 0.1,
min_samps_gene_expr = n, min_gene_expr = 10)
dmDS
# Find out how many of the genes remaining after filtering have N isoforms
# by counting the number of unique gene IDs and tabulating the results
table(table(counts(dmDS)$gene_id))
# Create a design matrix using a design formula as well as the sample
# information contained in the dmDS object (accessed via samples.csv)
design_full <- model.matrix(~condition, data = DRIMSeq::samples(dmDS))
colnames(design_full)
# To accelerate the subsequent steps, subset to the first 250 genes
dmDS <- dmDS[1:250,]
# Estimating model parameters and testing for differential transcript use
# Estimate the precision, which is inversely related to dispersion in the
# Dirichlet Multinomial model
# Fit regression coefficients
# Perform null hypothesis testing on the coefficient of interest
set.seed(1)
system.time({
dmDS <- dmPrecision(dmDS, design = design_full )
dmDS <- dmFit (dmDS, design = design_full )
dmDS <- dmTest (dmDS, coef = "condition2")
})
# Tabulate the results, including a p-value per gene or a p-value per transcript
# p-value per gene: is there DTU within this gene?
# p-value per transcript: has the proportion of this transcript changed within
# its parent gene?
results <- DRIMSeq::results(dmDS) # per gene
results.txp <- DRIMSeq::results(dmDS, level = "feature") # per transcript
At this point, the results I should get are as follows:
head(results)
## gene_id lr df pvalue adj_pvalue
## 1 ENSG00000000457.13 1.493561 4 8.277814e-01 9.120246e-01
## 2 ENSG00000000460.16 1.068294 3 7.847330e-01 9.101892e-01
## 3 ENSG00000000938.12 4.366806 2 1.126575e-01 2.750169e-01
## 4 ENSG00000001084.11 1.630085 3 6.525877e-01 8.643316e-01
## 5 ENSG00000001167.14 28.402587 1 9.853354e-08 5.007113e-07
## 6 ENSG00000001461.16 9.815460 1 1.730510e-03 6.732766e-03
head(results.txp)
## gene_id feature_id lr df pvalue adj_pvalue
## 1 ENSG00000000457.13 ENST00000367771.10 0.16587607 1 0.6838032 0.9171007
## 2 ENSG00000000457.13 ENST00000367770.5 0.01666448 1 0.8972856 0.9788571
## 3 ENSG00000000457.13 ENST00000367772.8 1.02668495 1 0.3109386 0.6667146
## 4 ENSG00000000457.13 ENST00000423670.1 0.06046507 1 0.8057624 0.9323782
## 5 ENSG00000000457.13 ENST00000470238.1 0.28905766 1 0.5908250 0.8713427
## 6 ENSG00000000460.16 ENST00000496973.5 0.83415788 1 0.3610730 0.7232298
However, what I see in the R console is the following:
head(results)
## gene_id lr df pvalue adj_pvalue
## 1 ENSG00000237094.12 52.9721358 1 3.383138e-13 2.532227e-12
## 2 ENSG00000237491.8 2.7403807 1 9.784145e-02 3.179847e-01
## 3 ENSG00000228794.8 6.9271154 2 3.131814e-02 1.330626e-01
## 4 ENSG00000187961.13 0.9699384 2 6.157162e-01 8.934371e-01
## 5 ENSG00000217801.9 0.2262070 1 6.343506e-01 8.934371e-01
## 6 ENSG00000131591.17 30.4292202 1 3.462727e-08 2.136131e-07
head(results.txp)
## gene_id feature_id lr df pvalue adj_pvalue
## 1 ENSG00000237094.12 ENST00000599771.6 52.9721358 1 3.383138e-13 3.341499e-12
## 2 ENSG00000237094.12 ENST00000608420.1 52.9721358 1 3.383138e-13 3.341499e-12
## 3 ENSG00000237491.8 ENST00000585826.1 2.7403807 1 9.784145e-02 3.528888e-01
## 4 ENSG00000237491.8 ENST00000592547.1 2.7403807 1 9.784145e-02 3.528888e-01
## 5 ENSG00000228794.8 ENST00000445118.6 0.4788971 1 4.889223e-01 8.378376e-01
## 6 ENSG00000228794.8 ENST00000449005.5 0.5862693 1 4.438654e-01 8.201190e-01
I have tried switching from R version 4.1 and Bioconductor version 13.3 to the older ones used in the tutorial, but I got error messages when trying to download the rnaseqDTU package which said it was not available to older versions of Bioconductor. As I use RStudio, I also tried clearing my global environment and running the code again, but that did not work either. I’m not sure what to do about this issue and would appreciate any potential solutions! Thanks.

Related

lcmm::predictClass with l-spline link function

I am getting an error message trying to predict class membership in lcmm::predictClass(). This seems to be due to using a spline-based link function, as exemplified below. The lcmm::predictClass() function works okay for the default link function.
The following shows 1) a reproduceable example giving the error message, and 2) a working example with the same broad approach.
## define initialisation values for quick result here
BB <- c(-19.064,21.718,-1.192,-1.295,-1.205,-0.281,0.110,
-0.232, 1.339,-1.007, 1.019,-9.395, 1.702,2.030,
2.089, 1.352,-9.369, 1.220, 1.532, 2.481,1.223)
library(lcmm)
m2c <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="3-quant-splines",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=BB)
## converges in 3 iterations
## define the prediction cases
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>% ## 10 random IDs
select(ID,Ydep1,Ydep2,Time,X1,X2)
## find predicted class memberships
predictClass(m2c, newdata=X)
## Error in multlcmm(fixed = Ydep1 + Ydep2 ~ 1 + Time * X2, mixture = ~1 + :
## Length of vector range is not correct.
On the other hand, a similar approach with a linear link function gives the following. Note that these models are based on the example in the ?multlcmm help section.
library(lcmm)
m2 <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="linear",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=c(18,-20.77,1.16,-1.41,-1.39,-0.32,0.16,
-0.26,1.69,1.12,1.1,10.8,1.24,24.88,1.89))
## converges in 2 iterations
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>%
select(ID,Ydep1,Ydep2,Time,X1,X2)
predictClass(m2, newdata=X)
## ID class prob1 prob2
## 1 21 2 0.031948951 9.680510e-01
## 2 25 2 0.042938984 9.570610e-01
## 3 33 2 0.026053178 9.739468e-01
## 4 46 1 0.999999964 3.597409e-08
## 5 50 2 0.066291287 9.337087e-01
## 6 74 2 0.005630593 9.943694e-01
## 7 120 2 0.024787290 9.752127e-01
## 8 171 2 0.053499974 9.465000e-01
## 9 229 1 0.999999996 4.368222e-09
##10 235 2 0.008173507 9.918265e-01
## ...or similar
The other predict functions predictL() and predictY() seem to work okay. The predictRE() throws the same error message.
I will also email the package maintainer.

Correct variable values in a dataframe applying a function using variable-specific values in another dataframe in R

I have a df called 'covs' with sites on rows and in columns, 9 different environmental variables for each of these sites. I need to recalculate the value of each cell using the function x - center_values(x)) / scale_values(x). However, 'center_values' and 'scale_values' are different for each environmental covariate, and they are located in another df called 'correction'.
I have found many solutions for applying a function for a whole df, but not for applying specific values according to the id of the value to transform.
covs <- read.table(text = "X elev builtup river grip pa npp treecov
384879-2009 1 24.379101 25188.572 1241.8348 1431.1082 5.705152e+03 16536.664 60.23175
385822-2009 2 29.533478 32821.770 2748.9053 1361.7772 2.358533e+03 15773.115 62.38455
385823-2009 3 30.097059 28358.244 2525.7627 1073.8772 4.340906e+03 14899.451 46.03269
386765-2009 4 33.877861 40557.891 927.4295 1049.4838 4.580944e+03 15362.518 53.08151
386766-2009 5 38.605156 36182.801 1479.6178 1056.2130 2.517869e+03 13389.958 35.71379",
header= TRUE)
correction <- read.table(text = "var_name center_values scale_values
1 X 196.5 113.304898393671
2 elev 200.217889868483 307.718211316278
3 builtup 31624.4888660664 23553.2438790344
4 river 1390.41023742909 1549.88661649406
5 grip 5972.67361738244 6996.57793554527
6 pa 2731.33431010861 4504.71055521749
7 npp 10205.2997576655 2913.19658598938
8 treecov 47.9080656134352 17.7101565911347
9 nonveg 7.96755640452006 4.56625351682905", header= TRUE)
Could someone help me write a code to recalculate the environmental covariate values in 'covs' using the specific covariate values reported in 'correction'? E.g. For each value in the column 'elev' of the df 'covs', I need to substract the 'center_value' reported for 'elev' in the 'corrected' df, and then divided by the 'scale_value' of 'elev' reported in 'corrected' df. Thank you for your kind help.
You may assign var_name to row names, then loop over the names of covs to do the calculations in an sapply.
rownames(correction) <- correction$var_name
res <- as.data.frame(sapply(names(covs), function(x, y)
(covs[, x] - correction[x, "center_values"])/correction[x, "scale_values"]))
res
# X elev builtup river grip pa npp treecov
# 1 -1.725433 -0.5714280 -0.27324970 -0.09586213 -0.6491124 0.66015733 2.173339 0.6958541
# 2 -1.716607 -0.5546776 0.05083296 0.87651254 -0.6590217 -0.08275811 1.911239 0.8174114
# 3 -1.707781 -0.5528462 -0.13867495 0.73253905 -0.7001703 0.35730857 1.611340 -0.1058927
# 4 -1.698956 -0.5405596 0.37928543 -0.29871910 -0.7036568 0.41059457 1.770295 0.2921174
# 5 -1.690130 -0.5251972 0.19353224 0.05755748 -0.7026950 -0.04738713 1.093183 -0.6885470
Check e.g. "elev":
(covs[,"elev"] - correction["elev", "center_values"]) / correction["elev", "scale_values"]
# [1] -0.5714280 -0.5546776 -0.5528462 -0.5405596 -0.5251972

parse multiple XML files based on a vector and rbind in a dataframe

With some effort and help from the stackers, I have been able to parse a webpage and save it as a dataframe. I want to repeat the same operation on multiple xml files and rbind the list. Here is what I tried and did successfully:
library(XML)
xml.url <- "http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml"
doc <- xmlParse(xml.url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
Above code works well, now when I try to apply a function to do the same for multiple xml files :
ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")
xml_url_test = as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml",
ERS_ID))
XML_parser <- function(XML_url){
doc <- xmlParse(XML_url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
return(x_t)
}
major_test <- sapply(xml_url_test, XML_parser)
It works, but gives me a long list that is not in the right data frame format as I generated for the single XML file.
Finally I would like to also add a column to the final dataframe that has the ERS number from the ERS_ID vector
Something like x_t$ERSid <- ERS_ID in the function
Can someone point out what am I missing in the function as well as any better ways to do the task?
Thanks!
Your main issue is using sapply over lapply() where the latter returns a list and former attempts to simplify to a vector or matrix, here being a matrix.
major_test <- lapply(xml_url_test, XML_parser)
Of course, sapply is a wrapper for lapply and can also return a list: sapply(..., simplify=FALSE):
major_test <- sapply(xml_url_test, XML_parser, simplify=FALSE)
However, a few other items came up:
At beginning, you are not concatenating your ERS_ID to the url stem with sprintf's %s operator. So right now, the same urls are repeating.
At end, you are not binding your list of data frames into a compiled final single dataframe.
Add new ERS column inside your defined function, passing in ERS_ID vector. And while creating column, also remove the ERS prefix with gsub.
R code (adjusted)
XML_parser <- function(eid) {
XML_url <- as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", eid))
doc <- xmlParse(XML_url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
x_t$ERSid <- gsub("ERS", "", eid) # ADD COL, REMOVE ERS
x_t <- x_t[,c(ncol(x_t),2:ncol(x_t)-1)] # MOVE NEW COL TO FIRST
return(x_t)
}
major_test <- lapply(ERS_ID, XML_parser)
# major_test <- sapply(ERS_ID, XML_parser, simplify=FALSE)
# BIND DATA FRAMES TOGETHER
finaldf <- do.call(rbind, major_test)
# RESET ROW NAMES
row.names(finaldf) <- seq(nrow(finaldf))
Using xml2 and the tidyverse you can do something like this:
require(xml2)
require(purrr)
require(tidyr)
urls <- rep("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml", 2)
identifier <- LETTERS[seq_along(urls)] # Take a unique identifier per url here
parse_attribute <- function(x){
out <- data.frame(tag = xml_text(xml_find_all(x, "./TAG")),
value = xml_text(xml_find_all(x, "./VALUE")), stringsAsFactors = FALSE)
spread(out, tag, value)
}
doc <- map(urls, read_xml)
out <- doc %>%
map(xml_find_all, "//SAMPLE_ATTRIBUTE") %>%
set_names(identifier) %>%
map_df(parse_attribute, .id="url")
Which gives you a 2x36 data.frame. To parse the column type i would suggest using readr::type_convert(out)
Out looks as follows:
url age body product body site body-mass index chimera check collection date
1 A 28 mucosa Sigmoid colon 16.95502 ChimeraSlayer; Usearch 4.1 database 2009-03-16
2 B 28 mucosa Sigmoid colon 16.95502 ChimeraSlayer; Usearch 4.1 database 2009-03-16
disease status ENA-BASE-COUNT ENA-CHECKLIST ENA-FIRST-PUBLIC ENA-LAST-UPDATE ENA-SPOT-COUNT
1 remission 627051 ERC000015 2014-12-31 2016-10-21 1668
2 remission 627051 ERC000015 2014-12-31 2016-10-21 1668
environment (biome) environment (feature) environment (material) experimental factor
1 organism-associated habitat organism-associated habitat mucus microbiome
2 organism-associated habitat organism-associated habitat mucus microbiome
gastrointestinal tract disorder geographic location (country and/or sea,region) geographic location (latitude)
1 Ulcerative Colitis India 72.82807
2 Ulcerative Colitis India 72.82807
geographic location (longitude) host subject id human gut environmental package investigation type
1 18.94084 1 human-gut metagenome
2 18.94084 1 human-gut metagenome
medication multiplex identifiers pcr primers phenotype project name
1 ASA;Steroids;Probiotics;Antibiotics TGATACGTCT 27F-338R pathological BMRP
2 ASA;Steroids;Probiotics;Antibiotics TGATACGTCT 27F-338R pathological BMRP
sample collection device or method sequence quality check sequencing method sequencing template sex target gene
1 biopsy software pyrosequencing DNA male 16S rRNA
2 biopsy software pyrosequencing DNA male 16S rRNA
target subfragment
1 V1V2
2 V1V2
purrr is really helpful here, as you can iterate over a vector of URLs or a list of XML files with map, or within nested elements with at_depth, and simplify the results with the *_df forms and flatten.
library(tidyverse)
library(xml2)
# be kind, don't call this more times than you need to
x <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762") %>%
sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", .) %>%
map(read_xml) # read each URL into a list item
df <- x %>% map(xml_find_all, '//SAMPLE_ATTRIBUTE') %>% # for each item select nodes
at_depth(2, as_list) %>% # convert each (nested) attribute to list
map_df(map_df, flatten) # flatten items, collect pages to df, then all to one df
df
## # A tibble: 175 × 3
## TAG VALUE UNITS
## <chr> <chr> <chr>
## 1 investigation type metagenome <NA>
## 2 project name BMRP <NA>
## 3 experimental factor microbiome <NA>
## 4 target gene 16S rRNA <NA>
## 5 target subfragment V1V2 <NA>
## 6 pcr primers 27F-338R <NA>
## 7 multiplex identifiers TGATACGTCT <NA>
## 8 sequencing method pyrosequencing <NA>
## 9 sequence quality check software <NA>
## 10 chimera check ChimeraSlayer; Usearch 4.1 database <NA>
## # ... with 165 more rows
You can retrieve multiple IDs with a single REST url using a comma-separated list or range like ERS445758-ERS445762 and avoid multiple queries to the ENA.
This code gets all 5 samples into a node set and then applies functions using a leading dot in the xpath string so its relative to that node.
ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")
url <- paste0( "http://www.ebi.ac.uk/ena/data/view/", paste(ERS_ID, collapse=","), "&display=xml")
doc <- xmlParse(url)
samples <- getNodeSet( doc, "//SAMPLE")
## check the first node
samples[[1]]
## get the sample attribute node set and apply xmlToDataFrame to that
x <- lapply( lapply(samples, getNodeSet, ".//SAMPLE_ATTRIBUTE"), xmlToDataFrame)
# labels for bind_rows
names(x) <- sapply(samples, xpathSApply, ".//PRIMARY_ID", xmlValue)
library(dplyr)
y <- bind_rows(x, .id="sample")
z <- subset(y, TAG %in% c("age","sex","body site","body-mass index") , 1:3)
sample TAG VALUE
15 ERS445758 age 28
16 ERS445758 sex male
17 ERS445758 body site Sigmoid colon
19 ERS445758 body-mass index 16.9550173
50 ERS445759 age 58
51 ERS445759 sex male
...
library(tidyr)
z %>% spread( TAG, VALUE)
sample age body site body-mass index sex
1 ERS445758 28 Sigmoid colon 16.9550173 male
2 ERS445759 58 Sigmoid colon 23.22543185 male
3 ERS445760 26 Sigmoid colon 20.76124567 female
4 ERS445761 30 Sigmoid colon 0 male
5 ERS445762 36 Sigmoid colon 0 male

How can implement a function?

I have two data files as below:
head (RNA)
Gene_ID chr start end
1 ENSG00000000003.1 X 99883667 99884983
2 ENSG00000000003.2 X 99885756 99885863
3 ENSG00000000003.3 X 99887482 99887565
4 ENSG00000000003.4 X 99888402 99888536
5 ENSG00000000003.5 X 99888928 99889026
6 ENSG00000000003.6 X 99890175 99890249
head(snp)
chr start end SNP_No
1 1 58812 58812 SNP_1
2 1 67230 67230 SNP_2
3 1 79529 79529 SNP_3
4 1 79595 79595 SNP_4
5 1 85665 85665 SNP_5
6 1 86064 86064 SNP_6
I would like to find overlap between snp file and RNA file, so I used GenomicRanges R package and I have done below commands:
gr_RNA <- GRanges(seqnames=RNA$chr,IRanges(start=RNA$start,end=RNA$end,names=RNA$Gene_ID))
gr_SNP <- GRanges(seqnames=SNP$chr, IRanges(start=SNP$start,end=SNP$end,names=SNP$SNP_No))
overlaps <- findOverlaps(gr_RNA, gr_SNP)
subsetByOver <- subsetByOverlaps(gr_RNA, gr_SNP)
match_hit <- data.frame(names(gr_RNA)[queryHits(overlaps)],names(gr_SNP)[subjectHits(overlaps)],stringsAsFactors=F)
names(match_hit) <- c('Gene_ID','SNP')
head(match_hit)
Gene_ID SNP
1 ENSG00000000457.1 SNP_307301
2 ENSG00000000457.2 SNP_307307
3 ENSG00000000457.11 SNP_307365
4 ENSG00000000457.12 SNP_307387
5 ENSG00000000460.1 SNP_306845
6 ENSG00000000460.1 SNP_306846
dim(match_hit)
[1] 12287 2
Then I expanded distance for start and end position from RNA file ("start-100" and "end+100")and run scripts again as below:
gr_RNA1 <- GRanges(seqnames=RNA$chr, IRanges(start=(RNA$start)-100, end=(RNA$end)+100, names=RNA$Gene_ID))
overlaps <- findOverlaps(gr_RNA1, gr_SNP)
subsetByOver<-subsetByOverlaps(gr_RNA1, gr_SNP)
match_hit1 <- data.frame(names(gr_RNA1)[queryHits(overlaps)],names(gr_SNP)[subjectHits(overlaps)],stringsAsFactors=F)
dim(match_hit1)
[1] 17976 2
Now, I want to implement a function which takes the RNA table, the SNP table, and the expand distance, then give me final results.
Functions in R are defined like this:
myFunction <- function(parameters) {
#function Code
return(result)
}
see also

Class probabilities in Neural networks

I use the caret package with multi-layer perception.
My dataset consists of a labelled output value, which can be either A,B or C. The input vector consists of 4 variables.
I use the following lines of code to calculate the class probabilities for each input value:
fit <- train(device~.,data=dataframetrain[1:100,], method="mlp",
trControl=trainControl(classProbs=TRUE))
(p=(predict(fit,newdata=dataframetest,type=("prob"))))
I thought that the class probabilities for each record must sum up to one. But I get the following:
rowSums(p)
# 1 2 3 4 5 6 7 8
# 1.015291 1.015265 1.015291 1.015291 1.015291 1.014933 1.015011 1.015291
# 9 10 11 12 13 14 15 16
# 1.014933 1.015206 1.015291 1.015291 1.015291 1.015224 1.015011 1.015291
Can anybody help me because I don't know what I did wrong.
There's probably nothing wrong, it just seems that caret returns the values of the neurons in the output layer without converting them to probabilities (correct me if I'm wrong). When using the RSNNS::mlp function outside of caret the rows of the predictions also don't sum to one.
Since all output neurons have the same activation function the outputs can be converted to probabilities by dividing the predictions by the respective row sum, see this question.
This behavior seems to be true when using method = "mlp" or method = "mlpWeightDecay" but when using method = "nnet" the predictions do sum to one.
Example:
library(RSNNS)
data(iris)
#shuffle the vector
iris <- iris[sample(1:nrow(iris),length(1:nrow(iris))),1:ncol(iris)]
irisValues <- iris[,1:4]
irisTargets <- iris[,5]
irisTargetsDecoded <- decodeClassLabels(irisTargets)
iris2 <- splitForTrainingAndTest(irisValues, irisTargetsDecoded, ratio=0.15)
iris2 <- normTrainingAndTestSet(iris2)
set.seed(432)
model <- mlp(iris2$inputsTrain, iris2$targetsTrain,
size=5, learnFuncParams=c(0.1), maxit=50,
inputsTest=iris2$inputsTest, targetsTest=iris2$targetsTest)
predictions <- predict(model,iris2$inputsTest)
head(rowSums(predictions))
# 139 26 17 104 54 82
# 1.0227419 1.0770722 1.0642565 1.0764587 0.9952268 0.9988647
probs <- predictions / rowSums(predictions)
head(rowSums(probs))
# 139 26 17 104 54 82
# 1 1 1 1 1 1
# nnet example --------------------------------------
library(caret)
training <- sample(seq_along(irisTargets), size = 100, replace = F)
modelCaret <- train(y = irisTargets[training],
x = irisValues[training, ],
method = "nnet")
predictionsCaret <- predict(modelCaret,
newdata = irisValues[-training, ],
type = "prob")
head(rowSums(predictionsCaret))
# 122 100 89 134 30 86
# 1 1 1 1 1 1
I don't know how much flexibility the caret package offers in these choices, but the standard way to make a neural net produce outputs which sum to one is to use the softmax function as the activation function in the output layer.

Resources