lcmm::predictClass with l-spline link function - r

I am getting an error message trying to predict class membership in lcmm::predictClass(). This seems to be due to using a spline-based link function, as exemplified below. The lcmm::predictClass() function works okay for the default link function.
The following shows 1) a reproduceable example giving the error message, and 2) a working example with the same broad approach.
## define initialisation values for quick result here
BB <- c(-19.064,21.718,-1.192,-1.295,-1.205,-0.281,0.110,
-0.232, 1.339,-1.007, 1.019,-9.395, 1.702,2.030,
2.089, 1.352,-9.369, 1.220, 1.532, 2.481,1.223)
library(lcmm)
m2c <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="3-quant-splines",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=BB)
## converges in 3 iterations
## define the prediction cases
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>% ## 10 random IDs
select(ID,Ydep1,Ydep2,Time,X1,X2)
## find predicted class memberships
predictClass(m2c, newdata=X)
## Error in multlcmm(fixed = Ydep1 + Ydep2 ~ 1 + Time * X2, mixture = ~1 + :
## Length of vector range is not correct.
On the other hand, a similar approach with a linear link function gives the following. Note that these models are based on the example in the ?multlcmm help section.
library(lcmm)
m2 <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="linear",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=c(18,-20.77,1.16,-1.41,-1.39,-0.32,0.16,
-0.26,1.69,1.12,1.1,10.8,1.24,24.88,1.89))
## converges in 2 iterations
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>%
select(ID,Ydep1,Ydep2,Time,X1,X2)
predictClass(m2, newdata=X)
## ID class prob1 prob2
## 1 21 2 0.031948951 9.680510e-01
## 2 25 2 0.042938984 9.570610e-01
## 3 33 2 0.026053178 9.739468e-01
## 4 46 1 0.999999964 3.597409e-08
## 5 50 2 0.066291287 9.337087e-01
## 6 74 2 0.005630593 9.943694e-01
## 7 120 2 0.024787290 9.752127e-01
## 8 171 2 0.053499974 9.465000e-01
## 9 229 1 0.999999996 4.368222e-09
##10 235 2 0.008173507 9.918265e-01
## ...or similar
The other predict functions predictL() and predictY() seem to work okay. The predictRE() throws the same error message.
I will also email the package maintainer.

Related

comp() returns ranks instead of p-values

I am given example that comp() should be returning p-vals but it ends up returning ranks so let me ask:
Why is comp() function from survmisc package returning ranks instead of p-values?
Is there a way to change it?
test_drug <- survfit(Surv(N_Days,Cens) ~ Drug, data = df)
comp(ten(test_drug), p=c(0,1,1,0.5,0.5),q=c(1,0,1,0.5,2))
output:
Q Var Z pNorm
1 3.3457e+00 2.7643e+01 0.63634 4
n 3.2000e+02 1.0304e+06 0.31524 10
sqrtN 3.4634e+01 4.8218e+03 0.49877 9
S1 2.1524e+00 1.6867e+01 0.52410 7
S2 2.1294e+00 1.6650e+01 0.52185 8
FH_p=0_q=1 1.1647e+00 2.2356e+00 0.77898 3
FH_p=1_q=0 2.1809e+00 1.7056e+01 0.52809 6
FH_p=1_q=1 8.4412e-01 7.9005e-01 0.94968 1
FH_p=0.5_q=0.5 1.6895e+00 4.1759e+00 0.82678 2
FH_p=0.5_q=2 2.7491e-01 2.2027e-01 0.58575 5
maxAbsZ Var Q pSupBr
1 5.8550e+00 2.7643e+01 1.11361 5
n 9.7000e+02 1.0304e+06 0.95556 6
sqrtN 6.3636e+01 4.8218e+03 0.91643 7
S1 3.5891e+00 1.6867e+01 0.87391 9
S2 3.5737e+00 1.6650e+01 0.87581 8
FH_p=0_q=1 2.2539e+00 2.2356e+00 1.50743 2
FH_p=1_q=0 3.6025e+00 1.7056e+01 0.87230 10
FH_p=1_q=1 1.4726e+00 7.9005e-01 1.65678 1
FH_p=0.5_q=0.5 2.9457e+00 4.1759e+00 1.44148 3
FH_p=0.5_q=2 6.3430e-01 2.2027e-01 1.35150 4
So according to the topic here:
https://github.com/dardisco/survMisc/issues/21
And information that I got from the profesor lecturer who solved the problem earlier.
This is issue with R version and update is required to the fuction itself by authors or contributors.
This can be solves using attr() func with 'tft' parameter standing for test for trend. Code example here:
test_bilirubin <- survfit(Surv(N_Days,Cens) ~ Bilirubin_cat, data = df)
b=ten(test_bilirubin)
comp(b,p=c(0,1,1,0.5,0.5),q=c(1,0,1,0.5,2))
d=attr(b,"tft")
# "lrt" - the long-rank family of tests
#"sup" - Renyi test,
#"tft" - test for trend
cbind(d$tft$W,round(d$tft$pChisq,4))

Results from DTU analysis tutorial not what they should be?

I’m currently following a differential transcript usage (DTU) analysis tutorial (link here) and am using the sample datasets provided by the authors. However, my results stop matching those from the tutorial after I create a dmDSdata object and filter it (I’ve included the code below). Creating the object works fine, but after filtering and estimating model parameters, the results tables I produce show different genes and transcripts from the ones shown in the tutorial:
# Load the DRIMSeq package and create a dmDSdata object with the
# counts and samples data frames
library(DRIMSeq)
dmDS <- dmDSdata(counts = counts, samples = samples)
dmDS # returns information about the number of genes
# Each row of the dmDSdata object contains all the transcripts corresponding
# to a particular gene
methods(class = class(dmDS))
counts(dmDS[1,])[,1:4]
# Filter the dmDS object before estimating model parameters
n <- 12 # the total number of samples
n.small <- 6 # sample size of the smallest group
dmDS <- dmFilter(dmDS,
min_samps_feature_expr = n.small, min_feature_expr = 10 ,
min_samps_feature_prop = n.small, min_feature_prop = 0.1,
min_samps_gene_expr = n, min_gene_expr = 10)
dmDS
# Find out how many of the genes remaining after filtering have N isoforms
# by counting the number of unique gene IDs and tabulating the results
table(table(counts(dmDS)$gene_id))
# Create a design matrix using a design formula as well as the sample
# information contained in the dmDS object (accessed via samples.csv)
design_full <- model.matrix(~condition, data = DRIMSeq::samples(dmDS))
colnames(design_full)
# To accelerate the subsequent steps, subset to the first 250 genes
dmDS <- dmDS[1:250,]
# Estimating model parameters and testing for differential transcript use
# Estimate the precision, which is inversely related to dispersion in the
# Dirichlet Multinomial model
# Fit regression coefficients
# Perform null hypothesis testing on the coefficient of interest
set.seed(1)
system.time({
dmDS <- dmPrecision(dmDS, design = design_full )
dmDS <- dmFit (dmDS, design = design_full )
dmDS <- dmTest (dmDS, coef = "condition2")
})
# Tabulate the results, including a p-value per gene or a p-value per transcript
# p-value per gene: is there DTU within this gene?
# p-value per transcript: has the proportion of this transcript changed within
# its parent gene?
results <- DRIMSeq::results(dmDS) # per gene
results.txp <- DRIMSeq::results(dmDS, level = "feature") # per transcript
At this point, the results I should get are as follows:
head(results)
## gene_id lr df pvalue adj_pvalue
## 1 ENSG00000000457.13 1.493561 4 8.277814e-01 9.120246e-01
## 2 ENSG00000000460.16 1.068294 3 7.847330e-01 9.101892e-01
## 3 ENSG00000000938.12 4.366806 2 1.126575e-01 2.750169e-01
## 4 ENSG00000001084.11 1.630085 3 6.525877e-01 8.643316e-01
## 5 ENSG00000001167.14 28.402587 1 9.853354e-08 5.007113e-07
## 6 ENSG00000001461.16 9.815460 1 1.730510e-03 6.732766e-03
head(results.txp)
## gene_id feature_id lr df pvalue adj_pvalue
## 1 ENSG00000000457.13 ENST00000367771.10 0.16587607 1 0.6838032 0.9171007
## 2 ENSG00000000457.13 ENST00000367770.5 0.01666448 1 0.8972856 0.9788571
## 3 ENSG00000000457.13 ENST00000367772.8 1.02668495 1 0.3109386 0.6667146
## 4 ENSG00000000457.13 ENST00000423670.1 0.06046507 1 0.8057624 0.9323782
## 5 ENSG00000000457.13 ENST00000470238.1 0.28905766 1 0.5908250 0.8713427
## 6 ENSG00000000460.16 ENST00000496973.5 0.83415788 1 0.3610730 0.7232298
However, what I see in the R console is the following:
head(results)
## gene_id lr df pvalue adj_pvalue
## 1 ENSG00000237094.12 52.9721358 1 3.383138e-13 2.532227e-12
## 2 ENSG00000237491.8 2.7403807 1 9.784145e-02 3.179847e-01
## 3 ENSG00000228794.8 6.9271154 2 3.131814e-02 1.330626e-01
## 4 ENSG00000187961.13 0.9699384 2 6.157162e-01 8.934371e-01
## 5 ENSG00000217801.9 0.2262070 1 6.343506e-01 8.934371e-01
## 6 ENSG00000131591.17 30.4292202 1 3.462727e-08 2.136131e-07
head(results.txp)
## gene_id feature_id lr df pvalue adj_pvalue
## 1 ENSG00000237094.12 ENST00000599771.6 52.9721358 1 3.383138e-13 3.341499e-12
## 2 ENSG00000237094.12 ENST00000608420.1 52.9721358 1 3.383138e-13 3.341499e-12
## 3 ENSG00000237491.8 ENST00000585826.1 2.7403807 1 9.784145e-02 3.528888e-01
## 4 ENSG00000237491.8 ENST00000592547.1 2.7403807 1 9.784145e-02 3.528888e-01
## 5 ENSG00000228794.8 ENST00000445118.6 0.4788971 1 4.889223e-01 8.378376e-01
## 6 ENSG00000228794.8 ENST00000449005.5 0.5862693 1 4.438654e-01 8.201190e-01
I have tried switching from R version 4.1 and Bioconductor version 13.3 to the older ones used in the tutorial, but I got error messages when trying to download the rnaseqDTU package which said it was not available to older versions of Bioconductor. As I use RStudio, I also tried clearing my global environment and running the code again, but that did not work either. I’m not sure what to do about this issue and would appreciate any potential solutions! Thanks.

nnet gives me error "NA/NaN/Inf in foreign function call (arg 2)" in RStudio

I've been trying to run a neural network in RStudio to predict the answer of a bank marketing campaign, but for some reason I get the bellow error.
> bankData_net <- nnet(bankData[A,c(1:4)], Train_lab[A,], size=3, maxit=100, softmax=TRUE)
# weights: 23
Error in nnet.default(bankData[A, c(1:4)], Train_lab[A, ], size = 3, maxit = 100, :
NA/NaN/Inf in foreign function call (arg 2)
In addition: Warning message:
In nnet.default(bankData[A, c(1:4)], Train_lab[A, ], size = 3, maxit = 100, :
NAs introduced by coercion
The database looks like the one bellow (these are only the first 10 rows to get an idea. The db has a few thousands of rows).
age job marital education y
1 56 housemaid married basic.4y no
2 57 services married high.school no
3 37 services married high.school no
4 40 admin. married basic.6y no
5 56 services married high.school yes
6 45 services married basic.9y no
7 59 admin. married professional.course no
8 41 blue-collar married unknown yes
9 24 technician single professional.course no
10 25 services single high.school no
And the bellow is the code I'm trying to run.
# save the data set in a variable
bankData = read.csv("data/bank-additional.csv", sep = ";")
# print first 10 rows of iris data
head(bankData, n=10)
# Remove variable "duration" which is not helpful
newbankData <- subset(bankData, select = c(age, job, marital, education, y))
head(newbankData, n=10)
library(nnet)
# create train labels: convert the text bankData responses to numeric class labels
Train_lab <- class.ind(bankData$y)
# set seed for random number generator for repeatable results
set.seed(1)
# Create indexes for training (70%) and validation (30%) data
A <- sort(sample(nrow(bankData), nrow(bankData)*.7))
# train neural net
# bankData[A,c(1:4)] is to select the first 4 variables as inputs
# size=5 for 5 hidden units, maxit=100 to train for 100 iterations
bankData_net <- nnet(bankData[A,c(1:4)], Train_lab[A,], size=3, maxit=100, softmax=TRUE)
# test
Yt <- predict(bankData_net, bankData[-A,c(1:4)], type="class")
# build a confusion matrix
conf.matrix <- table(bankData[-A,]$y, Yt)
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix))
colnames(conf.matrix) <- paste("Pred", colnames(conf.matrix))
print(conf.matrix)
Please help me what I'm doing wrong and how I can fix this.

R/Plotly: Error in list2env(data) : first argument must be a named list

I'm moderately experienced using R, but I'm just starting to learn to write functions to automate tasks. I'm currently working on a project to run sentiment analysis and topic models of speeches from the five remaining presidential candidates and have run into a snag.
I wrote a function to do a sentence-by-sentence analysis of positive and negative sentiments, giving each sentence a score. Miraculously, it worked and gave me a dataframe with scores for each sentence.
score text
1 1 iowa, thank you.
2 2 thanks to all of you here tonight for your patriotism, for your love of country and for doing what too few americans today are doing.
3 0 you are not standing on the sidelines complaining.
4 1 you are not turning your backs on the political process.
5 2 you are standing up and fighting back.
So what I'm trying to do now is create a function that takes the scores and figures out what percentage of the total is represented by the count of each score and then plot it using plotly. So here is the function I've written:
scoreFun <- function(x){{
tbl <- table(x)
res <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(res) <- c('Score', 'Count','Percentage')
return(res)
}
percent = data.frame(Score=rownames, Count=Count, Percentage=Percentage)
return(percent)
}
Which returns this:
saPct <- scoreFun(sanders.scores$score)
saPct
Count Percentage
-6 1 0.44
-5 1 0.44
-4 6 2.64
-3 13 5.73
-2 20 8.81
-1 42 18.50
0 72 31.72
1 34 14.98
2 18 7.93
3 9 3.96
4 6 2.64
5 2 0.88
6 1 0.44
9 1 0.44
11 1 0.44
What I had hoped it would return is a dataframe with what has ended up being the rownames as a variable called Score and the next two columns called Count and Percentage, respectively. Then I want to plot the Score on the x-axis and Percentage on the y-axis using this code:
d <- subplot(
plot_ly(clPct, x = rownames, y=Percentage, xaxis="x1", yaxis="y1"),
plot_ly(saPct, x = rownames, y=Percentage, xaxis="x2", yaxis="y2"),
margin = 0.05,
nrows=2
) %>% layout(d, xaxis=list(title="", range=c(-15, 15)),
xaxis2=list(title="Score", range=c(-15,15)),
yaxis=list(title="Clinton", range=c(0,50)),
yaxis2=list(title="Sanders", range=c(0,50)),showlegend = FALSE)
d
I'm pretty certain I've made some obvious mistakes in my function and my plot_ly code, because clearly it's not returning the dataframe I want and is leading to the error Error in list2env(data) : first argument must be a named list when I run the `plotly code. Again, though, I'm not very experienced writing functions and I've not found a similar issue when I Google, so I don't know how to fix this.
Any advice would be most welcome. Thanks!
#MLavoie, this code from the question I referenced in my comment did the trick. Many thanks!
scoreFun <- function(x){
tbl <- data.frame(table(x))
colnames(tbl) <- c("Score", "Count")
tbl$Percentage <- tbl$Count / sum(tbl$Count) * 100
return(tbl)
}

Class probabilities in Neural networks

I use the caret package with multi-layer perception.
My dataset consists of a labelled output value, which can be either A,B or C. The input vector consists of 4 variables.
I use the following lines of code to calculate the class probabilities for each input value:
fit <- train(device~.,data=dataframetrain[1:100,], method="mlp",
trControl=trainControl(classProbs=TRUE))
(p=(predict(fit,newdata=dataframetest,type=("prob"))))
I thought that the class probabilities for each record must sum up to one. But I get the following:
rowSums(p)
# 1 2 3 4 5 6 7 8
# 1.015291 1.015265 1.015291 1.015291 1.015291 1.014933 1.015011 1.015291
# 9 10 11 12 13 14 15 16
# 1.014933 1.015206 1.015291 1.015291 1.015291 1.015224 1.015011 1.015291
Can anybody help me because I don't know what I did wrong.
There's probably nothing wrong, it just seems that caret returns the values of the neurons in the output layer without converting them to probabilities (correct me if I'm wrong). When using the RSNNS::mlp function outside of caret the rows of the predictions also don't sum to one.
Since all output neurons have the same activation function the outputs can be converted to probabilities by dividing the predictions by the respective row sum, see this question.
This behavior seems to be true when using method = "mlp" or method = "mlpWeightDecay" but when using method = "nnet" the predictions do sum to one.
Example:
library(RSNNS)
data(iris)
#shuffle the vector
iris <- iris[sample(1:nrow(iris),length(1:nrow(iris))),1:ncol(iris)]
irisValues <- iris[,1:4]
irisTargets <- iris[,5]
irisTargetsDecoded <- decodeClassLabels(irisTargets)
iris2 <- splitForTrainingAndTest(irisValues, irisTargetsDecoded, ratio=0.15)
iris2 <- normTrainingAndTestSet(iris2)
set.seed(432)
model <- mlp(iris2$inputsTrain, iris2$targetsTrain,
size=5, learnFuncParams=c(0.1), maxit=50,
inputsTest=iris2$inputsTest, targetsTest=iris2$targetsTest)
predictions <- predict(model,iris2$inputsTest)
head(rowSums(predictions))
# 139 26 17 104 54 82
# 1.0227419 1.0770722 1.0642565 1.0764587 0.9952268 0.9988647
probs <- predictions / rowSums(predictions)
head(rowSums(probs))
# 139 26 17 104 54 82
# 1 1 1 1 1 1
# nnet example --------------------------------------
library(caret)
training <- sample(seq_along(irisTargets), size = 100, replace = F)
modelCaret <- train(y = irisTargets[training],
x = irisValues[training, ],
method = "nnet")
predictionsCaret <- predict(modelCaret,
newdata = irisValues[-training, ],
type = "prob")
head(rowSums(predictionsCaret))
# 122 100 89 134 30 86
# 1 1 1 1 1 1
I don't know how much flexibility the caret package offers in these choices, but the standard way to make a neural net produce outputs which sum to one is to use the softmax function as the activation function in the output layer.

Resources