Annotated correlation tables with stargazer - r

I want to report correlation tables in a latex report and I'm using 'stargazer' to transform my R objects into tex-code. The correlational data is currently stored in a data frame.
I would like to print rownames and possibly add an annotation under the table. I couldn't find a 'print rownames'-argument and the 'notes'-argument doesn't seem to work.
Any Ideas?
## create object
x <- matrix(1:4, 2, byrow = TRUE)
dimnames(x) <- list(c("A", "B"), c("A", "B"))
x <- data.frame(x)
## create Tex-Code
stargazer(x, summary = FALSE, title = "2x2 Matrix",
notes = "This is a two by two Matrix")

As of version 5.0, stargazer can directly output the content of matrices/vectors. The following code should provide an easy and intuitive resolution to your problem:
## create object
x <- matrix(1:4, 2, byrow = TRUE)
dimnames(x) <- list(c("A", "B"), c("A", "B"))
## create Tex-Code
stargazer(x, title = "2x2 Matrix",
notes = "This is a two by two Matrix")

This is rather a markdown solution that can be converted to LaTeX with e.g. Pandoc:
> require(pander)
> pander(x, caption = 'Annotation')
---------------
A B
------- --- ---
**A** 1 2
**B** 3 4
---------------
Table: Annotation

To get the 'rownames', try this hackish solution:
## create object
x <- matrix(1:4, 2, byrow = TRUE)
x <- data.frame(x)
x <- cbind(c("A","B"),x)
colnames(x) <- c("","A", "B")
## create Tex-Code
stargazer(x, summary = FALSE, title = "2x2 Matrix",
notes = "This is a two by two Matrix", type="text")
At the moment (v. 4.5.1), 'stargazer' is best suited to working with regression tables and data frames. Your question, however, suggests that users might be interested in better support for matrices. Expect this in future releases (next few months).
As for notes, these really only work for regression tables at the moment. However, they will be available for summary statistics and data frame tables in the next release. If you're willing to edit the source, you can get something very close (although not quite perfect) to the future implementation by replacing the following line(s):
.format.s.stat.parts <<- c("-!","stat names","-!","statistics1","-!")
by:
.format.s.stat.parts <<- c("-!","stat names","-!","statistics1","-!","notes")

Related

k-fold cross validation in quanteda

I've been using the quanteda SML workflow as described in the quanteda tutorial (https://tutorials.quanteda.io/machine-learning/nb/) and found it extremely helpful to set up my own classification task. However, instead of the fixed held-out train/test sampling I would like to use a k-fold cross-validation. Could you point me towards the best way to implement it into the workflow? Is there an easy way to apply it in quanteda?
Many thanks
I tried to add a cross validation based on this example:
https://rdrr.io/github/quanteda/quanteda.classifiers/man/crossval.html
require(quanteda)
require(quanteda.textmodels)
require(caret)
corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)
# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)
# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)
# tokenize texts
toks_movies <- tokens(corp_movies, remove_punct = TRUE, remove_number = TRUE) %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_wordstem()
dfmt_movie <- dfm(toks_movies)
# get training set
dfmat_training <- dfm_subset(dfmt_movie, id_numeric %in% id_train)
# get test set (documents not in id_train)
dfmat_test <- dfm_subset(dfmt_movie, !id_numeric %in% id_train)
tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb)
dfmat_matched <- dfm_match(dfmat_test, features = featnames(dfmat_training))
actual_class <- dfmat_matched$sentiment
predicted_class <- predict(tmod_nb, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
tab_class
require(confusionMatrix)
confusionMatrix(tab_class, mode = "everything", positive = "pos")
#n-fold cross validation
require(crossval)
dfmat <- dfm(toks_movies)
tmod <- textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment)
crossval(tmod, k = 5, by_class = TRUE)
crossval(tmod, k = 5, by_class = FALSE)
crossval(tmod, k = 5, by_class = FALSE, verbose = TRUE)
but it returns "Error in group.samples(Y) : argument "Y" is missing, with no default"
It should probably be a comment, but I cannot post them yet. I think your problem is caused by the usage of the crossval() function from the improper package. The link you shared suggests that you want to use it from the remote quanteda/quanteda.classifiers package, instead of crossval. The one you used presumably requires a different pipeline cause its definition is different. The used function requires additional X and Y arguments. Their lack is a reason for your error.

How to coerce stslist.freq to dataframe

I am doing some describtive sequence analysis using the "TraMineR" library. I want to report my findings via R-Markdown in html format. For formating tables I use "kable" and "kableExtra".
To get the frequency and propotions of the most common sequences I use seqtab(). The result is an stslist.freq object. When I try to coerce it to a dataframe, the dataframe is not containing any frequencies and proportions.
I tried to print the results of seqtab() and store this result again. This gives me the dataframe I desire. However there are two "problems" with that: (1) I don't understand what is happening here and it seems like a "dirty" trick, (2) as a result I also get the output of the print command in my final html document if I don't split the code in multiple chunks and disable the ouput in the specific chunk.
Here is some code to replicate the problem:
library("TraMineR")
#Data creation
data.long <- data.frame(
id=rep(1:50, each=4),
time = c(0,1,2,3),
status = sample(letters[1:2], 200, replace = TRUE),
weight=rep(runif(50, 0, 1), each=4)
)
#reshape
data.wide <- reshape(data.long, v.names = "status", idvar="id", direction="wide", timevar="time")
#sequence
sequence <- seqdef(data.wide,
var=c("status.0", "status.1", "status.2", "status.3"),
weights=data.wide$weight)
#frequencies of sequences
##doesn't work:
seqtab.df1 <- as.data.frame(seqtab(sequence))
##works:
seqtab.df2 <- print(seqtab(sequence))
I expect the dataframe to be the same as the one saved in seqtab.df2, however either without using the print command or with "silently" (no output printed) using the print command.
Thank you very much for your help and let me know if I forgot something to make answering the question possible!
If you look at the class() of the object returned by seqtab, it has the type
class(seqtab(sequence))
# [1] "stslist.freq" "stslist" "data.frame"
so if we look at exactly, what's happening in the print statement for such an object we can get a clue what's going on
TraMineR:::print.stslist.freq
# function (x, digits = 2, width = 1, ...)
# {
# table <- attr(x, "freq")
# print(table, digits = digits, width = width, ...)
# }
# <bytecode: 0x0000000003e831f8>
# <environment: namespace:TraMineR>
We see that what it's really giving you is the "freq" attribute. You can extract this directly and skip the print()
attr(seqtab(sequence), "freq")
# Freq Percent
# a/3-b/1 4.283261 20.130845
# b/1-a/1-b/2 2.773341 13.034390
# a/2-b/1-a/1 2.141982 10.067073
# a/1-b/1-a/1-b/1 1.880359 8.837476
# a/1-b/2-a/1 1.723489 8.100203
# b/1-a/2-b/1 1.418302 6.665861
# b/2-a/1-b/1 1.365099 6.415813
# a/1-b/3 1.241644 5.835586
# a/1-b/1-a/2 1.164434 5.472710
# a/2-b/2 1.092656 5.135360

How to visulize the convolution layer and feature layer in mxnet after cnn was finished trained?

I want to plot or visualize the result of each layers out from a trained CNN with mxnet in R. Like w´those abstract art from what a nn's each layer can see.
But I don't know how. Please somebody help me. One way I can think out is to put the weights and bias back to every step and plot the step out. But when I try to put model$arg.params$convolution0_weight back to mx.symbol.Convolution(), I get
Error in mx.varg.symbol.Convolution(list(...)) :
./base.h:291: Unsupported parameter type object type for argument weight, expect integer, logical, or string.
Can anyone help me?
I thought out one way, but encounter a difficulty at one step. Here is what I did.
I found all the trained cnn's parameters inmodel$arg.params , and to compute with parameters we can use mx.nd... founctions as bellow:
`#convolution 1_result
conv1_result<- mxnet::mx.nd.Convolution(data=mx.nd.array(train_array),weight=model$arg.params$convolution0_weight,bias=model$arg.params$convolution0_bias,kernel=c(8,8),num_filter = 50)
str(conv1_result)
tanh1_result<-mx.nd.Activation(data= conv1_result, act_type = "sigmoid")
pool1_result <- mx.nd.Pooling(data = tanh1_result, pool_type = "avg", kernel = c(4,4), stride = c(4,4))
conv2 result
conv2_result<- mxnet::mx.nd.Convolution(data=pool1_result,weight=model$arg.params$convolution1_weight,bias=model$arg.params$convolution1_bias,kernel=c(5,5),num_filter = 50)
tanh2_result<-mx.nd.Activation(data= conv1_result, act_type = "sigmoid")
pool2_result <- mx.nd.Pooling(data = tanh1_result, pool_type = "avg", kernel = c(4,4), stride = c(4,4))
1st fully connected layer result
flat_result <- mx.nd.flatten(data = pool2_result)
fcl_1_result <- mx.nd.FullyConnected(data = flat_result,weight = model$arg.params$fullyconnected0_weight,bias = model$arg.params$fullyconnected0_bias, num_hidden = 500)
tanh_3_result <- mx.nd.Activation(data = fcl_1_result, act_type = "tanh")
2nd fully connected layer result
fcl_2_result <- mx.nd.FullyConnected(data = tanh_3,weight = model$arg.params$fullyconnected1_weight,bias = model$arg.params$fullyconnected1_bias, num_hidden =100)`
but when I came to mx.nd.FullyConnected() step , I encountered not sufficient memory(i have 16 GB RAM) and R crashed.
So, does anyone know how to batch_size the input data in
mx.nd.FullyConnected(), or any method to make mx.nd.FullyConnected() run successfully as mx.model.FeedForward.create()
did?
Here is the code that can help you to achieve what you want. The code below displays activations of 2 convolution layers of LeNet. The code gets as an input MNIST dataset, which is 28x28 grayscale images (downloaded automatically), and produces images as activations.
You can grab outputs from executor. To see the list of available outputs use names(executor$ref.outputs)
The result of each output is available as a matrix with values in [-1; 1] range. The dimensions of the matrix depends on parameters of the layer. The code use these matrices to display as greyscaled images where -1 is white pixel, 1 - black pixel. (most of the code is taken from https://github.com/apache/incubator-mxnet/issues/1152 and massaged a little bit)
The code is a self sufficient to run, but I have noticed that if I build the model second time in the same R session, the names of ouputs get different indices, and later the code fails because the expected names of outputs are hard coded. So if you decide to create a model more than once, you will need to restart R session.
Hope it helps and you can adjust this example to your case.
library(mxnet)
download.file('https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/R/data/mnist_csv.zip', destfile = 'mnist_csv.zip')
unzip('mnist_csv.zip', exdir = '.')
train <- read.csv('train.csv', header=TRUE)
data.x <- train[,-1]
data.x <- data.x/255
data.y <- train[,1]
val_ind = 1:100
train.x <- data.x[-val_ind,]
train.x <- t(data.matrix(train.x))
train.y <- data.y[-val_ind]
val.x <- data.x[val_ind,]
val.x <- t(data.matrix(val.x))
val.y <- data.y[val_ind]
train.array <- train.x
dim(train.array) <- c(28, 28, 1, ncol(train.x))
val.array <- val.x
dim(val.array) <- c(28, 28, 1, ncol(val.x))
# input layer
data <- mx.symbol.Variable('data')
# first convolutional layer
convLayer1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=30)
convAct1 <- mx.symbol.Activation(data=convLayer1, act_type="tanh")
poolLayer1 <- mx.symbol.Pooling(data=convAct1, pool_type="max", kernel=c(2,2), stride=c(2,2))
# second convolutional layer
convLayer2 <- mx.symbol.Convolution(data=poolLayer1, kernel=c(5,5), num_filter=60)
convAct2 <- mx.symbol.Activation(data=convLayer2, act_type="tanh")
poolLayer2 <- mx.symbol.Pooling(data=convAct2, pool_type="max",
kernel=c(2,2), stride=c(2,2))
# big hidden layer
flattenData <- mx.symbol.Flatten(data=poolLayer2)
hiddenLayer <- mx.symbol.FullyConnected(flattenData, num_hidden=500)
hiddenAct <- mx.symbol.Activation(hiddenLayer, act_type="tanh")
# softmax output layer
outLayer <- mx.symbol.FullyConnected(hiddenAct, num_hidden=10)
LeNet1 <- mx.symbol.SoftmaxOutput(outLayer)
# Group some output layers for visual analysis
out <- mx.symbol.Group(c(convAct1, poolLayer1, convAct2, poolLayer2, LeNet1))
# Create an executor
executor <- mx.simple.bind(symbol=out, data=dim(val.array), ctx=mx.cpu())
# Prepare for training the model
mx.set.seed(0)
# Set a logger to keep track of callback data
logger <- mx.metric.logger$new()
# Using cpu by default, but set gpu if your machine has a supported one
devices=mx.cpu(0)
# Train model
model <- mx.model.FeedForward.create(LeNet1, X=train.array, y=train.y,
eval.data=list(data=val.array, label=val.y),
ctx=devices,
num.round=1,
array.batch.size=100,
learning.rate=0.05,
momentum=0.9,
wd=0.00001,
eval.metric=mx.metric.accuracy,
epoch.end.callback=mx.callback.log.train.metric(100, logger))
# Update parameters
mx.exec.update.arg.arrays(executor, model$arg.params, match.name=TRUE)
mx.exec.update.aux.arrays(executor, model$aux.params, match.name=TRUE)
# Select data to use
mx.exec.update.arg.arrays(executor, list(data=mx.nd.array(val.array)), match.name=TRUE)
# Do a forward pass with the current parameters and data
mx.exec.forward(executor, is.train=FALSE)
# List of outputs available.
names(executor$ref.outputs)
# Plot the filters of a sample from validation set
sample_index <- 99 # sample number in validation set. Change it to if you want to see other samples
activation0_filter_count <- 30 # number of filters of the "convLayer1" layer
par(mfrow=c(6,5), mar=c(0.1,0.1,0.1,0.1)) # number of rows x columns in output
dim(executor$ref.outputs$activation0_output)
for (i in 1:activation0_filter_count) {
outputData <- as.array(executor$ref.outputs$activation0_output)[,,i,sample_index]
image(outputData,
xaxt='n', yaxt='n',
col=gray(seq(1,0,-0.1)))
}
activation1_filter_count <- 60 # number of filters of the "convLayer2" layer
dim(executor$ref.outputs$activation1_output)
par(mfrow=c(6,10), mar=c(0.1,0.1,0.1,0.1)) # number of rows x columns in output
for (i in 1:activation1_filter_count) {
outputData <- as.array(executor$ref.outputs$activation1_output)[,,i,sample_index]
image(outputData,
xaxt='n', yaxt='n',
col=gray(seq(1,0,-0.1)))
}
As a result you should see the following images for a validation sample #2 (use RStudio left and right arrows to navigate between them).

How do I display labels from data on Dissimilarity matrix using Coldiss function rather than default numbers?

I think I have read every page on the internet that mentions coldiss and I am still having trouble getting the labels to look correctly. In the image I inserted, the matrices look good but the labels are default numbers (so aren't that useful for a stand alone image) and in the ordered matrix the matrix gets ordered correctly, but the labels didn't re-order, which doesn't make sense.
[Matrix output images][1]
My questions are:
1) How do I get the labels to order properly for the ordered matrix? If the cells in the heat map are changing colors after being ordered, the respective labels should be different too.
2) Is it possible to edit the coldiss function to use my isolate labels that can be found in the top row or first column to label the heat map rather than the default numbers?
Here is the code I'm running.
library(gclus)
library(ape)
source("coldiss.txt")
tree<-read.tree("BP_SNPS_only-BioNJ_tree_100BS")
PatristicDistMatrix100BS<-cophenetic.phylo(tree)
coldiss(D = PatristicDistMatrix100BS, nc = 4, byrank = TRUE, diag = TRUE)
Here is the coldiss.txt file:
# coldiss()
# Color plots of a dissimilarity matrix, without and with ordering
#
# License: GPL-2
# Author: Francois Gillet, 23 August 2012
#
"coldiss" <- function(D, nc = 4, byrank = TRUE, diag = FALSE)
{
require(gclus)
if (max(D)>1) D <- D/max(D)
if (byrank) {
spe.color <- dmat.color(1-D, cm.colors(nc))
}
else {
spe.color <- dmat.color(1-D, byrank=FALSE, cm.colors(nc))
}
spe.o <- order.single(1-D)
speo.color <- spe.color[spe.o, spe.o]
op <- par(mfrow=c(1,2), pty="s")
if (diag) {
plotcolors(spe.color, rlabels=attributes(D)$Labels,
main="Dissimilarity Matrix",
dlabels=attributes(D)$Labels)
plotcolors(speo.color, rlabels=attributes(D)$Labels[spe.o],
main="Ordered Dissimilarity Matrix",
dlabels=attributes(D)$Labels[spe.o])
}
else {
plotcolors(spe.color, rlabels=attributes(D)$Labels,
main="Dissimilarity Matrix")
plotcolors(speo.color, rlabels=attributes(D)$Labels[spe.o],
main="Ordered Dissimilarity Matrix")
}
par(op)
}
# Usage:
# coldiss(D = dissimilarity.matrix, nc = 4, byrank = TRUE, diag = FALSE)
# If D is not a dissimilarity matrix (max(D) > 1), then D is divided by max(D)
# nc number of colours (classes)
# byrank= TRUE equal-sized classes
# byrank= FALSE equal-length intervals
# diag = TRUE print object labels also on the diagonal
# Example:
# coldiss(spe.dj, nc=9, byrank=F, diag=T)
Here is an abbreviated version of PatristicDistMatrix100BS:
CDC-B043_1995 CDC-A267_1994 CDC-A161_1992 CDC-C931_1998
CDC-B043_1995 0 0.00099 0.00099 0.00166
CDC-A267_1994 0.00099 0 0.00066 0.00133
CDC-A161_1992 0.00099 0.00066 0 0.00133
CDC-C931_1998 0.00166 0.00133 0.00133 0
I hope this provides all the relevant information and thank you for any help you can provide even if it's a completely different function.
There is nothing wrong in the code. The main problem I think is some other packages you have loaded. I also had same problem but when I tried separately it worked well and as you require. Just remove other packages or calculate separately. For more details have a look on the code of chapter three of this document (http://adn.biol.umontreal.ca/~numericalecology/numecolR/). Here is the code I work with.
(vegan must be loaded after ade4 to avoid some conflicts)
library(ade4)
library(vegan)
library(gclus)
library(cluster)
library(FD)
files must be in the working directory. You can search this file from internet from this link (https://github.com/JoeyBernhardt/NumericalEcology)
source("coldiss.R")
source("panelutils.R")
Then calculate your dissimilarity matrix and plot using the code
BCD <- vegdist(df[-1])
coldiss(BCD, byrank = FALSE, diag = TRUE)
Hopefully it will work.

Microarray Limma package, in topTable function don't assign ID for probsets column

I tried a tutorial by Daniel Swan ,it works perfectly well. But I'm facing a problem in topTable function of limma package.
The "topTable" function create a "probeset list" but this probset list have not "ID" header (other columns name is their sample name, but Probe list column have not name (ID)).
At the result, when I am runing:
gene.symbols <- getSYMBOL(probeset.list$ID, "hgu133plus2")
I'm getting the following error
Error in .select(x, keys, columns, keytype = extraArgs[["kt"]], jointype = jointype):
'keys' must be a character vector
topTable is:
logFC AveExpr t P.Value adj.P.Val B
204779_s_at 7.367790 4.171707 72.77347 3.284937e-15 8.969850e-11 20.25762
207016_s_at 6.936667 4.027733 57.39252 3.694641e-14 5.044293e-10 19.44987
209631_s_at 5.192949 4.003992 51.24892 1.170273e-13 1.065182e-09 18.96660
my expression Set achieved by simpleaffy (gcrma) package.
I'm runing R 3.0.2 under windows 7 with latest bioconductor packages, simpleaffy_2.38.0 , limma_3.18.13 and anotation files: hgu133plus2.db_2.10.1 ,hgu133plus2probe_2.13.0, hgu133plus2cdf_2.13.0
I would be very thankful, if somebody could help me.
The IDs are not stored as an ID column, but as the rownames of the table. Change the line to:
gene.symbols <- getSYMBOL(rownames(probeset.list), "hgu133plus2")
If you want there to be an ID column instead of using row names, you can assign one with:
probeset.list$ID = rownames(probeset.list)
According to the documentation of the toptable function, the ID column will exist if and only if there are duplicated gene names:
If ‘fit’ had unique rownames, then the row.names of the above
data.frame are the same in sorted order. Otherwise, the row.names
of the data.frame indicate the row number in ‘fit’. If ‘fit’ had
duplicated row names, then these are preserved in the ‘ID’ column
of the data.frame, or in ‘ID0’ if ‘genelist’ already contained an
‘ID’ column.
In the other examples you've seen ID used, there must have been duplicate gene names in the input. This makes sense because R typically doesn't like having duplicated rownames (but has no problem having duplicate IDs in a column).
Hope my piece of working codes can make your question clear:
library(limma) # загружаем нужную библиотека
library(siggenes)
library(cluster)
library(stats)
data <- read.table("AneurismDataAllProbesGenesisLog2NormalizedExperAndGenes.tab", sep = "\t", header = TRUE) # read from file
q = as.matrix(data) # данные в матрицу
b = as.matrix(cbind(data[, 2:10], data[, 11:14])) # cмежные колонки данных
m = normalizeQuantiles(b, ties=TRUE)
f = data.frame(condition = c(0,0,0,0,0,0,0,0,0,1,1,1,1)) # дизайн
fit = lmFit(m, f) # линейная модель
e = eBayes(fit) # тест Байеса
volcanoplot(e, coef=1, highlight=5, names=data$GeneName, xlab="Log Fold Change", ylab="Log Odds", pch=19, cex=0.67, col = "dark blue") # график-вулкан
z = rownames(m) = data[, 1]
hc <- hclust(dist(m), "ave") # кластерграмма
plot(hc)
plot(hc, hang = -1)
print(e$coefficients) # output eBayes coefficients
print(e$p.value) # get out the P values
toptable(e) # select 10 most differentialy expressed genes, the disadvantage that it outputs only the gene row number and not the name
printresult <-toptable(e) # assign the result to a variable
write.csv(printresult, file = "eBayesTableAneurism", row.names = TRUE) # write to the file in the current folder
volcanoplot(e, coef=1, highlight=10, names=data[,1], xlab="Log Fold Change", ylab="Log Odds", pch=19, cex=0.67, col = "red") # график-вулкан c именами
volcanoplot(e, coef=1, highlight=5, names=data[,1], xlab="Log Fold Change", ylab="Log Odds", pch=19, cex=0.67, col = "blue") # график-вулкан с именами (Volcano with gene names)

Resources