r plot correlation matrix from file with correlation - r

After try to find a solution, I didn't.
I have a .txt file with a correlation matrix which was previously created from other records. It looks like this:
CXCL9 IL2RG TAP1
CXCL9 1
IL2RG 0.828 1
TAP1 0.605 0.631 1
CD274 0.564 0.57 0.679
LAG3 0.624 0.676 0.681
I am trying to generate a correlogram, an for that I've done this:
m <- read.table("file.txt", sep="\t", header=TRUE, check.names = FALSE)
mymatrix <- as.matrix(m)
corrplot(mymatrix, type = "lower", method="number")
And I get this message:
Error in corrplot(mymatrix, type = "lower") : The matrix is not in [-1, 1]!
How can I do a simple correlogram with this data? (maybe doing a heatmap?)
The desired output:

Related

Obtaining mean phylogenetic tree branch lengths from an ensemble of phylogenetic trees

I have a set of phylogenetic trees, some with different topologies and different branch lengths. Here and example set:
(LA:97.592181158,((HS:82.6284812237,RN:72.190055848635):10.438414999999999):3.989335,((CP:32.2668593286,CL:32.266858085):39.9232054349,(CS:78.2389673073,BT:78.238955218815):8.378847):10.974376);
(((HS:71.9309734249,((CP:30.289472339999996,CL:30.289473923):31.8509454,RN:62.1404181356):9.790551):2.049235,(CS:62.74606492390001,BS:62.74606028250001):11.234141000000001):5.067314,LA:79.0475136246);
(((((CP:39.415718961379994,CL:39.4157161214):29.043224136600003,RN:68.4589436016):8.947169,HS:77.4061105636):4.509818,(BS:63.09170355585999,CS:63.09171066541):18.824224):13.975551000000001,LA:95.891473546);
(LA:95.630761929,((HS:73.4928857457,((CP:32.673882875400004,CL:32.673881941):33.703323212,RN:66.37720021233):7.115682):5.537861,(CS:61.798048265700004,BS:61.798043931600006):17.232697):16.600025000000002);
(((HS:72.6356569413,((CP:34.015223002300004,CL:34.015223157499996):35.207698155399996,RN:69.2229294656):3.412726):8.746038,(CS:68.62665546391,BS:68.6266424085):12.755043999999998):13.40646,LA:94.78814570300001);
(LA:89.58710099299999,((HS:72.440439124,((CP:32.270428384199995,CL:32.2704269484):32.0556597315,RN:64.32607145395):8.114349):6.962274,(CS:66.3266360702,BS:66.3266352709):13.076080999999999):10.184418);
(LA:91.116083247,((HS:73.8383213643,((CP:36.4068361936,CL:36.4068400719):32.297183626700004,RN:68.704029984267):5.134297):6.50389,(BS:68.6124876659,CS:68.61249734691):11.729719):10.773886000000001);
(((HS:91.025288418,((CP:40.288406529099994,CL:40.288401832999995):29.854198951399997,RN:70.14260821095):20.882673999999998):6.163698,(CS:81.12951949976,BS:81.12952162629999):16.059462):13.109915,LA:110.298870881);
In this example there are 2 unique topologies - using R's ape unique.multiPhylo shows that (assuming the example above is saved to a file tree.fn):
tree <- ape::read.tree(tree.fn)
unique.tree <- ape::unique.multiPhylo(tree, use.tip.label = F, use.edge.length = F)
> length(tree)
[1] 8
> length(unique.tree)
[1] 2
My question is how do I get a list of trees, each one representing a unique topology in the input list, and the branch lengths are a summary statistic, such as mean or median, across all trees with the same topology.
In the example above, it will return the first tree as is, because its topology is unique, and another tree which is the topology of the other trees, with mean or median branch lengths?
If I understand well, you want to sort all the trees for each unique into different groups (e.g. in your example, the first group contains one tree, etc...) and then measure some stats for each group?
You can do that by first grouping the topologies into a list:
set.seed(5)
## Generating 20 4 tip trees (hopefully they will be identical topologies!)
tree_list <- rmtree(20, 4)
## How many unique topologies?
length(unique(tree_list))
## Sorting the trees by topologies
tree_list_tmp <- tree_list
sorted_tree_list <- list()
counter <- 0
while(length(tree_list_tmp) != 0) {
counter <- counter+1
## Is the first tree equal to any of the trees in the list
equal_to_tree_one <- unlist(lapply(tree_list_tmp, function(x, base) all.equal(x, base, use.edge.length = FALSE), base = tree_list_tmp[[1]]))
## Saving the identical trees
sorted_tree_list[[counter]] <- tree_list_tmp[which(equal_to_tree_one)]
## Removing them from the list
tree_list_tmp <- tree_list_tmp[-which(equal_to_tree_one)]
## Repeat while there are still some trees!
}
## The list of topologies should be equal to the number of unique trees
length(sorted_tree_list) == length(unique(tree_list))
## Giving them names for fancyness
names(sorted_tree_list) <- paste0("topology", 1:length(sorted_tree_list))
Then for all the trees in each unique topology group you can extract different summary statistics by making a function. Here for example I will measure the branch length sd, mean and 90% quantiles.
## function for getting some stats
get.statistics <- function(unique_topology_group) {
## Extract the branch lengths of all the trees
branch_lengths <- unlist(lapply(unique_topology_group, function(x) x$edge.length))
## Apply some statistics
return(c( n = length(unique_topology_group),
mean = mean(branch_lengths),
sd = sd(branch_lengths),
quantile(branch_lengths, prob = c(0.05, 0.95))))
}
## Getting all the stats
all_stats <- lapply(sorted_tree_list, get.statistics)
## and making it into a nice table
round(do.call(rbind, all_stats), digits = 3)
# n mean sd 5% 95%
# topology1 3 0.559 0.315 0.113 0.962
# topology2 2 0.556 0.259 0.201 0.889
# topology3 4 0.525 0.378 0.033 0.989
# topology4 2 0.489 0.291 0.049 0.855
# topology5 2 0.549 0.291 0.062 0.882
# topology6 1 0.731 0.211 0.443 0.926
# topology7 3 0.432 0.224 0.091 0.789
# topology8 1 0.577 0.329 0.115 0.890
# topology9 1 0.473 0.351 0.108 0.833
# topology10 1 0.439 0.307 0.060 0.795
Of course you can tweak it to get your own desired stats or even get the stats per trees per groups (using a double lapply lapply(sorted_trees_list, lapply, get.statistics) or something like that).

Is there a way to get predicted topic probabilities in seededlda model in R?

I am using R seededlda package to train a seeded lda model (https://github.com/koheiw/seededlda).
model = textmodel_seededlda(dfmt, dict, residual = TRUE)
topics(model) returns a list of predicted topic for each document. I was wondering if there is a way to get probabilities of all topics for each document?
I tried using posterior from topicmodels but it gave an error.
probabilities <- topicmodels::posterior(seeded_lda)
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘posterior’ for signature ‘"textmodel_lda", "missing"’
I have faced the same issue and have used a workaround based on the source code of seededlda. Namely, just fetching it directly from the output of seededlda::textmodel_seededlda().
Below, you will find a replicable example of that workaround using the inbuilt corpus quanteda::data_char_ukimmig2010 and a made up seed dictionary.
### packs
require(quanteda)
require(seededlda)
require(dplyr)
## prep the dfm
dfmat <- dfm(quanteda::data_char_ukimmig2010,
remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE,
remove = stopwords("en"))
## prep the seed dictionary (made up one)
my_dict <- dictionary(list(civil_rights = c("bisexual", "charter of rights", "civil libert", "disabilit", "discriminat", "diversity", "equal employm", "equal opportunit"),
healthcare = c( "health", "hematolog","hepatolog","hiv", "obesity", "influenza"),
labour = c("hiring", "income", "salar", "union")))
## fit the model
slda <- textmodel_seededlda(x = dfmat, dictionary = my_dict, case_insensitive = TRUE, verbose = TRUE, residual = TRUE)
## extract the topic props across the documents
tidy_topics <- as_tibble(slda$theta, rownames = "doc_id")
## it should end up looking like this
tidy_topics
# A tibble: 9 x 5
doc_id civil_rights healthcare labour other
<chr> <dbl> <dbl> <dbl> <dbl>
1 BNP 0.561 0.204 0.0500 0.185
2 Coalition 0.00382 0.0725 0.897 0.0267
3 Conservative 0.0102 0.00205 0.781 0.207
4 Greens 0.00467 0.693 0.201 0.101
5 Labour 0.0191 0.0924 0.840 0.0484
6 LibDem 0.018 0.23 0.538 0.214
7 PC 0.0574 0.631 0.123 0.189
8 SNP 0.00725 0.123 0.471 0.399
9 UKIP 0.00980 0.0770 0.259 0.654
For reusability, I ended up writting these two functions for fetching word-topic probabilities and doc-topic probabilities. Both take as input the output of seededlda::textmodel_seededlda().
### packs
require(dplyr)
require(tidyr)
### get document topic probabilities
get_doc_topic_probs <- function(slda) {
out <- slda$theta %>%
as_tibble(rownames = "doc_id")
return(out)
}
### get word topic probabilities
get_word_topic_probs <- function(slda) {
out <- slda$phi %>%
as_tibble(rownames = "topic") %>%
pivot_longer(cols = !matches("topic"), names_to = "token", values_to = "prob")
return(out)
}

Writing a formula and trying to loop it in R

A2.DM19C.MICSw… A2.DM19C.MICSw… A2.IF12C.MICSwm… A2.DM12C.MICSwm… A2.HA12C.MICSwm…
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.131 0.0516 -0.294 1.29 0.144
2 -0.175 -0.0250 -0.183 1.31 0.146
3 -0.128 0.0691 -0.294 1.31 0.0224
4 -0.175 0.0359 -0.294 1.31 0.136
5 -0.142 0.0169 -0.295 1.31 0.0239
6 -0.252 -0.0918 -0.272 1.33 -0.0263
I have a head of data that looks like this and the dataset is called data_LOG. I want to z-score all these columns. Because there are over 1000 columns, I want to loop the formula so that I can quickly change all these values to a z-score. The equation for z-score is (y-mean(y)/sd(y)). So i made a function called 'zscore'.
zscore <- function(r){
Cal <- (r-mean(r))/sd(r)
return(Cal)
}
Which works just fine when tested against the first column. I want the z-score data to be in a new data frame i call dataZ.
dataZ <- data_log
However, when i attempt to loop the formula, i get an error code.
for (i in 1:ncol(data_log)) {
dataZ[,i] <- zscore(data_log[,i])
}
Error in is.data.frame(x) :
'list' object cannot be coerced to type 'double'
In addition: Warning message:
In mean.default(r) :
Show Traceback
Rerun with Debug
Error in is.data.frame(x) :
'list' object cannot be coerced to type 'double'
I am unsure what this means and how to fix it? please help!
If you want to keep your approach try this
dataZ <- NULL
for (i in 1:ncol(df)) {
z <- zscore(df[[i]])
dataZ <- cbind(z, dataZ)
}
dataZ <- as.data.frame(dataZ)
You could use apply() in combination with standardize()or scale()
dataZ <- apply(data_LOG, 2, scale) # margins = 2, indicates that the function is applied columns
HTH :)

How can I get Mplus's "Results in Probability Scale" output via the R MplusAutomation package?

I'm performing a latent class analysis using Mplus, and trying to get the output into R via the MplusAutomation package (since I'm doing this many times, I want to avoid copying by hand). I'd like to grab the "Results in Probability Scale" subsection in the "Model Results" section of the Mplus output, but I'm unable to find it in the R object MplusAutomation creates from the .out file. That object contains a "parameters" data frame which includes other information from the "Model Results" section, so is it a matter of "Results in Probability Scale" being a simple transformation of the other model results data, that I could do myself in R? If not, is there some other way of recreating the results of this section from what info I do have in R? Or is the information I'm looking for stored somewhere else in the output?
The "Results in Probability Scale"-section does not seem to be parsed by MplusAutomation.
However, you can convert the threshold parameters yourself to probability scale using the formula prob = 1 / (1 + exp(est)).
For example, the code below should reproduce the results in probability scale from this UCLA example:
library(dplyr)
library(tidyr)
library(MplusAutomation)
# Fetch & write output from UCLA LCA-example to temp file
lca_ex_out = tempfile(fileext = '.out')
fileConn <- file(lca_ex_out)
writeLines(readLines('https://stats.idre.ucla.edu/stat/mplus/dae/lca1.out'), fileConn)
close(fileConn)
lca_ex_result = readModels(lca_ex_out) # extract results from temp file
# select threshold parameters, covert to probability & layout in table
lca_ex_result$parameters$unstandardized %>%
filter(paramHeader == 'Thresholds') %>%
mutate(est_probscale = 1 / (1 + exp(est))) %>%
select(param, LatentClass, est_probscale) %>%
spread(LatentClass, est_probscale)
Output:
param 1 2 3
1 ITEM1$1 0.908 0.312 0.923
2 ITEM2$1 0.337 0.164 0.546
3 ITEM3$1 0.067 0.036 0.426
4 ITEM4$1 0.065 0.056 0.418
5 ITEM5$1 0.219 0.044 0.765
6 ITEM6$1 0.320 0.183 0.471
7 ITEM7$1 0.113 0.098 0.512
8 ITEM8$1 0.140 0.110 0.619
9 ITEM9$1 0.325 0.188 0.349

neural network: in neurons[[i]] %*% weights[[i]] : requires numeric/complex matrix/vector arguments

i am trying to the neural network method on my data and i am stuck.
i am allways getting the message:
in neurons[[i]] %*% weights[[i]] : requires numeric/complex matrix/vector arguments
the facts are:
i am reading my data using read.csv
i am adding a link to a file with some of my data, i hope it helps
https://www.dropbox.com/s/b1btx0cnhmj229p/collineardata0.4%287.2.2017%29.csv?dl=0
i have no NA in my data (i checked twice)
the outcome of str(data) is:
'data.frame': 20 obs. of 457 variables:
$ X300.5_alinine.sulphate : num 0.351 0.542 0.902 0.656 1 ...
$ X300.5_bromocresol.green : num 0.435 0.603 0.749 0.314 0.922 ...
$ X300.5_bromophenol.blue : num 0.415 0.662 0.863 0.345 0.784 ...
$ X300.5_bromothymol.blue : num 0.2365 0.0343 0.4106 0.3867 0.8037 ...
$ X300.5_chlorophenol.red : num 0.465 0.1998 0.7786 0.0699 1 ...
$ X300.5_cresol.red : num 0.534 0.311 0.678 0.213 0.821 ...
continued
i have tried to do use model.matrix
the code i have was tried on different datasets (i.e iris) and it was good.
can anyone please try and suggest what is wrong with my data/data reading?
the code is
require(neuralnet)
require(MASS)
require(grid)
require(nnet)
#READ IN DATA
data<-read.table("data.csv", sep=",", dec=".", head=TRUE)
dim(data)
# Create Vector of Column Max and Min Values
maxs <- apply(data[,3:459], 2, max)
mins <- apply(data[,3:459], 2, min)
# Use scale() and convert the resulting matrix to a data frame
scaled.data <- as.data.frame(scale(data[,3:459],center = mins, scale = maxs - mins))
# Check out results
print(head(scaled.data,2))
#create formula
feats <- names(scaled.data)
# Concatenate strings
f <- paste(feats,collapse=' + ')
f <- paste('data$Type ~',f)
# Convert to formula
f <- as.formula(f)
f
#creating neural net
nn <- neuralnet(f,model,hidden=c(21,15),linear.output=FALSE)
str(scaled.data)
apply(scaled.data,2,function(x) sum(is.na(x)))
There are multiple things wrong with your code.
1.There are multiple factors in your dependent variable Type. The neuralnet only accepts numeric input so you must convert it to a binary matrix with model.matrix.
y <- model.matrix(~ Type + 0, data = data[,1,drop=FALSE])
# fix up names for as.formula
y_feats <- gsub(" |\\+", "", colnames(y))
colnames(y) <- y_feats
scaled.data <- cbind(y, scaled.data)
# Concatenate strings
f <- paste(feats,collapse=' + ')
y_f <- paste(y_feats,collapse=' + ')
f <- paste(y_f, '~',f)
# Convert to formula
f <- as.formula(f)
2.You didn't even pass in your scaled.data to the neuralnet call anyway.
nn <- neuralnet(f,scaled.data,hidden=c(21,15),linear.output=FALSE)
The function will run now but you will need to look in to the multiclass problems (beyond the scope of this question). This package does not output straight probabilities so you must be cautious.

Resources