In the pvclust package in R, there is the pvclust() function. In the example provided in the function help file, there's the function:
boston.pp <- pvpick(boston.pv)
This is supposed to print out the clusters with high p-values. The output of this function is:
$clusters
$clusters[[1]]
[1] "rm" "medv"
$clusters[[2]]
[1] "zn" "dis"
$clusters[[3]]
[1] "crim" "indus" "nox" "age" "rad" "tax" "ptratio" "lstat"
$edges
[1] 3 5 9
I have a lot of trouble understanding what the output means, especially since I have very limited technical background on cluster analysis. In particular, I don't understand the meaning of the vector of names under each cluster. Can someone explain this for me? Thanks!
https://cran.r-project.org/web/packages/pvclust/pvclust.pdf
describes pvclust:
For data expressed as (n x p) matrix or data frame, we assume that the data is n observations of p objects, which are to be clustered. The i’th row vector corresponds to the i’th observation of these objects and the j’th column vector corresponds to a sample of j’th object with size n
Output of pvpick:
cluster - a list of character string vectors. Each vector corresponds to the names of objects in each cluster.
Have you plotted dendrogram of pvclust output? pvpick clusters output just lists internal points (pvclust treats each column in boston dataset as a point) in some cluster which you will see in dendrogram if you plot it.
Related
I would like to know if there is any way (I'm sure it is) to get the elements of the
additive relationship matrix A in R.
I already have the pedigree and I was succesfull on getting the A matrix by to different ways:
by using the function makeA from the pedigree package:
library(pedigree)
makeA(pedigree_renum, which = pedigree_renum$ID=="1-2372") #for all the animals
#> [1] TRUE
but I can not get the elements from the matrix
by using the function getA from the peigreemm package. In this case I get the 2372*2372 A matrix:
class(pedigree_general)
#> [1] "pedigree"
attr(,"package")
#> [1] "pedigreemm"
matrizA<-getA(pedigree_general)
class(matrizA)
#> [1] "dsCMatrix"
attr(,"package")
#> [1] "Matrix"
But I can't find out how to save certain elements from the matrix such as the upper diagonal elements.
Hope some of you can help me figure this out!
Different approaches to obtain the same result are welcome :)
Greetings from Buenos Aires.
From Pedigree's documentation on makeA:
Makes the A matrix for a part of a pedigree and stores it in a file called A.txt
What you have missed, if I'm not mistaken, is that the matrix you are searching for should be loaded from the file A.txt, which is the output file of the command. Example:
id <- 1:6
dam <- c(0,0,1,1,4,4)
sire <- c(0,0,2,2,3,5)
ped <- data.frame(id,dam,sire)
makeA(ped,which = c(rep(FALSE,4),rep(TRUE,2)))
A <- read.table("A.txt")
After printing A, here is the matrix:
Let me know if I'm missing something.
I am learning cross validation method.
In the lines below, the input and query are both a data frame.
my.knn <- get.knnx(input,query,k=2)
nn.index <- my.knn$nn.index
What does the second line mean? What will nn.index be?
my.knn is a list of variables. So nn.index is taking that value out of the list so you can work on it as a single variable.
EXAMPLE OF GETTING ELEMENTS OUT OF A LIST
stats <- list("mean" = 10, "data" = c(0, 10 ,20))
#just get the average out
my.average <- stats$mean
So a list can have different kind of results from your testing, and can have a mix of variable types (integers, strings, vectors). The $ syntax is taking one of the variables out of the list into a single variable.
If you type my.knn at the prompt you will see its contents with sections marked with $. This will help see what is in your list.
In the example:
> stats
$mean
[1] 10
$data
[1] 0 10 20
SPECIFICS ON FUNCTION
I looked at get.knnx function notes, assuming you are using FNN package, here http://www.inside-r.org/packages/cran/fnn/docs/get.knn:
Output a list contains:
nn.index
an n x k matrix for the nearest neighbor indice(s).
nn.dist
an n x k matrix for the nearest neighbor Euclidean distances.
So you can see your function output list has these two variables - an index of the nearest neighbour, and the second is the distances.
Trust this helps.
Studying the http://cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf
Here's the question i have, how to extract the names from an ordered the vector. The problem in the book it asks to give the gene identifiers of the three genes(from the patients in disease stage B1) with the largest mean as an output.
The data set is from package "ALL"
source("http://bioconductor.org/biocLite.R")
biocLite("ALL")
Here's what i got so far,
library("ALL")
data("ALL")
B1 <- exprs(ALL[,ALL$BT=="B1"])
hist(B1)
mean(B1)
meanB1 <- apply(exprs(ALL[,ALL$BT=="B1"]),1,mean)
omeanB1 <- order((meanB1), decreasing=TRUE)
I'm wondering if there is a particular function i can call from R to extract just the names of the genes. In the package "golub" ,there is a golub.gnames to help extract the gene names.
It seems to me that you're almost there. Once you have the order, you can apply it to meanB1:
head(meanB1[omeanB1])
# AFFX-hum_alu_at 31962_at 31957_r_at 40887_g_at 36546_r_at
# 13.41648 13.16671 13.15995 13.10987 12.94578
# 1288_s_at
# 12.80290
To get the names of the top three genes, you can do:
names(meanB1[omeanB1])[1:3]
# [1] "AFFX-hum_alu_at" "31962_at" "31957_r_at"
I am analysing with R some gene expression data. I would like to do differential gene expression analysis with limma's eBayes (limma is part of BioConductor), but to do that I need to have my expression data as an eset object. Thing is, I have only preprocessed data and do not have the CEL files, I could convert directly to eset object. I tried searching from Internet, but couldn't find a solution. Only thing I found, was that it IS possible.
Why eBayes:
It should have robust results even with only two or three samples in some of the groups and I do indeed have 3 groups that are from 2 to 3 samples in size.
In detail what I have and want to do:
I have expression data, already as logarithmic, normalized intesity values. The data is in expression matrix. There is about 20 000 rows and each row is a gene and the rownames are the official gene names. There is 22 columns and each column corresponds to one cancer sample. I have different kinds of cancer subtypes there and would like to compare for example subtype 1 samples' gene expression to that of the group 2's. Below is a two row, 5 column example of what my matrix would look like.
Example matrix:
SAMP1 SAMP2 SAMP3 SAMP4 SAMP5
GENE1 123.764 122.476 23.4764 2.24343 123.3124
GENE2 224.233 455.111 124.122 112.155 800.4516
The problem:
To evaluate the differential gene expression with eBayes I would need the eset object out of this expression data and I have honestly no idea how to go about that step. :(
I am very grateful for every bit of info that can help me out! If someone can suggest another reliable method for small sample size comparisons, that might solve my problem as well.
Thank you!
Using an ExpressionSet seems to be quite similar to a SummarizedExperiment which is also prevalent in Bioconductor packages. From what I understand, there is nothing special about using one or the other in a package--in my experience, it's considered as a generalized container for data in order to standardize the data set format across Bioconductor packages.
From the vignette on Bioconductor:
Affymetrix data will usually be normalized using the
affy
package. We will assume here that the
data is available as an
ExpressionSet
object called
eset. Such an object will have an slot containing
the log-expression values for each gene on each array which can be extracted using
exprs(eset).
In other words, there's nothing special about the data for the ExpressionSet. An ExpressionSet is simply a bunch of related experimental data strung together into one, but it appears that I can create a new object just from the regular object:
library(limma)
# counts is the assay data I already have.
dim(counts)
# [1] 64102 8
# Creates a new ExpressionSet object (quite bare, only the assay data)
asdf <- ExpressionSet(assayData = counts)
# Returns the data you put in.
exprs(asdf)
This works on my setup.
The second part that you need to consider is the design of the differential expression analysis comparison model matrix. You will need predefined factors to go along with your samples (probably within a phenoData argument to ExpressionSet and then create a model.matrix using R's special formula syntax. They look similar to: dependent ~ factor1 + factor2 + co:related. Note that a factor1 is a factor category or dimension, not just one level.
Once you have that, you should be able to run lmFit. I've actually not used limma much before but it appears to be similar to edgeR's scheme.
Just decided to make it answer to help some other poor sod, who has the same accident. Figured the problem out myself after going through the links kindly given in comments.
ExpressionSet() does take matrices and turn them to eSet object fine. Just had to make sure the data was as matrix instead of data frame object.
I have created a correlation matrix with an external program (SparCC). I have calculated p-values from the same data in SparCC as well and I end up with two objects which I imported into R, let's call them corr and pval and
> ncol(corr)==nrow(corr)
[1] TRUE
> ncol(pval)==nrow(pval)
[1] TRUE
and
> colnames(corr)==rownames(pval)
[1] TRUE ...
and the same the other way around.
Since the matrices (or should I be using data.frame?) are fairly large (about 1000 items), I would like to extract the significant correlations from the corr matrix by looking up their p-value in the pval matrix, I have looked into doing something with apply:
extracted.values <- apply(corr, nrows(corr), which(pval<0.1))
But since the part with which isn't really a function, it will output and error.
Since the which command output a list of the position in the pval matrix, I'm a bit at loss as to how to retrieve the colnames and rownames for each desired items.
Is there an easier way of doing what I want, like creating a correlation object from scratch in R (is this at all possible?) which contains both corr and pval matrices and extracting the significant values? I have found this solution in Python, but a solution with R would be much appreciated if it's less complicated than what I think it is.
thanks for any help!
edit: the python example doesn't keep headers.
You can simply do
corr[pval < 0.1]