Data structure and package for a radial dendrogram in R - r

I'd like to create a radial dendrogram in R, but being new to the software, I don't know if I chose the correct data structure and package.
I've created a YAML file that looks as follows:
Data structure
I know the exact hierachy of the languages, but I need R to calculate x and y values. I'd use hclust for that, I think?
I found this instruction here for example: https://stats.stackexchange.com/questions/4062/how-to-plot-a-fan-polar-dendrogram-in-r, but it uses the mtcars dataset. I'd just like to know whether it makes sense to set up my data as above or whether I should use a different structure. When I try to import the datasets I get an error message saying I've got more columns than column headers so I must be doing something wrong.

Related

Plotting a subset of data from a prcomp matrix without re-running prcomp

I am asking a question to a similar post posted up 2 years ago, with no full answer to it (subset of prcomp object in R). P.S. sorry for commenting on it for an answer..
Basically, my question is the same. I have generated a PCA table using prcomp that has 10000+ genes, and 1700+ cells, made up of 7 timepoints. Plotting all of them in a single file makes it difficult to see.
I would like to plot each timepoint separately, using the same PCA results table (ie without re-running prcomp).
Thanks Dean for giving me tips on posting. To think of a way to describe my dataset without actually loading it here, will take me a week I believe. I also tried the
dput(droplevels(head(object,2)))
option, but it was just too much info since I have such a large dataset. In short, it is a large matrix of single-cell dataset where people can commonly see on packages such as Seurat (https://satijalab.org/seurat/pbmc3k_tutorial_1_4.html). EDIT: I have posted a screenshot of a subset of my matrix here ().
Sorry I don't know how to re-create this or even export a text format.. But this is what I can provide:
My TPM matrix has 16541 rows (defining genes), and 1798 columns (defining cells).
In it, I have "re-labelled" my columns based on timepoints, using codes such as:
D0<-c(colnames(TPM[,grep("20180419-24837-1-*", colnames(TPM))])) #D0: 286 cells
D7<-c(colnames(TPM[,grep("20180419-24837-2-*", colnames(TPM))])) #D7: 237 cells
D10<-c(colnames(TPM[,grep("20180419-24947-5-*", colnames(TPM))])) #D10: 304 cells
...... and I continued to label each timepoint.
Each timepoint was also given a specific colour.
rc<-rep("white", ncol(TPM))
rc<-[,grep("20180419-24837-1-*", colnames(TPM))]= "magenta"
...... and I continued to give colour to each timepoint.
I performed a PCA using this code:
pcaRes<-prcomp(t(log(TPM+1)), center= TRUE, scale. = TRUE)
Then I proceeded to plot a PCA plot using:
plot(pcaRes$x[,1], pcaRes$x[,2], xlab="PC1", ylab="PC2",
cex=1.0, col= rc, pch=16, main="")
Then I when I wanted to plot a PCA plot only with D0, using the same PCA output (pcaRes).. This is where I am stuck.
P.S. If anyone else has an easier way of advising how to input an example data here from my large matrix, I welcome any help. Thanks so much! Sorry I am very new in bioinformatics.
Stack Exchange for
Bioinformatics is where you you will need to go to ask question(s) or learn about the package(s) and function(s) you need to deal with you area of specialty. Stack Exchange for Bioinformatics is linked with Stackoverflow so you will just need to join, you'll have the same login.
Classes S3, S4 and Base.
This Very basic over view of Classes in R. Think of a Class as the parent you inherit all of their skills or abilities from and as a result you are able to achieve certain tasks better than others and some cases, you will not be able to do the task at all.
In R and all programming, to save re-inventing the wheel, parent classes are created so that the average person does not have to repeatedly write a function to do something simple like plot() a graph. This stuff is hidden, to access it, you inherit from the parent. The child reads the traits off the parent(s), and then it either performs the task or gives you a cryptic error message.
Base and S3 classes work well together, they are like the working class people of the R world. S4 is a specialized class made for specific fields of study to be able to provide specific functionality needed in their industry. This mean you can only use certain Base and S3 functions with Class S4 functions, most are just not compatible. So it's nothing you've done wrong, plot() and ggplot() just have the wrong parent(s) to work with your dataset.
Typical Base and S3 Class dataframe: Box like structure. Along the left hand side is all the column names, nice and neatly stacked on top of each other.
Seurat S4 Class dataframe: Tree like structure, formatted to be read by a specific function(s).
Well hope that helps and I wish you well in your career. Cheers Conrad
Ps if this helps, then click the arrow up. :)
thanks #ConradThiele for your suggestion, I will check out that site.
I had a chat with other bioinformatics around the institute. My query has little to do with the object being an S4 class, since I am performing prcomp outside of the package. I have extracted my matrix out of the object and then ran prcomp on it.
Solution is simple: run prcomp with full dataset, transform the prcomp output into a dataframe, input additional columns to input additional details like "timepoint", create new dataframe(s) only with the "timepoint"/ "variable" of interest from the prcomp result, make multiple sub-dataframe and then plotting these using "plot" or whatever function you use.
This was not my solution but from a bioinformatition I went for help to in my institute. Hope this helps others! Thanks again for your time.
P.S. If I have the time, I will post a copy of the code I suggested soon.

Stacked or Grouped Barcharts from Weighted Survey Data (Class="survey.design2" "survey.design") in R

I am working with weighted survey data of the class survey.design2 and survey.design. With the package survey, and the function call svytable, I can create contingency tables for survey data. With these contingency tables, I can then create normal bar-charts using lattice. The standard way for doing this (e.g. barchart(cars ~ mpg | factor(cyl), data=mtcars,...)) doesn't work for this data type.
I am used to working with ggplot2, and would like to create either stacked or grouped bar-charts, if possible even with facet-wraps. Unfortunately, ggplot2 does not know how to deal with data of the type survey.design2 either. As far as I am concerned, there also does not exist some sort of add-on, which would allow ggplot2 to deal with this kind of data.
So far I have:
sub-set my data set
converted it into class survey.design2 with the function call svydesign(),
plotted multiple bar-charts in one window using grid.arrange(). This sort of provides for a work around for facetting, but still doesn't allow me to create stacked or grouped bar-charts.
I'd be grateful for any suggestions.
Thank you
Good morning MatthewR
I have a data set with 62732 observations and 691 variables.
Original Data Set
So any example based on a random number generator should work as well, I guess. I am really just interested in a work around to this issue, not necessarily the final code.
I then convert the data frame into survey.design format using:
df_Survey <- svydesign(id=~1, weights=~IXPXHJ, data=df). IXPXHJ is the variable by which the original sample data set will be weighted so as to get the entire population. head(df$IXPXHJ) looks something like this:
87.70876
78.51809
91.95209
94.38899
105.32005
56.30210
str(df_Survey) looks something like this.
Survey Data Structure

rangedummarizedexperiment for deseq2

I'm trying to use the DESeq2 package in R for differential gene expression, but I'm having trouble creating the required RangedSummarizedExperiment object from my input data. I have found several tutorials and vignettes for doing this, but they all seem to apply to a raw data set that is different from mine. My data has gene names as row names and patient id as column names, and the data is simply integer count data. There has to be a simple way to create the RangedSummarizedExperiment object from this type of input data, but I haven't yet found a way. Can anybody help? Thanks.
I had a similar problem understanding how to use this data structure. I eventually managed to do without it by using DESeqDataSetFromMatrix. You can see an example in the first code block of Modify r object with rpy2 (this code is pure R, rpy2 stuff comes after). In this example, I have genes as rows and samples as columns, so it is likely you will be able to adopt the same approach.

reaching max.print on R

I just found a bunch of weather data that I would like to play around with in glmnet in R. First I've been reading and organizing the data in R, and right now I am just trying to look at the raw data of each variable. Unfortunately, each variable has a lot of data and R isn't able to print it all. Is there a way I can view all the raw data in R or just in the file itself? I've tried opening the file in excel to no success. Thanks!
Try to use Frequency tables, you can group by segments.
str() , summary(), table(), pairs(), plots() etc. There are several libraries (such as decr) which facilitate analyzing numerical and factor levels. Let me know if you need help with any.

How can I create a dendrogram in R using pre-clustered data created elsewhere?

I have clustering code written in Java, from which I can create a nested tree structure, e.g. the following shows a tiny piece of the tree where the two "isRetired" objects were clustered in the first iteration, and this group was clustered with "setIsRequired" in the fifth iteration. The distances between the objects in the clusters are shown in parentheses.
|+5 (dist. = 0.0438171125324851)
|+1 (dist. = 2.220446049250313E-16)
|-isRetired
|-isRetired
|-setIsRetired
I would prefer to present my results in a more traditional dendrogram style, and it looks like R has some nice capabilities, but because I know very little about R, I am unclear on how to take advantage of them.
Is it possible for me to write out a tree structure to a file from Java, and then, with a few lines of R code, produce a dendrogram? From the R program, I'd like to do something like:
Read from a file into a data structure (an "hclust" object?)
Convert the data structure into a dendrogram (using "as-dendrogram"?)
Display the dendrogram using "plot"
I guess the question boils down to whether R provides an easy way of reading from a file and converting that string input into an (hclust) object. If so, what should the data in the input file look like?
I think what you are looking for is phylog. You can print your tree in a file in Newick notation, parse that out and construct a phylog object which you can easily visualize. The end of the webpage gives an example of how to do this. You also might want to consider phylobase. Although you don't want the entire functionality provided by these packages, you can piggyback on the constructs they use to represent trees and their plotting capabilities.
EDIT: It looks like a similar question to yours has been asked before here providing a simpler solution. So basically the only thing you will have to code here is your Newick parser or a parser for any other representation you want to output from Java.
The ape (Analysis of Phylogenetics and Evolution) package contains dendrogram drawing functionality, and it is capable of reading trees in Newick format. Because it is an optional package, you'll need to install it. It is theoretically easy to use, e.g. the following commands produce a dendrogram:
> library("ape")
> gcPhylo <- read.tree(file = "gc.tree")
> plot(gcPhylo, show.node.label = TRUE)
My main complaint thus far is that there is little diagnostic information when there is trouble with the syntax of the file containing the tree information in Newick format. I've had success reading these same files with other tools (which in some cases, may be because the tools are forgiving of certain faults in the syntax).
You can also produce a dendrogram using the phylog package as shown below.
> library(ade4)
> newickString <- system("cat gc.tree", intern = TRUE)
> gcPhylog <- newick2phylog(newickString)
> plot(gcPhylog, clabel.nodes=1)
Both can work with trees in Newick format and both have many plotting options.

Resources