Plotting a subset of data from a prcomp matrix without re-running prcomp - r

I am asking a question to a similar post posted up 2 years ago, with no full answer to it (subset of prcomp object in R). P.S. sorry for commenting on it for an answer..
Basically, my question is the same. I have generated a PCA table using prcomp that has 10000+ genes, and 1700+ cells, made up of 7 timepoints. Plotting all of them in a single file makes it difficult to see.
I would like to plot each timepoint separately, using the same PCA results table (ie without re-running prcomp).
Thanks Dean for giving me tips on posting. To think of a way to describe my dataset without actually loading it here, will take me a week I believe. I also tried the
dput(droplevels(head(object,2)))
option, but it was just too much info since I have such a large dataset. In short, it is a large matrix of single-cell dataset where people can commonly see on packages such as Seurat (https://satijalab.org/seurat/pbmc3k_tutorial_1_4.html). EDIT: I have posted a screenshot of a subset of my matrix here ().
Sorry I don't know how to re-create this or even export a text format.. But this is what I can provide:
My TPM matrix has 16541 rows (defining genes), and 1798 columns (defining cells).
In it, I have "re-labelled" my columns based on timepoints, using codes such as:
D0<-c(colnames(TPM[,grep("20180419-24837-1-*", colnames(TPM))])) #D0: 286 cells
D7<-c(colnames(TPM[,grep("20180419-24837-2-*", colnames(TPM))])) #D7: 237 cells
D10<-c(colnames(TPM[,grep("20180419-24947-5-*", colnames(TPM))])) #D10: 304 cells
...... and I continued to label each timepoint.
Each timepoint was also given a specific colour.
rc<-rep("white", ncol(TPM))
rc<-[,grep("20180419-24837-1-*", colnames(TPM))]= "magenta"
...... and I continued to give colour to each timepoint.
I performed a PCA using this code:
pcaRes<-prcomp(t(log(TPM+1)), center= TRUE, scale. = TRUE)
Then I proceeded to plot a PCA plot using:
plot(pcaRes$x[,1], pcaRes$x[,2], xlab="PC1", ylab="PC2",
cex=1.0, col= rc, pch=16, main="")
Then I when I wanted to plot a PCA plot only with D0, using the same PCA output (pcaRes).. This is where I am stuck.
P.S. If anyone else has an easier way of advising how to input an example data here from my large matrix, I welcome any help. Thanks so much! Sorry I am very new in bioinformatics.

Stack Exchange for
Bioinformatics is where you you will need to go to ask question(s) or learn about the package(s) and function(s) you need to deal with you area of specialty. Stack Exchange for Bioinformatics is linked with Stackoverflow so you will just need to join, you'll have the same login.
Classes S3, S4 and Base.
This Very basic over view of Classes in R. Think of a Class as the parent you inherit all of their skills or abilities from and as a result you are able to achieve certain tasks better than others and some cases, you will not be able to do the task at all.
In R and all programming, to save re-inventing the wheel, parent classes are created so that the average person does not have to repeatedly write a function to do something simple like plot() a graph. This stuff is hidden, to access it, you inherit from the parent. The child reads the traits off the parent(s), and then it either performs the task or gives you a cryptic error message.
Base and S3 classes work well together, they are like the working class people of the R world. S4 is a specialized class made for specific fields of study to be able to provide specific functionality needed in their industry. This mean you can only use certain Base and S3 functions with Class S4 functions, most are just not compatible. So it's nothing you've done wrong, plot() and ggplot() just have the wrong parent(s) to work with your dataset.
Typical Base and S3 Class dataframe: Box like structure. Along the left hand side is all the column names, nice and neatly stacked on top of each other.
Seurat S4 Class dataframe: Tree like structure, formatted to be read by a specific function(s).
Well hope that helps and I wish you well in your career. Cheers Conrad
Ps if this helps, then click the arrow up. :)

thanks #ConradThiele for your suggestion, I will check out that site.
I had a chat with other bioinformatics around the institute. My query has little to do with the object being an S4 class, since I am performing prcomp outside of the package. I have extracted my matrix out of the object and then ran prcomp on it.
Solution is simple: run prcomp with full dataset, transform the prcomp output into a dataframe, input additional columns to input additional details like "timepoint", create new dataframe(s) only with the "timepoint"/ "variable" of interest from the prcomp result, make multiple sub-dataframe and then plotting these using "plot" or whatever function you use.
This was not my solution but from a bioinformatition I went for help to in my institute. Hope this helps others! Thanks again for your time.
P.S. If I have the time, I will post a copy of the code I suggested soon.

Related

Add extra explanatory layer onto PCA

I am trying to run a PCA on a dataframe which is accompanied by a metadata table. The PCA table is all normalized, scaled etc. the metadata, however, is not.
I want the PCA to not only cluster based on the dataframe but also the option to add one or multiple columns from the metadata table as explanatory variables as well. Again, these are not scaled and normalized with the main dataset. Also, I am not looking to color the plots with a certain data column, I'd want the column to be considered for the actual clustering.
I am aware that this sounds kinda vague, but I am having a hard time to find the exact words. After looking around for a little bit I found demixed PCA which seems to be very close to what I want to achieve. Sadly, there is no package in R to run it.
Any recommendations are welcome and thank you in advance.

R newbie- is there a way to separate or filter out items listed in a single cell for plotting purposes?

Problem
R and stack overflow newbie here so try and be patient with me. I am currently working on a data.frame that will act as a summary of various modeling approaches used to predict either fall events or fall rates within an in-patient setting based on a range of hospital, environmental and individual-level variables.
My data is in long format and some studies have several rows (I have created a row for each model type, with some studies having built multiple). For some columns (i.e., Model performance) I have multiple entries separated by a comma (e.g., C-statistic, Hosmer-Lemeshow test, likelihood ratio, and so forth). My question is, is there a way to separate these so I can create a barplot in ggplot2 that shows the prevalence of different methods and there is one bar per statistic/test type, with the height of the bar being a count of the number of instances in the data frame it occurs? At the moment this obviously does not work as some bars have a label that contains all of the values (i.e, C-statistic, Hosmer-Lemeshow test, likelihood ratio), which means there can be multiple bars that contain "c-statistic" for example, because the list is slightly different.
Screenshots and code
I have attached a screenshot of my data.frame below. The column I refer to is "Statistic.reported"
Screenshot of datadrame:
I have also attached an image of what happens when I create a basic barplot with the following code:
Bar <- ggplot(Modelling.Data, aes(x=Statistic.reported)) +geom_bar()+ theme_classic()
Image of plot using current basic code:~
Things I have tried
I have tried using the tidyr package function seperate_rows my code for this was as follows
separate_rows(Modelling.Data,Modelling.Data$Statistic.reported, sep = ",")
From this I got an error that said "Can't subset columns that don't exist".
Hopefully, this makes sense, but I'm really new to all of this so if you need anything else please tell me. Any tips or advice would be hugely appreciated! Apologies in advance for my complete lack of knowledge.

R: [Indicspecies package] multipatt function: extract values from summary.multipatt

I am working with the 'indicspecies' package - multipatt function and am unable to extract summary values of the package. Unfortunately I can't print all the summary and am left with impartial information for my model. The reason is the huge amount of data that needs to be printed from the summary (300.000 different species, 3 groups, 6 comparable combinations).
This is what happens with summary being saved (pre-code incl.):
x <- multipatt(data, ...)
sumx <-summary(x)
sumx
NULL
str(sumx)
NULL
So, the summary does not work exactly like a generic summary. It seems that the function is based around the older indval function from the 'labdsv' package (which is mentioned in the documentation). I found an archived thread where a similar problem is discussed: http://r.789695.n4.nabble.com/extract-values-from-summary-of-function-indval-of-the-package-labdsv-td4637466.html
but it seems not resolved (and is not exactly about the same function, rather the base function indval).
I was wondering if anyone has experience with the indicspecies package and knows a way to either extract the info from the summary.
It is possible to extract significance and other information from the other saved data from the model, but it might be nice to just get a quick complete overview from the data.
ps. I tried
options(max.print=1000000)
but this didn't solve it for me.
I use to capture the summary output for a multipatt object, but don't any more because the p-values reported are not corrected for multiple testing. To answer the OP's question you can capture the summary output using capture.output
ex.
dat.multipatt.summary<-capture.output(summary(dat.multipatt, indvalcomp=TRUE))
Again, I do not recommend this. It is very important to correct the p-values for multiple testing, so the summary output actually isn't helpful. To be clear ?multipatt states:
"sign Data table with results of the best matching pattern, the association value and the degree of statistical significance of the association (i.e. p-values from permutation test). Note that p-values are not corrected for multiple testing."
I just posted an answer for how to correct the p-values here https://stats.stackexchange.com/questions/370724/indiscpecies-multipatt-and-overcoming-multi-comparrisons/401277#401277
I don't have any experience with this package and since you haven't provided the data, it's difficult to reproduce. But since summary is returning NULL, are you sure your x is computed properly? Check the object.size or class or something else of x to see if it indeed has any content.
Also instead of accessing all the contents of summary(x) together, you can use # to access slots of it (similar to $ in dataframe).
If you need further assistance, it'd be better t provide atleast a small subset or some other sample data so that the community can work with it.

Updating existing plot instead of creating new in a for loop

I am attempting a homework problem where I am tasked to plot the histogram that results from a Galton board experiment, essentially creating normal distribution by adding one value at at a time and updating the histogram after each trial (ball). I would like to find a way to update the histogram after each addition of a new value to the distribution; instead of that my code currently makes a whole ton of plots.
So far I've set up a vector with length=1000 (though theoretically I should be able to apply my final code to a vector of anything length?) and created a loop to add values to it using rbinom() with 200 "pegs" with a probability of 50% (falling left or right).
x<-numeric(1000) #create vector length of 1000 values of 0
for (i in 1:1000) {
x[i]<-sum(rbinom(200,1,0.5))
hist(x,freq=FALSE)
}
I have the hist() call within the for loop (this may be a cardinal sin in R...), which as you can imagine produces 1000 graphs! Definitely not the right way to go about this. Is there any way to just essentially update on top of the previous plot? I'm thinking of things like abline(), lines(), etc, which (as far as I can tell) just add lines on top of an already existing plot in R without creating a new one. This is probably because the data associated with those functions isn't the same as the data in a vector? Anyways, I haven't been able to figure this out wth google. I haven't tried using ggplot2 or the animate packages yet, though I'm only vaguely familiar with the former and I imagine there's a learning curve.
A final note: I'm fairly new to R, so I'd appreciate unrelated advice on the above code, but I also think it's very productive to work things out on your own, so I would prefer hints and/or general advice instead of pasting working code.
Thank you very much in advance for your help!

R plot data.frame to get more effective overview of data

At work when I want to understand a dataset (I work with portfolio data in life insurance), I would normally use pivot tables in Excel to look at e.g. the development of variables over time or dependencies between variables.
I remembered from university the nice R-function where you can plot every column of a dataframe against every other column like in:
For the dependency between issue.age and duration this plot is actually interesting because you can clearly see that high issue ages come with shorter policy durations (because there is a maximum age for each policy). However the plots involving the issue year iss.year are much less "visual". In fact you cant see anything from them. I would like to see with once glance if the distribution of issue ages has changed over the different issue.years, something like
where you could see immediately that the average age of newly issue policies has been increasing from 2014 to 2016.
I don't want to write code that needs to be customized for every dataset that I put in because then I can also do it faster manually in Excel.
So my question is, is there an easy way to plot each column of a matrix against every other column with more flexible chart types than with the standard plot(data.frame)?
The ggpairs() function from the GGally library. It has a lot of capability for visualizing columns of all different types, and provides a lot of control over what to visualize.
For example, here is a snippet from the vignette linked to above:
data(tips, package = "reshape")
ggpairs(tips)

Resources