Why is PCA analysis in R using order as a variable? - r

I am doing PCA analysis in R. I am not by any means a programmer so please have some patience me if I'm too vague or use incorrect terminology :)
So, for context, I am doing PCA of a giant dataset of US counties, with a ton of demographic data!
pcatest <- prcomp(countydata, center = TRUE, scale = TRUE)
Beforehand, this prcomp function was not accepting my countydata dataframe, saying it was "not numeric," so I needed to unlist it, use the as.numeric function, create a matrix and turn it back into a dataframe.
Anyways, after doing this, I noticed that the PCA analysis was definitely a bit weird. For most counties in the US, PC1 was around -0.9, but in nearly every county in Iowa, as well as some in Illinois and Indiana, values ranged from 20-40. Counties in Alabama, Alaska, and Arizona also had significantly lower than average values, despite being highly demographically different. I meticulously checked my data, nothing seemed off about the information that would lead to this PCA failure? I checked to see if numerical order or row number was accidentally made a variable analyzed by PCA, and it didn't seem like it!
Now, I do not know what to do. Maybe this error has something to do with what I had to do in order to use the prcomp function, maybe not. Has anyone else had this issue? If so, I would really like help. Thank you! :)

Related

How to eliminate no-neighbour data when doing spatial clustering

Hello wonderful people!
I'm trying to run a cluster analysis on some data that I've mapped on a choropleth map. It's the % participation rates per constituency.
I'm trying to run a Moran_result test, but am unable to get the data in list format. I keep getting the error: "Error in nb2listw(nb) : Empty neighbour sets found".
I assume this is because some constituencies (like the Isle of White) have no neighbours. I can't find online what constituencies are islands or have no neighbours, and wondered if there is a speedier way to by-pass solve this issue in R, rather than googling all 573 England and Wales constituencies.
Can you help?
Ideas: I thought that maybe I could create "fake" polygons to surround all constituencies with no value so at the very least they could be listed. Or maybe there is a way of searching which have no neighbours and then removing them? Both of these I'm unsure how to do.
My goal: I wan't to get a few spatial clusters where the participation rates are similar and then extract that data so I can compare it to a regression model I have. If you know of another way to do this, other than above, please let me know.
I've tried: new_dataframe <- filter(election_merged_sf, !is.na(nb), but this doesn't actually remove any objects. I assume this is because it is testing whether there are numeric neighbours, when it needs to be done spatially.

Is there an R function/package for determining WWF biomes from latlong coordinates?

Very new here, hi, postgraduate student who is tearing their hair out.
I have inherited a dataset of secondary data collected from research papers on species populations and their genetic diversity and have been adding more appropriate data to this sheet in preparation to perform some analyses. Part of the analysis will include subsetting the data by biome type to create comparisons between the biomes, and therefore I've been cleaning up and trying at add this information to the data I've added. I have latlong coordinates for each population (in degrees decimals) and it appears that the person working on this before me was able to use these to determine the biome for each point, specifically following the Olson et al. (2001)/WWF 14 biome categorisation, but at this point I'll take anything.
However I have no idea how this was achieved and truly can't find anything to help. After googling just about every combination of "r package biomes WWF latitude longitude species assign derive convert" that you can think of, the only packages that I have located are non functioning in my version of RStudio (e.g. biomeara, ggbiome), leaving me with no idea if they'd even work, and all other pages that I have dragged up seem to already have biome data included with their dataset. Other research papers I have found describe assigning biomes based on latlong coords and give 0 steps on how to actually achieve this. Is it possible in R? Am I losing my mind? Does anyone know of a way to do this, whether in R or not, and that preferably doesn't take forever as I have over 8000 populations to assess? Many thanks!

custom code for compact letter display from pairwise table output

I would like to create a custom code that creates a compact letter display from a pairwise test I have performed.
I have done this with pairwise t-tests with success (packages for this exist), and I am also familiar with the package library(multcomp) when I run linear models and the function cld() to get the compact letter displays, but they will not work for my specific case here.
I work with kaplan meier survival data often, and after I run the pairwise_survdiff() function to see if any statistical differences exist between groups (found in the packages library(survival) and library(survminer), I am easily able to extract a table to display all pairwise comparisons and their corresponding p-values. I have included an example for you here today. (see df below)
When their are many comparisons to do by hand, this becomes a mess to found out which groups are different / similar, and it's prone to human error when many levels exist, and up to now, I've always done it by hand. I would like to change this.
Could someone help me with a code that helps do this automatically?
Here is a mock dataframe df with 10 treatments (named treatment-1....treatment-10), and the rows are filled with p-values. Let's assume anything below p<0.05 as significant. However, it would be very cool to have a code that would allow a more conservative approach, and say set the desired cut off for statistical significance (say anything below p<0.01 as significant for example).
Thanks for your help, and again, here is a play datatframe
df <- read.table("https://pastebin.com/raw/ZAKDBjVs", header = T)
While reflecting on this, I believe I found an answer on my own, with the library(mulcompView) and library(rcompanion)
Nonetheless, I think it's important, since I have seen / heard this question multiple times. Here is how I solved my problem
library(rcompanion)
library(multcompView)
df <- read.table("https://pastebin.com/raw/ZAKDBjVs", header=T)
PT1 = fullPTable(df)
multcompLetters(PT1,
compare="<",
threshold=0.05,
Letters=letters,
reversed = FALSE)
This gives me the desired output with the compact letter displays between groups. Additionally, one could edit the statistical threshold to be either more/less conservative by changing the threshold=
Very happy with the result. This has bothered me for a while. I hope it is useful to other members

Plotting a subset of data from a prcomp matrix without re-running prcomp

I am asking a question to a similar post posted up 2 years ago, with no full answer to it (subset of prcomp object in R). P.S. sorry for commenting on it for an answer..
Basically, my question is the same. I have generated a PCA table using prcomp that has 10000+ genes, and 1700+ cells, made up of 7 timepoints. Plotting all of them in a single file makes it difficult to see.
I would like to plot each timepoint separately, using the same PCA results table (ie without re-running prcomp).
Thanks Dean for giving me tips on posting. To think of a way to describe my dataset without actually loading it here, will take me a week I believe. I also tried the
dput(droplevels(head(object,2)))
option, but it was just too much info since I have such a large dataset. In short, it is a large matrix of single-cell dataset where people can commonly see on packages such as Seurat (https://satijalab.org/seurat/pbmc3k_tutorial_1_4.html). EDIT: I have posted a screenshot of a subset of my matrix here ().
Sorry I don't know how to re-create this or even export a text format.. But this is what I can provide:
My TPM matrix has 16541 rows (defining genes), and 1798 columns (defining cells).
In it, I have "re-labelled" my columns based on timepoints, using codes such as:
D0<-c(colnames(TPM[,grep("20180419-24837-1-*", colnames(TPM))])) #D0: 286 cells
D7<-c(colnames(TPM[,grep("20180419-24837-2-*", colnames(TPM))])) #D7: 237 cells
D10<-c(colnames(TPM[,grep("20180419-24947-5-*", colnames(TPM))])) #D10: 304 cells
...... and I continued to label each timepoint.
Each timepoint was also given a specific colour.
rc<-rep("white", ncol(TPM))
rc<-[,grep("20180419-24837-1-*", colnames(TPM))]= "magenta"
...... and I continued to give colour to each timepoint.
I performed a PCA using this code:
pcaRes<-prcomp(t(log(TPM+1)), center= TRUE, scale. = TRUE)
Then I proceeded to plot a PCA plot using:
plot(pcaRes$x[,1], pcaRes$x[,2], xlab="PC1", ylab="PC2",
cex=1.0, col= rc, pch=16, main="")
Then I when I wanted to plot a PCA plot only with D0, using the same PCA output (pcaRes).. This is where I am stuck.
P.S. If anyone else has an easier way of advising how to input an example data here from my large matrix, I welcome any help. Thanks so much! Sorry I am very new in bioinformatics.
Stack Exchange for
Bioinformatics is where you you will need to go to ask question(s) or learn about the package(s) and function(s) you need to deal with you area of specialty. Stack Exchange for Bioinformatics is linked with Stackoverflow so you will just need to join, you'll have the same login.
Classes S3, S4 and Base.
This Very basic over view of Classes in R. Think of a Class as the parent you inherit all of their skills or abilities from and as a result you are able to achieve certain tasks better than others and some cases, you will not be able to do the task at all.
In R and all programming, to save re-inventing the wheel, parent classes are created so that the average person does not have to repeatedly write a function to do something simple like plot() a graph. This stuff is hidden, to access it, you inherit from the parent. The child reads the traits off the parent(s), and then it either performs the task or gives you a cryptic error message.
Base and S3 classes work well together, they are like the working class people of the R world. S4 is a specialized class made for specific fields of study to be able to provide specific functionality needed in their industry. This mean you can only use certain Base and S3 functions with Class S4 functions, most are just not compatible. So it's nothing you've done wrong, plot() and ggplot() just have the wrong parent(s) to work with your dataset.
Typical Base and S3 Class dataframe: Box like structure. Along the left hand side is all the column names, nice and neatly stacked on top of each other.
Seurat S4 Class dataframe: Tree like structure, formatted to be read by a specific function(s).
Well hope that helps and I wish you well in your career. Cheers Conrad
Ps if this helps, then click the arrow up. :)
thanks #ConradThiele for your suggestion, I will check out that site.
I had a chat with other bioinformatics around the institute. My query has little to do with the object being an S4 class, since I am performing prcomp outside of the package. I have extracted my matrix out of the object and then ran prcomp on it.
Solution is simple: run prcomp with full dataset, transform the prcomp output into a dataframe, input additional columns to input additional details like "timepoint", create new dataframe(s) only with the "timepoint"/ "variable" of interest from the prcomp result, make multiple sub-dataframe and then plotting these using "plot" or whatever function you use.
This was not my solution but from a bioinformatition I went for help to in my institute. Hope this helps others! Thanks again for your time.
P.S. If I have the time, I will post a copy of the code I suggested soon.

How to read this graph R

Sorry if this may come across as really stupid but I done some clustering using K MEANS in R and plotted this graph.
Could someone please explain me how to read this plot? or the name for these types of plots so i can google further.
Thank you
PS: for clustering experts can you identify any meaningful clusters from this plot?
You effectively ignored all attributes except quantity.
The result is meaningless.
K-means only works well, when every axis has a comparable distribution.
Restart from the beginning, with careful preprocessing, scaling, and understanding your data.

Resources