PCA biplot one variables shown R - r

I ran a pca on a set of 45000 genes on 5 different samples, and when I perform a biplot, all I see is a mass of text (responding to the observation names), and cannot see the location of my samples. Is there a way to plot the location of the samples only, and not the observation, in a biplot?
Using built in data from R
usa <- USArrests
pca1 <- prcomp(usa)
biplot(pca1)
This generates a biplot where all the states (observation names) overlap the variables (my different samples) rape, etc. Is it possible to plot only the variables (samples), and not the states (observation names)?

biplot.default uses text to write the categorical variable name of the observation. As it doesn't use points you need to modify the source if you only want the points (and not the labels) to be plotted.
However, you could "hack" it by doing something like:
biplot(pca1, xlabs = rep(".", nrow(usa)))
I hope this is what you're looking for!
Edit If this is not satisfactory, you can modify the source given when running stats:::biplot.default to use points.

Related

How to find outliers in R for MDS?

I'm making a MDS plot with Salary and WAR data from baseball players. I have attached it here. Because there are so many data points I just want to label the Baseball Player names for the outlier data points. Any idea on how to do so?
Well, first it depends on the method you want to detect the outliers. I create a reproducible example, since you did not provide a dataset. I'm using the mtcars dataset, which is a regular dataset already implemented in R.
For outlier detection I'm using the cooks distance (you can check about that in a post here)
First i set up the data and calculating the cooks distance (cooksd) with cooks.distance from stats between the variables hp and sec
df<-cbind( mtcars[,c("hp", "qsec")], cooks.distance(lm(mtcars$hp~mtcars$qsec)))
names(df)[3]<-"cooksd"
plot(cooks.distance(lm(mtcars$hp~mtcars$qsec)))
We see, that we got two outliers. I'm using the threshold of 4*mean(cooksd) and labeling those in my dataset with following loop. Those are considered as outliers.
for (i in 1:nrow(df)){
if (df$cooksd[i] > 4* mean(df$cooksd)){
df$outlier[i] =1
}else {df$outlier[i]=0}
}
Last step is to plot the values and only give a label to those, which are above the threshold. I'm using here the row.names, but you can also use any column, just filter it as the other datapoints.
plot(df$hp, df$qsec,
text(df[df$outlier==1,"hp"], df[df$outlier==1,"qsec"], row.names(df[df$outlier == 1,]), cex=0.6, pos=4, col="red") )
A huge problem with outlier detection is, there are different methods and for every case the interpretation must be adjusted. Which method you want to use or which threshold, won't be answered here, since it's more a discussion for stackexchange. You can use cooksd, stdev or anything you seem it is the most useful. But the solution stays the same:
use your method to find the outliers
label them in your dataset
just implement the text in your plot, which you labelled.
For your three clusters, you just have to do a lm modell for every cluster, label the outliers and then join all three results.

R: Re-colour and connect certain points in a model-plot

This is my first question here so please excuse any mistakes I (may or may not) make.
The premise:
I got a vegetational dataset containing paired data on different plots for old and new observations. I used the 'openxlsx'-package to load my data, and 'vegan'-package to execute an NMDS as follows:
mydata <- read.xlsx(mydata)
mydataMDS <- metaMDS(mydata, k=2, trymax=500)
The result is then used for a model via the "envfit()"-function, including environmental variables:
myenvdata <- read.xlsx(myenvdata)
mydataMDS_fit <- envfit(mydataMDS, myenvdata, perm=10000, na.rm=TRUE)
plot(mydataMDS, display="sites")
plot(mydataMDS_fit, p.max=0.01, axis=TRUE)
Now I have a plot with my statistical "mydataMDS"-analyses, including vectors produced by the "mydataMDS_fit" R calculated.
The problem:
I want to colour and connect certain points within this plot. As "mydata" consists of observations within the same plot at different times, I intend to colour all points of old observations in one colour, and all the new ones in a different one. I've read about adding columns in order to group old and new observations, but as I'm working with a model there are no columns. How can I edit my datasets ("mydata", "mydataMDS", "myenvdata", "mydataMDS_fit") in order to show old and new plots in 2 different colours (one colour for old, one colour for new), and connect the paired observations with lines? Or: Is there a possibility to directly re-colour the points within my graphical output via checking for old/new observations?
(Sorry, I feel like my explanation is quite complicated, but I still hope someone may be able to help)

Plot the relationship of each column to a singular column in a table

I have one table of derived vegetation indices for 63 sample sites from different satellites. this gives me a table with 63 observations(sample sites) and 56 variables(1 Sample ID, 50 vegetation indices, 4 Biomass and 1 LAI). The last 5 columns of the table are the biomass and LAI, and the first column is the sample ID.
I want to generate a plot showing the relationship between a single vegetation index and one of the biomass parameters.
I am able to do this using the plot function, for one observation and variable at a time.
plot(data$Dry10, data$X8047EVImea)
I don't want to run this code 50 times and again by 5 sets for each biomass and LAI parameter.
Is there a way to loop or nested loop this plot function so that I can generate 200 graphs at once?
Also, I will place a regression line in each plot to see what vegetation index will best represent the amount of biomass present at the sample site.
This is my first post on stackoverflow, so please don't hesitate to request more information on the problem if I have missed something.
As noted in my comment you can accomplish this with a faceted plot in the ggplot2 package. This does require a little bit of data re-arrangement that can be accomplished with the reshape2 package. Here is some code that will be close to what you want to do but since I don't completely know your data formats it might take some fixes:
library(ggplot2)
library(reshape2)
library(dplyr)
vegDat <- data[,2:51]
bioDat <- data[,52:55]
## melt the data.frames so the biomass and vegetation headers are now variables
vegDatM <- melt(vegDat, variable.name='vegInd', value.name='vegVal')
bioDatM <- melt(bioDat, variable.name='bioInd', value.name='bioVal')
## Join these datasets to create all comparisons to be made
gdat <- bind_cols(vegDatM[rep(seq_len(nrow(vegDatM)), each=nrow(bioDatM)),],
bioDatM[rep(seq_len(nrow(bioDatM)), nrow(vegDatM)),])
## plot the data in a faceted grid
ggplot(gdat) + geom_point(aes(x=vegVal, y=bioVal)) + facet_grid(vegInd ~ bioInd)
Note that since there are 50 plots you may want to open a divice with a large height (or width if you swap the facet) i.e. pdf('foo.pdf', heigth=20). Hope this gets you on the right track.

How to structure data for R?

So... newbie R user here. I have some observations that I'd like to record using R and be able to add to later.
The items are sorted by weights, and the number at each weight recorded. So far what I have looks like this:
weights <- c(rep(171.5, times=1), rep(171.6, times=2), rep(171.7, times=4), rep(171.8, times=18), rep(171.9, times=39), rep(172.0, times=36), rep(172.1, times=34), rep(172.2, times=25))
There will be a total of 500 items being observed.
I'm going to be taking additional observations over time to (hopefully) see how the distribution of weights changes with use/wear. I'd like to be able plots showing either stacked histograms or boxplots.
What would be the best way to format / store this data to facilitate this kind of use case? A matrix, dataframe, something else?
As other comments have suggest, the most versatile (and perhaps useful) container (structure) for your data would be a data frame - for use with the library(ggplot2) for your future plotting and graphing needs(such as BoxPlot with ggplot and various histograms
Toy example
All the code below does is use your weights vector above, to create a data frame with some dummy IDs and plot a box and whisker plot, and results in the below plot.
library(ggplot2)
IDs<-sample(LETTERS[1:5],length(weights),TRUE) #dummy ID values
df<-data.frame(ID=IDs,Weights=weights) #make data frame with your
#original `weights` vector
ggplot(data=df,aes(factor(ID),Weights))+geom_boxplot() #box-plot

PCA Biplot : A way to hide vectors to see all data points clearly

I am trying to do PCA with R.
My Data has 10,000 columns and 90 rows
I used the prcomp function to do PCA.
Trying to prepare a biplot with the prcomp results, I ran into the problem that the 10,000 plotted vectors cover my datapoints. Is there any option for the biplot to hide the vectors' representation?
OR
I can use plot to get the PCA results. But I am not sure how to label these points according to my datapoints, which are numbered 1 to 90.
Sample<-read.table(file.choose(),header=F,sep="\t")
Sample.scaled<-data.frame(apply(Sample_2XY,2,scale))
Sample_scaled.2<-data.frame(t(na.omit(t(Sample_2XY.scaled))))
pca.Sample<-prcomp(Sample_2XY.scaled.2,retx=TRUE)
pdf("Sample_plot.pdf")
plot(pca.Sample$x)
dev.off()
If you do a help(prcomp) or ?prcomp, the help file tells us all the things contained in the prcomp() object returned by the function. We just need to pick which things we want to plot and do it with some function that gives us more control than biplot().
A more general trick for cases when the help file doesn't clarify things is to do a str() on the prcomp object (in your case pca.Sample) to see all its parts and find what we want ( str() compactly displays the internal structure of an R object. )
Here is an example with some of R's sample data:
# do a pca of arrests in different states
p<-prcomp(USArrests, scale = TRUE)
str(p) gives me something ugly and too long to include, but I can see that p$x has the states as rownames and their locations on the principal components as columns. Armed with this, we can plot it any way we want, such as with plot() and text() (for labels):
# plot and add labels
plot(p$x[,1],p$x[,2])
text(p$x[,1],p$x[,2],labels=rownames(p$x))
If we are making a scatterplot with many observations, the labels may not be readable. We therefore might want to only label more extreme values, which we can identify with quantile():
#make a new dataframe with the info from p we want to plot
df <- data.frame(PC1=p$x[,1],PC2=p$x[,2],labels=rownames(p$x))
#make sure labels are not factors, so we can easily reassign them
df$labels <- as.character(df$labels)
# use quantile() to identify which ones are within 25-75 percentile on both
# PC and blank their labels out
df[ df$PC1 > quantile(df$PC1)["25%"] &
df$PC1 < quantile(df$PC1)["75%"] &
df$PC2 > quantile(df$PC2)["25%"] &
df$PC2 < quantile(df$PC2)["75%"],]$labels <- ""
# plot
plot(df$PC1,df$PC2)
text(df$PC1,df$PC2,labels=df$labels)

Resources