I have just performed a PCA analysis for a large data set with approximately 20,000 variables. To do so, I used the following code:
df_pca <- prcomp(df, center=FALSE, scale.=TRUE)
I am curious how my variables influenced PCA.1 (Dimension 1 of the PCA analysis) and PCA.2 (Dimension 2 of the PCA analysis).
I used the following code to look at how each variable influenced the dimensional analysis:
fviz_pca_var(df_pca, col.var = "black")
However, this creates a graph with all 20,000 of my variables and since there is so much information, it is unreadable.
Is there a way to select the variables that have most influenced PCA.1 and PCA.2 and graph only those?
Thank you in advance!
If you want to see the dimension that you want, you should do this:
library(factoextra)
fviz_contrib(df_pca,
choice = "var",
axes = 5,
top = 10, color = 'darkorange3', barfill = 'blue4',fill ='blue4')
with the axes you can choose the dim that you want to see. In this case you are seeing the dimension number 5.
If you want to see the variables and the curve that help you to choose the number of dimension, you can use this:
fviz_screeplot(df_pca, ncp=14,linecolor = 'darkorange3', barfill = 'blue4',
barcolor ='blue4', xlab = "Dimensioni",
ylab = '% varicance',
main = 'Reduction of components')
get_eigenvalue(df_pca)
What you want to do is first get the actual table that correlates the synthetic variable w/ the real variables. Do that like this:
a <- df_pca$rotation
Then we can use dplyr to manipulate the data frame and extract what we want:
library(dplyr)
library(tibble)
a %>% as.data.frame %>% rownames_to_column %>%
select(rowname, PC1, PC2) %>% arrange(desc(PC1^2+PC2^2)) %>% head(10)
The above will organize show the top 10 most important variables for PC1 and PC2. You can run the same thing for PC1 only by changing to arrange(desc(abs(PC1))), or PC2 by changing to arrange(desc(abs(PC2)))... and see more or less than 10 variables by changing head(10).
Related
I'm rather new to R and have been trying to analyze some of my proteomic data with it. In particular, I'm trying to make a PCA plot so that I can see some of the similarities/differences between my different treatments. I essentially have 5 treatments, each in triplicates, so 15 columns total. My 95 rows are each represented by a different Protein ID. My actual data are normalized LFQ intensities, which is only to say that it's numeric and I know a PCA analysis/plot is appropriate.
I was able to successfully create a PCA plot with this code:
PlantPCA <- prcomp(Plant_Num_Named, center = TRUE, scale = TRUE)
summary(PlantPCA)
#create a quick plot
plot(PlantPCA$x[,1],PlantPCA$x[,2], xlab="PC1 (69.4%)", ylab = "PC2 (9.2%)", main = "PC1 / PC2 - plot")
#full plot
fviz_pca_ind(PlantPCA, geom.ind = "point", pointshape = 21,
pointsize = 2) +
ggtitle("2D PCA-plot from 15 feature dataset") +
theme(plot.title = element_text(hjust = 0.5))
Which gives me this:
My PCA Plot
I know it's basic I just wanted to get it to work first. However, now I want to add ellipses to surround my 5 different treatments, disregarding the different triplicates, and I'm not sure how to do so. I've seen people be able to do this when what they want to color by be another column of categorical data--but that isn't the case here.
Ideally, I would have 5 different colors for my treatments (which again are currently different columns) and the ellipses would match the color.
I want something similar to this:
Example Plot
From this tutorial website: https://towardsdatascience.com/principal-component-analysis-pca-101-using-r-361f4c53a9ff
Is this something that is attainable? I just need a little direction. Any and all advice is welcome!
I can see many posts on this topic, but none addresses this question. Apologies if I missed a relevant answer. I have a large protein expression dataset, with samples like so as the columns:
rep1_0hr, rep1_16hr, rep1_24hr, rep1_48hr, rep1_72hr .....
and 2000+ proteins in the rows. In other words each sample is a different developmental timepoint.
If it is of any interest, the original dataset is 'mulvey2015' from the pRolocdata package in R, which I converted to a SummarizedExperiment object in RStudio.
I first ran k-means clustering on the data (an assay() of a SummarizedExperiment dataset, to get 12 clusters:
k_mul <- kmeans(scale(assay(mul)), centers = 12, nstart = 10)
Then:
summary(k_mul)
produced the expected output.
I would like the visualisation to look like this, with samples on the x-axis and expression on the y-axis. The plots look like they have been generated using facet_wrap() in ggplot:
For ggplot the data need to be provided as a dataframe with a column for the cluster identity of an individual protein. Also the data need to be in long format. I tried pivoting (pivot_longer) the original dataset, but of course there are a very large number of data points. Moreover, the image I posted shows that for any one plot, the number of coloured lines is smaller than the total number of proteins, suggesting that there might have been dimension reduction on the dataset first, but I am unsure. Up till now I have been running the kmeans algorithm without dimension reduction. Can I get guidance please for how to produce this plot?
Here is my attempt at reverse engeneering the plot:
library(pRolocdata)
library(dplyr)
library(tidyverse)
library(magrittr)
library(ggplot2)
mulvey2015 %>%
Biobase::assayData() %>%
magrittr::extract2("exprs") %>%
data.frame(check.names = FALSE) %>%
tibble::rownames_to_column("prot_id") %>%
mutate(.,
cl = kmeans(select(., -prot_id),
centers = 12,
nstart = 10) %>%
magrittr::extract2("cluster") %>%
as.factor()) %>%
pivot_longer(cols = !c(prot_id, cl),
names_to = "Timepoint",
values_to = "Expression") %>%
ggplot(aes(x = Timepoint, y = Expression, color = cl)) +
geom_line(aes(group = prot_id)) +
facet_wrap(~ cl, ncol = 4)
As for you questions, pivot_longer is usually quite performant unless it fails to find unique combinations in keys or problems related with data type conversion. The plot can be improved by:
tweaking the alpha parameter of geom_lines (e.g. alpha = 0.5), in order to provide an idea of density of lines
finding a good abbreviation and order for Timepoint
changing axis.text.x orientation
Here is my own, very similar solution to the above.
dfsa_mul <- data.frame(scale(assay(mul)))
dfsa_mul2 <- rownames_to_column(dfsa_mul, "protID")
add the kmeans $cluster column to the dfsa_mul2 dataframe. Only change clus to a factor after executing pivot_longer
dfsa_mul2$clus <- ksa_mul$cluster
dfsa_mul2 %>%
pivot_longer(cols = -c("protID", "clus"),
names_to = "samples",
values_to = "expression") %>%
ggplot(aes(x = samples, y = expression, colour = factor(clus))) +
geom_line(aes(group = protID)) +
facet_wrap(~ factor(clus))
This generates a series of plots identical to the graphs posted by #sbarbit.
I am trying to visualize my data. All I need is a plot to compare the distribution of the different variables.
I already tried with multi.hist. Actually, that would be enough for me. But the problem is, I cannot manage the margins of the scale to stay the same for each histogram to compare the distributions as it is already trying to fit for each variable.
As well, I have a categorial variable in my data as well (topic 1-5). Maybe there is a good way to visualize this as well but I am not dying if it is not possible so easy.
I tried a lot with ggplot as well but I am rather new to r and could not make anything good yet.
Below you see an example for my data.
Thank you very much in advance :)
My data:
Data
Try first converting your data to long format:
df2 <- df %>% pivot_longer(cols = 1:5, names_to = 'set', values_to = 'sub_means')
Then you can do a density plot, either colouring by set and faceting by topic:
df2 %>% ggplot(x = sub_means, fill = set) + geom_density() + facet_wrap(~topic)
Or vice versa:
df2 %>% ggplot(x = sub_means, fill = topic) + geom_density() + facet_wrap(~set)
Here's the dataset I'm using: https://www.dropbox.com/s/b3nv38jjo5dxcl6/nba_2013.csv?dl=0, it contains statistics for NBA players.
And I want to see how different columns correlate with each other, thus I want to draw pairwise scatterplots.
Here's my code:
library(GGally)
nba %>%
select(ast, fg, trb) %>%
ggpairs()
nba variable contains the whole dataset
And when I want to draw the pairwise scatterplot, I get something like this:
Generated Output
However, some of the graphs are replaced by "Corr" values, is there a way to replace these "Corr" values with actual graphs so that the output looks as follows:
Desired Output
here is an approach:
nba %>%
select(ast, fg, trb) %>%
ggpairs(upper = list(continuous = "points",
combo ="facethist", discrete = "facetbar", na = "na"))
I work with a massive 4D nifti file (x - y - z - subject; MRI data) and due to the size I can't convert to a csv file and open in R. I would like to get a series of overlaying density plots (classic example here) one for each subject with the idea to just visualise that there is not much variance in density distributions across the sample.
I could however, extract summary statistics for each subject (mean, median, SD, range etc. of the variable of interest) and use these to create the density plots (at least for the variables that are normally distributed). Something like this would be fantastic but I am not sure how to do it for density plots.
Your help will be much appreciated.
So these really aren't density plots per se - they are plots of densties of normal distributions with given means and standard deviations.
That can be done in ggplot2, but you need to expand your table of subjects and summaries into grids of points and normal densities at those points.
Here's an example. First, make up some data, consisting of subject IDs and some simulated sample averages and sample standard deviations.
library(tidyverse)
set.seed(1)
foo <- data_frame(Subject = LETTERS[1:10], avg=runif(10, 10,20), stdev=runif(10,1,2))
Now, for each subject we need to obtain a suitable grid of "x" values along with the normal density (for that subject's avg and stdev) evaluated at those "x" values. I've chosen plus/minus 4 standard deviations. This can be done using do. But that produces a funny data frame with a column consisting of data frames. I use unnest to explode out the data frame.
bar <- foo %>%
group_by(Subject) %>%
do(densities=data_frame(x=seq(.$avg-4*.$stdev, .$avg+4*.$stdev, length.out = 50),
density=dnorm(x, .$avg, .$stdev))) %>%
unnest()
Have a look at bar to see what happened. Now we can use ggplot2 to put all these normal densities on the same plot. I'm guessing with lots of subjects you wouldn't want a legend for the plot.
bar %>%
ggplot(aes(x=x, y=density, color=Subject)) +
geom_line(show.legend = FALSE)