Principal Component Analysis in R data color

Principal Component Analysis in R data color - r

Hi everyone I have a simple question but for which i havent been able to get an answer in any tutorial. Ive done a simple principal component analysis on a set of data and then plot my data with biplot.
CP <- prcomp(dat, scale. = T)
summary(CP)
biplot(CP)
With this i get a scatter plot of my data in terms of the first and second component. I wish to separate my data by color, indicating R to paint my first 20 data in red and next 20 data in blue. I dont know how to tell R to color those two sets of data.
Any help will be very appreciated. thks!
(im very new to R)

Disclaimer: This is not a direct answer but can be tweak to obtain the desired output.
library(ggbiplot)
data(wine)
wine.pca <- prcomp(wine, scale. = TRUE)
print(ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, ellipse = TRUE, circle = TRUE))

Using plot() will provide you more flexibility - you may use it alone or with text() for text labels as belows (Thanks #flodel for useful comments):
col = rep(c("red","blue"),each=20)
plot(CP$x[,1], CP$x[,2], pch="", main = "Your Plot Title", xlab = "PC 1", ylab = "PC 2")
text(CP$x[,1], CP$x[,2], labels=rownames(CP$x), col = col)
However if you want to use biplot() try this code:
biplot(CP$x[1:20,], CP$x[21:40,], col=c("red","blue"))

Related

How to Label the "Outliers" in the box plot when there are many values which equals the outlier?

Please refer the box plot picture above .I Want to label only the outliers.
I use the below code to make a label column to label the outlier.
outliers_price = boxplot(Ready_to_work_data$median_price ~
Ready_to_work_data$Regionname,plot=FALSE)$out
Ready_to_work_data$lable_price <- ifelse(Ready_to_work_data$median_price %in%
outliers_price, Ready_to_work_data$median_price, "")
Now when I use the code geom_text(aes(label= lable_price)), I see the below plot (plot2) where all the matching values are highlighted many of which aren't outliers. How do I resolve this ?

Since there is no data provided in the question, it's hard to reproduce the needed plot. But here is one solution for tagging the overlapping outliers with non-overlapping labels. The labels will not overlap when the points overlap because ggstatsplot uses ggrepel in the backdrop.
library(ggstatsplot)
ggbetweenstats(
data = movies_long,
x = genre,
y = rating,
plot.type = "box",
outlier.tagging = TRUE,
outlier.label = title,
outlier.coef = 2,
messages = FALSE
)
Created on 2018-10-17 by the reprex package (v0.2.1.9000)

Run points() after plot() on a dataframe

I'm new to R and want to plot specific points over an existing plot. I'm using the swiss data frame, which I visualize through the plot(swiss) function.
After this, want to add outliers given by the Mahalanobis distance:
mu_hat <- apply(swiss, 2, mean); sigma_hat <- cov(swiss)
mahalanobis_distance <- mahalanobis(swiss, mu_hat, sigma_hat)
outliers <- swiss[names(mahalanobis_distance[mahalanobis_distance > 10]),]
points(outliers, pch = 'x', col = 'red')
but this last line has no effect, as the outlier points aren't added to the previous plot. I see that if repeat this procedure on a pair of variables, say
plot(swiss[2:3])
points(outliers[2:3], pch = 'x', col = 'red')
the red points are added to the plot.
Ask: is there any restriction to how the points() function can be used for a multivariate data frame?

Here's a solution using GGally::ggpairs. It's a little ugly as we need to modify the ggally_points function to specify the desired color scheme.
I've assumed that mu_hat = colMeans(swiss) and sigma_hat = cov(swiss).
library(dplyr)
library(GGally)
swiss %>%
bind_cols(distance = mahalanobis(swiss, colMeans(swiss), cov(swiss))) %>%
mutate(is_outlier = ifelse(distance > 10, "yes", "no")) %>%
ggpairs(columns = 1:6,
mapping = aes(color = is_outlier),
upper = list(continuous = function(data, mapping, ...) {
ggally_points(data = data, mapping = mapping) +
scale_colour_manual(values = c("black", "red"))
}),
lower = list(continuous = function(data, mapping, ...) {
ggally_points(data = data, mapping = mapping) +
scale_colour_manual(values = c("black", "red"))
}),
axisLabels = "internal")

Unfortunately this isn't possible the way you're currently doing things. When plotting a data frame R produces many plots and aligns them. What you're actually seeing there is 6 by 6 = 36 individual plots which have all been aligned to look nice.
When you use the dots command, it tells it to place the dots on the current plot. Which doesn't really make sense when you have 36 plots, at least not the way you want it to.
ggplot is a really powerful tool in R, it provides far greater combustibility. For example you could set up the dataframe to include your outliers, but have them labelled as "outlier" and place it in each plot that you have set up as facets. The more you explore it you might find there are better plots which suit your needs as well.
Plotting a dataframe in base R is a good exploratory tool. You could set up those outliers as a separate dataframe and plot it, so you can see each of the 6 by 6 plots side by side and compare. It all depends on your goal. If you're goal is to produce exactly as you've described, the ggplot2 package will help you create something more professional. As #Gregor suggested in the comments, looking up the function ggpairs from the GGally package would be a good place to start.
A quick google image search shows some funky plots akin to what you're after and then some!
Find it here

R: PCA plot with different colors for Sites

I´m recently trying to analyse my data and want to make the graphs a little nicer but I´m failing at this.
So I have a data set with 144 sites and 5 environmental variables. It´s basically about the substrate composition around an island and the fish abundance. On this island there is supposed to be a difference in the substrate composition between the north and the southside. Right now I am doing a pca and with the biplot function it works quite fine, but I would like to change the plot a bit.
I need one where the sites are just points and not numbered, arrows point to the different variable and the sites are colored according to their location (north or southside). So I tried everything i could find.
Most examples where with the dune data and suggested something like this:
library(vegan)
library(biplot)
data(dune)
mod <- rda(dune, scale = TRUE)
biplot(mod, scaling = 3, type = c("text", "points"))
So according to this I would just need to say text and points and R would label the variables and just make points for the sites. When i do this, however I get the Error:
Error in plot.default(x, type = "n", xlim = xlim, ylim = ylim, col = col[1L], :
formal argument "type" matched by multiple actual arguments
No idea how to get around this.
So next strategy I found, is to make a plot manually like this:
require("vegan")
data(dune, dune.env)
mod <- rda(dune, scale = TRUE)
scl <- 3 ## scaling == 3
colvec <- c("red2", "green4", "mediumblue")
plot(mod, type = "n", scaling = scl)
with(dune.env, points(mod, display = "sites", col = colvec[Use],
scaling = scl, pch = 21, bg = colvec[Use]))
text(mod,display="species", scaling = scl, cex = 0.8, col = "darkcyan")
with(dune.env, legend("bottomright", legend = levels(Use), bty = "n",
col = colvec, pch = 21, pt.bg = colvec))
This works fine so far as well, I get different colors and points, but now the arrows are missing. So I found that this should be corrected easy, if i just put "display="bp"" in the text line. But this doesn´t work either. Everytime I put "bp" R says:
Error in match.arg(display) :
argument "display" is missing, with no default
So I´m kind of desperate now. I looked through all the answers here and I don´t understand why display="bp" and type=c("text","points") is not working for me.
If anyone has an idea i would be super grateful.
https://www.dropbox.com/sh/y8xzq0bs6mus727/AADmasrXxUp6JTTHN5Gr9eufa?dl=0
This is the link to my dropbox folder. It contains my R-script and the csv files. The one named environmentalvariables_Kon1 also contains the data about north and southside.
So yeah...if anyone could help me. That would be awesome. I really don´t know what to do anymore.
Best regards,
Nancy

You can add arrows with arrows(). See the code for vegan:::biplot.rda to see how it works in the original function.
With your plot, add
g <- scores(mod, display = "species")
len <- 1
arrows(0, 0, len * g[, 1], len * g[, 2], length = 0.05, col = "darkcyan")
You might want to adjust the value of len to make the arrows longer

how to make a biplot without label in R

I used
biplot(prcomp(data, scale.=T), xlabs=rep("·", nrow(data)))
but it did not work to omit the labels.
Even if I remove the labels my plot is so messy and ugly which can be seen below!
I also need to show the percentage of PCs on axes
I used the following command to plot the image
biplot(prcomp(data, scale.=T), xlabs=rep("·", nrow(data)), ylabs = rep("·", ncol(data)))

Try this one
\devtools::install_github("sinhrks/ggfortify")
library(ggfortify)
ggplot2::autoplot(stats::prcomp(USArrests, scale=TRUE), label = FALSE, loadings.label = TRUE)

How to produce a sequence frequency plot for one single cluster within a cluster solution

I couldn't find a sufficient answer to my problem, perhaps someone can help here? (I am a beginner to R)
I do sequence analysis, the state space is n = 10 and the time space is t = 168 (months). I drew a sequence frequency plot for a cluster solution with 8 clusters. However, the plot is not really open to interpretations because the single plots are too clinched or too small resp. (see graph below)
I did the following procedures so far (very close to the instructions in the TraMineR-Help-document):
dist.om1 <- seqdist(neu.seq, method = "OM", indel = 1, sm = submat)
clusterward1 <- agnes(dist.om1, diss = TRUE, method = "ward")
cluster8 <- cutree(clusterward1, k = 8)
cluster8 <- factor(cluster8, labels = c("Typ 1", "Typ 2", "Typ 3", "Typ 4", "Typ 5", "Typ 6", "Typ 7", "Typ 8"))
seqfplot(neu.seq, group = cluster8, pbarw = T, withlegend = T)
I tried to reconfigure the margins but the result was always the same plot (the attached plot was done with the default settings). So I thought, instead, maybe I could draw sequence frequency plot for a single cluster within my 8-cluster-solution.
(in Stata-code, I would write something like for a single sequence index plot sqindexplot if cluster8 == 4)
However, I don't know how this is done in R. If someone has an idea how to get a prettier sequence frequency plot, i'd be very grateful!
Thank you!
Oliver

With 8 groups you may need to reduce the font size of the axes labels using the cex.plot argument. For example:
seqfplot(neu.seq, group = cluster8, withlegend = T, cex.plot=.5)
You may also get better looking plots with the border=NA argument that suppresses the black border around the bars representing each sequence pattern.
Alternatively, if you are using graphic devices such as pdf, png or jpeg to create your plot files, try to play with the parameters width and height of the functions. The larger the height value, the smaller the text looks out.
To get only cluster 4, use
seqfplot(neu.seq[cluster8=="Typ 4",], withlegend = T)
(See also How to identify sequences within each cluster? )
And if you want to combine the plots yourself using for example par(mfrow=c(.,.)) you have to disable the automatic legend, and insert the legend manually with seqlegend, e.g.
par(mfrow=c(2,2))
seqfplot(neu.seq[cluster8=="Typ 4",], withlegend = F)
seqfplot(neu.seq[cluster8=="Typ 5",], withlegend = F)
seqfplot(neu.seq[cluster8=="Typ 6",], withlegend = F)
seqlegend(neu.seq)
dev.off()
Hope this helps.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Principal Component Analysis in R data color - r

Disclaimer: This is not a direct answer but can be tweak to obtain the desired output. library(ggbiplot) data(wine) wine.pca <- prcomp(wine, scale. = TRUE) print(ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, ellipse = TRUE, circle = TRUE))

Related

How to Label the "Outliers" in the box plot when there are many values which equals the outlier?

Run points() after plot() on a dataframe

R: PCA plot with different colors for Sites

how to make a biplot without label in R

How to produce a sequence frequency plot for one single cluster within a cluster solution

Categories

Resources