R colour code plot by rownames for principal component analysis - r

I am attempting to complete a principal component analysis on a set of data containing columns of numeric data.
Assuming a dataset like this (in reality I have a pre configured data frame, this one if for reproducibility):
v1 <- c(1,2,3,4,5,6,7)
v2 <- c(3,6,2,5,2,4,9)
v3 <- c(6,1,4,2,3,7,5)
dataset <-data.frame(v1,v2,v3)
row.names(dataset) <-c('New York', 'Seattle', 'Washington DC', 'Dallas', 'Chicago','Los Angeles','Minneapolis')
I have ran my principal component analysis, and successfully plotted it:
pca=prcomp(dataset,scale=TRUE)
plot(pca$x[,1], pca$x[,2],
xlab="First PC",ylab="Second PC")
text(pca$x[,1], pca$x[,2],cex=0.7,pos=3,col="darkgrey")
What I want to do however is colour code my data points based on the city, which is the row names of my dataset. I also want to use these cities (i.e. rownames) as labels.
I've tried the following, but neither have worked:
## attempt 1 - I get row labels, but no chart
plot(pca$x[,1], pca$x[,2],col=rownames(dataset),pch=rownames(dataset),
xlab="First PC",ylab="Second PC")
text(pca$x[,1], pca$x[,2],labels=rownames(dataset),cex=0.7,pos=3,col="darkgrey")
## attempt 2
datasetwithcity = rownames_to_column(dataset, var = "city")
head(datasetwithcity)
OnlyCities=datasetwithcity[,1]
OnlyCities
# this didn't work:
City_Labels=as.numeric(OnlyCities)
head(City_Labels)
# gets city labels, but loses points and no colour
plot(pca$x[,1], pca$x[,2],col=City_Labels,pch=City_Labels,
xlab="First PC",ylab="Second PC")
text(pca$x[,1], pca$x[,2],labels=rownames(dataset),
cex=0.7,pos=3,col="darkgrey")

There are many different ways to do this.
In base R, you could do:
plot(pca$x[,1], pca$x[,2],
xlab="First PC",ylab="Second PC", col = seq(nrow(pca$x)),
xlim = c(-2.5, 2.5), ylim = c(-2, 2))
text(pca$x[,1], pca$x[,2],cex=0.7,pos=3,col="darkgrey")
text(x = pca$x[,1], y = pca$x[,2], labels = rownames(pca$x), pos = 1)
Personally, I think the resulting aesthetics are nicer (and more easy to change to suit your needs) with ggplot. The code is also a bit easier to read once you get used to the syntax.
library(ggplot2)
df <- as.data.frame(pca$x)
df$city <- rownames(df)
ggplot(df, aes(PC1, PC2, color = city)) +
geom_point(size = 3) +
geom_text(aes(label = city) , vjust = 2) +
lims(x = c(-2.5, 2.5), y = c(-2, 2)) +
theme_bw() +
theme(legend.position = "none")
Created on 2021-10-28 by the reprex package (v2.0.0)

Related

R ggplot2 - setting breaks when also using facets

I want to have some more flexibility for setting the breaks on a ggplot that has facets.
library(data.table)
library(tidyverse)
dt <- data.table(x = rnorm(1000),
group = sample(c(1,2), size=1000, replace = TRUE))
The problem is I want to create the breaks based off all of the data for a particular facet, but the documentation for breaks says:
A function that takes the limits as input and returns breaks as output (e.g., a function returned by scales::extended_breaks()). Also accepts rlang lambda function notation.
Note that you just get the limits of the data for that facet. Say I want to use the output of summary to create my breaks. E.g.:
breaks_f <- function(x){
print(x) # included this to confirm I only get limits
as.numeric(ceiling(summary(x)))
}
dt %>%
ggplot(aes(x=x)) +
geom_density(adjust=.8, color = NA, alpha = .8, fill = 'blue')+
scale_x_continuous(breaks = breaks_f) +
facet_wrap(vars(group), scales = 'free')
You'll see if you run this, you'll get breaks that based off taking summary(c(min, max)) for each facet, not all of the data for that facet.
So is there a way to access all of the data within each facet?
Thanks!
One option would be ggh4x::facetted_pos_scales which allows to
... vary labels, breaks, limits, transformations and even axis guides for each panel individually.
Hence, using ggh4x::facetted_pos_scales you could apply your function and set the breaks individually for each panel:
library(data.table)
library(tidyverse)
library(ggplot2)
library(ggh4x)
set.seed(123)
dt <- data.table(
x = rnorm(1000),
group = sample(c(1, 2), size = 1000, replace = TRUE)
)
breaks_f <- function(x) {
print(x) # included this to confirm I only get limits
as.numeric(ceiling(summary(x)))
}
dt %>%
ggplot(aes(x = x)) +
geom_density(adjust = .8, color = NA, alpha = .8, fill = "blue") +
facet_wrap(vars(group), scales = "free") +
facetted_pos_scales(x = list(
scale_x_continuous(breaks = breaks_f),
scale_x_continuous(breaks = breaks_f)
))
#> [1] -2.937358 3.535249
#> [1] -2.937358 3.535249
#> [1] -3.084506 2.959591
#> [1] -3.084506 2.959591

R ggplot2 facet chart with annotations varying among the facets

I am trying to construct a facet plot using ggplot2 with an annotation that varies from one facet to the next. The annotation is to be located using the plot area coordinates between 0 and 1, rather than the usual (x,y) coordinates, and is to be in the same location for every facet. The annotation is to be constructed using the y aesthetic and the paste0() function.
My reprex shows one case that works, but this case does not include the part that comes from the y aesthetic and the annotation does not vary among the facets. The reprex also shows a second case where the percentage change in the last (most recent) value for the y aesthetic is added to the annotation, and this does not work. It is this second case that I want to solve.
The reprex uses the ggpp package, but I have also tried using annotation_custom instead of ggpp. However I have not been able to get that to work either. Any help much appreciated.
Here is my reprex:
# Reprex for facets with placed annotation
library(ggplot2)
library(ggpp)
PC <- function(x) {y <- round(100*(x/lag(x)-1),1)}
df <- data.frame(tm=1:25,A=sample(1:100,25,replace=T),
B=sample(1:100,25,replace=T),C=sample(1:100,25,replace=T),
D=sample(1:100,25,replace=T))
df <- tidyr::pivot_longer(df,cols=2:5,names_to="City",values_to="Value")
# This works:
ggplot(df,aes(x=tm,y=Value))+
geom_line()+
scale_y_continuous(lim=c(-10,100))+
ggpp::geom_text_npc(aes(npcx = x, npcy = y, label=label),
data = data.frame(x = 0.05, y = 0.05,
label='% change in various cities'))+
facet_wrap(~City,scale="free_y")
# But this does not work:
ggplot(df,aes(x=tm,y=Value))+
geom_line()+
scale_y_continuous(lim=c(-10,100))+
ggpp::geom_text_npc(aes(npcx = x, npcy = y, label=label),
data = data.frame(x = 0.05, y = 0.05,
label=paste0("Last change in this city ",PC(y)[25],'%')))+
facet_wrap(~City,scale="free_y")
This solution below is a bit complicated, there are probably simpler ones, but it works.
1. Function PC()
Without loading package dplyr, your function PC is calling stats::lag, not dplyr::lag. And assigning to y without returning its value. The right version is
PC <- function(x) {round(100*(x/dplyr::lag(x) - 1), 1)}
2. The data
The plot is created with data = df but then, when plotting the labels the data set changes and the y value no longer comes from df.
The ggpp::geom_text_npc layer doesn't compute PC(y) correctly because its data argument only is self-referring to y. The data.frame is ill formed. This y is not the one in df.
A way to correct this is to first note that the labels to be plotted are 4, one per city and compute the last change value beforehand. This is very simple:
Value <- with(df, tapply(Value, City, \(y) PC(y)[length(y)]))
Value
# A B C D
# -24.1 -16.7 -91.3 46.9
The labels data then becomes
df_labels <- data.frame(
x = rep(0.05, length(Value)), y = rep(0.05, length(Value)),
City = names(Value),
label = paste0("Last change in this city ", Value, "%")
)
3. The plot
Full reproducible example, from top to bottom.
# Reprex for facets with placed annotation
suppressPackageStartupMessages({
library(ggplot2)
library(ggpp)
})
set.seed(2022)
PC <- function(x) {y <- round(100*(x/dplyr::lag(x) - 1), 1)}
df <- data.frame(tm=1:25,A=sample(1:100,25,replace=T),
B=sample(1:100,25,replace=T),
C=sample(1:100,25,replace=T),
D=sample(1:100,25,replace=T))
df <- tidyr::pivot_longer(df,cols=2:5,names_to="City",values_to="Value")
Value <- with(df, tapply(Value, City, \(y) PC(y)[length(y)]))
df_labels <- data.frame(
x = rep(0.05, length(Value)), y = rep(0.05, length(Value)),
City = names(Value),
label = paste0("Last change in this city ", Value, "%")
)
ggplot(df, aes(x = tm, y = Value)) +
geom_line() +
scale_y_continuous(lim = c(-10, 100)) +
ggpp::geom_text_npc(
data = df_labels,
mapping = aes(
npcx = x, npcy = y,
label = label
)
) +
facet_wrap(~ City, scale = "free_y")
Created on 2022-08-08 by the reprex package (v2.0.1)

violin plot in R and values in X axis

I created these two violin plots in R, using:
install.packages("vioplot")
par(mfrow = c(1, 2))
vioplot::vioplot(HEL$Y,las=2,main="HEL$Y",col="deepskyblue",notch=TRUE)
vioplot::vioplot(ITA$Y,las=2,main="ITA$Y",col="aquamarine",notch=TRUE)
as a result I get the following. However, I don't know why in the X axis I get 1 and 2. How can I get rid of the 2?
Thanks for your help.
This mysterious behavior is due to the use of the argument "notch = TRUE". Example:
set.seed(456)
vioplot(rnorm(10), notch = TRUE)
My interpretation is that notch is not an argument of vioplot, so the function interprets it as data to add to the graph (see the little smudge at y = 1: that's where it wants to put the new data, since TRUE equals 1 when it is converted into a numeric).
To confirm that an unknown argument is interpreted as data to be plotted, here is a little experiment:
vioplot(rnorm(10), unknown_argument = rnorm(10))
And the result:
This is a ggplot2 solution in case you're interested.
library(ggplot2)
library(dplyr)
# Recreate similar data
HEL <- data.frame(Y = rnorm(50, 8, 3))
ITA <- data.frame(Y = rnorm(50, 9, 2))
# Join in a single dataframe and reshape to longer format
dat <- bind_rows(rename(HEL, hel_y = Y),
rename(ITA, ita_y = Y)) |>
tidyr::pivot_longer(everything())
# Make the plots
dat |>
ggplot(aes(name, value)) +
geom_violin(aes(fill = name)) +
geom_boxplot(width = 0.1) +
scale_fill_manual(values = c("deepskyblue", "aquamarine")) +
theme(legend.position = "")
Created on 2022-04-28 by the reprex package (v2.0.1)

R control jitter function - avoid overplotting / non-random jitter

My problems seems simple, I am using ggplot2 with geom_jitter() to plot a variable. (take my picture as an example)
Jitter now adds some random noise to the variable (the variable is just called "1" in this example) to prevent overplotting. So I have now random noise in the y-direction and clearly what otherwise would be completely overplotted is now better visible.
But here is my question:
As you can see, there are still some points, that overplot each other. In my example here, this could be easily prevented, if it wouldn't be random noise in y-direction... but somehow more strategically placed offsets.
Can I somehow alter the geom_jitter() behavior or is there a similar function in ggplot2 that does exactly this?
Not really a minimal example, but also not too long:
library("imputeTS")
library("ggplot2")
data <- tsAirgap
# 2.1 Create required data
# Get all indices of the data that comes directly before and after an NA
na_indx_after <- which(is.na(data[1:(length(data) - 1)])) + 1
# starting from index 2 moves all indexes one in front, so no -1 needed for before
na_indx_before <- which(is.na(data[2:length(data)]))
# Get the actual values to the indices and put them in a data frame with a label
before <- data.frame(id = "1", type = "before", input = na_remove(data[na_indx_before]))
after <- data.frame(id = "1", type = "after", input = na_remove(data[na_indx_after]))
all <- data.frame(id = "1", type = "source", input = na_remove(data))
# Get n values for the plot labels
n_before <- length(before$input)
n_all <- length(all$input)
n_after <- length(after$input)
# 2.4 Create dataframe for ggplot2
# join the data together in one dataframe
df <- rbind(before, after, all)
# Create the plot
gg <- ggplot(data = df) +
geom_jitter(mapping = aes(x = id, y = input, color = type, alpha = type), width = 0.5 , height = 0.5)
gg <- gg + ggplot2::scale_color_manual(
values = c("before" = "skyblue1", "after" = "yellowgreen","source" = "gray66"),
)
gg <- gg + ggplot2::scale_alpha_manual(
values = c("before" = 1, "after" = 1,"source" = 0.3),
)
gg + ggplot2::theme_linedraw() + theme(aspect.ratio = 0.5) + ggplot2::coord_flip()
So many good suggestions...here is what Bens suggestion would look like for my example:
I changed parts of my code to:
gg <- ggplot(data = df, aes(x = input, color = type, fill = type, alpha = type)) +
geom_dotplot(binwidth = 15)
Would basically also work as intended for me. ggbeeplot as suggested by Jon also worked great for my purpose.
I thought of a hack I really like, using ggrepel. It's normally used for labels, but nothing preventing you from making the label into a point.
df <- data.frame(x = rnorm(200),
col = sample(LETTERS[1:3], 200, replace = TRUE),
y = 1)
ggplot(df, aes(x, y, label = "●", color = col)) + # using unicode black circle
ggrepel::geom_text_repel(segment.color = NA,
box.padding = 0.01, key_glyph = "point")
A downside of this method is that ggrepel can take a lot time for a large number of points, and will recalculate differently each time you change the plot size. A faster alternative would be to use ggbeeswarm::geom_quasirandom, which uses a deterministic process to define jitter that looks random.
ggplot(df, aes(x,y, color = col)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE)

Adding text outside plot doesn't work in r

I have a simple dataset:
11 observations, 1 variable.
I want to plot them adding my own axis names, but when I want to change the position of them, R keeps plotting them in the exact same spot.
Here is my script:
plot(data[,5], xlab = "", xaxt='n')
axis(1, at = 1:11, labels = F)
text(1:11, par("usr")[3] - 0.1, srt = 90, adj = 1, labels = names, xpd = TRUE)
I am changing the -0.1, to any number but R keeps placing the labels in the exact same spot. I tried with short names like "a" but the result is the same.
Thanks in advance
My data:
10308.9
10201.6
12685.3
3957.93
7677.1
9671.7
11849.4
10755.7
11283.4
11583.8
12066.9
names <- rep("name",11)
My ggplot solution:
# creating the sample dataframe
data <- read.table(text="10308.9
10201.6
12685.3
3957.93
7677.1
9671.7
11849.4
10755.7
11283.4
11583.8
12066.9", header=FALSE)
# adding a names column
data$names <- as.factor(paste0("name",sprintf("%02.0f", seq(1,11,1))))
#creating the plot
require(ggplot2)
ggplot(data, aes(x=names, y=V1)) +
geom_bar(fill = "white", color = "black")
which gives:
When you want to change the order of the bars, you can do that with transform:
# transforming the data (I placed "name04" as the first one)
data2 <- transform(data,
newnames=factor(names,
levels=c("name04","name01","name02","name03","name04","name05","name06","name07","name08","name09","name10","name11"),
ordered =TRUE))
#creating the plot
ggplot(data2, aes(x=newnames, y=V1)) +
geom_bar(stat="identity", fill="white", color="black")
which gives:

Resources