Adding text annotation to a clustering scatter plot (tSNE) - r

I have XY data (a 2D tSNE embedding of high dimensional data) which I'd like to scatter plot. The data are assigned to several clusters, so I'd like to color code the points by cluster and then add a single label for each cluster, that has the same color coding as the clusters, and is located outside (as much as possible) from the cluster's points.
Any idea how to do this using R in either ggplot2 and ggrepel or plotly?
Here's the example data (the XY coordinates and cluster assignments are in df and the labels in label.df) and the ggplot2 part of it:
library(dplyr)
library(ggplot2)
set.seed(1)
df <- do.call(rbind,lapply(seq(1,20,4),function(i) data.frame(x=rnorm(50,mean=i,sd=1),y=rnorm(50,mean=i,sd=1),cluster=i)))
df$cluster <- factor(df$cluster)
label.df <- data.frame(cluster=levels(df$cluster),label=paste0("cluster: ",levels(df$cluster)))
ggplot(df,aes(x=x,y=y,color=cluster))+geom_point()+theme_minimal()+theme(legend.position="none")

The geom_label_repel() function in the ggrepel package allows you to easily add labels to plots while trying to "repel" the labels from not overlapping with other elements. A slight addition to your existing code where we summarize the data / get coordinates of where to put the labels (here I chose the upper left'ish region of each cluster - which is the min of x and the max of y) and merge it with your existing data containing the cluster labels. Specify this data frame in the call to geom_label_repel() and specify the variable that contains the label aesthetic in aes().
library(dplyr)
library(ggplot2)
library(ggrepel)
set.seed(1)
df <- do.call(rbind,lapply(seq(1,20,4),function(i) data.frame(x=rnorm(50,mean=i,sd=1),y=rnorm(50,mean=i,sd=1),cluster=i)))
df$cluster <- factor(df$cluster)
label.df <- data.frame(cluster=levels(df$cluster),label=paste0("cluster: ",levels(df$cluster)))
label.df_2 <- df %>%
group_by(cluster) %>%
summarize(x = min(x), y = max(y)) %>%
left_join(label.df)
ggplot(df,aes(x=x,y=y,color=cluster))+geom_point()+theme_minimal()+theme(legend.position="none") +
ggrepel::geom_label_repel(data = label.df_2, aes(label = label))

Related

Plotting naturalearth RasterLayer with natural colors using R and ggplot2

I am trying to create a map of some parts of South America using naturalearthdata. It is important for me that I create a ggplot2 object, so I can further modify and combine the plot with geom_sf and geom_points. THe last part, which I thought would be most challenging, is already done.
My main problem right now is that the raster-object is colored as if the integer value were an intensity. However, I would like the color to be as in the source:
https://www.naturalearthdata.com/downloads/10m-natural-earth-1/10m-natural-earth-1-with-shaded-relief-water-and-drainages/
I use the following code:
library(tidyverse)
library(raster)
hills <- raster("./map_data/NE1_LR_LC_SR_W_DR.tif")
# for the larger dataframe, my RStudio keeps crashing
hills_df <- as.data.frame(hills, xy = T) %>%
filter(x >= -90) %>% filter(x <= -55) %>% filter(y <= 5) %>% filter(y >= -30)
ggplot(data = hills_df) +
geom_raster(aes(x=x, y=y,fill=NE1_LR_LC_SR_W_DR))
The problem is that the data frame only has another column which is an integer value. By setting the aesthetics to fill=value, this integer gets mapped as magnitude. How can I implement the natural coloring as in the source?
Best,
Tarotis
One way to achieve this is making a data.frame with coordinates of each cell of the rasterStack and values of the three layers combined via rgb, refer to this blog post.
However, geom_tile and even geom_raster appear to be very slow when as few as tens of thousands of pixels are involved. I use the following workaround:
Create a 3D matrix from rasterStack
Collapse it to two dimensions with rgb function (like explained in this SO answer)
Introduce matrix of RGB values to ggplot with annotate_raster
The code can be as follows:
library(tidyverse)
library(raster)
natearth_map <-
raster::stack('./map_data/NE1_LR_LC_SR_W_DR.tif') %>% # import tiff as rasterStack
crop(extent(-90, -55, -30, 5)) %>% # subset to desired extent
as.array %>% # convert to 3D array
`/`(255) %>% # switch to proportions to meet rgb() requirements
apply(c(1, 2), function(x) rgb(matrix(x, ncol = 3))) %>% # collapse layers to RGB colors
annotation_raster(-90, -55, -30, 5) # make it a ggplot object
Then you can add it to your ggplot like this.
ggplot(data, aes(x, y)) + natearth_map + geom_point()

Change position of legend in plot of pec object

I am trying to plot the prediction error curve from pec package but I can't change the legend position and size. There's an example from pec package:
library(rms)
library(pec)
data(pbc)
pbc <- pbc[sample(1:NROW(pbc),size=100),]
f1 <- psm(Surv(time,status!=0)~edema+log(bili)+age+sex+albumin,data=pbc)
f2 <- coxph(Surv(time,status!=0)~edema+log(bili)+age+sex+albumin,data=pbc,x=TRUE,y=TRUE)
f3 <- cph(Surv(time,status!=0)~edema+log(bili)+age+sex+albumin,data=pbc,surv=TRUE)
brier <- pec(list("Weibull"=f1,"CoxPH"=f2,"CPH"=f3),data=pbc,formula=Surv(time,status!=0)~1)
print(brier)
plot(brier)
But shows a big the legend in the middle of plot.
I also tried:
plot(brier, legend = "topright")
class(brier)
But don't show legend.
How can I change the position of legend? And also ¿is it posible to plot this graph using ggplot?
I think I got what you want using ggplot2. The idea is to pick elements from your brier object that contains data for the plot, make a dataframe with it and plot it.
library(ggplot2)
# packages for the pipe and pivot_wider, you can do it with base functions, I just prefer these
library(tidyr)
library(dplyr)
df <- do.call(cbind, brier[["AppErr"]]) # contains y values for each model
df <- cbind(brier[["time"]], df) # values of the x axis
colnames(df)[1] <- "time"
df <- as.data.frame(df) %>% pivot_longer(cols = 2:last_col(), names_to = "models", values_to = "values") # pivot table to long format makes it easier to use ggplot
ggplot(data = df, aes(x = time, y = values, color = models)) +
geom_line() # I suppose you know how to custom axis names etc.
Output:

How to make "interactive" time series plots for exploratory data analysis

I have a time series data frame similar to data created below. Measurements of 5 variables are taken on each individual. Individuals have unique ID numbers. Note that in this data set each individual is of the same length (each has 1000 observations), but in my real data set each individual is of has different lengths (teach individual has a different number of observations). For each individual, I want to plot all 5 variables on top of one another (i.e. all on the y axis) and plot them against time (x axis). I want to print each of these plots to an external document of some kind (pdf, or whatever is recommended for this application) with one plot per page, meaning each individual will have its own page with a single plot. I want these time series plots to be "interactive", in that I can move my mouse over a point, and it will tell me what time individual data points are at. My goal in doing this is exploring the association between peaks, valleys, and other regions between the 5 variables. I am not sure if ggplot2 is still the best application for this, but I would still like for the plots to be aesthetically appealing so that it will be easier to see patterns in the data. Also, is pasting these plots to a pdf the most sensible route? Or would I be better off using R notebook or some other application?
ID <- rep(c("A","B","C"), each=1000)
time <- rep(c(1:1000), times = 3)
one <- rnorm(1000)
two <- rnorm(1000)
three <- rnorm(1000)
four <- rnorm(1000)
five<-rnorm(1000)
data<- data.frame(cbind(ID,time,one,two,three,four,five))
Try using the plotly package. And since you want it to be interactive, you'll want to export as something like html rather than pdf.
To produce a single faceted plot (note I added stringAsFactors = FALSE to your sample data):
library(tidyverse)
library(plotly)
ID <- rep(c("A","B","C"), each=1000)
time <- rep(c(1:1000), times = 3)
one <- rnorm(1000)
two <- rnorm(1000)
three <- rnorm(1000)
four <- rnorm(1000)
five<-rnorm(1000)
data<- data.frame(cbind(ID,time,one,two,three,four,five),
stringsAsFactors = FALSE)
data_long <- data %>%
gather(variable,
value,
one:five) %>%
mutate(time = as.numeric(time),
value = as.numeric(value))
plot <- data_long %>%
ggplot(aes(x = time,
y = value,
color = variable)) +
geom_point() +
facet_wrap(~ID)
interactive_plot <- ggplotly(plot)
htmlwidgets::saveWidget(interactive_plot, "example.html")
If you want to produce and export an interactive plot for every ID programmatically:
walk(unique(data_long$ID),
~ htmlwidgets::saveWidget(ggplotly(data_long %>%
filter(ID == .x) %>%
ggplot(aes(x = time,
y = value,
color = variable)) +
geom_point() +
labs(title = paste(.x))),
paste("plot_for_ID_", .x, ".html", sep = "")))
Edit: I changed map() to walk() so that the plots are produced without console output (previously just a list with 3 empty elements).

Change plotting order of categories of data in tmap map R

I am plotting some spatial data in R using the tmap package. I define breaks and plot color in the tm_dots function. I'd like to be able to define the plot order of the categories so that they are defined by the category (highest category on top, second highest below that, etc.). I need to be able to see clearly where the highest category points are. I know this could be achieved with multiple spatial point data frames, but is there another less clunky way? Below is an example using the meuse data. I make the points huge so they overlap. So ideally here in the plot the plot order would be: blue, green, orange, red.
libary(tmap)
library(sp)
data("meuse")
coordinates(meuse) <- c("x","y")
tm_layout() +
tm_shape(meuse) + tm_dots("cadmium", breaks = c(1,2,3,4,Inf), palette = "-Spectral", auto.palette.mapping = FALSE,
size = 1) +
tm_legend(legend.outside = TRUE)
Turns out the default plot order is the original data frame row order. To make the categories plot in the correct order, I create a numeric factor variable for the categories and sort the original data frame by it.
libary(tmap)
library(sp)
library(dplyr)
data("meuse")
meuse <- meuse %>%
mutate(cat = base::cut(cadmium, breaks = c(-Inf,1,2,3,4,Inf),
labels = c(1,2,3,4,5))) %>%
arrange(cat)
coordinates(meuse) <- c("x","y")
tm_layout() +
tm_shape(meuse) + tm_dots("cadmium", breaks = c(1,2,3,4,Inf), palette = "-Spectral", auto.palette.mapping = FALSE,
size = 1) +
tm_legend(legend.outside = TRUE)

How to do a 2d heatmap with color smoothing ... or a density plot from absolute values?

I've done the rounds here and via google without a solution, so please help if you can.
I'm looking to create something like this : painSensitivityHeatMap using ggplot2
I can create something kinda similar using geom_tile, but without the smoothing between data points ... the only solution I have found requires a lot of code and data interpolation. Not very elegant, me thinks.uglySolutionUsingTile
So I'm thinking, I could coerce the density2d plots to my purposes instead by having the plot use fixed values rather than a calculated data-point density -- much in the same way that stat='identity' can be used in histograms to make them represent data values, rather than data counts.
So a minimal working example:
df <- expand.grid(letters[1:5], LETTERS[1:5])
df$value <- sample(1:4, 25, replace=TRUE)
# A not so pretty, non-smooth tile plot
ggplot(df, aes(x=Var1, y=Var2, fill=value)) + geom_tile()
# A potentially beautiful density2d plot, except it fails :-(
ggplot(df, aes(x=Var1, y=Var2)) + geom_density2d(aes(color=..value..))
This took me a little while, but here is a solution for future reference
A solution using idw from the gstat package and spsample from the sp package.
I've written a function which takes a dataframe, number of blocks (tiles) and a low and upper anchor for the colour scale.
The function creates a polygon (a simple quadrant of 5x5) and from that creates a grid of that shape.
In my data, the location variables are ordered factors -- therefor I unclass them into numbers (1-to-5 corresponding to the polygon-grid) and convert them to coordinates -- thus converting the tmpDF from a datafra to a spatial dataframe. Note: there are no overlapping/duplicate locations -- i.e 25 observations corresponding to the 5x5 grid.
The idw function fills in the polygon-grid (newdata) with inverse-distance weighted values ... in other words, it interpolates my data to the full polygon grid of a given number of tiles ('blocks').
Finally I create a ggplot based on a color gradient from the colorRamps package
painMapLumbar <- function(tmpDF, blocks=2500, lowLimit=min(tmpDF$value), highLimit=max(tmpDF$value)) {
# Create polygon to represent the lower back (lumbar)
poly <- Polygon(matrix(c(0.5, 0.5,0.5, 5.5,5.5, 5.5,5.5, 0.5,0.5, 0.5), ncol=2, byrow=TRUE))
# Create a grid of datapoints from the polygon
polyGrid <- spsample(poly, n=blocks, type="regular")
# Filter out the data for the figure we want
tmpDF <- tmpDF %>% mutate(x=unclass(x)) %>% mutate(y=unclass(y))
tmpDF <- tmpDF %>% filter(y<6) # Lumbar region only
coordinates(tmpDF) <- ~x+y
# Interpolate the data as Inverse Distance Weighted
invDistanceWeigthed <- as.data.frame(idw(formula = value ~ 1, locations = tmpDF, newdata = polyGrid))
p <- ggplot(invDistanceWeigthed, aes(x=x1, y=x2, fill=var1.pred)) + geom_tile() + scale_fill_gradientn(colours=matlab.like2(100), limits=c(lowLimit,highLimit))
return(p)
}
I hope this is useful to someone ... thanks for the replies above ... they helped me move on.

Resources