How to tailor R plot axis? - r

I have the following data stored in a CSV file.
NodeID,pageRank
0,0.0327814931593
1,0.384378430034
2,0.342932804288
3,0.0390870921
4,0.0808856932345
5,0.0390870921
I have read the CSV file in R and ordered pageRank values in descending order.
data <- read.csv("pagerank.csv")
data <- data[order(-data$pageRank),]
After ordering, data look following.
1 0.38437843
2 0.34293280
4 0.08088569
3 0.03908709
5 0.03908709
0 0.03278149
In the above example, the first column represents NodeID (not sequentially ordered) and the second column represents pageRank (descending order). Next I have used following command to plot the data.
plot(data$pageRank, type="o", col="red", xlab="Node Rank", ylab="PageRank Value")
The plot is showing Y-axis (pageRank values) properly. However, on the X-axis it is showing sequential numbers (0,1,2,3,4,5). Hence, instead of showing sequential number, how can I plot NodeID (1,2,4,3,5,0) on the X-axis (available in data) by maintaining pageRank's descending order. I have tried the following. However, it does not maintain pageRank's descending order.
plot(data$NodeID, data$pageRank, type="o", col="red", xlab="NodeID", ylab="PageRank Value")

In ggplot, we assemble the plot from layers connected with the plus operator +.
So we can start by defining the dataset to plot (data), then use the aes function to specify which variables to use for the x and y axes. Finally we tell it to plot both points and a line using this data.
library(ggplot2)
ggplot(data = data,
aes(x = NodeID, y = pageRank)) +
geom_point() +
geom_line() +
xlab('Node Rank') + ylab('PageRank Value')
Simple! I highly recommend using ggplot whenever possible over the limited and obtusely designed base R graphics.

You can use fct_reorder from the forcats package to do the ordering for you. See also this.
library(magrittr)
library(tidyverse)
library(forcats)
txt <- "NodeID,pageRank
0,0.0327814931593
1,0.384378430034
2,0.342932804288
3,0.0390870921
4,0.0808856932345
5,0.0390870921"
df <- read_csv(txt)
# Convert NodeID column to factor first
df %<>%
mutate(NodeID = factor(NodeID))
# Plot
ggplot(df, aes(pageRank, fct_reorder(NodeID, pageRank))) +
geom_point() +
geom_segment(aes(y = NodeID, yend = NodeID, x = 0, xend = pageRank), color = "red") +
scale_x_continuous(expand = c(0, 0)) +
ylab("Node Rank") +
xlab("PageRank Value") +
theme_classic(base_size = 16)
ggplot(df, aes(y = pageRank, x = fct_reorder(NodeID, -pageRank))) +
geom_line(group = 1, color = "red") +
scale_y_continuous(expand = c(0, 0)) +
xlab("Node Rank") +
ylab("PageRank Value") +
theme_classic(base_size = 16)
Using base R plot
df$NodeID <- factor(df$NodeID, levels = c("1", "2", "4", "3", "5", "0"))
plot(df$pageRank ~ df$NodeID, xlab = "Node Rank", ylab = "PageRank Value")
Created on 2018-08-08 by the reprex package (v0.2.0.9000).

Related

how to input data from multiple columns in x and y arguments in ggplot

I am trying to create a density plot for particle size data. My data has multiple density and size readings for each genotype set. Is there a way to specify multiple columns into x and y using ggplot? I tried coding for this but am only getting a blank plot as of now. This is the link to the csv file I used: https://drive.google.com/file/d/11djXTmZliPCGLCZavukjb0TT28HsKMRQ/view?usp=sharing
Thanks!
crop.data6 <- read.csv("barleygt25.csv", header = TRUE)
crop.data6
library(ggplot2)
plot1 = ggplot(data=crop.data6, aes(x=, xend=bq, y=a, yend=bq, color=genotype))
plot1
Your data is in a strange format that doesn't lend itself well to plotting. Effectively, it needs to be transposed then pivoted into long format to make it suitable for plotting:
df <- data.frame(xvals = c(t(crop.data6[1:9, -c(1:2)])),
yvals = c(t(crop.data6[10:18, -c(1:2)])),
genotype = rep(crop.data6$genotype[1:9], each = 68))
ggplot(df, aes(xvals, yvals, color = genotype)) +
geom_line(size = 1) +
scale_color_brewer(palette = "Set1") +
theme_bw(base_size = 16) +
labs(x = "value", y = "density")

How do I correctly connect data points ggplot

I am making a stratigraphic plot but somehow, my data points don't connect correctly.
The purpose of this plot is that the values on the x-axis are connected so you get an overview of the change in d18O throughout time (age, ma).
I've used the following script:
library(readxl)
R_pliocene_tot <- read_excel("Desktop/R_d18o.xlsx")
View(R_pliocene_tot)
install.packages("analogue")
install.packages("gridExtra")
library(tidyverse)
R_pliocene_Rtot <- R_pliocene_tot %>%
gather(key=param, value=value, -age_ma)
R_pliocene_Rtot
R_pliocene_Rtot %>%
ggplot(aes(x=value, y=age_ma)) +
geom_path() +
geom_point() +
facet_wrap(~param, scales = "free_x") +
scale_y_reverse() +
labs(x = NULL, y = "Age (ma)")
which leads to the following figure:
Something is wrong with the geom_path function, I guess, but I can't figure out what it is.
Though the comment seem solve the problem I don't think the question asked was answered. So here is some introduction about ggplot2 library regard geom_path
library(dplyr)
library(ggplot2)
# This dataset contain two group with random value for y and x run from 1->20
# The param is just to replicate the question param variable.
df <- tibble(x = rep(seq(1, 20, by = 1), 2),
y = runif(40, min = 1, max = 100),
group = c(rep("group 1", 20), rep("group 2", 20)),
param = rep("a param", 40))
df %>%
ggplot(aes(x = x, y = y)) +
# In geom_path there is group aesthetics which help the function to know
# which data point should is in which path.
# The one in the same group will be connected together.
# here I use the color to help distinct the path a bit more.
geom_path(aes(group = group, color = group)) +
geom_point() +
facet_wrap(~param, scales = "free_x") +
scale_y_reverse() +
labs(x = NULL, y = "Age (ma)")
In your data which work well with group = 1 I guessed all data points belong to one group and you just want to draw a line connect all those data point. So take my data example above and draw with aesthetics group = 1, you can see the result that have two line similar to the above example but now the end point of group 1 is now connected with the starting point of group 2.
So all data point is now on one path but the order of how they draw is depend on the order they appear in the data. (I keep the color just to help see it a bit clearer)
df %>%
ggplot(aes(x = x, y = y)) +
geom_path(aes(group = 1, color = group)) +
geom_point() +
facet_wrap(~param, scales = "free_x") +
scale_y_reverse() +
labs(x = NULL, y = "Age (ma)")
Hope this give you better understanding of ggplot2::geom_path

ggplot is not graphing a vertical line

I am trying to plot a graph in ggplot2 where the x-axis represents month-day combinations, the dots represent y-values for two different groups.
When graphing my original data set using this code,
ggplot(graphing.df, aes(MONTHDAY, y.var, color = GROUP)) +
geom_point() +
ylab(paste0(""))+
scale_x_discrete(breaks = function(x) x[seq(1, length(x), by = 15)])+
theme(legend.text = element_blank(),
legend.title = element_blank()) +
geom_vline(xintercept = which(graphing.df$MONTHDAY == "12-27")[1], col='red', lwd=2)
I get this graph where the vertical line is not showing.
When I tried to create a reproducible example using the following code...
df <- data.frame(MONTHDAY = c("01-01", "01-01", "01-02", "01-02", "01-03", "01-03"),
TYPE = rep(c("A", "B"), 3),
VALUE = sample(1:10, 6, replace = TRUE))
verticle_line <- "01-02"
ggplot(df, aes(MONTHDAY, VALUE, color = TYPE)) +
geom_point() +
#geom_vline(xintercept = which(df$MONTHDAY == verticle_line)[1], col='red', lwd=2)+
geom_vline(xintercept = which(df$MONTHDAY == verticle_line), col='blue', lwd=2)
The vertical line is showing, but now its showing in the wrong place
In my original data set I have two values for each month-day combination (representing each of the two groups). The month-day combination column is a character vector, it is not a factor and does not have levels.
Here is a way. It subsets the data keeping only the rows of interest and plots the vertical line defined by MONTHDAY.
library(ggplot2)
verticle_line <- "01-02"
ggplot(df, aes(MONTHDAY, VALUE, color = TYPE)) +
geom_point() +
geom_vline(data = subset(df, MONTHDAY == verticle_line),
mapping = aes(xintercept = MONTHDAY), color = 'blue', size = 2)
Data
I will repost the data creation code, this time setting the RNG seed in order to make the example reproducible.
set.seed(2020)
df <- data.frame(MONTHDAY = c("01-01", "01-01", "01-02", "01-02", "01-03", "01-03"),
TYPE = rep(c("A", "B"), 3),
VALUE = sample(1:10, 6, replace = TRUE))
The reason your line is not showing up where you expect is because you are setting the value of xintercept= via the output of the which() function. which() returns the index value where the condition is true. So in the case of your reproducible example, you get the following:
> which(df$MONTHDAY == verticle_line)
[1] 3 4
It returns a vector indicating that in df$MONTHDAY, indexes 3 and 4 in that vector are true. So your code below:
geom_vline(xintercept = which(df$MONTHDAY == verticle_line)...
Reduces down to this:
geom_vline(xintercept = c(3,4)...
Your MONTHDAY axis is not formatted as a date, but treated as a discrete axis of character vectors. In this case xintercept=c(3,4) applied to a discrete axis draws two vertical lines at x intercepts equivalent to the 3rd and 4th discrete position on that axis: in other words, "01-03" and... some unknown 4th position that is not observable within the axis limits.
How do you fix this? Just take out which():
ggplot(df, aes(MONTHDAY, VALUE, color = TYPE)) +
geom_point() +
geom_vline(xintercept = verticle_line, col='blue', lwd=2)
We can get the corresponding values of 'MONTHDAY' after subsetting
ggplot(df, aes(MONTHDAY, VALUE, color = TYPE)) +
geom_point() +
geom_vline(xintercept = df$MONTHDAY[df$MONTHDAY == verticle_line],
col='blue', lwd=2)

How to graph two sets of data with lines and two *different* point symbols with *distinguishable* data point symbols in legend?

I have been trying to plot a graph of two sets of data with different point symbols and connecting lines with different colors using the R package ggplot2, but for the life of me, I have not been able to get the legend correctly distinguish between the two curves by showing the associated data point symbol for each curve.
I can get the legend to show different line colors. But I have not been able to make the legend to show different data point symbols for each set of data.
The following code:
df <- data.frame( thrd_cnt=c(1,2,4,8,16),
runtime4=c(53,38,31,41,54),
runtime8=c(54,35,31,35,44))
library("ggplot2")
print(
ggplot(data = df, aes(df$thrd_cnt, y=df$runtime, color=)) +
geom_line(aes(y=df$runtime4, color = "4 cores")) +
geom_point(aes(y=df$runtime4, color = "4 cores"), fill = "white",
size = 3, shape = 21) +
geom_line(aes(y=df$runtime8, color = "8 cores")) +
geom_point(aes(y=df$runtime8, color = "8 cores"), fill = "white",
size = 3, shape = 23) +
xlab("Number of Threads") +
ylab(substitute(paste("Execution Time, ", italic(milisec)))) +
scale_x_continuous(breaks=c(1,2,4,8,16)) +
theme(legend.position = c(0.3, 0.8)) +
labs(color="# cores")
)
## save a pdf and a png
ggsave("runtime.pdf", width=5, height=3.5)
ggsave("runtime.png", width=5, height=3.5)
outputs this graph:
plot
But the data point symbols in the legend are not distinguishable. The legend shows the same symbol for both graphs (which is formed of both data point symbols on top of each other).
One possible solution is to define the number of threads as a factor, then I might be able to get the data point symbols on the legend right, but still I don't know how to do that.
Any help would be appreciated.
As noted, you need to gather the data into a long format so you can map the cores variable to colour and shape. To keep the same choices of shape and fill as in your original plot, use scale_shape_manual to set the shape corresponding to each level of cores. Note that you need to set the name for both the colour and shape legends in labs() to ensure they coincide and don't produce two legends. I also used mutate so that the levels of cores don't confusingly include the word runtime.
df <- data.frame( thrd_cnt=c(1,2,4,8,16),
runtime4=c(53,38,31,41,54),
runtime8=c(54,35,31,35,44))
library(tidyverse)
ggplot(
data = df %>%
gather(cores, runtime, runtime4, runtime8) %>%
mutate(cores = str_c(str_extract(cores, "\\d"), " cores")),
mapping = aes(x = thrd_cnt, y = runtime, colour = cores)
) +
geom_line() +
geom_point(aes(shape = cores), size = 3, fill = "white") +
scale_x_continuous(breaks = c(1, 2, 4, 8, 16)) +
scale_shape_manual(values = c("4 cores" = 21, "8 cores" = 23)) +
theme(legend.position = c(0.3, 0.8)) +
labs(
x = "Number of Threads",
y = "Execution Time (millisec)",
colour = "# cores",
shape = "# cores"
)
Created on 2018-04-10 by the reprex package (v0.2.0).
or shape is fine too, and if you're doing more stuff with df, might make sense to convert and keep it in long, 'tidy' format.
library("ggplot2")
df <- data.frame( thrd_cnt=c(1,2,4,8,16),
runtime4=c(53,38,31,41,54),
runtime8=c(54,35,31,35,44))
df <- df %>% gather("runtime", "millisec", 2:3)
ggplot(data = df, aes(x = thrd_cnt, y = millisec, color = runtime, shape =
runtime)) + geom_line() + geom_point()
after gathering into a "long" formatted data frame, you pass colour and shape (pch) to the aesthetics arguments:
library(tidyverse)
df <- data.frame( thrd_cnt=c(1,2,4,8,16),
runtime4=c(53,38,31,41,54),
runtime8=c(54,35,31,35,44))
df %>% gather(key=run, value=time, -thrd_cnt) %>%
ggplot(aes(thrd_cnt, time, pch=run, colour=run)) + geom_line() + geom_point()
(Notice how brief the code is, compared to the original post)

Heatmaps in R using ggplot function - how to cluster rows?

I am currently generating heatmaps in R using the ggplot function. In the code below.. I first read the data into a dataframe, remove any duplicate rows, factorise timestamp field, melt the dataframe (according to 'timestamp'), scale all variable between 0 and 1, then plot the heatmap.
In the resulting heatmap, time is plotted on the x axis and each iostat-sda variable (see sample data below) is plotted along the y axis. Note: If you want to try out the R code – you can paste the sample data below into a file called iostat-sda.csv.
however I really need to be able cluster the rows within this heatmap... anyone know how this can be achieved using the ggplot function?
Any help would be very much appreciated!!
############################## The code
library(ggplot2)
fileToAnalyse_f <- read.csv(file="iostat-sda.csv",head=TRUE,sep=",")
fileToAnalyse <- subset(fileToAnalyse, !duplicated(timestamp))
fileToAnalyse[,1]<-factor(fileToAnalyse[,1])
fileToAnalyse.m <- melt(fileToAnalyse, id=1)
fileToAnalyse.s <- ddply(fileToAnalyse.m, .(variable), transform, rescale = rescale(value) ) #scales each variable between 0 and 1
base_size <- 9
ggplot(fileToAnalyse.s, aes(timestamp, variable)) + geom_tile(aes(fill = rescale), colour = "black") + scale_fill_gradient(low = "black", high = "white") + theme_grey(base_size = base_size) + labs(x = "Time", y = "") + opts(title = paste("Heatmap"),legend.position = "right", axis.text.x = theme_blank(), axis.ticks = theme_blank()) + scale_y_discrete(expand = c(0, 0)) + scale_x_discrete(expand = c(0, 0))
########################## Sample data from iostat-sda.csv
timestamp,DSKRRQM,DSKWRQM,DSKR,DSKW,DSKRMB,DSKWMB,DSKARQS,DSKAQUS,DSKAWAIT,DSKSVCTM,DSKUtil
1319204905,0.33,0.98,10.35,2.37,0.72,0.02,120.00,0.01,0.40,0.31,0.39
1319204906,1.00,4841.00,682.00,489.00,60.09,40.68,176.23,2.91,2.42,0.50,59.00
1319204907,0.00,1600.00,293.00,192.00,32.64,13.89,196.45,5.48,10.76,2.04,99.00 1319204908,0.00,3309.00,1807.00,304.00,217.39,26.82,236.93,4.84,2.41,0.45,96.00
1319204909,0.00,5110.00,93.00,427.00,0.72,43.31,173.43,4.43,8.67,1.90,99.00
1319204910,0.00,6345.00,115.00,496.00,0.96,52.25,178.34,4.00,6.32,1.62,99.00
1319204911,0.00,6793.00,129.00,666.00,1.33,57.22,150.83,4.74,6.16,1.26,100.00
1319204912,0.00,6444.00,115.00,500.00,0.93,53.06,179.77,4.20,6.83,1.58,97.00
1319204913,0.00,1923.00,835.00,215.00,78.45,16.68,185.55,4.81,4.58,0.91,96.00
1319204914,0.00,0.00,788.00,0.00,83.51,0.00,217.04,0.45,0.57,0.25,20.00
1319204915,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1319204916,0.00,4.00,2.00,4.00,0.01,0.04,17.67,0.00,0.00,0.00,0.00
1319204917,0.00,8.00,4.00,8.00,0.02,0.09,17.83,0.00,0.00,0.00,0.00
1319204918,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1319204919,0.00,2.00,113.00,4.00,11.96,0.03,209.93,0.06,0.51,0.43,5.00
1319204920,0.00,59.00,147.00,54.00,11.15,0.63,120.02,0.04,0.20,0.15,3.00
1319204921,1.00,19.00,57.00,18.00,4.68,0.20,133.47,0.07,0.93,0.67,5.00
There is a nice package called NeatMap which simplifies generating heatmaps in ggplot2. Some of the row clustering methods include Multidimensional Scaling, PCA, or hierarchical clustering. Things to watch out for are:
Data to make.heatmap1 has to be in wide format
Data has to be a matrix, not a dataframe
Assign rownames to the wide-format matrix before generating the plot
I've changed your code slightly to avoid naming variables the same as base functions (i.e. rescale)
fileToAnalyse.s <- ddply(fileToAnalyse.m, .(variable), transform, rescale.x = rescale(value) ) #scales each variable between 0 and 1
fileToAnalyse.w <- dcast(fileToAnalyse.s, timestamp ~ variable, value_var="rescale.x")
rownames(fileToAnalyse.w) <- as.character(fileToAnalyse.w[, 1])
ggheatmap <- make.heatmap1(as.matrix(fileToAnalyse.w[, -1]), row.method = "complete.linkage", row.metric="euclidean", column.cluster.method ="none", row.labels = rownames(fileToAnalyse.w))
+scale_fill_gradient(low = "black", high = "white") + labs(x = "Time", y = "") + opts(title = paste("Heatmap")

Resources