Perfmon Data Plot. Scaling y axis with ggplot - r

I am currently working on a script that will take in Windows Perfmon Data, and plot graphs from this data, as I have found the PAL tool far too slow.
This is my first pass and is quite basic at the moment.
I am struggling with the scaling of the y axis. I am currently getting horrible graphs like this:
How can I scale the Y axis appropriately so that there are reasonable breaks etc with data between 0 and 1. (e.g 0.0000123,0.12,0.98,0.00000024) etc?
I was hoping for something dynamic like:
scale_y_continuous(breaks = c(min(d[,i]), 0, max(d[,i])))
Error in Summary.factor(c(1L, 105L, 181L, 125L, 699L, 55L, 270L, 226L, :
min not meaningful for factors
Any help appreciated.
require(lattice)
require(ggplot2)
require(reshape2)
# Read in Perfmon -- MUST BE CSV
d <- read.table("~/R/RPerfmon.csv",header=TRUE,sep=",",dec=".",check.names=FALSE)
# Rename First Column to Time as this is standard in all Perfmon CSVs
colnames(d)[1]="Time"
# Convert Time Column into proper format
d$Time<-as.POSIXct(d$Time, format='%m/%d/%Y %H:%M:%S')
# Strip out The computer name from all Column Headers (Perfmon Counters)
# The regex matches a-zA-Z, underscores and dashes, may need to be expanded
colnames(d) <- sub("^\\\\\\\\[a-zA-Z_-]*\\\\", "", colnames(d))
colnames(d) <- sub("\\\\", "|", colnames(d))
colnames(d)
warnings()
pdf(paste("PerfmonPlot_",Sys.Date(),".pdf",sep=""))
for (i in 2:ncol(d)) {
p <- qplot(d[,"Time"],y=d[,i], data=d, xlab="Time",ylab="", main=colnames(d[i]))
p <- p + geom_hline()
p <- p + scale_y_continuous(breaks = c(min(d[,i]), 0, max(d[,i])))
print(p)
}
dev.off()

In order to get reasonable breaks between 0 and 1, you can for example use:
scale_y_continuous(breaks=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0))
A rewritten plot-part of your code:
ggplot(d, aes(x=Time, y=d[,i])) +
geom_hline() +
scale_y_continuous(breaks=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0)) +
labs(title=colnames(d[i]), x="Time",y="")
And a more dynamic way of setting the breaks:
scale_y_continuous(breaks=seq(from=round(min(d[,i]),1), to=round(max(d[,i]),1), by=0.1))
However, when you look at the error message, you can see that the y-variables are factor-variables. So you have to convert them with as.numeric first.

Here is the code I ended up with after a bit of playing in case anyone wants to be able to do the same:
The key to making it dynamic was the following (note the as.numeric to avoid any errors)
ynumeric <- as.numeric(d[,i])
ymin <- min(ynumeric,na.rm = TRUE)
ymax <- max(ynumeric,na.rm = TRUE)
#generate sequence of 10
ybreaks <- seq(ymin, ymax, length.out = 10)
#Then passing this to the y_continuous function
p <- p + scale_y_continuous(breaks=c(ybreaks))
I hope to expand this in the future to be somewhere in the region of PALs complexity, but using R for efficiency.
require(lattice)
require(ggplot2)
require(reshape2)
# Read in Perfmon -- MUST BE CSV
d <- read.table("~/R/RPerfmon.csv",header=TRUE,sep=",",dec=".",check.names=FALSE,stringsAsFactors=FALSE)
# Rename First Column to Time as this is standard in all Perfmon CSVs
colnames(d)[1]="Time"
# Convert Time Column into proper format
d$Time<-as.POSIXct(d$Time, format='%m/%d/%Y %H:%M:%S')
# Strip out The computer name from all Column Headers (Perfmon Counters)
# The regex matches a-zA-Z, underscores and dashes, may need to be expanded
colnames(d) <- sub("^\\\\\\\\[a-zA-Z_-]*\\\\", "", colnames(d))
colnames(d) <- sub("\\\\", "|", colnames(d))
colnames(d)
warnings()
pdf(paste("PerfmonPlotData_",Sys.Date(),".pdf",sep=""))
for (i in 2:ncol(d)) {
ynumeric <- as.numeric(d[,i])
ymin <- min(ynumeric,na.rm = TRUE)
ymax <- max(ynumeric,na.rm = TRUE)
#generate sequence of 10
ybreaks <- seq(ymin, ymax, length.out = 10)
print(ybreaks)
print(paste(ymin,ymax))
p <- qplot(d[,"Time"],y=ynumeric, data=d, xlab="Time",ylab="", main=colnames(d[i]))
p <- p + geom_smooth(size=3,se=TRUE) + theme_bw()
p <- p + scale_y_continuous(breaks=c(ybreaks))
print(p)
}
dev.off()

Related

Missing ticks in custom axis transformation

When plotting the ratio between two variables, their relative order is often of no concern, yet depending on which variable is in the numerator, its relative size is constrained either to (0,1) or (1, Inf), which is somewhat unintuitive and breaks symmetry. I want to plot ratios "symmetrically", without resorting to symmetric log-scale, by having a y-axis that goes like 1/4, 1/3, 1/2, 1, 2, 3, 4 or, equivalently, 4^-1, 3^-1, 2^-1, 1, 2, 3, 4 in regular intervals. I've come up with the following:
symmult <- function(x){
isf <- is.finite(x) & (x>0)
xf <- x[isf]
xf <- ifelse(xf>=1,
xf-1,
1-(1/xf))
x[isf] <- xf
x[!isf] <- NA
x[!is.finite(x)] <- NA
return(x)
}
symmultinv <- function(x){
isf <- is.finite(x)
xf <- x[isf]
xf <- ifelse(x[isf]>=0,
x[isf]+1,
-1/(x[isf]-1))
x[isf] <- xf
x[!isf] <- NA
x[!is.finite(x)] <- NA
return(x)
}
sym_mult_trans = function(){trans_new("sym_mult", symmult, symmultinv )}
x <- c(-4:-2, 1:4)
x[x<1] <- 1/abs(x[x<1])
ggplot() +
geom_point(aes(x=x, y=x)) +
scale_y_continuous(trans="sym_mult")
The transformation works, but I cannot get the axis labels etc. to work for any 0<x<1, without setting them manually. Any help would be greatly appreciated.
You can create bespoke 'breaks' and 'format' functions that you can use inside trans_new (or pass to scale_y_continuous directly via its breaks and labels parameters).
For the breaks function, remember it will take as input a length-two numeric vector representing the range of the y axis. You must then convert this to a number of appropriate breaks. Here, if the minimum of the range is less than one, we take its reciprocal, find the pretty breaks between one and that number, then take the reciprocal of the output. We concatenate that onto pretty breaks between 1 and our range maximum:
# Define breaks function
symmult_breaks <- function(x) {
c(1 / extended_breaks(5)(c(1/x[x < 1], 1)),
extended_breaks(5)(c(1, x[x >= 1])))
}
For the labelling function, remember, it needs to take as input the vector of numbers produced by our breaks function. We can paste a 1/ in front of the reciprocal of numbers less than one, but leave numbers of 1 or more unaltered:
# Define labelling function
symmult_labs <- function(x) {
labs <- character(length(x))
labs[x >= 1] <- as.character(x[x >= 1])
labs[x < 1] <- paste("1", as.character(1/x[x < 1]), sep = "/")
labs
}
So your full new transformation becomes:
# Use our four functions to define the whole transformation:
sym_mult_trans <- function() {
trans_new(name = "sym_mult",
transform = symmult,
inverse = symmultinv,
breaks = symmult_breaks,
format = symmult_labs)
}
And your plot becomes:
ggplot() +
geom_point(aes(x = x, y = x)) +
scale_y_continuous(trans = "sym_mult")

How can one control the number of axis ticks within `facet_wrap()`?

I have a figure created with facet_wrap visualizing the estimated density of many groups. Some of the groups have a much smaller variance than others. This leads to the x axis not being readable for some panels. Minimum reproducable example:
library(tidyverse)
x1 <- rnorm(1e4)
x2 <- rnorm(1e4,mean=2,sd=0.00001)
data.frame(x=c(x1,x2),group=c(rep("1",length(x1)),rep("2",length(x2)))) %>%
ggplot(.) + geom_density(aes(x=x)) + facet_wrap(~group,scales="free")
The obvious solution to the problem is to increase the figure size, so that everything becomes readable. However, there are too many panels to make this a useful solution. My favourite solution would be to control the number of axis ticks, for example allow for only two ticks on all x-axes. Is there a way to accomplish this?
Edit after suggestions:
Adding + scale_x_continuous(n.breaks = 2) looks like it should exactly do what I want, but it actually does not:
Following the answer in the suggested question Change the number of breaks using facet_grid in ggplot2, I end up with two axis ticks, but undesirably many decimal points:
equal_breaks <- function(n = 3, s = 0.5, ...){
function(x){
# rescaling
d <- s * diff(range(x)) / (1+2*s)
seq(min(x)+d, max(x)-d, length=n)
}
}
data.frame(x=c(x1,x2),group=c(rep("1",length(x1)),rep("2",length(x2)))) %>%
ggplot(.) + geom_density(aes(x=x)) + facet_wrap(~group,scales="free") + scale_x_continuous(breaks=equal_breaks(n=3, s=0.05), expand = c(0.05, 0))
You can add if(seq[2]-seq[1] < 10^(-r)) seq else round(seq, r) to the function equal_breaks developed here.
By doing so, you will round your labels on the x-axis only if the difference between them is above a threshold 10^(-r).
equal_breaks <- function(n = 3, s = 0.05, r = 0,...){
function(x){
d <- s * diff(range(x)) / (1+2*s)
seq = seq(min(x)+d, max(x)-d, length=n)
if(seq[2]-seq[1] < 10^(-r)) seq else round(seq, r)
}
}
data.frame(x=c(x1,x2),group=c(rep("1",length(x1)),rep("2",length(x2)))) %>%
ggplot(.) + geom_density(aes(x=x)) + facet_wrap(~group, scales="free") +
scale_x_continuous(breaks=equal_breaks(n=3, s=0.05, r=0))
As you rightfully pointed, this answer gives only two alternatives for the number of digits; so another possibility is to return round(seq, -floor(log10(abs(seq[2]-seq[1])))), which gets the "optimal" number of digits for every facet.
equal_breaks <- function(n = 3, s = 0.1,...){
function(x){
d <- s * diff(range(x)) / (1+2*s)
seq = seq(min(x)+d, max(x)-d, length=n)
round(seq, -floor(log10(abs(seq[2]-seq[1]))))
}
}
data.frame(x=c(x1,x2,x3),group=c(rep("1",length(x1)),rep("2",length(x2)),rep("3",length(x3)))) %>%
ggplot(.) + geom_density(aes(x=x)) + facet_wrap(~group, scales="free") +
scale_x_continuous(breaks=equal_breaks(n=3, s=0.1))
Thanks so much for so many helpful suggestions and great answers! I figured out a solution that works for arbitrarily complex datasets (at least I hope so) by modifying the approach by #Maƫl and borrowing the great function by RHertel from Count leading zeros between the decimal point and first nonzero digit.
Rounding to the first significant decimal point leads to highly asymmetric ticks in some cases, therefore I rounded to the second significant decimal point.
library(tidyverse)
x1 <- rnorm(1e4)
x2 <- rnorm(1e4,mean=2,sd=0.000001)
x3 <- rnorm(1e4,mean=2,sd=0.01)
zeros_after_period <- function(x) {
if (isTRUE(all.equal(round(x),x))) return (0) # y would be -Inf for integer values
y <- log10(abs(x)-floor(abs(x)))
ifelse(isTRUE(all.equal(round(y),y)), -y-1, -ceiling(y))} # corrects case ending with ..01
equal_breaks <- function(n,s){
function(x){
x=x*10000
d <- s * diff(range(x)) / (1+2*s)
seq = seq(min(x)+d, max(x)-d, length=n) / 10000
round(seq,zeros_after_period(seq[2]-seq[1])+2)
}
}
data.frame(x=c(x1,x2,x3),group=c(rep("1",length(x1)),rep("2",length(x2)),rep("3",length(x3)))) %>%
ggplot(.) + geom_density(aes(x=x)) + facet_wrap(~group, scales="free") +
scale_x_continuous(breaks=equal_breaks(n=2, s=0.1))
Apologies for answering my own question ... but that would not have been possible without the great help from the community :-)
One option to achieve your desired result would be to use a custom breaks and limits function which builds on scales::breaks_extended to first get pretty breaks for the range and then makes use of seq to get the desired number of breaks. However, depending on the desired number of breaks this simple approach will not ensure that we end up with pretty breaks:
library(ggplot2)
set.seed(123)
x1 <- rnorm(1e4)
x2 <- rnorm(1e4,mean=2,sd=0.00001)
mylimits <- function(x) range(scales::breaks_extended()(x))
mybreaks <- function(n = 3) {
function(x) {
breaks <- mylimits(x)
seq(breaks[1], breaks[2], length.out = n)
}
}
d <- data.frame(x=c(x1,x2),group=c(rep("1",length(x1)),rep("2",length(x2))))
ggplot(d) +
geom_density(aes(x=x)) +
scale_x_continuous(breaks = mybreaks(n = 3), limits = mylimits) +
facet_wrap(~group,scales="free")

How to plot pie chart in R from a table with relative Frequency?

I am brand new to R. I need to plot a pie graph. Now I have tried my best but it's not generating a pie chart for me. Below is my code.
socialIssue <- matrix(c(245,112,153,71,133,306),ncol=1,byrow=T)
rownames(socialIssue) <- c("Housing","Transportation","Health Care","Education","Food","Other")
colnames(socialIssue) <- c("Frequency")
socialIssue <- as.table(socialIssue)
socialIssue/sum(socialIssue)
cols <- rainbow(nrow(socialIssue))
pie(socialIssue$Frequency, labels=paste0(round(socialIssue$Frequency/sum(socialIssue$Frequency)*100,2),"%"),colnames=cols)
This is the following output. The frequency outputted is correct.
socialIssue <- matrix(c(245,112,153,71,133,306),ncol=1,byrow=T)
> rownames(socialIssue) <- c("Housing","Transportation","Health Care","Education","Food","Other")
> colnames(socialIssue) <- c("Frequency")
> socialIssue <- as.table(socialIssue)
> socialIssue/sum(socialIssue)
Frequency
Housing 0.24019608
Transportation 0.10980392
Health Care 0.15000000
Education 0.06960784
Food 0.13039216
Other 0.30000000
>
> cols <- rainbow(nrow(socialIssue))
> pie(socialIssue$Frequency, labels=paste0(round(socialIssue$Frequency/sum(socialIssue$Frequency)*100,2),"%"),colnames=cols)
Error in socialIssue$Frequency : $ operator is invalid for atomic vectors
Convert to dataframe and then plot
socialIssue = as.data.frame(socialIssue)
socialIssue$percent = round(100*socialIssue$Freq/sum(socialIssue$Freq), digits = 1)
socialIssue$label = paste(socialIssue$Var1," (", socialIssue$percent,"%)", sep = "")
pie(socialIssue$Freq, labels = socialIssue$label, col = cols)
This does it:
pie(socialIssue[, 1],
labels = paste0(round(socialIssue[, 1] / sum(socialIssue[, 1]) * 100, 2), "%"))
Because you have a matrix, not a data frame.
prop.table takes care of the % calculation - sprintf deals with the formatting of the number values so you have consistent decimal places.
All of the conversion code isn't required either:
socialIssue <- matrix(c(245,112,153,71,133,306),ncol=1,byrow=T)
pie(socialIssue, labels=sprintf("%.2f%%", prop.table(socialIssue)*100))
With base R, with the colors you used (the param name should be cols instead of `colnames'), with legends added:
pie(socialIssue[,1], labels=paste0(round(socialIssue/sum(socialIssue)*100,2),"%"),col=cols)
legend('bottomright', legend=rownames(socialIssue), fill=cols, bty='n')
or with ggplot2
socialIssue <- matrix(c(245,112,153,71,133,306),ncol=1,byrow=T)
rownames(socialIssue) <- c("Housing","Transportation","Health Care","Education","Food","Other")
colnames(socialIssue) <- c("Frequency")
library(ggplot2)
library(scales)
ggplot(as.data.frame(socialIssue), aes(x='',y=Frequency, fill=as.factor(Frequency))) +
geom_bar(width=1, stat='identity') +
scale_fill_manual(values=cols, labels=rownames(socialIssue)) +
scale_y_continuous(labels=percent) +
coord_polar(theta = "y") + theme_bw()

R ggplot2 boxplot from 10 files

I have 4 files each called 0_X_cell.csv, 0_S_cell.csv and 15_X_cell.csv, 15_S_cell.csv of the format:
p U:0 U:1 U:2 Tracer Tracer_0 U_0:0
-34.014 0.15268 -3.7907 -0.20155 10.081 10.032 0.12454
-33.836 0.07349 -2.1457 -0.30531 27.706 27.278 0.076542
I'd like to create boxplots out of the values for Tracer/3600 and put them on the same graph using ggplot2 but I'm finding it not quite so straightforward. Any suggestions would be much appreciated:
I'm thinking it might something like this:
Import data from all files into separate variables:
Extract Tracer from each one and put into a data.frame
Plot the boxplots of every column Tracer/3600. But each column will be called Tracer...
What would the correct procedure be?
Here's one way to do it (if I understood you correctly):
`0_X_cell.csv` <- `0_S_cell.csv` <- `15_X_cell.csv` <- `15_S_cell.csv` <- read.table(header=T, text="
p U:0 U:1 U:2 Tracer Tracer_0 U_0:0
-34.014 0.15268 -3.7907 -0.20155 10.081 10.032 0.12454
-33.836 0.07349 -2.1457 -0.30531 27.706 27.278 0.076542")
lst <- mget(grep("cell.csv", ls(), fixed=TRUE, value=TRUE))
df <- stack(lapply(lapply(lst, "[", "Tracer"), unlist))
df$ind <- sub("^(\\d+_[A-Z]).*$", "\\1", df$ind)
library(ggplot2)
ggplot(df, aes(ind, values/3600)) + geom_boxplot()
To read in the data from your dir:
z <- list.files(pattern = ".*cell\\.csv$")
z <- lapply(1:length(z), function(x) {chars <- strsplit(z[x], "_");
cbind(data.frame(Tracer = read.csv(z[x])$Tracer), time = chars[[1]][1], treatment = chars[[1]][2])})
z <- do.call(rbind, z)
Then plot it:
library(ggplot2)
ggplot(z, aes(y = Tracer/3600, x = factor(time))) +geom_boxplot(aes(fill = factor(treatment))) + ylab("Tracer")

How to extract the endpoints of an interval in R?

I've searched, but I cannot find an answer. I want to further process the data of a plot I've created in R with geom_bin2d. I've extracted the bins (intervals) from such a plot using
> library(ggplot2)
> my_plot <- ggplot(diamonds, aes(x = x, y = y))+ geom_bin2d(bins=3)
> plot_data <- ggplot_build(my_plot)
> data <- plot_data$data[[1]]
> data$xbin[[1]]
[1] [0,3.58]
Levels: [0,3.58] (3.58,7.16] (7.16,10.7] (10.7,14.3]
Nothing I tried worked, including min and mean. How do I access the endpoints of such an interval like data$xbin[[1]]?
(Update: I turned the example into a complete test case based on a built-in data set.)
Something like
library(stringr)
x <- cut(seq(1:5), breaks = 2)
as.numeric(unlist(str_extract_all(as.character(x[1]), "\\d+\\.*\\d*")))
or in you example
my_plot <- ggplot(diamonds, aes(x = x, y = y))+ geom_bin2d(bins=3)
plot_data <- ggplot_build(my_plot)
data <- plot_data$data[[1]]
x <- data$xbin[[1]]
as.numeric(unlist(str_extract_all(as.character(x), "\\d+\\.*\\d*")))[2]
3.58

Resources