How to extract the endpoints of an interval in R? - r

I've searched, but I cannot find an answer. I want to further process the data of a plot I've created in R with geom_bin2d. I've extracted the bins (intervals) from such a plot using
> library(ggplot2)
> my_plot <- ggplot(diamonds, aes(x = x, y = y))+ geom_bin2d(bins=3)
> plot_data <- ggplot_build(my_plot)
> data <- plot_data$data[[1]]
> data$xbin[[1]]
[1] [0,3.58]
Levels: [0,3.58] (3.58,7.16] (7.16,10.7] (10.7,14.3]
Nothing I tried worked, including min and mean. How do I access the endpoints of such an interval like data$xbin[[1]]?
(Update: I turned the example into a complete test case based on a built-in data set.)

Something like
library(stringr)
x <- cut(seq(1:5), breaks = 2)
as.numeric(unlist(str_extract_all(as.character(x[1]), "\\d+\\.*\\d*")))
or in you example
my_plot <- ggplot(diamonds, aes(x = x, y = y))+ geom_bin2d(bins=3)
plot_data <- ggplot_build(my_plot)
data <- plot_data$data[[1]]
x <- data$xbin[[1]]
as.numeric(unlist(str_extract_all(as.character(x), "\\d+\\.*\\d*")))[2]
3.58

Related

Making Multiple Plots from a Function call in R studio

I am relatively new to R. My question results from a project in an online learning course. I am using R studio to make multiple plots from a function call. I want a new plot for each column which represents the y-axis while the x-axis remains equal to the month. The function works when displaying a single variable. However, when I try the function call using multiple columns I receive:
"Error: More than one expression parsed"
Similar code worked in the online program's simulated platform.
I have provided my code below with a small sample from the data frame. Is it possible to derive multiple plots in this way? If so, how can I update or correct my code to make the plot for each column.
month <- c('mar', 'oct', 'oct')
day <- c('fri', 'tue', 'sat')
FFMC <- c(86.2, 90.6, 90.6)
DMC <- c(26.2, 35.4, 43.7)
DC <- c(94.3, 669.1, 686.9)
ISI <- c(5.1, 6.7, 6.7)
temp <- c(8.2, 18.0, 14.6)
RH <- c(51, 33, 33)
wind <- c(6.7, 0.9, 1.3)
rain <- c(0.0, 0.0, 0.0)
forestfires_df <- data.frame(month, day, FFMC, DMC, DC, ISI, temp, RH, wind, rain)
library(ggplot2)
library(purrr)
month_box <- function(x , y) {
ggplot(data = forestfires_df, aes_string(x = month, y = y_var)) +
geom_boxplot() +
theme_bw()
}
month <- names(forestfires_df)[1]
y_var <- names(forestfires_df)[3:10]
month_plots <- map2(month, y_var, month_box)
#After running month_plots I receive "Error: More than one expression parsed"
The issue is that the function arguments should match the ones inside
month_box <- function(x , y) {
ggplot(data = forestfires_df, aes_string(x = x, y = y)) +
geom_boxplot() +
theme_bw()
}
If we use 'month' and 'y_var', 'y_var' is of length 8 and that is the reason we do the looping in map. With the change, the map2 should work as expected
map2(month, y_var, month_box)
Or using anonymous function
map2(month, y_var, ~ month_box(.x, .y))
Like I mentioned in a comment, aes_string has been soft-deprecated in favor of using tidyeval to write ggplot2 functions. You can rewrite your function as a simple tidyeval-based one, then map over the columns of interest passing bare column names or their positions the way you would with most other tidyverse functions.
There are a couple ways to write a function like this. The older way is with quosures and unquoting columns, but its syntax can be confusing. dplyr comes with a very in-depth vignette, but I like this blog post as a quick guide.
month_box_quo <- function(x, y) {
x_var <- enquo(x)
y_var <- enquo(y)
ggplot(forestfires_df, aes(x = !!x_var, y = !!y_var)) +
geom_boxplot()
}
A single call looks like this, with bare column names:
month_box_quo(x = month, y = DMC)
Or with map_at and column positions (or with vars()):
# mapped over variables of interest; assumes y gets the mapped-over column
map_at(forestfires_df, 3:10, month_box_quo, x = month)
# or with formula shorthand
map_at(forestfires_df, 3:10, ~month_box_quo(x = month, y = .))
The newer tidyeval syntax ({{}}, or curly-curly) is easier to follow, and returns the same list of plots as above.
month_box_curly <- function(x, y) {
ggplot(forestfires_df, aes(x = {{ x }}, y = {{ y }})) +
geom_boxplot()
}

Pass variables as parameters to plot_ly function

I would like to create a function that creates different kinds of plotly plots based on the parameters that are passed into it. If I create the following data
library(plotly)
#### test data
lead <- rep("Fred Smith", 30)
lead <- append(lead, rep("Terry Jones", 30))
lead <- append(lead, rep("Henry Sarduci", 30))
proj_date <- seq(as.Date('2017-11-01'), as.Date('2017-11-30'), by = 'day')
proj_date <- append(proj_date, rep(proj_date, 2))
set.seed(1237)
actHrs <- runif(90, 1, 100)
cummActHrs <- cumsum(actHrs)
forHrs <- runif(90, 1, 100)
cummForHrs <- cumsum(forHrs)
df <- data.frame(Lead = lead, date_seq = proj_date,
cActHrs = cummActHrs,
cForHrs = cummForHrs)
I could plot it using:
plot_ly(data = df, x = ~date_seq, y = ~cActHrs, split = ~Lead)
If I made a makePlot function like the one shown below, how would I make it do something like this:
makePlot <- function(plot_data = df, x_var = date_seq, y_var, split_var) {
plot <- plot_ly(data = df, x = ~x_var, y = ~y_var, split = ~split_var)
return(plot)
}
?
Is there a function I can wrap x_var, y_var, and split_var with so that plotly will recognize them as x, y, and split parameters?
Eventually got around to figuring this out and hope this little follow up takes some of the mystery of these types of tasks. Although this question is focused on plotting, it's important to first build an understanding of how the functions in various R packages (e.g. dplyr and plotly) evaluate expressions and how to manipulate the way those expressions are evaluated. A great reference to build this understanding is Hadley's article on programming in dplyr here or alternatively here.
Once that's under your belt, this turns out to be pretty easy. The trick is to simply pass your variable arguments like you do when you call dplyr functions and be sure to quote those parameters inside your plotting function. For the question above, this function worked for me:
makePlot <- function(plot_data = df, x_var, y_var, split_var,
type_var="scatter",
mode_var="lines+markers") {
quo_x <- enquo(x_var)
quo_y <- enquo(y_var)
quo_split <- enquo(split_var)
# print(c(quo_x, quo_y, quo_split))
plot <- plot_ly(data = plot_data, x = quo_x, y = quo_y, split = quo_split,
type=type_var, mode=mode_var)
return(plot)
}
# using df created in question, pass col's as args like dplyr functions
p1 <- makePlot2(df, date_seq, cActHrs, Lead)
p2 <- makePlot2(df, date_seq, cForHrs, Lead)

ggplotly and geom_area : display informations when hovering over an area (not a point)

When it comes to a ggplotly graph, it's easy to display informations when hovering over specific point. This code do the job :
toy_df=data.frame("t"=c(seq(1,10),seq(1,10)),
"value"=c(runif(10,0,10),2*runif(10,0,10)),
"event"=c(rep("A",10),rep("B",10)))
p <- ggplot() + geom_area(aes(y = value, x = t, fill=event), data = toy_df)
ggplotly(p)
But I would like to display informations when hovering over one of the area. Because in my case, area is an event that I want to be able to describe deeply.
Polygons in ggplot2 (geom_polygon) provide a possible solutions.
Below you can find a rather raw code that should clarify the main idea:
library(ggplot2)
library(plotly)
set.seed(1)
toy_df=data.frame("t"=c(seq(1,10),seq(1,10)),
"value"=c(runif(10,0,10),2*runif(10,0,10)),
"event"=c(rep("A",10),rep("B",10)))
# In order to create polygons like in geom_areas,
# two points on the x-axis must be added: one at t=1 and one at t=10
toy_df2 <- toy_df[NULL,]
for (k in unique(toy_df$event)) {
subdf <- subset(toy_df, toy_df$event==k)
nr <- nrow(subdf)
row1 <- subdf[1,]
row1$value <- 0
row2 <- subdf[nr,]
row2$value <- 0
toy_df2 <- rbind(toy_df2, row1, subdf, row2)
}
# Stack polygons
toy_df2$value[toy_df2$event=="A"] <- toy_df2$value[toy_df2$event=="A"] +
toy_df2$value[toy_df2$event=="B"]
# Calculate mean values for the two events: they will be displayed in the tooltip
toy_df2 <- toy_df2 %>% group_by(event) %>% mutate(mn=round(mean(value),3))
p <- ggplot(data = toy_df2, aes(y = value, x = t, fill=event,
text=paste0("Value:", mn,"<br>Event:", event))) +
geom_polygon()
ggplotly(p, tooltip="text")

R ggplot2 boxplot from 10 files

I have 4 files each called 0_X_cell.csv, 0_S_cell.csv and 15_X_cell.csv, 15_S_cell.csv of the format:
p U:0 U:1 U:2 Tracer Tracer_0 U_0:0
-34.014 0.15268 -3.7907 -0.20155 10.081 10.032 0.12454
-33.836 0.07349 -2.1457 -0.30531 27.706 27.278 0.076542
I'd like to create boxplots out of the values for Tracer/3600 and put them on the same graph using ggplot2 but I'm finding it not quite so straightforward. Any suggestions would be much appreciated:
I'm thinking it might something like this:
Import data from all files into separate variables:
Extract Tracer from each one and put into a data.frame
Plot the boxplots of every column Tracer/3600. But each column will be called Tracer...
What would the correct procedure be?
Here's one way to do it (if I understood you correctly):
`0_X_cell.csv` <- `0_S_cell.csv` <- `15_X_cell.csv` <- `15_S_cell.csv` <- read.table(header=T, text="
p U:0 U:1 U:2 Tracer Tracer_0 U_0:0
-34.014 0.15268 -3.7907 -0.20155 10.081 10.032 0.12454
-33.836 0.07349 -2.1457 -0.30531 27.706 27.278 0.076542")
lst <- mget(grep("cell.csv", ls(), fixed=TRUE, value=TRUE))
df <- stack(lapply(lapply(lst, "[", "Tracer"), unlist))
df$ind <- sub("^(\\d+_[A-Z]).*$", "\\1", df$ind)
library(ggplot2)
ggplot(df, aes(ind, values/3600)) + geom_boxplot()
To read in the data from your dir:
z <- list.files(pattern = ".*cell\\.csv$")
z <- lapply(1:length(z), function(x) {chars <- strsplit(z[x], "_");
cbind(data.frame(Tracer = read.csv(z[x])$Tracer), time = chars[[1]][1], treatment = chars[[1]][2])})
z <- do.call(rbind, z)
Then plot it:
library(ggplot2)
ggplot(z, aes(y = Tracer/3600, x = factor(time))) +geom_boxplot(aes(fill = factor(treatment))) + ylab("Tracer")

Perfmon Data Plot. Scaling y axis with ggplot

I am currently working on a script that will take in Windows Perfmon Data, and plot graphs from this data, as I have found the PAL tool far too slow.
This is my first pass and is quite basic at the moment.
I am struggling with the scaling of the y axis. I am currently getting horrible graphs like this:
How can I scale the Y axis appropriately so that there are reasonable breaks etc with data between 0 and 1. (e.g 0.0000123,0.12,0.98,0.00000024) etc?
I was hoping for something dynamic like:
scale_y_continuous(breaks = c(min(d[,i]), 0, max(d[,i])))
Error in Summary.factor(c(1L, 105L, 181L, 125L, 699L, 55L, 270L, 226L, :
min not meaningful for factors
Any help appreciated.
require(lattice)
require(ggplot2)
require(reshape2)
# Read in Perfmon -- MUST BE CSV
d <- read.table("~/R/RPerfmon.csv",header=TRUE,sep=",",dec=".",check.names=FALSE)
# Rename First Column to Time as this is standard in all Perfmon CSVs
colnames(d)[1]="Time"
# Convert Time Column into proper format
d$Time<-as.POSIXct(d$Time, format='%m/%d/%Y %H:%M:%S')
# Strip out The computer name from all Column Headers (Perfmon Counters)
# The regex matches a-zA-Z, underscores and dashes, may need to be expanded
colnames(d) <- sub("^\\\\\\\\[a-zA-Z_-]*\\\\", "", colnames(d))
colnames(d) <- sub("\\\\", "|", colnames(d))
colnames(d)
warnings()
pdf(paste("PerfmonPlot_",Sys.Date(),".pdf",sep=""))
for (i in 2:ncol(d)) {
p <- qplot(d[,"Time"],y=d[,i], data=d, xlab="Time",ylab="", main=colnames(d[i]))
p <- p + geom_hline()
p <- p + scale_y_continuous(breaks = c(min(d[,i]), 0, max(d[,i])))
print(p)
}
dev.off()
In order to get reasonable breaks between 0 and 1, you can for example use:
scale_y_continuous(breaks=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0))
A rewritten plot-part of your code:
ggplot(d, aes(x=Time, y=d[,i])) +
geom_hline() +
scale_y_continuous(breaks=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0)) +
labs(title=colnames(d[i]), x="Time",y="")
And a more dynamic way of setting the breaks:
scale_y_continuous(breaks=seq(from=round(min(d[,i]),1), to=round(max(d[,i]),1), by=0.1))
However, when you look at the error message, you can see that the y-variables are factor-variables. So you have to convert them with as.numeric first.
Here is the code I ended up with after a bit of playing in case anyone wants to be able to do the same:
The key to making it dynamic was the following (note the as.numeric to avoid any errors)
ynumeric <- as.numeric(d[,i])
ymin <- min(ynumeric,na.rm = TRUE)
ymax <- max(ynumeric,na.rm = TRUE)
#generate sequence of 10
ybreaks <- seq(ymin, ymax, length.out = 10)
#Then passing this to the y_continuous function
p <- p + scale_y_continuous(breaks=c(ybreaks))
I hope to expand this in the future to be somewhere in the region of PALs complexity, but using R for efficiency.
require(lattice)
require(ggplot2)
require(reshape2)
# Read in Perfmon -- MUST BE CSV
d <- read.table("~/R/RPerfmon.csv",header=TRUE,sep=",",dec=".",check.names=FALSE,stringsAsFactors=FALSE)
# Rename First Column to Time as this is standard in all Perfmon CSVs
colnames(d)[1]="Time"
# Convert Time Column into proper format
d$Time<-as.POSIXct(d$Time, format='%m/%d/%Y %H:%M:%S')
# Strip out The computer name from all Column Headers (Perfmon Counters)
# The regex matches a-zA-Z, underscores and dashes, may need to be expanded
colnames(d) <- sub("^\\\\\\\\[a-zA-Z_-]*\\\\", "", colnames(d))
colnames(d) <- sub("\\\\", "|", colnames(d))
colnames(d)
warnings()
pdf(paste("PerfmonPlotData_",Sys.Date(),".pdf",sep=""))
for (i in 2:ncol(d)) {
ynumeric <- as.numeric(d[,i])
ymin <- min(ynumeric,na.rm = TRUE)
ymax <- max(ynumeric,na.rm = TRUE)
#generate sequence of 10
ybreaks <- seq(ymin, ymax, length.out = 10)
print(ybreaks)
print(paste(ymin,ymax))
p <- qplot(d[,"Time"],y=ynumeric, data=d, xlab="Time",ylab="", main=colnames(d[i]))
p <- p + geom_smooth(size=3,se=TRUE) + theme_bw()
p <- p + scale_y_continuous(breaks=c(ybreaks))
print(p)
}
dev.off()

Resources