Trouble plotting with dates in R - r

I am relatively new to R and am having trouble plotting grouped data against date. I have count data grouped by month over 4 years. I don't want May of 2008 grouped with May 2009 but rather points for each month of each year with standard errors. Here is my code so far but I get a blank graph with no points. I can get rid of the axis.POSIXct line and I get a graph with points and error bars. The problem seems to be around the scaling or data format of the plot vs. the axis. Can anyone help me here?
> r <- as.POSIXct(range(refmCount$mo.yr), "month")
>
> ############# can get plot and points to line up on the x-axis##########################
> plot(refmCount$mo.yr, refmCount$count, type = "n", xaxt = "n",
+ xlab = "Date",
+ ylab = "Mean number of salamanders per night",
+ xlim = c(r[1], r[2]))
> axis.POSIXct(1, at = seq(r[1], r[2], by = "month"), format = "%b")
> points(refmCount$mo.yr, refmCount$count, type = "p", pch = 19)
points(depmCount$mo.yr, depmCount$count, type = "p", pch = 24)
> arrows(refmCount$mo.yr, refmCount$count+mCount$se, refmCount$mo.yr, refmCount$count- refmCount$se, angle=90, code=3, length=0)
>
> str(refmCount)
'data.frame': 19 obs. of 7 variables:
$ mo.yr:Class 'Date' num [1:19] 14000 14031 14061 14092 14123 ...
$ trt : Factor w/ 2 levels "Depletion","Reference": 2 2 2 2 2 2 2 2 2 2 ...
$ N : num 75 110 15 10 34 20 20 10 40 15 ...
$ count: num 3.6 5.95 3.47 6.7 11.12 ...
$ sd : num 8.58 8.4 4.42 3.47 11.88 ...
$ se : num 0.99 0.801 1.142 1.096 2.037 ...
$ ci : num 1.97 1.59 2.45 2.48 4.14 ...
> r
[1] "2008-04-30 20:00:00 EDT" "2011-05-31 20:00:00 EDT"
>

You have two choices. Install package "zoo" and use the yearmon class, or calculate numeric months so that May 2005 is 2005.4167. You can create prettier labels with paste(month.abb[month], year).

Related

I'm having problems with my ggplot2 theme system

I'm having problems with my ggplot2 drawing, I don't know why, I've restarted Rstudio and its theme system can't be restored to the original, which is the default theme
library(tidyverse)
chic <- read_csv("./chicago-nmmaps-custom.csv")
ggplot(chic, aes(x = date, y = temp)) +
geom_point()
Here's the code I ran
This is what I got when I ran it
Normal should look like this, as shown below
You could use theme_set to replace older themes like this:
library(ggplot2)
p <- ggplot(mtcars, aes(mpg, wt)) +
geom_point()
p
old <- theme_set(theme_bw())
p
theme_set(old)
p
Created on 2022-10-08 with reprex v2.0.2
The problem is that column date is not a date object, it's a column of class "character". Coerce to class "Date" and the default grey theme is used.
The output of str shows the data set columns' classes and date is displayed as chr, meaning, a column of class "character". R has real dates and times classes and this column must become one. Everything afterwards will be easier, including ggplot2 code. ggplot2's layers scale_*_date and scale_*_datetime even have special date and date/time breaks and labels arguments, respectively.
str(chic)
#> 'data.frame': 5114 obs. of 9 variables:
#> $ city : chr "chic" "chic" "chic" "chic" ...
#> $ date : chr "1987-01-01" "1987-01-02" "1987-01-03" "1987-01-04" ...
#> $ death : int 130 150 101 135 126 130 129 109 125 153 ...
#> $ temp : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
#> $ dewpoint: num 31.5 29.9 27.4 28.6 28.9 ...
#> $ pm10 : num 27.8 NA 33.7 40.8 NA ...
#> $ o3 : num 4.03 4.58 3.4 3.94 4.4 ...
#> $ time : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ season : chr "winter" "winter" "winter" "winter" ...
library(ggplot2)
chic |>
dplyr::mutate(date = as.Date(date)) |>
ggplot(aes(date, temp)) +
geom_point() +
scale_x_date(date_breaks = "1 year", date_labels = "%Y")
Created on 2022-10-08 with reprex v2.0.2

Formula notation for scatterplot producing unexpected results

I am working on a map, where the color of each point is proportional to one response variable, and the size of the point is proportional to another. I've noticed that when I try to plot the points using formula notation things go haywire, while default notation performs as expected. I have used formula notation to plot maps many times before, and thought that the notations were nearly interchangeable. Why would these produce different results? I have read through the plot.formula and plot.default documentation and haven't been able to figure it out. Based on this I am wondering if it has to do with the columns of dat being coerced to factors, but I'm not sure why that would be happening. Any ideas?
Consider the following example data frame, dat:
latitude <- c(runif(10, min = 45, max = 48))
latitude[9] <- NA
longitude <- c(runif(10, min = -124.5, max = -122.5))
longitude[9] <- NA
color <- c("#00FFCCCC", "#99FF00CC", "#FF0000CC", "#3300FFCC", "#00FFCCCC",
"#00FFCCCC", "#3300FFCC", "#00FFCCCC", NA, "#3300FFCC")
size <- c(4.916667, 5.750000, 7.000000, 2.000000, 5.750000,
4.500000, 2.000000, 4.500000, NA, 2.000000)
dat <- as.data.frame(cbind(longitude, latitude, color, size))
Plotting according to formula notation
plot(latitude ~ longitude, data = dat, type = "p", pch = 21, col = 1, bg = color, cex = size)
produces
this mess and the following error: graphical parameter "type" is obsolete.
Plotting according to the default notation
plot(longitude, latitude, type = "p", pch = 21, col = 1, bg = color, cex = size)
works as expected, though with the same error.
There are a couple of problems with this. First is that your use of cbind is turning this into a matrix, albeit temporarily, which is converting your numbers to character. See:
dat <- as.data.frame(cbind(longitude, latitude, color, size))
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: Factor w/ 9 levels "-122.855375511572",..: 6 8 9 1 4 3 2 7 NA 5
# $ latitude : Factor w/ 9 levels "45.5418886151165",..: 6 2 4 1 3 7 5 9 NA 8
# $ color : Factor w/ 4 levels "#00FFCCCC","#3300FFCC",..: 1 3 4 2 1 1 2 1 NA 2
# $ size : Factor w/ 5 levels "2","4.5","4.916667",..: 3 4 5 1 4 2 1 2 NA 1
If instead you just use data.frame, you'll get:
dat <- data.frame(longitude, latitude, color, size)
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: num -124 -124 -124 -123 -124 ...
# $ latitude : num 47.3 45.9 46.3 45.5 46 ...
# $ color : Factor w/ 4 levels "#00FFCCCC","#3300FFCC",..: 1 3 4 2 1 1 2 1 NA 2
# $ size : num 4.92 5.75 7 2 5.75 ...
plot(latitude ~ longitude, data = dat, pch = 21, col = 1, bg = color, cex = size)
But now the colors are all dorked. Okay, the problem is likely because your $color is a factor, which is being interpreted internally as integers. Try stringsAsFactors=F:
dat <- data.frame(longitude, latitude, color, size, stringsAsFactors=FALSE)
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: num -124 -124 -124 -123 -124 ...
# $ latitude : num 47.3 45.9 46.3 45.5 46 ...
# $ color : chr "#00FFCCCC" "#99FF00CC" "#FF0000CC" "#3300FFCC" ...
# $ size : num 4.92 5.75 7 2 5.75 ...
plot(latitude ~ longitude, data = dat, pch = 21, col = 1, bg = color, cex = size)

ggplot2 facet_wrap geom_text not accepting date values

I have a small data set, local, (5 observations) with two types: a and b.
Each observation has a Date field (p.start), a ratio, and a duration.
local
principal p.start duration allocated.days ratio
1 P 2015-03-18 1 162.0000 162.0000
2 V 2015-08-28 4 24.0000 6.0000
3 V 2015-09-03 1 89.0000 89.0000
4 V 2015-03-30 1 32.0000 32.0000
5 P 2015-01-29 1 150.1667 150.1667
str(local)
'data.frame': 5 obs. of 5 variables:
$ principal : chr "P" "V" "V" "V" ...
$ p.start : Date, format: "2015-03-18" "2015-08-28" "2015-09-03" "2015-03-30" ...
$ duration : Factor w/ 10 levels "1","2","3","4",..: 1 4 1 1 1
$ allocated.days: num 162 24 89 32 150
$ ratio : num 162 6 89 32 150
I have another data frame, stats, with text to be added to a faceted plot.
stats
principal xx yy zz
1 P 2015-02-28 145.8 Average = 156
2 V 2015-02-28 145.8 Average = 24
str(stats)
'data.frame': 2 obs. of 4 variables:
$ principal: chr "P" "V"
$ xx : Date, format: "2015-02-28" "2015-02-28"
$ yy : num 146 146
$ zz : chr "Average = 156" "Average = 24"
The following code fails:
p = ggplot (local, aes (x = p.start, y = ratio, size = duration))
p = p + geom_point (colour = "blue"); p
p = p + facet_wrap (~ principal, nrow = 2); p
p = p + geom_text(aes(x=xx, y=yy, label=zz), data= stats)
p
Error: Continuous value supplied to discrete scale
Any ideas? I'm missing something obvious.
The problem is that you are plotting from 2 data.frames, but your initial ggplot call includes aes parameters referring to just the local data.frame.
So although your geom_text specifies data=stats, it is still looking for size=duration.
The following line works for me:
ggplot(local) +
geom_point(aes(x=p.start, y=ratio, size=duration), colour="blue") +
facet_wrap(~ principal, nrow=2) +
geom_text(data=stats, aes(x=xx, y=yy, label=zz))
Just remove size = duration from ggplot (local, aes (x = p.start, y = ratio, size = duration)) and add it into geom_point (colour = "blue"). Then, it should work.
ggplot(local, aes(x=p.start, y=ratio))+
geom_point(colour="blue", aes(size=duration))+
facet_wrap(~principal, nrow=2)+
geom_text(aes(x=xx, y=yy, label=zz), data=stats)

Adding Different Percentiles in boxplots in R

I am failry new to R and recently used it to make some Boxplots. I also added the mean and standard deviation in my boxplot. I was wondering if i could add some kind of tick mark or circle in different percentile as well. Let's say if i want to mark the 85th, $ 90th percentile in each HOUR boxplot, is there a way to do this? My data consist of a year worth of loads in MW in each hour & My output consist of 24 boxplots for each hour for each month. I am doing each month at a time because i am not sure if there is a way to run all 96(Each month, weekday/weekend , for 4 different zones) boxplots at once. Thanks in advance for help.
JANWD <-read.csv("C:\\My Directory\\MWBox2.csv")
JANWD.df<-data.frame(JANWD)
JANWD.sub <-subset(JANWD.df, MONTH < 2 & weekend == "NO")
KeepCols <-c("Hour" , "Houston_Load")
HWD <- JANWD.sub[ ,KeepCols]
sd <-tapply(HWD$Houston_Load, HWD$Hour, sd)
means <-tapply(HWD$Houston_Load, HWD$Hour, mean)
boxplot(Houston_Load ~ Hour, data=HWD, xlab="WEEKDAY HOURS", ylab="MW Differnce", ylim= c(-10, 20), smooth=TRUE ,col ="bisque", range=0)
points(sd, pch = 22, col= "blue")
points(means, pch=23, col ="red")
#Output of the subset of data used to run boxplot for month january in Houston
str(HWD)
'data.frame': 504 obs. of 2 variables:
`$ Hour : int 1 2 3 4 5 6 7 8 9 10 ...'
`$ Houston_Load: num 1.922 2.747 -2.389 0.515 1.922 ...'
#OUTPUT of the original data
str(JANWD)
'data.frame': 8783 obs. of 9 variables:
$ Date : Factor w/ 366 levels "1/1/2012","1/10/2012",..: 306 306 306 306 306 306 306 306 306 306 ...
`$ Hour : int 1 2 3 4 5 6 7 8 9 10 ...'
` $ MONTH : int 8 8 8 8 8 8 8 8 8 8 ...'
`$ weekend : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...'
`$ TOTAL_LOAD : num 0.607 5.111 6.252 7.607 0.607 ...'
`$ Houston_Load: num -2.389 0.515 1.922 2.747 -2.389 ...'
`$ North_Load : num 2.95 4.14 3.55 3.91 2.95 ...'
`$ South_Load : num -0.108 0.267 0.54 0.638 -0.108 ...'
`$ West_Load : num 0.154 0.193 0.236 0.311 0.154 ...'
Here is one way, using quantile() to compute the relevant percentiles for you. I add the marks using rug().
set.seed(1)
X <- rnorm(200)
boxplot(X, yaxt = "n")
## compute the required quantiles
qntl <- quantile(X, probs = c(0.85, 0.90))
## add them as a rgu plot to the left hand side
rug(qntl, side = 2, col = "blue", lwd = 2)
## add the box and axes
axis(2)
box()
Update: In response to the OP providing str() output, here is an example similar to the data that the OP has to hand:
set.seed(1) ## make reproducible
HWD <- data.frame(Hour = rep(0:23, 10),
Houston_Load = rnorm(24*10))
Now get I presume you want ticks at 85th and 90th percentiles for each Hour? If so we need to split the data by Hour and compute via quantile() as I showed earlier:
quants <- sapply(split(HWD$Houston_Load, list(HWD$Hour)),
quantile, probs = c(0.85, 0.9))
which gives:
R> quants <- sapply(split(HWD$Houston_Load, list(HWD$Hour)),
+ quantile, probs = c(0.85, 0.9))
R> quants
0 1 2 3 4 5 6
85% 0.3576510 0.8633506 1.581443 0.2264709 0.4164411 0.2864026 1.053742
90% 0.6116363 0.9273008 2.109248 0.4218297 0.5554147 0.4474140 1.366114
7 8 9 10 11 12 13 14
85% 0.5352211 0.5175485 1.790593 1.394988 0.7280584 0.8578999 1.437778 1.087101
90% 0.8625322 0.5969672 1.830352 1.519262 0.9399476 1.1401877 1.763725 1.102516
15 16 17 18 19 20 21
85% 0.6855288 0.4874499 0.5493679 0.9754414 1.095362 0.7936225 1.824002
90% 0.8737872 0.6121487 0.6078405 1.0990935 1.233637 0.9431199 2.175961
22 23
85% 1.058648 0.6950166
90% 1.145783 0.8436541
Now we can draw marks at the x locations of the boxplots
boxplot(Houston_Load ~ Hour, data = HWD, axes = FALSE)
xlocs <- 1:24 ## where to draw marks
tickl <- 0.15 ## length of marks used
for(i in seq_len(ncol(quants))) {
segments(x0 = rep(xlocs[i] - 0.15, 2), y0 = quants[, i],
x1 = rep(xlocs[i] + 0.15, 2), y1 = quants[, i],
col = c("red", "blue"), lwd = 2)
}
title(xlab = "Hour", ylab = "Houston Load")
axis(1, at = xlocs, labels = xlocs - 1)
axis(2)
box()
legend("bottomleft", legend = paste(c("0.85", "0.90"), "quantile"),
bty = "n", lty = "solid", lwd = 2, col = c("red", "blue"))
The resulting figure should look like this:

How can I have a date field formatted properly in a smoothScatter plot?

I have data that looks like this:
> head(data)
date price volume
1 2011-06-26 17:16:05 17.51001 2.000
2 2011-06-26 20:50:00 14.80351 2.981
3 2011-06-26 20:51:00 14.90000 2.000
4 2011-06-26 20:52:00 14.89001 0.790
5 2011-06-26 20:53:00 15.00000 1.000
6 2011-06-26 21:05:01 16.20000 6.500
> str(head(data))
'data.frame': 6 obs. of 3 variables:
$ date : POSIXct, format: "2011-06-26 17:16:05" "2011-06-26 20:50:00" "2011-06-26 20:51:00" "2011-06-26 20:52:00" ...
$ price : num 17.5 14.8 14.9 14.9 15 ...
$ volume: num 2 2.98 2 0.79 1 ...
When I plot it like this:
someColors <- colorRampPalette(c("black", "blue", "orange", "red"), space="Lab")
smoothScatter(data, colramp=someColors)
I get almost exactly what I'm looking for, but it converts the posix dates to numbers. How can I set the x labels more usefully so that my stuff is a bit more readable?
(source: skitch.com)
Edit: I can get an approximation of what I want like this:
smoothScatter(data, colramp=someColors, xaxt="n")
axis(1, at=data$date,
labels=lapply(data$date, function(d) strftime(d, "%F")),
tick=FALSE)
That's terribly slow, though. It seems like I should be able to prep the data or advice the label drawer a bit.
In terms of speed, it might help to specify range of dates to use for the x-axis labels. For example:
days <- seq(min(data$date), max(data$date), by = 'month')
axis(1, at=days,
labels=strftime(days, "%F"),
tick=FALSE)
It might also help to round the times to the nearest day:
days <- seq(as.Date(min(data$date)), as.Date(max(data$date)), by = 'month')

Resources