ggplot2 geom_line() and smoothing

ggplot2 geom_line() and smoothing - r

I am trying to create a GGPLOT2 smoothed line graph that looks more like this
Source: http://www.esrl.noaa.gov/psd/enso/mei/
and less like this:
Source: https://dl.dropboxusercontent.com/u/16400709/StackOverflow/Rplot02.png
My data are available on dropbox.
Having looked at previous posts I used the code below:
#MEI Line Graph
d4 <- read.csv("https://dl.dropboxusercontent.com/u/16400709/StackOverflow/Data_MEI.csv")
head(d4,n=20)
MEI<-ggplot(d4,aes(x=d4$Date, y=d4$MEI,group=1))+geom_line()
MEI+stat_smooth(method ="auto",level=0.95)
What I think I need is to reduce the amount of smoothing taking place, but I have yet to figure out how to achieve this.
d4s<-SMA(d4$MEI,n=8)
plot.ts(d4s)
SMA() works well but I cant get it to work with ggplot
Any hints would be appreciated!

Be aware that the MEI index is for a 2-month period, so it's already got some smoothing built in. Assuming that you are using the MEI data that NOAA ESRL publishes, you should be able to create the same plot.
First of all you need to get the system set up, as you'll be working with timezeones:
# set things up ----
working.dir = file.path('/code/R/StackOverflow/')
setwd(working.dir)
Sys.setenv(TZ='GMT')
now, download your data and read it in
d.in <- read.csv("MEI.txt")
The next step is to get the dates formatted properly.
d.in$Date <- as.POSIXct(d.in$Date,
format = "%d/%m/%Y",
tz = "GMT")
and because we need to figure out where things cross the x-axis, we'll have to work in decimal dates. Use the Epoch value:
d <- data.frame(x = as.numeric(format(d.in$Date,
'%s')),
y = d.in$MEI)
Now we can figure out the zero-crossings. We'll use Beroe's example for that.
rx <- do.call("rbind",
sapply(1:(nrow(d)-1), function(i){
f <- lm(x~y, d[i:(i+1),])
if (f$qr$rank < 2) return(NULL)
r <- predict(f, newdata=data.frame(y=0))
if(d[i,]$x < r & r < d[i+1,]$x)
return(data.frame(x=r,y=0))
else return(NULL)
}))
and tack that on the end of the initial data:
d2 <- rbind(d,rx)
now convert back to dates:
d2$date <- as.POSIXct(d2$x,
origin = "1960-01-01",
format = "%s",
tz = "GMT")
now we can do the plot:
require(ggplot2)
ggplot(d2,aes(x = date,
y = y)) +
geom_area(data=subset(d2, y<=0), fill="blue") +
geom_area(data=subset(d2, y>=0), fill="red") +
scale_y_continuous(name = "MEI")
and that gives you this:
Now, do you really need to smooth this?

Related

Improper facet_wrap output

The below is my sample data
Though there are values for mobile and tablet across all 4 dates, when I try to facet_wrap across device category, my results are not what is expected. All the values corresponding to each date are being added to the desktop only and are not being distributed across the 3 categories.
The code that I used is
qplot(data=gaData, x=gaData$Date, y=gaData$Users, xlim = c(20170101,20170101))+
facet_wrap(~gaData$Device.Category, ncol = 1)
The output that I'm seeing in the plot is
I'm new to the whole data visualization area. I'm unable to identify what is wrong with the code.
P.S. I'm able to plot mobile and tablet individually for the same dates successfully as individual plots.

x <- data.frame(Date = c('2017-01-01','2017-01-01','2017-01-01','2017-01-02','2017-01-02','2017-01-02','2017-01-03','2017-01-03','2017-01-03',
'2017-01-04','2017-01-04','2017-01-04'), Device = c("desktop","mobile","tablet","desktop","mobile","tablet",
"desktop","mobile","tablet","desktop","mobile","tablet"),
Users = c(404,223,39,529,211,43,1195,285,29,1019,275,35))
x$Date <- as.POSIXct(x$Date, tz = "UTC")
ggplot(x, aes(Date, Users)) + geom_line() + facet_wrap(~Device)
Is this what you wanted?

Hope this helps.
# Simulate some dummy data
dat <- data.frame(
Date = rep(20170101:20170104, each = 3),
Device = rep(c('D', 'M', 'T'), 4),
Users = runif(n = 12, max = 1000, min = 10) %>% round()
)
# This is the 'base map', map variables onto aesthetics
ggplot(aes(x = Date, y = Users, col = Device), data = dat) +
# What kind of geometry?
geom_line() +
geom_point() +
# From 1d panel to 2d
facet_wrap(~Device, ncol = 1)
Plot Result
You may also consider converting Date variable to class date.
The following references hope to help you gain some understanding of ggplot2.
http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html
http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html
http://ggplot2.tidyverse.org/reference/
Also, DataCamp provides wonderful online tutorials.
Welcome to the amazing world of R. Cheers.

First I suggest you convert 'Date' to POSIXct. I use lubridate package:
library(lubridate)
Date=ymd(c(rep("2017-01-01",3),rep("2017-01-02",3),rep("2017-01-03",3),rep("2017-01-04",3)))
then we can build the rest of the dat frame
Country=rep("United States",12)
Device.Category=rep(c("Desktop","Mobile","Tablet"),4)
Users=c(404,223,39,529,211,43,1195,285,29,1019,275,35)
df=data.frame(Date,Country,Device.Category,Users)
If you want to plot only for "2017-01-01" use this
ggplot(df,aes(x=Date,y=Users))+geom_point()+facet_grid(Device.Category~.)+xlim(ymd("2017-01-01"),ymd("2017-01-01"))
Or if you want all dates just remove the xlim function
ggplot(df,aes(x=Date,y=Users))+geom_point()+facet_grid(Device.Category~.)+xlim(ymd("2017-01-01"),ymd("2017-01-01"))

How do I set the series labels in a multiline ggplot2 series?

I'm currently working on automating some basic experiential analysis using R. Currently, I've got my script setup as follows which generates the plot shown below.
data <- list()
for (experiment in experiments) {
path = paste('../out/', experiment, '/', plot, '.csv', sep="")
data[[experiment]] <- read.csv(path, header=F)
}
df <- data.frame(Year=1:40,
'current'=colMeans(data[['current']]),
'vip'=colMeans(data[['vip']]),
'vipbonus'=colMeans(data[['vipbonus']]))
df <- melt(df, id.vars = 'Year', variable.name = 'Series')
plotted <- ggplot(df, aes(Year, value)) +
geom_line(aes(colour = Series)) +
labs(y = ylabel, title = title)
file = paste(plot, '.png', sep="")
ggsave(filename = file, plot = plotted)
While this is close to what we want the final product to look like, the series labels need to be updated. Ideally we want them to be something like "VIP, no bonus", "VIP, with bonus" and so forth, but obviously using labels like that in the data frame is not valid R (and invalid characters are automatically replaced with . even with backticks). Since these experiments are a work in progress, we also know that we are gong to need more series labels in the future so we don't want to lose the ability of ggplot to automatically set the colors for us.
How can I set the series labels to be appropriate for humans?

The OP explained that he is currently working on automating some basic experiential analysis, part of which is the relabeling of the series. The OP showed also some code which is used to prepare the data to be plotted.
Based on the additional information supplied in comments, I believe the overall processing could be streamlined which will address the series labeling issue as well.
Some preparations
# used for creating file paths
experiments <- c("current", "vip", "vipbonus")
# used for labeling the series
exp_labels <- c("Current", "VIP, no bonus", "VIP, with bonus")
plot <- "dataset1" # e.g.
paths <- paste0(file.path("../out", experiments, plot), ".csv")
paths
#[1] "../out/current/dataset1.csv" "../out/vip/dataset1.csv" "../out/vipbonus/dataset1.csv"
Read data
library(data.table) #version 1.10.4 used here
# read all files into one large data.table
# add running count in column "Series" to identify the source of each row
DT <- rbindlist(lapply(paths, fread, header = FALSE), idcol = "Series")
# rename file chunks = Series, use predefined labels
DT[, Series := factor(Series, labels = exp_labels)]
Reshape and aggregate by groups
# reshape from wide to long
molten <- melt(DT, id.vars = "Series")
# compute means by Series and Year = variable
aggregated <- molten[, .(value = mean(value)), by = .(Series, variable)]
# take factor level number of "variable" as Year
aggregated[, Year := as.integer(variable)]
Note that aggregation is done in long format (after melt()) to save typing the same command for each column.
Create chart & save to disk
library(ggplot2)
ggplot(aggregated, aes(Year, value)) +
geom_line(aes(colour = Series)) +
labs(y = "ylabel", title = "title")
file = paste(plot, '.png', sep="")
ggsave(filename = file) # by default, the last plot is saved

While this may not be an ideal approach, what we found that worked for us was to update the relevant series labels after the melt command was performed:
df$Series <- as.character(df$Series)
df$Series[df$Series == "current"] <- "Current"
df$Series[df$Series == "vip"] <- "VIP, no bonus"
df$Series[df$Series == "vipbonus"] <- "VIP, with bonus"
Which results in plots like the following:

You can try this
library(tidyverse)
df <- df %>% dplyr::mutate(Series = as.character(Series),
Series = fct_recode(Series,
"Current" = "current",
"VIP, no bonus" = "vip",
"VIP, with bonus" = "vipbonus"))

How to deal with "data of class uneval" error from ggplot2?

While trying to overlay a new line to a existing ggplot, I am getting the following error:
Error: ggplot2 doesn't know how to deal with data of class uneval
The first part of my code works fine. Below is an image of "recent" hourly wind generation data from a Midwestern United States electric power market.
Now I want to overlay the last two days worth of observations in Red. It should be easy but I cant figure out why I am getting a error.
Any assistance would be greatly appreciated.
Below is a reproducible example:
# Read in Wind data
fname <- "https://www.midwestiso.org/Library/Repository/Market%20Reports/20130510_hwd_HIST.csv"
df <- read.csv(fname, header=TRUE, sep="," , skip=7)
df <- df[1:(length(df$MKTHOUR)-5),]
# format variables
df$MWh <- as.numeric(df$MWh)
df$Datetime <- strptime(df$MKTHOUR, "%m/%d/%y %I:%M %p")
# Create some variables
df$Date <- as.Date(df$Datetime)
df$HrEnd <- df$Datetime$hour+1
# Subset recent and last data
last.obs <- range(df$Date)[2]
df.recent <- subset(df, Date %in% seq(last.obs-30, last.obs-2, by=1))
df.last <- subset(df, Date %in% seq(last.obs-2, last.obs, by=1))
# plot recent in Grey
p <- ggplot(df.recent, aes(HrEnd, MWh, group=factor(Date))) +
geom_line(color="grey") +
scale_y_continuous(labels = comma) +
scale_x_continuous(breaks = seq(1,24,1)) +
labs(y="MWh") +
labs(x="Hour Ending") +
labs(title="Hourly Wind Generation")
p
# plot last two days in Red
p <- p + geom_line(df.last, aes(HrEnd, MWh, group=factor(Date)), color="red")
p

when you add a new data set to a geom you need to use the data= argument. Or put the arguments in the proper order mapping=..., data=.... Take a look at the arguments for ?geom_line.
Thus:
p + geom_line(data=df.last, aes(HrEnd, MWh, group=factor(Date)), color="red")
Or:
p + geom_line(aes(HrEnd, MWh, group=factor(Date)), df.last, color="red")

Another cause is accidentally putting the data=... inside the aes(...) instead of outside:
RIGHT:
ggplot(data=df[df$var7=='9-06',], aes(x=lifetime,y=rep_rate,group=mdcp,color=mdcp) ...)
WRONG:
ggplot(aes(data=df[df$var7=='9-06',],x=lifetime,y=rep_rate,group=mdcp,color=mdcp) ...)
In particular this can happen when you prototype your plot command with qplot(), which doesn't use an explicit aes(), then edit/copy-and-paste it into a ggplot()
qplot(data=..., x=...,y=..., ...)
ggplot(data=..., aes(x=...,y=...,...))
It's a pity ggplot's error message isn't Missing 'data' argument! instead of this cryptic nonsense, because that's what this message often means.

This could also occur if you refer to a variable in the data.frame that doesn't exist. For example, recently I forgot to tell ddply to summarize by one of my variables that I used in geom_line to specify line color. Then, ggplot didn't know where to find the variable I hadn't created in the summary table, and I got this error.

How to create histogram in R with CSV time data?

I have CSV data of a log for 24 hours that looks like this:
svr01,07:17:14,'u1#user.de','8.3.1.35'
svr03,07:17:21,'u2#sr.de','82.15.1.35'
svr02,07:17:30,'u3#fr.de','2.15.1.35'
svr04,07:17:40,'u2#for.de','2.1.1.35'
I read the data with tbl <- read.csv("logs.csv")
How can I plot this data in a histogram to see the number of hits per hour?
Ideally, I would get 4 bars representing hits per hour per srv01, srv02, srv03, srv04.
Thank you for helping me here!

I don't know if I understood you right, so I will split my answer in two parts. The first part is how to convert your time into a vector you can use for plotting.
a) Converting your data into hours:
#df being the dataframe
df$timestamp <- strptime(df$timestamp, format="%H:%M:%S")
df$hours <- as.numeric(format(df$timestamp, format="%H"))
hist(df$hours)
This gives you a histogram of hits over all servers. If you want to split the histograms this is one way but of course there are numerous others:
b) Making a histogram with ggplot2
#install.packages("ggplot2")
require(ggplot2)
ggplot(data=df) + geom_histogram(aes(x=hours), bin=1) + facet_wrap(~ server)
# or use a color instead
ggplot(data=df) + geom_histogram(aes(x=hours, fill=server), bin=1)
c) You could also use another package:
require(plotrix)
l <- split(df$hours, f=df$server)
multhist(l)
The examples are given below. The third makes comparison easier but ggplot2 simply looks better I think.
EDIT
Here is how thes solutions would look like
first solution:
second solution:
third solution:

An example dataset:
dat = data.frame(server = paste("svr", round(runif(1000, 1, 10)), sep = ""),
time = Sys.time() + sort(round(runif(1000, 1, 36000))))
The trick I use is to create a new variable which only specifies in which hour the hit was recorded:
dat$hr = strftime(dat$time, "%H")
Now we can use some plyr magick:
hits_hour = count(dat, vars = c("server","hr"))
And create the plot:
ggplot(data = hits_hour) + geom_bar(aes(x = hr, y = freq, fill = server), stat="identity", position = "dodge")
Which looks like:
I don't really like this plot, I'd be more in favor of:
ggplot(data = hits_hour) + geom_line(aes(x = as.numeric(hr), y = freq)) + facet_wrap(~ server, nrow = 1)
Which looks like:
Putting all the facets in one row allows easy comparison of the number of hits between the servers. This will look even better when using real data instead of my random data.

Easiest way to create an irregular time series graph (R? GGPLOT? ITS?)

I'm a graphic designer who is trying to use R to create graphs that are too complicated for Excel. I'm specifically trying to create an irregular time series step chart. I've had no problems creating a regular time series chart, but for some reason, the irregular dates are throwing everything off.
I'm starting with a basic text file with two columns of data:
01-04-1940 4
05-29-1963 35
12-02-2002 24
I've loaded the data using
d <- read.delim("file.txt", header = TRUE)
and I've converted the first column in Unix time using
d$date <- as.Date(d$date, format = "%m-%d-%Y")
But at this point, I can't find any more information anywhere on how to proceed. I've seen the R package "ITS," But I cannot find any documentation on it beyond technical descriptions of the classes involved.
I'd much appreciate it if someone with some experience in R could point out the few lines of code I need to create this graph. Thanks!

ggplot deals quite nicely with data in date format. Here are some suggestions:
d <- data.frame(
date = c("01-04-1940", "05-29-1963", "12-02-2002"),
value = c(4, 35, 24)
)
d$date <- as.Date(d$date, format = "%m-%d-%Y")
ggplot(d, aes(x=date, y=value)) + geom_step(colour="blue")
ggplot(d, aes(x=date, y=value)) + geom_line(colour="red")

I would use xts/zoo. They both handle irregular time series easily.
z <- zoo(d[,2], d[,1])
plot(z)
plot(z, type="s")

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

ggplot2 geom_line() and smoothing - r

Related

Improper facet_wrap output

How do I set the series labels in a multiline ggplot2 series?

How to deal with "data of class uneval" error from ggplot2?

How to create histogram in R with CSV time data?

Easiest way to create an irregular time series graph (R? GGPLOT? ITS?)

Categories

Resources