I have a table of maximum trip lengths by month which I am trying to graph in R ,
While trying to graph it, the X-axis does not graph according to the month, instead it graphs it alphabetically
I'm just getting started in R and I used the following code from what one of the videos I watched adjusted for my table names:
max_trips <- read.csv("max_and_min_trips.csv")
ggplot(data=max_trips)+
geom_point(mapping = aes(x=month,y=max_trip_duration))+
scale_x_month(month_labels = "%Y-%m")
The simple answer is that the data for your "month" column is stored as a vector of strings, not as a date. In R, this data type is called a "character" (or chr). You can confirm this by typing class(max_trips$month). The result is certainly "character" in your console. Therefore, your solution would be to (1) convert the data type to a date and (2) adjust the formatting of the date on the x axis using scale_x_date and/or related functions.
I'll demonstrate the process with a simple example dataset and plot. Here's the basic data frame and plot. You'll see, the plot is again arranged "alphabetically" instead of as expected if the mydf$dates values were stored as dates in "month/year" format.
library(lubridate)
mydf <- data.frame(
dates = c("1/21", "2/20", "12/21", "3/19", "10/19", "9/19"),
yvals = c(13, 31, 14, 10, 20, 18))
ggplot(mydf, aes(x = dates, y = yvals)) + geom_point()
Convert to Date
To convert to a date, you can use a few different functions, but I find the lubridate package particularly useful here. The as_date() function will be used for the conversion; however, we cannot just apply as_date() directly to mydf$dates or we will get the following error in the console:
> as_date(mydf$dates)
[1] NA NA NA NA NA NA
Warning message:
All formats failed to parse. No formats found.
Since there are so many variety of ways you can format data which correspond to dates, date times, etc, we need to specify that our data is in "month/year" format. The other key here is that data setup as a date must specify year, month and day. Our data here is just specifying month and year, so we will first need to add a random "day" to each date before converting. Here's something that works:
mydf$dates <- as_date(
paste0("1/", mydf$dates), # need to add a "day" to correctly format for date
format = "%d/%m/%y" # nomenclature from strptime()
)
The paste0(...) function serves to add "1/" before each value in mydf$dates and then the format = argument specifies the character values should be read as "day/month/year". For more information on the nomenclature for formats of dates, see the help for the strptime() function here.
Now our column is in date format:
> mydf$dates
[1] "2021-01-01" "2020-02-01" "2021-12-01" "2019-03-01" "2019-10-01" "2019-09-01"
> class(mydf$dates)
[1] "Date"
Changing Date Scale
When plotting now, the data is organized in the proper order along a date scale x axis.
p <- ggplot(mydf, aes(x = dates, y = yvals)) + geom_point()
p
If the labeling isn't quite what you are looking for, you may check the documentation here for the scale_x_date() function for some suggestions. The basic idea is to setup the arguments for breaks= in your scale and how they are labeled with date_labels=.
p + scale_x_date(breaks="4 months", date_labels = "%m/%y")
In the OP's case, I would suggest the following code should work:
library(lubridate)
max_trips <- read.csv("max_and_min_trips.csv")
max_trips$month <- as_date(
paste0("1/", max_trips$month),
format = "%d/%m/%y")
ggplot(data=max_trips)+
geom_point(mapping = aes(x=month,y=max_trip_duration))+
scale_x_date(breaks = "1 month", date_labels = "%Y-%m")
Related
I would like to use ggplot to graph portions of time series data. For example, say I only wanted to graph the last five dates of this data. Is there away to specify this in ggplot without subsetting the data ahead of time? I tried using xlim, but it didn't work.
date <- c("2016-03-24","2016-03-25","2016-03-26","2016-03-27","2016-03-28",
"2016-03-29","2016-03-30","2016-03-31","2016-04-01","2016-04-02")
Temp <- c(35,34,92,42,21,47,37,42,63,12)
df <- data.frame(date,Temp)
My attempt:
ggplot(df) + geom_line(aes(x=date,y=Temp)) + xlim("2016-03-29","2016-04-02")
My dates are formatted as POSIXct.
You have to enter the xlim values as as.Date or as.POSIXct(). Is this what you want?
df$date <- as.Date(df$date, format= "%Y-%m-%d", tz = "UTC")
ggplot(df) + geom_line(aes(x=date,y=Temp)) +
xlim(as.Date(c("2016-03-30", "2016-04-02"), tz = "UTC", format = "%Y-%m-%d") )
PS: Be aware that you will get the following warning:
Warning message:
Removed 5 rows containing missing values (geom_path)
I have hourly timeseries data of three homes(H1, H2, H3) for continuous five days created as
library(xts)
library(ggplot2)
set.seed(123)
dt <- data.frame(H1 = rnorm(24*5,200,2),H2 = rnorm(24*5,150,2),H3 = rnorm(24*5,50,2)) # hourly data of three homes for 5 days
timestamp <- seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-01-05 23:59:59"), by = "hour") # create timestamp
dt$timestamp <- timestamp
Now I want to plot data homewise in facet form; accordingly I melt dataframe as
tempdf <- reshape2::melt(dt,id.vars="timestamp") # melt data for faceting
colnames(tempdf) <- c("time","var","val") # rename so as not to result in conflict with another melt inside geom_line
Within each facet (for each home), I want to see the values of all the five days in line plot form (each facet should contain 5 lines corresponding to different days). Accordingly,
ggplot(tempdf) + facet_wrap(~var) +
geom_line(data = function(x) {
locdat <- xts(x$val,x$time)# create timeseries object for easy splitting
sub <- split.xts(locdat,f="days") # split data daywise of considered home
sub2 <- sapply(sub, function(y) return(coredata(y))) # arrange data in matrix form
df_sub2 <- as.data.frame(sub2)
df_sub2$timestamp <- index(sub[[1]]) # forcing same timestamp for all days [okay with me]
df_melt <- reshape2::melt(df_sub2,id.vars="timestamp") # melt to plot inside each facet
#return(df_melt)
df_melt
}, aes(x=timestamp, y=value,group=variable,color=variable),inherit.aes = FALSE)
I have forced the same timestamp for all the days of a home to make plotting simple. With above code, I get plot as
Only problem with above plot is that, It is plotting same data in all the facets. Ideally, H1 facet should contain data of home 1 only and H2 facet should contain data of home 2. I know that I am not able to pass homewise data in geom_line(), can anyone help to do in correct manner.
I think that you may find it more efficient to modify the data outside the call to ggplot rather than inside it (allows closer inspection of what is happening at each step, at least in my opinion).
Here, I am using lubridate to generate two new columns. The first holds only the date (and not the time) to allow faceting on that. The second holds the full datetime, but I then modify the date so that they are all the same. This leaves only the times as mattering (and we can suppress the chosen date in the plot).
library(lubridate)
tempdf$day <- date(tempdf$time)
tempdf$forPlotTime <- tempdf$time
date(tempdf$forPlotTime) <-
"2016-01-01"
Then, I can pass that modified data.frame to ggplot. You will likely want to modify colors/labels, but this should get you a pretty good start.
ggplot(tempdf
, aes(x = forPlotTime
, y = val
, col = as.factor(day))) +
geom_line() +
facet_wrap(~var) +
scale_x_datetime(date_breaks = "6 hours"
, date_labels = "%H:%M")
Generates:
I am trying to understand why R behaves differently with the "aggregate" function. I wanted to average 15m-data to hourly data. For this, I passed the 15m-data together with a pre-designed "hour" array (4 times the same date per hour, taking the original POSIXct array) to the aggregate function.
After some time, I realized that the function was behaving odd (well, probably the data was odd, but why?) when giving over the date-array with
strftime(data.15min$posix, format="%Y-%m-%d %H")
However, if I handed over the data with
cut(data.15min$posix, "1 hour")
the data was averaged correctly.
Below, a minimal example is embedded, including a sample of the data.
I would be happy to understand what I did wrong.
Thanks in advance!
d <- 3
bla <- read.table("test_daten.dat",header=TRUE,sep=",")
data.15min <- NULL
data.15min$posix <- as.POSIXct(bla$dates,tz="UTC")
data.15min$o3 <- bla$o3
hourtimes <- unique(as.POSIXct(paste(strftime(data.15min$posix, format="%Y-%m-%d %H"),":00:00",sep=""),tz="Universal"))
agg.mean <- function (xx, yy, rm.na = T)
# xx: parameter that determines the aggregation: list(xx), e.g. hour etc.
# yy: parameter that will be aggregated
{
aa <- yy
out.mean <- aggregate(aa, list(xx), FUN = mean, na.rm=rm.na)
out.mean <- out.mean[,2]
}
#############
data.o3.hour.mean <- round(agg.mean(strftime(data.15min$posix, format="%m/%d/%y %H"), data.15min$o3), d); data.o3.hour.mean[1:100]
win.graph(10,5)
par(mar=c(5,15,4,2), new =T)
plot(data.15min$posix,data.15min$o3,col=3,type="l",ylim=c(10,60)) # original data
par(mar=c(5,15,4,2), new =T)
plot(data.date.hour_mean,data.o3.hour.mean,col=5,type="l",ylim=c(10,60)) # Wrong
##############
data.o3.hour.mean <- round(agg.mean(cut(data.15min$posix, "1 hour"), data.15min$o3), d); data.o3.hour.mean[1:100]
win.graph(10,5)
par(mar=c(5,15,4,2), new =T)
plot(data.15min$posix,data.15min$o3,col=3,type="l",ylim=c(10,60)) # original data
par(mar=c(5,15,4,2), new =T)
plot(data.date.hour_mean,data.o3.hour.mean,col=5,type="l",ylim=c(10,60)) # Correct
Data:
Download data
Too long for a comment.
The reason your results look different is that aggregate(...) sorts the results by your grouping variable(s). In the first case,
strftime(data.15min$posix, format="%m/%d/%y %H")
is a character vector with poorly formatted dates (they do not sort properly). So the first row corresponds to the "date" "01/01/96 00".
In your second case,
cut(data.15min$posix, "1 hour")
generates actual POSIXct dates, which sort properly. So the first row corresponds to the date: 1995-11-04 13:00:00.
If you had used
strftime(data.15min$posix, format="%Y-%m-%d %H")
in your first case you would have gotten the same result as using cut(...)
I have a dataframe called EWMA_SD252 3561 obs. of 102 variables (daily volatilities of 100 stocks since 2000), here is a sample :
Data IBOV ABEV3 AEDU3 ALLL3
3000 2012-02-09 16.88756 15.00696 33.46089 25.04788
3001 2012-02-10 18.72925 14.55346 32.72209 24.93913
3002 2012-02-13 20.87183 15.25370 31.91537 24.28962
3003 2012-02-14 20.60184 14.86653 31.04094 28.18687
3004 2012-02-15 20.07140 14.56653 37.45965 33.47379
3005 2012-02-16 19.99611 16.80995 37.36497 32.46208
3006 2012-02-17 19.39035 17.31730 38.85145 31.50452
What i am trying to do is using a single command, to subset a interval from a particular stock using dates references and also plot a chart for the same interval, so far i was able to do the subset part but now i am stuck on plotting a chart, here is what i code so far :
Getting the Date Interval and the stock name :
datas = function(x,y,z){
intervalo_datas(as.Date(x,"%d/%m/%Y"),as.Date(y,"%d/%m/%Y"),z)
}
Subsetting the Data :
intervalo_datas <- function(x,y,z){
cbind(as.data.frame(EWMA_SD252[,1]),as.data.frame(EWMA_SD252[,z]))[EWMA_SD252$Data >= x & EWMA_SD252$Data <= y,]
}
Now i am stuck, is it possible using a function to get ABEV3 data.frame and plot a chart using dates in X and volatility in y, using just the command bellow ?
ABEV3 = datas("09/02/2012","17/02/2012","ABEV3")
I think you should use xts package. It is suitable :
manipluating time series specially financial time series
subsetting time series
plotting time series
So I would create an xts object using your data. Then I wrap the subset/plot in a single function like what you tried to do.
library(xts)
dat_ts <- xts(dat[,-1],as.Date(dat$Data))
plot_data <-
function(start,end,stock)
plot(dat_ts[paste(start,end,sep='/'),stock])
You can call it like this :
plot_data('2012-02-09','2012-02-14','IBOV')
You could use ggplot2 and reshape2 to make a function that automatically plots an arbitrary quantity of stocks:
plot_stocks <- function(data, date1, date2, stocks){
require(ggplot2)
require(reshape2)
date1 <- as.Date(date1, "%d/%m/%Y")
date2 <- as.Date(date2, "%d/%m/%Y")
data <- data[data$Data > date1 & data$Data < date2,c("Data", stocks)]
data <- melt(data, id="Data")
names(data) <- c("Data", "Stock", "Value")
ggplot(data, aes(Data, Value, color=Stock)) + geom_line()
}
Plotting one stock "ABEV3":
plot_stocks(EWMA_SD252, "09/02/2012", "17/02/2012", "ABEV3")
Plotting three stocks:
plot_stocks(EWMA_SD252, "09/02/2012", "17/02/2012", c("IBOV", "ABEV3", "AEDU3"))
You can further personalize your function adding other geoms, like geom_smooth etc.
(I'm assuming your EWMA_SD252 data.frame's Data column is already Date class. Convert it if it's not already.)
It looks like your trying to plot a particular column of your data.frame for a given date interval. It will be much easier for others to read your code (and you too in 6 months!) if you use variable names that are more descriptive than x, y, and z, e.g. date0, date1, column.
Let's rewrite your function. If EWMA_SD252 is already a data.frame, then you don't need to cbind individual columns of it into a data.frame. Giving a data argument makes things more flexible as well. All your datas function does is convert to Dates and call intervalo_datas, so we should wrap that up as well.
intervalo_datas <- function(date0, date1, column_name, data = EWMA_SD252) {
if (!is.Date(date0)) date0 <- as.Date(date0, "%d/%m/%Y")
if (!is.Date(date1)) date1 <- as.Date(date1,"%d/%m/%Y")
cols <- c(1, which(names(data) == column_name))
return(EWMA_SD252[EWMA_SD252$Data >= x & EWMA_SD252$Data <= y, cols])
}
Now you should be able to get a subset this way
ABEV3 = intervalo_datas("09/02/2012", "17/02/2012", "ABEV3")
And plot like this.
plot(ABEV3[, 1], ABEV3[, 2])
If you want the subsetting function to also plot, just add the plot command before the return line (but define the subset first!). Using something like xts as agstudy recommends will simplify things and handle the dates better on the axis labels.
I have date that looks like this:
"date", "sunrise"
2009-01-01, 05:31
2009-01-02, 05:31
2009-01-03, 05:33
2009-01-05, 05:34
....
2009-12-31, 05:29
and I want to plot this in R, with "date" as the x-axis, and "sunrise" as the y-axis.
You need to work a bit harder to get R to draw a suitable plot (i.e. get suitable axes). Say I have data similar to yours (here in a csv file for convenience:
"date","sunrise"
2009-01-01,05:31
2009-01-02,05:31
2009-01-03,05:33
2009-01-05,05:34
2009-01-06,05:35
2009-01-07,05:36
2009-01-08,05:37
2009-01-09,05:38
2009-01-10,05:39
2009-01-11,05:40
2009-01-12,05:40
2009-01-13,05:41
We can read the data in and format it appropriately so R knows the special nature of the data. The read.csv() call includes argument colClasses so R doesn't convert the dates/times into factors.
dat <- read.csv("foo.txt", colClasses = "character")
## Now convert the imported data to appropriate types
dat <- within(dat, {
date <- as.Date(date) ## no need for 'format' argument as data in correct format
sunrise <- as.POSIXct(sunrise, format = "%H:%M")
})
str(dat)
Now comes the slightly tricky bit as R gets the axes wrong (or perhaps better to say they aren't what we want) if you just do
plot(sunrise ~ date, data = dat)
## or
with(dat, plot(date, sunrise))
The first version gets both axes wrong, and the second can dispatch correctly on the dates so gets the x-axis correct, but the y-axis labels are not right.
So, suppress the plotting of the axes, and then add them yourself using axis.FOO functions where FOO is Date or POSIXct:
plot(sunrise ~ date, data = dat, axes = FALSE)
with(dat, axis.POSIXct(x = sunrise, side = 2, format = "%H:%M"))
with(dat, axis.Date(x = date, side = 1))
box() ## complete the plot frame
HTH
I think you can use the as.Date and as.POSIXct functions to convert the two columns in the proper format (the format parameter of as.POSIXct should be set to "%H:%M")
The standard plot function should then be able to deal with time and dates by itself