I'm plotting a time series graph using ggplot, however whenever the size of the data frame is greater than around 600, ggplot throws the following error:
Error in anyDuplicated.default(breaks) : length 1136073601 is too
large for hashing
In fact, it just gave me the same error when I try to plot 400 items.
The data is melted like so, except there are four variables- speed, dir, temp and pressure:
time variable value
1 2006-07-01 00:00:00 speed 4.180111
2 2006-07-02 00:00:00 speed 5.527226
3 2006-07-09 00:00:00 speed 6.650821
4 2006-07-16 00:00:00 speed 4.380063
5 2006-07-23 00:00:00 speed 5.641709
6 2006-07-30 00:00:00 speed 7.636913
7 2006-08-06 00:00:00 speed 7.128334
8 2006-08-13 00:00:00 speed 4.719046
...
201 2006-07-01 00:00:00 temp 17.140069
202 2006-07-02 00:00:00 temp 17.517480
203 2006-07-09 00:00:00 temp 14.211002
204 2006-07-16 00:00:00 temp 20.121617
205 2006-07-23 00:00:00 temp 17.933492
206 2006-07-30 00:00:00 temp 15.244583
My code to plot these is based on what I found here: http://had.co.nz/ggplot2/scale_date.html
qplot(time,value,data=test3,geom="line",group=variable) +
+ facet_grid(variable ~ ., scale = "free_y")
Any pointers and I'd be very grateful!!
To massage the date from character to date i'm using:
test$time <- strptime(test$time, format="%Y-%m-%d %H:%M:%S")
test$time <- as.POSIXct(test$time, format="%H:%M:%S")
test3 = melt(test,id="time")
class(test$time) returns "[1] "POSIXt" "POSIXct""
Try setting a timezone explicitly in the call to as.POSIXct(), as in https://gist.github.com/925852
Related
In the past last days I have been struggling a lot trying to handle my data. The question is that all the information I find online and in books dont suit my data.
My original data is +100 columns of time series (independent from each others), each with 48 months, starting in 08/2017 and finishing in 07/2021.
The objective is to obtain a value/metric representing the trend/seasonality, for each time series, so I can then make comparisons between them.
Below a data sample and two approaches that I tried to follow but failed.
Data sample (with only 6 columns of data, named orderly from 287 to 293):
287 288 289 290 292 293
2017-08-01 0.1613709 0.09907194 0.2542814 0.2179386 0.08020622 0.07926023
2017-09-01 0.1774719 0.10227714 0.2211257 0.1979846 0.09384094 0.10182659
2017-10-01 0.1738235 0.11191972 0.2099357 0.1930938 0.08038543 0.09304474
2017-11-01 0.1999949 0.14005038 0.2282944 0.2140095 0.08814765 0.10820706
2017-12-01 0.2203560 0.16408010 0.1864422 0.1890152 0.08735655 0.11958204
2018-01-01 0.2728642 0.22230381 0.1906515 0.1954573 0.10269819 0.13728082
2018-02-01 0.2771547 0.24142554 0.2287340 0.2431592 0.12353792 0.15428189
2018-03-01 0.2610135 0.24747148 0.2631311 0.2862447 0.18993516 0.17344621
2018-04-01 0.3502901 0.32087711 0.3012136 0.3339466 0.18706540 0.20857209
2018-05-01 0.3669179 0.36063092 0.3789247 0.3781572 0.18566273 0.20633488
2018-06-01 0.2643827 0.27359616 0.3415491 0.3172041 0.19025036 0.18735599
2018-07-01 0.2335092 0.29352583 0.3298348 0.2986179 0.17155325 0.15914827
2018-08-01 0.1994154 0.24043388 0.2868625 0.2659566 0.16226752 0.14772256
2018-09-01 0.1709875 0.20753322 0.2648888 0.2465150 0.15494714 0.14099699
2018-10-01 0.1843677 0.20504727 0.2600666 0.2480716 0.14583226 0.13660546
2018-11-01 0.2662550 0.23209503 0.1921081 0.2067601 0.14891306 0.14775722
2018-12-01 0.3455008 0.25827029 0.1825465 0.2222157 0.15189449 0.15854924
2019-01-01 0.3562984 0.28744854 0.1726661 0.2381863 0.15497530 0.16970100
2019-02-01 0.3596556 0.29504905 0.2190216 0.2532990 0.16528823 0.17614880
2019-03-01 0.3676633 0.30941445 0.2663822 0.3146126 0.19225333 0.19722699
2019-04-01 0.3471219 0.32011859 0.3318789 0.3620176 0.21693162 0.21269362
2019-05-01 0.3391499 0.33623537 0.3498372 0.3514615 0.22655705 0.21467237
2019-06-01 0.2134116 0.23256447 0.3097683 0.2937520 0.20671346 0.18182811
2019-07-01 0.1947303 0.25061919 0.3017159 0.2840877 0.16773642 0.12524420
2019-08-01 0.1676979 0.23042951 0.2933951 0.2741012 0.17294869 0.14598469
2019-09-01 0.1574564 0.20590697 0.2507077 0.2448338 0.16662829 0.14514487
2019-10-01 0.1670441 0.21569649 0.2239352 0.2349953 0.15196066 0.14107334
2019-11-01 0.2314212 0.23944840 0.1962703 0.2248290 0.16566737 0.18157745
2019-12-01 0.2937217 0.26243412 0.2524490 0.2844418 0.17893194 0.22077498
2020-01-01 0.3023854 0.28244002 0.2816947 0.3094329 0.16686343 0.22517501
2020-02-01 0.3511840 0.30870934 0.3109404 0.3344240 0.15479491 0.22957504
2020-03-01 0.3968343 0.33328386 0.3382992 0.3578028 0.14350501 0.23369119
2020-04-01 0.3745884 0.34262505 0.3675449 0.3827939 0.19862225 0.23809122
2020-05-01 0.3530601 0.35166492 0.3709603 0.3476905 0.25196152 0.24234931
2020-06-01 0.2282214 0.20867654 0.3517663 0.3336991 0.24879937 0.22456414
2020-07-01 0.2057477 0.21648387 0.3331914 0.3201591 0.20879761 0.18008671
2020-08-01 0.2000177 0.19419089 0.3040352 0.2979807 0.19359850 0.16924703
2020-09-01 0.1848961 0.19882785 0.2737280 0.2814912 0.17682968 0.15218477
2020-10-01 0.3177567 0.22982973 0.2646506 0.2804482 0.20588015 0.20085790
2020-11-01 0.3710144 0.28390520 0.2552706 0.2793703 0.18294126 0.15860050
2020-12-01 0.3783443 0.27966508 0.2316715 0.2586552 0.17646898 0.17848388
2021-01-01 0.3458173 0.25866979 0.2361880 0.2659490 0.17908497 0.18354894
2021-02-01 0.3604397 0.27641854 0.2407045 0.2732429 0.19147607 0.18462597
2021-03-01 0.3736471 0.29244967 0.2685608 0.2918238 0.20266803 0.18559877
2021-04-01 0.3581235 0.31151629 0.3729554 0.3619925 0.22856252 0.20997657
2021-05-01 0.3513976 0.34056181 0.4269086 0.4071241 0.26643216 0.24394560
2021-06-01 0.2306971 0.29087504 0.3798922 0.2053191 0.25745857 0.23557143
2021-07-01 0.2577626 0.26011944 0.3343924 0.3452438 0.21910554 0.19516812
I have tried to approch the issue with an xts format
projsxts <- xts(x= projs_2017Jul_t, order.by = projs_2017Jul_time)
plot(projsxts, main="NDVI values for oak projects with ESR (fitted values)", xlab="Time", ylab="NDVI")
[Xts timeseries plot][1]
[1]: https://i.stack.imgur.com/M46YQ.png
And also the normal ts approach, using "mts" as class for a multiple time series:
projs_2017Jul_ts1 <- ts(projs_2017Jul_t, frequency = 12, start=c(2017,8), end = c(2021,8), class = "mts", names = names2017)
print(projs_2017Jul_ts1)
I can obtain a summary, but when I try to use "decompose" I have the errors that "time series has no or less than 2 periods", although it has 48 months.
If I try to "stl", it says its only allowed in univariate series.
describe2017 <- summary.matrix(projs_2017Jul_ts1) #########gives Min, Median, Mean, Max (...) Values per column
projs_2017Jul_ts1 <- decompose(projs_2017Jul_ts1)
*"Error in decompose(projs_2017Jul_ts1) : time series has no or less than 2 periods"*
decompose_ts <- stl(projs_2017Jul_ts1)
*Error in stl(projs_2017Jul_ts1) : only univariate series are allowed*
Any advice/suggestion on how to do this, please? Thank you !
You basic approach is correct (create a time-series object, then use methods to decompose the time-series). I was able to reproduce your error, which is good.
The stl function only takes a single (univariate time-series), but when you feed a single time-series into stl, it gives the same error as you got using the decompose function. I think your data are not long enough for the algorithm to decompose. Typically, you need two full periods of data, in this case, the period is likely supra-annual, and five years is not long enough for the algorithm to identify the periodicity of the series.
see this post: error in stl, series has less than two periods (erroneous?)
## code I used to get your data into R
x <- readClipboard()
ts.data <- read.table(text = x, header = TRUE)
## code to create a timeseies object for 287
ts1 <- xts::xts(ts.data[,"X287"], order.by = as.Date(row.names(ts.data)))
## check the plot
plot(ts1)
plot of ts for 287
## use stl - Cleveland et al 1990 method for decomposing timeseries into seasonal, trend and remainder
stl.ts1 <- stl(ts1)
Error in stl(as.ts(ts1)) :
series is not periodic or has less than two period
I am currently working on a project involving data of delivery timings. The data can be both negative (indicating that the delivery was not late but actually ahead of the estimate) or positive (indicating that it was indeed late).
I would like to obtain the five number summary and interquartile range using the fivenum() function on the data. However, because all of the values are positive, my statistics are not accurate. The following is an example of the data I am working with:
Delivery.Late Reaction.Time Time.Until.Send.To.Vendor
1 00:01:29 00:00:00 00:05:08
2 00:12:19 00:00:00 00:04:52
3 00:02:55 00:00:00 00:05:42
4 00:06:14 00:00:00 00:14:34
5 -00:06:05 00:00:00 00:01:42
6 00:09:58 00:00:00 00:02:56
From this, I am interested in the Delivery.Late variable and would like to perform exploratory / diagnostic statistics on it.
I have used the chron package to convert the column data into chronological objects but chron(object) always takes the absolute value of the time and turns it into a positive value. Here is a sample of my code:
library(chron)
feb_01_07 <- read.csv("~/filepath/data.csv")
#converting factor to time
feb_01_07[,19] <- chron(times=feb_01_07$Delivery.Late)
#Five number summary and interquartile range for $Delivery.Late column
fivenum(feb_01_07$Delivery.Late, na.rm=TRUE)
After running fivenum() I get the results:
[1] 00:01:29 00:02:55 00:06:09 00:09:58 00:12:19
Which is inaccurate because the lowest number (the first term), should in fact, be -00:06:05 and not 00:01:29. -00:06:05 was converted to a positive chronological object and became the median instead.
How can I convert them to time objects while maintaining the negative values?Thanks so much for any insight!
Can do something like this:
library(chron)
delivery_late <- c("00:01:29", "00:12:19", "-00:06:05")
not_late_idx <- grep(pattern = "^-.*", x = delivery_late)
times <- chron(times=delivery_late)
times[not_late_idx] <- -1*times[not_late_idx]
1) chron times can represent negative times but will render them as negative numbers. We can present it as a negative times object like this:
library(chron)
# convert string in form [-]HH:MM:SS to times object
neg_times <- function(x) ifelse(grepl("-", x), - times(sub("-", "", x)), times(x))
DF <- read.table("data.dat")
test <- transform(DF, Delivery.Late = neg_times(Delivery.Late))
giving:
> test
Delivery.Late Reaction.Time Time.Until.Send.To.Vendor
1 0.001030093 00:00:00 00:05:08
2 0.008553241 00:00:00 00:04:52
3 0.002025463 00:00:00 00:05:42
4 0.004328704 00:00:00 00:14:34
5 -0.004224537 00:00:00 00:01:42
6 0.006921296 00:00:00 00:02:56
and we could also define a formatting routine:
# format a possibly negative times object
format_neg_times <- function(x) {
paste0(ifelse(x < 0, "-", ""), format(times(abs(x))))
}
format_neg_times(test[[1]])
## [1] "00:01:29" "00:12:19" "00:02:55" "00:06:14" "-00:06:05" "00:09:58"
2) The example in the question only has times that are before noon. If it is always the case that the times are between -12:00:00 and 12:00:00 then we could represent negative times as x + 1 like this:
library(chron)
wrap_neg_times <- function(x) times(neg_times(x) %% 1)
DF <- read.table("data.dat")
test2 <- transform(DF, Delivery.Late = wrap_neg_times(Delivery.Late))
giving:
> test2
Delivery.Late Reaction.Time Time.Until.Send.To.Vendor
1 00:01:29 00:00:00 00:05:08
2 00:12:19 00:00:00 00:04:52
3 00:02:55 00:00:00 00:05:42
4 00:06:14 00:00:00 00:14:34
5 23:53:55 00:00:00 00:01:42
6 00:09:58 00:00:00 00:02:56
format_wrap_neg_times <- function(x) {
format_neg_times(ifelse(x > 0.5, x - 1, x))
}
format_wrap_neg_times(test2[[1]])
## [1] "00:01:29" "00:12:19" "00:02:55" "00:06:14" "-00:06:05" "00:09:58"
Note
The input in reproducible form:
Lines <- "
Delivery.Late Reaction.Time Time.Until.Send.To.Vendor
1 00:01:29 00:00:00 00:05:08
2 00:12:19 00:00:00 00:04:52
3 00:02:55 00:00:00 00:05:42
4 00:06:14 00:00:00 00:14:34
5 -00:06:05 00:00:00 00:01:42
6 00:09:58 00:00:00 00:02:56"
cat(Lines, file = "data.dat")
Update
Fix.
I am new to R and I am having some issues with the padr package described here.
I have a hourly data set that is missing hours and I would like to insert a row to input a value for the missing data. I am trying to use the pad function and the fill_by_value function from the padr package but I am getting an error when I use the pad function.
The data called Mendo is presented as:
Date.Local Time.Local Sample.Measurement
2016-01-01 00:00:00 3
2016-01-01 00:01:00 4
2016-01-01 00:02:00 1
2016-01-01 00:04:00 4
2016-01-01 00:05:00 5
I want the final data to look like:
Date.Local Time.Local Sample.Measurement
2016-01-01 00:00:00 3
2016-01-01 00:01:00 4
2016-01-01 00:02:00 1
2016-01-01 00:03:00 999
2016-01-01 00:04:00 4
2016-01-01 00:05:00 5
I am under the impression the padr packaged wants a datetime POSIXct column so I use the command
Mendo$Time.Local <- as.POSIXct(paste(Mendo$Date.Local, Mendo$Time.Local), format = '%Y-%m-%d %H:%M')
to get:
Time.Local Sample.Measurement
2016-01-01 00:00:00 3
2016-01-01 00:01:00 4
2016-01-01 00:02:00 1
2016-01-01 00:04:00 4
2016-01-01 00:05:00 5
Now I try to use the pad function like instruction in the link provided above. My line of code is:
Mendo_padded <- Mendo %>% pad
and I get the error:
Error in if (total_invalid == nrow(x)) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In if (unique(nchar(x_char)) == 10) { :
the condition has length > 1 and only the first element will be used
If this were to work, I would then use the command
Mendo_padded %>% fill_by_value(Sample.Measurement, value = 999)
to get all the missing hours Sample.Measurement value to be 999.
I would love feedback, suggestions or comments on what I may be doing wrong and how I can go about getting this code to work! Thank you!
It seems that pad can automatically detect which column is of Date / POSIXct / POSIXlt type, so you do not need to supply Mendo$Time.Local to pad. The padding will be applied on hour intervals.
library(magrittr)
library(padr)
PM10 <- read.csv(file="../Downloads/hourly_81102_2016.csv",
stringsAsFactors = FALSE) # don't change the columns to factors
Mendo <- PM10[PM10$County.Name == "Mendocino",]
Mendo$Time.Local <-
as.POSIXct(paste(
Mendo$Date.Local, Mendo$Time.Local), format = '%Y-%m-%d %H:%M')
Mendo <- Mendo[,c("Time.Local", "Sample.Measurement")]
# remove Mendo$Time.Local
Mendo_padded <- Mendo %>% na.omit %>%
pad(interval = 'hour',
start_val = NULL, end_val = NULL, group = NULL,
break_above = 1)
You may also consider using the column Time.GMT and Date.GMT because date and time may depend on where you (your computer) are.
Edit: As suggested by OP, na.omit should be used before pad to avoid NA values in the Date column.
I have a data file containing the raw values of my measurments. The top entries look like that:
495.08
1117.728
872.712
665.632
713.296
1172.44
1302.544
1428.832
1413.536
1361.896
1126.656
644.776
1251.616
1252.824
[...]
The measurements are done in 15 min intervals. In order to put this in a R time series object I use the following code.
data <- scan("./data__2.dat",skip=1)
datats <- ts(data, frequency=24*60/15, start=c(2014,1))
But this leaves me with:
Although the data is only from one year. Hence, it seems like the frequency is wrong. Any thoughts on how to fix this?
By doing:
library(xts)
library(lubridate)
df <- data.frame(interval = seq(ymd_hms('2014-01-21 00:00:00'),
by = '15 min',length.out=(60*24*365/15)),
data = rnorm(60*24*365/15))
ts <- xts(df, order.by=df$interval)
You get:
interval data
1 2014-01-21 00:00:00 -1.3975823
2 2014-01-21 00:15:00 -0.4710713
3 2014-01-21 00:30:00 0.9149273
4 2014-01-21 00:45:00 -0.3053136
5 2014-01-21 01:00:00 -1.2459707
6 2014-01-21 01:15:00 0.4749215
We have a csv file with Dates in Excel format and Nav for Manager A and Manager B as follows:
Date,Manager A,Date,Manager B
41346.6666666667,100,40932.6666666667,100
41347.6666666667,100,40942.6666666667,99.9999936329992
41348.6666666667,100,40945.6666666667,99.9999936397787
41351.6666666667,100,40946.6666666667,99.9999936714362
41352.6666666667,100,40947.6666666667,100.051441180137
41353.6666666667,100,40948.6666666667,100.04877283951
41354.6666666667,100.000077579585,40949.6666666667,100.068400298752
41355.6666666667,100.00007861475,40952.6666666667,100.070263374822
41358.6666666667,100.000047950872,40953.6666666667,99.9661095940006
41359.6666666667,99.9945012295984,40954.6666666667,99.8578245935173
41360.6666666667,99.9944609274138,40955.6666666667,99.7798031949116
41361.6666666667,99.9944817907402,40956.6666666667,100.029523604978
41366.6666666667,100,40960.6666666667,100.14859511024
41367.6666666667,99.4729804387476,40961.6666666667,99.7956029017769
41368.6666666667,99.4729804387476,40962.6666666667,99.7023420799123
41369.6666666667,99.185046151864,40963.6666666667,99.6124531927299
41372.6666666667,99.1766469096966,40966.6666666667,99.5689030038018
41373.6666666667,98.920738006398,40967.6666666667,99.5701493637685
,,40968.6666666667,99.4543885041996
,,40969.6666666667,99.3424528379521
We want to create a zoo object with the following structure [Dates, Manager A Nav, Manager B Nav].
After reading the csv file with:
data = read.csv("...", header=TRUE, sep=",")
we set an index for splitting the object and use lapply to split
INDEX <- seq(1, by = 2, length = ncol(data) / 2)
data.zoo <- lapply(INDEX, function(i, data) data[i:(i+1)], data = zoo(data))
I'm stuck with the fact that Dates are in Excel format and don't know how to fix that stuff. Is the problem set in a correct way?
If all you want to do is to convert the dates to proper dates you can do this easily enough. The thing you need to know is the origin date. Your numbers represent the integer and fractional number of days that have passed since the origin date. Usually this is Jan 0 1990!!! Go figure, but be careful as I don't think this is always the case. You can try this...
# Excel origin is day 0 on Jan 0 1900, but treats 1900 as leap year so...
data$Date <- as.Date( data$Date , origin = "1899/12/30")
data$Date.1 <- as.Date( data$Date.1 , origin = "1899/12/30")
# For more info see ?as.Date
If you are interested in keeping the times as well, you can use as.POSIXct, but you must also specify the timezone (UTC by default);
data$Date <- as.POSIXct(data$Date, origin = "1899/12/30" )
head(data)
# Date Manager.A Date.1 Manager.B
# 1 2013-03-13 16:00:00 100 2012-01-24 100.00000
# 2 2013-03-14 16:00:00 100 2012-02-03 99.99999
# 3 2013-03-15 16:00:00 100 2012-02-06 99.99999
# 4 2013-03-18 16:00:00 100 2012-02-07 99.99999
# 5 2013-03-19 16:00:00 100 2012-02-08 100.05144
# 6 2013-03-20 16:00:00 100 2012-02-09 100.04877