In the past last days I have been struggling a lot trying to handle my data. The question is that all the information I find online and in books dont suit my data.
My original data is +100 columns of time series (independent from each others), each with 48 months, starting in 08/2017 and finishing in 07/2021.
The objective is to obtain a value/metric representing the trend/seasonality, for each time series, so I can then make comparisons between them.
Below a data sample and two approaches that I tried to follow but failed.
Data sample (with only 6 columns of data, named orderly from 287 to 293):
287 288 289 290 292 293
2017-08-01 0.1613709 0.09907194 0.2542814 0.2179386 0.08020622 0.07926023
2017-09-01 0.1774719 0.10227714 0.2211257 0.1979846 0.09384094 0.10182659
2017-10-01 0.1738235 0.11191972 0.2099357 0.1930938 0.08038543 0.09304474
2017-11-01 0.1999949 0.14005038 0.2282944 0.2140095 0.08814765 0.10820706
2017-12-01 0.2203560 0.16408010 0.1864422 0.1890152 0.08735655 0.11958204
2018-01-01 0.2728642 0.22230381 0.1906515 0.1954573 0.10269819 0.13728082
2018-02-01 0.2771547 0.24142554 0.2287340 0.2431592 0.12353792 0.15428189
2018-03-01 0.2610135 0.24747148 0.2631311 0.2862447 0.18993516 0.17344621
2018-04-01 0.3502901 0.32087711 0.3012136 0.3339466 0.18706540 0.20857209
2018-05-01 0.3669179 0.36063092 0.3789247 0.3781572 0.18566273 0.20633488
2018-06-01 0.2643827 0.27359616 0.3415491 0.3172041 0.19025036 0.18735599
2018-07-01 0.2335092 0.29352583 0.3298348 0.2986179 0.17155325 0.15914827
2018-08-01 0.1994154 0.24043388 0.2868625 0.2659566 0.16226752 0.14772256
2018-09-01 0.1709875 0.20753322 0.2648888 0.2465150 0.15494714 0.14099699
2018-10-01 0.1843677 0.20504727 0.2600666 0.2480716 0.14583226 0.13660546
2018-11-01 0.2662550 0.23209503 0.1921081 0.2067601 0.14891306 0.14775722
2018-12-01 0.3455008 0.25827029 0.1825465 0.2222157 0.15189449 0.15854924
2019-01-01 0.3562984 0.28744854 0.1726661 0.2381863 0.15497530 0.16970100
2019-02-01 0.3596556 0.29504905 0.2190216 0.2532990 0.16528823 0.17614880
2019-03-01 0.3676633 0.30941445 0.2663822 0.3146126 0.19225333 0.19722699
2019-04-01 0.3471219 0.32011859 0.3318789 0.3620176 0.21693162 0.21269362
2019-05-01 0.3391499 0.33623537 0.3498372 0.3514615 0.22655705 0.21467237
2019-06-01 0.2134116 0.23256447 0.3097683 0.2937520 0.20671346 0.18182811
2019-07-01 0.1947303 0.25061919 0.3017159 0.2840877 0.16773642 0.12524420
2019-08-01 0.1676979 0.23042951 0.2933951 0.2741012 0.17294869 0.14598469
2019-09-01 0.1574564 0.20590697 0.2507077 0.2448338 0.16662829 0.14514487
2019-10-01 0.1670441 0.21569649 0.2239352 0.2349953 0.15196066 0.14107334
2019-11-01 0.2314212 0.23944840 0.1962703 0.2248290 0.16566737 0.18157745
2019-12-01 0.2937217 0.26243412 0.2524490 0.2844418 0.17893194 0.22077498
2020-01-01 0.3023854 0.28244002 0.2816947 0.3094329 0.16686343 0.22517501
2020-02-01 0.3511840 0.30870934 0.3109404 0.3344240 0.15479491 0.22957504
2020-03-01 0.3968343 0.33328386 0.3382992 0.3578028 0.14350501 0.23369119
2020-04-01 0.3745884 0.34262505 0.3675449 0.3827939 0.19862225 0.23809122
2020-05-01 0.3530601 0.35166492 0.3709603 0.3476905 0.25196152 0.24234931
2020-06-01 0.2282214 0.20867654 0.3517663 0.3336991 0.24879937 0.22456414
2020-07-01 0.2057477 0.21648387 0.3331914 0.3201591 0.20879761 0.18008671
2020-08-01 0.2000177 0.19419089 0.3040352 0.2979807 0.19359850 0.16924703
2020-09-01 0.1848961 0.19882785 0.2737280 0.2814912 0.17682968 0.15218477
2020-10-01 0.3177567 0.22982973 0.2646506 0.2804482 0.20588015 0.20085790
2020-11-01 0.3710144 0.28390520 0.2552706 0.2793703 0.18294126 0.15860050
2020-12-01 0.3783443 0.27966508 0.2316715 0.2586552 0.17646898 0.17848388
2021-01-01 0.3458173 0.25866979 0.2361880 0.2659490 0.17908497 0.18354894
2021-02-01 0.3604397 0.27641854 0.2407045 0.2732429 0.19147607 0.18462597
2021-03-01 0.3736471 0.29244967 0.2685608 0.2918238 0.20266803 0.18559877
2021-04-01 0.3581235 0.31151629 0.3729554 0.3619925 0.22856252 0.20997657
2021-05-01 0.3513976 0.34056181 0.4269086 0.4071241 0.26643216 0.24394560
2021-06-01 0.2306971 0.29087504 0.3798922 0.2053191 0.25745857 0.23557143
2021-07-01 0.2577626 0.26011944 0.3343924 0.3452438 0.21910554 0.19516812
I have tried to approch the issue with an xts format
projsxts <- xts(x= projs_2017Jul_t, order.by = projs_2017Jul_time)
plot(projsxts, main="NDVI values for oak projects with ESR (fitted values)", xlab="Time", ylab="NDVI")
[Xts timeseries plot][1]
[1]: https://i.stack.imgur.com/M46YQ.png
And also the normal ts approach, using "mts" as class for a multiple time series:
projs_2017Jul_ts1 <- ts(projs_2017Jul_t, frequency = 12, start=c(2017,8), end = c(2021,8), class = "mts", names = names2017)
print(projs_2017Jul_ts1)
I can obtain a summary, but when I try to use "decompose" I have the errors that "time series has no or less than 2 periods", although it has 48 months.
If I try to "stl", it says its only allowed in univariate series.
describe2017 <- summary.matrix(projs_2017Jul_ts1) #########gives Min, Median, Mean, Max (...) Values per column
projs_2017Jul_ts1 <- decompose(projs_2017Jul_ts1)
*"Error in decompose(projs_2017Jul_ts1) : time series has no or less than 2 periods"*
decompose_ts <- stl(projs_2017Jul_ts1)
*Error in stl(projs_2017Jul_ts1) : only univariate series are allowed*
Any advice/suggestion on how to do this, please? Thank you !
You basic approach is correct (create a time-series object, then use methods to decompose the time-series). I was able to reproduce your error, which is good.
The stl function only takes a single (univariate time-series), but when you feed a single time-series into stl, it gives the same error as you got using the decompose function. I think your data are not long enough for the algorithm to decompose. Typically, you need two full periods of data, in this case, the period is likely supra-annual, and five years is not long enough for the algorithm to identify the periodicity of the series.
see this post: error in stl, series has less than two periods (erroneous?)
## code I used to get your data into R
x <- readClipboard()
ts.data <- read.table(text = x, header = TRUE)
## code to create a timeseies object for 287
ts1 <- xts::xts(ts.data[,"X287"], order.by = as.Date(row.names(ts.data)))
## check the plot
plot(ts1)
plot of ts for 287
## use stl - Cleveland et al 1990 method for decomposing timeseries into seasonal, trend and remainder
stl.ts1 <- stl(ts1)
Error in stl(as.ts(ts1)) :
series is not periodic or has less than two period
Related
I am seeing missing timseries data corresponding to GMT time change for Summer. I guess this might also be for Winter. I have two parts of query.
How to generate the missing timeseries from the code below. The table is in a xts format.
How to filter the records by time that will include the missing time series once generated. This is only a sample dataset. Thanks.
start <- as.POSIXct("2022-03-27 00:58:00")
interval <- 2
end <- as.POSIXct("2022-03-27 03:00:00")
missing_timestamp <- data.frame(TIMESTAMP = seq(from=start, by=interval*60, to=end))
head(missing_timestamp)
TIMESTAMP
1 2022-03-27 00:58:00
2 2022-03-27 02:00:00
3 2022-03-27 02:02:00
4 2022-03-27 02:04:00
5 2022-03-27 02:06:00
6 2022-03-27 02:08:00
Update:
Linked to the second query, when below code is executed for time between 00hrs and 02:06hrs all data is returned rather than the first four records.
a <- missing_timestamp %>% filter(TIMESTAMP > ymd_hms("2022-03-27 00:00:00") & TIMESTAMP < ymd_hms("2022-03-27 02:06:00"))
I've been trying to calculate intervals for individuals and have run into a weird error. Specifically, in this code:
library(lubridate)
library(tidyverse)
library(plyr)
df<-tibble(dates=mdy(c("2/20/20","2/25/20","3/1/20","3/11/20","3/20/20")),recips=c("x","x","a","a","a"),treatment=c("T","P","T","P","P"),eventtype=c("a","real","y","z","real"))
df%>%mutate(window=interval(start=dates,end=dates+weeks(2)))
ddply(df,.(recips),mutate,window=interval(start=dates,end=dates+weeks(2)))
the last line draws an error that the second to last line doesn't draw. Any tips?
The issue would be the class of the output of interval which is not in compliance with the ddply. An option is to convert to character with as.character
plyr::ddply(df, c("recips"), plyr::mutate,
window = as.character(interval(start = dates, end = dates + weeks(2))))
-output
# dates recips treatment eventtype window
#1 2020-03-01 a T y 2020-03-01 UTC--2020-03-15 UTC
#2 2020-03-11 a P z 2020-03-11 UTC--2020-03-25 UTC
#3 2020-03-20 a P real 2020-03-20 UTC--2020-04-03 UTC
#4 2020-02-20 x T a 2020-02-20 UTC--2020-03-05 UTC
#5 2020-02-25 x P real 2020-02-25 UTC--2020-03-10 UTC
Based on the data showed, we are creating the interval on each element of 'date'. So, the group_by operation is not needed
library(dplyr)
df %>%
mutate(window = interval(start=dates,end=dates+weeks(2)))
I have a huge set of data that in the .csv format has 2 columns (one Date_time and other is Q.vanda).
This is what the head and tail of the data looks like,
> head(mdf.vanda)
Date_Time Q.vanda
1 1969-12-05 21:00:00 0
2 1969-12-05 21:01:00 4
3 1969-12-05 21:05:00 11
4 1969-12-05 21:20:00 17
5 1969-12-05 22:45:00 27
6 1969-12-05 22:55:00 23
> tail(mdf.vanda)
Date_Time Q.vanda
165738 2016-01-19 10:15:00 2995.25
165739 2016-01-19 10:30:00 2858.04
165740 2016-01-19 10:45:00 2956.94
165741 2016-01-19 11:00:00 2972.52
165742 2016-01-19 11:15:00 2776.99
165743 2016-01-19 11:30:00 3082.53
There are 48 years of data in between and I want to create a for loop to subset them by year (ex. from 1969/10/01 to 1970/10/01, 1970/10/01 to 1971/10/01 etc.)
I wrote a code but, it's giving me an error that I am not able to resolve. I am pretty new at R so, feel free to suggest some other code that you might think is more efficient for my purpose.
code:
cut <- as.POSIXct(strptime(as.character(c('1969/10/01','1970/10/01','1971/10/01','1972/10/01','1973/10/01','1974/10/01','1975/10/01','1976/10/01','1977/10/01','1978/10/01','1979/10/01','1980/10/01','1981/10/01','1982/10/01','1983/10/01','1984/10/01','1985/10/01','1986/10/01','1987/10/01','1988/10/01','1989/10/01','1990/10/01','1991/10/01','1992/10/01','1993/10/01','1994/10/01','1995/10/01','1996/10/01','1997/10/01','1998/10/01',
'1999/10/01','2000/10/01','2001/10/01','2002/10/01','2003/10/01','2004/10/01',
'2005/10/01','2006/10/01','2007/10/01','2008/10/01','2009/10/01','2010/10/01',
'2011/10/01','2012/10/01','2013/10/01','2014/10/01','2015/10/01','2016/10/01')),format = "%Y/%m/%d"))
df.sub <- as.data.frame(matrix(data=NA,nrow=14496, ncol=96)) #nrow = (31+30+31+31+28)*(4*24)[days * readings/day] , ncol = (48*2)[Seasons*cols]
i.odd <- seq(1,49, by=2)
for (i in 1:48) {df.sub[1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= cut[i] & mdf.vanda$Date_Time < cut[i+1]])
,i.odd[i]:(i.odd[i]+1)] <- subset(mdf.vanda,mdf.vanda$Date_Time > cut[i] & mdf.vanda$Date_Time < cut[i+1])}
Error:
Error in [<-.data.frame(*tmp*, 1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= :
replacement element 1 has 1595 rows, need 1596
you can split your data as shown
split(mdf.vanda,findInterval(as.Date(mdf.vanda$Date_Time),seq(as.Date("1969-10-01"),as.Date("2016-10-01"),"1 year"))
There is no need for a loop here. Base R has the cut function to perform this very operation and significantly faster than the loop. Since you have the break points defined with your "cut" variable.
#cut <- as.POSIXct(c('1969/10/01', ... ,'2016/10/01'),format = "%Y/%m/%d")
mytime<-cut(mdf.vanda$Date_Time, breaks = cut, include.lowest = TRUE)
The variable "mytime" is a vector the length of your data frame with a label to bin the data.
You could then use the split function to break your dataframe in a list of data frames or use the group_by function from the dplyr library for additional data processing.
I suggest you have a look at the convenient quantmod package. Once you have Time Series data, you can use the apply.yearly function and apply any function to every year of data.
I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]
I'm plotting a time series graph using ggplot, however whenever the size of the data frame is greater than around 600, ggplot throws the following error:
Error in anyDuplicated.default(breaks) : length 1136073601 is too
large for hashing
In fact, it just gave me the same error when I try to plot 400 items.
The data is melted like so, except there are four variables- speed, dir, temp and pressure:
time variable value
1 2006-07-01 00:00:00 speed 4.180111
2 2006-07-02 00:00:00 speed 5.527226
3 2006-07-09 00:00:00 speed 6.650821
4 2006-07-16 00:00:00 speed 4.380063
5 2006-07-23 00:00:00 speed 5.641709
6 2006-07-30 00:00:00 speed 7.636913
7 2006-08-06 00:00:00 speed 7.128334
8 2006-08-13 00:00:00 speed 4.719046
...
201 2006-07-01 00:00:00 temp 17.140069
202 2006-07-02 00:00:00 temp 17.517480
203 2006-07-09 00:00:00 temp 14.211002
204 2006-07-16 00:00:00 temp 20.121617
205 2006-07-23 00:00:00 temp 17.933492
206 2006-07-30 00:00:00 temp 15.244583
My code to plot these is based on what I found here: http://had.co.nz/ggplot2/scale_date.html
qplot(time,value,data=test3,geom="line",group=variable) +
+ facet_grid(variable ~ ., scale = "free_y")
Any pointers and I'd be very grateful!!
To massage the date from character to date i'm using:
test$time <- strptime(test$time, format="%Y-%m-%d %H:%M:%S")
test$time <- as.POSIXct(test$time, format="%H:%M:%S")
test3 = melt(test,id="time")
class(test$time) returns "[1] "POSIXt" "POSIXct""
Try setting a timezone explicitly in the call to as.POSIXct(), as in https://gist.github.com/925852