I have the below dataset which is a dataframe. But I would like to convert it into time series so that I can do ARIMA forecasting.
Have searched various topics in SO but could not find anything similar which is at YEARMONTH grain. Everyone talked about date field. But here I don't have date.
I am using the below code but this gives error
dataset <- data.frame(year =c(2017), YearMonth = c(201701,201702,201703,201704), sales = c(100,200,300,400))
library(zoo)
newdataset <- as.ts(read.zoo(dataset, FUN = as.yearmon))
# Error:
#
# In zoo(coredata(x), tt) :
# some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
I know it gives error because I have year column as 1st column which does not have unique values but not really sure how to fix it.
Any help would be really appreciated.
Regards,
Akash
One option is to convert YearMonth to 1st date of a month and generate ts.
library(zoo)
dataset$YearMonth = as.Date(as.yearmon(as.character(dataset$YearMonth),"%Y%m"), frac = 0)
dataset
# year YearMonth sales
# 1 2017 2017-01-01 100
# 2 2017 2017-02-01 200
# 3 2017 2017-03-01 300
# 4 2017 2017-04-01 400
Just for ts another option is as:
dataset$YearMonth = as.yearmon(as.character(dataset$YearMonth),"%Y%m")
as.ts(dataset[-1])
# Time Series:
# Start = 1
# End = 4
# Frequency = 1
# YearMonth sales
# 1 2017.000 100
# 2 2017.083 200
# 3 2017.167 300
# 4 2017.250 400
Related
Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)
I have the following data set. I am trying to split the date_1 field into month and days. Then converting the month number to a month name.
date_1,no_of_births_1
1/1,1482
2/2,1213
3/23,1220
4/4,1319
5/11,1262
6/18,1271
I am using month.abb[] for converting the month number to name. But instead of providing month name for each value of month number, the result is generating wrong array.
for example: month.abb[2] is generating Apr instead of Feb.
date_1 no_of_births_1 V1 V2 month
1 1/1 1482 1 1 Jan
2 2/2 1213 2 2 Apr
3 3/23 1220 3 23 May
4 4/4 1319 4 4 Jun
5 5/11 1262 5 11 Jul
6 6/18 1271 6 18 Aug
below is the code i am using,
birthday<-read.csv("Birthday_s.csv",header = TRUE)
birthday$date_1<-as.character(birthday$date_1)
#split the data
listx<-sapply(birthday$date_1,function(x) strsplit(x,"/"))
library(base)
#convert to data frame
mat<-as.data.frame(matrix(unlist(listx),ncol = 2, byrow = TRUE))
#combine birthday and mat
birthday2<-cbind(birthday,mat)
#convert month number to month name
birthday2$month<-sapply(birthday2$V1, function(x) month.abb[as.numeric(x)])
When I run your code, I get the correct months. However, your code is more complicated than necessary. Here are two ways to extract month and day from date_1:
First, when you read the data, use stringsAsFactors=FALSE, which prevents strings from getting converted to factors.
birthday <- read.csv("Birthday_s.csv",header = TRUE, stringsAsFactors=FALSE)
Extract month and days using date functions:
library(lubridate)
birthday$month = month(as.POSIXct(birthday$date_1, format="%m/%d"), abbr=TRUE, label=TRUE)
birthday$day = day(as.POSIXct(birthday$date_1, format="%m/%d"))
Extract month and days using Regular Expressions:
birthday$month = month.abb[as.numeric(gsub("([0-9]{1,2}).*", "\\1", birthday$date_1))]
birthday$day = as.numeric(gsub(".*/([0-9]{1,2}$)", "\\1", birthday$date_1))
I am trying to calculate the mean date independent of year for each level of a factor.
DF <- data.frame(Date = seq(as.Date("2013-2-15"), by = "day", length.out = 730))
DF$ID = rep(c("AAA", "BBB", "CCC"), length.out = 730)
head(DF)
Date ID
1 2013-02-15 AAA
2 2013-02-16 BBB
3 2013-02-17 CCC
4 2013-02-18 AAA
5 2013-02-19 BBB
6 2013-02-20 CCC
With the data above and the code below, I can calculate the mean date for each factor, but this includes the year.
I want a mean month and day across years. The preferred result would be a POSIXct time class formatted as month-day (eg. 12-31 for Dec 31st) representing the mean month and day across multiple years.
library(dplyr)
DF2 <- DF %>% group_by(ID) %>% mutate(
Col = mean(Date, na.rm = T))
DF2
Addition
I am looking for the mean day of the year with a month and day component, for each factor level. If the date represents, for example, the date an animal reproduced, I am not interested in the yearly differences between years, but instead want a single mean day.
I The end result would look like DF2 but with the new value calculated as previously described (mean day of the year with a month day component.
Sorry this was not more clear.
If I understand your question correctly, here's how to get a mean date column. I first extract the day of the year with yday from POSIXlt. I then calculate the mean. To get a date back, I have to add those days to an actual year, hence the creation of the Year object. As requested, I put the results in the same format as DF2 in your example.
library(dplyr)
DF2 <- DF %>%
mutate(Year=format(Date,"%Y"),
Date_day=as.POSIXlt(Date, origin = "1960-01-01")$yday)%>%
group_by(ID) %>%
mutate(Col = mean(Date_day, na.rm = T),Mean_date=format(as.Date(paste0(Year,"-01-01"))+Col,"%m-%d"))%>%
select(Date,ID,Mean_date)
DF2
> DF2
Source: local data frame [730 x 3]
Groups: ID [3]
Date ID Mean_date
(date) (chr) (chr)
1 2013-02-15 AAA 07-02
2 2013-02-16 BBB 07-02
3 2013-02-17 CCC 07-01
4 2013-02-18 AAA 07-02
5 2013-02-19 BBB 07-02
6 2013-02-20 CCC 07-01
7 2013-02-21 AAA 07-02
8 2013-02-22 BBB 07-02
9 2013-02-23 CCC 07-01
10 2013-02-24 AAA 07-02
.. ... ... ...
You can take the mean of dates by using the mean function. However, note that the mean implementation (and result) will be different depending on the data type. For POSIXct, the mean will be calculated and return the date and time - think of taking the mean of a bunch of integers and you will likely get a float or numeric. For Date, it will essentially 'round' the date to the nearest date.
For example, I recently took a mean of dates. Look at the output when different data types are used.
> mean(as.Date(stationPointDf$knockInDate))
[1] "2018-06-04"
> mean(as.POSIXct(stationPointDf$knockInDate))
[1] "2018-06-03 21:19:21 CDT"
If I am looking for a mean Month and Day across years, I convert all the dates to have the current year using lubridate package.
library(lubridate)
year(myVectorOfDates) <- 2018
Then, I compute the mean and drop the year.
I have 3 days long time series (data) sampled every minute (60*24*3 values):
require(zoo)
t<-seq(as.POSIXlt("2015/02/02 00:01:00"),as.POSIXlt("2015/02/04 24:00:00"), length.out=60*24*3)
d<-seq(1,2, length.out=60*24*3)
data<-zoo(d,t)
I would like to calculate:
Mean values (over three days span) for every minute of the hour assuming that all hours are equal. In this case I should have 60 values in the output with time stamps:
01:00, 02:00, ..., 60:00. Each mean must be calculated over 24x3=72 values, since we have 72 hours in three days long time series.
Same as above but additionally tracking hours:
00:01:00, 00:02:00, ..., 23:60:00. Each mean will be calculated over three values, since we have three days long time series.
These two both create zoo series use aggregate.zoo. The index of the resulting zoo series will be of chron "times" class.
library(chron) # "times" class
aggregate(data, times(format(time(data), "00:%M:00")), mean)
aggregate(data, times(format(time(data), "%H:%M:00")), mean)
If its OK that the index is of class "character" then times can be omitted in which case chron is not needed.
You can do this using data.table and lubridate:
library(data.table)
library(lubridate)
##
Dt <- data.table(
Data=as.numeric(data),
Index=index(data))
##
min_dt <- Dt[
,list(Mean=mean(Data)),
by=list(Minute=minute(Index))]
##
hmin_dt <- Dt[
,list(Mean=mean(Data)),
by=list(Hour=hour(Index),
Minute=minute(Index))]
##
R> head(min_dt)
Minute Mean
1: 1 1.493170
2: 2 1.493401
3: 3 1.493633
4: 4 1.493864
5: 5 1.494096
6: 6 1.494327
##
R> head(hmin_dt)
Hour Minute Mean
1: 0 1 1.333411
2: 0 2 1.333642
3: 0 3 1.333874
4: 0 4 1.334105
5: 0 5 1.334337
6: 0 6 1.334568
Data:
library(zoo)
t <- seq(
as.POSIXlt("2015/02/02 00:01:00"),
as.POSIXlt("2015/02/04 24:00:00"),
length.out=60*24*3)
d <- seq(1,2,length.out=60*24*3)
data <- zoo(d,t)
I have a dataframe called daily which looks like this:
daily[1:10,]
Climate_Division Date Precipitation
1 1 1948-07-01 0.2100000
2 1 1948-07-02 0.7000000
3 1 1948-07-03 0.1900000
4 1 1948-07-04 0.1033333
5 1 1948-07-05 0.1982895
6 1 1948-07-06 0.1433333
7 1 1948-07-07 NA
8 1 1948-07-08 NA
9 1 1948-07-09 NA
10 1 1948-07-10 NA
The objective that I would like to accomplish is average all the day values throughout the years (1948-1995) to replace the NA value that occurs on that particular day. For example, since row 7 has an NA for July 7, 1948, I would average all the July 7 from 1948-1995 and replace that particular day with the average.
What I have tried so far is this:
index <- which(is.na(daily$Precipitation)) # find where the NA's occur
daily_avg <- daily # copy dataframe
daily_avg$Date <- strftime(daily_avg$Date, format="2000-%m-%d") # Change the Date format to represent only the day and month and disregard year
daily_avg <- aggregate(Precipitation~Date, FUN = mean, data = daily_avg, na.rm = TRUE) # find the mean precip per day
daily[index,3] <- daily_avg[daily_avg$Date %in% strftime(daily[index,2], format="2000-%m-%d"), 2]
The last line in the code is not working properly, I'm not sure why yet. That is how my thought process of this problem is going. However, I was wondering if there is a better way of doing it using a built in function that I am not aware of. Any help is greatly appreciated. Thank you
I think the data in your example, don't explain the problem. You should give data for a certain day over many years with some NA values. For example, here I change the problem for 2 days over 3 years.
Climate_Division Date Precipitation
1 1 1948-07-01 0.2100000
2 1 1948-07-02 NA
3 1 1949-07-01 0.1900000
4 1 1949-07-02 0.1033333
5 1 1950-07-01 NA
6 1 1950-07-02 0.1433333
The idea if I understand , is to replace NA values by the mean of the values over all years. You can use ave and transform to create the new column containing the mean, then replace the NA value with it.
daily$daymonth <- strftime(daily$Date, format="%m-%d")
daily <- transform(daily, mp =ave(Precipitation,daymonth,
FUN=function(x) mean(x,na.rm=TRUE) ))
transform(daily, Precipitation =ifelse(is.na(Precipitation),mp,Precipitation))
Climate_Division Date Precipitation daymonth mp
1 1 1948-07-01 0.2100000 07-01 0.2000000
2 1 1948-07-02 0.1233333 07-02 0.1233333
3 1 1949-07-01 0.1900000 07-01 0.2000000
4 1 1949-07-02 0.1033333 07-02 0.1233333
5 1 1950-07-01 0.2000000 07-01 0.2000000
6 1 1950-07-02 0.1433333 07-02 0.1233333
Using data.table
Some dummy data
set.seed(1)
library(data.table)
daily <- seq(as.Date('1948-01-01'),as.Date('1995-12-31')
dd <- data.table(date = daily, precip = runif(length(daily)))
# add na values
nas <- sample(length(daily),300, FALSE)
dd[, precip := {is.na(precip) <- nas; precip}]
## calculate the daily averages
# add day and month
dd[, c('month','day') := list(month(date), mday(date))]
monthdate <- dd[, list(mprecip = mean(precip, na.rm = TRUE)),
keyby = list(month, date)]
# set key for joining
setkey(dd, month, date)
# replace NA with day-month averages
dd[monthdate, precip := ifelse(is.na(precip), mprecip, precip)]
# set key to reorder to daily
setkey(dd, date)
A slightly neater version of mnel's answer, which I would prefer to the accepted one:
set.seed(1)
library(data.table)
# step 1: form data
daily <- seq(as.Date('1948-01-01'),as.Date('1995-12-31'),by="day")
dd <- data.table(date = daily, precip = runif(length(daily)))
# step 2: add NA values
nas <- sample(length(daily),300, FALSE)
dd[, precip := {is.na(precip) <- nas; precip}]
# step 3: replace NAs with day-of-month across years averages
dd[, c('month','day') := list(month(date), mday(date))]
dd[,precip:= ifelse(is.na(precip), mean(precip, na.rm=TRUE), precip), by=list(month,day)]