I am currently using the "economics" dataset in ggplot2 package. I have been told to try this code, and it works, but I do not understand the first line (how does this date conversion function work - I have seen in the vignette that it is supposed to change the timezone, but it doesn't seem to be used to that purpose here ? what does x refer to ?) and I would be grateful for any help!
year <- function(x) as.POSIXlt(x)$year + 1900
ggplot(economics, aes(unemploy / pop, uempmed)) +
geom_path(colour = "grey50") +
geom_point(aes(colour = year(date)))
The as.POSIXlt function converts an existing date or date-time (in a variety of different formats) into an object of class "POSIXlt". This is really just a list with different components such as year, month, day, etc.
We can see this with a simple example:
my_date <- as.POSIXlt("2022-07-20")
unclass(my_date)
#> $sec
#> [1] 0
#>
#> $min
#> [1] 0
#>
#> $hour
#> [1] 0
#>
#> $mday
#> [1] 20
#>
#> $mon
#> [1] 6
#>
#> $year
#> [1] 122
#>
#> $wday
#> [1] 3
#>
#> $yday
#> [1] 200
#>
#> $isdst
#> [1] 0
#>
#> attr(,"tzone")
#> [1] "GMT"
We can use $ notation to extract any of these elements just as we can with a normal list:
my_date$year
#> [1] 122
Since the year is represented as an integer relative to 1900, adding 1900 to the result simply returns the year as an integer:
my_date$year + 1900
#> [1] 2022
So your year function will simply extract the year as an integer from a date or date-time in s variety of formats. In the case of your plot code, it is simply extracting the year from the date column.
The year function you have shown is essentially identical to the year.default function in the lubridate package, except you don't need to load in an extra package:
lubridate:::year.default
#> function (x)
#> as.POSIXlt(x, tz = tz(x))$year + 1900
#> <bytecode: 0x000001ba60f19bf8>
#> <environment: namespace:lubridate>
Created on 2022-07-20 by the reprex package (v2.0.1)
Related
I want to count the number of events that occur within intervals.
I start with a table that has three columns: start dates, end dates, and the interval created by them.
table <-
tibble(
start = c( "2022-08-02", "2022-10-06", "2023-01-11"),
end = c("2022-08-04", "2023-02-06", "2023-02-04"),
interval = start %--% end
)
I also have a vector of timestamp events:
events <- c(ymd("2022-08-07"), ymd("2022-10-17"), ymd("2023-01-17"), ymd("2023-02-02"))
For each interval in my table, I want to know how many events fell within that interval so that my final table looks something like this (but with the correct counts):
start end interval n_events_within_interval
<chr> <chr> <Interval> <int>
1 2022-08-02 2022-08-04 2022-08-02 UTC--2022-08-04 UTC 2
2 2022-10-06 2023-02-06 2022-10-06 UTC--2023-02-06 UTC 2
3 2023-01-11 2023-02-04 2023-01-11 UTC--2023-02-04 UTC 2
I have tried this so far, but I'm not sure how to get mutate to cycle through the events vector for each row:
library(tidyverse)
library(lubridate)
library(purrr)
table <-
tibble(
start = c( "2022-08-02", "2022-10-06", "2023-01-11"),
end = c("2022-08-04", "2023-02-06", "2023-02-04"),
interval = start %--% end
)
table
#> # A tibble: 3 × 3
#> start end interval
#> <chr> <chr> <Interval>
#> 1 2022-08-02 2022-08-04 2022-08-02 UTC--2022-08-04 UTC
#> 2 2022-10-06 2023-02-06 2022-10-06 UTC--2023-02-06 UTC
#> 3 2023-01-11 2023-02-04 2023-01-11 UTC--2023-02-04 UTC
events <- c(ymd("2022-08-07"), ymd("2022-10-17"), ymd("2023-01-17"), ymd("2023-02-02"))
events
#> [1] "2022-08-07" "2022-10-17" "2023-01-17" "2023-02-02"
table %>%
mutate(
n_events_within_interval = sum(events %within% interval)
)
#> Warning in as.numeric(a) - as.numeric(int#start): longer object length is not a
#> multiple of shorter object length
#> Warning in as.numeric(a) - as.numeric(int#start) <= int#.Data: longer object
#> length is not a multiple of shorter object length
#> Warning in as.numeric(a) - as.numeric(int#start): longer object length is not a
#> multiple of shorter object length
#> # A tibble: 3 × 4
#> start end interval n_events_within_interval
#> <chr> <chr> <Interval> <int>
#> 1 2022-08-02 2022-08-04 2022-08-02 UTC--2022-08-04 UTC 2
#> 2 2022-10-06 2023-02-06 2022-10-06 UTC--2023-02-06 UTC 2
#> 3 2023-01-11 2023-02-04 2023-01-11 UTC--2023-02-04 UTC 2
Created on 2023-02-15 with reprex v2.0.2
We could use rowwise
library(dplyr)
library(lubridate)
table %>%
rowwise %>%
mutate(n_events_within_interval = sum(events %within% interval)) %>%
ungroup
I have a dataframe of comments where one column df$date corresponds to the dates when the comment was made, expressed as shown below:
[1] "2019-06-01" "2019-07-01" "2019-10-01" "2019-10-01" "2019-09-01" "2019-04-01" "2019-04-01" "2019-04-01" "2019-04-01"
[10] "2019-04-01" "2018-08-01" "2018-08-01" "2018-08-01" "2018-07-01" "2018-08-01" "2018-07-01" "2018-07-01" "2018-06-01"
I want to add a new column with the seasons. Basically, I want to say that if the date was made between December and February, then the corresponding season would be winter. I have tried the following, but it gives me: "Error: Not compatible with requested type: [type=character; target=double]."
df$season = ifelse(between(df$date,"2018-11-30", "2019-03-01"), "Invierno"
ifelse(between(df$date,"2019-02-28", "2019-06-01"),"Spring", ifelse(between(df$date,"2019-06-30", "2019-07-01"),"Summer",
"Fall")))
Does this mean I have to reformat the date to character or is there any way I could create the seasons column using the date format?
Thanks in advance!
Two things happening here. First you need to tell R that you want a date string to actually be a date using as.Date. But then once that vector is a date, you can't use ifelse anymore because that (very useful!) function doesn't play well with dates. So it is a good opportunity to introduce dplyr's case_when syntax which accomplishes the same thing but in a more readable way:
library(dplyr, warn.conflicts = FALSE)
date <- c("2019-06-01","2019-07-01","2019-10-01","2019-10-01","2019-09-01","2019-04-01","2019-04-01","2019-04-01","2019-04-01",
"2019-04-01","2018-08-01","2018-08-01","2018-08-01","2018-07-01","2018-08-01","2018-07-01","2018-07-01","2018-06-01")
df <- data.frame(date)
## first need to tell R that date is actually a date
df$date <- as.Date(df$date)
## Turns out that ifelse doesn't actually work well for dates so I'll introduce the glorious case_when function
df$season <- case_when(
between(df$date, as.Date("2018-11-30"), as.Date("2019-03-01")) ~ "Invierno",
between(df$date, as.Date("2019-02-28"), as.Date("2019-06-01")) ~ "Spring",
between(df$date, as.Date("2019-06-30"), as.Date("2019-07-01")) ~ "Summer",
TRUE ~ "Fall"
)
df
#> date season
#> 1 2019-06-01 Spring
#> 2 2019-07-01 Summer
#> 3 2019-10-01 Fall
#> 4 2019-10-01 Fall
#> 5 2019-09-01 Fall
#> 6 2019-04-01 Spring
#> 7 2019-04-01 Spring
#> 8 2019-04-01 Spring
#> 9 2019-04-01 Spring
#> 10 2019-04-01 Spring
#> 11 2018-08-01 Fall
#> 12 2018-08-01 Fall
#> 13 2018-08-01 Fall
#> 14 2018-07-01 Fall
#> 15 2018-08-01 Fall
#> 16 2018-07-01 Fall
#> 17 2018-07-01 Fall
#> 18 2018-06-01 Fall
Created on 2020-05-29 by the reprex package (v0.3.0)
I have spent 1-day search for the answer to this question and yet still could not figure out how this works (relatively new to R).
The data:
I have the daily revenue of a store. The starting date is November 2017, and the end date is February 2020. (It is not a typical Jan - Dec every year data). There is no missing value, and every day's sale is recorded. There are 2 columns: date (in proper date format) and revenue (in numerical format).
I am trying to build a time series forecasting model for my sales data. One pre-requisite is that I need to transform my data into the ts object. All those posts online I have seen dealt with yearly or monthly data, yet I have not yet seen anyone mention daily data.
I tried to convert my data to a ts object this way (I named my data "d"):
d_ts <- ts(d$revenue, start=min(d$date), end = max(d$date), frequency = 365)
I then got really weird results as such:
Start = c(17420, 1)
End = c(18311, 1)
Frequency = 365
[1] 1174.77 214.92 10.00 684.86 7020.04 11302.50 30613.55 29920.98 24546.49 22089.89 30291.65 32993.05 26517.11 39670.38 30361.32 17510.72
[17] 9888.76 3032.27 1229.74 2426.36 ....... [ reached getOption("max.print") -- omitted 324216 entries ]
There are 892 days in this dataset, how come the ts object's dimension to be 325,216 x 1 ????
I looked into this book called "Hands-On Time-Series with R" and found the following excerpt:
enter image description here
This basically means the ts() object does NOT work for daily data. Is this why my ts() conversion is messed up?
My questions are ...
(1) How can I make my daily revenue data to be a time series object before feeding into a model, if ts() does not work for daily data? All those time-series models require my data to be in time-series format though.
(2) Does the fact that my data does not start on Jan 2017 & end on Dec 2019 (i.e. those perfect 12 months in a year data shown in many online posts) have any complications? If so, what should I adjust so that the time series forecasting would be meaningful?
I have been stuck on this issue and could not wrap my head around. I really, really appreciate your help!
The ts function can work with any time interval, that's defined by the start and end points. As you're using dates, one unit corresponds to one day, as this is how they're stored internally. The help file at ?ts also shows examples of how to use annual or quarterly data,
To read in your daily data correctly you need to set frequency=1. Using some data similar in structure to what you've got:
#Compile a dataframe like yours
library(lubridate)
set.seed(0)
d <- data.frame(date=seq.Date(dmy("01/11/2017/"), by="day", length.out=892))
d$revenue <- runif(892)
head(d)
#date revenue
# 1 2017-11-01 0.8966972
# 2 2017-11-02 0.2655087
# 3 2017-11-03 0.3721239
# 4 2017-11-04 0.5728534
# 5 2017-11-05 0.9082078
# 6 2017-11-06 0.2016819
#Convert to timeseries object
d_ts <- ts(d$revenue, start=min(d$date), end = max(d$date), frequency = 1)
d_ts
# Time Series:
# Start = 17471
# End = 18362
# Frequency = 1
# [1] 0.896697200 0.265508663 0.372123900 0.572853363 0.908207790 0.201681931 0.898389685 0.944675269 0.660797792
# [10] 0.629114044 0.061786270 0.205974575 0.176556753 0.687022847 0.384103718 0.769841420 0.497699242 0.717618508
With daily data, you are better off using a tsibble class rather than a ts class. There are modelling and forecasting tools available via the fable package.
library(tsibble)
library(fable)
set.seed(1)
d_tsibble <- data.frame(
date = seq(as.Date("2017-11-01"), by = "day", length.out = 892),
revenue = rnorm(892)
) %>%
as_tsibble(index = date)
d_tsibble
#> # A tsibble: 892 x 2 [1D]
#> date revenue
#> <date> <dbl>
#> 1 2017-11-01 -0.626
#> 2 2017-11-02 0.184
#> 3 2017-11-03 -0.836
#> 4 2017-11-04 1.60
#> 5 2017-11-05 0.330
#> 6 2017-11-06 -0.820
#> 7 2017-11-07 0.487
#> 8 2017-11-08 0.738
#> 9 2017-11-09 0.576
#> 10 2017-11-10 -0.305
#> # … with 882 more rows
d_tsibble %>%
model(
arima = ARIMA(revenue)
) %>%
forecast(h = "14 days")
#> # A fable: 14 x 4 [1D]
#> # Key: .model [1]
#> .model date revenue .distribution
#> <chr> <date> <dbl> <dist>
#> 1 arima 2020-04-11 -0.0178 N(-1.8e-02, 1.1)
#> 2 arima 2020-04-12 -0.0117 N(-1.2e-02, 1.1)
#> 3 arima 2020-04-13 -0.00765 N(-7.7e-03, 1.1)
#> 4 arima 2020-04-14 -0.00501 N(-5.0e-03, 1.1)
#> 5 arima 2020-04-15 -0.00329 N(-3.3e-03, 1.1)
#> 6 arima 2020-04-16 -0.00215 N(-2.2e-03, 1.1)
#> 7 arima 2020-04-17 -0.00141 N(-1.4e-03, 1.1)
#> 8 arima 2020-04-18 -0.000925 N(-9.2e-04, 1.1)
#> 9 arima 2020-04-19 -0.000606 N(-6.1e-04, 1.1)
#> 10 arima 2020-04-20 -0.000397 N(-4.0e-04, 1.1)
#> 11 arima 2020-04-21 -0.000260 N(-2.6e-04, 1.1)
#> 12 arima 2020-04-22 -0.000171 N(-1.7e-04, 1.1)
#> 13 arima 2020-04-23 -0.000112 N(-1.1e-04, 1.1)
#> 14 arima 2020-04-24 -0.0000732 N(-7.3e-05, 1.1)
Created on 2020-04-01 by the reprex package (v0.3.0)
I have looked far and wide for a solution to this issue, but I cannot seem to figure it out. I do not have much experience working with xts objects in R.
I have 40 xts objects (ETF data) and I want to run the quantmod function WeeklyReturn on each of them individually.
I have tried to refer to them by using the ls() function:
lapply(ls(), weeklyReturn)
I have also tried the object() function
lapply(object(), weeklyReturn)
I have also tried using as.xts() in my call to coerce the ls() objects to be used as xts but to no avail.
How can I run this function on every xts object in the environment?
Thank you,
It would be better to load all of your xts objects into a list or create them in a way that returns them in a list to begin with. Then you could do results = lapply(xts.list, weeklyReturn).
To work with objects in the global environment, you could test for whether the object is an xts object and then run weeklyReturn on it if it is. Something like this:
results = lapply(setNames(ls(), ls()), function(i) {
x = get(i)
if(is.xts(x)) {
weeklyReturn(x)
}
})
results = results[!sapply(results, is.null)]
Or you could select only the xts objects to begin with:
results = sapply(ls()[sapply(ls(), function(i) is.xts(get(i)))],
function(i) weeklyReturn(get(i)), simplify=FALSE, USE.NAMES=TRUE)
lapply(ls(), weeklyReturn) doesn't work, because ls() returns the object names as strings. The get function takes a string as an argument and returns the object with that name.
An alternate solution using the tidyquant package. Note that this is data frame based so I will not be working with xts objects. I use two core functions to scale the analysis. First, tq_get() is used to go from a vector of ETF symbols to getting the prices. Second, tq_transmute() is used to apply the weeklyReturn function to the adjusted prices.
library(tidyquant)
etf_vec <- c("SPY", "QEFA", "TOTL", "GLD")
# Use tq_get to get prices
etf_prices <- tq_get(etf_vec, get = "stock.prices", from = "2017-01-01", to = "2017-05-31")
etf_prices
#> # A tibble: 408 x 8
#> symbol date open high low close volume adjusted
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 SPY 2017-01-03 227.121 227.919 225.951 225.24 91366500 223.1760
#> 2 SPY 2017-01-04 227.707 228.847 227.696 226.58 78744400 224.5037
#> 3 SPY 2017-01-05 228.363 228.675 227.565 226.40 78379000 224.3254
#> 4 SPY 2017-01-06 228.625 229.856 227.989 227.21 71559900 225.1280
#> 5 SPY 2017-01-09 229.009 229.170 228.514 226.46 46265300 224.3848
#> 6 SPY 2017-01-10 228.575 229.554 228.100 226.46 63771900 224.3848
#> 7 SPY 2017-01-11 228.453 229.200 227.676 227.10 74650000 225.0190
#> 8 SPY 2017-01-12 228.595 228.847 227.040 226.53 72113200 224.4542
#> 9 SPY 2017-01-13 228.827 229.503 228.786 227.05 62717900 224.9694
#> 10 SPY 2017-01-17 228.403 228.877 227.888 226.25 61240800 224.1767
#> # ... with 398 more rows
# Use tq_transmute to apply weeklyReturn to multiple groups
etf_returns_w <- etf_prices %>%
group_by(symbol) %>%
tq_transmute(select = adjusted, mutate_fun = weeklyReturn)
etf_returns_w
#> # A tibble: 88 x 3
#> # Groups: symbol [4]
#> symbol date weekly.returns
#> <chr> <date> <dbl>
#> 1 SPY 2017-01-06 0.0087462358
#> 2 SPY 2017-01-13 -0.0007042173
#> 3 SPY 2017-01-20 -0.0013653367
#> 4 SPY 2017-01-27 0.0098350474
#> 5 SPY 2017-02-03 0.0016159256
#> 6 SPY 2017-02-10 0.0094619381
#> 7 SPY 2017-02-17 0.0154636969
#> 8 SPY 2017-02-24 0.0070186222
#> 9 SPY 2017-03-03 0.0070964211
#> 10 SPY 2017-03-10 -0.0030618336
#> # ... with 78 more rows
And this is the string of my dataframe.
'data.frame': 10652 obs. of 4 variables:
$ Date: chr "06-15-2017" "06-15-2017" "06-15-2017" "06-15-2017" ...
$ Time: Factor w/ 951 levels "00:00:01","00:00:02",..: 396 398 400 402 404 406 407 409 411 413 ...
$ CPU : num 2.4 2.4 2.3 2.3 2.2 2.2 2.1 2.1 2.1 2.1 ...
$ MEM : num 2.5 2.5 2.5 2.6 2.6 2.6 2.6 2.7 2.9 2.9 ...
I want to make R read the date and time column in Date and Time format.
I have tried:
DateData$Date_Time = within(DateData, { timestamp=format(as.POSIXct(paste(DateData$Date, DateData$Time)), "%d/%m/%Y %H:%M:%S") })
I have tried this after merging the date and time column-
DateData$Date_Time = as.chron(DateData$Date_Time, "%d/%m/%Y %H:%M:%S")
DateData = within(DateData, { timestamp=strptime(paste((DateData$Date, DateData$Time), "%Y/%m/%d%H:%M:%S") })
And this: DateData$DateTime = strptime(DateData$DateTime,"%m-%d-%Y %H:%M:%S")
Nothing seems to work for me.
Dealing with conversion after importing data
This is a sample of your data
df <- data.frame(Date = c("06-15-2017","06-15-2017","06-15-2017","06-15-2017"), Time = c("00:00:01", "00:00:02", "00:00:03", "00:00:04"), stringsAsFactors = F)
For date object, you can use either base R, lubridate or anytime
packages
df$Date_base <- as.Date(df$Date, format = "%m-%d-%y")
library(lubridate)
#>
#> Attachement du package : 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
df$Date_lubridate <- mdy(df$Date)
library(anytime)
df$Date_anytime <- anytime(df$Date)
For working with time objects only (not Datetime), you can work with
hms package or period objects form lubridate package with
lubridate::hms
library(hms)
#>
#> Attachement du package : 'hms'
#> The following object is masked from 'package:lubridate':
#>
#> hms
df$Time_hms <- as.hms(df$Time)
df$Time_lubridate <- lubridate::hms(df$Time) # hms in lubridate is masked by hms package
here are what results look like
df
#> Date Time Date_base Date_lubridate Date_anytime Time_hms
#> 1 06-15-2017 00:00:01 2020-06-15 2017-06-15 2017-06-15 00:00:01
#> 2 06-15-2017 00:00:02 2020-06-15 2017-06-15 2017-06-15 00:00:02
#> 3 06-15-2017 00:00:03 2020-06-15 2017-06-15 2017-06-15 00:00:03
#> 4 06-15-2017 00:00:04 2020-06-15 2017-06-15 2017-06-15 00:00:04
#> Time_lubridate
#> 1 1S
#> 2 2S
#> 3 3S
#> 4 4S
Class of the column and summary of df
sapply(df, class)
#> $Date
#> [1] "character"
#>
#> $Time
#> [1] "character"
#>
#> $Date_base
#> [1] "Date"
#>
#> $Date_lubridate
#> [1] "Date"
#>
#> $Date_anytime
#> [1] "POSIXct" "POSIXt"
#>
#> $Time_hms
#> [1] "hms" "difftime"
#>
#> $Time_lubridate
#> [1] "Period"
#> attr(,"package")
#> [1] "lubridate"
summary(df)
#> Date Time Date_base
#> Length:4 Length:4 Min. :2020-06-15
#> Class :character Class :character 1st Qu.:2020-06-15
#> Mode :character Mode :character Median :2020-06-15
#> Mean :2020-06-15
#> 3rd Qu.:2020-06-15
#> Max. :2020-06-15
#> Date_lubridate Date_anytime Time_hms
#> Min. :2017-06-15 Min. :2017-06-15 Length:4
#> 1st Qu.:2017-06-15 1st Qu.:2017-06-15 Class1:hms
#> Median :2017-06-15 Median :2017-06-15 Class2:difftime
#> Mean :2017-06-15 Mean :2017-06-15 Mode :numeric
#> 3rd Qu.:2017-06-15 3rd Qu.:2017-06-15
#> Max. :2017-06-15 Max. :2017-06-15
#> Time_lubridate
#> Min. :1S
#> 1st Qu.:1.75S
#> Median :2.5S
#> Mean :2.5S
#> 3rd Qu.:3.25S
#> Max. :4S
Dealing with conversion directly when reading
You can deal with type conversion directly when you read a file from a file using the readr package.
library(readr)
read_csv('Date, Time
06-15-2017, 00:00:01
06-15-2017, 00:00:02
06-15-2017, 00:00:03
06-15-2017, 00:00:04
', col_types = cols(Date = col_date(format = "%m-%d-%Y"),
Time = col_time()))
#> # A tibble: 4 x 2
#> Date Time
#> <date> <time>
#> 1 2017-06-15 00:00:01
#> 2 2017-06-15 00:00:02
#> 3 2017-06-15 00:00:03
#> 4 2017-06-15 00:00:04
Using readr, you see that it directly import your data in a data.frame (a special tibble format from tidyverse) with column as Date and Time. You can find some information here
You used date-time formats that don't match your data at multiple places.
If you paste the Date and Time columns together with a space separator, the format to parse is %m-%d-%Y %H:%M:%S.
That is, to combine the two columns and parse as date-time:
DateData$DateTime <- strptime(paste(DateData$Date, DateData$Time, sep=' '), '%m-%d-%Y %H:%M:%S')
installing lubridate package
install.packages("lubridate")
library (lubridate)
pasting the Date and Time Column
DFanalysis$DateStamp <- paste(DFanalysis$Date, DFanalysis$Time, sep = " ")
Check the class of DateStamp
class(DFanalysis$DateStamp)
If the class is character we can convert it directly
DFanalysis$DateStamp <- mdy_hms(DFanalysis$DateStamp)