For example,
dateIntervals <- as.Date(c("2020-08-10", "2020-11-11", "2021-07-05"))
possibleDates <- seq(as.Date("2020-01-02"), dateIntervals[3], by = "day")
genDF<-function() data.frame(Date = sample(possibleDates, 100), Value = runif(100))
listdf <-replicate(2, genDF(), simplify = FALSE)
Yes, listdf has two dataframe elements(each 100 random dates in possibleDates and values)
and listdf[[1]] is like this
A data.frame: 100 × 2
Date Value
<date> <dbl>
2020-07-24 0.63482411
2020-02-26 0.25989280
2020-10-26 0.21721077
2020-10-11 0.34774192
2020-08-18 0.67758312
2020-02-03 0.22929624
2020-06-10 0.30279353
2020-05-29 0.95549488
...
lapply(listdf, function(x) split(x, findInterval(x$Date, dateIntervals)))
Made listdf as a 2*3 list, splitted by date.
1.$`0`
A data.frame: 43 × 2
Date Value
<date> <dbl>
1 2020-07-24 0.63482411
2 2020-02-26 0.25989280
6 2020-02-03 0.22929624
7 2020-06-10 0.30279353
...
$`1`
A data.frame: 15 × 2
Date Value
<date> <dbl>
3 2020-10-26 0.21721077
4 2020-10-11 0.34774192
5 2020-08-18 0.67758312
31 2020-11-09 0.59149301
...
$`2`
A data.frame: 42 × 2
Date Value
<date> <dbl>
9 2021-06-28 0.10055644
10 2021-05-17 0.63942936
12 2021-04-22 0.63589801
13 2021-02-01 0.70106156
...
2.$`0`
A data.frame: 43 × 2
Date Value
<date> <dbl>
2 2020-07-16 0.81376364
4 2020-07-03 0.05152627
7 2020-01-21 0.98677433
8 2020-03-23 0.13513921
...
$`1`
A data.frame: 18 × 2
Date Value
<date> <dbl>
5 2020-11-01 0.02740125
12 2020-09-04 0.82042568
15 2020-08-12 0.54190868
16 2020-09-19 0.05933666
18 2020-10-05 0.04983061
...
$`2`
A data.frame: 38 × 2
Date Value
<date> <dbl>
1 2021-04-13 0.46199245
3 2021-06-12 0.71461155
6 2021-01-24 0.56527997
9 2021-04-17 0.72634151
13 2021-04-20 0.55489499
...
I want only first two of the splitted ones.($'0' and $'1' for 1. and 2.)
is there any parameter in the split function which does things like this?
(getting only first or last n elements)
I want something like this...
lapply(listdf, function(x) split(x, findInterval(x$Date, dateIntervals), some parameter=2))
yes this "2". Getting only the first two ones. is there a function parameter in split which can do this?
Related
I am working with a large list of dataframes that use inconsistent date formats. I would like to conditionally mutate across the list so that any dataframe that contains a string will use one date format, and those that do not contain the string use another format. In other words, I want to distinguish between dataframes launched in year 2019 (which use mdy) and those launched in all others years (which use dmy).
The following code will conditionally mutate rows within a dataframe, but I am unsure how to conditionally mutate across the entire column.
dataframes %>% map(~.x %>%
mutate(date_time = if_else(str_detect(date_time, "/19 "),
mdy_hms(date_time), dmy_hms(date_time)))
Thank you!
edit
Data and code example. There are dataframes that contain a mixture of years.
library(tidyverse)
library(lubridate)
dataframes <- list(
tibble(date_time = c("07/06/19 01:00:00 PM", "07/06/20 01:00:00 PM"), num = 1:2), # July 6th
tibble(date_time = c("06/07/20 01:00:00 PM", "06/07/21 01:00:00 PM"), num = 1:2) # July 6th
)
dataframes %>%
map(~.x %>%
mutate(date_time = if_else(str_detect(date_time, "/19 "),
mdy_hms(date_time), dmy_hms(date_time)),
date = date(date_time),
month = month(date_time),
doy = yday(date_time)))
[[1]]
# A tibble: 2 × 5
date_time num date month doy
<dttm> <int> <date> <dbl> <dbl>
1 2019-07-06 13:00:00 1 2019-07-06 7 187
2 2020-06-07 13:00:00 2 2020-06-07 6 159
[[2]]
# A tibble: 2 × 5
date_time num date month doy
<dttm> <int> <date> <dbl> <dbl>
1 2020-07-06 13:00:00 1 2020-07-06 7 188
2 2021-07-06 13:00:00 2 2021-07-06 7 187
If you are trying to determine the format of the date column for the whole data.frame based on the presence of any date from 2019, then a small tweak of your code should work.
Instead of evaluating each record for the presence of /19 , you set the condition of the if_else() to be any(str_detect(...)) which returns TRUE if any of the values are TRUE. However the result of any() is always of length 1 so you then need to rep() the result to match the length of the whole data.frame using dplyr::n().
library(tidyverse)
library(lubridate)
dataframes <- list(
tibble(date_time = c("07/06/19 01:00:00 PM", "07/06/20 01:00:00 PM"), num = 1:2), # July 6th
tibble(date_time = c("06/07/20 01:00:00 PM", "06/07/21 01:00:00 PM"), num = 1:2) # July 6th
)
dataframes %>%
map( ~ .x %>%
mutate(
date_time = if_else(str_detect(date_time, "/19 ") %>%
any() %>%
rep(n()),
mdy_hms(date_time),
dmy_hms(date_time)),
date = date(date_time),
month = month(date_time),
doy = yday(date_time)
))
#> [[1]]
#> # A tibble: 2 × 5
#> date_time num date month doy
#> <dttm> <int> <date> <dbl> <dbl>
#> 1 2019-07-06 13:00:00 1 2019-07-06 7 187
#> 2 2020-07-06 13:00:00 2 2020-07-06 7 188
#>
#> [[2]]
#> # A tibble: 2 × 5
#> date_time num date month doy
#> <dttm> <int> <date> <dbl> <dbl>
#> 1 2020-07-06 13:00:00 1 2020-07-06 7 188
#> 2 2021-07-06 13:00:00 2 2021-07-06 7 187
Created on 2022-07-20 by the reprex package (v2.0.1)
I have two datasets that I would like to join based on date. One is a survey dataset, and the other is a list of prices at various dates. The dates don't match exactly, so I would like to join on the nearest date in the survey dataset (the price data is weekly).
Here's a brief snippet of what the survey dataset looks like (there are many other variables, but here's the two most relevant):
ID
actual.date
20120377
2012-09-26
2020455822
2020-11-23
20126758
2012-10-26
20124241
2012-10-25
2020426572
2020-11-28
And here's the price dataset (also much larger, but you get the idea):
date
price.var1
price.var2
2017-10-30
2.74733926399869
2.73994826674735
2015-03-16
2.77028200438506
2.74079930272231
2010-10-18
3.4265947805337
3.41591263539176
2012-10-29
4.10095806545397
4.14717556976502
2012-01-09
3.87888859352037
3.93074237884497
What I would like to do is join the price dataset to the survey dataset, joining on the nearest date.
I've tried a number of different things, none of which have worked to my satisfaction.
#reading in sample data
library(data.table)
library(dplyr)
survey <- fread(" ID actual.date
1: 20120377 2012-09-26
2: 2020455822 2020-11-23
3: 20126758 2012-10-26
4: 20124241 2012-10-25
5: 2020426572 2020-11-28
> ") %>% select(-V1)
price <- fread("date price.var1 price.var2
1: 2017-10-30 2.747339 2.739948
2: 2015-03-16 2.770282 2.740799
3: 2010-10-18 3.426595 3.415913
4: 2012-10-29 4.100958 4.147176
5: 2012-01-09 3.878889 3.930742") %>% select(-V1)
#using data.table
setDT(survey)[,DT_DATE := actual.date]
setDT(price)[,DT_DATE := date]
survey_price <- survey[price,on=.(DT_DATE),roll="nearest"]
#This works, and they join, but it drops a ton of observations, which won't work
#using dplyr
library(dplyr)
survey_price <- left_join(survey,price,by=c("actual.date"="date"))
#this joins them without dropping observations, but all of the price variables become NAs
You were almost there.
In the DT[i,on] syntax, i should be survey to join on all its rows
setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price
date price.var1 price.var2 ID
<IDat> <num> <num> <int>
1: 2012-09-26 4.100958 4.147176 20120377
2: 2020-11-23 2.747339 2.739948 2020455822
3: 2012-10-26 4.100958 4.147176 20126758
4: 2012-10-25 4.100958 4.147176 20124241
5: 2020-11-28 2.747339 2.739948 2020426572
Convert the dates to numeric and find the closest date from the survey for price with Closest() from DescTools, and take that value.
Example datasets
survey <- tibble(
ID = sample(20000:40000, 9, replace = TRUE),
actual.date = seq(today() %m+% days(5), today() %m+% days(5) %m+% months(2),
"week")
)
price <- tibble(
date = seq(today(), today() %m+% months(2), by = "week"),
price_1 = sample(2:6, 9, replace = TRUE),
price_2 = sample(2:6, 9, replace = TRUE)
)
survey
# A tibble: 9 x 2
ID actual.date
<int> <date>
1 34592 2022-05-07
2 37846 2022-05-14
3 22715 2022-05-21
4 22510 2022-05-28
5 30143 2022-06-04
6 34348 2022-06-11
7 21538 2022-06-18
8 39802 2022-06-25
9 36493 2022-07-02
price
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-02 6 6
2 2022-05-09 3 2
3 2022-05-16 6 4
4 2022-05-23 6 2
5 2022-05-30 2 6
6 2022-06-06 2 4
7 2022-06-13 2 2
8 2022-06-20 3 5
9 2022-06-27 5 6
library(tidyverse)
library(lubridate)
library(DescTools)
price <- price %>%
mutate(date = Closest(survey$actual.date %>%
as.numeric, date %>%
as.numeric) %>%
as_date())
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-07 6 6
2 2022-05-14 3 2
3 2022-05-21 6 4
4 2022-05-28 6 2
5 2022-06-04 2 6
6 2022-06-11 2 4
7 2022-06-18 2 2
8 2022-06-25 3 5
9 2022-07-02 5 6
merge(survey, price, by.x = "actual.date", by.y = "date")
actual.date ID price_1 price_2
1 2022-05-07 34592 6 6
2 2022-05-14 37846 3 2
3 2022-05-21 22715 6 4
4 2022-05-28 22510 6 2
5 2022-06-04 30143 2 6
6 2022-06-11 34348 2 4
7 2022-06-18 21538 2 2
8 2022-06-25 39802 3 5
9 2022-07-02 36493 5 6
I'm trying to add a column to a Tidyquant tibble. Here's the code:
library(tidyquant)
symbol <- 'AAPL'
start_date <- as.Date('2022-01-01')
end_date <- as.Date('2022-03-31')
prices <- tq_get(symbol,
from = start_date,
to = end_date,
get = 'stock.prices')
head(prices)
# A tibble: 6 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AAPL 2022-01-03 178. 183. 178. 182. 104487900 182.
2 AAPL 2022-01-04 183. 183. 179. 180. 99310400 179.
3 AAPL 2022-01-05 180. 180. 175. 175. 94537600 175.
4 AAPL 2022-01-06 173. 175. 172. 172 96904000 172.
5 AAPL 2022-01-07 173. 174. 171. 172. 86709100 172.
6 AAPL 2022-01-10 169. 172. 168. 172. 106765600 172.
Now, I'm attempting to add the change_on_day column (that's just the difference in the 'adjusted' prices between one day and the next) using the following:
prices$change_on_day <- diff(prices$adjusted)
The error message is:
Error: Assigned data `diff(prices$adjusted)` must be compatible with existing data.
x Existing data has 61 rows.
x Assigned data has 60 rows.
i Only vectors of size 1 are recycled.
How would I add this price difference column?
Thanks!
If you are trying to get today's value from the previous date value then you should be able to do that with the lag() function
prices %>%
mutate(change_on_day=adjusted-lag(adjusted,1))
We can use tq_transmute with quantmod::periodReturn setting the period argument to 'daily' in order to calculate daily returns.
library(tidyquant)
symbol <- "AAPL"
start_date <- as.Date("2022-01-01")
end_date <- as.Date("2022-03-31")
prices <- tq_get(symbol,
from = start_date,
to = end_date,
get = "stock.prices"
)
stock_returns_monthly <- prices %>%
tq_transmute(
select = adjusted,
mutate_fun = periodReturn,
period = "daily",
col_rename = "change_on_day"
)
stock_returns_monthly
#> # A tibble: 61 × 2
#> date change_on_day
#> <date> <dbl>
#> 1 2022-01-03 0
#> 2 2022-01-04 -0.0127
#> 3 2022-01-05 -0.0266
#> 4 2022-01-06 -0.0167
#> 5 2022-01-07 0.000988
#> 6 2022-01-10 0.000116
#> 7 2022-01-11 0.0168
#> 8 2022-01-12 0.00257
#> 9 2022-01-13 -0.0190
#> 10 2022-01-14 0.00511
#> # … with 51 more rows
Created on 2022-04-18 by the reprex package (v2.0.1)
For more information check this vignette
I'm working with trip ticket data and it includes a column with dates and times. I'm want to group trips according to Morning(05:00 - 10:59), Lunch(11:00-12:59), Afternoon(13:00-17:59), Evening(18:00-23:59), and Dawn/Graveyard(00:00-04:59) and then count the number of trips (by means of counting the unique values in the trip_id column) for each of those categories.
Only I don't know how to group/summarize according to time values. Is this possible in R?
trip_id start_time end_time day_of_week
1 CFA86D4455AA1030 2021-03-16 08:32:30 2021-03-16 08:36:34 Tuesday
2 30D9DC61227D1AF3 2021-03-28 01:26:28 2021-03-28 01:36:55 Sunday
3 846D87A15682A284 2021-03-11 21:17:29 2021-03-11 21:33:53 Thursday
4 994D05AA75A168F2 2021-03-11 13:26:42 2021-03-11 13:55:41 Thursday
5 DF7464FBE92D8308 2021-03-21 09:09:37 2021-03-21 09:27:33 Sunday
Here's a solution with hour() and case_when().
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
trip <- tibble(start_time = mdy_hm("1/1/2022 1:00") + minutes(seq(0, 700, 15)))
trip <- trip %>%
mutate(
hr = hour(start_time),
time_of_day = case_when(
hr >= 5 & hr < 11 ~ "morning",
hr >= 11 & hr < 13 ~ "afternoon",
TRUE ~ "fill in the rest yourself :)"
)
)
print(trip)
#> # A tibble: 47 x 3
#> start_time hr time_of_day
#> <dttm> <int> <chr>
#> 1 2022-01-01 01:00:00 1 fill in the rest yourself :)
#> 2 2022-01-01 01:15:00 1 fill in the rest yourself :)
#> 3 2022-01-01 01:30:00 1 fill in the rest yourself :)
#> 4 2022-01-01 01:45:00 1 fill in the rest yourself :)
#> 5 2022-01-01 02:00:00 2 fill in the rest yourself :)
#> 6 2022-01-01 02:15:00 2 fill in the rest yourself :)
#> 7 2022-01-01 02:30:00 2 fill in the rest yourself :)
#> 8 2022-01-01 02:45:00 2 fill in the rest yourself :)
#> 9 2022-01-01 03:00:00 3 fill in the rest yourself :)
#> 10 2022-01-01 03:15:00 3 fill in the rest yourself :)
#> # ... with 37 more rows
trips <- trip %>%
count(time_of_day)
print(trips)
#> # A tibble: 3 x 2
#> time_of_day n
#> <chr> <int>
#> 1 afternoon 7
#> 2 fill in the rest yourself :) 16
#> 3 morning 24
Created on 2022-03-21 by the reprex package (v2.0.1)
I want to use the Prophet() function in R, but I cannot transform my column "YearWeek" to a as.Date() column.
I have a column "YearWeek" that stores values from 201401 up to 201937 i.e. starting in 2014 week 1 up to 2019 week 37.
I don't know how to declare this column as a date in the form yyyy-ww needed to use the Prophet() function.
Does anyone know how to do this?
Thank you in advance.
One solution could be to append a 01 to the end of your yyyy-ww formatted dates.
Data:
library(tidyverse)
df <- cross2(2014:2019, str_pad(1:52, width = 2, pad = 0)) %>%
map_df(set_names, c("year", "week")) %>%
transmute(date = paste(year, week, sep = "")) %>%
arrange(date)
head(df)
#> # A tibble: 6 x 1
#> date
#> <chr>
#> 1 201401
#> 2 201402
#> 3 201403
#> 4 201404
#> 5 201405
#> 6 201406
Now let's append the 01 and convert to date:
df %>%
mutate(date = paste(date, "01", sep = ""),
new_date = as.Date(date, "%Y%U%w"))
#> # A tibble: 312 x 2
#> date new_date
#> <chr> <date>
#> 1 20140101 2014-01-05
#> 2 20140201 2014-01-12
#> 3 20140301 2014-01-19
#> 4 20140401 2014-01-26
#> 5 20140501 2014-02-02
#> 6 20140601 2014-02-09
#> 7 20140701 2014-02-16
#> 8 20140801 2014-02-23
#> 9 20140901 2014-03-02
#> 10 20141001 2014-03-09
#> # ... with 302 more rows
Created on 2019-10-10 by the reprex package (v0.3.0)
More info about a numeric week of the year can be found here.