How to avoid timezone offset when using read_csv_arrow - r

Let's say I have a csv file. For example, this one, https://www.misoenergy.org/planning/generator-interconnection/GI_Queue/gi-interactive-queue/#
If I do
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()
# A tibble: 3,343 x 24
`Project #` `Request Status` `Queue Date` `Withdrawn Date` `Done Date` `Appl In Service ~` `Transmission ~` County State
<chr> <chr> <dttm> <dttm> <dttm> <dttm> <chr> <chr> <chr>
1 E002 Done 2013-09-12 20:00:00 NA 2003-12-12 19:00:00 NA Entergy Point~ LA
2 E291 Done 2012-05-14 20:00:00 NA 2013-10-21 20:00:00 2015-12-31 19:00:00 Entergy NA TX
3 G001 Withdrawn 1995-11-07 19:00:00 NA NA NA American Transm~ Brown~ WI
4 G002 Done 1998-11-30 19:00:00 NA NA NA LG&E and KU Ser~ Trimb~ KY
It seems like it's assuming the file is in GMT and then converts the GMT representation of the date to my local time zone (Eastern).
I can do Sys.setenv(TZ="GMT") before I load the file and then that avoids the offset issue.
Sys.setenv(TZ="GMT")
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()
# A tibble: 3,343 x 24
`Project #` `Request Status` `Queue Date` `Withdrawn Date` `Done Date` `Appl In Service ~` `Transmission ~` County State
<chr> <chr> <dttm> <dttm> <dttm> <dttm> <chr> <chr> <chr>
1 E002 Done 2013-09-13 00:00:00 NA 2003-12-13 00:00:00 NA Entergy Point~ LA
2 E291 Done 2012-05-15 00:00:00 NA 2013-10-22 00:00:00 2016-01-01 00:00:00 Entergy NA TX
3 G001 Withdrawn 1995-11-08 00:00:00 NA NA NA American Transm~ Brown~ WI
4 G002 Done 1998-12-01 00:00:00 NA NA NA LG&E and KU Ser~ Trimb~ KY
While setting my session tz to GMT isn't really too onerous, I'm wondering if there's a way to have it either assume the file is the same as my local time zone and just keep it that way or if it wants to assume it's GMT in the file then just keep it in GMT regardless of my local timezone.

It seems like it's assuming the file is in GMT and then converts the GMT representation of the date to my local time zone (Eastern).
Actually, the timezone conversion you are seeing just happens when you print. You can see this if you save the data frame to a variable and print it before and after you change your current timezone:
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
df <- miso_queue %>% collect()
Sys.setenv(TZ="US/Pacific")
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`
# <dttm>
# 1 2013-09-12 17:00:00
# 2 2012-05-14 17:00:00
# 3 1995-11-07 16:00:00
# 4 1998-11-30 16:00:00
# 5 1998-11-30 16:00:00
# 6 1998-11-30 16:00:00
# 7 1999-02-14 16:00:00
# 8 1999-02-14 16:00:00
# 9 1999-07-29 17:00:00
# 10 1999-08-12 17:00:00
# # … with 3,333 more rows
Sys.setenv(TZ="GMT")
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`
# <dttm>
# 1 2013-09-13 00:00:00
# 2 2012-05-15 00:00:00
# 3 1995-11-08 00:00:00
# 4 1998-12-01 00:00:00
# 5 1998-12-01 00:00:00
# 6 1998-12-01 00:00:00
# 7 1999-02-15 00:00:00
# 8 1999-02-15 00:00:00
# 9 1999-07-30 00:00:00
# 10 1999-08-13 00:00:00
# # … with 3,333 more rows
However, in the example you showed there is no time data, so you might be better off reading that column as a date instead of a timestamp. Unfortunately right now I think Arrow only lets you parse as a date right now if you provide the schema for the whole table. One alternative would be to parse the date columns after reading.

Related

Conditionally mutate column across list of dataframes in R

I am working with a large list of dataframes that use inconsistent date formats. I would like to conditionally mutate across the list so that any dataframe that contains a string will use one date format, and those that do not contain the string use another format. In other words, I want to distinguish between dataframes launched in year 2019 (which use mdy) and those launched in all others years (which use dmy).
The following code will conditionally mutate rows within a dataframe, but I am unsure how to conditionally mutate across the entire column.
dataframes %>% map(~.x %>%
mutate(date_time = if_else(str_detect(date_time, "/19 "),
mdy_hms(date_time), dmy_hms(date_time)))
Thank you!
edit
Data and code example. There are dataframes that contain a mixture of years.
library(tidyverse)
library(lubridate)
dataframes <- list(
tibble(date_time = c("07/06/19 01:00:00 PM", "07/06/20 01:00:00 PM"), num = 1:2), # July 6th
tibble(date_time = c("06/07/20 01:00:00 PM", "06/07/21 01:00:00 PM"), num = 1:2) # July 6th
)
dataframes %>%
map(~.x %>%
mutate(date_time = if_else(str_detect(date_time, "/19 "),
mdy_hms(date_time), dmy_hms(date_time)),
date = date(date_time),
month = month(date_time),
doy = yday(date_time)))
[[1]]
# A tibble: 2 × 5
date_time num date month doy
<dttm> <int> <date> <dbl> <dbl>
1 2019-07-06 13:00:00 1 2019-07-06 7 187
2 2020-06-07 13:00:00 2 2020-06-07 6 159
[[2]]
# A tibble: 2 × 5
date_time num date month doy
<dttm> <int> <date> <dbl> <dbl>
1 2020-07-06 13:00:00 1 2020-07-06 7 188
2 2021-07-06 13:00:00 2 2021-07-06 7 187
If you are trying to determine the format of the date column for the whole data.frame based on the presence of any date from 2019, then a small tweak of your code should work.
Instead of evaluating each record for the presence of /19 , you set the condition of the if_else() to be any(str_detect(...)) which returns TRUE if any of the values are TRUE. However the result of any() is always of length 1 so you then need to rep() the result to match the length of the whole data.frame using dplyr::n().
library(tidyverse)
library(lubridate)
dataframes <- list(
tibble(date_time = c("07/06/19 01:00:00 PM", "07/06/20 01:00:00 PM"), num = 1:2), # July 6th
tibble(date_time = c("06/07/20 01:00:00 PM", "06/07/21 01:00:00 PM"), num = 1:2) # July 6th
)
dataframes %>%
map( ~ .x %>%
mutate(
date_time = if_else(str_detect(date_time, "/19 ") %>%
any() %>%
rep(n()),
mdy_hms(date_time),
dmy_hms(date_time)),
date = date(date_time),
month = month(date_time),
doy = yday(date_time)
))
#> [[1]]
#> # A tibble: 2 × 5
#> date_time num date month doy
#> <dttm> <int> <date> <dbl> <dbl>
#> 1 2019-07-06 13:00:00 1 2019-07-06 7 187
#> 2 2020-07-06 13:00:00 2 2020-07-06 7 188
#>
#> [[2]]
#> # A tibble: 2 × 5
#> date_time num date month doy
#> <dttm> <int> <date> <dbl> <dbl>
#> 1 2020-07-06 13:00:00 1 2020-07-06 7 188
#> 2 2021-07-06 13:00:00 2 2021-07-06 7 187
Created on 2022-07-20 by the reprex package (v2.0.1)

How to convert a class character into POSIXct POSIXt in r [duplicate]

This question already has answers here:
Convert date-time string to class Date
(4 answers)
Closed 1 year ago.
I am currently trying to merge two dataframes that have the following format:
`Company Name` Date `Press Release`
<chr> <dttm> <chr>
1 ExxonMobil 2021-05-27 00:00:00 Mena Report
2 Shell 2021-05-27 00:00:00 Mena Report
3 JPMorgan 2021-05-27 00:00:00 Mena Report
4 Shell NA DeSmogBlog
5 ExxonMobil NA DeSmogBlog
6 ExxonMobil 2021-04-20 00:00:00 The Guardian
and
Date `Equity Price` `Company Name`
<chr> <dbl> <chr>
1 07/30/21 153. JPMorgan
2 07/29/21 153 JPMorgan
3 07/28/21 152. JPMorgan
4 07/27/21 151. JPMorgan
5 07/26/21 152. JPMorgan
6 07/23/21 151. JPMorgan
I need to merge them by 'Date', but I can not convert the 'Date' format of the second dataset into POSIXct POSIXt. I already tried to run the following code, but it does not go:
n.equity <- as.POSIXct(equity$Date)
```
You have to apply a format, In your case:
n.equity <- as.POSIXlt(equity$Date, format = "%m/%d/%y")
Output:
[1] "2021-07-30 CEST" "2021-07-29 CEST" "2021-07-28 CEST" "2021-07-27 CEST"
[5] "2021-07-26 CEST" "2021-07-23 CEST"
Here is complete example how you can do it with the lubridate package:
library(lubridate)
library(dyplr)
df1 <- df1 %>%
mutate(Date = ymd_hms(Date))
df2 <- df2 %>%
mutate(Date = mdy(Date))
df1 %>%
full_join(df2, by="Date")
output:
# A tibble: 12 x 5
`Company Name.x` Date `Press Release` `Equity Price` `Company Name.y`
<chr> <dttm> <chr> <dbl> <chr>
1 ExxonMobil 2021-05-27 00:00:00 Mena Report NA NA
2 Shell 2021-05-27 00:00:00 Mena Report NA NA
3 JPMorgan 2021-05-27 00:00:00 Mena Report NA NA
4 Shell NA DeSmogBlog NA NA
5 ExxonMobil NA DeSmogBlog NA NA
6 ExxonMobil 2021-04-20 00:00:00 The Guardian NA NA
7 NA 2021-07-30 00:00:00 NA 153 JPMorgan
8 NA 2021-07-29 00:00:00 NA 153 JPMorgan
9 NA 2021-07-28 00:00:00 NA 152 JPMorgan
10 NA 2021-07-27 00:00:00 NA 151 JPMorgan
11 NA 2021-07-26 00:00:00 NA 152 JPMorgan
12 NA 2021-07-23 00:00:00 NA 151 JPMorgan
data
df1 <- tribble(
~`Company Name`, ~Date, ~`Press Release`,
"ExxonMobil", "2021-05-27 00:00:00", "Mena Report",
"Shell", "2021-05-27 00:00:00", "Mena Report",
"JPMorgan", "2021-05-27 00:00:00", "Mena Report",
"Shell", "NA", "DeSmogBlog",
"ExxonMobil", "NA", "DeSmogBlog",
"ExxonMobil", "2021-04-20 00:00:00", "The Guardian")
df2 <- tribble(
~Date, ~`Equity Price`, ~`Company Name`,
"07/30/21", 153., "JPMorgan",
"07/29/21", 153, "JPMorgan",
"07/28/21", 152., "JPMorgan",
"07/27/21", 151., "JPMorgan",
"07/26/21", 152., "JPMorgan",
"07/23/21", 151., "JPMorgan")

Return Tibble as Json

I am new to R and I am trying to use the Hansard Library.
Is there any way I could export the results of any of the queries as Json rather than a tibble?
library(hansard)
library(tibble)
#example query
z <- mp_vote_record(172, "aye", start_date = "2017-01-01", end_date = "2017-05-03")
print(z)
Giving the output:
# A tibble: 38 x 5
about title uin date_value date_datatype
<chr> <chr> <chr> <dttm> <chr>
1 722300 Early Parl~ CD:20~ 2017-04-19 00:00:00 POSIXct
2 714865 Pension Sc~ CD:20~ 2017-03-29 00:00:00 POSIXct
3 714866 Pension Sc~ CD:20~ 2017-03-29 00:00:00 POSIXct
4 714868 Pension Sc~ CD:20~ 2017-03-29 00:00:00 POSIXct
5 713962 Bus Servic~ CD:20~ 2017-03-27 00:00:00 POSIXct
6 713963 Bus Servic~ CD:20~ 2017-03-27 00:00:00 POSIXct
7 714005 Bus Servic~ CD:20~ 2017-03-27 00:00:00 POSIXct
8 710264 Reproducti~ CD:20~ 2017-03-13 00:00:00 POSIXct
9 708770 Children a~ CD:20~ 2017-03-07 00:00:00 POSIXct
10 708773 Children a~ CD:20~ 2017-03-07 00:00:00 POSIXct
# ... with 28 more rows
You can just transform the tibble into json using jsonlite package. An example using the built-in data set iris:
library(dplyr)
library(jsonlite)
mydata <- as_tibble(iris)
toJSON(mydata)

R- create dataset by removing duplicates based on a condition - filter

I have a data frame where for each day, I have several prices.
I would like to modify my data frame with the following code :
newdf <- Data %>%
filter(
if (Data$Date == Data$Echeance) {
Data$Close == lag(Data$Close,1)
} else {
Data$Close == Data$Close
}
)
However, it is not giving me what I want, that is :
create a new data frame where the variable Close takes its normal value, unless the day of Date is equal to the day of Echeance. In this case, take the following Close value.
I added filter because I wanted to remove the duplicate dates, and keep only one date per day where Close satisfies the condition above.
There is no error message, it just doesn't give me the right database.
Here is a glimpse of my data:
Date Echeance Compens. Open Haut Bas Close
1 1998-03-27 00:00:00 1998-09-10 00:00:00 125. 828 828 820 820. 197
2 1998-03-27 00:00:00 1998-11-10 00:00:00 128. 847 847 842 842. 124
3 1998-03-27 00:00:00 1999-01-11 00:00:00 131. 858 858 858 858. 2
4 1998-03-30 00:00:00 1998-09-10 00:00:00 125. 821 821 820 820. 38
5 1998-03-30 00:00:00 1998-11-10 00:00:00 129. 843 843 843 843. 1
6 1998-03-30 00:00:00 1999-01-11 00:00:00 131. 860 860 860 860. 5
Thanks a lot in advance.
Sounds like a use case for ifelse, with dplyr:
library(dplyr)
Data %>%
mutate(Close = ifelse(Date==Echeance, lead(Close,1), Close))
Here an example:
dat %>%
mutate(var_new = ifelse(date1==date2, lead(var,1), var))
# A tibble: 3 x 4
# date1 date2 var var_new
# <date> <date> <int> <int>
# 1 2018-03-27 2018-03-27 10 11
# 2 2018-03-28 2018-01-01 11 11
# 3 2018-03-29 2018-02-01 12 12
The function lead will move the vector by 1 position. Also note that I created a var_new just to show the difference, but you can mutate directly var.
Data used:
dat <- tibble(date1 = seq(from=as.Date("2018-03-27"), to=as.Date("2018-03-29"), by="day"),
date2 = c(as.Date("2018-03-27"), as.Date("2018-01-01"), as.Date("2018-02-01")),
var = 10:12)
dat
# A tibble: 3 x 3
# date1 date2 var
# <date> <date> <int>
# 1 2018-03-27 2018-03-27 10
# 2 2018-03-28 2018-01-01 11
# 3 2018-03-29 2018-02-01 12

R: extract hour from variable format timestamp

My dataframe has timestamp with and without seconds, and a random use of 0 in front of months and hours, i.e. 01 or 1
library(tidyverse)
df <- data_frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 01:03', '12/30/2015 23:04:25'))
cust timestamp
A 5/31/2016 1:03:12
A 05/25/2016 01:06
B 6/16/2016 01:03
B 12/30/2015 23:04:25
How to extract hours into a separate column? The desired output:
cust timestamp hours
A 5/31/2016 1:03:12 1
A 05/25/2016 01:06 1
B 6/16/2016 9:03 9
B 12/30/2015 23:04:25 23
I prefer the answer with tidyverse and mutate, but my attempt fails to extract hours correctly:
df %>% mutate(hours=strptime(timestamp, '%H') %>% as.character() )
# A tibble: 4 × 3
cust timestamp hours
<chr> <chr> <chr>
1 A 5/31/2016 1:03:12 2016-10-31 05:00:00
2 A 05/25/2016 01:06 2016-10-31 05:00:00
3 B 6/16/2016 01:03 2016-10-31 06:00:00
4 B 12/30/2015 23:04:25 2016-10-31 12:00:00
Try this:
library(lubridate)
df <- data.frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 09:03', '12/30/2015 23:04:25'))
df %>% mutate(hours=hour(strptime(timestamp, '%m/%d/%Y %H:%M')) %>% as.character() )
cust timestamp hours
1 A 5/31/2016 1:03:12 1
2 A 05/25/2016 01:06 1
3 B 6/16/2016 09:03 9
4 B 12/30/2015 23:04:25 23
Here is a solution that appends 00 for the seconds when they are missing, then converts to a date using lubridate and extracts the hours using format. Note, if you don't want the 00:00 at the end of the hours, you can just eliminate them from the output format in format:
df %>%
mutate(
cleanTime = ifelse(grepl(":[0-9][0-9]:", timestamp)
, timestamp
, paste0(timestamp, ":00")) %>% mdy_hms
, hour = format(cleanTime, "%H:00:00")
)
returns:
cust timestamp cleanTime hour
<chr> <chr> <dttm> <chr>
1 A 5/31/2016 1:03:12 2016-05-31 01:03:12 01:00:00
2 A 05/25/2016 01:06 2016-05-25 01:06:00 01:00:00
3 B 6/16/2016 01:03 2016-06-16 01:03:00 01:00:00
4 B 12/30/2015 23:04:25 2015-12-30 23:04:25 23:00:00
Your timestamp is a character string (), you need to format is as a date (with as.Date for example) before you can start using functions like strptime.
You are going to have to go through some string manipulations to have properly formatted data before you can convert it to dates. Prepend a zero to months with a single digit and append :00 to hours with missing seconds. Use strsplit() and other regex functions. Afterwards do as.Date(df$timestamp,format = '%m/%d/%Y %H:%M:%S'), then you will be able to use strptime to extract the hours.

Resources