How to convert a class character into POSIXct POSIXt in r [duplicate] - r

This question already has answers here:
Convert date-time string to class Date
(4 answers)
Closed 1 year ago.
I am currently trying to merge two dataframes that have the following format:
`Company Name` Date `Press Release`
<chr> <dttm> <chr>
1 ExxonMobil 2021-05-27 00:00:00 Mena Report
2 Shell 2021-05-27 00:00:00 Mena Report
3 JPMorgan 2021-05-27 00:00:00 Mena Report
4 Shell NA DeSmogBlog
5 ExxonMobil NA DeSmogBlog
6 ExxonMobil 2021-04-20 00:00:00 The Guardian
and
Date `Equity Price` `Company Name`
<chr> <dbl> <chr>
1 07/30/21 153. JPMorgan
2 07/29/21 153 JPMorgan
3 07/28/21 152. JPMorgan
4 07/27/21 151. JPMorgan
5 07/26/21 152. JPMorgan
6 07/23/21 151. JPMorgan
I need to merge them by 'Date', but I can not convert the 'Date' format of the second dataset into POSIXct POSIXt. I already tried to run the following code, but it does not go:
n.equity <- as.POSIXct(equity$Date)
```

You have to apply a format, In your case:
n.equity <- as.POSIXlt(equity$Date, format = "%m/%d/%y")
Output:
[1] "2021-07-30 CEST" "2021-07-29 CEST" "2021-07-28 CEST" "2021-07-27 CEST"
[5] "2021-07-26 CEST" "2021-07-23 CEST"
Here is complete example how you can do it with the lubridate package:
library(lubridate)
library(dyplr)
df1 <- df1 %>%
mutate(Date = ymd_hms(Date))
df2 <- df2 %>%
mutate(Date = mdy(Date))
df1 %>%
full_join(df2, by="Date")
output:
# A tibble: 12 x 5
`Company Name.x` Date `Press Release` `Equity Price` `Company Name.y`
<chr> <dttm> <chr> <dbl> <chr>
1 ExxonMobil 2021-05-27 00:00:00 Mena Report NA NA
2 Shell 2021-05-27 00:00:00 Mena Report NA NA
3 JPMorgan 2021-05-27 00:00:00 Mena Report NA NA
4 Shell NA DeSmogBlog NA NA
5 ExxonMobil NA DeSmogBlog NA NA
6 ExxonMobil 2021-04-20 00:00:00 The Guardian NA NA
7 NA 2021-07-30 00:00:00 NA 153 JPMorgan
8 NA 2021-07-29 00:00:00 NA 153 JPMorgan
9 NA 2021-07-28 00:00:00 NA 152 JPMorgan
10 NA 2021-07-27 00:00:00 NA 151 JPMorgan
11 NA 2021-07-26 00:00:00 NA 152 JPMorgan
12 NA 2021-07-23 00:00:00 NA 151 JPMorgan
data
df1 <- tribble(
~`Company Name`, ~Date, ~`Press Release`,
"ExxonMobil", "2021-05-27 00:00:00", "Mena Report",
"Shell", "2021-05-27 00:00:00", "Mena Report",
"JPMorgan", "2021-05-27 00:00:00", "Mena Report",
"Shell", "NA", "DeSmogBlog",
"ExxonMobil", "NA", "DeSmogBlog",
"ExxonMobil", "2021-04-20 00:00:00", "The Guardian")
df2 <- tribble(
~Date, ~`Equity Price`, ~`Company Name`,
"07/30/21", 153., "JPMorgan",
"07/29/21", 153, "JPMorgan",
"07/28/21", 152., "JPMorgan",
"07/27/21", 151., "JPMorgan",
"07/26/21", 152., "JPMorgan",
"07/23/21", 151., "JPMorgan")

Related

How to avoid timezone offset when using read_csv_arrow

Let's say I have a csv file. For example, this one, https://www.misoenergy.org/planning/generator-interconnection/GI_Queue/gi-interactive-queue/#
If I do
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()
# A tibble: 3,343 x 24
`Project #` `Request Status` `Queue Date` `Withdrawn Date` `Done Date` `Appl In Service ~` `Transmission ~` County State
<chr> <chr> <dttm> <dttm> <dttm> <dttm> <chr> <chr> <chr>
1 E002 Done 2013-09-12 20:00:00 NA 2003-12-12 19:00:00 NA Entergy Point~ LA
2 E291 Done 2012-05-14 20:00:00 NA 2013-10-21 20:00:00 2015-12-31 19:00:00 Entergy NA TX
3 G001 Withdrawn 1995-11-07 19:00:00 NA NA NA American Transm~ Brown~ WI
4 G002 Done 1998-11-30 19:00:00 NA NA NA LG&E and KU Ser~ Trimb~ KY
It seems like it's assuming the file is in GMT and then converts the GMT representation of the date to my local time zone (Eastern).
I can do Sys.setenv(TZ="GMT") before I load the file and then that avoids the offset issue.
Sys.setenv(TZ="GMT")
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()
# A tibble: 3,343 x 24
`Project #` `Request Status` `Queue Date` `Withdrawn Date` `Done Date` `Appl In Service ~` `Transmission ~` County State
<chr> <chr> <dttm> <dttm> <dttm> <dttm> <chr> <chr> <chr>
1 E002 Done 2013-09-13 00:00:00 NA 2003-12-13 00:00:00 NA Entergy Point~ LA
2 E291 Done 2012-05-15 00:00:00 NA 2013-10-22 00:00:00 2016-01-01 00:00:00 Entergy NA TX
3 G001 Withdrawn 1995-11-08 00:00:00 NA NA NA American Transm~ Brown~ WI
4 G002 Done 1998-12-01 00:00:00 NA NA NA LG&E and KU Ser~ Trimb~ KY
While setting my session tz to GMT isn't really too onerous, I'm wondering if there's a way to have it either assume the file is the same as my local time zone and just keep it that way or if it wants to assume it's GMT in the file then just keep it in GMT regardless of my local timezone.
It seems like it's assuming the file is in GMT and then converts the GMT representation of the date to my local time zone (Eastern).
Actually, the timezone conversion you are seeing just happens when you print. You can see this if you save the data frame to a variable and print it before and after you change your current timezone:
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
df <- miso_queue %>% collect()
Sys.setenv(TZ="US/Pacific")
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`
# <dttm>
# 1 2013-09-12 17:00:00
# 2 2012-05-14 17:00:00
# 3 1995-11-07 16:00:00
# 4 1998-11-30 16:00:00
# 5 1998-11-30 16:00:00
# 6 1998-11-30 16:00:00
# 7 1999-02-14 16:00:00
# 8 1999-02-14 16:00:00
# 9 1999-07-29 17:00:00
# 10 1999-08-12 17:00:00
# # … with 3,333 more rows
Sys.setenv(TZ="GMT")
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`
# <dttm>
# 1 2013-09-13 00:00:00
# 2 2012-05-15 00:00:00
# 3 1995-11-08 00:00:00
# 4 1998-12-01 00:00:00
# 5 1998-12-01 00:00:00
# 6 1998-12-01 00:00:00
# 7 1999-02-15 00:00:00
# 8 1999-02-15 00:00:00
# 9 1999-07-30 00:00:00
# 10 1999-08-13 00:00:00
# # … with 3,333 more rows
However, in the example you showed there is no time data, so you might be better off reading that column as a date instead of a timestamp. Unfortunately right now I think Arrow only lets you parse as a date right now if you provide the schema for the whole table. One alternative would be to parse the date columns after reading.

R convert "Y-m-d" or "m/d/Y" to the same format

I have a huge (~10.000.000 rows) dataframe with a column that consists dates, i.e:
df <- data.frame(StartDate = as.character(c("2014-08-20 11:59:38",
"2014-08-21 16:17:44",
"2014-08-22 19:02:10",
"9/1/2014 08:05:13",
"9/2/2014 15:13:28",
"9/3/2014 00:22:01")))
The problem is that date formats are mixed - I would like to standardise them so as to get:
StartDate
1 2014-08-20
2 2014-08-21
3 2014-08-22
4 2014-09-01
5 2014-09-02
6 2014-09-03
1. as.Date() approach
as.Date("2014-08-31 23:59:38", "%m/%d/%Y")
as.Date("9/1/2014 00:00:28", "%m/%d/%Y")
gives
[1] NA
[1] "2014-09-01"
2. lubridate approach
dmy("9/1/2014 00:00:28")
mdy("9/1/2014 00:00:28")
dmy("2014-08-31 23:59:38")
mdy("2014-08-31 23:59:38")
in each case returns
[1] NA
Warning message:
All formats failed to parse. No formats found.
Is there any neat solution to that?
Easier maybe to use parse_date
library(parsedate)
df$StartDate <- as.Date(parse_date(df$StartDate))
-output
> df$StartDate
[1] "2014-08-20" "2014-08-21" "2014-08-22" "2014-09-01" "2014-09-02" "2014-09-03"
I have just found out that anytime::anydate extracts the dates directly and straightforwardly:
library(anytime)
library(tidyverse)
df %>%
mutate(Date = anydate(StartDate))
#> StartDate Date
#> 1 2014-08-20 11:59:38 2014-08-20
#> 2 2014-08-21 16:17:44 2014-08-21
#> 3 2014-08-22 19:02:10 2014-08-22
#> 4 9/1/2014 08:05:13 2014-09-01
#> 5 9/2/2014 15:13:28 2014-09-02
#> 6 9/3/2014 00:22:01 2014-09-03
Another solution, based on lubridate:
library(tidyverse)
library(lubridate)
df %>%
mutate(Date = if_else(!str_detect(StartDate,"/"),
date(ymd_hms(StartDate, quiet = T)), date(mdy_hms(StartDate, quiet = T))))
#> StartDate Date
#> 1 2014-08-20 11:59:38 2014-08-20
#> 2 2014-08-21 16:17:44 2014-08-21
#> 3 2014-08-22 19:02:10 2014-08-22
#> 4 9/1/2014 08:05:13 2014-09-01
#> 5 9/2/2014 15:13:28 2014-09-02
#> 6 9/3/2014 00:22:01 2014-09-03

How to copy data from 1 dataframe into another based on several conditions

In my df1 (including df1$id, df1$datetime_interval, df1$datetime_event and df1$event) i'd like to put data from df2 (including df2$id, df2$datetime_event) based onder these conditions:
if df1$id and df2$id match
and if df2$datetime_event is within the df1$datetime_interval,
than I want the data of df2$datetime_event copied in the column of df1$datetime_event, of that perticular row in df1, and a string (for instance "yes") in df1$event.
if the conditions aren't met, I want no results (NA)
So:
df1
ID datetime_interval datetime_event event
1 2019-04-19 21:50:00 UTC--2019-04-20 21:31:00 UTC NA NA
1 2019-07-02 04:23:00 UTC--2019-07-02 08:51:00 UTC NA NA
2 2019-07-04 19:45:00 UTC--2019-07-05 00:30:00 UTC NA NA
3 2019-06-07 08:55:00 UTC--2019-06-07 14:43:00 UTC NA NA
3 2019-05-06 17:18:00 UTC--2019-05-06 23:18:00 UTC NA NA
6 2019-08-02 22:00:00 UTC--2019-08-04 03:10:00 UTC NA NA
df2
ID datetime_event
1 2019-04-19 21:55:00
3 2019-05-06 21:23:00
5 2019-07-04 19:45:00
6 2019-05-06 17:18:00
6 2019-08-03 10:10:00
I have tried some things but it didn't work out like i want it too. I'm still missing some steps and i don't know how to move on from this. This is what i have so far:
for(i in seq_along(df1$id)){
for(j in seq_along(df2$id)){
ifelse(df2$id[j] == df1$id[i]) {
ifelse(df2$datetime_event[j] %within% df1$datetime_interval[i] == TRUE){
df1$datetime_event <- df2$datetime_ic_corr[j]
}
}
}
}
my desired outcome is this:
df1
ID datetime_event datetime_event event
1 2019-04-19 21:50:00 UTC--2019-04-20 21:31:00 UTC 2019-04-19 21:55:00 yes
1 2019-07-02 04:23:00 UTC--2019-07-02 08:51:00 UTC NA NA
2 2019-07-04 19:45:00 UTC--2019-07-05 00:30:00 UTC NA NA
3 2019-06-07 08:55:00 UTC--2019-06-07 14:43:00 UTC NA NA
3 2019-05-06 17:18:00 UTC--2019-05-06 23:18:00 UTC 2019-05-06 21:23:00 yes
6 2019-08-02 22:00:00 UTC--2019-08-04 03:10:00 UTC 2019-08-03 10:10:00 yes
Thank you in advance for all new input! Cause I'm stuck...
dput(df1)
structure(list(ID = c(1, 1, 2, 3, 3, 6), datetime_interval = c("2019-04-19 21:50:00 UTC--2019-04-20 21:31:00 UTC",
"2019-07-02 04:23:00 UTC--2019-07-02 08:51:00 UTC", "2019-07-04 19:45:00 UTC--2019-07-05 00:30:00 UTC",
"2019-06-07 08:55:00 UTC--2019-06-07 14:43:00 UTC", "2019-05-06 17:18:00 UTC--2019-05-06 23:18:00 UTC",
"2019-08-02 22:00:00 UTC--2019-08-04 03:10:00 UTC"), datetime_event = c("NA",
"NA", "NA", "NA", "NA", "NA"), event = c("NA", "NA", "NA", "NA",
"NA", "NA")), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
dput(df2)
structure(list(ID = c(1, 3, 5, 6, 6), datetime_event = c("2019-04-19 21:55:00 UTC",
"2019-05-06 21:23:00 UTC", "2019-07-04 19:45:00 UTC", "2019-05-06 17:18:00 UTC",
"2019-08-03 10:10:00 UTC")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
Tricky problem. I think this works:
library(dplyr)
library(tidyr)
# convert datetime_interval to datetime class start and end columns
# and add row IDs
df1 = df1 %>%
separate(datetime_interval, into = c("start", "end"), sep = "--") %>%
mutate_at(vars(start, end), as.POSIXct) %>%
select(-datetime_event, -event) %>%
mutate(row_id = row_number())
# convert datetime event to datetime class
df2 = df2 %>%
mutate(datetime_event = as.POSIXct(datetime_event))
# join and filter
df1 %>% left_join(df2, by = "ID") %>%
mutate(
datetime_event = ifelse(datetime_event >= start & datetime_event <= end, datetime_event, NA),
event = ifelse(is.na(datetime_event), NA, "yes")
) %>%
arrange(row_id, datetime_event) %>%
group_by(row_id) %>%
slice(1)
# # A tibble: 6 x 6
# # Groups: row_id [6]
# ID start end row_id datetime_event event
# <dbl> <dttm> <dttm> <int> <dbl> <chr>
# 1 1 2019-04-19 21:50:00 2019-04-20 21:31:00 1 1555725300 yes
# 2 1 2019-07-02 04:23:00 2019-07-02 08:51:00 2 NA NA
# 3 2 2019-07-04 19:45:00 2019-07-05 00:30:00 3 NA NA
# 4 3 2019-06-07 08:55:00 2019-06-07 14:43:00 4 NA NA
# 5 3 2019-05-06 17:18:00 2019-05-06 23:18:00 5 1557192180 yes
# 6 6 2019-08-02 22:00:00 2019-08-04 03:10:00 6 1564841400 yes

Dates with dttm format appear with a strange format when displayed in a DT:: datatable

I have dataframe loaded by a csv which is like:
A tibble: 5 x 8
ID EventType EventDate EventValue EventValueExt1 EventValueExt2 EventValueExt3 EventValueExt4
<dbl> <chr> <dttm> <chr> <dbl> <dbl> <dbl> <dbl>
1 12340 steps 2019-11-26 21:18:00 3017 NA NA NA NA
2 12339 steps 2019-11-25 14:23:00 3016 NA NA NA NA
3 12338 steps 2019-11-25 14:00:00 3015 NA NA NA NA
4 12337 geo_logging 2019-11-22 19:10:00 40.748498,-73.9933~ 16.4 0 16.8 0
5 12336 geo_logging 2019-11-22 19:09:00 40.7484843,-73.993~ 22.2 0 16.8 0
Then I try to create a DT::datatable with: datatable(device1_report1583417393205)
and I get:
As you can see my datetime has a strange format. I try
class(device1_report1583417393205$EventDate)
[1] "POSIXct" "POSIXt"
and then I try:
library(lubridate)
library(chron)
device1_report1583417393205$EventDate <- ymd_hms(device1_report1583417393205$EventDate)
chron(dates = format(device1_report1583417393205$EventDate, '%Y-%m-%d'), time = format(device1_report1583417393205$EventDate, "%H:%M:%S"),
format = c('y-m-d', 'h:m:s'))
but still get the same result in my DT. I also tried to make them factors but still the same. Any ideas?
EDIT:REPRODUCIBLE EXAMPLE
require(lubridate)
require(dplyr)
df = data.frame(timestring = c("2015-12-12 13:34:56", "2015-12-14 16:23:32"),
localzone = c("America/Los_Angeles", "America/New_York"), stringsAsFactors = F)
df$moment = as.POSIXct(df$timestring, format="%Y-%m-%d %H:%M:%S", tz="UTC")
df = df %>% rowwise() %>% mutate(localtime = force_tz(moment, localzone))
df
DT::datatable(df)

Calculate Log Difference For Each Day in R Produce NA for the First Observation for Each Day

Problem: Calculate the difference in log for each day (group by each day). The ideal result should produce NA for the first observation for each day.
library(dplyr)
library(tidyverse)
library(tibble)
library(lubridate)
df <- tibble(t = c("2019-10-01 09:30", "2019-10-01 09:35", "2019-10-01 09:40", "2019-10-02 09:30", "2019-10-02 09:35", "2019-10-02 09:40", "2019-10-03 09:30", "2019-10-03 09:35", "2019-10-03 09:40"), v = c(105.0061, 104.891, 104.8321, 104.5552, 104.4407, 104.5837, 104.5534, 103.6992, 103.5851)) # data
# my attempt
df %>%
# create day
mutate(day = day(t)) %>%
# group by day
group_by(day) %>%
# calculate log difference and append column
mutate(logdif = diff(log(df$v)))
The problem is
Error: Column `logdif` must be length 3 (the group size) or one, not 8
What I need:
[1] NA -0.0010967280 -0.0005616930 NA -0.0010957154
[6] 0.0013682615 NA -0.0082035450 -0.0011009036
Never use $ in dplyr pipes, also you need to append NA to diff output
library(dplyr)
df %>%
mutate(day = lubridate::day(t)) %>%
group_by(day) %>%
mutate(logdif = c(NA, diff(log(v))))
# t v day logdif
# <chr> <dbl> <int> <dbl>
#1 2019-10-01 09:30 105. 1 NA
#2 2019-10-01 09:35 105. 1 -0.00110
#3 2019-10-01 09:40 105. 1 -0.000562
#4 2019-10-02 09:30 105. 2 NA
#5 2019-10-02 09:35 104. 2 -0.00110
#6 2019-10-02 09:40 105. 2 0.00137
#7 2019-10-03 09:30 105. 3 NA
#8 2019-10-03 09:35 104. 3 -0.00820
#9 2019-10-03 09:40 104. 3 -0.00110

Resources