I have a dataset with a date-time vector (format is m/d/y h:m) that looks like this:
june2018_2$datetime
[1] "6/1/2018 1:00" "6/1/2018 2:00" "6/1/2018 3:00" "6/1/2018 4:00"
And I have 61 other variables that are all numeric (with some already missing values indicated with 'NA'). My date time vector is missing some hourly slots and I want to make the date-time vector full and fill in missing spots in the other 61 variables with 'NA'. I tried to use what's already out there but I can't seem to find some code or function that works for what I'm specifically working with. Any tips?
If your datetime is not in POSIXct then could be mutated. With complete you can fill in rows by the hour. Other columns in the data frame will be NA.
library(tidyverse)
df %>%
mutate(datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M")) %>%
complete(datetime = seq(from = first(datetime), to = last(datetime), by = "hours"))
For example, if you have test data:
set.seed(123)
df <- data.frame(
datetime = c("6/1/2018 1:00", "6/1/2018 3:00", "6/1/2018 5:00", "6/1/2018 9:00"),
var1 = sample(10,4)
)
The output would be:
# A tibble: 9 x 2
datetime var1
<dttm> <int>
1 2018-06-01 01:00:00 3
2 2018-06-01 02:00:00 NA
3 2018-06-01 03:00:00 10
4 2018-06-01 04:00:00 NA
5 2018-06-01 05:00:00 2
6 2018-06-01 06:00:00 NA
7 2018-06-01 07:00:00 NA
8 2018-06-01 08:00:00 NA
9 2018-06-01 09:00:00 8
Related
I have a time series, that spans almost 20 years with a resolution of 15 min.
I want to extract only hourly values (00:00:00, 01:00:00, and so on...) and plot the resulting time series.
The df looks like this:
3 columns: date, time, and discharge
How would you approach this?
a reproducible example would be good for this kind of question. Here is my code, hope it helps you:
#creating dummy data
df <- data.frame(time = seq(as.POSIXct("2018-01-01 00:00:00"), as.POSIXct("2018-01-01 23:59:59"), by = "15 min"), variable = runif(96, 0, 1))
example output: (only 5 rows)
time variable
1 2018-01-01 00:00:00 0.331546992
2 2018-01-01 00:15:00 0.407269290
3 2018-01-01 00:30:00 0.635367577
4 2018-01-01 00:45:00 0.808612045
5 2018-01-01 01:00:00 0.258801201
df %>% filter(format(time, "%M:%S") == "00:00")
output:
1 2018-01-01 00:00:00 0.76198532
2 2018-01-01 01:00:00 0.01304103
3 2018-01-01 02:00:00 0.10729465
4 2018-01-01 03:00:00 0.74534184
5 2018-01-01 04:00:00 0.25942667
plot(df %>% filter(format(time, "%M:%S") == "00:00") %>% ggplot(aes(x = time, y = variable)) + geom_line())
I have below dataframe (df) from ENTSO-E showing German power prices. I created the "Hour" column with lubridate function hour(df$date). Output was a range (1,2,..,23,0)
# to replace 0 with 24
df["Hour"][df["Hour"]=="0"]<- "24"
I will need to work on an hourly basis. So I filtered each hour from 1 till 24, but I cannot filter the replaced hour: H24.
H1 <- df %>%
filter(Hour==1)
H24 <- df %>%
filter(Hour==24)
Error in match.fun(FUN) : object 'Hour' not found
24 values are still in Hour col, and class is numeric but I cannot do any calculation with the Hour column.
class(df$Hour)
[1] "numeric"
mean(german_last_4$Hour)
[1] NA
I am thinking problem is with replace function. is there any other way to produce a result that works with H24?
date
price
Hour
2019-01-01 01:00:00
28.32
1
2019-01-01 02:00:00
10.07
2
2019-01-01 03:00:00
-4.08
3
2019-01-01 04:00:00
-9.91
4
2019-01-01 05:00:00
-7.41
5
2019-01-01 06:00:00
-12.55
6
I merged two dataframes, ~251k rows and ~237k rows, respectively, based on nearest datetime. However, I have an issue with the output.
Here are small, hypothetical examples.
Large dataframe:
dflarge <- data.frame(datetime = c("2021-01-01 12:47:16", "2021-01-01 13:47:16", "2021-01-01 14:47:16", "2021-01-01 15:47:16", "2021-01-01 16:47:16"))
Converting to datetime format:
dflarge$datetime <- as.POSIXct(dflarge$datetime)
tibble(dflarge)
# A tibble: 5 x 1
datetime
<dttm>
1 2021-01-01 12:47:16
2 2021-01-01 13:47:16
3 2021-01-01 14:47:16
4 2021-01-01 15:47:16
5 2021-01-01 16:47:16
Small dataframe and necessary format conversions:
dfsmall <- data.frame(datetime = c("2021-01-01 15:00:00", "2021-01-01 16:00:00", "2021-01-01 17:00:00"), value = c("0.5", "1.0", "1.5"))
dfsmall$datetime <- as.POSIXct(dfsmall$datetime)
dfsmall$value <- as.numeric(dfsmall$value)
tibble(dfsmall)
# A tibble: 3 x 2
datetime value
<dttm> <dbl>
1 2021-01-01 15:00:00 0.5
2 2021-01-01 16:00:00 1
3 2021-01-01 17:00:00 1.5
Now I perform the merge...
library(data.table)
setDT(dflarge)[, value := setDT(dfsmall)[dflarge, value, on = "datetime", roll = "nearest"]]
tibble(dflarge)
# A tibble: 5 x 2
datetime value
<dttm> <dbl>
1 2021-01-01 12:47:16 0.5
2 2021-01-01 13:47:16 0.5
3 2021-01-01 14:47:16 0.5
4 2021-01-01 15:47:16 1
5 2021-01-01 16:47:16 1.5
Despite the logic behind the result, as you can see the first two records have also had the value 0.5 assigned to it but this is incorrect!
To remove or modify these values manually will not suffice i.e., scrolling through 1/4 million records and finding where the duplications start. The scripts I am compiling is for autonomous database merging and appending.
I basically require a function that only matches the dfshort$value to a dflarge$datetime that is around e.g., the nearest 1 hour of dfshort$datetime, not >= 2 hours apart AND, subsequently replace irrelevant values with NA or NAN.
Burrowing ideas from a solution to a similar question:
setDT(dflarge)
setDT(dfsmall)
dflarge[, joindatetime := datetime + 3600] # we add 3600 secs (one hour)
dflarge[, value := dfsmall[dflarge, value, on = .(datetime = joindatetime), roll = 7200]
][, joindatetime := NULL]
# datetime value
# 1: 2021-01-01 12:47:16 <NA>
# 2: 2021-01-01 13:47:16 <NA>
# 3: 2021-01-01 14:47:16 0.5
# 4: 2021-01-01 15:47:16 1.0
# 5: 2021-01-01 16:47:16 1.5
I have 2 columns
one is date :
2011-04-13
2013-07-29
2010-11-23
the other is time :
3
22
15
I want to make a new column contains date time
it will be like this
2011-04-13 3:00:00
2013-07-29 22:00:00
2010-11-23 15:00:00
I managed to combine them as string
but when i convert them to datetime i get only date the time disappears
any idea how to get date and time in one column?
my script
data <- read.csv("d:\\__r\\hour.csv")
data$date <- as.POSIXct(paste(data$dteday , paste(data$hr, ":00:00", sep=""), sep=" "))
as example you can use ymd_hm function from lubridate:
a <- c("2014-09-08", "2014-09-08", "2014-09-08")
b <- c(3, 4, 5)
library(lubridate)
library(tidyverse)
tibble(a, b) %>%
mutate(time = paste0(a, " ", b, "-0"),
time = ymd_hm(time))
output would be:
# A tibble: 3 x 3
a b time
<chr> <dbl> <dttm>
1 2014-09-08 3 2014-09-08 03:00:00
2 2014-09-08 4 2014-09-08 04:00:00
3 2014-09-08 5 2014-09-08 05:00:00
found this fixed the problem
data$date <- as.POSIXct(strptime(paste(data$dteday , paste(data$hr, ":00:00", sep=""), sep=" "), "%Y-%m-%d %H:%M:%S"))
My dataframe has timestamp with and without seconds, and a random use of 0 in front of months and hours, i.e. 01 or 1
library(tidyverse)
df <- data_frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 01:03', '12/30/2015 23:04:25'))
cust timestamp
A 5/31/2016 1:03:12
A 05/25/2016 01:06
B 6/16/2016 01:03
B 12/30/2015 23:04:25
How to extract hours into a separate column? The desired output:
cust timestamp hours
A 5/31/2016 1:03:12 1
A 05/25/2016 01:06 1
B 6/16/2016 9:03 9
B 12/30/2015 23:04:25 23
I prefer the answer with tidyverse and mutate, but my attempt fails to extract hours correctly:
df %>% mutate(hours=strptime(timestamp, '%H') %>% as.character() )
# A tibble: 4 × 3
cust timestamp hours
<chr> <chr> <chr>
1 A 5/31/2016 1:03:12 2016-10-31 05:00:00
2 A 05/25/2016 01:06 2016-10-31 05:00:00
3 B 6/16/2016 01:03 2016-10-31 06:00:00
4 B 12/30/2015 23:04:25 2016-10-31 12:00:00
Try this:
library(lubridate)
df <- data.frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 09:03', '12/30/2015 23:04:25'))
df %>% mutate(hours=hour(strptime(timestamp, '%m/%d/%Y %H:%M')) %>% as.character() )
cust timestamp hours
1 A 5/31/2016 1:03:12 1
2 A 05/25/2016 01:06 1
3 B 6/16/2016 09:03 9
4 B 12/30/2015 23:04:25 23
Here is a solution that appends 00 for the seconds when they are missing, then converts to a date using lubridate and extracts the hours using format. Note, if you don't want the 00:00 at the end of the hours, you can just eliminate them from the output format in format:
df %>%
mutate(
cleanTime = ifelse(grepl(":[0-9][0-9]:", timestamp)
, timestamp
, paste0(timestamp, ":00")) %>% mdy_hms
, hour = format(cleanTime, "%H:00:00")
)
returns:
cust timestamp cleanTime hour
<chr> <chr> <dttm> <chr>
1 A 5/31/2016 1:03:12 2016-05-31 01:03:12 01:00:00
2 A 05/25/2016 01:06 2016-05-25 01:06:00 01:00:00
3 B 6/16/2016 01:03 2016-06-16 01:03:00 01:00:00
4 B 12/30/2015 23:04:25 2015-12-30 23:04:25 23:00:00
Your timestamp is a character string (), you need to format is as a date (with as.Date for example) before you can start using functions like strptime.
You are going to have to go through some string manipulations to have properly formatted data before you can convert it to dates. Prepend a zero to months with a single digit and append :00 to hours with missing seconds. Use strsplit() and other regex functions. Afterwards do as.Date(df$timestamp,format = '%m/%d/%Y %H:%M:%S'), then you will be able to use strptime to extract the hours.