Adding data based on date from another dataframe

Adding data based on date from another dataframe - r

I have two datasets. One with multiple dates:
date, time
1 2013-05-01 12:43:34
2 2013-05-02 05:04:23
3 2013-05-02 09:34:34
4 2013-05-02 12:32:23
5 2013-05-03 23:23:23
6 2013-05-04 15:34:17
and one with sunrise and sunsets data:
Sunrise Sunset
2013-05-01 06:43:00 2013-05-01 21:02:12
2013-05-02 06:44:00 2013-05-02 21:03:13
2013-05-03 06:44:56 2013-05-03 21:04:02
2013-05-04 06:45:32 2013-05-04 21:05:00
I want to add a column to the first dataframe with either "Day" or "night", based on whether the date and time from the first dataframe is between the sunrise and sunset time and dates.
date, time Day or night
1 2013-05-01 12:43:34 Day
2 2013-05-02 05:04:23 Night
3 2013-05-02 09:34:34 Day
4 2013-05-02 12:32:23 Day
5 2013-05-03 23:23:23 Night
6 2013-05-04 15:34:17 Day
I tried copying and if_else functions, but the length of rows is different because for one year I have 365 sunrises and sunsets but I've also got multiple measurements for one day (total of 28000 rows).
Can anyone help me with my problem.
Thanks in advance.

df1 <- structure(list(date_time = c("2013-05-01 12:43:34", "2013-05-02 05:04:23",
"2013-05-02 09:34:34", "2013-05-02 12:32:23", "2013-05-03 23:23:23",
"2013-05-04 15:34:17")), row.names = c(NA, -6L), class = c("data.frame"))
df2 <- structure(list(Sunrise = c("2013-05-01 06:43:00", "2013-05-02 06:44:00",
"2013-05-03 06:44:56", "2013-05-04 06:45:32"), Sunset = c("2013-05-01 21:02:12",
"2013-05-02 21:03:13", "2013-05-03 21:04:02", "2013-05-04 21:05:00"
)), row.names = c(NA, -4L), class = c("data.frame"))
# prepare df1
df1 <- df1 %>%
mutate(date_time = as.POSIXct(date_time, tz = "UTC")) %>%
mutate(Date = as.Date(date_time))
# prepare df2
df2 <- df2 %>%
mutate(Sunrise = as.POSIXct(Sunrise, tz = "UTC")) %>%
mutate(Sunset = as.POSIXct(Sunset, tz = "UTC")) %>%
mutate(Date = as.Date(Sunrise))
library(lubridate) # for the use of interval
merge(df1, df2, by = "Date") %>%
mutate(DayOrNight = ifelse(date_time %within% interval(Sunrise, Sunset), "Day", "Night"))
# Date date_time Sunrise Sunset DayOrNight
# 1 2013-05-01 2013-05-01 12:43:34 2013-05-01 06:43:00 2013-05-01 21:02:12 Day
# 2 2013-05-02 2013-05-02 05:04:23 2013-05-02 06:44:00 2013-05-02 21:03:13 Night
# 3 2013-05-02 2013-05-02 09:34:34 2013-05-02 06:44:00 2013-05-02 21:03:13 Day
# 4 2013-05-02 2013-05-02 12:32:23 2013-05-02 06:44:00 2013-05-02 21:03:13 Day
# 5 2013-05-03 2013-05-03 23:23:23 2013-05-03 06:44:56 2013-05-03 21:04:02 Night
# 6 2013-05-04 2013-05-04 15:34:17 2013-05-04 06:45:32 2013-05-04 21:05:00 Day

Related

Add column in dataframe based on 3 columns from another dataframe using R

I have 2 dataframes which are as follows:
Dataframe 1: traffic_df which is hourly data.
Date_Time
Traffic
2020-03-09 06:00:00
10
2020-03-09 07:00:00
20
2020-03-10 07:00:00
20
2020-03-24 08:00:00
15
Dataframe 2: Alert.level
Start
End
Alert.level
10/03/2020 13:30
23/03/2020 13:30
2
23/03/2020 13:30
25/03/2020 23:59
3
I want to add a 3rd column to traffic_df which is the associated Alert.level if the Date_Time falls within the Start and End Date_Time of the Alert.level df so that the resulting dataframe will look like this:
Dataframe 1: traffic_df
Date_Time
Traffic
Alert.Level
2020-03-09 06:00:00
10
2020-03-09 07:00:00
20
2020-03-10 07:00:00
20
2
2020-03-24 08:00:00
15
3
Is there anyway to do this without having to make a matching hourly dataframe and then using join?
I'm thinking somehow using the map function?
Code to produce the df:
traffic_df <- structure(list(Date_Time = c("2020-03-09 06:00:00", "2020-03-09 07:00:00", "2020-03-10 07:00:00",
"2020-03-24 08:00:00"), Traffic = c(10L, 20L, 20L, 15L)),
row.names = c(NA, -4L), class = "data.frame")
Alert.Level = data.frame(Start = c("10/03/2020 13:30", "23/03/2020 13:30"),
End = c("23/03/2020 13:30", "25/03/2020 23:59"),
Alert.level = c(2, 3))

You may try the fuzzyjoin package.
Data
library(lubridate)
traffic_df <- structure(list(Date_Time = c("2020-03-09 06:00:00", "2020-03-09 07:00:00", "2020-03-10 07:00:00",
"2020-03-24 08:00:00"), Traffic = c(10L, 20L, 20L, 15L)),
row.names = c(NA, -4L), class = "data.frame") %>%
mutate(Date_Time = ymd_hms(Date_Time))
Alert.Level = data.frame(Start = c("10/03/2020 13:30", "23/03/2020 13:30"),
End = c("23/03/2020 13:30", "25/03/2020 23:59"),
Alert.level = c(2, 3)) %>%
mutate(Start = dmy_hms(Start),
End = dmy_hms(End))
Code
library(fuzzyjoin)
traffic_df %>%
fuzzy_left_join(Alert.Level,
match_fun = list(`>=`, `<=`),
by = list(x = c("Date_Time",
"Date_Time"),
y = c("Start",
"End"))) %>%
select(-Start, -End)
Output
In contrast to your expected output above, row three is not matched, because 7:00 o'clock is before the starting time of 13:30.
Date_Time Traffic Alert.level
1 2020-03-09 06:00:00 10 NA
2 2020-03-09 07:00:00 20 NA
3 2020-03-10 07:00:00 20 NA
4 2020-03-24 08:00:00 15 3

Here is a solution using sqldf. Note that I renamed the data.frame to have an underscore for convenience with SQL.
library(sqldf)
Alert_level <- Alert.level
sqldf("SELECT * FROM traffic_df
LEFT JOIN Alert_level
ON traffic_df.Date_Time BETWEEN Alert_level.Start AND Alert_level.End")
Output
Date_Time Traffic Start End Alert.level
1 2020-03-09 06:00:00 10 <NA> <NA> NA
2 2020-03-09 07:00:00 20 <NA> <NA> NA
3 2020-03-10 07:00:00 20 <NA> <NA> NA
4 2020-03-24 08:00:00 15 2020-03-23 13:30:00 2020-03-25 23:59:00 3

I like outer approaches in such cases. First, define a Vectorized FUNction, that looks if a specific x is between an y interval. Put it in outer which iterates each Date_Time with each start/end interval of Alert.Level. This gives a matrix o that informs which of the intervals is applicable (I use unname to avoid confusion). Then, in traffic_df we crate a NA column alert_lv (should just have a different name than "Alert.Level"), subset it with positive colSums, and put in the according levels of Alert.Level.
FUN <- Vectorize(function(x, y) x >= y[1] & x < y[2])
(o <- unname(outer(traffic_df$Date_Time, Alert.Level[-3], FUN)))
# [,1] [,2] [,3] [,4]
# [1,] FALSE FALSE TRUE FALSE
# [2,] FALSE FALSE FALSE TRUE
w <- unlist(apply(o, 1, which))
traffic_df <- within(traffic_df, {
alert_lv <- NA
alert_lv[rowSums(o) > 0] <- Alert.Level[w, 3]
})
traffic_df
# Date_Time Traffic alert_lv
# 1 2020-03-09 06:00:00 10 NA
# 2 2020-03-09 07:00:00 20 NA
# 3 2020-03-10 07:00:00 20 2
# 4 2020-03-24 08:00:00 15 3
Note: To use this solution you first need the usual 'POSIXct' formats, so first you should do
traffic_df$Date_Time <- as.POSIXct(traffic_df$Date_Time)
Alert.Level[1:2] <- lapply(Alert.Level[1:2], strptime, format='%d/%m/%Y %H:%M')

aggregate data frame to typical year/week

so i have a large data frame with a date time column of class POSIXct and a another column with price data of class numeric. the date time column has values of the form "1998-12-07 02:00:00 AEST" that are half hour observations across 20 years. a sample data set can be generated with the following code (vary the 100 to whatever number of observations are necessary):
data.frame(date.time = seq.POSIXt(as.POSIXct("1998-12-07 02:00:00 AEST"), as.POSIXct(Sys.Date()+1), by = "30 min")[1:100], price = rnorm(100))
i want to look at a typical year and typical week. so for the typical year i have the following code:
mean.year <- aggregate(df$price, by = list(format(df$date.time, "%m-%d %H:%M")), mean)
it seems to give me what i want:
Group.1 x
1 01-01 00:00 31.86200
2 01-01 00:30 34.20526
3 01-01 01:00 28.40105
4 01-01 01:30 26.01684
5 01-01 02:00 23.68895
6 01-01 02:30 23.70632
however the column "Group.1" is of class character and i would like it to be of class POSIXct. how can i do this?
for the typical week i have the following code
mean.week <- aggregate(df$price, by = list(format(df$date.time, "%wday %H:%M")), mean)
the output is as follows
Group.1 x
1 0day 00:00 33.05613
2 0day 00:30 30.92815
3 0day 01:00 29.26245
4 0day 01:30 29.47959
5 0day 02:00 29.18380
6 0day 02:30 25.99400
again, column "Group.1" is of class character and i would like POSIXct. also, i would like to have the day of the week as "Monday", "Tuesday", etc. instead of 0day. how would i do this?

Convert the datetime to a character string that can validly be converted back to POSIXct and then do so:
mean.year <- aggregate(df["price"],
by = list(time = as.POSIXct(format(df$date.time, "2000-%m-%d %H:%M"))), mean)
head(mean.year)
## time price
## 1 2000-12-07 02:00:00 -0.56047565
## 2 2000-12-07 02:30:00 -0.23017749
## 3 2000-12-07 03:00:00 1.55870831
## 4 2000-12-07 03:30:00 0.07050839
## 5 2000-12-07 04:00:00 0.12928774
## 6 2000-12-07 04:30:00 1.71506499
To get the day of the week use %a or %A -- see ?strptime for the list of percent codes.
mean.week <- aggregate(df["price"],
by = list(time = format(df$date.time, "%a %H:%M")), mean)
head(mean.week)
## time price
## 1 Mon 02:00 -0.56047565
## 2 Mon 02:30 -0.23017749
## 3 Mon 03:00 1.55870831
## 4 Mon 03:30 0.07050839
## 5 Mon 04:00 0.12928774
## 6 Mon 04:30 1.71506499
Note
The input df in reproducible form -- note that set.seed is needed to make it reproducible:
set.seed(123)
df <- data.frame(date.time = seq.POSIXt(as.POSIXct("1998-12-07 02:00:00 AEST"),
as.POSIXct(Sys.Date()+1), by = "30 min")[1:100], price = rnorm(100))

Changing quarterly data into hourly data

I have data as below. It is from 01.01.2015~31.12.2015.
The data is in quarterly base. But I want to add, for example, like 0:00, 0:15, 0:30, 0:45 together to make a hour data. How can I make this into hourly data?
Thank you in advance.
Date Hour Day-ahead Total Load Forecast [MW] - Germany (DE)
01.01.2015 0:00 42955
01.01.2015 0:15 42412
01.01.2015 0:30 41901
01.01.2015 0:45 41355
01.01.2015 1:00 40710
01.01.2015 1:15 40204
01.01.2015 1:30 39640
01.01.2015 1:45 39324
01.01.2015 2:00 39002
01.01.2015 2:15 38869
01.01.2015 2:30 38783
01.01.2015 2:45 38598
01.01.2015 3:00 38626
01.01.2015 3:15 38459
01.01.2015 3:30 38414
...
> dput(head(new3))
structure(list(Date = structure(c(16436, 16436, 16436, 16436,
16436, 16436), class = "Date"), Hour = c("0:00", "0:15", "0:30",
"0:45", "1:00", "1:15"), Dayahead = c("42955", "42412", "41901",
"41355", "40710", "40204"), Actual = c(42425L, 42021L, 42068L,
41874L, 41230L, 40810L), Difference = c("530", "391", "-167",
"-519", "-520", "-606")), .Names = c("Date", "Hour", "Dayahead",
"Actual", "Difference"), row.names = c(NA, 6L), class = "data.frame")

I've created a small data set for example.
df <- read.csv(text = "Date,Hour,Val
2013-06-03,06:01,0
2013-06-03,12:08,-1
2013-06-03,12:48,3.3
2013-06-03,13:58,2
2013-06-03,13:01,12
2013-06-03,13:08,3
2013-06-03,14:48,4
2013-06-03,14:58,8
2013-06-03,15:01,9.2
2013-06-03,15:08,12.3
2013-06-03,16:48,0
2013-06-03,19:58,-10", stringsAsFactors = FALSE)
With group_by and summarize from dplyr and floor_date from lubridate this can be done:
library(dplyr)
library(lubridate)
df %>%
group_by(Hours=floor_date(ymd_hm(paste(Date, Hour)), "1 hour")) %>%
summarize(Val=sum(Val))
# # A tibble: 7 x 2
# Hours Val
# <dttm> <dbl>
# 1 2013-03-06 06:00:00 0
# 2 2013-03-06 12:00:00 2.30
# 3 2013-03-06 13:00:00 17.0
# 4 2013-03-06 14:00:00 12.0
# 5 2013-03-06 15:00:00 21.5
# 6 2013-03-06 16:00:00 0
# 7 2013-03-06 19:00:00 -10.0

lets say your data frame is called df
> head(df)
Date Hour Forecast
1 01.01.2015 12:00:00 AM 42955
2 01.01.2015 12:15:00 AM 42412
3 01.01.2015 12:30:00 AM 41901
4 01.01.2015 12:45:00 AM 41355
5 01.01.2015 01:00:00 AM 40710
6 01.01.2015 01:15:00 AM 40204
you can aggregate your forecast to hourly basis by the following code
library(lubridate)
df$DateTime=paste(df$Date,df$Hour,sep=" ")%>%dmy_hms%>%floor_date(unit="hour")
result<-ddply(df,.(DateTime),summarize,x=sum(Forecast))
> result
DateTime x
1 2015-01-01 00:00:00 168623
2 2015-01-01 01:00:00 159878
3 2015-01-01 02:00:00 155252
4 2015-01-01 03:00:00 115499
variable x has the sum of forecasts for every hour. Timestamp 00:00:00 aggregates times 00:00, 00:15, 00:30, 00:45.

Fill missing sequence values with dplyr

I have a data frame with missing values for "SNAP_ID". I'd like to fill in the missing values with floating point values based on a sequence from the previous non-missing value (lag()?). I would really like to achieve this using just dplyr if possible.
Assumptions:
There will never be missing data as the first or last row I'm generating the missing dates based on missing days between a min and max of a data set
There can be multiple gaps in the data set
Current data:
end SNAP_ID
1 2015-06-26 12:59:00 365
2 2015-06-26 13:59:00 366
3 2015-06-27 00:01:00 NA
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367
8 2015-06-29 09:59:00 368
What I want to achieve:
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 366.2
5 2015-06-28 00:01:00 366.3
6 2015-06-28 23:00:00 366.4
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
As a data frame:
df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260,
1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "UTC", class = c("POSIXct",
"POSIXt")), SNAP_ID = c(365, 366, NA, NA, NA, NA, 367, 368)), .Names = c("end",
"SNAP_ID"), row.names = c(NA, -8L), class = "data.frame")
This was my attempt at achieving this goal, but it only works for the first missing value:
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),lag(SNAP_ID)+0.1,SNAP_ID))
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
The outstanding answer from #mathematical.coffee below:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) %>%
ungroup() %>%
select(-tmp)

EDIT: new version works for any number of NA runs.
This one doesn't need zoo, either.
First, notice that tmp=cumsum(!is.na(SNAP_ID)) groups the SNAP_IDs such groups of the same tmp consist of one non-NA value followed by a run of NA values.
Then group by this variable and just add .1 to the first SNAP_ID to fill out the NAs:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1)))
end SNAP_ID tmp
1 2015-06-26 12:59:00 365.0 1
2 2015-06-26 13:59:00 366.0 2
3 2015-06-27 00:01:00 366.1 2
4 2015-06-27 23:00:00 366.2 2
5 2015-06-28 00:01:00 366.3 2
6 2015-06-28 23:00:00 366.4 2
7 2015-06-29 09:00:00 367.0 3
8 2015-06-29 09:59:00 368.0 4
Then you can drop the tmp column afterwards (add %>% select(-tmp) to the end).
EDIT: this is the old version which doesn't work for subsequent runs of NAs.
If your aim is to fill each NA with the previous value + 0.1, you can use zoo's na.locf (which fills each NA with the previous value), along with cumsum(is.na(SNAP_ID))*0.1 to add the extra 0.1.
library(zoo)
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),
na.locf(SNAP_ID) + cumsum(is.na(SNAP_ID))*0.1,
SNAP_ID))

create 30 min interval for time series with different start time

I have data for electricity sensor reading with interval 15 min but the start time is not fixed for example
in this day it start at min 13 another day start from different minute
dateTime KW
1/1/2013 1:13 34.70
1/1/2013 1:28 43.50
1/1/2013 1:43 50.50
1/1/2013 1:58 57.50
.
.
.//here start from min 02
1/30/2013 0:02 131736.30
1/30/2013 0:17 131744.30
1/30/2013 0:32 131751.10
1/30/2013 0:47 131759.00
I have data for one year and i need to have regular interval 30 min starting from mid night 00:00.
I am new to R ..can anyone help me

May be you can try:
dT <- as.POSIXct(strptime(df$dateTime, '%m/%d/%Y %H:%M'))
grp <- as.POSIXct(cut(c(as.POSIXct(gsub(' +.*', '', min(dT))), dT,
as.POSIXct(gsub(' +.*', '', max(dT)+24*3600))), breaks='30 min'))
df$grp <- grp[-c(1,length(grp))]
df
# dateTime KW grp
#1 1/1/2013 1:13 34.7 2013-01-01 01:00:00
#2 1/1/2013 1:28 43.5 2013-01-01 01:00:00
#3 1/1/2013 1:43 50.5 2013-01-01 01:30:00
#4 1/1/2013 1:58 57.5 2013-01-01 01:30:00
#5 1/30/2013 0:02 131736.3 2013-01-30 00:00:00
#6 1/30/2013 0:17 131744.3 2013-01-30 00:00:00
#7 1/30/2013 0:32 131751.1 2013-01-30 00:30:00
#8 1/30/2013 0:47 131759.0 2013-01-30 00:30:00
data
df <- structure(list(dateTime = c("1/1/2013 1:13", "1/1/2013 1:28",
"1/1/2013 1:43", "1/1/2013 1:58", "1/30/2013 0:02", "1/30/2013 0:17",
"1/30/2013 0:32", "1/30/2013 0:47"), KW = c(34.7, 43.5, 50.5,
57.5, 131736.3, 131744.3, 131751.1, 131759)), .Names = c("dateTime",
"KW"), class = "data.frame", row.names = c(NA, -8L))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Adding data based on date from another dataframe - r

Related

Add column in dataframe based on 3 columns from another dataframe using R

aggregate data frame to typical year/week

Changing quarterly data into hourly data

Fill missing sequence values with dplyr

create 30 min interval for time series with different start time

Categories

Resources