I've got a timeseries dataset — data from meteostation. So there's 3 columns: time - time and date; p - rain, mm; h - water level,m.
I need to make a new column factor_rain, with 1 and 0 values. 1 - if water level(df$h) was influenced by rain (df$p). This can be if there was a rain for the last 5 hours (5 entries).
In other cases, there should be 0.
A part of dataset is here:
df <- data.frame(time = c("2017-06-04 9:00:00", "2017-06-04 13:00:00", "2017-06-04 17:00:00",
"2017-06-04 19:00:00", "2017-06-04 21:00:00", "2017-06-04 23:00:00",
"2017-06-05 9:00:00", "2017-06-05 11:00:00",
"2017-06-05 13:00:00", "2017-06-05 16:00:00",
"2017-06-05 19:00:00", "2017-06-05 21:00:00", "2017-06-05 23:00:00",
"2017-06-06 9:00:00", "2017-06-06 11:00:00", "2017-06-06 13:00:00",
"2017-06-06 16:00:00", "2017-06-06 17:00:00", "2017-06-06 18:00:00",
"2017-06-06 19:00:00"),
p = c(NA, NA, 16.4, NA, NA, NA, NA, NA, NA, NA, 12,
NA, NA, NA, NA, NA, NA, NA, NA, NA),
h = c(23,NA,NA,NA,NA,32,NA,NA,28,NA,NA,
33,NA,NA,NA,29,NA,NA,NA,NA))
I was trying the simplest way I thought — it works only for one case unfortunately:
> df$factor_rain[df$p[-c(1:5)] > 1 & df$h > 1] <- 1
> Warning message:
In df$p[-c(1:5)] > 1 & df$h > 1 :
longer object length is not a multiple of shorter object length
Is there any way to fix it? If you can suggest how to use real time (smth from xts library, for example) it would be great. I mean use a 5 hours treshold, not 5 values.
By the way I need to get this as a result:
> df
time p h factor_rain
1 2017-06-04 9:00:00 NA 23 0
2 2017-06-04 13:00:00 NA NA 0
3 2017-06-04 17:00:00 16.4 NA 0
4 2017-06-04 19:00:00 NA NA 0
5 2017-06-04 21:00:00 NA NA 0
6 2017-06-04 23:00:00 NA 32 1
7 2017-06-05 9:00:00 NA NA 0
8 2017-06-05 11:00:00 NA NA 0
9 2017-06-05 13:00:00 NA 28 0
10 2017-06-05 16:00:00 NA NA 0
11 2017-06-05 19:00:00 12.0 NA 0
12 2017-06-05 21:00:00 NA 33 1
13 2017-06-05 23:00:00 NA NA 0
14 2017-06-06 9:00:00 NA NA 0
15 2017-06-06 11:00:00 NA NA 0
16 2017-06-06 13:00:00 NA 29 0
17 2017-06-06 16:00:00 NA NA 0
18 2017-06-06 17:00:00 NA NA 0
19 2017-06-06 18:00:00 NA NA 0
20 2017-06-06 19:00:00 NA NA 0
You can use
df$factorrain = FALSE
df$factorrain[rowSums(expand.grid(which(!is.na(df$p)), 0:4))] = TRUE
# time p h factorrain
# 1 2017-06-04 9:00:00 NA 23 FALSE
# 2 2017-06-04 13:00:00 NA NA FALSE
# 3 2017-06-04 17:00:00 16.4 NA TRUE
# 4 2017-06-04 19:00:00 NA NA TRUE
# 5 2017-06-04 21:00:00 NA NA TRUE
# 6 2017-06-04 23:00:00 NA 32 TRUE
# 7 2017-06-05 9:00:00 NA NA TRUE
# 8 2017-06-05 11:00:00 NA NA FALSE
# 9 2017-06-05 13:00:00 NA 28 FALSE
# 10 2017-06-05 16:00:00 NA NA FALSE
# 11 2017-06-05 19:00:00 12.0 NA TRUE
# 12 2017-06-05 21:00:00 NA 33 TRUE
# 13 2017-06-05 23:00:00 NA NA TRUE
# 14 2017-06-06 9:00:00 NA NA TRUE
# 15 2017-06-06 11:00:00 NA NA TRUE
# 16 2017-06-06 13:00:00 NA 29 FALSE
# 17 2017-06-06 16:00:00 NA NA FALSE
# 18 2017-06-06 17:00:00 NA NA FALSE
# 19 2017-06-06 18:00:00 NA NA FALSE
# 20 2017-06-06 19:00:00 NA NA FALSE
Or, a similar approach with apply,
df$factorrain = FALSE
df$factorrain[sapply(which(!is.na(df$p)), function(x) x+(0:4))] = TRUE
A solution can be achieved by using non-equi join from data.table.
library(data.table)
df$time <- as.POSIXct(df$time, format = "%Y-%m-%d %H:%M:%S")
setDT(df)
df[,timeLow := time-5*60*60]
df[df,.(time, p, h = i.h), on=.(time < time, time >= timeLow)][
,.(factor_rain = ifelse(!is.na(first(h)), any(!is.na(p)),FALSE)),by=.(time)][
df,.(time, p, h, factor_rain),on="time"]
# time p h factor_rain
# 1: 2017-06-04 09:00:00 NA 23 FALSE
# 2: 2017-06-04 13:00:00 NA NA FALSE
# 3: 2017-06-04 17:00:00 16.4 NA FALSE
# 4: 2017-06-04 19:00:00 NA NA FALSE
# 5: 2017-06-04 21:00:00 NA NA FALSE
# 6: 2017-06-04 23:00:00 NA 32 FALSE <-- There is no rain in last 5 hours
# 7: 2017-06-05 09:00:00 NA NA FALSE
# 8: 2017-06-05 11:00:00 NA NA FALSE
# 9: 2017-06-05 13:00:00 NA 28 FALSE
# 10: 2017-06-05 16:00:00 NA NA FALSE
# 11: 2017-06-05 19:00:00 12.0 NA FALSE
# 12: 2017-06-05 21:00:00 NA 33 TRUE
# 13: 2017-06-05 23:00:00 NA NA FALSE
# 14: 2017-06-06 09:00:00 NA NA FALSE
# 15: 2017-06-06 11:00:00 NA NA FALSE
# 16: 2017-06-06 13:00:00 NA 29 FALSE
# 17: 2017-06-06 16:00:00 NA NA FALSE
# 18: 2017-06-06 17:00:00 NA NA FALSE
# 19: 2017-06-06 18:00:00 NA NA FALSE
# 20: 2017-06-06 19:00:00 NA NA FALSE
Note: The solution can be optimized a bit. I'll take up optimization in a while.
Related
This is a bit of a curious case for which I have been unable to find a solution on stackoverflow. I have a dataset with a date-time column and a column of values that indicate an event, such as in the dat example below. The date-times are every hour, however, note that occasional "missed" hours exist (2 hours are missing between rows 12 & 13).
dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")),
max(as.POSIXct("2010-04-04 10:00:00 UTC")), by = "hour")[-c(13,14)],
event = c(1, rep(NA, 9), 2, rep(NA, 5), 3, 4, rep(NA, 9), 5, NA, 6))
> dat
datetime event
1 2010-04-03 03:00:00 1
2 2010-04-03 04:00:00 NA
3 2010-04-03 05:00:00 NA
4 2010-04-03 06:00:00 NA
5 2010-04-03 07:00:00 NA
6 2010-04-03 08:00:00 NA
7 2010-04-03 09:00:00 NA
8 2010-04-03 10:00:00 NA
9 2010-04-03 11:00:00 NA
10 2010-04-03 12:00:00 NA
11 2010-04-03 13:00:00 2
12 2010-04-03 14:00:00 NA
13 2010-04-03 17:00:00 NA
14 2010-04-03 18:00:00 NA
15 2010-04-03 19:00:00 NA
16 2010-04-03 20:00:00 NA
17 2010-04-03 21:00:00 3
18 2010-04-03 22:00:00 4
19 2010-04-03 23:00:00 NA
20 2010-04-04 00:00:00 NA
21 2010-04-04 01:00:00 NA
22 2010-04-04 02:00:00 NA
23 2010-04-04 03:00:00 NA
24 2010-04-04 04:00:00 NA
25 2010-04-04 05:00:00 NA
26 2010-04-04 06:00:00 NA
27 2010-04-04 07:00:00 NA
28 2010-04-04 08:00:00 5
29 2010-04-04 09:00:00 NA
30 2010-04-04 10:00:00 6
I would like each row within an interval of 7 hours after the event occurs to be identified with a unique identifier, but with the following caveats (hence the "curious case"):
if a subsequent event occurs within the 7 hours of the event prior, that subsequent event is essentially ignored (i.e., "event" number does not equal assigned identifier value), and
missing times are accounted for (i.e., the rule is based on the time elapsed, not the number of rows).
The product would look like result:
library(dplyr)
result <- dat %>%
mutate(id = c(rep(1, 8), rep(NA, 2), rep(2, 6), rep(3, 8), rep(NA, 3), rep(4, 3)))
> result
datetime event id
1 2010-04-03 03:00:00 1 1
2 2010-04-03 04:00:00 NA 1
3 2010-04-03 05:00:00 NA 1
4 2010-04-03 06:00:00 NA 1
5 2010-04-03 07:00:00 NA 1
6 2010-04-03 08:00:00 NA 1
7 2010-04-03 09:00:00 NA 1
8 2010-04-03 10:00:00 NA 1
9 2010-04-03 11:00:00 NA NA
10 2010-04-03 12:00:00 NA NA
11 2010-04-03 13:00:00 2 2
12 2010-04-03 14:00:00 NA 2
13 2010-04-03 17:00:00 NA 2
14 2010-04-03 18:00:00 NA 2
15 2010-04-03 19:00:00 NA 2
16 2010-04-03 20:00:00 NA 2
17 2010-04-03 21:00:00 3 3
18 2010-04-03 22:00:00 4 3
19 2010-04-03 23:00:00 NA 3
20 2010-04-04 00:00:00 NA 3
21 2010-04-04 01:00:00 NA 3
22 2010-04-04 02:00:00 NA 3
23 2010-04-04 03:00:00 NA 3
24 2010-04-04 04:00:00 NA 3
25 2010-04-04 05:00:00 NA NA
26 2010-04-04 06:00:00 NA NA
27 2010-04-04 07:00:00 NA NA
28 2010-04-04 08:00:00 5 4
29 2010-04-04 09:00:00 NA 4
30 2010-04-04 10:00:00 6 4
Most ideally, this would be accomplished in a dplyr framework.
library(lubridate)
library(tidyverse)
dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")),
max(as.POSIXct("2010-04-04 10:00:00 UTC")), by = "hour")[-c(13,14)],
event = c(1, rep(NA, 9), 2, rep(NA, 5), 3, 4, rep(NA, 9), 5, NA, 6)) %>%
mutate(id = c(rep(1, 8), rep(NA, 2), rep(2, 6), rep(3, 8), rep(NA, 3), rep(4, 3)))
Events <- dat %>%
#Get only the roes with events
filter(!is.na(event)) %>%
#Get the duration of time between events
mutate(
EventLag = datetime - lag(datetime)) %>%
## remove events that occurred < 7 hrs after the previous or that are NA (i.e. the first one). but in the real data
## I do not suspect your first point would ever be an event...? Maybe this can be removed in the
## real dataset...
filter(as.numeric(EventLag) > 7| is.na(EventLag)) %>%
as.data.frame()
## You now have all of the events that are of interest (i.e. those that occurred outside of the 7 hr buffer)
## Give the events a new ID so there are no gaps
## Join them with the rest of the datetime stamps
Events <- Events %>%
mutate(ID = row_number()) %>%
dplyr::select(datetime, ID)
## Expand each event by 7 hrs
Events <- Events %>%
group_by(ID) %>%
do(data.frame(ID= .$ID, datetime= seq(.$datetime, .$datetime + hours(7), by = '1 hour'), stringsAsFactors=FALSE)) %>%
as.data.frame()
## Join with initial data by datettime
DatJoin <- dat %>%
left_join(Events, by = "datetime")
DatJoin
I struggle with nested ifelse. I want to create a new variable using dplyr::mutate based on values of other variables. See the reproductible example below.
library(dplyr)
library(hms)
# make a test dataframe
datetime <- as.POSIXct(c("2015-01-26 10:10:00 UTC","2015-01-26 10:20:00 UTC","2015-01-26 10:30:00 UTC", "2015-01-26 10:40:00 UTC","2015-01-26 10:50:00 UTC","2015-01-26 11:00:00 UTC","2015-01-26 00:10:00 UTC","2015-01-26 11:20:00 UTC","2015-01-26 11:30:00 UTC","2017-03-10 10:00:00 UTC"))
time <- hms::as_hms(datetime)
pco2_corr <- c(90,135,181,226,272,317,363,NA,454,300)
State_Zero <- c(NA,NA,1,rep(NA,7))
State_Flush <- c(rep(NA,4),1,rep(NA,5))
z <- tibble(datetime, time, pco2_corr, State_Zero, State_Flush)
# now create a new variable
z <- z %>%
dplyr::mutate(pco2_corr_qf = ifelse(is.na(pco2_corr), 15,
ifelse((State_Zero >= 1 | State_Flush >= 1), 4,
ifelse(pco2_corr < 100 | pco2_corr > 450, 7,
ifelse((time >= "00:00:00" & time <= "01:30:00") |
(time >= "12:00:00" & time <= "13:00:00"), 16,
ifelse((datetime >= "2017-03-10 08:00:00" &
datetime < "2017-03-21 20:00:00"), 99,
1))))))
z
# A tibble: 10 x 6
datetime time pco2_corr State_Zero State_Flush pco2_corr_qf
<dttm> <time> <dbl> <dbl> <dbl> <dbl>
1 2015-01-26 10:10:00 10:10 90 NA NA NA
2 2015-01-26 10:20:00 10:20 135 NA NA NA
3 2015-01-26 10:30:00 10:30 181 1 NA 4
4 2015-01-26 10:40:00 10:40 226 NA NA NA
5 2015-01-26 10:50:00 10:50 272 NA 1 4
6 2015-01-26 11:00:00 11:00 317 NA NA NA
7 2015-01-26 00:10:00 00:10 363 NA NA NA
8 2015-01-26 11:20:00 11:20 NA NA NA 15
9 2015-01-26 11:30:00 11:30 454 NA NA NA
10 2017-03-10 10:00:00 10:00 300 NA NA NA
The first two ifelse work fine but the next three do not. The new variable pco2_corr_qf should not have any NA but values 7, 16, 99 and 1.
What am I doing wrong?
You are comparing time with a string that gives incorrect output, convert it to the relevant class. We can use case_when which is a better alternative to nested ifelse.
library(dplyr)
library(hms)
z %>%
mutate(pco2_corr_qf = case_when(
is.na(pco2_corr) ~ 15,
State_Zero >= 1 | State_Flush >= 1 ~ 4,
pco2_corr < 100 | pco2_corr > 450 ~ 7,
(time >= as_hms("00:00:00") & time <= as_hms("01:30:00")) |
(time >= as_hms("12:00:00") & time <= as_hms("13:00:00")) ~ 16,
datetime >= as.POSIXct("2017-03-10 08:00:00") &
datetime < as.POSIXct("2017-03-21 20:00:00") ~ 99,
TRUE ~ 1))
# datetime time pco2_corr State_Zero State_Flush pco2_corr_qf
# <dttm> <time> <dbl> <dbl> <dbl> <dbl>
# 1 2015-01-26 10:10:00 10:10 90 NA NA 7
# 2 2015-01-26 10:20:00 10:20 135 NA NA 1
# 3 2015-01-26 10:30:00 10:30 181 1 NA 4
# 4 2015-01-26 10:40:00 10:40 226 NA NA 1
# 5 2015-01-26 10:50:00 10:50 272 NA 1 4
# 6 2015-01-26 11:00:00 11:00 317 NA NA 1
# 7 2015-01-26 00:10:00 00:10 363 NA NA 16
# 8 2015-01-26 11:20:00 11:20 NA NA NA 15
# 9 2015-01-26 11:30:00 11:30 454 NA NA 7
#10 2017-03-10 10:00:00 10:00 300 NA NA 99
I've got a dataframe of 3 variables: POSIXct object - time, numeric - RRR and factor - he. Where RRR is an amount of liquid precipitation and he is the hydrological event number, here its time corresponds to the beginning of the flood event.
df <- structure(list(time = structure(c(1396879200, 1396922400, 1396976400,
1397008800, 1397095200, 1397332800, 1397354400, 1397397600, 1397451600,
1397484000, 1397527200, 1397786400, 1397959200, 1398002400, 1398024000,
1398132000, 1398175200, 1398218400, 1398261600, 1398369600, 1398466800,
1398477600, 1398520800, 1398564000, 1398607200, 1398747600, 1398780000,
1398909600, 1398952800, 1398974400, 1398996000),
class = c("POSIXct", "POSIXt"),
tzone = ""),
RRR = c(NA, 2, NA, 4, NA, NA, 0.9, 3,
NA, 0.4, 11, NA, 0.5, 1, NA, 13, 4, 0.8, 0.3, NA, NA, 8, 4, 11,
1, NA, 7, 1, 0.4, NA, 4),
he = c(1, NA, 2, NA, 3, 4, NA, NA,
5, NA, NA, 6, NA, NA, 7, NA, NA, NA, NA, 8, 9, NA, NA, NA, NA,
10, NA, NA, NA, 11, NA)),
class = "data.frame",
row.names = c(NA, -31L))
Head of my dataframe look as follows:
> df
time RRR he
1 2014-04-07 18:00:00 NA 1
2 2014-04-08 06:00:00 2.0 NA
3 2014-04-08 21:00:00 NA 2
4 2014-04-09 06:00:00 4.0 NA
5 2014-04-10 06:00:00 NA 3
6 2014-04-13 00:00:00 NA 4
7 2014-04-13 06:00:00 0.9 NA
8 2014-04-13 18:00:00 3.0 NA
9 2014-04-14 09:00:00 NA 5
I need to calculate the time difference between time of every he value and last non-NA RRR value. For example, for he = 2 the desired difference would be difftime(df$time[3], df$time[2]), while for he = 4 the time difference should be difftime(df$time[6], df$time[4]). So in the end I want to get a dataframe like this, where 'diff' is the time difference in hours.
> df
time RRR he diff
1 2014-04-07 18:00:00 NA 1 NA
2 2014-04-08 06:00:00 2.0 NA NA
3 2014-04-08 21:00:00 NA 2 15
4 2014-04-09 06:00:00 4.0 NA NA
5 2014-04-10 06:00:00 NA 3 24
6 2014-04-13 00:00:00 NA 4 90
7 2014-04-13 06:00:00 0.9 NA NA
8 2014-04-13 18:00:00 3.0 NA NA
9 2014-04-14 09:00:00 NA 5 15
I'm sure that there must be easier ways, but using tidyverse and data.table you can do:
df %>%
mutate(time = as.POSIXct(time, format = "%Y-%m-%d %H:%M:%S")) %>% #Transforming "time" into a datetime object
fill(RRR) %>% #Filling the NA values in "RRR" with tha last non-NA value
group_by(temp = rleid(RRR)) %>% #Grouping by run length of "RRR"
mutate(temp2 = seq_along(temp)) %>% #Sequencing around the run length of "RRR"
group_by(RRR, temp) %>% #Group by "RRR" and run length of "RRR"
mutate(diff = ifelse(!is.na(he), difftime(time, time[temp2 == 1], units="hours"), NA)) %>% #Computing the difference in hours between the first occurrence of a non-NA "RRR" value and the non-NA "he" values
ungroup() %>%
select(-temp, -temp2, -RRR) %>% #Removing the redundant variables
rowid_to_column() %>% #Creating unique row IDs
left_join(df %>%
rowid_to_column() %>%
select(RRR, rowid), by = c("rowid" = "rowid")) %>% #Merging with the original df to get the original values of "RRR"
select(-rowid) #Removing the redundant variables
time he diff RRR
<dttm> <dbl> <dbl> <dbl>
1 2014-04-07 16:00:00 1. 0. NA
2 2014-04-08 04:00:00 NA NA 2.00
3 2014-04-08 19:00:00 2. 15. NA
4 2014-04-09 04:00:00 NA NA 4.00
5 2014-04-10 04:00:00 3. 24. NA
6 2014-04-12 22:00:00 4. 90. NA
7 2014-04-13 04:00:00 NA NA 0.900
8 2014-04-13 16:00:00 NA NA 3.00
9 2014-04-14 07:00:00 5. 15. NA
10 2014-04-14 16:00:00 NA NA 0.400
Here's a data.table approach making use of its non-equi join capabilities:
library(data.table)
setDT(df)
df[df[!is.na(he)][df[!is.na(RRR)], on = .(time>time), rrr_time := i.time],
on = .(time, he), rrr_time := i.rrr_time][, diff := difftime(time, rrr_time)]
The result is:
# time RRR he rrr_time diff
# <POSc> <num> <num> <POSc> <difftime>
# 1: 2014-04-07 16:00:00 NA 1 <NA> NA hours
# 2: 2014-04-08 04:00:00 2.0 NA <NA> NA hours
# 3: 2014-04-08 19:00:00 NA 2 2014-04-08 04:00:00 15 hours
# 4: 2014-04-09 04:00:00 4.0 NA <NA> NA hours
# 5: 2014-04-10 04:00:00 NA 3 2014-04-09 04:00:00 24 hours
# 6: 2014-04-12 22:00:00 NA 4 2014-04-09 04:00:00 90 hours
# 7: 2014-04-13 04:00:00 0.9 NA <NA> NA hours
# 8: 2014-04-13 16:00:00 3.0 NA <NA> NA hours
# 9: 2014-04-14 07:00:00 NA 5 2014-04-13 16:00:00 15 hours
# 10: 2014-04-14 16:00:00 0.4 NA <NA> NA hours
# 11: 2014-04-15 04:00:00 11.0 NA <NA> NA hours
# 12: 2014-04-18 04:00:00 NA 6 2014-04-15 04:00:00 72 hours
# 13: 2014-04-20 04:00:00 0.5 NA <NA> NA hours
# 14: 2014-04-20 16:00:00 1.0 NA <NA> NA hours
# 15: 2014-04-20 22:00:00 NA 7 2014-04-20 16:00:00 6 hours
# 16: 2014-04-22 04:00:00 13.0 NA <NA> NA hours
# 17: 2014-04-22 16:00:00 4.0 NA <NA> NA hours
# 18: 2014-04-23 04:00:00 0.8 NA <NA> NA hours
# 19: 2014-04-23 16:00:00 0.3 NA <NA> NA hours
# 20: 2014-04-24 22:00:00 NA 8 2014-04-23 16:00:00 30 hours
# 21: 2014-04-26 01:00:00 NA 9 2014-04-23 16:00:00 57 hours
# 22: 2014-04-26 04:00:00 8.0 NA <NA> NA hours
# 23: 2014-04-26 16:00:00 4.0 NA <NA> NA hours
# 24: 2014-04-27 04:00:00 11.0 NA <NA> NA hours
# 25: 2014-04-27 16:00:00 1.0 NA <NA> NA hours
# 26: 2014-04-29 07:00:00 NA 10 2014-04-27 16:00:00 39 hours
# 27: 2014-04-29 16:00:00 7.0 NA <NA> NA hours
# 28: 2014-05-01 04:00:00 1.0 NA <NA> NA hours
# 29: 2014-05-01 16:00:00 0.4 NA <NA> NA hours
# 30: 2014-05-01 22:00:00 NA 11 2014-05-01 16:00:00 6 hours
# 31: 2014-05-02 04:00:00 4.0 NA <NA> NA hours
# time RRR he rrr_time diff
A base alternative with findInterval:
t_he <- d$time[!is.na(d$he)]
t_r <- d$time[!is.na(d$RRR)]
i <- findInterval(t_he, t_r)
d[!is.na(d$he), "diff"] <- t_he - t_r[replace(i, i == 0, NA)]
# time RRR he diff
# 1 2014-04-07 16:00:00 NA 1 NA hours
# 2 2014-04-08 04:00:00 2.0 NA NA hours
# 3 2014-04-08 19:00:00 NA 2 15 hours
# 4 2014-04-09 04:00:00 4.0 NA NA hours
# 5 2014-04-10 04:00:00 NA 3 24 hours
# 6 2014-04-12 22:00:00 NA 4 90 hours
# 7 2014-04-13 04:00:00 0.9 NA NA hours
# 8 2014-04-13 16:00:00 3.0 NA NA hours
# 9 2014-04-14 07:00:00 NA 5 15 hours
I have a dataframe that looks like this:
dat <- data.frame(time = seq(as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 245383, replace = TRUE))
So I have every 15 minutes a measurement value. The structure is:
> str(dat)
'data.frame': 245383 obs. of 2 variables:
$ time : POSIXct, format: "2010-01-01 00:00:00" "2010-01-01 00:15:00" "2010-01-01 00:30:00" "2010-01-01 00:45:00" ...
$ radiation: num 230 443 282 314 286 225 77 89 97 330 ...
Now I want to interpolate, so my aim is a dataframe with values for every minute.
I searched a few times and tried some methods with the zoo package. But I have some problems with the dataframe. I have to convert it to a text file i guess? I have no idea how to do that.
Here is a tidyverse solution.
library('tidyverse')
dat <- data.frame(time = seq(as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 245383, replace = TRUE))
dat <- head(dat, 3)
dat
# time radiation
# 1 2010-01-01 00:00:00 241
# 2 2010-01-01 00:15:00 438
# 3 2010-01-01 00:30:00 457
You can create a data frame with all of the required times. Using full_join will make the missing radiation values be NA.
approx will fill the NAs with a linear approximation.
dat %>%
full_join(data.frame(time = seq(
from = min(.$time),
to = max(.$time),
by = 'min'))) %>%
arrange(time) %>%
mutate(radiation = approx(radiation, n = n())$y)
# Joining, by = "time"
# time radiation
# 1 2010-01-01 00:00:00 241.0000
# 2 2010-01-01 00:01:00 254.1333
# 3 2010-01-01 00:02:00 267.2667
# 4 2010-01-01 00:03:00 280.4000
# 5 2010-01-01 00:04:00 293.5333
# 6 2010-01-01 00:05:00 306.6667
# 7 2010-01-01 00:06:00 319.8000
# 8 2010-01-01 00:07:00 332.9333
# 9 2010-01-01 00:08:00 346.0667
# 10 2010-01-01 00:09:00 359.2000
# 11 2010-01-01 00:10:00 372.3333
# 12 2010-01-01 00:11:00 385.4667
# 13 2010-01-01 00:12:00 398.6000
# 14 2010-01-01 00:13:00 411.7333
# 15 2010-01-01 00:14:00 424.8667
# 16 2010-01-01 00:15:00 438.0000
# 17 2010-01-01 00:16:00 439.2667
# 18 2010-01-01 00:17:00 440.5333
# 19 2010-01-01 00:18:00 441.8000
# 20 2010-01-01 00:19:00 443.0667
# 21 2010-01-01 00:20:00 444.3333
# 22 2010-01-01 00:21:00 445.6000
# 23 2010-01-01 00:22:00 446.8667
# 24 2010-01-01 00:23:00 448.1333
# 25 2010-01-01 00:24:00 449.4000
# 26 2010-01-01 00:25:00 450.6667
# 27 2010-01-01 00:26:00 451.9333
# 28 2010-01-01 00:27:00 453.2000
# 29 2010-01-01 00:28:00 454.4667
# 30 2010-01-01 00:29:00 455.7333
# 31 2010-01-01 00:30:00 457.0000
You can use the approx function like this:
dat <- data.frame(time = seq(as.POSIXct("2016-12-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 2887, replace = TRUE))
mins <- seq(as.POSIXct("2016-12-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60)
out <- approx(dat$time, dat$radiation, mins)
Here is a solution using pad from the padr package to fill the gaps in your time column. na.approx is used for interpolation.
library(padr)
library(zoo)
dat[1:2, ]
time radiation
#1 2010-01-01 00:00:00 133
#2 2010-01-01 00:15:00 187
dat_padded <- pad(dat[1:2, ], interval = "min")
dat_padded$radiation <- zoo::na.approx(dat_padded$radiation)
dat_padded
time radiation
#1 2010-01-01 00:00:00 133.0
#2 2010-01-01 00:01:00 136.6
#3 2010-01-01 00:02:00 140.2
#4 2010-01-01 00:03:00 143.8
#5 2010-01-01 00:04:00 147.4
#6 2010-01-01 00:05:00 151.0
#7 2010-01-01 00:06:00 154.6
#8 2010-01-01 00:07:00 158.2
#9 2010-01-01 00:08:00 161.8
#10 2010-01-01 00:09:00 165.4
#11 2010-01-01 00:10:00 169.0
#12 2010-01-01 00:11:00 172.6
#13 2010-01-01 00:12:00 176.2
#14 2010-01-01 00:13:00 179.8
#15 2010-01-01 00:14:00 183.4
#16 2010-01-01 00:15:00 187.0
data
set.seed(1)
dat <-
data.frame(
time = seq(
as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60 * 99,
by = 60 * 15
),
radiation = sample(1:500, 245383, replace = TRUE)
)
I have a time series data of two columns, and I want a graph with averaged hourly pattern for each month, like the graph attached but with two time series.
timestamp ET_control ET_treatment
1 2016-01-01 00:00:00 NA NA
2 2016-01-01 00:30:00 NA NA
3 2016-01-01 01:00:00 NA NA
4 2016-01-01 01:30:00 NA NA
5 2016-01-01 02:00:00 NA NA
6 2016-01-01 02:30:00 NA NA
7 2016-01-01 03:00:00 NA NA
8 2016-01-01 03:30:00 NA NA
9 2016-01-01 04:00:00 NA NA
10 2016-01-01 04:30:00 NA NA
11 2016-01-01 05:00:00 NA NA
12 2016-01-01 05:30:00 NA NA
13 2016-01-01 06:00:00 NA NA
14 2016-01-01 06:30:00 NA NA
15 2016-01-01 07:00:00 NA NA
16 2016-01-01 07:30:00 NA NA
17 2016-01-01 08:00:00 NA NA
18 2016-01-01 08:30:00 NA NA
19 2016-01-01 09:00:00 NA NA
20 2016-01-01 09:30:00 NA NA
21 2016-01-01 10:00:00 NA NA
22 2016-01-01 10:30:00 NA NA
23 2016-01-01 11:00:00 NA NA
24 2016-01-01 11:30:00 0.09863437 NA
25 2016-01-01 12:00:00 0.11465258 NA
26 2016-01-01 12:30:00 0.12356855 NA
27 2016-01-01 13:00:00 0.09246215 0.085398782
28 2016-01-01 13:30:00 0.08843156 0.072877001
29 2016-01-01 14:00:00 0.08536019 0.081885947
30 2016-01-01 14:30:00 0.08558541 NA
31 2016-01-01 15:00:00 0.05571436 NA
32 2016-01-01 15:30:00 0.04087248 0.038582547
33 2016-01-01 16:00:00 0.04233724 NA
34 2016-01-01 16:30:00 0.02150660 0.019560578
35 2016-01-01 17:00:00 0.01803765 0.019691155
36 2016-01-01 17:30:00 NA 0.005190489
37 2016-01-01 18:00:00 NA NA
38 2016-01-01 18:30:00 NA NA
39 2016-01-01 19:00:00 NA NA
40 2016-01-01 19:30:00 NA NA
41 2016-01-01 20:00:00 NA NA
42 2016-01-01 20:30:00 NA NA
43 2016-01-01 21:00:00 NA NA
44 2016-01-01 21:30:00 NA NA
45 2016-01-01 22:00:00 NA NA
46 2016-01-01 22:30:00 NA NA
47 2016-01-01 23:00:00 NA NA
48 2016-01-01 23:30:00 NA NA
49 2016-01-02 00:00:00 NA NA
50 2016-01-02 00:30:00 NA NA
given t is your data.frame with packages dplyr and ggplot2:
t <- t %>% mutate(
month = format(strptime(timestamp, "%Y-%m-%d %H:%M:%S"), "%b"),
hour=format(strptime(timestamp, "%Y-%m-%d %H:%M:%S"), "%H"))
tm <- t %>% group_by(month, hour) %>%
summarize(ET_control_mean=mean(ET_control, na.rm=T))
ggplot(tm, aes(x=hour, y=ET_control_mean)) + geom_point() + facet_wrap(~ month)
if you want to have both columns in your graph, you should transform your data into the 'long' format.