Fill in missing dates in a dataframe - r

I have two dataframes, interest rates and monthly standard deviation prices returns, that I have managed to merge together. However the interest rate data has gaps in its dates where the markets were not open, i.e. weekends and holidays. The monthly returns all start on the first of the month so where this lines up with a market closure the data doesn't merge correctly. An example of the dataframes is
Date Rollingstd
01/11/2014 0.00925
01/10/2014 0.01341
Date InterestRate
03/11/2014 2
31/10/2014 1.5
As you can see there is no 01/11/2014 in the interest rate data so merging together gives me
Date InterestRate Rollingstd
03/11/2014 2 0.01341
31/10/2014 1.5 0.01341
I guess a fix for this would be to expand the interest rate dataframe so that it includes all dates and just fill the interest rate data up so it looks like this
Date InterestRate
03/11/2014 2
02/11/2014 1.5
01/11/2014 1.5
31/10/2014 1.5
This would ensure there are no missing dates in the dataframe. Any ideas on how I could do this?

Do you want this?
df2 <- read.table(text = 'Date InterestRate
03/11/2014 2
31/10/2014 1.5', header = T)
df1 <- read.table(text = 'Date Rollingstd
01/11/2014 0.00925
01/10/2014 0.01341', header = T)
library(tidyverse)
df1 %>% full_join(df2, by = 'Date') %>%
mutate(Date = as.Date(Date, '%d/%m/%Y')) %>%
arrange(Date) %>%
complete(Date = seq.Date(min(Date), max(Date), 'days')) %>%
fill(InterestRate, .direction = 'up') %>%
as.data.frame()
#> Date Rollingstd InterestRate
#> 1 2014-10-01 0.01341 1.5
#> 2 2014-10-02 NA 1.5
#> 3 2014-10-03 NA 1.5
#> 4 2014-10-04 NA 1.5
#> 5 2014-10-05 NA 1.5
#> 6 2014-10-06 NA 1.5
#> 7 2014-10-07 NA 1.5
#> 8 2014-10-08 NA 1.5
#> 9 2014-10-09 NA 1.5
#> 10 2014-10-10 NA 1.5
#> 11 2014-10-11 NA 1.5
#> 12 2014-10-12 NA 1.5
#> 13 2014-10-13 NA 1.5
#> 14 2014-10-14 NA 1.5
#> 15 2014-10-15 NA 1.5
#> 16 2014-10-16 NA 1.5
#> 17 2014-10-17 NA 1.5
#> 18 2014-10-18 NA 1.5
#> 19 2014-10-19 NA 1.5
#> 20 2014-10-20 NA 1.5
#> 21 2014-10-21 NA 1.5
#> 22 2014-10-22 NA 1.5
#> 23 2014-10-23 NA 1.5
#> 24 2014-10-24 NA 1.5
#> 25 2014-10-25 NA 1.5
#> 26 2014-10-26 NA 1.5
#> 27 2014-10-27 NA 1.5
#> 28 2014-10-28 NA 1.5
#> 29 2014-10-29 NA 1.5
#> 30 2014-10-30 NA 1.5
#> 31 2014-10-31 NA 1.5
#> 32 2014-11-01 0.00925 2.0
#> 33 2014-11-02 NA 2.0
#> 34 2014-11-03 NA 2.0
Created on 2021-05-23 by the reprex package (v2.0.0)

Related

heatwaveR package, ts2clm() turn temperature values into NA

I'm using heatwaveR package in R to make a plot (event_line()) and visualize the heatwaves over the years. The first step is to run ts2clm(), but this command turn my temp colum into NA so I can't plot anything. Does anyone see any errors?
This is my data:
>>> Data
t temp
[Date] [num]
0 2020-05-14 6.9
1 2020-05-06 6.8
2 2020-04-23 5.5
3 2020-04-16 3.6
4 2020-03-31 2.5
5 2020-02-25 2.3
6 2020-01-30 2.8
7 2019-10-02 13.4
8 2022-09-02 19
9 2022-08-15 18.7
...
687 1974-05-06 4.2
This is my code:
#Load data
Data <- read_xlsx("seili_raw_temp.xlsx")
#Set t as class Date
Data$t <- as.Date(Data$t, format = "%Y-%m-%d")
#Constructs seasonal and threshold climatologies
ts <- ts2clm(Data, climatologyPeriod = c("1974-05-06", "2020-05-14"))
#This is the point where almost all temp values turn into NA, so you can ignore below.
#Detect_even
res <- detect_event(ts)
#Draw heatwave plot
event_line(res, min_duration = "3",metric = "int_cum",
start_date = c("1974-05-06"), end_date = c("2020-05-14"))
The data you posted isn't long enough to get the function to work, so I just made some up:
library(heatwaveR)
library(lubridate)
set.seed(1234)
Data <- data.frame(
t = seq(ymd("2015-01-01"), ymd("2023-01-01"), by="7 day"))
Data$temp <- runif(nrow(Data), 0,45)
Then, when I execute the function, I get the result below. The problem is that your data (like the ones I generated) have one observation every 7 days. The ts2clm() function pads out the dataset so that every day has an entry and if a temperature was not observed on that day, it fills in with a missing value.
ts <- ts2clm(Data, climatologyPeriod = c("2015-01-01", "2022-12-29"))
ts
#> # A tibble: 2,920 × 5
#> doy t temp seas thresh
#> <int> <date> <dbl> <dbl> <dbl>
#> 1 1 2015-01-01 5.12 22.5 38.6
#> 2 2 2015-01-02 NA 22.4 38.5
#> 3 3 2015-01-03 NA 22.2 38.2
#> 4 4 2015-01-04 NA 22.1 37.9
#> 5 5 2015-01-05 NA 21.9 37.3
#> 6 6 2015-01-06 NA 21.7 36.8
#> 7 7 2015-01-07 NA 21.5 36.5
#> 8 8 2015-01-08 28.0 21.3 36.1
#> 9 9 2015-01-09 NA 21.2 36.1
#> 10 10 2015-01-10 NA 21.0 35.8
#> # … with 2,910 more rows
Created on 2023-02-10 by the reprex package (v2.0.1)

how to copy part of rows based on group by 'id' in R?

I have a data frame such as below:
id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA NA NA NA 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 NA NA NA NA 5.2 5.47
3024 99/11/02 42 2 NA NA NA NA 7.1 5.54
I have to copy the values of 'pp' column to 'nh' based on 'id' in other rows with that 'id'. my target data frame is as below:
id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA 1 0 0 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 4 6 NA 1 5.2 5.47
3024 99/11/02 42 2 4 6 NA 1 7.1 5.54
I apprecite it if anybody share his/her comment with me.
Best Regards
Another option using na.locf:
df <- read.table(text="id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA NA NA NA 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 NA NA NA NA 5.2 5.47
3024 99/11/02 42 2 NA NA NA NA 7.1 5.54", header=TRUE)
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
summarise(across(everything(), ~na.locf(., na.rm = FALSE, fromLast = FALSE)))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 5 × 10
#> # Groups: id [2]
#> id Date Age Sex PP Duration cd nh W_B R_B
#> <int> <chr> <int> <int> <int> <int> <int> <int> <dbl> <dbl>
#> 1 583 99/07/19 51 2 NA 1 0 0 6.2 4.26
#> 2 583 99/07/23 51 2 NA 1 0 0 7 4.35
#> 3 3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
#> 4 3024 99/11/01 42 2 4 6 NA 1 5.2 5.47
#> 5 3024 99/11/02 42 2 4 6 NA 1 7.1 5.54
Created on 2022-07-02 by the reprex package (v2.0.1)
library(tidyverse)
df <- read_table("id Date Age Sex PP Duration cd nh W_B R_B
583 99/07/19 51 2 NA 1 0 0 6.2 4.26
583 99/07/23 51 2 NA NA NA NA 7 4.35
3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
3024 99/11/01 42 2 NA NA NA NA 5.2 5.47
3024 99/11/02 42 2 NA NA NA NA 7.1 5.54")
df %>%
group_by(id) %>%
fill(PP:nh, .direction = 'updown')
#> # A tibble: 5 × 10
#> # Groups: id [2]
#> id Date Age Sex PP Duration cd nh W_B R_B
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 583 99/07/19 51 2 NA 1 0 0 6.2 4.26
#> 2 583 99/07/23 51 2 NA 1 NA 0 7 4.35
#> 3 3024 99/10/30 42 2 4 6 NA 1 6.2 5.28
#> 4 3024 99/11/01 42 2 4 6 NA 1 5.2 5.47
#> 5 3024 99/11/02 42 2 4 6 NA 1 7.1 5.54
Created on 2022-07-02 by the reprex package (v2.0.1)

How to transpose or pivot columns of a date frame in R

I am trying to transpose data in R.
The data retrieved was a JSON file and here is an example of the dataframe I am using:
date parameter value
2020-07-01T23:50:00Z wind_dir 236.0
2020-07-01T23:40:00Z wind_dir 236.0
2020-07-01T23:40:00Z wind_speed 1.9
2020-07-01T23:30:00Z wind_dir 239.0
2020-07-01T23:10:00Z wind_dir 184.0
2020-07-01T23:00:00Z wind_dir 178.0
2020-07-01T22:50:00Z wind_speed 1.1
2020-07-01T22:50:00Z wind_dir 197.0
2020-07-01T22:40:00Z wind_speed 1.8
2020-07-01T22:30:00Z wind_speed 1.4
2020-07-01T22:20:00Z wind_dir 172.0
2020-07-01T22:20:00Z wind_speed 1.4
2020-07-01T22:00:00Z wind_dir 170.0
I need to change the date, so it does not include T and Z.
I want to transpose the rows and separate the parameters into two columns: wind_speed and wind_dir
The final dataset should look like:
date wind_dir wind_speed
2020-07-01 23:50:00 236.0 NA
2020-07-01 23:40:00 236.0 1.9
2020-07-01 23:30:00 239.0 NA
2020-07-01 23:10:00 184.0 NA
2020-07-01 23:00:00 178.0 NA
2020-07-01 22:50:00 197.0 1.1
2020-07-01 22:40:00 NA 1.8
2020-07-01 22:30:00 NA 1.4
2020-07-01 22:20:00 172.0 1.4
2020-07-01 22:00:00 170.0 NA
I also would like to have the date starting with an increasing time stamp.
I appreciate your help!
You may try
library(dplyr)
df %>%
mutate(date =gsub("T|Z", " ", date)) %>%
pivot_wider(names_from = parameter, values_from = value)
date wind_dir wind_speed
<chr> <dbl> <dbl>
1 "2020-07-01 23:50:00 " 236 NA
2 "2020-07-01 23:40:00 " 236 1.9
3 "2020-07-01 23:30:00 " 239 NA
4 "2020-07-01 23:10:00 " 184 NA
5 "2020-07-01 23:00:00 " 178 NA
6 "2020-07-01 22:50:00 " 197 1.1
7 "2020-07-01 22:40:00 " NA 1.8
8 "2020-07-01 22:30:00 " NA 1.4
9 "2020-07-01 22:20:00 " 172 1.4
10 "2020-07-01 22:00:00 " 170 NA
There are functions called pivot_wider and pivot_longer from the tidyr package you can use. T and Z are part of ISO 8601 so you can easily transform them into a object of class datetime:
library(tidyverse)
data <- tribble(
~date, ~parameter, ~value,
"2020-07-01T23:50:00Z ", "wind_dir", 236.0,
"2020-07-01T23:40:00Z " , "wind_dir", 236.0,
"2020-07-01T23:40:00Z" , "wind_speed", 1.9
)
data
#> # A tibble: 3 x 3
#> date parameter value
#> <chr> <chr> <dbl>
#> 1 "2020-07-01T23:50:00Z " wind_dir 236
#> 2 "2020-07-01T23:40:00Z " wind_dir 236
#> 3 "2020-07-01T23:40:00Z" wind_speed 1.9
data %>%
mutate(date = date %>% parse_datetime()) %>%
pivot_wider(names_from = parameter, values_from = value)
#> # A tibble: 2 x 3
#> date wind_dir wind_speed
#> <dttm> <dbl> <dbl>
#> 1 2020-07-01 23:50:00 236 NA
#> 2 2020-07-01 23:40:00 236 1.9
Created on 2021-10-18 by the reprex package (v2.0.1)
The T and Z removals can be done using base R gsub(). The rest can be done using pivot_wider() from the tidyr package:
raw <- read.table(text = "date parameter value
2020-07-01T23:50:00Z wind_dir 236.0
2020-07-01T23:40:00Z wind_dir 236.0
2020-07-01T23:40:00Z wind_speed 1.9
2020-07-01T23:30:00Z wind_dir 239.0
2020-07-01T23:10:00Z wind_dir 184.0
2020-07-01T23:00:00Z wind_dir 178.0
2020-07-01T22:50:00Z wind_speed 1.1
2020-07-01T22:50:00Z wind_dir 197.0
2020-07-01T22:40:00Z wind_speed 1.8
2020-07-01T22:30:00Z wind_speed 1.4
2020-07-01T22:20:00Z wind_dir 172.0
2020-07-01T22:20:00Z wind_speed 1.4
2020-07-01T22:00:00Z wind_dir 170.0", header = TRUE)
raw$date <- trimws(gsub("[TZ]", " ", raw$date))
library(tidyr)
packageVersion("tidyr")
#> [1] '1.1.2'
raw <- pivot_wider(raw,
names_from = "parameter",
values_from = "value")
raw
#> # A tibble: 10 x 3
#> date wind_dir wind_speed
#> <chr> <dbl> <dbl>
#> 1 2020-07-01 23:50:00 236 NA
#> 2 2020-07-01 23:40:00 236 1.9
#> 3 2020-07-01 23:30:00 239 NA
#> 4 2020-07-01 23:10:00 184 NA
#> 5 2020-07-01 23:00:00 178 NA
#> 6 2020-07-01 22:50:00 197 1.1
#> 7 2020-07-01 22:40:00 NA 1.8
#> 8 2020-07-01 22:30:00 NA 1.4
#> 9 2020-07-01 22:20:00 172 1.4
#> 10 2020-07-01 22:00:00 170 NA
Created on 2021-10-18 by the reprex package (v2.0.0)

Create a conditional column based on another table

I have two data frames, Table1 and Table2.
Table1:
code
CM171
CM114
CM129
CM131
CM154
CM197
CM42
CM54
CM55
Table2:
code;y;diff_y
CM60;1060;2.9
CM55;255;0.7
CM54;1182;3.2
CM53;1046;2.9
CM47;589;1.6
CM42;992;2.7
CM39;1596;4.4
CM36;1113;3
CM34;1975;5.4
CM226;155;0.4
CM224;46;0.1
CM212;43;0.1
CM197;726;2
CM154;1122;3.1
CM150;206;0.6
CM144;620;1.7
CM132;8;0
CM131;618;1.7
CM129;479;1.3
CM121;634;1.7
CM114;15;0
CM109;1050;2.9
CM107;1165;3.2
CM103;194;0.5
I want to add a column to Table2 based on the values in Table1. I tried to do this using dplyr:
result <-Table2 %>%
mutate (fbp = case_when(
code == Table1$code ~"y",))
But this only works for a few rows. Does anyone know why it doesn't add all rows? The values are not repeated.
Try this. It looks like the == operator is only checking for one value. Instead you can use %in% to test all values. Here the code:
#Code
result <-Table2 %>%
mutate (fbp = case_when(
code %in% Table1$code ~"y",))
Output:
code y diff_y fbp
1 CM60 1060 2.9 <NA>
2 CM55 255 0.7 y
3 CM54 1182 3.2 y
4 CM53 1046 2.9 <NA>
5 CM47 589 1.6 <NA>
6 CM42 992 2.7 y
7 CM39 1596 4.4 <NA>
8 CM36 1113 3.0 <NA>
9 CM34 1975 5.4 <NA>
10 CM226 155 0.4 <NA>
11 CM224 46 0.1 <NA>
12 CM212 43 0.1 <NA>
13 CM197 726 2.0 y
14 CM154 1122 3.1 y
15 CM150 206 0.6 <NA>
16 CM144 620 1.7 <NA>
17 CM132 8 0.0 <NA>
18 CM131 618 1.7 y
19 CM129 479 1.3 y
20 CM121 634 1.7 <NA>
21 CM114 15 0.0 y
22 CM109 1050 2.9 <NA>
23 CM107 1165 3.2 <NA>
24 CM103 194 0.5 <NA>

Linear interpolation among columns in r

I am working with some temperature data where I have temperatures at certain depths e.g. 0.9m, 2.5m and 5m. I would like to interpolate this values so I obtain the temperature each meter, e.g. 1m, 2m and 3m. The original data looks like this:
df
# A tibble: 5 x 3
date d_0.9 d_2.5
<dttm> <dbl> <dbl>
1 2004-01-05 03:00:00 7 8
2 2004-01-05 04:00:00 7.5 9
3 2004-01-05 05:00:00 7 8
4 2004-01-05 06:00:00 6.92 NA
What I would like to get is something like :
df_int
# A tibble: 5 x 5
date d_0.9 d_1 d_2 d_2.5
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2004-01-05 03:00:00 7 7.0625 7.6875 8
2 2004-01-05 04:00:00 7.5 7.59375 8.53125 9
3 2004-01-05 05:00:00 7 7.0625 7.6875 8
4 2004-01-05 06:00:00 6.92 NA NA NA
I have to do this for a very large data frame. Is there an efficient way of doing it?
Many thanks in advance
One option is to convert the data to long format, use a join to add rows for the depths we want to interpolate at, and then use approx for the interpolation:
library(tidyverse)
# Data
df = tibble(date=seq(as.POSIXct("2004-01-05 03:00:00"),
as.POSIXct("2004-01-05 06:00:00"),
by="1 hour"),
d_0.9 = c(7,7.5,7,6.92),
d_2.5 = c(8,NA,8,NA),
d_5.0 = c(10,10.5,9.4,NA))
# Create a data frame with all of the times and depths we want to interpolate at
depths = sort(unique(c(c(0.9, 2.5, 5), seq(ceiling(0.9), floor(5), 1))))
depths = crossing(date=unique(df$date), depth = depths)
# Convert data to long format, join to add interpolation depths, then interpolate
df.interp = df %>%
gather(depth, value, -date) %>%
mutate(depth = as.numeric(gsub("d_", "", depth))) %>%
full_join(depths) %>%
arrange(date, depth) %>%
group_by(date) %>%
mutate(value.interp = if(length(na.omit(value)) > 1) {
approx(depth, value, xout=depth)$y
} else {
value
})
In the code above, the if statement is inclduded to prevent approx throwing an error when a given date has only one non-missing value.
df.interp
date depth value value.interp
1 2004-01-05 03:00:00 0.9 7.00 7.000000
2 2004-01-05 03:00:00 1.0 NA 7.062500
3 2004-01-05 03:00:00 2.0 NA 7.687500
4 2004-01-05 03:00:00 2.5 8.00 8.000000
5 2004-01-05 03:00:00 3.0 NA 8.400000
6 2004-01-05 03:00:00 4.0 NA 9.200000
7 2004-01-05 03:00:00 5.0 10.00 10.000000
8 2004-01-05 04:00:00 0.9 7.50 7.500000
9 2004-01-05 04:00:00 1.0 NA 7.573171
10 2004-01-05 04:00:00 2.0 NA 8.304878
11 2004-01-05 04:00:00 2.5 NA 8.670732
12 2004-01-05 04:00:00 3.0 NA 9.036585
13 2004-01-05 04:00:00 4.0 NA 9.768293
14 2004-01-05 04:00:00 5.0 10.50 10.500000
15 2004-01-05 05:00:00 0.9 7.00 7.000000
16 2004-01-05 05:00:00 1.0 NA 7.062500
17 2004-01-05 05:00:00 2.0 NA 7.687500
18 2004-01-05 05:00:00 2.5 8.00 8.000000
19 2004-01-05 05:00:00 3.0 NA 8.280000
20 2004-01-05 05:00:00 4.0 NA 8.840000
21 2004-01-05 05:00:00 5.0 9.40 9.400000
22 2004-01-05 06:00:00 0.9 6.92 6.920000
23 2004-01-05 06:00:00 1.0 NA NA
24 2004-01-05 06:00:00 2.0 NA NA
25 2004-01-05 06:00:00 2.5 NA NA
26 2004-01-05 06:00:00 3.0 NA NA
27 2004-01-05 06:00:00 4.0 NA NA
28 2004-01-05 06:00:00 5.0 NA NA

Resources