Low to high frequency conversion in panel data in R using tempdisagg - r

I have daily panel data with four variables: date, cusip(id identifier), PD (probability of default), and price. PD is only available on a quarterly basis for the first day of January, April, July, and October. I want to generate daily data for PD using Chow-Lin frequency conversion from tempdisagg package. I know how to apply td() function on time series, but I didn't find examples with panel data frames. Here are my code and sample data using reproduce() from devtools package, so only few sample days are included instead of full quarter. Running td() reports an error:
Error in td(PD ~ price, conversion = "first", method = "chow-lin-fixed", fixed.rho
= 0.5) : In numeric mode, 'to' must be an integer number.
I know that both price and PD are high-frequency daily indicators in mydata, so I guess I need to use to.quarterly() function on PDor something similar.
library(dplyr)
library(zoo)
library(tempdisagg)
library(tsbox)
mydata <- structure(list(date = structure(c(13516, 13516, 13517, 13517,13518, 13518, 13521, 13605, 13605, 13606), class = "Date"), cusip = c("31677310","66585910", "31677310", "66585910", "31677310", "66585910", "31677310","66585910", "31677310", "66585910"), PD = c(0.076891, 0.096,NA, NA, NA, NA, NA, 0.094341, 0.08867, NA), price = c(40.98, 61.31,40.99, 60.77, 40.18, 59.97, 39.92, 59.96, 38.6, 60.69)), row.names = c(6L,13L, 36L, 43L, 66L, 73L, 96L, 1843L, 1866L, 1873L), class = "data.frame")
mydata <- mydata%>%
group_by(cusip) %>%
arrange(cusip,date) %>%
mutate(PDdaily = td(PD ~ price, conversion = "first",method = "chow-lin-fixed", fixed.rho = 0.5))

Your example is not sufficient. For each disaggregation, we need at least 3 low frequency values to be able to perform a regression.
Here is an alternative example, with 3 pairs of low and high frequency series:
library(tidyverse)
library(tempdisagg)
library(tsbox)
mydata <- ts_c(
low_freq = ts_frequency(fdeaths, "year"),
high_freq = mdeaths
) %>%
ts_tbl() %>%
ts_wide() %>%
crossing(id = 1:3) %>%
arrange(id)
Applying td multiple times on data in a data frame will be cumbersome.
It is easier to extract the data into two lists, one with the low and one with high frequency series:
list_lf <- group_split(ts_na_omit(select(mydata, time, value = low_freq, id)), id, keep = FALSE)
list_hf <- group_split(select(mydata, time, value = high_freq, id), id, keep = FALSE)
Now you can use Map() or map2() to apply the function to each pair of elements:
ans <- map2(list_lf, list_hf, ~ predict(td(.x ~ .y)))
Transforming the disaggregated data back to a data frame:
bind_rows(ans, .id = "id")
#> # A tibble: 216 x 3
#> id time value
#> <chr> <date> <dbl>
#> 1 1 1974-01-01 59.2
#> 2 1 1974-02-01 54.2
#> 3 1 1974-03-01 54.4
#> 4 1 1974-04-01 54.4
#> 5 1 1974-05-01 47.3
#> 6 1 1974-06-01 42.8
#> 7 1 1974-07-01 43.3
#> 8 1 1974-08-01 40.6
#> 9 1 1974-09-01 42.0
#> 10 1 1974-10-01 47.3
#> # … with 206 more rows
Created on 2020-06-03 by the reprex package (v0.3.0)

Related

Mutate several columns based on one condition

I'd like to assign different values to several columns, based on the value in another column, i.e. do a multiple mutate based on a single condition.
For example, I would have a dataframe like this:
df <- tibble(cfr = c("IRL000I12572", "ESP000023522", "ESP000023194"),
vessel_name = c("RACHEL JAY", "ALAKRANTXU", "DONIENE"),
length = c(NA, NA, 109.30),
tonnage = c(NA, NA, 3507.00),
power = c(NA, NA, 7149.05))
I'd like to manually assign a set of values to length, tonnage, and power when cfr == IRL000I12572, another set of values when cfr == ESP000023522, and keep the given values when cfr == ESP000023194.
Right know, I'm doing it using either an ifelse or case_when statement in my mutate, but I end up with three rows per cfr (and I have many)...
For example:
df <- df %>%
mutate(length = ifelse(cfr == "IRL000I12572", 22.5, length),
tonnage = ifelse(cfr == "IRL000I12572", 153.00, tonnage),
power = ifelse(cfr == "IRL000I12572", 370, power))
Is there a way to 'condense' the statement and have only one per cfr value, to assign the three different length, tonnage, and power values in one row?
Thanks!
You can use rows_update() from dplyr. Note that this is marked as an experimental function, so use at your own risk!
library(dplyr)
df <- tibble(cfr = c("IRL000I12572", "ESP000023522", "ESP000023194"),
vessel_name = c("RACHEL JAY", "ALAKRANTXU", "DONIENE"),
length = c(NA, NA, 109.30),
tonnage = c(NA, NA, 3507.00),
power = c(NA, NA, 7149.05))
df_update <- tibble(cfr = "IRL000I12572",
length = 22.5,
tonnage = 153.00,
power = 370)
df %>%
rows_update(df_update, by = "cfr")
# A tibble: 3 x 5
cfr vessel_name length tonnage power
<chr> <chr> <dbl> <dbl> <dbl>
1 IRL000I12572 RACHEL JAY 22.5 153 370
2 ESP000023522 ALAKRANTXU NA NA NA
3 ESP000023194 DONIENE 109. 3507 7149.
You can also make use of across to pull from a reference list (or vector). But this would require a different reference table or some other code feature per lookup ID.
x <- list(length = 22.5,
tonnage = 153.00,
power = 370)
df %>%
mutate(across(names(x), ~ ifelse(cfr == "IRL000I12572", x[[cur_column()]], .)))
In base R you could do:
df[df$cfr == "IRL000I12572", -c(1:2)] <- list(22.5, 153.00, 370)
So that
df
#> # A tibble: 3 x 5
#> cfr vessel_name length tonnage power
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 IRL000I12572 RACHEL JAY 22.5 153 370
#> 2 ESP000023522 ALAKRANTXU NA NA NA
#> 3 ESP000023194 DONIENE 109. 3507 7149.

Remove rows with two conditions in R

I have this following dataset:
df <- structure(list(Data = structure(c(1623888000, 1629158400, 1629158400
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Client = c("Client1",
"Client1", "Client1"), Fund = c("Fund1", "Fund1", "Fund2"), Nature = c("Application",
"Rescue", "Application"), Quantity = c(433.059697, 0, 171.546757
), Value = c(69800, -70305.67, 24875), `NAV Yesterday` = c(162.40991399996,
162.40991399996, 145.044589000056), `NAV in Application Date` = c(161.178702344125,
162.346370458944, 145.004198476337), `Var NAV` = c(0.00763879866215962,
0.00039140721678275, 0.000278547270652531), `Var * Value` = c(533.188146618741,
-27.5181466187465, 6.92886335748171), FinalValue = c(70333.1881466187,
-70333.1881466187, 24881.9288633575), `Rentability WRONG` = c(0.0210345899274819,
0.0210345899274819, 0.0210345899274819)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
What I need to do is:
If quantity = 0, then remove all rows with the same Fund name as that one, but remove only the rows that have Date < or = Date of the Quantity = 0 Fund
What I did here is:
I grouped the data by Fund
Arranged each group by Data
Created a column zero_point that assigns 1 to the row where Quantity == 0 and NA otherwise
Filled the fields in zero_point that come before the actual "zero point" with the same value.
filtered those rows out.
output <- df %>%
group_by(Fund) %>%
arrange(Data) %>%
mutate(zero_point = case_when(Quantity == 0 ~ 1)) %>%
fill(zero_point, .direction = "up") %>%
filter(is.na(zero_point))
(On the condition that there is only one instance where Quantity is 0 per Fund group)
You can try -
library(dplyr)
df %>%
filter({
#Row index where Quantity = 0
inds = which(Quantity == 0)
#Drop rows where Data value is less than Data value at Quantity = 0
#and Fund is same as present at Quantity = 0.
!(Data <= Data[inds] & Fund %in% Fund[inds])
})
Here's a thought:
df %>%
group_by(Fund) %>%
filter(!any(Quantity == 0) | Data <= Data[which.min(Quantity)])
# # A tibble: 3 x 12
# # Groups: Fund [2]
# Data Client Fund Nature Quantity Value `NAV Yesterday` `NAV in Applica~ `Var NAV` `Var * Value` FinalValue `Rentability WR~
# <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2021-06-17 00:00:00 Clien~ Fund1 Appli~ 433. 69800 162. 161. 0.00764 533. 70333. 0.0210
# 2 2021-08-17 00:00:00 Clien~ Fund1 Rescue 0 -70306. 162. 162. 0.000391 -27.5 -70333. 0.0210
# 3 2021-08-17 00:00:00 Clien~ Fund2 Appli~ 172. 24875 145. 145. 0.000279 6.93 24882. 0.0210
I'm assuming you meant "Data <= Data of the Quantity = 0 Fund", therefore using Data instead of Date (not found) or NAV in Application Date.
This filters nothing in this sample data, I'm hoping the logic is correct.
Testing for equality with floating-point (numeric) can be problematic at times (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). If you have some small near-zero numbers, then this will silently produce counter-intuitive results without warning or error. You might be more defensive to use something like:
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) | Data <= Data[which.min(Quantity)])
or even
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) |
row_number() == which.min(Quantity) |
Data < Data[which.min(Quantity)])
While the latter is a bit paranoid (and double-calculates which.min(.), it should not succumb to problems with equality tests.
The only time this will fail is if all(is.na(Quantity)); that is, which.min(c(NA,NA)) returns integer(0) which will cause an error in dplyr::filter. One might choose to add safeguard with something like filter(any(!is.na(Quantity)) & (...)).

map over every combination of lat/lon values and store in a matrix the distance

I am trying to compute some distances for each combination of lat/lon values I have.
The first data frame looks like:
lon lat NOMVIAL
<dbl> <dbl> <chr>
1 -99.1 19.5 Tepozanes
2 -99.0 19.3 Bartolomé Díaz de León
3 -99.2 19.3 Renato Leduc
4 -99.2 19.2 Cuautlalpan
The second data frame looks like:
CVEGEO mean_lat mean_lon
<int> <dbl> <dbl>
1 90130143 19.2 -99.1
2 90130234 19.2 -99.0
3 90090300 19.2 -99.0
So I want to take each combination in df2 and compute the distances for each row in df1 and storing the results as a matrix. I can compute the distances for a single lat/lon value using:
df1 %>%
add_column(
M_lat = -99.183203,
M_long = 19.506582
) %>%
mutate(
Distance = geosphere::distHaversine(cbind(lon, lat), cbind(M_lat, M_long))
)
However, I would like to use map and store the results as a matrix, as below.
Expected output:
90020001 90030001 90040001 90040010 90040020
Tepozanes 999 111 ... ... ...
Renato Leduc
Samahil
...
Primera ... ... ... ... ...
Where the column names come from the column GVEGEO column in the df2 and the row names come from the column NOMVIAL in the df1 data frame. The values being the distances computed.
Data:
df1 <- structure(list(lon = c(-99.12587729, -99.03630014, -99.16578649,
-99.18373215, -99.21312146, -99.29082258, -99.19958018, -99.05745354,
-99.09046923, -99.04686154), lat = c(19.543991, 19.2921902, 19.29272965,
19.2346386, 19.29264198, 19.32628302, 19.29913009, 19.38650317,
19.47120537, 19.31618134), NOMVIAL = c("Tepozanes", "Bartolomé Díaz de León",
"Renato Leduc", "Cuautlalpan", "Samahil", "Ninguno", "Monte de Sueve",
"Ninguno", "Rinoceronte", "Primera")), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))
df2 <- structure(list(CVEGEO = c(90130143L, 90130234L, 90090300L, 90130284L,
90130290L), mean_lat = c(19.2256141377143, 19.2447500775758,
19.209320585524, 19.2219817711111, 19.2405991752941), mean_lon = c(-99.143825052,
-99.0409973439394, -98.9713545799563, -99.1172106433333, -99.1347260164706
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
EDIT:
To clarify, hopefully, a little more clearly.
I would like to take each row of df2 and use this to compute the distances of all rows in df1.
Take row 1 in df2
CVEGEO mean_lat mean_lon
90130143 19.22561 -99.14383
Take these values and compute the distance for all of the rows in df1 - so for each of the 10 rows in df1 I will have a distance computed based on this row in df2. Then move to row 2 of df2 and do the same again...
So this matrix will have dimensions 5 columns and 10 rows.
You can use the following solution. You have to swap mean_lon and mean_lat columns in your second data frame as you will get an error while lon and lat are exchanged.
library(dplyr)
library(purrr)
library(tidyr)
library(tibble)
map2(df4$lon, df4$lat, ~ df5 %>%
rowwise() %>%
mutate(output = geosphere::distHaversine(c(.x, .y), c_across(mean_lon:mean_lat)))) %>%
set_names(df4$NOMVIAL) %>%
map(~ .x %>%
select(CVEGEO, output) %>%
pivot_wider(names_from = CVEGEO, values_from = output)) %>%
bind_rows() %>%
rownames_to_column() %>%
mutate(rowname = df4$NOMVIAL)
# A tibble: 10 x 6
rowname `90130143` `90130234` `90090300` `90130284` `90130290`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Tepozanes 35492. 34483. 40636. 35857. 33786.
2 Bartolomé Díaz de León 13513. 5304. 11476. 11549. 11831.
3 Renato Leduc 7820. 14159. 22444. 9385. 6658.
4 Cuautlalpan 4313. 15044. 22501. 7133. 5193.
5 Samahil 10426. 18857. 27048. 12785. 10071.
6 Ninguno 19083. 27775. 36007. 21625. 18974.
7 Monte de Sueve 10065. 17730. 25985. 12194. 9429.
8 Ninguno 20078. 15874. 21699. 19361. 18158.
9 Rinoceronte 27908. 25739. 31724. 27885. 26088.
10 Primera 14334. 7976. 14299. 12830. 12491.

dplyr, purrr, dynamically generate/calculate new columns in R

I have the following problem. I have a data frame/tibble that has (a lot) of columns that represent a value in different years, e.g. the number of inhabitants in a city at different points in time. I want to generate now columns that give me the growth rate (see pictures attached). It should be something like using mutate() while looping over the columns. I think that should be a common task but I can't find any hint how to do it.
Edit:
A minimal example could look like this:
## Minimal example
library(tidyverse)
## Given data frame
df <- tibble(
City = c("Melbourne", "Sydney", "Adelaide"),
year_2000 = c(100, 100, 205),
year_2001 = c(101, 100, 207),
year_2002 = c(102, 100, 209)
)
## Result
df <- df %>%
mutate(
gr_2000_2001 = year_2001/year_2000*100 - 100,
gr_2001_2002 = year_2002/year_2001*100 - 100
)
I want to find a way to automate/do the mutate command in a smart way, as I have to do it for 150 years.
enter image description here
enter image description here
The easiest way in this example would probably be to make your data tidy and then apply whatever formula you are using to calculate growth rates by using dplyr's lag()function to a data frame grouped by City:
## Minimal example
library(tidyverse)
df <- data.frame(City = c("Melbourne", "Sydney"),
year_2000 = c(100, 100),
year_2001 = c(101,100),
year_2002 = c(102, 102))
df %>%
gather(year, value, 2:4) %>%
group_by(City) %>%
mutate(growth = value/dplyr::lag(value,n=1))
The result is this:
# A tibble: 6 x 4
# Groups: City [2]
City year value growth
<fct> <chr> <dbl> <dbl>
1 Melbourne year_2000 100 NA
2 Sydney year_2000 100 NA
3 Melbourne year_2001 101 1.01
4 Sydney year_2001 100 1
5 Melbourne year_2002 102 1.01
6 Sydney year_2002 102 1.02
If you absolutely need the data in the format you provided in the screenshots, you can then apply spread() to reshape it into the original format. This is not generally recommended, however.

R GGplot2 Stacked Columns Chart

I am trying to do a Stacked Columns Chart in R. Sorry but I am learning thats why i need help
This is how i have the data
structure(list(Category = structure(c(2L, 3L, 4L, 1L), .Label = c("MLC1000",
"MLC1051", "MLC1648", "MLC5726"), class = "factor"), Minutes = c(2751698L,
2478850L, 556802L, 2892097L), Items = c(684L, 607L, 135L, 711L
), Visits = c(130293L, 65282L, 25484L, 81216L), Sold = c(2625L,
1093L, 681L, 1802L)), .Names = c("Category", "Minutes", "Items",
"Visits", "Sold"), class = "data.frame", row.names = c(NA, -4L)
)
And i want to create this graphic
I think there are two pretty basic principles that you should apply to make this problem easier to handle. First, you should make your data tidy. Second, you shouldn't leave ggplot to do your calculations for you.
library(tidyverse)
a <- data_frame(
category = letters[1:4],
minutes = c(2751698, 2478850, 556802, 2892097),
visits = c(130293, 65282, 25484, 81216),
sold = c(2625, 1093, 681, 1802)
) %>%
gather(variable, value, -category) %>% # make tidy
group_by(variable) %>%
mutate(weight = value / sum(value)) # calculate weight variable
## Source: local data frame [12 x 4]
## Groups: variable [3]
## category variable value weight
## <chr> <chr> <dbl> <dbl>
## 1 a minutes 2751698 0.31703610
## 2 b minutes 2478850 0.28559999
## 3 c minutes 556802 0.06415178
## 4 d minutes 2892097 0.33321213
## 5 a visits 130293 0.43104127
## 6 b visits 65282 0.21596890
## 7 c visits 25484 0.08430734
## 8 d visits 81216 0.26868249
## 9 a sold 2625 0.42331882
## 10 b sold 1093 0.17626189
## 11 c sold 681 0.10982100
## 12 d sold 1802 0.29059829
I don't know what was up with your structure(), but I couldn't build a data frame from it without crashing my R session.
Once we get the data into this format, the ggplot2 call is actually really easy:
ggplot(a, aes(x = variable, weight = weight * 100, fill = category)) +
geom_bar()

Resources