How to check for each country if date falls within specific interval across rows? - r

Building on Check if a date is within an interval in R, we want to see if a specific event falls into a timeframe specified by another event. To give you a concrete example: For each country, did event (battle/protests/...) happen at the time of elections?
country <- c("Angola","Angola","Angola","Angola","Angola", "Benin","Benin","Benin","Benin","Benin","Benin")
event_type <- c("battle", "protests","riots", "riots", "elections","elections","protests","riots","violence","riots","elections")
event_date <- as.Date(c("2017-06-16", "2017-01-23", "2016-03-15", "2017-09-18", "2017-08-23", "2019-04-18", "2019-03-12", "2019-04-14", "2018-03-15", "2015-09-14", "2016-03-20"))
start_ecycle <- as.Date(c(NA,NA,NA,NA,"2017-05-25", "2019-01-18",NA,NA,NA,NA,"2015-12-21"))
end_ecycle <-as.Date(c(NA,NA,NA,NA,"2017-09-22","2019-05-18",NA,NA,NA,NA,"2016-04-19"))
mydata <- data.frame(country, event_type, event_date, start_ecycle, end_ecycle)
To this end, we created an interval variable
library(lubridate)
is.instant(mydata$start_ecycle); is.instant(mydata$end_ecycle)
mydata$ecycle <- interval(mydata$start_ecycle, mydata$end_ecycle)
Now, we got stuck. This is what the data.frame should look like in the end - i.e. here column G "ecycle_within" is added with 1 if event_date falls within ecycle (per country):
Any help much appreciated. Thanks!

Based on your comment about the elections cycles being across rows, I would recommend creating a separate dataset first with the elections data.
You can then join the election dates table. This will create a duplicate row for each event and election date range though.
The %within% lubridate function can then be used to check whether an event is within a specific election date range.
Lastly I reduce the number of rows, by filtering out rows corresponding to election date ranges that aren't relevant.
I am more familiar with dplyr and purrr and used them to implement it below. But you should be able to do something similar with base-r functions too.
I got the output close to your required output. But not 100% sure why you would like to do it this way.
library(tidyverse)
library(lubridate)
library(purrr)
elections <- mydata %>%
as_tibble() %>%
select(country, event_type, start_ecycle, end_ecycle) %>%
filter(event_type == "elections") %>%
mutate(election_year = year(start_ecycle)) %>%
select(country, start_ecycle, end_ecycle, election_year)
mydata2 <- mydata %>%
as_tibble() %>%
mutate(row = row_number()) %>%
select(row, country, event_type, event_date) %>%
left_join(elections, by = "country") %>%
mutate(ecycle = map2(start_ecycle, end_ecycle, ~ interval(.x, .y))) %>%
mutate(ecycle_within = map2_int(event_date, ecycle, ~ .x %within% .y)) %>%
select(-ecycle) %>%
group_by(country, event_type, event_date) %>%
arrange(desc(ecycle_within)) %>%
slice(1:1) %>%
ungroup() %>%
arrange(row) %>%
select(-row)
mydata2 %>% select(-election_year)
#> # A tibble: 11 x 6
#> country event_type event_date start_ecycle end_ecycle ecycle_within
#> <fct> <fct> <date> <date> <date> <int>
#> 1 Angola battle 2017-06-16 2017-05-25 2017-09-22 1
#> 2 Angola protests 2017-01-23 2017-05-25 2017-09-22 0
#> 3 Angola riots 2016-03-15 2017-05-25 2017-09-22 0
#> 4 Angola riots 2017-09-18 2017-05-25 2017-09-22 1
#> 5 Angola elections 2017-08-23 2017-05-25 2017-09-22 1
#> 6 Benin elections 2019-04-18 2019-01-18 2019-05-18 1
#> 7 Benin protests 2019-03-12 2019-01-18 2019-05-18 1
#> 8 Benin riots 2019-04-14 2019-01-18 2019-05-18 1
#> 9 Benin violence 2018-03-15 2019-01-18 2019-05-18 0
#> 10 Benin riots 2015-09-14 2019-01-18 2019-05-18 0
#> 11 Benin elections 2016-03-20 2015-12-21 2016-04-19 1

Related

Cleaning an oddly structured dataframe from an excel file (any recommendations on functions also appreciated)

I'm trying to make a dataframe pulled from an excel file more user-friendly by creating a "Type" column.
The data can be found here: https://www.dmo.gov.uk/data/pdfdatareport?reportCode=D1A (direct download excel link here: https://www.dmo.gov.uk/umbraco/surface/DataExport/GetDataExport?reportCode=D1A&exportFormatValue=xls&parameters=%26COBDate%3D11%2F04%2F2011)
As you can probably see, the type of data is all grouped together in column A, like so:
What I'd like to do is is change title "Conventional Gilts" to being "Name", and create a "Type" column that has the different categories pulled from their grouped title. In the linked file, the "Types" would be: "Ultra-Short", "Short", "Medium", "Long", "Index-linked Gilts (3-month Indexation Lag)", "Undated Gilts (non "rump")", and ""Rump" Gilts".
While I feel I would need to do some form of pattern recognition using a package like grepl, I'm not sure how I can achieve this from a 'dynamic' perspective (changing if new categories are created).
Any advice on how to achieve this (or even achieve this in a function) would be greatly appreciated.
I don't know about a single function to do all this; the data is haphazardly arranged and needs to be fixed "manually", for example:
library(readxl)
library(tidyverse)
gilts <- read_xls("C:/Users/Administrator/Documents/gilts.xls")
gilts %>%
filter(!apply(gilts, 1, function(x) all(is.na(x)))) %>%
filter(seq(nrow(.)) < 44) %>%
select(1:7) %>%
filter(seq(nrow(.)) != 1) %>%
setNames(unlist(slice(., 1))) %>%
filter(seq(nrow(.)) != 1) %>%
mutate(splitter = cumsum(is.na(`ISIN Code`))) %>%
group_by(splitter) %>%
mutate(Type = first(`Conventional Gilts`)) %>%
summarize(across(everything(), ~.x[-1])) %>%
ungroup() %>%
select(-1) %>%
select(c(8, 1:7)) %>%
rename(Name = `Conventional Gilts`) %>%
mutate(across(c(4, 5, 7),
~ as.Date(as.numeric(.x), origin = "1899-12-30"))) %>%
mutate(across(contains("million"), as.numeric))
#> `summarise()` has grouped output by 'splitter'. You can override using the
#> `.groups` argument.
#> # A tibble: 37 x 8
#> Type Name ISIN ~1 Redempti~2 First Is~3 Divid~4 Current/~5 Total~6
#> <chr> <chr> <chr> <date> <date> <chr> <date> <dbl>
#> 1 Ultra-Short 9% Conv~ GB0002~ 2011-07-12 1987-07-12 12 Jan~ 2011-07-01 7312.
#> 2 Ultra-Short 3¼% Tre~ GB00B3~ 2011-12-07 2008-11-14 7 Jun/~ 2011-05-26 15747
#> 3 Ultra-Short 5% Trea~ GB0030~ 2012-03-07 2001-05-25 7 Mar/~ 2011-08-26 26867.
#> 4 Ultra-Short 5¼% Tre~ GB00B1~ 2012-06-07 2007-03-16 7 Jun/~ 2011-05-26 25612.
#> 5 Ultra-Short 4½% Tre~ GB00B2~ 2013-03-07 2008-03-05 7 Mar/~ 2011-08-26 33787.
#> 6 Ultra-Short 8% Trea~ GB0008~ 2013-09-27 1993-04-01 27 Mar~ 2011-09-16 8378.
#> 7 Ultra-Short 2¼% Tre~ GB00B3~ 2014-03-07 2009-03-20 7 Mar/~ 2011-08-26 29123.
#> 8 Short 5% Trea~ GB0031~ 2014-09-07 2002-07-25 7 Mar/~ 2011-08-26 36579.
#> 9 Short 2¾% Tre~ GB00B4~ 2015-01-22 2009-11-04 22 Jan~ 2011-07-13 28181.
#> 10 Short 4¾% Tre~ GB0033~ 2015-09-07 2003-09-26 7 Mar/~ 2011-08-26 33650.
#> # ... with 27 more rows, and abbreviated variable names 1: `ISIN Code`,
#> # 2: `Redemption Date`, 3: `First Issue Date`, 4: `Dividend Dates`,
#> # 5: `Current/Next \nEx-dividend Date`,
#> # 6: `Total Amount in Issue \n(£ million nominal)`
Created on 2022-10-30 with reprex v2.0.2
Different approach, premised on the fact that all the gilts start with numbers and the types do not. Makes use of janitor which has super helpful functions for cleaning up messy imported data like this.
library(tidyverse)
library(readxl)
library(janitor)
import_gilts <- read_excel("20221031 - Gilts in Issue.xls.xls", skip = 7)
gilts <- import_gilts %>%
filter(!str_detect(1, "^Note|^Page")) %>%
rename(Name = `Conventional Gilts`) %>%
remove_empty(which = "rows") %>%
mutate(Type = case_when(str_detect(Name, "^[^0-9]") ~ Name,
TRUE ~ NA_character_),
.before = Name) %>%
fill(Type, .direction = "down") %>%
arrange(desc(...9)) %>%
row_to_names(row_number = 2) %>%
rename(Type = 1,
Name = 2) %>%
filter(Type != Name)
Quick draft so there's certainly room for improvement.
Should be able to be turned into a function as long as the number of imported columns and number of rows to skip reading in the file stay the same.

map_df -- Argument 1 must be a data frame or a named atomic vector

I am an infectious diseases physician and have set myself the challenge of creating a dataframe with the UK cumulative published cases of monkeypox, so I can graph it as a runing tally or a chloropleth map as there is no nice dashboard at present for this.
All the data is published as html webpages rather than as a nice csv so I am trying to scrape it all off the internet using the rvest package.
Data is only published intermittently (about twice per week) with the cumulative totals for each of the 4 home nations in UK.
I have managed to get working code to pull data from each of the separate webpages and testing it on the first 2 pages in my mpx_gov_uk_pages list works well giving a small example tibble:
library(tidyverse)
library(lubridate)
library(rvest)
library(janitor)
# load in overview page url which has links to each date of published cases
mpx_gov_uk_overview_page <- c("https://www.gov.uk/government/publications/monkeypox-outbreak-epidemiological-overview")
# extract urls for each date page
mpx_gov_uk_pages <- mpx_gov_uk_overview_page %>%
read_html %>%
html_nodes(".govuk-link") %>%
html_attr('href') %>%
str_subset("\\d{1,2}-[a-z]+-\\d{4}") %>%
paste0("https://www.gov.uk", .) %>%
as.character()
# make table for home nations for each date
table1 <- mpx_gov_uk_pages[1] %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract(mpx_gov_uk_pages[1], "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
table2 <- mpx_gov_uk_pages[2] %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract(mpx_gov_uk_pages[2], "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [4].
# Combine tables
bind_rows(table1, table2)
#> # A tibble: 8 × 3
#> date area cases
#> <date> <chr> <dbl>
#> 1 2022-08-02 England 2638
#> 2 2022-08-02 Northern Ireland 24
#> 3 2022-08-02 Scotland 65
#> 4 2022-08-02 Wales 32
#> 5 2022-07-29 England 2436
#> 6 2022-07-29 Northern Ireland 19
#> 7 2022-07-29 Scotland 61
#> 8 2022-07-29 Wales 30
I want to automate this by creating a generic function and passing the list of urls to purrr::map_df as there will be an ever growing number of pages (there's already 13):
pull_first_table <- function(x){
x %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract({{x}}, "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
}
summary_table <- map_df(mpx_gov_uk_pages, ~ pull_first_table)
Error in `dplyr::bind_rows()`:
! Argument 1 must be a data frame or a named atomic vector.
Run `rlang::last_error()` to see where the error occurred.
The generic function seems to work ok when I supply it with a single element e.g. mpx_gov_uk_cases[2] but I cannot seem to get map_df to work properly even though the webscraping is producing tibbles.
All help and pointers greatly welcomed.
We just need the function and not a lambda expression.
map_dfr(mpx_gov_uk_pages, pull_first_table)
-output
# A tibble: 52 × 3
date area cases
<date> <chr> <dbl>
1 2022-08-02 England 2638
2 2022-08-02 Northern Ireland 24
3 2022-08-02 Scotland 65
4 2022-08-02 Wales 32
5 2022-07-29 England 2436
6 2022-07-29 Northern Ireland 19
7 2022-07-29 Scotland 61
8 2022-07-29 Wales 30
9 2022-07-26 England 2325
10 2022-07-26 Northern Ireland 18
# … with 42 more rows
If we use the lambda expression,
map_dfr(mpx_gov_uk_pages, ~ pull_first_table(.x))

How to look up date value from same table based on a condition in R?

I am not sure if this task is of self join or not. I am basically trying to lookup the latest date for each State.UnionTerritory in below dataframe where the Daily_confirmed cases for each of them were <= half of current Date.
This will help me to get the the doubling time of cases for each State on each date.
library(tidyverse)
library(lubridate)
df_ind <- read_csv("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/sample_data.csv")
df_ind %>% head()
# output
Date State.UnionTerritory Daily_confirmed
<date> <chr> <dbl>
1 2021-12-23 Haryana 46
2 2021-12-23 Maharashtra 1179
3 2021-12-23 Delhi 118
4 2021-12-22 Haryana 55
5 2021-12-22 Maharashtra 1201
6 2021-12-22 Delhi 125
For example Delhi has 118 Cases on 2021-12-23 and less than or half of this for Delhi is coming as 57 on 2021-12-15 so doubling rate would be 2021-12-23 - 2021-12-15 = 8 days.
so I should get something like:
This should be applied for each State in the data & on all dates.
df_ind <- df_ind %>%
mutate(Daily_confirmed_half = as.integer(Daily_confirmed / 2) )
I am not sure how exactly I can try this to get proper Dates as the Doubling_Date column where Daily_confirmed Cases values met the condition of <= half.
I can group summarize & use first to pull the latest dates but not sure what would be the efficient way of bringing the right dates in another column of this data frame.
I'm not sure if this is your desired output. The approach uses a full_join together with filter to simulate an non-euqal join in dplyr. Then we do some data cleaning and in the last step we need a left_join to our original data, since there are quite a couple of days, where we cannot calculate the doubling date, since its not included in the time series.
library(tidyverse)
library(lubridate)
df_ind %>%
group_by(State.UnionTerritory) %>%
full_join(., ., by = c("State.UnionTerritory")) %>%
filter(Date.x > Date.y,
Daily_confirmed.x > (Daily_confirmed.y * 2)) %>%
group_by(Date.x, State.UnionTerritory) %>%
filter(Date.y == max(Date.y)) %>%
filter(Daily_confirmed.y == max(Daily_confirmed.y)) %>%
rename("Date" = Date.x,
"Daily_confirmed" = Daily_confirmed.x,
"Doubling_Date" = Date.y) %>%
select(- Daily_confirmed.y) %>%
mutate(Day_to_double = Date - Doubling_Date) %>%
left_join(df_ind,
.,
by = c("Date", "State.UnionTerritory", "Daily_confirmed")) %>%
arrange(State.UnionTerritory, desc(Date))
#> # A tibble: 252 x 5
#> Date State.UnionTerritory Daily_confirmed Doubling_Date Day_to_double
#> <date> <chr> <dbl> <date> <drtn>
#> 1 2021-12-23 Delhi 118 2021-12-15 8 days
#> 2 2021-12-22 Delhi 125 2021-12-15 7 days
#> 3 2021-12-21 Delhi 102 2021-12-14 7 days
#> 4 2021-12-20 Delhi 91 2021-12-14 6 days
#> 5 2021-12-19 Delhi 107 2021-12-14 5 days
#> 6 2021-12-18 Delhi 86 2021-12-13 5 days
#> 7 2021-12-17 Delhi 69 2021-12-13 4 days
#> 8 2021-12-16 Delhi 85 2021-12-13 3 days
#> 9 2021-12-15 Delhi 57 2021-11-27 18 days
#> 10 2021-12-14 Delhi 45 2021-11-15 29 days
#> # … with 242 more rows
Created on 2021-12-25 by the reprex package (v0.3.0)

aggregation of the region's values ​in the dataset

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I processed the dataset.
Can we find the day of the least death in the Asian region?
the important thing here;
 is the sum of deaths of all countries in the asia region. Accordingly, it is to sort and find the day.
as output;
date region death
2020/02/17 asia 6300 (asia region sum)
The data in the output I created are examples. The data in the example are not real.
Since these are cumulative cases and deaths, we need to difference the data.
library(dplyr)
df %>%
mutate(day = as.Date(day)) %>%
filter(region=="Asia") %>%
group_by(day) %>%
summarise(deaths=sum(death)) %>%
mutate(d=c(first(deaths),diff(deaths))) %>%
arrange(d)
# A tibble: 107 x 3
day deaths d
<date> <int> <int>
1 2020-01-23 18 1 # <- this day saw only 1 death in the whole of Asia
2 2020-01-29 133 2
3 2020-02-21 2249 3
4 2020-02-12 1118 5
5 2020-01-24 26 8
6 2020-02-23 2465 10
7 2020-01-26 56 14
8 2020-01-25 42 16
9 2020-01-22 17 17
10 2020-01-27 82 26
# ... with 97 more rows
So the second day of records saw the least number of deaths recorded (so far).
Using the dplyr package for data treatment :
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
library(dplyr)
df_sum <- df %>% group_by(region,day) %>% # grouping by region and day
summarise(death=sum(death)) %>% # summing following the groups
filter(region=="Asia",death==min(death)) # keeping only minimum of Asia
Then you have :
> df_sum
# A tibble: 1 x 3
# Groups: region [1]
region day death
<fct> <fct> <int>
1 Asia 2020/01/22 17

Create ID for specific sequence of consecutive days based on grouping variable in R

For a list of events at the country-day level, we would like to create a unique ID for a sequence of consecutive days in a specific country (if two or more days of events in a country are consecutive --> create unique ID), so that I can ultimately reduce the data frame to specific sequences of events rather than event days.
I did not manage to aggregate the data based on the sequence of events. I believe this response is similar (Creating groups of consecutive days meeting a given criteria) however it is in SQL.
The data has the following format:
country <- c("Angola","Angola","Angola","Angola","Angola", "Benin","Benin","Benin","Benin","Benin","Benin")
event_date <- as.Date(c("2017-06-16", "2017-06-17", "2017-06-18", "2017-08-22", "2017-08-23", "2019-04-18", "2019-04-19", "2019-04-20", "2018-03-15", "2018-03-16", "2016-03-17"))
mydata <- data.frame(country, event_date)
In the output, I expect to have a new column with the ID that is unique to each series of events in a country:
seq.ID <- c(1,1,1,2,2,3,3,3,4,4,4)
mydata2 <- data.frame(country, event_date, seq.ID)
So that ultimately, I can reduce the data to the level of country and sequence of events:
mydata3 <- mydata2[!duplicated(mydata2$seq.ID),]
Try:
library(dplyr)
mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L)
Output:
# A tibble: 5 x 2
# Groups: country [2]
seq.ID country
<int> <fct>
1 1 Angola
2 2 Angola
3 1 Benin
4 2 Benin
5 3 Benin
You can also use the .keep_all argument in distinct and preserve the first date of each sequence:
mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L),
.keep_all = TRUE)
# A tibble: 5 x 3
# Groups: country [2]
country event_date seq.ID
<fct> <date> <int>
1 Angola 2017-06-16 1
2 Angola 2017-08-22 2
3 Benin 2019-04-18 1
4 Benin 2018-03-15 2
5 Benin 2016-03-17 3
In case of desired non-aggregated output with different sequence IDs, you could do:
mydata %>%
mutate(
seq.ID = cumsum(
(event_date != lag(event_date, default = first(event_date)) + 1L) |
country != lag(country, default = first(country))
)
)
country event_date seq.ID
1 Angola 2017-06-16 1
2 Angola 2017-06-17 1
3 Angola 2017-06-18 1
4 Angola 2017-08-22 2
5 Angola 2017-08-23 2
6 Benin 2019-04-18 3
7 Benin 2019-04-19 3
8 Benin 2019-04-20 3
9 Benin 2018-03-15 4
10 Benin 2018-03-16 4
11 Benin 2016-03-17 5
Note that there is a typo in your last event_date, this is why the outputs don't correspond 100% to your desired output.

Resources