expanding a named list inside a dataframe - r

I have a number of JSON files in the following format
{ "year": 2019,
"numberofhomes": 480,
"meetingdate": "2019-02-09",
"votes":
[
{ "date": "2019-01-23", "votes": 39 },
{ "date": "2019-02-01", "votes": 124 },
{ "date": "2019-02-09", "votes": 164 }
]
}
While reading this in with jsonlite::read_json, the resulting column for votes is a named list.
jsonlite::read_json("sources/votes-2019.json", simplifyDataFrame = FALSE) %>%
as_tibble()
# A tibble: 3 x 4
year numberofhomes meetingdate votes
<int> <int> <chr> <list>
1 2019 480 2019-02-09 <named list [2]>
2 2019 480 2019-02-09 <named list [2]>
3 2019 480 2019-02-09 <named list [2]>
Or the alternative
jsonlite::read_json("sources/votes-2019.json", simplifyDataFrame = TRUE) %>%
as_tibble()
# A tibble: 3 x 4
year numberofhomes meetingdate votes$date $votes
<int> <int> <chr> <chr> <int>
1 2019 480 2019-02-09 2019-01-23 39
2 2019 480 2019-02-09 2019-02-01 124
3 2019 480 2019-02-09 2019-02-09 164
How can I transform the last column(s) into a normal dataframe column? Alternatively, is there a better way to read in JSON files with nested arrays?

You can use unnest_wider:
library(tibble)
jsonlite::read_json("sources/votes-2019.json", simplifyDataFrame = FALSE) %>%
as_tibble() %>%
unnest_wider(votes)

Related

Group and add variable of type stock and another type in a single step?

I want to group by district summing 'incoming' values at quarter and get the value of the 'stock' in the last quarter (3) in just one step. 'stock' can not summed through quarters.
My example dataframe:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
df
district quarter incoming stock
1 ARA 1 4044 19547
2 ARA 2 2992 3160
3 ARA 3 2556 1533
4 BJI 1 1639 5355
5 BJI 2 9547 6146
6 BJI 3 1191 355
7 CMC 1 2038 5816
8 CMC 2 1942 1119
9 CMC 3 225 333
The actual dataframe has ~45.000 rows and 41 variables of which 8 are of type stock.
The result should be:
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I know how to get to the result but in three steps and I don't think it's efficient and error prone due to the data.
My approach:
basea <- df %>%
group_by(district) %>%
filter(quarter==3) %>% #take only the last quarter
summarise(across(stock, sum)) %>%
baseb <- df %>%
group_by(district) %>%
summarise(across(incoming, sum)) %>%
final <- full_join(basea, baseb)
Does anyone have any suggestions to perform the procedure in one (or at least two) steps?
Grateful,
Modus
Given that the dataset only has 3 quarters and not 4. If that's not the case use nth(3) instead of last()
library(tidyverse)
df %>%
group_by(district) %>%
summarise(stock = last(stock),
incoming = sum(incoming))
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
here is a data.table approach
library(data.table)
setDT(df)[, .(incoming = sum(incoming), stock = stock[.N]), by = .(district)]
district incoming stock
1: ARA 9592 1533
2: BJI 12377 355
3: CMC 4205 333
Here's a refactor that removes some of the duplicated code. This also seems like a prime use-case for creating a custom function that can be QC'd and maintained easier:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
aggregate_stocks <- function(df, n_quarter) {
base <- df %>%
group_by(district)
basea <- base %>%
filter(quarter == n_quarter) %>%
summarise(across(stock, sum))
baseb <- base %>%
summarise(across(incoming, sum))
final <- full_join(basea, baseb, by = "district")
return(final)
}
aggregate_stocks(df, 3)
#> # A tibble: 3 × 3
#> district stock incoming
#> <chr> <dbl> <dbl>
#> 1 ARA 1533 9592
#> 2 BJI 355 12377
#> 3 CMC 333 4205
Here is the same solution as #Tom Hoel but without using a function to subset, instead just use []:
library(dplyr)
df %>%
group_by(district) %>%
summarise(stock = stock[3],
incoming = sum(incoming))
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205

How do I add only the last element of one vector to the last element of another vector?

I have a dataframe with hundreds of different investments (represented by the "id" column), their cashflows, and market value. The following example demonstrates the data that I'm working with:
df <- data.frame(date = c("2020-01-31", "2020-02-29", "2020-03-31", "2020-02-29", "2020-03-31", "2020-04-30", "2020-05-31"),
id = c("alpha", "alpha", "alpha", "bravo", "bravo", "bravo", "bravo"),
cashflow = c(-100,20,4,-50,8,12,8),
market_value = c(100,90,80,50,110,120,115))
I ultimately want to calculate the IRR per investment. However, before I can do that, I need to add only the last market value number to the corresponding cashflow. I don't care about any market values before that. In this case, the last cashflow for "alpha" investment must be 84 (i.e., 80 market value + 4 cashflow) and the last cashflow for "bravo" investment must be 123 (i.e., 115 market value + 8 cashflow).
Desired output:
id
cashflow
alpha
-100
alpha
20
alpha
84
bravo
-50
bravo
8
bravo
12
bravo
123
Thanks!
I'm not too sure on what final output you want but here's how you'd just take the last.
df %>%
mutate(total = cashflow + market_value) %>%
group_by(id) %>%
slice_max(order_by = date) %>%
ungroup()
#> # A tibble: 2 × 5
#> date id cashflow market_value total
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2020-03-31 alpha 4 80 84
#> 2 2020-05-31 bravo 8 115 123
Created on 2022-07-22 by the reprex package (v2.0.1)
EDIT - just seen what I think is your desired output, how's this?
df %>%
group_by(id) %>%
mutate(
cashflow = if_else(row_number() == n(), cashflow + market_value, cashflow)
)
#> # A tibble: 7 × 4
#> # Groups: id [2]
#> date id cashflow market_value
#> <chr> <chr> <dbl> <dbl>
#> 1 2020-01-31 alpha -100 100
#> 2 2020-02-29 alpha 20 90
#> 3 2020-03-31 alpha 84 80
#> 4 2020-02-29 bravo -50 50
#> 5 2020-03-31 bravo 8 110
#> 6 2020-04-30 bravo 12 120
#> 7 2020-05-31 bravo 123 115
Created on 2022-07-22 by the reprex package (v2.0.1)

rvest html_table() use second row as header

I am trying to scrape data from a table on fbref however the tables contain two headers with the subheader being incorporated into the first row of data. Does anyone know how to skip the first line and use the second line as the table header so that data types can be maintained? Here is my code below.
library(rvest)
library(dplyr)
team_link = "https://fbref.com/en/squads/cff3d9bb/Chelsea-Stats-All-Competitions"
team_page = read_html(team_link)
shooting_table = team_page %>% html_nodes("#all_stats_shooting") %>%
html_table()
shooting_table = shooting_table[[1]]
You can use the janitor package
library(janitor)
shooting_table %>%
row_to_names(1)
Which gives us:
# A tibble: 28 × 23
Player Nation Pos Age `90s` Gls Sh SoT `SoT%` `Sh/90` `SoT/90` `G/Sh` `G/SoT` Dist FK PK
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Edouard M… sn SEN GK 29 34.0 0 0 0 "" 0.00 0.00 "" "" "" 0 0
2 Antonio R… de GER DF 28 33.7 3 48 13 "27.1" 1.42 0.39 "0.06" "0.23" "19.… 0 0
3 Thiago Si… br BRA DF 36 29.4 3 18 5 "27.8" 0.61 0.17 "0.17" "0.60" "10.… 0 0
4 Mason Mou… eng E… MF,FW 22 26.3 11 75 27 "36.0" 2.86 1.03 "0.13" "0.37" "17.… 6 1

Divide Results Into Three Groups Based On Condition And Date Check

This is one I have been having trouble with for days. I need to take my data and divide results into three groups based on conditions and date check. You can see this in the original data table that I have provided.
Table with original data
Basically, I need to do this by individual. If they fail then they have 7 days to pass. If they fail and pass within 7 days then they go in the Yes category. If they fail and then have another failure within 7 days, they go in the No category. If they have a failed result and nothing after that, then they go in the Refused category.
So, I need to test the row after a Fail for a Pass or Fail or Nothing by individual and then check that it is within 7 days.
Individuals such as Sam, since he did not take another test after the second failure, can be in multiple groups at the same time. Luke on the other hand, passed but it was after the 7 day period so they scored a refused. The final table would then look like this:
enter image description here
I have tried to use if-else statements but I don't know how to check the next row of the same individual and ignore any other rows other than the row that exist, if any, right after Fail per individual.
I don't know if this can be done in R but I appreciate any help I can get on this.
Thank you!
It is not a complete solution, but my suggestion.
Your dataset:
# A tibble: 13 x 4
name result time_1 time_2
<chr> <chr> <date> <date>
1 Joe Fail 2022-03-01 NA
2 Joe Pass NA 2022-03-05
3 Heather Fail 2022-03-21 NA
4 Heather Pass NA 2022-03-26
5 Heather Pass NA 2022-03-27
6 Heather Fail 2022-03-13 NA
7 Heather Pass NA 2022-03-17
8 Sam Fail 2022-03-20 NA
9 Sam Fail 2022-03-21 NA
10 Luke Fail 2022-03-11 NA
11 Luke Pass NA 2022-03-13
12 Luke Fail 2022-03-19 NA
13 Luke Pass NA 2022-03-29
library(lubridate)
library(tidyverse)
df_clean <- df %>%
arrange(name, result, time_1, time_2) %>%
group_by(name, result) %>%
mutate(attempt = 1:n()) %>%
unite(col = "result",
c("result", "attempt"),
sep = "_", remove = TRUE) %>%
unite(col = "time",
c("time_1", "time_2"),
sep = "", remove = TRUE) %>%
mutate(time = time %>% str_remove_all("NA") %>% as.Date()) %>%
ungroup() %>%
spread(key = result, value = time)
"Cleaned dataset":
# A tibble: 4 x 6
name Fail_1 Fail_2 Pass_1 Pass_2 Pass_3
<chr> <date> <date> <date> <date> <date>
1 Heather 2022-03-13 2022-03-21 2022-03-17 2022-03-26 2022-03-27
2 Joe 2022-03-01 NA 2022-03-05 NA NA
3 Luke 2022-03-11 2022-03-19 2022-03-13 2022-03-29 NA
4 Sam 2022-03-20 2022-03-21 NA NA NA
df_clean %>%
mutate(yes = case_when(interval(Fail_1, Pass_1) %>%
as.numeric("days") <= 7 ~ 1,
TRUE ~ 0),
refused = case_when(is.Date(Fail_1) & is.na(Pass_1) ~ 1,
TRUE ~ 0))
# A tibble: 4 x 8
name Fail_1 Fail_2 Pass_1 Pass_2 Pass_3 yes refused
<chr> <date> <date> <date> <date> <date> <dbl> <dbl>
1 Heather 2022-03-13 2022-03-21 2022-03-17 2022-03-26 2022-03-27 1 0
2 Joe 2022-03-01 NA 2022-03-05 NA NA 1 0
3 Luke 2022-03-11 2022-03-19 2022-03-13 2022-03-29 NA 1 0
4 Sam 2022-03-20 2022-03-21 NA NA NA 0 1

Extracting table data from a website using R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to get information from (https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html) using R.
The data is not in .csv or excel format. I am not sure where to start. I know very basic R and would welcome any help! thank you!
Presuming it's the table of data from the page you are looking for
library(tidyverse)
library(rvest)
page <- xml2::read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")
tbl <- html_table(page)[[1]]
tbl <- as.tibble(tbl)
tbl
# A tibble: 260 x 9
`Medicinal\r\n … `Submission Numb… `Innovative Dru… Manufacturer `Drug(s) Containi… `Notice of Compl… `6 Year\r\n … `Pediatric Exte… `Data Protectio…
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 abiraterone ace… 138343 Zytiga Janssen I… N/A 2011-07-27 2017-07-27 N/A 2019-07-27
2 aclidinium bromide 157598 Tudorza Genu… AstraZeneca … Duaklir Genuair 2013-07-29 2019-07-29 N/A 2021-07-29
3 afatinib dimaleate 158730 Giotrif Boehringer … N/A 2013-11-01 2019-11-01 N/A 2021-11-01
4 aflibercept 149321 Eylea Bayer Inc. N/A 2013-11-08 2019-11-08 N/A 2021-11-08
5 albiglutide 165145 Eperzan GlaxoSmithKl… N/A 2015-07-15 2021-07-15 N/A 2023-07-15
6 alectinib hydrochl… 189442 Alecensaro Hoffmann-La … N/A 2016-09-29 2022-09-29 N/A 2024-09-29
7 alirocumab 183116 Praluent Sanofi-avent… N/A 2016-04-11 2022-04-11 N/A 2024-04-11
8 alogliptin benzoate 158335 Nesina Takeda Ca… "Kazano\r\n … 2013-11-27 2019-11-27 N/A 2021-11-27
9 anthrax immune glo… 200446 Anthrasil Emergent … N/A 2017-11-06 2023-11-06 Yes 2026-05-06
10 antihemophilic fac… 163447 Eloctate Bioverativ … N/A 2014-08-22 2020-08-22 Yes 2023-02-22
# ... with 250 more rows
To read in the 2nd/3rd/4th table on the page change the number in tbl <- html_table(page)[[1]] to the number table wish to read
You'll be able to extract this data through web scraping.
Try something like
library(rvest)
library(dplyr)
url <- "https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html"
page_html <- read_html(url)
tables <- page_html %>% html_nodes("table")
for (i in 1:length(tables)) {
table <- tables[i]
table_header <- table %>% html_nodes("thead th") %>% html_text(.) %>% trimws(.) %>% gsub("\r", "", .) %>% gsub("\n", "", .)
table_data <- matrix(ncol=length(table_header), nrow=1) %>% as.data.frame(.)
colnames(table_data) <- table_header
rows <- table %>% html_nodes("tr")
for (j in 2:length(rows)) {
table_data[j-1, ] <- rows[j] %>% html_nodes("td") %>% html_text(.) %>% trimws(.)
}
assign(paste0("table_data", i), table_data)
}
You can process them all the same way without a for loop and without using assign() (shudder). Plus, we can assign the table caption (the <h2> above each) to each table for a reference:
library(rvest)
xdf <- read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")
tbls <- html_table(xdf, trim = TRUE)
We clean up the column names using janitor::clean_names() then find the captions, clean them up so they're proper variable names and assign them to each table:
setNames(
lapply(tbls, function(tbl) {
janitor::clean_names(tbl) %>% # CLEAN UP TABLE COLUMN NAMES
tibble::as_tibble() # solely for better printing
}),
html_nodes(xdf, "table > caption") %>% # ASSIGN THE TABLE HEADER TO THE LIST ELEMENT
html_text() %>% # BUT WE NEED TO CLEAN THEM UP FIRST
trimws() %>%
tolower() %>%
gsub("[[:punct:][:space:]]+", "_", .) %>%
gsub("_+", "_", .) %>%
make.unique(sep = "_")
) -> tbls
Now we can access them by name in the list without using the nigh-never-recommended assign() (shudder again):
tbls$products_for_human_use_active_data_protection_period
## # A tibble: 260 x 9
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 abiraterone … 138343 Zytiga Janssen … N/A 2011-07-27 2017-07-27
## 2 aclidinium brom… 157598 Tudorza Gen… AstraZeneca… Duaklir Genu… 2013-07-29 2019-07-29
## 3 afatinib dimale… 158730 Giotrif Boehringer … N/A 2013-11-01 2019-11-01
## 4 aflibercept 149321 Eylea Bayer In… N/A 2013-11-08 2019-11-08
## 5 albiglutide 165145 Eperzan GlaxoSmithK… N/A 2015-07-15 2021-07-15
## 6 alectinib hydro… 189442 Alecensaro Hoffmann-La… N/A 2016-09-29 2022-09-29
## 7 alirocumab 183116 Praluent Sanofi-aven… N/A 2016-04-11 2022-04-11
## 8 alogliptin benz… 158335 Nesina Takeda C… "Kazano\r\n … 2013-11-27 2019-11-27
## 9 anthrax immune … 200446 Anthrasil Emergent … N/A 2017-11-06 2023-11-06
## 10 antihemophilic … 163447 Eloctate Bioverativ … N/A 2014-08-22 2020-08-22
## # ... with 250 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>
tbls$products_for_human_use_expired_data_protection_period
## # A tibble: 92 x 9
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 abatacept 98531 Orencia Bristol-Mye… N/A 2006-06-29 2012-06-29
## 2 acamprosate cal… 103287 Campral Mylan Pharm… N/A 2007-03-16 2013-03-16
## 3 alglucosidase a… 103381 Myozyme Genzyme Can… N/A 2006-08-14 2012-08-14
## 4 aliskiren hemif… 105388 Rasilez Novartis Ph… "Rasilez HCT\r\… 2007-11-14 2013-11-14
## 5 ambrisentan 113287 Volibris GlaxoSmithK… N/A 2008-03-20 2014-03-20
## 6 anidulafungin 110202 Eraxis Pfizer Cana… N/A 2007-11-14 2013-11-14
## 7 aprepitant 108483 Emend Merck Fross… "Emend Tri-Pack… 2007-08-24 2013-08-24
## 8 aripiprazole 120192 Abilify Bristol-Mye… Abilify Maintena 2009-07-09 2015-07-09
## 9 azacitidine 127108 Vidaza Celgene N/A 2009-10-23 2015-10-23
## 10 besifloxacin 123400 Besivance Bausch & … N/A 2009-10-23 2015-10-23
## # ... with 82 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>
tbls$products_for_veterinary_use_active_data_protection_period
## # A tibble: 26 x 8
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 afoxolaner 163768 Nexgard Merial Cana… Nexgard Spectra 2014-07-08 2020-07-08
## 2 avilamycin 156949 Surmax 100 Pre… Elanco Cana… Surmax 200 Prem… 2014-02-18 2020-02-18
## 3 cefpodoxime pro… 149164 Simplicef Zoetis Cana… N/A 2012-12-06 2018-12-06
## 4 clodronate diso… 172789 Osphos Injecti… Dechra Ltd. N/A 2015-05-06 2021-05-06
## 5 closantel sodium 180678 Flukiver Elanco Divi… N/A 2015-11-24 2021-11-24
## 6 derquantel 184844 Startect Zoetis Cana… N/A 2016-04-27 2022-04-27
## 7 dibotermin alfa… 148153 Truscient Zoetis Cana… N/A 2012-11-20 2018-11-20
## 8 fluralaner 166320 Bravecto Intervet Ca… N/A 2014-05-23 2020-05-23
## 9 gonadotropin re… 140525 Improvest Zoetis Cana… N/A 2011-06-22 2017-06-22
## 10 insulin human (… 150211 Prozinc Boehringer … N/A 2013-04-24 2019-04-24
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>
tbls$products_for_veterinary_use_expired_data_protection_period
## # A tibble: 26 x 8
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 acetaminophen 110139 Pracetam 20% O… Ceva Animal… N/A 2009-03-05 2015-03-05
## 2 buprenorphine h… 126077 Vetergesic Mul… Sogeval UK … N/A 2010-02-03 2016-02-03
## 3 cefovecin sodium 110061 Convenia Zoetis Cana… N/A 2007-05-30 2013-05-30
## 4 cephalexin mono… 126970 Vetolexin Vétoquinol … Cefaseptin 2010-06-24 2016-06-24
## 5 dirlotapide 110110 Slentrol Zoetis Cana… N/A 2008-08-14 2014-08-14
## 6 emamectin benzo… 109976 Slice Intervet Ca… N/A 2009-06-29 2015-06-29
## 7 emodepside 112103 / 112106… Profender Bayer Healt… N/A 2008-08-28 2014-08-28
## 8 firocoxib 110661 / 110379 Previcox Merial Cana… N/A 2007-09-28 2013-09-28
## 9 fluoxetine hydr… 109825 / 109826… Reconcile Elanco, Div… N/A 2008-03-28 2014-03-28
## 10 gamithromycin 125823 Zactran Merial Cana… N/A 2010-03-29 2016-03-29
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>
There are also N/As in each we can turn into NA and there's a column drug_s_containing_the_medicinal_ingredient_variations common to each that - when an observation is not N/A - is one or more drugs separated by \r\n so we can use that to turn it into a list column that you can post-process with, say, tidyr::unnest():
lapply(tbls, function(x) {
# Make "N/A" into real NAs
x[] <- lapply(x, function(.x) ifelse(.x == "N/A", NA_character_, .x))
# The common `drug_s_containing_the_medicinal_ingredient_variations`
# column - when not N/A - has one drug per-line so we can use that
# fact to turn it into a list column which you can use `tidyr::unnest()` on
x$drug_s_containing_the_medicinal_ingredient_variations <-
lapply(x$drug_s_containing_the_medicinal_ingredient_variations, function(.x) {
strsplit(trimws(.x), "[\r\n]+")
})
x
}) -> tbls

Resources