How can I reshape data from long to wide - r

** Sample data added after comment**
What I have:
pmts <- data.frame(stringsAsFactors=FALSE,
name = c("johndoe", "johndoe", "janedoe", "foo", "foo", "foo"),
pmt_amount = c(550L, 550L, 995L, 375L, 375L, 375L),
pmt_date = c("9/1/16", "11/1/16", "12/15/16", "1/5/17", "3/5/17", "5/5/17")
)
#> name pmt_amount pmt_date
#> 1 johndoe 550 9/1/16
#> 2 johndoe 550 11/1/16
#> 3 janedoe 995 12/15/16
#> 4 foo 375 1/5/17
#> 5 foo 375 3/5/17
#> 6 foo 375 5/5/17
What I am looking to achieve:
read.table(header = T, text =
"name pmt_amount first_pmt second_pmt third_pmt
johndoe 550 9/1/16 11/1/16 NA
janedoe 995 12/15/16 NA NA
foo 375 1/5/17 3/5/17 5/5/17"
)
#> name pmt_amount first_pmt second_pmt third_pmt
#> 1 johndoe 550 9/1/16 11/1/16 <NA>
#> 2 janedoe 995 12/15/16 <NA> <NA>
#> 3 foo 375 1/5/17 3/5/17 5/5/17
** End of update**
I have a large dataset with payment information for different products. Some of these products have a pay-in-full option as well as a two-pay and three-pay option. I need to create fields that would be First_Payment, Second_Payment, and Third_Payment and would populate NA in the respective fields if there was only one or two payments.
I've tried a couple options and the best workaround I have thus far is this:
pmts %>%
group_by(Email, Name, Amount, Form.Title) %>%
summarise(First_Payment = min(Payment.Date),
Second_Payment = median(Payment.Date),
Last_Payment = max(Payment.Date)) -> pmts
This obviously is not ideal as is making up a payment date for the 2-pay plans and I would have to instruct the end-user to ignore this field and just look at the 1st and 3rd fields.
I also tried to summarise with partial sorts like this:
n <- length(pmts$Payment.Date)
sort(pmts$Payment.Date,partial=n-1)[n-1]
However, if there wasn't three payments for the person, it would take the n-1 date from the entire data set and apply to all other fields.
Ideally, I would have it so if it was a pay-in-full the the First_Payment field would have the date and the 2nd/3rd fields would say NA. The 2-pay would have 1st and 2nd dates and the 3rd field would say NA. And finally the 3 pay would have all 3 dates.
The end users here are not super data savvy so I'm trying to make this as easy to interpret as possible. Any suggestions would be tremendously appreciated. Thank you!

Using data.table this is a simple one-liner
library(data.table) #v1.9.8+
dcast(setDT(pmts), name + pmt_amount ~ rowid(pmt_amount))
# Using 'pmt_date' as value column. Use 'value.var' to override
# name pmt_amount 1 2 3
# 1: foo 375 1/5/17 3/5/17 5/5/17
# 2: janedoe 995 12/15/16 NA NA
# 3: johndoe 550 9/1/16 11/1/16 NA
dcast converts from long to wide and it accepts expressions. rowid is just adding a row counter per pmt_amount.

You can use tidyr for this.
library(dplyr)
library(tidyr)
pmts <- tibble(
name = c("johndoe", "johndoe", "janedoe", "foo", "foo", "foo"),
pmt_amount = c(550L, 550L, 995L, 375L, 375L, 375L),
pmt_date = lubridate::mdy(c("9/1/16", "11/1/16", "12/15/16", "1/5/17", "3/5/17", "5/5/17"))
)
pmts
#> # A tibble: 6 x 3
#> name pmt_amount pmt_date
#> <chr> <int> <date>
#> 1 johndoe 550 2016-09-01
#> 2 johndoe 550 2016-11-01
#> 3 janedoe 995 2016-12-15
#> 4 foo 375 2017-01-05
#> 5 foo 375 2017-03-05
#> 6 foo 375 2017-05-05
pmts_long <- pmts %>%
group_by(name) %>%
arrange(name, pmt_date) %>%
mutate(pmt = row_number()) %>%
ungroup() %>%
complete(name, nesting(pmt)) %>%
fill(pmt_amount, .direction = "down")
pmts_long
#> # A tibble: 9 x 4
#> name pmt pmt_amount pmt_date
#> <chr> <int> <int> <date>
#> 1 foo 1 375 2017-01-05
#> 2 foo 2 375 2017-03-05
#> 3 foo 3 375 2017-05-05
#> 4 janedoe 1 995 2016-12-15
#> 5 janedoe 2 995 NA
#> 6 janedoe 3 995 NA
#> 7 johndoe 1 550 2016-09-01
#> 8 johndoe 2 550 2016-11-01
#> 9 johndoe 3 550 NA
pmts_wide <- pmts_long %>%
gather("key", "val", -name, -pmt_amount, -pmt) %>%
unite(pmt_number, key, pmt) %>%
spread(pmt_number, val)
pmts_wide
#> # A tibble: 3 x 5
#> name pmt_amount pmt_date_1 pmt_date_2 pmt_date_3
#> * <chr> <int> <date> <date> <date>
#> 1 foo 375 2017-01-05 2017-03-05 2017-05-05
#> 2 janedoe 995 2016-12-15 NA NA
#> 3 johndoe 550 2016-09-01 2016-11-01 NA

Related

dplyr arrange is not working while order is fine

I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2

Remove duplicates based on multiple conditions

I have some individuals that are listed twice because they have received numerous degrees. I am trying to only get the rows with the latest degree granting date. Below are examples of the current output and the desired output
people | g_date | wage|quarter
personA|2009-01-01|100 |20201
personA|2009-01-01|100 |20202
personA|2010-01-01|100 |20201
personA|2010-01-01|100 |20202
personB|2012-01-01|50 |20201
personB|2012-01-01|50 |20202
personB|2012-01-01|50 |20203
Desired output
people | g_date | wage|quarter
personA|2010-01-01|100 |20201
personA|2010-01-01|100 |20202
personB|2012-01-01|50 |20201
personB|2012-01-01|50 |20202
personB|2012-01-01|50 |20203
I have used the code that is below but it is removing all the rows so that there is only one row per person.
df<-df[order(df$g_date),]
df<-df[!duplicated(df$people, fromLast = TRUE),]
Another option using group_by with ordered slice_max like this:
library(dplyr)
df %>%
group_by(people, quarter) %>%
slice_max(order_by = g_date, n = 1)
#> # A tibble: 5 × 4
#> # Groups: people, quarter [5]
#> people g_date wage quarter
#> <chr> <chr> <dbl> <int>
#> 1 personA 2010-01-01 100 20201
#> 2 personA 2010-01-01 100 20202
#> 3 personB 2012-01-01 50 20201
#> 4 personB 2012-01-01 50 20202
#> 5 personB 2012-01-01 50 20203
Created on 2022-12-15 with reprex v2.0.2
merge(df, aggregate(. ~ people, df[1:2], max))
#> people g_date wage quarter
#> 1 personA 2010-01-01 100 20201
#> 2 personA 2010-01-01 100 20202
#> 3 personB 2012-01-01 50 20201
#> 4 personB 2012-01-01 50 20202
#> 5 personB 2012-01-01 50 20203
Update (thanks to #Villalba, removed first answer):
We colud first group arrange and then filter:
library(dplyr)
library(lubridate)
df %>%
group_by(people, quarter) %>%
mutate(g_date = ymd(g_date)) %>%
arrange(g_date, .by_group = TRUE) %>%
filter(row_number()==n())
people g_date wage quarter
<chr> <date> <int> <int>
1 personA 2010-01-01 100 20201
2 personA 2010-01-01 100 20202
3 personB 2012-01-01 50 20201
4 personB 2012-01-01 50 20202
5 personB 2012-01-01 50 20203

Group and add variable of type stock and another type in a single step?

I want to group by district summing 'incoming' values at quarter and get the value of the 'stock' in the last quarter (3) in just one step. 'stock' can not summed through quarters.
My example dataframe:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
df
district quarter incoming stock
1 ARA 1 4044 19547
2 ARA 2 2992 3160
3 ARA 3 2556 1533
4 BJI 1 1639 5355
5 BJI 2 9547 6146
6 BJI 3 1191 355
7 CMC 1 2038 5816
8 CMC 2 1942 1119
9 CMC 3 225 333
The actual dataframe has ~45.000 rows and 41 variables of which 8 are of type stock.
The result should be:
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I know how to get to the result but in three steps and I don't think it's efficient and error prone due to the data.
My approach:
basea <- df %>%
group_by(district) %>%
filter(quarter==3) %>% #take only the last quarter
summarise(across(stock, sum)) %>%
baseb <- df %>%
group_by(district) %>%
summarise(across(incoming, sum)) %>%
final <- full_join(basea, baseb)
Does anyone have any suggestions to perform the procedure in one (or at least two) steps?
Grateful,
Modus
Given that the dataset only has 3 quarters and not 4. If that's not the case use nth(3) instead of last()
library(tidyverse)
df %>%
group_by(district) %>%
summarise(stock = last(stock),
incoming = sum(incoming))
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
here is a data.table approach
library(data.table)
setDT(df)[, .(incoming = sum(incoming), stock = stock[.N]), by = .(district)]
district incoming stock
1: ARA 9592 1533
2: BJI 12377 355
3: CMC 4205 333
Here's a refactor that removes some of the duplicated code. This also seems like a prime use-case for creating a custom function that can be QC'd and maintained easier:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
aggregate_stocks <- function(df, n_quarter) {
base <- df %>%
group_by(district)
basea <- base %>%
filter(quarter == n_quarter) %>%
summarise(across(stock, sum))
baseb <- base %>%
summarise(across(incoming, sum))
final <- full_join(basea, baseb, by = "district")
return(final)
}
aggregate_stocks(df, 3)
#> # A tibble: 3 × 3
#> district stock incoming
#> <chr> <dbl> <dbl>
#> 1 ARA 1533 9592
#> 2 BJI 355 12377
#> 3 CMC 333 4205
Here is the same solution as #Tom Hoel but without using a function to subset, instead just use []:
library(dplyr)
df %>%
group_by(district) %>%
summarise(stock = stock[3],
incoming = sum(incoming))
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205

Creating serial number for unique entries in R

I wanted to assign same serial number for all same Submission_Ids under one Batch_number. Could some one please help me figure this out?
Submission_Id <- c(619295,619295,619295,619295,619296,619296,619296,619296,619296,556921,556921,559254,647327,647327,647327,646040,646040,646040,646040,646040,64604)
Batch_No <- (633,633,633,633,633,633,633,633,633,633,633,633,634,634,634,650,650,650,650,650,650)
Expected result
Sl.No <- c(1,1,1,1,2,2,2,2,2,3,3,4,1,1,1,1,1,1,1,1,1)
One way to do it is creating run-length IDs with data.table::rleid(Submission_Id) grouped_by(Batch_No). We can use this inside 'dplyr'. To show this I created a tibble() with both given vectors Batch_Id and Submission_Id.
library(dplyr)
library(data.table)
dat <- tibble(Submission_Id = Submission_Id,
Batch_No = Batch_No)
dat %>%
group_by(Batch_No) %>%
mutate(S1.No = data.table::rleid(Submission_Id))
#> # A tibble: 21 x 3
#> # Groups: Batch_No [3]
#> Submission_Id Batch_No S1.No
#> <dbl> <dbl> <int>
#> 1 619295 633 1
#> 2 619295 633 1
#> 3 619295 633 1
#> 4 619295 633 1
#> 5 619296 633 2
#> 6 619296 633 2
#> 7 619296 633 2
#> 8 619296 633 2
#> 9 619296 633 2
#> 10 556921 633 3
#> # ... with 11 more rows
The original data
Submission_Id <- c(619295,619295,619295,619295,619296,619296,619296,619296,619296,556921,556921,559254,647327,647327,647327,646040,646040,646040,646040,646040,64604)
Batch_No <- c(633,633,633,633,633,633,633,633,633,633,633,633,634,634,634,650,650,650,650,650,650)
Created on 2022-12-16 by the reprex package (v2.0.1)

Extracting table data from a website using R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to get information from (https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html) using R.
The data is not in .csv or excel format. I am not sure where to start. I know very basic R and would welcome any help! thank you!
Presuming it's the table of data from the page you are looking for
library(tidyverse)
library(rvest)
page <- xml2::read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")
tbl <- html_table(page)[[1]]
tbl <- as.tibble(tbl)
tbl
# A tibble: 260 x 9
`Medicinal\r\n … `Submission Numb… `Innovative Dru… Manufacturer `Drug(s) Containi… `Notice of Compl… `6 Year\r\n … `Pediatric Exte… `Data Protectio…
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 abiraterone ace… 138343 Zytiga Janssen I… N/A 2011-07-27 2017-07-27 N/A 2019-07-27
2 aclidinium bromide 157598 Tudorza Genu… AstraZeneca … Duaklir Genuair 2013-07-29 2019-07-29 N/A 2021-07-29
3 afatinib dimaleate 158730 Giotrif Boehringer … N/A 2013-11-01 2019-11-01 N/A 2021-11-01
4 aflibercept 149321 Eylea Bayer Inc. N/A 2013-11-08 2019-11-08 N/A 2021-11-08
5 albiglutide 165145 Eperzan GlaxoSmithKl… N/A 2015-07-15 2021-07-15 N/A 2023-07-15
6 alectinib hydrochl… 189442 Alecensaro Hoffmann-La … N/A 2016-09-29 2022-09-29 N/A 2024-09-29
7 alirocumab 183116 Praluent Sanofi-avent… N/A 2016-04-11 2022-04-11 N/A 2024-04-11
8 alogliptin benzoate 158335 Nesina Takeda Ca… "Kazano\r\n … 2013-11-27 2019-11-27 N/A 2021-11-27
9 anthrax immune glo… 200446 Anthrasil Emergent … N/A 2017-11-06 2023-11-06 Yes 2026-05-06
10 antihemophilic fac… 163447 Eloctate Bioverativ … N/A 2014-08-22 2020-08-22 Yes 2023-02-22
# ... with 250 more rows
To read in the 2nd/3rd/4th table on the page change the number in tbl <- html_table(page)[[1]] to the number table wish to read
You'll be able to extract this data through web scraping.
Try something like
library(rvest)
library(dplyr)
url <- "https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html"
page_html <- read_html(url)
tables <- page_html %>% html_nodes("table")
for (i in 1:length(tables)) {
table <- tables[i]
table_header <- table %>% html_nodes("thead th") %>% html_text(.) %>% trimws(.) %>% gsub("\r", "", .) %>% gsub("\n", "", .)
table_data <- matrix(ncol=length(table_header), nrow=1) %>% as.data.frame(.)
colnames(table_data) <- table_header
rows <- table %>% html_nodes("tr")
for (j in 2:length(rows)) {
table_data[j-1, ] <- rows[j] %>% html_nodes("td") %>% html_text(.) %>% trimws(.)
}
assign(paste0("table_data", i), table_data)
}
You can process them all the same way without a for loop and without using assign() (shudder). Plus, we can assign the table caption (the <h2> above each) to each table for a reference:
library(rvest)
xdf <- read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")
tbls <- html_table(xdf, trim = TRUE)
We clean up the column names using janitor::clean_names() then find the captions, clean them up so they're proper variable names and assign them to each table:
setNames(
lapply(tbls, function(tbl) {
janitor::clean_names(tbl) %>% # CLEAN UP TABLE COLUMN NAMES
tibble::as_tibble() # solely for better printing
}),
html_nodes(xdf, "table > caption") %>% # ASSIGN THE TABLE HEADER TO THE LIST ELEMENT
html_text() %>% # BUT WE NEED TO CLEAN THEM UP FIRST
trimws() %>%
tolower() %>%
gsub("[[:punct:][:space:]]+", "_", .) %>%
gsub("_+", "_", .) %>%
make.unique(sep = "_")
) -> tbls
Now we can access them by name in the list without using the nigh-never-recommended assign() (shudder again):
tbls$products_for_human_use_active_data_protection_period
## # A tibble: 260 x 9
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 abiraterone … 138343 Zytiga Janssen … N/A 2011-07-27 2017-07-27
## 2 aclidinium brom… 157598 Tudorza Gen… AstraZeneca… Duaklir Genu… 2013-07-29 2019-07-29
## 3 afatinib dimale… 158730 Giotrif Boehringer … N/A 2013-11-01 2019-11-01
## 4 aflibercept 149321 Eylea Bayer In… N/A 2013-11-08 2019-11-08
## 5 albiglutide 165145 Eperzan GlaxoSmithK… N/A 2015-07-15 2021-07-15
## 6 alectinib hydro… 189442 Alecensaro Hoffmann-La… N/A 2016-09-29 2022-09-29
## 7 alirocumab 183116 Praluent Sanofi-aven… N/A 2016-04-11 2022-04-11
## 8 alogliptin benz… 158335 Nesina Takeda C… "Kazano\r\n … 2013-11-27 2019-11-27
## 9 anthrax immune … 200446 Anthrasil Emergent … N/A 2017-11-06 2023-11-06
## 10 antihemophilic … 163447 Eloctate Bioverativ … N/A 2014-08-22 2020-08-22
## # ... with 250 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>
tbls$products_for_human_use_expired_data_protection_period
## # A tibble: 92 x 9
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 abatacept 98531 Orencia Bristol-Mye… N/A 2006-06-29 2012-06-29
## 2 acamprosate cal… 103287 Campral Mylan Pharm… N/A 2007-03-16 2013-03-16
## 3 alglucosidase a… 103381 Myozyme Genzyme Can… N/A 2006-08-14 2012-08-14
## 4 aliskiren hemif… 105388 Rasilez Novartis Ph… "Rasilez HCT\r\… 2007-11-14 2013-11-14
## 5 ambrisentan 113287 Volibris GlaxoSmithK… N/A 2008-03-20 2014-03-20
## 6 anidulafungin 110202 Eraxis Pfizer Cana… N/A 2007-11-14 2013-11-14
## 7 aprepitant 108483 Emend Merck Fross… "Emend Tri-Pack… 2007-08-24 2013-08-24
## 8 aripiprazole 120192 Abilify Bristol-Mye… Abilify Maintena 2009-07-09 2015-07-09
## 9 azacitidine 127108 Vidaza Celgene N/A 2009-10-23 2015-10-23
## 10 besifloxacin 123400 Besivance Bausch & … N/A 2009-10-23 2015-10-23
## # ... with 82 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>
tbls$products_for_veterinary_use_active_data_protection_period
## # A tibble: 26 x 8
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 afoxolaner 163768 Nexgard Merial Cana… Nexgard Spectra 2014-07-08 2020-07-08
## 2 avilamycin 156949 Surmax 100 Pre… Elanco Cana… Surmax 200 Prem… 2014-02-18 2020-02-18
## 3 cefpodoxime pro… 149164 Simplicef Zoetis Cana… N/A 2012-12-06 2018-12-06
## 4 clodronate diso… 172789 Osphos Injecti… Dechra Ltd. N/A 2015-05-06 2021-05-06
## 5 closantel sodium 180678 Flukiver Elanco Divi… N/A 2015-11-24 2021-11-24
## 6 derquantel 184844 Startect Zoetis Cana… N/A 2016-04-27 2022-04-27
## 7 dibotermin alfa… 148153 Truscient Zoetis Cana… N/A 2012-11-20 2018-11-20
## 8 fluralaner 166320 Bravecto Intervet Ca… N/A 2014-05-23 2020-05-23
## 9 gonadotropin re… 140525 Improvest Zoetis Cana… N/A 2011-06-22 2017-06-22
## 10 insulin human (… 150211 Prozinc Boehringer … N/A 2013-04-24 2019-04-24
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>
tbls$products_for_veterinary_use_expired_data_protection_period
## # A tibble: 26 x 8
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 acetaminophen 110139 Pracetam 20% O… Ceva Animal… N/A 2009-03-05 2015-03-05
## 2 buprenorphine h… 126077 Vetergesic Mul… Sogeval UK … N/A 2010-02-03 2016-02-03
## 3 cefovecin sodium 110061 Convenia Zoetis Cana… N/A 2007-05-30 2013-05-30
## 4 cephalexin mono… 126970 Vetolexin Vétoquinol … Cefaseptin 2010-06-24 2016-06-24
## 5 dirlotapide 110110 Slentrol Zoetis Cana… N/A 2008-08-14 2014-08-14
## 6 emamectin benzo… 109976 Slice Intervet Ca… N/A 2009-06-29 2015-06-29
## 7 emodepside 112103 / 112106… Profender Bayer Healt… N/A 2008-08-28 2014-08-28
## 8 firocoxib 110661 / 110379 Previcox Merial Cana… N/A 2007-09-28 2013-09-28
## 9 fluoxetine hydr… 109825 / 109826… Reconcile Elanco, Div… N/A 2008-03-28 2014-03-28
## 10 gamithromycin 125823 Zactran Merial Cana… N/A 2010-03-29 2016-03-29
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>
There are also N/As in each we can turn into NA and there's a column drug_s_containing_the_medicinal_ingredient_variations common to each that - when an observation is not N/A - is one or more drugs separated by \r\n so we can use that to turn it into a list column that you can post-process with, say, tidyr::unnest():
lapply(tbls, function(x) {
# Make "N/A" into real NAs
x[] <- lapply(x, function(.x) ifelse(.x == "N/A", NA_character_, .x))
# The common `drug_s_containing_the_medicinal_ingredient_variations`
# column - when not N/A - has one drug per-line so we can use that
# fact to turn it into a list column which you can use `tidyr::unnest()` on
x$drug_s_containing_the_medicinal_ingredient_variations <-
lapply(x$drug_s_containing_the_medicinal_ingredient_variations, function(.x) {
strsplit(trimws(.x), "[\r\n]+")
})
x
}) -> tbls

Resources