Extracting table data from a website using R [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to get information from (https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html) using R.
The data is not in .csv or excel format. I am not sure where to start. I know very basic R and would welcome any help! thank you!

Presuming it's the table of data from the page you are looking for
library(tidyverse)
library(rvest)
page <- xml2::read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")
tbl <- html_table(page)[[1]]
tbl <- as.tibble(tbl)
tbl
# A tibble: 260 x 9
`Medicinal\r\n … `Submission Numb… `Innovative Dru… Manufacturer `Drug(s) Containi… `Notice of Compl… `6 Year\r\n … `Pediatric Exte… `Data Protectio…
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 abiraterone ace… 138343 Zytiga Janssen I… N/A 2011-07-27 2017-07-27 N/A 2019-07-27
2 aclidinium bromide 157598 Tudorza Genu… AstraZeneca … Duaklir Genuair 2013-07-29 2019-07-29 N/A 2021-07-29
3 afatinib dimaleate 158730 Giotrif Boehringer … N/A 2013-11-01 2019-11-01 N/A 2021-11-01
4 aflibercept 149321 Eylea Bayer Inc. N/A 2013-11-08 2019-11-08 N/A 2021-11-08
5 albiglutide 165145 Eperzan GlaxoSmithKl… N/A 2015-07-15 2021-07-15 N/A 2023-07-15
6 alectinib hydrochl… 189442 Alecensaro Hoffmann-La … N/A 2016-09-29 2022-09-29 N/A 2024-09-29
7 alirocumab 183116 Praluent Sanofi-avent… N/A 2016-04-11 2022-04-11 N/A 2024-04-11
8 alogliptin benzoate 158335 Nesina Takeda Ca… "Kazano\r\n … 2013-11-27 2019-11-27 N/A 2021-11-27
9 anthrax immune glo… 200446 Anthrasil Emergent … N/A 2017-11-06 2023-11-06 Yes 2026-05-06
10 antihemophilic fac… 163447 Eloctate Bioverativ … N/A 2014-08-22 2020-08-22 Yes 2023-02-22
# ... with 250 more rows
To read in the 2nd/3rd/4th table on the page change the number in tbl <- html_table(page)[[1]] to the number table wish to read

You'll be able to extract this data through web scraping.
Try something like
library(rvest)
library(dplyr)
url <- "https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html"
page_html <- read_html(url)
tables <- page_html %>% html_nodes("table")
for (i in 1:length(tables)) {
table <- tables[i]
table_header <- table %>% html_nodes("thead th") %>% html_text(.) %>% trimws(.) %>% gsub("\r", "", .) %>% gsub("\n", "", .)
table_data <- matrix(ncol=length(table_header), nrow=1) %>% as.data.frame(.)
colnames(table_data) <- table_header
rows <- table %>% html_nodes("tr")
for (j in 2:length(rows)) {
table_data[j-1, ] <- rows[j] %>% html_nodes("td") %>% html_text(.) %>% trimws(.)
}
assign(paste0("table_data", i), table_data)
}

You can process them all the same way without a for loop and without using assign() (shudder). Plus, we can assign the table caption (the <h2> above each) to each table for a reference:
library(rvest)
xdf <- read_html("https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/applications-submissions/register-innovative-drugs/register.html")
tbls <- html_table(xdf, trim = TRUE)
We clean up the column names using janitor::clean_names() then find the captions, clean them up so they're proper variable names and assign them to each table:
setNames(
lapply(tbls, function(tbl) {
janitor::clean_names(tbl) %>% # CLEAN UP TABLE COLUMN NAMES
tibble::as_tibble() # solely for better printing
}),
html_nodes(xdf, "table > caption") %>% # ASSIGN THE TABLE HEADER TO THE LIST ELEMENT
html_text() %>% # BUT WE NEED TO CLEAN THEM UP FIRST
trimws() %>%
tolower() %>%
gsub("[[:punct:][:space:]]+", "_", .) %>%
gsub("_+", "_", .) %>%
make.unique(sep = "_")
) -> tbls
Now we can access them by name in the list without using the nigh-never-recommended assign() (shudder again):
tbls$products_for_human_use_active_data_protection_period
## # A tibble: 260 x 9
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 abiraterone … 138343 Zytiga Janssen … N/A 2011-07-27 2017-07-27
## 2 aclidinium brom… 157598 Tudorza Gen… AstraZeneca… Duaklir Genu… 2013-07-29 2019-07-29
## 3 afatinib dimale… 158730 Giotrif Boehringer … N/A 2013-11-01 2019-11-01
## 4 aflibercept 149321 Eylea Bayer In… N/A 2013-11-08 2019-11-08
## 5 albiglutide 165145 Eperzan GlaxoSmithK… N/A 2015-07-15 2021-07-15
## 6 alectinib hydro… 189442 Alecensaro Hoffmann-La… N/A 2016-09-29 2022-09-29
## 7 alirocumab 183116 Praluent Sanofi-aven… N/A 2016-04-11 2022-04-11
## 8 alogliptin benz… 158335 Nesina Takeda C… "Kazano\r\n … 2013-11-27 2019-11-27
## 9 anthrax immune … 200446 Anthrasil Emergent … N/A 2017-11-06 2023-11-06
## 10 antihemophilic … 163447 Eloctate Bioverativ … N/A 2014-08-22 2020-08-22
## # ... with 250 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>
tbls$products_for_human_use_expired_data_protection_period
## # A tibble: 92 x 9
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 abatacept 98531 Orencia Bristol-Mye… N/A 2006-06-29 2012-06-29
## 2 acamprosate cal… 103287 Campral Mylan Pharm… N/A 2007-03-16 2013-03-16
## 3 alglucosidase a… 103381 Myozyme Genzyme Can… N/A 2006-08-14 2012-08-14
## 4 aliskiren hemif… 105388 Rasilez Novartis Ph… "Rasilez HCT\r\… 2007-11-14 2013-11-14
## 5 ambrisentan 113287 Volibris GlaxoSmithK… N/A 2008-03-20 2014-03-20
## 6 anidulafungin 110202 Eraxis Pfizer Cana… N/A 2007-11-14 2013-11-14
## 7 aprepitant 108483 Emend Merck Fross… "Emend Tri-Pack… 2007-08-24 2013-08-24
## 8 aripiprazole 120192 Abilify Bristol-Mye… Abilify Maintena 2009-07-09 2015-07-09
## 9 azacitidine 127108 Vidaza Celgene N/A 2009-10-23 2015-10-23
## 10 besifloxacin 123400 Besivance Bausch & … N/A 2009-10-23 2015-10-23
## # ... with 82 more rows, and 2 more variables: pediatric_extension_yes_no <chr>, data_protection_ends <chr>
tbls$products_for_veterinary_use_active_data_protection_period
## # A tibble: 26 x 8
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 afoxolaner 163768 Nexgard Merial Cana… Nexgard Spectra 2014-07-08 2020-07-08
## 2 avilamycin 156949 Surmax 100 Pre… Elanco Cana… Surmax 200 Prem… 2014-02-18 2020-02-18
## 3 cefpodoxime pro… 149164 Simplicef Zoetis Cana… N/A 2012-12-06 2018-12-06
## 4 clodronate diso… 172789 Osphos Injecti… Dechra Ltd. N/A 2015-05-06 2021-05-06
## 5 closantel sodium 180678 Flukiver Elanco Divi… N/A 2015-11-24 2021-11-24
## 6 derquantel 184844 Startect Zoetis Cana… N/A 2016-04-27 2022-04-27
## 7 dibotermin alfa… 148153 Truscient Zoetis Cana… N/A 2012-11-20 2018-11-20
## 8 fluralaner 166320 Bravecto Intervet Ca… N/A 2014-05-23 2020-05-23
## 9 gonadotropin re… 140525 Improvest Zoetis Cana… N/A 2011-06-22 2017-06-22
## 10 insulin human (… 150211 Prozinc Boehringer … N/A 2013-04-24 2019-04-24
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>
tbls$products_for_veterinary_use_expired_data_protection_period
## # A tibble: 26 x 8
## medicinal_ingre… submission_numb… innovative_drug manufacturer drug_s_containi… notice_of_compl… x6_year_no_file…
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 acetaminophen 110139 Pracetam 20% O… Ceva Animal… N/A 2009-03-05 2015-03-05
## 2 buprenorphine h… 126077 Vetergesic Mul… Sogeval UK … N/A 2010-02-03 2016-02-03
## 3 cefovecin sodium 110061 Convenia Zoetis Cana… N/A 2007-05-30 2013-05-30
## 4 cephalexin mono… 126970 Vetolexin Vétoquinol … Cefaseptin 2010-06-24 2016-06-24
## 5 dirlotapide 110110 Slentrol Zoetis Cana… N/A 2008-08-14 2014-08-14
## 6 emamectin benzo… 109976 Slice Intervet Ca… N/A 2009-06-29 2015-06-29
## 7 emodepside 112103 / 112106… Profender Bayer Healt… N/A 2008-08-28 2014-08-28
## 8 firocoxib 110661 / 110379 Previcox Merial Cana… N/A 2007-09-28 2013-09-28
## 9 fluoxetine hydr… 109825 / 109826… Reconcile Elanco, Div… N/A 2008-03-28 2014-03-28
## 10 gamithromycin 125823 Zactran Merial Cana… N/A 2010-03-29 2016-03-29
## # ... with 16 more rows, and 1 more variable: data_protection_ends <chr>
There are also N/As in each we can turn into NA and there's a column drug_s_containing_the_medicinal_ingredient_variations common to each that - when an observation is not N/A - is one or more drugs separated by \r\n so we can use that to turn it into a list column that you can post-process with, say, tidyr::unnest():
lapply(tbls, function(x) {
# Make "N/A" into real NAs
x[] <- lapply(x, function(.x) ifelse(.x == "N/A", NA_character_, .x))
# The common `drug_s_containing_the_medicinal_ingredient_variations`
# column - when not N/A - has one drug per-line so we can use that
# fact to turn it into a list column which you can use `tidyr::unnest()` on
x$drug_s_containing_the_medicinal_ingredient_variations <-
lapply(x$drug_s_containing_the_medicinal_ingredient_variations, function(.x) {
strsplit(trimws(.x), "[\r\n]+")
})
x
}) -> tbls

Related

dplyr arrange is not working while order is fine

I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2

Creating serial number for unique entries in R

I wanted to assign same serial number for all same Submission_Ids under one Batch_number. Could some one please help me figure this out?
Submission_Id <- c(619295,619295,619295,619295,619296,619296,619296,619296,619296,556921,556921,559254,647327,647327,647327,646040,646040,646040,646040,646040,64604)
Batch_No <- (633,633,633,633,633,633,633,633,633,633,633,633,634,634,634,650,650,650,650,650,650)
Expected result
Sl.No <- c(1,1,1,1,2,2,2,2,2,3,3,4,1,1,1,1,1,1,1,1,1)
One way to do it is creating run-length IDs with data.table::rleid(Submission_Id) grouped_by(Batch_No). We can use this inside 'dplyr'. To show this I created a tibble() with both given vectors Batch_Id and Submission_Id.
library(dplyr)
library(data.table)
dat <- tibble(Submission_Id = Submission_Id,
Batch_No = Batch_No)
dat %>%
group_by(Batch_No) %>%
mutate(S1.No = data.table::rleid(Submission_Id))
#> # A tibble: 21 x 3
#> # Groups: Batch_No [3]
#> Submission_Id Batch_No S1.No
#> <dbl> <dbl> <int>
#> 1 619295 633 1
#> 2 619295 633 1
#> 3 619295 633 1
#> 4 619295 633 1
#> 5 619296 633 2
#> 6 619296 633 2
#> 7 619296 633 2
#> 8 619296 633 2
#> 9 619296 633 2
#> 10 556921 633 3
#> # ... with 11 more rows
The original data
Submission_Id <- c(619295,619295,619295,619295,619296,619296,619296,619296,619296,556921,556921,559254,647327,647327,647327,646040,646040,646040,646040,646040,64604)
Batch_No <- c(633,633,633,633,633,633,633,633,633,633,633,633,634,634,634,650,650,650,650,650,650)
Created on 2022-12-16 by the reprex package (v2.0.1)

Trying to extract specific characters in a column in R?

The content in the column appears as follows $1,521+ 2 bds. I want to extract 1521 and put it in a new column. I know this can be done in alteryx using regex can I do it R?
How about the following?:
library(tidyverse)
x <- '$1,521+ 2 bds'
parse_number(x)
For example:
library(tidyverse)
#generate some data
tbl <- tibble(string = str_c('$', as.character(seq(1521, 1541, 1)), '+', ' 2bds'))
new_col <-
tbl$string %>%
str_split('\\+',simplify = TRUE) %>%
`[`(, 1) %>%
str_sub(2, -1) #get rid of '$' at the start
mutate(tbl, number = new_col)
#> # A tibble: 21 x 2
#> string number
#> <chr> <chr>
#> 1 $1521+ 2bds 1521
#> 2 $1522+ 2bds 1522
#> 3 $1523+ 2bds 1523
#> 4 $1524+ 2bds 1524
#> 5 $1525+ 2bds 1525
#> 6 $1526+ 2bds 1526
#> 7 $1527+ 2bds 1527
#> 8 $1528+ 2bds 1528
#> 9 $1529+ 2bds 1529
#> 10 $1530+ 2bds 1530
#> # … with 11 more rows
Created on 2021-06-12 by the reprex package (v2.0.0)
We can use sub from base R
as.numeric( sub("\\$(\\d+),(\\d+).*", "\\1\\2", x))
#[1] 1521
data
x <- '$1,521+ 2 bds'

Restructuring data depending on reoccuring values in R

Right now, I try to restructure my data (it's about responsiveness of contacted people in a survey) which has a structure like this:
df_test <- data.frame(
Residence=c(rep("Berlin",10),rep("Frankfurt",10),rep("Munich",10)),
Response=c(rep(TRUE,14),rep(FALSE,16)),
ID=c(rep(1:15,each=2)),
Contact = c(rep(c("Phone","Mail","In_Person","Phone","eMail","Phone"))),
Date = sample(seq(as.Date('2000/01/01'), as.Date('2001/01/01'), by="day"), 30)
)
df_test <- df_test[order(df_test$ID,df_test$Date),]
In the resulting dataframe, each line represents one contact event and, usually, all people (labelled by ID) have been contacted multiple times by different means:
#first 4 lines of dataframe:
Residence Response ID Contact Date
2 Berlin TRUE 1 Mail 2000-07-25
1 Berlin TRUE 1 Phone 2000-09-25
3 Berlin TRUE 2 In_Person 2000-02-06
4 Berlin TRUE 2 Phone 2000-10-01
To get a nice overview with focus on the contacted people for e.g. plots, I want to create a new data frame in which every line represents one contacted person, with fixed values just appearing once (e.g. ID, Residence, Response) while contact-specific values (Contact, Date) are listed in each line like so:
#restructured lines in new dataframe from first 4 lines of original dataframe:
Residence Response ID Contact Date Contact.1 Date.1
1 Berlin TRUE 1 Mail 2000-07-25 Phone 2000-09-25
2 Berlin TRUE 2 In_Person 2000-02-06 Phone 2000-10-01
With the initial sorting by date i hope to also get the contact attempts in each line in chronological order.
While i don't have any code which is close to running, i tried to at least get a dataframe with an empty column and fill it with with the extracted IDs, without duplicates:
for (i in df_test[,"ID"]){
if (df_test[i,"ID"] != df_test [i-1,"ID"]){
df_test_restructured<-append(df_test_restructured,df_test[i,"ID"])
}
}
After many unfruitful attmempts, I figured there should be some existing and more efficient strategies or functions unknown to me. Any suggestions? Thanks in advance <3
EDIT: Ideally, each row would have the contact attempt listed in order, since people also have been contacted multiple times with the same medium. I want to extract info like e.g. people have mostly responded after the first reminder email after already having been sent an initial email
Assuming you want per person (ID) one row to show at what date with what (phone, email,..) there was a contact you could do something like this with tidyverse.
library(tidyverse)
df_test <- data.frame(
Residence=c(rep("Berlin",10),rep("Frankfurt",10),rep("Munich",10)),
Response=c(rep(TRUE,14),rep(FALSE,16)),
ID=c(rep(1:15,each=2)),
Contact = c(rep(c("Phone","Mail","In_Person","Phone","eMail","Phone"))),
Date = sample(seq(as.Date('2000/01/01'), as.Date('2001/01/01'), by="day"), 30)
)
df_test %>%
group_by(ID) %>%
pivot_wider(names_from = Contact, values_from = Date)
#> # A tibble: 15 x 7
#> # Groups: ID [15]
#> Residence Response ID Phone Mail In_Person eMail
#> <chr> <lgl> <int> <date> <date> <date> <date>
#> 1 Berlin TRUE 1 2000-01-04 2000-09-06 NA NA
#> 2 Berlin TRUE 2 2000-03-15 NA 2000-05-19 NA
#> 3 Berlin TRUE 3 2000-11-05 NA NA 2000-05-06
#> 4 Berlin TRUE 4 2000-11-02 2000-03-29 NA NA
#> 5 Berlin TRUE 5 2000-12-20 NA 2000-04-30 NA
#> 6 Frankfurt TRUE 6 2000-02-23 NA NA 2000-02-05
#> 7 Frankfurt TRUE 7 2000-08-30 2000-11-29 NA NA
#> 8 Frankfurt FALSE 8 2000-02-20 NA 2000-08-08 NA
#> 9 Frankfurt FALSE 9 2000-12-11 NA NA 2000-05-25
#> 10 Frankfurt FALSE 10 2000-12-21 2000-01-15 NA NA
#> 11 Munich FALSE 11 2000-07-07 NA 2000-12-16 NA
#> 12 Munich FALSE 12 2000-08-26 NA NA 2000-09-03
#> 13 Munich FALSE 13 2000-05-02 2000-11-20 NA NA
#> 14 Munich FALSE 14 2000-04-05 NA 2000-09-30 NA
#> 15 Munich FALSE 15 2000-09-26 NA NA 2000-05-22
New Addition based on your new target
I am not sure if this is the tidiest way, but I guess it is what you are looking for.
df_test %>%
group_by(ID) %>%
arrange(Date) %>%
mutate(no = row_number()) %>%
pivot_wider(names_from = c(no), values_from = c(Contact,Date)) %>%
select(c(Residence:Contact_1, Date_1, Contact_2, Date_2)) %>%
arrange(ID)
#> # A tibble: 15 x 7
#> # Groups: ID [15]
#> Residence Response ID Contact_1 Date_1 Contact_2 Date_2
#> <chr> <lgl> <int> <chr> <date> <chr> <date>
#> 1 Berlin TRUE 1 Mail 2000-01-09 Phone 2000-04-26
#> 2 Berlin TRUE 2 Phone 2000-01-27 In_Person 2000-10-14
#> 3 Berlin TRUE 3 eMail 2000-03-01 Phone 2000-07-14
#> 4 Berlin TRUE 4 Phone 2000-05-19 Mail 2000-09-22
#> 5 Berlin TRUE 5 Phone 2000-07-06 In_Person 2000-12-03
#> 6 Frankfurt TRUE 6 eMail 2000-07-05 Phone 2000-11-20
#> 7 Frankfurt TRUE 7 Phone 2000-02-06 Mail 2000-12-28
#> 8 Frankfurt FALSE 8 Phone 2000-04-03 In_Person 2000-09-06
#> 9 Frankfurt FALSE 9 eMail 2000-06-16 Phone 2000-06-24
#> 10 Frankfurt FALSE 10 Phone 2000-01-26 Mail 2000-05-02
#> 11 Munich FALSE 11 In_Person 2000-02-15 Phone 2000-06-28
#> 12 Munich FALSE 12 eMail 2000-03-22 Phone 2000-04-24
#> 13 Munich FALSE 13 Phone 2000-03-21 Mail 2000-08-02
#> 14 Munich FALSE 14 In_Person 2000-09-01 Phone 2000-11-27
#> 15 Munich FALSE 15 Phone 2000-05-27 eMail 2000-07-09
You can start by doing:
> df_test %>%
+ pivot_wider(names_from = Contact,values_from=Date)
# A tibble: 15 x 7
Residence Response ID Phone Mail In_Person eMail
<fct> <lgl> <int> <date> <date> <date> <date>
1 Berlin TRUE 1 2000-01-20 2000-02-18 NA NA
2 Berlin TRUE 2 2000-07-24 NA 2000-03-19 NA
Actually, plotting with your original df is really doable.

How can I reshape data from long to wide

** Sample data added after comment**
What I have:
pmts <- data.frame(stringsAsFactors=FALSE,
name = c("johndoe", "johndoe", "janedoe", "foo", "foo", "foo"),
pmt_amount = c(550L, 550L, 995L, 375L, 375L, 375L),
pmt_date = c("9/1/16", "11/1/16", "12/15/16", "1/5/17", "3/5/17", "5/5/17")
)
#> name pmt_amount pmt_date
#> 1 johndoe 550 9/1/16
#> 2 johndoe 550 11/1/16
#> 3 janedoe 995 12/15/16
#> 4 foo 375 1/5/17
#> 5 foo 375 3/5/17
#> 6 foo 375 5/5/17
What I am looking to achieve:
read.table(header = T, text =
"name pmt_amount first_pmt second_pmt third_pmt
johndoe 550 9/1/16 11/1/16 NA
janedoe 995 12/15/16 NA NA
foo 375 1/5/17 3/5/17 5/5/17"
)
#> name pmt_amount first_pmt second_pmt third_pmt
#> 1 johndoe 550 9/1/16 11/1/16 <NA>
#> 2 janedoe 995 12/15/16 <NA> <NA>
#> 3 foo 375 1/5/17 3/5/17 5/5/17
** End of update**
I have a large dataset with payment information for different products. Some of these products have a pay-in-full option as well as a two-pay and three-pay option. I need to create fields that would be First_Payment, Second_Payment, and Third_Payment and would populate NA in the respective fields if there was only one or two payments.
I've tried a couple options and the best workaround I have thus far is this:
pmts %>%
group_by(Email, Name, Amount, Form.Title) %>%
summarise(First_Payment = min(Payment.Date),
Second_Payment = median(Payment.Date),
Last_Payment = max(Payment.Date)) -> pmts
This obviously is not ideal as is making up a payment date for the 2-pay plans and I would have to instruct the end-user to ignore this field and just look at the 1st and 3rd fields.
I also tried to summarise with partial sorts like this:
n <- length(pmts$Payment.Date)
sort(pmts$Payment.Date,partial=n-1)[n-1]
However, if there wasn't three payments for the person, it would take the n-1 date from the entire data set and apply to all other fields.
Ideally, I would have it so if it was a pay-in-full the the First_Payment field would have the date and the 2nd/3rd fields would say NA. The 2-pay would have 1st and 2nd dates and the 3rd field would say NA. And finally the 3 pay would have all 3 dates.
The end users here are not super data savvy so I'm trying to make this as easy to interpret as possible. Any suggestions would be tremendously appreciated. Thank you!
Using data.table this is a simple one-liner
library(data.table) #v1.9.8+
dcast(setDT(pmts), name + pmt_amount ~ rowid(pmt_amount))
# Using 'pmt_date' as value column. Use 'value.var' to override
# name pmt_amount 1 2 3
# 1: foo 375 1/5/17 3/5/17 5/5/17
# 2: janedoe 995 12/15/16 NA NA
# 3: johndoe 550 9/1/16 11/1/16 NA
dcast converts from long to wide and it accepts expressions. rowid is just adding a row counter per pmt_amount.
You can use tidyr for this.
library(dplyr)
library(tidyr)
pmts <- tibble(
name = c("johndoe", "johndoe", "janedoe", "foo", "foo", "foo"),
pmt_amount = c(550L, 550L, 995L, 375L, 375L, 375L),
pmt_date = lubridate::mdy(c("9/1/16", "11/1/16", "12/15/16", "1/5/17", "3/5/17", "5/5/17"))
)
pmts
#> # A tibble: 6 x 3
#> name pmt_amount pmt_date
#> <chr> <int> <date>
#> 1 johndoe 550 2016-09-01
#> 2 johndoe 550 2016-11-01
#> 3 janedoe 995 2016-12-15
#> 4 foo 375 2017-01-05
#> 5 foo 375 2017-03-05
#> 6 foo 375 2017-05-05
pmts_long <- pmts %>%
group_by(name) %>%
arrange(name, pmt_date) %>%
mutate(pmt = row_number()) %>%
ungroup() %>%
complete(name, nesting(pmt)) %>%
fill(pmt_amount, .direction = "down")
pmts_long
#> # A tibble: 9 x 4
#> name pmt pmt_amount pmt_date
#> <chr> <int> <int> <date>
#> 1 foo 1 375 2017-01-05
#> 2 foo 2 375 2017-03-05
#> 3 foo 3 375 2017-05-05
#> 4 janedoe 1 995 2016-12-15
#> 5 janedoe 2 995 NA
#> 6 janedoe 3 995 NA
#> 7 johndoe 1 550 2016-09-01
#> 8 johndoe 2 550 2016-11-01
#> 9 johndoe 3 550 NA
pmts_wide <- pmts_long %>%
gather("key", "val", -name, -pmt_amount, -pmt) %>%
unite(pmt_number, key, pmt) %>%
spread(pmt_number, val)
pmts_wide
#> # A tibble: 3 x 5
#> name pmt_amount pmt_date_1 pmt_date_2 pmt_date_3
#> * <chr> <int> <date> <date> <date>
#> 1 foo 375 2017-01-05 2017-03-05 2017-05-05
#> 2 janedoe 995 2016-12-15 NA NA
#> 3 johndoe 550 2016-09-01 2016-11-01 NA

Resources