How to Scrape multi page website using R language - r

I want to scrape contents of multi page website using R, currently I'm able to scrape the first page, How do I scrape all pages and store them in csv.
Here;s my code so far
library(rvest)
library(tibble)
library(tidyr)
library(dplyr)
df = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=1&selectedItem=viewAllAwardedContracts.do&T01_ps=100' %>%
read_html() %>% html_table()
df
write.csv(df,"Contracts_test_taneps.csv")

Scrape multiple pages. Change 1:2 to 1:YOU NUMBER
library(tidyverse)
library(rvest)
get_taneps <- function(page) {
str_c("https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=",
page, "&selectedItem=viewAllAwardedContracts.do&T01_ps=100") %>%
read_html() %>%
html_table() %>%
getElement(1) %>%
janitor::clean_names()
}
map_dfr(1:2, get_taneps)
# A tibble: 200 x 7
tender_no procuring_entity suppl~1 award~2 award~3 lot_n~4 notic~5
<chr> <chr> <chr> <chr> <chr> <chr> <lgl>
1 AE/005/2022-2023/MOROGORO/FA/G/01 Morogoro Municipal Council SHIBAM~ 08/11/~ "66200~ N/A NA
2 AE/005/2022-2023/DODOMA/FA/NC/02 Ministry of Livestock and Fish~ NINO G~ 04/11/~ "46511~ N/A NA
3 LGA/014/2022/2023/G/01 UTAWALA Bagamoyo District Council VILANG~ 02/11/~ "90000~ N/A NA
4 LGA/014/014/2022/2023/G/01 FEDHA 3EPICAR Bagamoyo District Council VILANG~ 02/11/~ "88100~ N/A NA
5 LGA/014/2022/2023/G/01/ARDHI Bagamoyo District Council VILANG~ 31/10/~ "16088~ N/A NA
6 LGA/014/2022/2023/G/11 VIFAA VYA USAFI SOKO LA SAMAKI Bagamoyo District Council MBUTUL~ 31/10/~ "10000~ N/A NA
7 DCD - 000899- 400E - ANIMAL FEEDS Kibaha Education Centre ALOYCE~ 29/10/~ "82400~ N/A NA
8 AE/005/2022-2023/MOROGORO/FA/G/01 Morogoro Regional Referral Hos~ JIGABH~ 02/11/~ "17950~ N/A NA
9 IE/023/2022-23/HQ/G/13 Commission for Mediation and A~ AKO GR~ 27/10/~ "42500~ N/A NA
10 AE/005/2022-2023/MOROGORO/FA/G/05 Morogoro Municipal Council THE GR~ 01/11/~ "17247~ N/A NA
# ... with 190 more rows, and abbreviated variable names 1: supplier_name, 2: award_date, 3: award_amount, 4: lot_name,
# 5: notice_pdf
# i Use `print(n = ...)` to see more rows
Write as .csv
write_csv(df, "Contracts_test_taneps.csv")

Related

Xpath not "seeing" table in Rvest

I am attempting to scrape table in this website. I have done it in several website but can't scrape in this website.
Does it have anything to do with the fact that the table is generated and not static?
Here is my code:
library(tidyverse)
library(rvest)
link <- "https://portal.vietcombank.com.vn/Personal/TG/Pages/ty-gia.aspx?devicechannel=default"
webpage <- read_html(link)
webpage %>%
html_nodes(xpath = "//*[#class = 'tbl-01 rateTable']") %>%
html_table()
Thank all very much
The data actually is present in another link which can be found out by looking under Networks tab in the browser.
library(rvest)
url <- 'https://portal.vietcombank.com.vn/UserControls/TVPortal.TyGia/pListTyGia.aspx?txttungay=5/10/2021&BacrhID=1&isEn=False'
data <- url %>% read_html %>% html_table() %>% .[[1]]
data
# `Ngoại tệ` `Ngoại tệ` Mua Mua Bán
# <chr> <chr> <chr> <chr> <chr>
# 1 Tên ngoại tệ Mã NT Tiền mặt Chuyển khoản Bán
# 2 AUSTRALIAN DOLLAR AUD 16,146.00 16,309.09 16,820.59
# 3 CANADIAN DOLLAR CAD 17,613.63 17,791.55 18,349.54
# 4 SWISS FRANC CHF 23,991.41 24,233.75 24,993.78
# 5 YUAN RENMINBI CNY 3,459.86 3,494.81 3,604.96
# 6 DANISH KRONE DKK - 3,489.44 3,620.54
# 7 EURO EUR 25,758.61 26,018.80 27,099.20
# 8 POUND STERLING GBP 30,155.61 30,460.21 31,415.53
# 9 HONGKONG DOLLAR HKD 2,849.58 2,878.36 2,968.63
#10 INDIAN RUPEE INR - 304.65 316.60
# … with 11 more rows

Change data type of all columns in list of data frames before using `bind_rows()`

I have a list of data frames, e.g. from the following code:
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE)
I would now like to combine the dataframes into one, e.g. with dplyr::bind_rows() but get the Error: Can't combine ..1$Deaths<integer> and..5$Deaths <character>. (the answer suggested here doesn't do the trick).
So I need to convert the data types before using row binding. I would like to use this inside a pipe (a tidyverse solution would be ideal) and not loop through the data frames due to the structure of the remaining project but instead use something vectorized like lapply(., function(x) {lapply(x %>% mutate_all, as.character)}) (which doesn't work) to convert all values to character.
Can someone help me with this?
You can change all the column classes to characters and bind them together with map_df.
library(tidyverse)
library(rvest)
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE) %>%
map_df(~.x %>% mutate(across(.fns = as.character)))
# Deaths Date Attraction `Amusement park` Location Incident Injuries
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 28 14 Feb… Transvaal Park (entire … Transvaal Park Yasenevo, Mosc… NA NA
#2 15 27 Jun… Formosa Fun Coast music… Formosa Fun Coast Bali, New Taip… NA NA
#3 8 11 May… Haunted Castle; a fire … Six Flags Great … Jackson Townsh… NA NA
#4 7 9 June… Ghost Train; a fire at … Luna Park Sydney Sydney, Austra… NA NA
#5 7 14 Aug… Skylab; a crane collide… Hamburger Dom Hamburg, (Germ… NA NA
# 6 6 13 Aug… Virginia Reel; a fire a… Palisades Amusem… Cliffside Park… NA NA
# 7 6 29 Jun… Eco-Adventure Valley Sp… OCT East Yantian Distri… NA NA
# 8 5 30 May… Big Dipper; the roller … Battersea Park Battersea, Lon… NA NA
# 9 5 23 Jun… Kuzuluk Aquapark swimmi… Kuzuluk Aquapark Akyazi, Turkey… NA NA
#10 4 24 Jul… Big Dipper; a bolt came… Krug Park Omaha, Nebrask… NA NA
# … with 1,895 more rows

Restructuring data depending on reoccuring values in R

Right now, I try to restructure my data (it's about responsiveness of contacted people in a survey) which has a structure like this:
df_test <- data.frame(
Residence=c(rep("Berlin",10),rep("Frankfurt",10),rep("Munich",10)),
Response=c(rep(TRUE,14),rep(FALSE,16)),
ID=c(rep(1:15,each=2)),
Contact = c(rep(c("Phone","Mail","In_Person","Phone","eMail","Phone"))),
Date = sample(seq(as.Date('2000/01/01'), as.Date('2001/01/01'), by="day"), 30)
)
df_test <- df_test[order(df_test$ID,df_test$Date),]
In the resulting dataframe, each line represents one contact event and, usually, all people (labelled by ID) have been contacted multiple times by different means:
#first 4 lines of dataframe:
Residence Response ID Contact Date
2 Berlin TRUE 1 Mail 2000-07-25
1 Berlin TRUE 1 Phone 2000-09-25
3 Berlin TRUE 2 In_Person 2000-02-06
4 Berlin TRUE 2 Phone 2000-10-01
To get a nice overview with focus on the contacted people for e.g. plots, I want to create a new data frame in which every line represents one contacted person, with fixed values just appearing once (e.g. ID, Residence, Response) while contact-specific values (Contact, Date) are listed in each line like so:
#restructured lines in new dataframe from first 4 lines of original dataframe:
Residence Response ID Contact Date Contact.1 Date.1
1 Berlin TRUE 1 Mail 2000-07-25 Phone 2000-09-25
2 Berlin TRUE 2 In_Person 2000-02-06 Phone 2000-10-01
With the initial sorting by date i hope to also get the contact attempts in each line in chronological order.
While i don't have any code which is close to running, i tried to at least get a dataframe with an empty column and fill it with with the extracted IDs, without duplicates:
for (i in df_test[,"ID"]){
if (df_test[i,"ID"] != df_test [i-1,"ID"]){
df_test_restructured<-append(df_test_restructured,df_test[i,"ID"])
}
}
After many unfruitful attmempts, I figured there should be some existing and more efficient strategies or functions unknown to me. Any suggestions? Thanks in advance <3
EDIT: Ideally, each row would have the contact attempt listed in order, since people also have been contacted multiple times with the same medium. I want to extract info like e.g. people have mostly responded after the first reminder email after already having been sent an initial email
Assuming you want per person (ID) one row to show at what date with what (phone, email,..) there was a contact you could do something like this with tidyverse.
library(tidyverse)
df_test <- data.frame(
Residence=c(rep("Berlin",10),rep("Frankfurt",10),rep("Munich",10)),
Response=c(rep(TRUE,14),rep(FALSE,16)),
ID=c(rep(1:15,each=2)),
Contact = c(rep(c("Phone","Mail","In_Person","Phone","eMail","Phone"))),
Date = sample(seq(as.Date('2000/01/01'), as.Date('2001/01/01'), by="day"), 30)
)
df_test %>%
group_by(ID) %>%
pivot_wider(names_from = Contact, values_from = Date)
#> # A tibble: 15 x 7
#> # Groups: ID [15]
#> Residence Response ID Phone Mail In_Person eMail
#> <chr> <lgl> <int> <date> <date> <date> <date>
#> 1 Berlin TRUE 1 2000-01-04 2000-09-06 NA NA
#> 2 Berlin TRUE 2 2000-03-15 NA 2000-05-19 NA
#> 3 Berlin TRUE 3 2000-11-05 NA NA 2000-05-06
#> 4 Berlin TRUE 4 2000-11-02 2000-03-29 NA NA
#> 5 Berlin TRUE 5 2000-12-20 NA 2000-04-30 NA
#> 6 Frankfurt TRUE 6 2000-02-23 NA NA 2000-02-05
#> 7 Frankfurt TRUE 7 2000-08-30 2000-11-29 NA NA
#> 8 Frankfurt FALSE 8 2000-02-20 NA 2000-08-08 NA
#> 9 Frankfurt FALSE 9 2000-12-11 NA NA 2000-05-25
#> 10 Frankfurt FALSE 10 2000-12-21 2000-01-15 NA NA
#> 11 Munich FALSE 11 2000-07-07 NA 2000-12-16 NA
#> 12 Munich FALSE 12 2000-08-26 NA NA 2000-09-03
#> 13 Munich FALSE 13 2000-05-02 2000-11-20 NA NA
#> 14 Munich FALSE 14 2000-04-05 NA 2000-09-30 NA
#> 15 Munich FALSE 15 2000-09-26 NA NA 2000-05-22
New Addition based on your new target
I am not sure if this is the tidiest way, but I guess it is what you are looking for.
df_test %>%
group_by(ID) %>%
arrange(Date) %>%
mutate(no = row_number()) %>%
pivot_wider(names_from = c(no), values_from = c(Contact,Date)) %>%
select(c(Residence:Contact_1, Date_1, Contact_2, Date_2)) %>%
arrange(ID)
#> # A tibble: 15 x 7
#> # Groups: ID [15]
#> Residence Response ID Contact_1 Date_1 Contact_2 Date_2
#> <chr> <lgl> <int> <chr> <date> <chr> <date>
#> 1 Berlin TRUE 1 Mail 2000-01-09 Phone 2000-04-26
#> 2 Berlin TRUE 2 Phone 2000-01-27 In_Person 2000-10-14
#> 3 Berlin TRUE 3 eMail 2000-03-01 Phone 2000-07-14
#> 4 Berlin TRUE 4 Phone 2000-05-19 Mail 2000-09-22
#> 5 Berlin TRUE 5 Phone 2000-07-06 In_Person 2000-12-03
#> 6 Frankfurt TRUE 6 eMail 2000-07-05 Phone 2000-11-20
#> 7 Frankfurt TRUE 7 Phone 2000-02-06 Mail 2000-12-28
#> 8 Frankfurt FALSE 8 Phone 2000-04-03 In_Person 2000-09-06
#> 9 Frankfurt FALSE 9 eMail 2000-06-16 Phone 2000-06-24
#> 10 Frankfurt FALSE 10 Phone 2000-01-26 Mail 2000-05-02
#> 11 Munich FALSE 11 In_Person 2000-02-15 Phone 2000-06-28
#> 12 Munich FALSE 12 eMail 2000-03-22 Phone 2000-04-24
#> 13 Munich FALSE 13 Phone 2000-03-21 Mail 2000-08-02
#> 14 Munich FALSE 14 In_Person 2000-09-01 Phone 2000-11-27
#> 15 Munich FALSE 15 Phone 2000-05-27 eMail 2000-07-09
You can start by doing:
> df_test %>%
+ pivot_wider(names_from = Contact,values_from=Date)
# A tibble: 15 x 7
Residence Response ID Phone Mail In_Person eMail
<fct> <lgl> <int> <date> <date> <date> <date>
1 Berlin TRUE 1 2000-01-20 2000-02-18 NA NA
2 Berlin TRUE 2 2000-07-24 NA 2000-03-19 NA
Actually, plotting with your original df is really doable.

How can I transpose data in each variable from long to wide using group_by? R

I have a dataframe with id variable name. I'm trying to figure out a way to transpose each variable in the dataframe by name.
My current df is below:
name jobtitle companyname datesemployed empduration joblocation jobdescrip
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati…
2 David… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
3 David… Data An… NA Jan 2018 – J… 6 mos Belfast, U… Working wi…
However, I'd like a dataframe in which there is only one row for name, and every observation for name becomes its own column, like below:
name jobtitle_1 companyname_1 datesemployed_1 empduration_1 joblocation_1 jobdescrip_1 job_title2 companyname_2 datesemployed_2 empduration_2 joblocation_2 jobdescrip_2
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
I have used commands like gather_by and melt in the past to reshape from long to wide, but in this case, I'm not sure how to apply it, since every observation for the id variable will need to become its own column.
It sounds like you are looking for gather and pivot_wider.
I used my own sample data with two names:
df <- tibble(name = c('David', 'David', 'David', 'Bill', 'Bill'),
jobtitle = c('PM', 'TPM', 'Analyst', 'Dev', 'Eng'),
companyname = c('EOS', 'Options', NA, 'Microsoft', 'Nintendo'))
First add an index column to distinguish the different positions for each name.
indexed <- df %>%
group_by(name) %>%
mutate(.index = row_number())
indexed
# name jobtitle companyname .index
# <chr> <chr> <chr> <int>
# 1 David PM EOS 1
# 2 David TPM Options 2
# 3 David Analyst NA 3
# 4 Bill Dev Microsoft 1
# 5 Bill Eng Nintendo 2
Then it is possible to use gather to get a long form, with one value per row.
gathered <- indexed %>% gather('var', 'val', -c(name, .index))
gathered
# name .index var val
# <chr> <int> <chr> <chr>
# 1 David 1 jobtitle PM
# 2 David 2 jobtitle TPM
# 3 David 3 jobtitle Analyst
# 4 Bill 1 jobtitle Dev
# 5 Bill 2 jobtitle Eng
# 6 David 1 companyname EOS
# 7 David 2 companyname Options
# 8 David 3 companyname NA
# 9 Bill 1 companyname Microsoft
# 10 Bill 2 companyname Nintendo
Now pivot_wider can be used to create a column for each variable and index.
gathered %>% pivot_wider(names_from = c(var, .index), values_from = val)
# name jobtitle_1 jobtitle_2 jobtitle_3 companyname_1 companyname_2 companyname_3
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 David PM TPM Analyst EOS Options NA
# 2 Bill Dev Eng NA Microsoft Nintendo NA
Get the data in long format, create a unique column identifier and get it back to wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name, col) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = c(col, row), values_from = value)

How to replace NA values when parsing a html page to make a dataframe? [duplicate]

This question already has answers here:
How to return NA when nothing is found in an xpath?
(2 answers)
Closed 5 years ago.
When trying to parse a html page, we can get NA values. So when we try to build a data frame with data in a list, missing values make it impossible.
Is there any easy way to succeed. please see the following example:
library(rvest)
library(RCurl)
library(XML)
pg <- getURL("https://agences.axa.fr/ile-de-france/paris/paris-19e-75019")
page = htmlTreeParse(pg,useInternal = TRUE,encoding="UTF-8")
unlist(xpathApply(page,'//b[#class="Name"]',xmlValue))
data.frame(noms = unlist(xpathApply(page,'//b[#class="Name"]',xmlValue)),
rue = unlist(xpathApply(page,'//span[#class="street-address"]',xmlValue)))
Using rvest and purrr (the tidyverse package for lists/functional programming, which pairs very nicely with rvest),
library(rvest)
library(purrr)
# be nice, only scrape once
h <- 'https://agences.axa.fr/ile-de-france/paris/paris-19e-75019' %>% read_html()
df <- h %>%
# select each list item
html_nodes('div.ListConseiller li') %>%
# for each item, make a list of parsed name and street; coerce results to data.frame
map_df(~list(nom = .x %>% html_node('b.Name') %>% html_text(),
rue = .x %>% html_node('span.street-address') %>% html_text(trim = TRUE)))
df
#> # A tibble: 14 × 2
#> nom rue
#> <chr> <chr>
#> 1 Marie France Tmim <NA>
#> 2 Rachel Tobie <NA>
#> 3 Bernard Licha <NA>
#> 4 David Giuili <NA>
#> 5 Myriam Yajid Khalfi <NA>
#> 6 Eytan Elmaleh <NA>
#> 7 Allister Charles <NA>
#> 8 Serge Savergne 321 Rue De Belleville
#> 9 Patrick Allouche 1 Rue Clavel
#> 10 Anne Fleiter 14 Avenue De Laumiere
#> 11 Eric Fitoussi <NA>
#> 12 Jean-Baptiste Crocombette 1 Bis Rue Emile Desvaux
#> 13 Eric Zunino 14 Rue De Thionville
#> 14 Eric Hayoun <NA>
The code uses CSS selectors for brevity, but use XPath ones via the xpath parameter of html_nodes and html_node, if you prefer.

Resources