R - Web Page Scraping - Trouble obtaining Attribute values using rvest - r

I am trying to use rvest to pull ISO country info from wikipedia ( including links from another page ) . I can't find a way of correctly obtaining the links ( href attribute ) without including the name ( I have tried xpath string function it causes an error ) . It is fairly easy to run - and self explanatory.
Any help appreciated!
library(rvest)
library(dplyr)
searchPage <- read_html("https://en.wikipedia.org/wiki/ISO_3166-2")
nodes <- html_node(searchPage, xpath = '(//h2[(span/#id = "Current_codes")]/following-sibling::table)[1]')
codes <- html_nodes(nodes, xpath = 'tr/td[1]/a/text()')
names <- html_nodes(nodes, xpath = 'tr/td[2]//a[#title]/text()')
#Following brings back data but attribute name as well
links <- html_nodes(nodes, xpath = 'tr/td[2]//a[#title]/#href')
#Following returns nothing
links2 <- html_nodes(nodes, xpath = 'tr/td[2]//a[#title]/#href/text()')
#Following Errors
links3 <- html_nodes(nodes, xpath = 'string(tr/td[2]//a[#title]/#href)')
#Following Errors
links4 <- sapply(nodes, function(x) { x %>% read_html() %>% html_nodes("tr/td[2]//a[#title]") %>% html_attr("href") })

You should have included more info in your question. The "self-explanatory"
bit nearly made me ignore the question (hint: consider providing sufficient verbal detail out of respect for others' time as well as broken code).
I say that b/c I have no idea if this is what you needed or not b/c you really didn't say.
library(rvest)
library(tibble)
pg <- read_html("https://en.wikipedia.org/wiki/ISO_3166-2")
tab <- html_node(pg, xpath=".//table[contains(., 'Zimbabwe')]")
iso_col <- html_nodes(tab, xpath=".//td[1]/a[contains(#href, 'ISO')]")
name_col <- html_nodes(tab, xpath=".//td[2]")
data_frame(
iso2c = html_text(iso_col),
iso2c_link = html_attr(iso_col, "href"),
country_name = html_text(name_col),
country_link = html_nodes(name_col, xpath=".//a[contains(#href, 'wiki')]") %>% html_attr("href")
)
## # A tibble: 249 x 4
## iso2c iso2c_link country_name country_link
## <chr> <chr> <chr> <chr>
## 1 AD /wiki/ISO_3166-2:AD Andorra /wiki/Andorra
## 2 AE /wiki/ISO_3166-2:AE United Arab Emirates /wiki/United_Arab_Emirates
## 3 AF /wiki/ISO_3166-2:AF Afghanistan /wiki/Afghanistan
## 4 AG /wiki/ISO_3166-2:AG Antigua and Barbuda /wiki/Antigua_and_Barbuda
## 5 AI /wiki/ISO_3166-2:AI Anguilla /wiki/Anguilla
## 6 AL /wiki/ISO_3166-2:AL Albania /wiki/Albania
## 7 AM /wiki/ISO_3166-2:AM Armenia /wiki/Armenia
## 8 AO /wiki/ISO_3166-2:AO Angola /wiki/Angola
## 9 AQ /wiki/ISO_3166-2:AQ Antarctica /wiki/Antarctica
## 10 AR /wiki/ISO_3166-2:AR Argentina /wiki/Argentina
## # ... with 239 more rows

Related

Rvest : Extracting clickable content

I am trying to extract the table in the link below
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State=0&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=--Select--&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
I want the whole table to be extracted and I am using the following code
html_page <- read_html(curl(curl))
tab <- html_page %>% html_table(., fill = TRUE)
I get the table in tab[[1]], however, if you notice that website it has a clickable section within the table that has additional data. That part is missing from the extracted table. Will appreciate any help on how the whole table can be extracted.
I'm not sure what you're getting. However, when I pulled from this website I see that there are multiple tabs but I pulled all of the data.
Here is the bottom of the table, when you show all.
Here are the results, when I query for the last line of this website data.
library(rvest)
library(tidyverse)
hx = "https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State=0&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=--Select--&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--"
htp <- read_html(hx) %>% html_table(., fill = T)
tbOne = htp[[1]][, 1:10] # just the data
tbOne %>% filter(`State Name` == "Uttar Pradesh",
`District Name` == "Badaun",
`Market Name` == "Wazirganj")
# # A tibble: 1 × 10
# `State Name` `District Name` `Market Name` Variety Group `Arrivals (Tonnes)`
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Uttar Pradesh Badaun Wazirganj Dara Cereals 3.50
# # … with 4 more variables: `Min Price (Rs./Quintal)` <chr>,
# # `Max Price (Rs./Quintal)` <chr>, `Modal Price (Rs./Quintal)` <chr>,
# # `Reported Date` <chr>
Update
When I pressed the 2, nothing happened (and I did try repeatedly). However, I needed to be really patient and I wasn't. Sorry about that.
The URL has the query in it, so the URL can be used to get all of the data. You could do this by adding the states you're missing, or you could do this for every state. For example, page one ends on Utter Pradesh, but we don't know if this is all of Utter Pradesh. That might make more sense when you see what I did.
Using rvest, I collected all of the states' names from the form. Then I put these name-value pairs into a data frame.
# collect form values for State
ht <- read_html(hx) %>% html_form()
df1 <- as.data.frame(ht[[1]][["fields"]][["ctl00$ddlState"]][["options"]]) %>%
rownames_to_column("State")
names(df1)[2] <- "Abb"
To only look at the states that were not included in page one, you could just query the states after Utter Pradesh like this.
which(df1$State == "Uttar Pradesh", arr.ind = T)
# [1] 35
# split the URL
urone = "https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State="
urtwo = "&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=West+Bengal&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--"
# collect remaining states' data
df2 <- map(36:nrow(df1),
function(x){
# assemble URL
y = toString(df1$Abb[x])
urall = paste0(urone, y, urtwo)
# get table
tabs <- read_html(urall) %>% html_table(., fill = T)
tabs
})
length(df2)
# [1] 2
length(df2[[1]]) # state 36 is empty
length(df2[[2]]) # state 37 is not
# add the new data to the original data
df3 <- df2[[2]][[1]]
tbOne <- rbind(tbOne, df3) # one data frame of tabled data
If you wanted to make sure that you had all the data for each state, you could expand this. Although, using map for that much data may be slow. So I used the function mclapply from the package parallel. In this code, I used 15 cores. You may need to change this depending on your computer's processor. Using 15 made this take less than a second.
# skip row 1, that's "select" or all
df4 <- mclapply(2:nrow(df1), mc.cores = getOption("mc.cores", 15L),
function(x){
# assemble URL
y = toString(df1$Abb[x])
urall = paste0(urone, y, urtwo)
# get table
tabs <- read_html(urall) %>% html_table(., fill = T)
tabs
})
length(df4)
# [1] 36
# create storage using first state with data
df5 <- df4[[7]][[1]]
map(8:36,
function(x){
y = length(df4[[x]])
if(y > 0){
df5 <<- rbind(df5, df4[[x]][[1]])
}
})
Now you have a data frame, df5 that started as each state queried separately.
I didn't look at how the data was different. However, my tbOne data frame has 577 observations. My df5 data frame has 584.

scraping wikimedia category trees

I want to use R to scrape the links contained within a wikimedia category tree and the structure of the tree from here. The code below can open up all the collapsible bullet points
library(RSelenium)
rD <- rsDriver(check = FALSE)
remDr <- rD[["client"]]
remDr$navigate("https://commons.wikimedia.org/wiki/Category:Sports")
n <- 1
# n <- 10 # takes a long time to expand all bullet points
for(i in 1:n){
b <- remDr$findElements(using = "css selector", "[title='expand']")
for(i in 1:length(b)){
b[[i]]$clickElement()
}
}
... but i am struggling to build a data base that would look like...
I can get the bullet href and names using the code below, but i am struggling to find a way to indicate which level each bullet points refers too (i.e. how deep in the category tree each bullet point is)? I am thinking there might be a clever xpath method to count how many CategoryTreeChildren deep each bullet is but that is reaching well beyond my capabilities.
# for testing I manually expand the bullets for the first couple of branches
# (fully for Bulgaria women badminton, basketball) and the last possible
# branch rather than let the for loop run and run through multiple cycles.
library(tidyverse)
library(rvest)
s <- remDr$getPageSource()
d <- read_html(s[[1]]) %>%
html_nodes("div#mw-subcategories") %>%
html_nodes("div.CategoryTreeItem") %>%
html_nodes("a") %>%
map(xml_attrs) %>%
map_df(~as.list(.)) %>%
as_tibble()
# > d
# # A tibble: 135 x 2
# href title
# <chr> <chr>
# 1 /wiki/Category:Categories_by_sport Category:Categories by sport
# 2 /wiki/Category:Categories_by_sport_by_c~ Category:Categories by sport by co~
# 3 /wiki/Category:Categories_of_Bulgaria_b~ Category:Categories of Bulgaria by~
# 4 /wiki/Category:Female_sportspeople_from~ Category:Female sportspeople from ~
# 5 /wiki/Category:Female_badminton_players~ Category:Female badminton players ~
# 6 /wiki/Category:Maria_Delcheva Category:Maria Delcheva
# 7 /wiki/Category:Petya_Nedelcheva Category:Petya Nedelcheva
# 8 /wiki/Category:Gabriela_Stoeva Category:Gabriela Stoeva
# 9 /wiki/Category:Stefani_Stoeva Category:Stefani Stoeva
# 10 /wiki/Category:Women%27s_basketball_pla~ Category:Women's basketball player~
I have also played around with the WikipediR package - it says in the package description that it can be used to retrieve elements of category trees but i cannot find an example of how to implement it.

Using jsonlite for multilevel nested lists and the NASA API in R

Using NASA's API to retrieve information on Mars' weather, I retrieve a list of list of lists.
In python pandas does a beautiful job of formatting the data; however, I have to code the API in R.
This is how the data is structured in R, even after using jsonlite's fromJSON(x, flatten = TRUE) function.
I'd like to structure the raw data to be like the pandas table.
Here is my API code:
library(httr)
library(jsonlite)
req <- "https://api.nasa.gov/insight_weather/?api_key=&feedtype=json&ver=1.0"
response <- GET(req)
response <- content(response, as="text")
mars <- fromJSON(response, flatten = TRUE)
There's more information being returned from the API query than the table screenshot shows but I've focused on just returning a table of a similar structure to the example. If you want the extra information like wind direction this is of a different structure and might be easier to parse separately and merge.
library(jsonlite)
library(purrr)
library(dplyr)
library(tidyr)
req <- "https://api.nasa.gov/insight_weather/?api_key=DEMO_KEY&feedtype=json&ver=1.0"
mars <- fromJSON(req)
map(mars[1:7], ~unlist(.x[1:6]) %>%
bind_rows) %>%
bind_rows(.id = "day") %>%
pivot_longer(cols = grep("\\.", names(.)), names_sep = "\\.", names_to = c(".value", "var"))
# A tibble: 28 x 8
day First_UTC Last_UTC Season var AT HWS PRE
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 402 2020-01-13T06:24:59Z 2020-01-14T07:04:33Z summer av -65.475 5.364 637.752
2 402 2020-01-13T06:24:59Z 2020-01-14T07:04:33Z summer ct 178174 79226 89083
3 402 2020-01-13T06:24:59Z 2020-01-14T07:04:33Z summer mn -100.044 0.236 618.015
4 402 2020-01-13T06:24:59Z 2020-01-14T07:04:33Z summer mx -16.815 21.146 653.7326
5 403 2020-01-14T07:04:34Z 2020-01-15T07:44:08Z summer av -62.449 5.683 636.87
6 403 2020-01-14T07:04:34Z 2020-01-15T07:44:08Z summer ct 211897 95539 105800
7 403 2020-01-14T07:04:34Z 2020-01-15T07:44:08Z summer mn -101.272 0.205 618.1757
8 403 2020-01-14T07:04:34Z 2020-01-15T07:44:08Z summer mx -16.931 20.986 653.4973
9 404 2020-01-15T07:44:09Z 2020-01-16T08:23:44Z summer av -63.622 5.303 636.148
10 404 2020-01-15T07:44:09Z 2020-01-16T08:23:44Z summer ct 293286 132690 158958
# … with 18 more rows

How to scrape multiple tables that are without IDs or Class using R

I'm trying to scrape this webpage using R : http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all (All the pages)
I'm new to programming. And everywhere I've looked, tables are mostly identified with IDs or Divs or Class. On this page there's none. Data is stored in Table format. How should I scrape it?
This is what I did :
library(rvest)
webpage <- read_html("http://zipnet.in/index.php
page=missing_mobile_phones_search&criteria=browse_all")
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[9:10] %>%
html_table(fill = TRUE)
colnames(tbls_ls[[1]]) <- c("Mobile Make", "State", "District",
"Police Station", "Status", "Mobile Type(GSM/CDMA)",
"FIR/DD/GD Dat")
You can scrape the table data by targeting the css id of each table. It looks like each page is composed of 3 different tables pasted one after another. Two of the tables have #AutoNumber15 css id while the third (in the middle) has the #AutoNumber16 css id.
I put a simple code example that should get you started in the right direction.
suppressMessages(library(tidyverse))
suppressMessages(library(rvest))
# define function to scrape the table data from a page
get_page <- function(page_id = 1) {
# default link
link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&Page_No="
# build link
link <- paste0(link, page_id)
# get tables data
wp <- read_html(link)
wp %>%
html_nodes("#AutoNumber16, #AutoNumber15") %>%
html_table(fill = TRUE) %>%
bind_rows()
}
# get the data from the first three pages
iter_page <- 1:3
# this is just a progress bar
pb <- progress_estimated(length(iter_page))
# this code will iterate over pages 1 through 3 and apply the get_page()
# function defined earlier. The Sys.sleep() part is used to pause the code
# after each iteration so that the sever is not overloaded with requests.
map_df(iter_page, ~ {
pb$tick()$print()
df <- get_page(.x)
Sys.sleep(sample(10, 1) * 0.1)
as_tibble(df)
})
#> # A tibble: 72 x 4
#> X1 X2 X3
#> <chr> <chr> <chr>
#> 1 FIR/DD/GD Number 000165 State
#> 2 FIR/DD/GD Date 17/08/2017 District
#> 3 Mobile Type(GSM/CDMA) GSM Police Station
#> 4 Mobile Make SAMSUNG J2 Mobile Number
#> 5 Missing/Stolen Date 23/04/2017 IMEI Number
#> 6 Complainant AKEEL KHAN Complainant Contact Number
#> 7 Status Stolen/Theft Report Date/Time on ZIPNET
#> 8 <NA> <NA> <NA>
#> 9 FIR/DD/GD Number FIR No 37/ State
#> 10 FIR/DD/GD Date 17/08/2017 District
#> # ... with 62 more rows, and 1 more variables: X4 <chr>

Creating a dataframe from a scraped character vector

I am trying to create a dataframe that has the columns: First name, Last name, Party, State, Member ID. Here is my code
library('rvest')
candidate_url <- 'https://www.congress.gov/help/field-values/member-bioguide-ids'
candidate_page <- read_html(candidate_url)
candidate_nodes <- html_nodes(candidate_page, 'table')
candidate_list <- html_text(candidate_nodes)
My main issue is getting the member IDs. An example ID is A000009. When I use the gsub function I lose the leading A in this example. The A is from this candidate's last name (Abercrombie), but I do not know how to add the A back into the member ID. Of course if there's a better way I am open to any suggestions.
Since you've got an HTML table, use html_table to extract it to a data.frame. You'll need fill = TRUE, because the table has extra empty rows inserted between each entry, which you can easily drop afterwards with tidyr::drop_na.
library(tidyverse)
library(rvest)
page <- 'https://www.congress.gov/help/field-values/member-bioguide-ids' %>%
read_html()
members <- page %>%
html_node('table') %>%
html_table(fill = TRUE) %>%
set_names('member', 'bioguide') %>%
drop_na(member) %>% # remove empty rows inserted in the table
tbl_df() # for printing
members
#> # A tibble: 2,243 x 2
#> member bioguide
#> * <chr> <chr>
#> 1 Abdnor, James (Republican - South Dakota) A000009
#> 2 Abercrombie, Neil (Democratic - Hawaii) A000014
#> 3 Abourezk, James (Democratic - South Dakota) A000017
#> 4 Abraham, Ralph Lee (Republican - Louisiana) A000374
#> 5 Abraham, Spencer (Republican - Michigan) A000355
#> 6 Abzug, Bella S. (Democratic - New York) A000018
#> 7 Acevedo-Vila, Anibal (Democratic - Puerto Rico) A000359
#> 8 Ackerman, Gary L. (Democratic - New York) A000022
#> 9 Adams, Alma S. (Democratic - North Carolina) A000370
#> 10 Adams, Brock (Democratic - Washington) A000031
#> # ... with 2,233 more rows
The member column could be further extracted, if you like.
There are also many other useful sources for this data, some of which correlate it with other useful variables. This one is well-structured and updated regularly.
Give this a try. I have updated this to include separating out the different fields.
library('rvest')
library('dplyr')
library('tidyr')
candidate_url <- 'https://www.congress.gov/help/field-values/member-bioguide-ids'
candidate_page <- read_html(candidate_url)
candidate_nodes <- html_nodes(candidate_page, 'table')
df.candidates <- as.data.frame(html_table(candidate_nodes, header = TRUE, fill = TRUE), stringsAsFactors = FALSE)
df.candidates <- df.candidates[!is.na(df.candidates$Member),]
df.candidates <- df.candidates %>%
mutate(Party.State = gsub("[\\(\\)]", "", regmatches(Member, gregexpr("\\(.*?\\)", Member))[[1]])) %>%
separate(Party.State, into = c("Party","State"), sep = " - ") %>%
mutate(Full.name = trimws(regmatches(df.candidates$Member, regexpr("^[^\\(]+", df.candidates$Member)))) %>%
separate(Full.name, into = c("Last.Name","First.Name","Suffix"), sep = ",", fill = "right") %>%
select(First.Name, Last.Name, Suffix, Party, State, Member.ID)
This is a bit hackish, but if you want to extract the variables using regex here are a few pointers.
candidate_list <- unlist(candidate_list)
ID <- regmatches(candidate_list,
gregexpr("[a-zA-Z]{1}[0-9]{6}", candidate_list))
party_state <- regmatches(candidate_list,
gregexpr("(?<=\\()[^)]+(?=\\))", candidate_list, perl=TRUE))
names_etc <- strsplit(candidate_list, "[a-zA-Z]{1}[0-9]{6}")
names <- sapply(names_etc, function(x) sub(" \\([^)]*\\)", "", x))

Resources