I am trying to take the details from a series of locations in a Google Earth kml file.
Getting the ids and coordinates works but for the name of location (which is located in the first table cell (td tag) of the Description), when I do it for ALL the locations, it returns the same value for all of them (Stratford Road - the name of the first location).
library(sf)
library(tidyverse)
library(rvest)
removeHtmlTags <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
getHtmlTableCells<- function(htmlString) {
# Convert html to html doc
htmldoc <- read_html(htmlString)
# get html for each cell (i.e. within <td></td>)
table_cells_with_tags <- html_nodes(htmldoc, "td")
# remove the html tags (<td></td>)
return(removeHtmlTags(table_cells_with_tags))[1]
}
download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml")
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = getHtmlTableCells(Description)[1]) %>%
st_drop_geometry()
Now if I use the function on a particular location and get the first table cell (td), then it works, returning Stratford Road and Selly Oak for the first as below.
getHtmlTableCells(locations$Description[1])[1]
getHtmlTableCells(locations$Description[2])[1]
What am I doing wrong?
read_html is not vectorised - it does not accept a vector of different html to parse. We can apply your function over each element of the vector:
locations <- st_read("aqms.kml", stringsAsFactors = FALSE)
locations %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = sapply(Description, function(x) getHtmlTableCells(x)[1])) %>%
st_drop_geometry()
#> latitude longitiude name
#> 1 -1.871622 52.45920 Stratford Road
#> 2 -1.934559 52.44513 Selly Oak (Bristol Road)
#> 3 -1.830070 52.43771 Acocks Green
#> 4 -1.898731 52.48180 Colmore Row
#> 5 -1.896764 52.48607 St Chads Queensway
#> 6 -1.891955 52.47990 Moor Street Queensway
#> 7 -1.918173 52.48138 Birmingham Ladywood
#> 8 -1.902121 52.47675 Lower Severn Street
#> 9 -1.786413 52.56815 New Hall
#> 10 -1.874989 52.47609 Birmingham A4540 Roadside
Alternatively, since you're making use of regex anyway within your function, you could make use of stringr::str_extract to extract your text (which is already vectorised).
library(sf)
library(tidyverse)
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = str_extract(Description, '(?<=Location</td> <td>)[^<]+')) %>%
st_drop_geometry()
Where (?<=Location</td> <td>) is a lookbehind for the Location td tag that precedes our name, and [^<]+ matches anything up to the next tag following the name.
Your getHtmlTableCells function isn't vectorized. If you pass it a single html string, it works fine, but if you pass it multiple strings it will only process the first. Also, you have put a [1] after the return statement, which doesn't do anything. It needs to be inside the brackets. One you do this, it is easy to vectorize the function using sapply.
So make a tiny change in your function...
getHtmlTableCells <- function(htmlString) {
# Convert html to html doc
htmldoc <- read_html(htmlString)
# get html for each cell (i.e. within <td></td>)
table_cells_with_tags <- html_nodes(htmldoc, "td")
# remove the html tags (<td></td>)
return(removeHtmlTags(table_cells_with_tags)[1])
}
and vectorize it like this:
download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml")
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = sapply(as.list(Description), getHtmlTableCells)) %>%
st_drop_geometry()
Which gives the correct result:
locations$name
#> [1] "Stratford Road" "Selly Oak (Bristol Road)"
#> [3] "Acocks Green" "Colmore Row"
#> [5] "St Chads Queensway" "Moor Street Queensway"
#> [7] "Birmingham Ladywood" "Lower Severn Street"
#> [9] "New Hall" "Birmingham A4540 Roadside"
Related
I am performing webscraping on a site and have been able to get basic data, but I now need to collect data from a more complicated part of the page.
I am using rvest to pull data from the AAA gas prices website:
https://gasprices.aaa.com/
I am now trying to pull county-level data, which is only displayed on the map (if you hover your cursor over an individual county. I need to get the county gas prices for individual counties in different states. For example, if you click on Maine, to go to the Maine page (https://gasprices.aaa.com/?state=ME), I need to webscrape the price for Aroostook (the northernmost county on the map).
I have been able to use rvest to extract the data for the metro areas (lower on the page), using html_nodes and the node "td". However, the code for the map is more complex. Instead of the simple "td" node, the developer tools (in Chrome) gives <td class="fm-tooltip-comment">$4.928</td on the line with the price ($4.928 is the current price in Aroostook, as of the date of this post). I cannot seem to identify that with the rvest package to extract it.
I have read that the class can be used, or others have proposed using the css code to designate it within rvest, but I am unfamiliar with how to do so. Pulling the metro-area numbers was straightforward, however the county-level prices embedded within the map do not seem as accessible.
Is there a way to extract this county-level data so that I can webscrape in R? And, can this then be repeatable for all the counties/states from which I must select? Do I need the css code, and if so how do I access it/write it properly for rvest to use?
It looks like the information you are looking for is store in the "index.php" file that gets downloaded when the web page loads.
The current link for Maine is "https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=89346&ver=5.9.3".
I am not sure what the r=89346 value is for, maybe a timestamp, tracking id, temporary token (to prevent web scraping) etc. I suspect this URL will change thus you may need to use the developer tools on the browser to obtain the current url.
Also, map_id refers to state but I don't know the rational, Florida is 1, NC is 35 and Maine is 21.
Download this file, then extract the JSON data and convert. The data starts with a {"st1": and ends with }}.
library(dplyr)
#read the index_php file and turn it into character string
index_php <-readLines("https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=19770&ver=5.9.3")
index_php <- paste(index_php, collapse = " ")
#extract out the correct JSON data part and convert
jsondata <- stringr::str_extract(index_php, "\\{\"st1\":.+?\\}\\}")
data<-jsonlite::fromJSON(jsondata)
#create a data frame with the results
answer <- bind_rows(data)
id name shortname link comment image color_map color_map_over
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Androscoggin "" "" $4.964 "" #ca3338 #ca3338
2 2 Aroostook "" "" $4.928 "" #dd7a7a #dd7a7a
3 3 Cumberland "" "" $4.944 "" #ca3338 #ca3338
4 4 Franklin "" "" $4.936 "" #dd7a7a #dd7a7a
5 5 Hancock "" "" $4.900 "" #01b5da #01b5da
6 6 Kennebec "" "" $4.955 "" #ca3338 #ca3338
There are some extra columns which need removal, I leave it as an exercise for the reader.
So, you can gather the state info, including state level prices from the initial US page. You can also, from there, gather the urls for each state page. Make a request to each of those pages, and store the returned html. You can then, depending on whether the county data is in a php file, either extract the php file links, request that file and process out the info you want, or, in the case of no php file, extract the necessary data from the html already stored from the state requests.
Below extracts all the prices for all states and counties. There is a state DataFrame and a state with counties DataFrame.
library(tidyverse)
library(rvest)
get_data <- function(state, url) {
# extract county and price data from php files. Pass in state abbreviation and php file URI.
s <- read_html(url) %>%
html_text() %>%
str_match("map_data\\s+:\\s+(.*\\}),") %>%
.[, 2]
return(
tibble(
state = state,
county = s %>% str_match_all(',"name":"(.*?)"') %>% .[[1]] %>% .[, 2],
price = s %>% str_match_all(',"comment":"(.*?)"') %>% .[[1]] %>% .[, 2]
)
)
}
start_url <- "https://gasprices.aaa.com/?state=US"
page <- read_html(start_url)
# get state price info and urls for state pages
data_strings <- page %>%
html_text() %>%
stringr::str_match('placestxt = (".*")') %>%
.[, 2] %>%
str_replace_all('\\"', "") %>%
str_split(";")
df_state <- data.frame(subset(data_strings[[1]], lapply(data_strings, function(x) {
x != ""
})[[1]]) %>% map(., ~ str_split(.x, ",")) %>% unlist(recursive = F)) %>%
transpose() %>%
.[c(1:4)] %>%
set_names("abbr", "state", "price", "url")
state_data <- lapply(df_state$url, read_html)
# find the php file links
df_state$data_url <- lapply(state_data, function(item) {
item %>%
html_element("[src*=js_data]") %>%
html_attr("src")
})
# separate out dataframe according to whether county data is in php file or in previously stored html
no_valid_data_url <- df_state %>% filter(is.na(data_url))
has_valid_data_url <- df_state %>% filter(!is.na(data_url))
# grab the data for states where there are php files with county info
df_state_county <- map2_dfr(has_valid_data_url$state, has_valid_data_url$data_url, get_data)
# add in missing info i.e. # handle cases where data_url is NA e.g. https://gasprices.aaa.com/?state=DC
if (nrow(no_valid_data_url) > 0) {
html_to_use <- state_data[match(no_valid_data_url$abbr, df_state$abbr)]
df_state_county_no_data_url <- map_dfr(html_to_use, function(html) {
state_node <- html %>% html_element(".selected")
state_text <- state_node %>% html_text(trim = T)
return(
data.frame(
state = state_text,
county = state_text,
price = html %>% html_element('td:contains("Current Avg.") + td') %>% html_text()
)
)
})
df_state_county <- rbind(df_state_county, df_state_county_no_data_url)
}
head(df_state, 2)
head(df_state_county, 2)
I scraped a wikipedia table using r
library(rvest)
url <- "https://en.wikipedia.org/wiki/New_York_City"
nyc <- url %>%
read_html() %>%
html_node(xpath = '//*[#id="mw-content-text"]/div/table[1]') %>%
html_table(fill = TRUE)
And want to save the values into a new dataframe.
Output
Area population
468.484 sq mi 8,336,817
What is the best way to do this?
You need to choose which table. From the table select needed columns and rows. Assign column names using setNames and reset rownames by setting them to NULL. I'm sure you want population column as.integer, just use gsub before to clean out the non-digits.
I'm not sure about the html_node line and left it out.
library(rvest)
url <- "https://en.wikipedia.org/wiki/New_York_City"
nyc <- read_html(url)
# nyc <- html_node(nyc, xpath = '//*[#id="mw-content-text"]/div/table[1]')
nyc <- html_table(nyc, header=TRUE, fill = TRUE)
nyc <- `rownames<-`(
setNames(nyc[[3]][-c(1:2, 10), 2:3], c("area", "population")),
NULL)
nyc <- transform(nyc, population=as.integer(gsub("\\D", "", population)))
nyc
# area population
# 1 Bronx 1418207
# 2 Kings 2559903
# 3 New York 1628706
# 4 Queens 2253858
# 5 Richmond 476143
# 6 City of New York 8336817
# 7 State of New York 19453561
Judging from OP's example output they want the table given at a different xpath to that which they provided in the question. Please see the following workflow, note: names have been set manually to save the hassle of formatting the strings from rows:
# Initialise package in session: rvest => .GlobalEnv()
library(rvest)
# Store the url scalar: url => character vector
url <- "https://en.wikipedia.org/wiki/New_York_City"
# Scrape the table and store it memory: nyc => data.frame
nyc <-
url %>%
read_html() %>%
html_node(xpath = '/html/body/div[3]/div[3]/div[4]/div/table[3]') %>%
html_table(fill = TRUE) %>%
data.frame()
# Set the names appropriately: names(nyc) character vector
names(nyc) <- c("borough", "county", "pop_est_2019",
"gdp_bill_usd", "gdp_per_cap",
"land_area_sq_mi", "land_area_sq_km",
"density_pop_sq_mi", "density_pop_sq_km")
# Coerce the vectors to the appropriate type: cleaned => data.frame
cleaned <- data.frame(lapply(nyc[4:nrow(nyc)-1,], function(x){
if(length(grep("\\d+\\,\\d+$|^\\d+\\.\\d+$", x)) > 0){
as.numeric(trimws(gsub("\\,", "", as.character(x)), "both"))
}else{
as.factor(x)
}
}
)
)
I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs (author_reg), which I'm using to scrape affiliation data. However, I have several columns indicating multiple authors (each of which I need the affiliation data for). When there aren't multiple authors, the cell has an NA value. Some of the columns are mostly NA values so how do I alter my code so it skips the NA values but doesn't delete them?
Here is the code I'm using:
library(rvest)
library(purrr)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500", "NA", "NA")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation_author_1 <- sapply(df$author_reg_1, function(x) {
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempts
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
Thanks in advance for your help!
As far as I see the target links, you can try the following way. First, you want to scrape all links from https://ideas.repec.org/e/ and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.
library(rvest)
library(httr)
library(tidyverse)
# Get all possible links from this webpage. There are 26665 links.
read_html("https://ideas.repec.org/e/") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href") %>%
.[grepl(x = ., pattern = "html")] -> x
# Create complete URLs.
mylinks1 <- paste("https://ideas.repec.org/e/", x, sep = "")
# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]
# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)
# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]
Then, for each link, I tried to extract affiliation information. There are several spots with h3. So I tried to specifically target h3 that stays in xpath containing id = "affiliation". If there is no affiliation information, R returns character(0). When enframe() is applied, these elements are removed. For instance, pab127 does not have any affiliation information, so there is no entry for this link.
lapply(urls, function(x){
read_html(x, encoding = "UTF-8") %>%
html_nodes(xpath = '//*[#id="affiliation"]') %>%
html_nodes("h3") %>%
html_text() %>%
trimws() -> foo
return(foo)}) -> mylist
Then, I assigned names to mylist with the links and created a data frame.
names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")
enframe(mylist) %>%
unnest(value)
name value
<chr> <chr>
1 paa1 "(80%) Institutt for ØkonomiUniversitetet i Bergen"
2 paa1 "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"
3 paa2 "Department of EconomicsCollege of BusinessUniversity of Wyoming"
4 paa6 "Statistisk SentralbyråGovernment of Norway"
5 paa8 "Centraal Planbureau (CPB)Government of the Netherlands"
6 paa9 "(79%) Economic StudiesBrookings Institution"
7 paa9 "(21%) Brookings Institution"
8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~
A few weeks ago, someone here helped me immensely get a list of all the links in the Notable Names database. i was able to run this code and get the following output
library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
## Gets A-Z links
all_surname_urls <- read_html(url_base) %>%
html_nodes(".newslink") %>%
html_attrs() %>%
map(pluck(1, 1))
all_ppl_urls <- map(
all_surname_urls,
function(x) read_html(x) %>%
html_nodes("a") %>%
html_attrs() %>%
map(pluck(1, 1))
) %>%
unlist()
all_ppl_urls <- setdiff(
all_ppl_urls[!duplicated(all_ppl_urls)],
c(all_surname_urls, "http://www.nndb.com/")
)
all_ppl_urls[1] %>%
read_html() %>%
html_nodes("p") %>%
html_text()
# [1] "AKA Lee William Aaker"
# [2] "Born: 25-Sep-1943Birthplace: Los Angeles, CA"
# [3] "Gender: MaleRace or Ethnicity: WhiteOccupation: Actor"
# [4] "Nationality: United StatesExecutive summary: The Adventures of Rin Tin Tin"
# ...
My original intention was to get a dataframe where i'd get the name of the person, their gender, race, occupation and nationality into a single dataframe.
A lot of the questions I saw here and on other sites was helpful if your data came in an html table and that's not the case with the notable names database. I know a loop needs to be involved for all 40K sites but after a weekend of searching for answers i can't seems to find out how. Can someone assist?
Edited to add
I tried following some of the rules here but this request was a bit more complex
## I tried to run list <- all_ppl_urls%>% map(read_html) but that was taking a LONG time so I decided to just get the first ten links for the sake of showing my example:
example <- head(all_ppl_urls, 10)
list <- example %>% map(read_html)
test <-list %>% map_df(~{
text_1 <- html_nodes(.x, 'p , b') %>% html_text
and i got this error:
Error:
In addition: Warning message:
closing unused connection 3 (http://www.nndb.com/people/965/000279128/)
Here you have a way to get data looking at each of your html files. This is just an approach that gets some good results...but... you must notice that those gsub functions should be edited in order to get better results. This happens because that list of urls or, lets say, that webpage, is not homogenized in how data are displayed. This is something you have to deal with. For instance, here are just two screenshots where you can find those differences in web presentation:
Anyway, you can manage this adapting this code:
library(purrr)
library(rvest)
[...] #here is your data
all_ppl_urls[100] %>%
read_html() %>%
html_nodes("p") %>%
html_text()
# [3] "Gender: MaleReligion: Eastern OrthodoxRace or Ethnicity: Middle EasternSexual orientation: StraightOccupation: PoliticianParty Affiliation: Republican"
#-----------------------------------------------------------------------------------------------
# NEW WAY
toString(read_html(all_ppl_urls[100])) #get example of how html looks...
#><b>AKA</b> Edmund Spencer Abraham</p>\n<p><b>Born:</b> 12-Jun-1952<br><b>Birthplace:</b> East Lansing, MI<br></p>\n<p><b>Gender:</b> Male<br><b>
#1. remove NA urls (avoid problems later on)
urls <- all_ppl_urls[!is.na(all_ppl_urls)]
length(all_ppl_urls)
length(urls)
#function that creates a list with your data
GetLife <- function (htmlurl) {
htmltext <- toString(read_html(htmlurl))
name <- gsub('^.*AKA</b>\\s*|\\s*</p>\n.*$', '', htmltext)
gender <- gsub('^.*Gender:</b>\\s*|\\s*<br>.*$', '', htmltext)
race <- gsub('^.*Race or Ethnicity:</b>\\s*|\\s*<br>.*$', '', htmltext)
occupation <- gsub('^.*Occupation:</b>\\s*|\\s*<br>.*$|\\s*</a>.*$|\\s*</p>.*$', '', htmltext)
#as occupation seems to have to many hyperlinks we are making another step
occupation <- gsub("<[^>]+>", "",occupation)
nationality <- gsub('^.*Nationality:</b>\\s*|\\s*<br>.*$', '', htmltext)
res <- c(ifelse(nchar(name)>100, NA, name), #function that cleans weird results >100 chars
ifelse(nchar(gender)>100, NA, gender),
ifelse(nchar(race)>100, NA, race),
ifelse(nchar(occupation)>100, NA, occupation),
ifelse(nchar(nationality)>100, NA, nationality),
htmlurl)
return(res)
}
emptydf <- data.frame(matrix(ncol=6, nrow=0)) #creaty empty data frame
colnames(emptydf) <- c("name","gender","race","occupation","nationality","url") #set names in empty data frame
urls <- urls[2020:2030] #sample some of the urls
for (i in 1:length(urls)){
emptydf[i,] <- GetLife(urls[i])
}
emptydf
Here is an example of those 10 urls analized:
name gender race occupation nationality url
1 <NA> Male White Business United States http://www.nndb.com/people/214/000128827/
2 Mark Alexander Ballas, Jr. Male White Dancer United States http://www.nndb.com/people/162/000346121/
3 Thomas Cass Ballenger Male White Politician United States http://www.nndb.com/people/354/000032258/
4 Severiano Ballesteros Sota Male Hispanic Golf Spain http://www.nndb.com/people/778/000116430/
5 Richard Achilles Ballinger Male White Government United States http://www.nndb.com/people/511/000168007/
6 Steven Anthony Ballmer Male White Business United States http://www.nndb.com/people/644/000022578/
7 Edward Michael Balls Male White Politician England http://www.nndb.com/people/846/000141423/
8 <NA> Male White Judge United States http://www.nndb.com/people/533/000168029/
9 <NA> Male Asian Engineer England http://www.nndb.com/people/100/000123728/
10 Michael A. Balmuth Male White Business United States http://www.nndb.com/people/635/000175110/
11 Aristotle N. Balogh Male White Business United States http://www.nndb.com/people/311/000172792/
Update
Included an error routine for profiles which could not be parsed properly. If there is any error you will get an NA row (even if some info could be parsed properly - this is due to the fact that we read all fields at once and we are relying that all fields could be read).
Maybe you want to further develop that code to return partial information? You could do this by reading the fields one after another (instead of once) and if there is an error return NA for this field and not the entire row. This has the downside, however, that the code need to parse the doc not only once per profile but several times.
Here's a function which relies on Xpath to select the relevant fields:
library(rvest)
library(glue)
library(tibble)
library(dplyr)
library(purrr)
scrape_profile <- function(url) {
fields <- c("Gender:", "Race or Ethnicity:", "Occupation:", "Nationality:")
filter <- glue("contains(text(), '{fields}')") %>%
paste0(collapse = " or ")
xp_string <- glue("//b[{filter}]/following::text()[normalize-space()!=''][1]")
tryCatch({
doc <- read_html(url)
name <- doc %>%
html_node(xpath = "(//b/text())[1]") %>%
as.character()
doc %>%
html_nodes(xpath = xp_string) %>%
as.character() %>%
gsub("^\\s|\\s$", "", .) %>%
as.list() %>%
setNames(c("Gender", "Race", "Occupation", "Nationality")) %>%
as_tibble() %>%
mutate(Name = name) %>%
select(Name, everything())
}, error = function(err) {
message(glue("Profile <{url}> could not be parsed properly."))
tibble(Name = ifelse(exists("name"), name, NA), Gender = NA,
Race = NA, Occupation = NA,
Nationality = NA)
})
}
All you have to do now is to apply scrape_profile to all of your profile urls:
map_dfr(all_ppl_urls[1:5], scrape_profile)
# # A tibble: 5 x 5
# Name Gender Race Occupation Nationality
# <chr> <chr> <chr> <chr> <chr>
# 1 Lee Aaker Male White Actor United States
# 2 Aaliyah Female Black Singer United States
# 3 Alvar Aalto Male White Architect Finland
# 4 Willie Aames Male White Actor United States
# 5 Kjetil André Aamodt Male White Skier Norway
Explanation
Identify Structure of Website: When looking at the source code of the profile site, you see that all relevant information but the name follows a label in bold (i.e. <b> tags), sometimes there is also a link tag (<a>).
Construct selector: With this information we now can construct either a css or an XPath selector. However, since we want to select text nodes, XPath seems to be the only(?) option: //b[contains(text(), "Gender:")]/following::text()[normalize-space()!=' '][1] selects
the first non empty text node ::text()[normalize-space()!=' '][1] which is
a sibling (/following) of
a <b> tag (//b) which
contains the text Gender: ([contains(text(), "Gender:")])
Multiple Select: since all tags are built in the same way, we can construct an Xpath which matches more than one element avoiding explicit loops. This we do by pasting several contains(.) statements together separated by or
Further Formatting: Finally we remove whitespaces and return it in a tibble
Name Field: Last step is to extract the name, which is basically the first bold (<b>) text
Good morning,
I'm new to scraping with R, and I'm having a hard time to scrape a list of elements from a webpage in a useful manner.
This is my script
library(rvest)
url <- read_html("https://www.pole-emploi.fr/annuaire/provins-77070")
webpage <- url %>%
html_nodes('.zone') %>%
html_text()
webpage
When I run the script all the elements appear squeezed together without any whitespace between, which is comprehensible since each item is enclosed in a single tag.
[1] "77114GouaixHerméNoyen-sur-SeineVilliers-sur-Seine"
[2] "77118BalloyBazoches-lès-BrayGravon"
I would like to have them either like this (or separated by commas)
[1] "77114 Gouaix Hermé Noyen-sur-Seine Villiers-sur-Seine"
[2] "77118 Balloy Bazoches-lès-Bray Gravon"
Or even better on a tidy format
Postal City
77114 Gouaix
77114 Hermé
77114 Noyen-sur-Seine
77114 Villiers-sur-Seine
I have tried to find other selector or Xpaths in the page without success. The most I have got is to select one single element of the list.
Any help would be greatly apprecaited.
Thanks in advance.
Each list element looks like this (truncated for brevity):
<li class="zone">\n<span class="code-postal">77114</span><ul>\n<li>Gouaix</li>\n<li>Hermé</li>\n ...
So, each one has a set of child nodes that look uniform. We can target the <span> and the <li> elements in the nested <ul> to get what you want:
library(rvest)
library(tidyverse)
pg <- read_html("https://www.pole-emploi.fr/annuaire/provins-77070")
html_nodes(pg, ".zone") %>%
map_df(~{
data_frame(
postal = html_node(.x, "span") %>% html_text(trim=TRUE),
city = html_nodes(.x, "ul > li") %>% html_text(trim=TRUE)
)
})
## # A tibble: 95 x 2
## postal city
## <chr> <chr>
## 1 77114 Gouaix
## 2 77114 Hermé
## 3 77114 Noyen-sur-Seine
## 4 77114 Villiers-sur-Seine
## 5 77118 Balloy
## 6 77118 Bazoches-lès-Bray
## 7 77118 Gravon
## 8 77126 Châtenay-sur-Seine
## 9 77126 Égligny
## 10 77134 Les Ormes-sur-Voulzie
## # ... with 85 more rows
the tidyverse method with explicit anonymous function (vs .x via formula function):
html_nodes(pg, ".zone") %>%
map_df(function(x) {
data_frame(
postal = html_node(x, "span") %>% html_text(trim=TRUE),
city = html_nodes(x, "ul > li") %>% html_text(trim=TRUE)
)
})
and, a pure base R version:
elements <- html_nodes(pg, ".zone")
lapply(elements, function(x) {
data.frame(
postal = html_text(html_node(x, "span"), trim=TRUE),
city = html_text(html_nodes(x, "ul > li"), trim=TRUE),
stringsAsFactors = FALSE
)
}) -> tmp
Reduce(rbind.data.frame, tmp)
# or
do.call(rbind.data.frame, tmp)