Data Scraping a Table with Multiple pages - r

I am currently trying to retrieve a table from the CDC website however (https://www.cdc.gov/obesity/data/prevalence-maps.html#states) the table in question has multiple pages that must be scrolled through and I am having difficulty retrieving it and putting it into RStudio. I have tried to utilize the possibly() function from purrr but no luck. Any help is appreciated.
library(rvest)
library(dplyr)
library(purrr)
link <- "https://www.cdc.gov/obesity/data/prevalence-maps.html"
xpaths <- paste0('//*[#id="DataTables_Table_0', 1:9, '"]/table[2]')
scrape_table <- function(link, xpath){
link %>%
read_html() %>%
html_nodes(xpath = xpath) %>%
html_table() %>%
flatten_df %>%
setNames(c("State", "Prevalence", "95 CI"))
}
scrape_table_possibly <- possibly(scrape_table, otherwise = NULL)
scraped_tables <- map(xpaths, ~ scrape_table_possibly(link = link, xpath = .x))

The page source doesn't contain the data but gets external data by JS so you can't scrape it by rvest anyway. The table you want is from this file: https://www.cdc.gov/obesity/data/maps/2019-overall.csv
Edit: I scrolled down and saw other tables:
https://www.cdc.gov/obesity/data/maps/2019-white.csv
https://www.cdc.gov/obesity/data/maps/2019-hispanic.csv
https://www.cdc.gov/obesity/data/maps/2019-black.csv

Related

Rvest Pulls Empty Tables

The site I use to scrape data has changed and I'm having issues pulling the data into table format. I used two different types of codes below trying to get the tables, but it is returning blanks instead of tables.
I'm a novice in regards to scraping and would appreciate the expertise of the group. Should I look for other solutions in rvest, or try to learn a program like rSelenium?
https://www.pgatour.com/stats/detail/02675
Scrape for Multiple Links
library("dplyr")
library("purr")
library("rvest")
df23 <- expand.grid(
stat_id = c("02568","02674", "02567", "02564", "101")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/detail/',
stat_id
)
) %>%
as_tibble()
#replaced tournament_id with stat_id
get_info <- function(link, stat_id){
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]]
}
test_main_stats <- df23 %>%
mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_main_stats <- test_main_stats %>%
unnest(everything())
Alternative Code
url <- read_html("https://www.pgatour.com/stats/detail/02568")
test1 <- url %>%
html_nodes(".css-8atqhb") %>%
html_table
This page uses javascript to create the table, so rvest will not directly work. But if one examines the page's source code, all of the data is stored in JSON format in a "<script>" node.
This code finds that node and converts from JSON to a list. The variable is the main table but there is a wealth of other information contained in the JSON data struture.
#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")
#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>% html_text()
#convert from JSON
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table
answer <-output$props$pageProps$statDetails$rows

How to scrape a table created using datawrapper using rvest?

I am trying to scrape Table 1 from the following website using rvest:
https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/
Following is the code i have written:
link <- "https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/"
page <- read_html(link)
page %>% html_nodes("iframe") %>% html_attr("src") %>% .[11] %>% read_html() %>%
html_nodes("table.medium datawrapper-g2oKP-6idse1 svelte-1vspmnh resortable")
But, i get {xml_nodeset (0)} as the result. I am struggling to figure out the correct tag to select in html_nodes() from the datawrapper page to extract Table 1.
I will be really grateful if someone can point out the mistake i am making, or suggest a solution to scrape this table.
Many thanks.
The data is present in the iframe but needs a little manipulation. It is easier, for me at least, to construct the csv download url from the iframe page then request that csv
library(rvest)
library(magrittr)
library(vroom)
library(stringr)
page <- read_html('https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/')
iframe <- page %>% html_element('iframe[title^="Table 1"]') %>% html_attr('src')
id <- read_html(iframe) %>% html_element('meta') %>% html_attr('content') %>% str_match('/(\\d+)/') %>% .[, 2]
csv_url <- paste(iframe,id, 'dataset.csv', sep = '/' )
data <- vroom(csv_url, show_col_types = FALSE)

Launch web browser and copy information contained R

I'm trying to find a way to copy-paste the title and the abstract from a PubMed page.
I started using
browseURL("https://pubmed.ncbi.nlm.nih.gov/19592249") ## final numbers are the PMID
now I can't find a way to obtain the title and the abstract in a txt way. I have to do it for multiple PMID so I need to automatize it. It can be useful also just copying everything is on that page and after I can take only what I need.
Is it possible to do that? thanks!
I suppose what you're trying to do is scrape PubMed for articles of interest?
Here's one way to do this using the rvest package:
#Required libraries.
library(magrittr)
library(rvest)
#Function.
getpubmed <- function(url){
dat <- rvest::read_html(url)
pid <- dat %>% html_elements(xpath = '//*[#title="PubMed ID"]') %>% html_text2() %>% unique()
ptitle <- dat %>% html_elements(xpath = '//*[#class="heading-title"]') %>% html_text2() %>% unique()
pabs <- dat %>% html_elements(xpath = '//*[#id="enc-abstract"]') %>% html_text2()
return(data.frame(pubmed_id = pid, title = ptitle, abs = pabs, stringsAsFactors = FALSE))
}
#Test run.
urls <- c("https://pubmed.ncbi.nlm.nih.gov/19592249", "https://pubmed.ncbi.nlm.nih.gov/22281223/")
df <- do.call("rbind", lapply(urls, getpubmed))
The code should be fairly self-explanatory. (I've not added the contents of df here for brevity.) The function getpubmed does no error-handling or anything of that sort, but it is a start. By supplying a vector of URLs to the do.call("rbind", lapply(urls, getpubmed)) construct, you can get back a data.frame consisting of the PubMed ID, title, and abstract as columns.
Another option would be to explore the easyPubMed package.
I would also use a function and rvest. However, I would go with a passing the pid in as the argument function, using html_node as only a single node is needed to be matched, and use faster css selectors. String cleaning is done via stringr package:
library(rvest)
library(stringr)
library(dplyr)
get_abstract <- function(pid){
page <- read_html(paste0('https://pubmed.ncbi.nlm.nih.gov/', pid))
df <-tibble(
title = page %>% html_node('.heading-title') %>% html_text() %>% str_squish(),
abstract = page %>% html_node('#enc-abstract') %>% html_text() %>% str_squish()
)
return(df)
}
get_abstract('19592249')

RVEST package seems to collect data in random order

I have the following question.
I am trying to harvest data from the Booking website (for me only, in order to learn the functionality of the rvest package). Everything's good and fine, the package seems to collect what I want and to put everything in the table (dataframe).
Here's my code:
library(rvest)
library(lubridate)
library(tidyverse)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
so in this chunk I collect the data from the first 60 pages after first manually feeding the Booking search engine with the country of my choise (Spain), the dates I am interested in (just some arbitrary interval) and the number of people (I used defaults here).
Then, I add this code to select the properties I want:
read_hotel <- function(url){ # collecting hotel names
ho <- read_html(url)
headline <- ho %>%
html_nodes("span.sr-hotel__name") %>% # the node I want to read
html_text() %>%
as_tibble()
}
hotels <- map_dfr(page_booking, read_hotel)
read_pr <- function(url){ # collecting price tags
pr <- read_html(url)
full_pr <- pr %>%
html_nodes("div.bui-price-display__value") %>% #the node I want to read
html_text() %>%
as_tibble()
}
fullprice <- map_dfr(page_booking, read_pr)
... and eventually save the whole data in the dataframe:
dfr <- tibble(hotels = hotels,
price_fact = fullprice)
I collect more parameters but this doesn't matter. The final dataframe of 1500 rows and two columns is then created. But the problem is the data within the second column does not correspond to the data in the first one. Which is really strange and renders my dataframe to be useless.
I don't really understand how the package works in the background and why does it behaves that way. I also paid attention the first rows in the first column of the dataframe (hotel name) do not correspond to the first hotels I see on the website. So it seems to be a different search/sort/filter criteria the rvest package uses.
Could you please explain me the processes take place during the rvest node hoping?
I would really appreciate at least some explanation, just to better understand the tool we work with.
You shouldn't scrape hotels' name and price separately like that. What you should do is get all nodes of items (hotels), then scrape the name and price relatively of each hotel. With this method, you can't mess up the order.
library(rvest)
library(purrr)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
map_dfr(
items,
function(item) {
data.frame(
hotel = item %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = item %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)
}
)
(The dots start the XPath syntaxes present the current node which is the hotel item.)
Update:
Update the code that I think faster but still does the job:
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
data.frame(
hotel = items %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = items %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)

R - Web Scrape of job board

I am trying to get a list of Companies and jobs in a table from indeed.com's job board.
I am using the rvest package using a URL Base of http://www.indeed.com/jobs?q=proprietary+trader&
install.packages("gtools")
install.packages('rvest")
library(rvest)
library(gtools)
mydata = read.csv("setup.csv", header=TRUE)
url_base <- "http://www.indeed.com/jobs?q=proprietary+trader&"
names <- mydata$Page
results<-data.frame()
for (name in names){
url <-paste0(url_base,name)
title.results <- url %>%
html() %>%
html_nodes(".jobtitle") %>%
html_text()
company.results <- url %>%
html() %>%
html_nodes(".company") %>%
html_text()
results <- smartbind(company.results, title.results)
results3<-data.frame(company=company.results, title=title.results)
}
new <- results(Company=company, Title=title)
and then looping a contatenation. For some reason it is not grabbing all of the jobs and mixing the companies and jobs.
It might be because you make two separate requests to the page. You should change the middle part of your code to:
page <- url %>%
html()
title.results <- page %>%
html_nodes(".jobtitle") %>%
html_text()
company.results <- page %>%
html_nodes(".company") %>%
html_text()
When I do that, it seems to give me 10 jobs and companies which match. Can you give an example otherwise of a query URL that doesn't work?

Resources