New to webscraping.
I am trying to scrape a site. I recently learnt how to get information from tables, but I want to know how to get the table name. (I believe table name might be wrong word here but bear with me)
Eg - https://www.msc.com/che/about-us/our-fleet?page=1
MSC is shipping firm and I need to get the list of their fleet and information on each ship.
I have written the following code that will retrieve the table data for each ship.
df <- MSCwp[i,1] %>%
read_html() %>% html_table()
MSCwp is the list url. This code gets me all the information I need about the ships listed in the webpage expect its name.
Is there any way to retrieve the name along with the table?
Eg - df for the above mentioned website will return 10 tables. (corresponding to the ships in the webpage). df[1] will have information about the ship Agamemnon but I am not sure how to retrieve the shipname along with the table.
You need to pull the names out from the main page.
library(rvest)
library(dplyr)
url <- "https://www.msc.com/che/about-us/our-fleet?page=1"
page <- read_html(url)
names <- page %>% html_elements("dd a") %>% html_text()
names
[1] "AGAMEMNON" "AGIOS DIMITRIOS" "ALABAMA" "ALLEGRO" "AMALTHEA" "AMERICA" "ANASTASIA"
[8] "ANTWERP TRADER" "ARCHIMIDIS" "ARIES"
In this case I am looking for the text in the "a" child node of the "dd" nodes.
Related
I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()
I'm attempting to scrape tables from this page:
https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-tds
I'm trying to gather the info under "player","over", and "under"
so the first row would be Joe Flacco 1.5 +140 1.5 -190 (these numbers change so when you're reading it might be different)
As an example of code I used on the same website, but different table/link, I used this:
library(rvest)
url <- 'https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-yds'
test <- url %>%
read_html() %>%
html_nodes('.default-color , .sportsbook-outcome-cell__line , .sportsbook-row-name') %>%
html_text()
This code gives me the exact data that I want.
Note that this working code is for a separate page: https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-yds
I used selectorGadget extension to ascertain the selector value
The 2 different pages I'm looking at are accessible from the header above "Bal Ravens"
Pass Yds is the default table selection for the page; Pass TDs is next to it, which gets you to the page which I'm also attempting to scrape.
For some reason, scraping the Pass TDs table using the same method as Pass Yds table leaves me with an empty string:
url <- 'https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-tds'
test <- url %>%
read_html() %>%
html_nodes('.sportsbook-row-name , .default-color , .sportsbook-outcome-cell__line') %>%
html_text()
Note that when using selectorGadget for this page, it gives me a different html_nodes selector
I have also tried using xpath and finding the individual tables (with html_table) on the Inspect page. Again, this process works with the Pass Yds page, but not the Pass Tds page. I assume this problem relates to the fact that the table on the website is variable, with Pass Yds being the default.
If anyone could help me with this, or point me in the direction to information regarding scraping these menu-selectable tables, I would greatly appreciate it.
Thanks!
I have a dataset that contains URLs of profiles of politicians in the German parliament. Many of these profiles also include links to the politician's Twitter pages. I want to create a loop, or ideally using purrr::map(), that scrapes the Twitter links adding it as a new column to the original dataset.
I've found several code examples that do something similar:
Scraping of Multi-page websites
Scraper.R scraping politicians twitter profiles
Scraping li elements with Rvest
But I cannot get them to run with my own data.
Taking a small sample of the URLs from the dataframe, I've converted thems to a vector, which looks like this:
> links
[1] "https://www.abgeordnetenwatch.de/profile/julia-obermeier"
[2] "https://www.abgeordnetenwatch.de/profile/anja-weisgerber"
[3] "https://www.abgeordnetenwatch.de/profile/klaus-ernst"
[4] "https://www.abgeordnetenwatch.de/profile/astrid-freudenstein"
Not all of the URLs have links to the politicians Twitter profile and missing links would be ideally returned as NA.
This is my attempt:
pages <- links %>% map(read_html)
The result is a list of 4. Next to get a dataframe containing the twitter links and politicians names(so I can merge them with the original dataset) I try the following code:
pages %>% map_df(~{
data_frame(name = html_node(pages, "h1") %>% html_text(trim=TRUE) %>%
html_node(pages, "href") %>% html_text(trim=TRUE))})
#Error in UseMethod("xml_find_first") :
#no applicable method for 'xml_find_first' applied to an object of class "list"
I think the issue is the HTML being in a list rather than a vector, but haven't found a solution to convert it while maintaining the original intent of returning a dataframe.
I also know from doing it step by step on one URL, that the code scrapes numerous links when I only want the URL of the politicians twitter profile, but I think that's something that can be easily fixed when I have the dataframe.
An overall more refined solution would be much appreciated.
Really any help would be appreciated.
You can't apply html_nodes on an html_text result.
This works :
ages %>% map_df(~{
data_frame(name = html_node(.x, "h1") %>% html_text(trim=TRUE)
)})
# A tibble: 4 × 1
name
<chr>
1 Julia Obermeier
2 Anja Weisgerber
3 Klaus Ernst
4 Astrid Freudenstein
I have a list of names (first name, last name, and date-of-birth) that I need to search the Fulton County Georgia (USA) Jail website to determine if a person is in or released from jail.
The website is http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400
The site requires you enter a last name and first name, then it gives you a list of results.
I have found some stackoverflow posts that have given me some direction, but I'm still struggling to figure this out. I"m using this post as and example to follow. I am using SelectorGaget to help figure out the CSS tags.
Here is the code I have so far. Right now I can't figure out what html_node to use.
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session(fc.url)
# Grab initial form
form.unfilled <- jail %>% html_node("form")
form.unfilled
The result I get from form.unfilled is {xml_missing} <NA> which I know isn't right.
I think if I can figure out the html_node value, I can proceed to using set_values and submit_form.
Thanks.
It appears on the initial call the webpage opens onto "http://justice.fultoncountyga.gov/PAJailManager/default.aspx". Once the session is started you should be able to jump to the search page:
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form")
Note: Verify that your actions are within the terms of service for the website. Many sites do have policy against scraping.
The website relies heavily on Javascript to render itself. When opening the link you provided in a fresh browser instance, you get redirected to http://justice.fultoncountyga.gov/PAJailManager/default.aspx, where you have to click the "Jail Records" link. This executes a bit a Javascript, to send you to the page with the form.
rvest is unable to execute arbitrary Javascript. You might have to look at RSelenium. Selenium basically remote-controls a browser (for example Firefox or Chrome), which executes the Javascript as intended.
Thanks to Dave2e.
Here is the code that works. This questions is answered (but I'll post another one because I'm not getting a table of data as a result.)
Note: I cannot find any Terms of Service on this site that I'm querying
library(rvest)
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form") %>% html_form()
form.unfilled
#name values
lname <- "DOE"
fname <- "JOHN"
# Fille the form with name values
form.filled <- form.unfilled %>%
set_values("LastName" = lname,
"FirstName" = fname)
#Submit form
r <- submit_form(jail2, form.filled,
submit = "SearchSubmit")
#grab tables from submitted form
table <- r %>% html_nodes("table")
#grab a table with some data
table[[5]] %>% html_table()
# resulting text in this table:
# " An error occurred while processing your request.Please contact your system administrator."
I am doing a research on World Bank (WB) projects on developing countries.
To do so, I am scraping their website in order to collect the data I am interested in.
The structure of the webpage I want to scrape is the following:
List of countries the list of all countries in which WB has developed projects
1.1. By clicking on a single country on 1. , one gets the single countries project list (that includes many webpages) it includes all the projects in a single countries . Of course, here I have included just one page of a single countries, but every country has a number of pages dedicated to this subject
1.1.1. By clicking on a a single project on 1.1. , one gets - among the others - the project's overview option I am interested in.
In other words, my problem is to find out a way to create a dataframe including all the countries, a complete list of all projects for each country and an overview of any single project.
Yet, this is the code that I have (unsuccessfully) written:
WB_links <- "http://projects.worldbank.org/country?lang=en&page=projects"
WB_proj <- function(x) {
Sys.sleep(5)
url <- sprintf("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s", x)
html <- read_html(url)
tibble(title = html_nodes(html, ".grid_20") %>% html_text(trim = TRUE),
project_url = html_nodes(html, ".grid_20") %>% html_attr("href"))
}
WB_scrape <- map_df(1:5, WB_proj) %>%
mutate(study_description =
map(project_url,
~read_html(sprintf
("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s", .x)) %>%
html_node() %>%
html_text()))
Any suggestion?
Note: I am sorry if this question seems trivial, but I am quite a newbie in R and I haven't found a help on this by looking around (though I could have missed something, of course).