I want to extract the 20th table from a Wikipedia page https://en.wikipedia.org/wiki/...
I now use this code, but it only extracts the first heading table.
the_url <- "https://en.wikipedia.org/wiki/..."
tb <- the_url %>% read_html() %>%
html_node("table") %>%
html_table(fill = TRUE)
What should I do to get the specific one? Thank you!!
Instead of indexing where table position could move, you could anchor according to relationship to element with id prize_money. Return just a single node for efficiency. Avoid longer xpaths as they can be fragile.
library(rvest)
table <- read_html('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup#Prize_money') %>%
html_node(xpath = "//*[#id='Prize_money']/parent::h4/following-sibling::table[1]") %>%
html_table(fill = T)
since you have a specific table you want to scrape you can identify in in the html_node() call by using the xpath of the webpage element:
library(dplyr)
library(rvest)
the_url <- "https://en.wikipedia.org/wiki/2018_FIFA_World_Cup"
the_url %>%
read_html() %>%
html_nodes(xpath='/html/body/div[3]/div[3]/div[5]/div[1]/table[20]') %>%
html_table(fill=TRUE)
Try this code.
library(rvest)
webpage <- read_html("https://en.wikipedia.org/wiki/2018_FIFA_World_Cup")
tbls <- html_nodes(webpage, "table")
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[3:4] %>%
html_table(fill = TRUE)
str(tbls_ls)
Related
I'm trying to scrape "1,335,000" from the screenshot below (the number is at the bottom of the screenshot). I wrote the following code in R.
t2<-read_html("https://fortune.com/company/amazon-com/fortune500/")
employee_number <- t2 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//*[contains(#class, 'info__value--2AHH7')]") %>%
rvest::html_text()
However, when I call "employee_number", it gives me "character(0)". Can anyone help me figure out why?
As Dave2e pointed the page uses javascript, thus can't make use of rvest.
url = "https://fortune.com/company/amazon-com/fortune500/"
#launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[#id="content"]/div[5]/div[1]/div[1]/div[12]/div[2]') %>%
html_text()
[1] "1,335,000"
Data is loaded dynamically from a script tag. No need for expense of a browser. You could either extract the entire JavaScript object within the script, pass to jsonlite to handle as JSON, then extract what you want, or, if just after the employee count, regex that out from the response text.
library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)
page <- read_html('https://fortune.com/company/amazon-com/fortune500/')
data <- page %>% html_element('#preload') %>% html_text() %>%
stringr::str_match(. , "PRELOADED_STATE__ = (.*);") %>% .[, 2] %>% jsonlite::parse_json()
print(data$components$page$`/company/amazon-com/fortune500/`[[6]]$children[[4]]$children[[3]]$config$employees)
#shorter version
print(page %>%html_text() %>% stringr::str_match('"employees":"(\\d+)?"') %>% .[,2] %>% as.integer() %>% format(big.mark=","))
I am using Rvest to scrape google news.
However, I encounter missing values in element "Time" from time to time on different keywords. Since the values are missing, it will end up having "different number of rows error" for the data frame of scraping result.
Is there anyway to fill-in NA for these missing values?
Below is the example of the code I am using.
html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text(),
Time = html_dat %>%
html_nodes('.WW6dff') %>%
html_text()
)
Without knowing the exact page you were looking at I tried the first Google news page.
In the Rvest page, html_node (without the s) will always return a value even it is NA. Therefore in order to keep the vectors the same length, one needed to find the common parent node for all of the desired data nodes. Then parse the desired information from each one of those nodes.
Assuming the Title node is most complete, go up 1 level with xml_parent() and attempt to retrieving the same number of description nodes, this didn't work. Then tried 2 levels up using xml_parent() %>% xml_parent(), this seems to work.
library(rvest)
url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)
Title = html_dat %>% html_nodes('.DY5T1d') %>% html_text()
# Link = dat$Link
Link = html_dat %>% html_nodes('.VDXfz') %>% html_attr('href')
Link <- gsub("./articles/", "https://news.google.com/articles/",Link)
#Find the common parent node
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>% html_nodes('.DY5T1d') %>% xml_parent() %>% xml_parent()
Description = Titlenodes %>% html_node('.Rai5ob') %>% html_text()
Time = Titlenodes %>% html_node('.WW6dff') %>% html_text()
answer <- data.frame(Title, Time, Description, Link)
I try to scape the prices, area and addresses from all flats of this homepage (https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden)
Getting the data for one list element with Rvest and xpath works fine (see code), but I donĀ“t know how to get the ID of each list element to loop through all elements.
Here is a part of the html-code with the data-go-to-expose-id I need for the loop. How can I get all IDs?
<span class="slick-bg-layer"></span><img alt="Immobilienbild" class="gallery__image block height-full" src="https://pictures.immobilienscout24.de/listings/541dfd45-c75a-4da7-a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80">a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80"></a>
And here is my current R-code to fetch the data from one list element:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
address <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[2]/div[2]/a') %>% html_text()
price <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[1]/dd') %>% html_text()
area <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[2]/dd') %>% html_text()
Does this get what you are after
library("tidyverse")
library("httr")
library("rvest")
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
x <- read_html(url)
x %>%
html_nodes("#listings") %>%
html_nodes(".result-list__listing") %>%
html_attr("data-id")
I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!
I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()
I used this script to extract the text from a webpage
url <- "http://www.dlink.com/it/it"
doc <- getURL(url)
#get the text from the body
html <- htmlTreeParse(doc, useInternal = TRUE)
txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
txt<-toString(txt)
but the problem is that it takes just the words in the first page, how can I extend it to the whole website?
I'd go with rvest to scrape the links and purrr to iterate:
library(rvest)
library(purrr)
url <- "http://www.dlink.com/it/it"
r <- read_html(url) %>%
html_nodes('a') %>%
html_attr('href') %>%
Filter(function(f) !is.na(f) & !grepl(x = f, pattern = '#|facebook|linkedin|twitter|youtube'), .) %>%
map(~{
print(.x)
html_session(url) %>%
jump_to(.x) %>%
read_html() %>%
html_nodes('body') %>%
html_text() %>%
toString()
})
I filtered out social nets and dead links from the list of links, some more tuning might be in order.
Be advised that you will be scraping a lot of garbage. Some more targeting on what to scrape inside each page might be needed (ie: something more specfic than the whole body tag)