Question
I wanted to rvest specific parts of the websites (car sales platform).
The CSS is frankly too confusing for me to figure out what's wrong on my own.
#### scraping the website www.otomoto.pl with used cars #####
baseURL_otomoto = "https://www.otomoto.pl/osobowe/?page="
i <- 1
for ( i in 1:7000 )
{
link = paste0(baseURL_otomoto,i)
out = read_html(link)
print(i)
print(link)
### building year
build_year = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
mileage = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[2]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
volume = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[3]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
fuel_type = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[4]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
price = html_nodes(out, xpath = '//div[#class="offer-item__price"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
link = html_nodes(out, xpath = '//div[#class="offer-item__title"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
offer_details = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
Any guesses what might be the reason for this behaviour?
PS#1.
How to rvest all build_type, mileage and fuel_type data from offers available on the analysed website at once as a data.frame? using classes (xpath = '//div[#class=...) didn't work in my case
PS#2.
I wanted to rvest details of the actual offers using f.i.
gear_type = html_nodes(out, xpath = '//*[#id="parameters"]/ul[1]/li[10]/div') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
the arguments
in ul[a] are for a in (1:2) &
in li[b] are for b in (1:12)
Unfortunately though this concept fails as the resulting data frame is empty. Any guesses why?
First and foremost, learn about CSS selectors and XPath. Your selectors are very long and extremely fragile (some of them did not work for me at all, mere two weeks later). For example, instead of:
html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text()
you can write:
html_nodes(out, css="[data-code=year]") %>% html_text()
Second, read documentation of libraries that you use. str_replace_all pattern may be regular expression, which saves you one call (use str_replace_all("[\n\r]", "") instead of str_replace_all("\n","") %>% str_replace_all("\r","")). html_text can do text trimming for you, which means that str_trim() is not needed at all.
Third, if you copy-paste some code, step back and think if function wouldn't be better solution; usually it would. In your case, personally, I would probably skip str_replace_all calls until data cleaning step, when I would call them on data.frame holding entire scrapped data.
To create data.frame from your data, call data.frame() function with column names and content, like that:
data.frame(build_year = build_year,
mileage = mileage,
volume = volume,
fuel_type = fuel_type,
price = price,
link = link,
offer_details = offer_details)
Or you could initialize data.frame with one column only and then add further vectors as columns:
output_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
output_df$volume <- html_nodes(out, css="[data-code=engine_capacity]") %>%
html_text(TRUE)
Finally, you should note that data.frame columns must all be the same length, while some of data that you scrap is optional. At the moment of writing this answer I had few offers without engine capacity and without offer description. You have to use two html_nodes calls in succession (as single CSS selector will not match what doesn't exist). But even then, html_nodes will silently drop missing data. This can be worked around by piping html_nodes output to html_node call:
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
The final version of my approach to loop internals is below. Just make sure that you initialize empty data.frame before calling it and that you merge output of current iteration with final data frame (using for example rbind), or each iteration will overwrite results of previous one. Or you could use do.call(rbind, lapply()), which is idiomatic R for such task.
As a side note, when scraping large amount of quickly changing data, consider decoupling data downloading and data processing steps. Imagine that there is some corner case that you haven't accounted for which will cause R to terminate. How will you proceed if such condition appear in the middle of your iterations? The longer you stay on one page, the more duplicates you introduce (as more offers appear and existing ones are pushed down on further pages), and more offers you miss (as sale is concluded and offers disappear forever).
current_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
current_df$mileage = html_nodes(out, css="[data-code=mileage]") %>%
html_text(TRUE)
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
current_df$fuel_type = html_nodes(out, css="[data-code=fuel_type]") %>%
html_text(TRUE)
current_df$price = out %>% html_nodes(xpath="//div[#class='offer-price']//span[contains(#class, 'number')]") %>%
html_text(TRUE)
current_df$link = out %>% html_nodes(css = "div.offer-item__title h2 > a") %>%
html_text(TRUE) %>%
str_replace_all("[\n\r]", "")
current_df$offer_details = out %>% html_nodes("div.offer-item__title") %>%
html_node("h3") %>%
html_text(TRUE)
Related
I created another post related to this here, but it caused a lot of confusion, so hopefully this post is clearer. FYI the code I initially wrote (but didn't do what I needed) for the loop is also in the link above if you want to see it.
goal
I'm trying to scrape all the competitor data from the IBJJF website. This includes each competitors division, gender, belt and weight. Although the stats only appear once at the top of the page, the below code will appropriately fill in the correct information for each competitor.
The problem
When this code is used in a for-loop, the loop does not automatically fill in the appropriate information for each competitor. For example, if the belt is black, I would like it to say black next to all competitors on that page. But I cannot figure out how to do that.
Question
How would you change this code into a loop so that the appropriate division, gender, belt, and weight is next to each competitor's name? Or if you were trying to capture all the data I mentioned from each page, how would you go about doing it?
library(rvest)
library(tidyverse)
MensUrl <- read_html('https://www.bjjcompsystem.com/tournaments/1869/categories/2053146')
## SCRAPE FIGHT INFO -------------------------------------------
# fight info
ageDivision <- MensUrl %>%
html_nodes('.category-title__age-division') %>%
html_text()
gender <- MensUrl %>%
html_nodes('.category-title__age-division+ .category-title__label') %>%
html_text()
belt <- MensUrl %>%
html_nodes('.category-title__label:nth-child(3)') %>%
html_text()
weight <- MensUrl %>%
html_nodes('.category-title__label:nth-child(4)') %>%
html_text()
fightAndMat <- MensUrl %>%
html_nodes('.bracket-match-header__where , .bracket-match-header__fight') %>%
html_text()
date = MensUrl %>%
html_nodes('.bracket-match-header__when') %>%
html_text()
CompetitorNo = MensUrl %>%
html_nodes('.match-card__competitor-n') %>%
html_text()
name = MensUrl %>%
html_nodes('.match-card__competitor-description div:nth-child(1)') %>%
html_text()
gym = MensUrl %>%
html_nodes('.match-card__club-name') %>%
html_text()
#### create match df ####
matches = data.frame('division' = ageDivision,
'gender' = gender,
'belt' = belt,
'weight' = weight,
'fightAndMat' = fightAndMat,
'date' = date,
'competitor' = CompetitorNo,
'name' = name,
'gym' = gym)
This is what the above code produces when put in a data frame. I want the final result to look the same.
how are you? I am trying to extract some info about this sportbetting webpage using rvest. I asked a related question a few days ago and i get almost 100% of my goals. So far , and thanks to you, extracted succesfully the title, the score and the time of the matches being played using the next code:
library(rvest)
library(tidyverse)
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>%
read_html()
data=data.frame(
Titulo = page %>%
html_elements(".titulo") %>%
html_text(),
Marcador = page %>%
html_elements(".marcador") %>%
html_text(),
Tiempo = page %>%
html_elements(".marcador+ span") %>%
html_text() %>%
str_squish()
)
Now i want to get repeated values, for example if the country of the match is "Brasil" I want to put it in the data frame that the country is Brasil for every match in that category. So far i only managed to extract all the countries but individually. Same applies for sport name and tournament.
Can you help me with that? Already thanks.
You could re-write your code to use separate functions that work with different levels of information. These can be called in a nested fashion making the code easier to read.
Essentially, using nested map_dfr() calls to produce a single dataframe from functions working with lists at different levels within the DOM.
Below, you could think of it like an outer loop of sports, then an intermediate loop over countries, and an innermost loop over events within a sport and country.
library(rvest)
library(tidyverse)
get_sport_info <- function(sport) {
df <- map_dfr(sport %>% html_elements(".category"), get_play_info)
df$sport <- sport %>%
html_element(".sport-name") %>%
html_text()
return(df)
}
get_play_info <- function(play) {
df <- map_dfr(play %>% html_elements(".event"), ~
data.frame(
titulo = .x %>% html_element(".titulo") %>% html_text(),
marcador = .x %>% html_element(".marcador") %>% html_text(),
tiempo = .x %>% html_element(".marcador + span") %>% html_text() %>% str_squish()
))
df$country <- play %>%
html_element(".category-name") %>%
html_text()
return(df)
}
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>% read_html()
sports <- page %>% html_elements(".sport")
final <- map_dfr(sports, get_sport_info)
I have been working on some R code. The purpose is to collect the average word length and other stats about the words in a section of a website with 50 pages. Collecting the stats is no problem and it's a easy part. However, getting my code to collect the stats over 50 pages is the hard part, it only ever seems to output information from the first page. See the code below and ignore the poor indentation.
install.packages(c('tidytext', 'tidyverse'))
library(tidyverse)
library(tidytext)
library(rvest)
library(stringr)
websitePage <- read_html('http://books.toscrape.com/catalogue/page-1.html')
textSort <- websitePage %>%
html_nodes('.product_pod a') %>%
html_text()
for (page_result in seq(from = 1, to = 50, by = 1)) {
link = paste0('http://books.toscrape.com/catalogue/page-',page_result,'.html')
page = read_html(link)
# Creates a tibble
textSort.tbl <- tibble(text = textSort)
textSort.tidy <- textSort.tbl %>%
funnest_tokens(word, text)
}
# Finds the average word length
textSort.tidy %>%
map(nchar) %>%
map(mean)
# Finds the most common words
textSort.tidy %>%
count(word, sort = TRUE)
# Removes the stop words and then finds most common words
textSort.tidy %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
# Counts the number of times the word "Girl" is in the text
textSort.tidy %>%
count(word) %>%
filter(word == "Girl")
You can use lapply/map to extract the tetx from multiple links.
library(rvest)
link <- paste0('http://books.toscrape.com/catalogue/page-',1:50,'.html')
result <- lapply(link, function(x) x %>%
read_html %>%
html_nodes('.product_pod a') %>%
html_text)
You can continue using lapply if you want to apply other functions to text.
I am using Rvest to scrape google news.
However, I encounter missing values in element "Time" from time to time on different keywords. Since the values are missing, it will end up having "different number of rows error" for the data frame of scraping result.
Is there anyway to fill-in NA for these missing values?
Below is the example of the code I am using.
html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text(),
Time = html_dat %>%
html_nodes('.WW6dff') %>%
html_text()
)
Without knowing the exact page you were looking at I tried the first Google news page.
In the Rvest page, html_node (without the s) will always return a value even it is NA. Therefore in order to keep the vectors the same length, one needed to find the common parent node for all of the desired data nodes. Then parse the desired information from each one of those nodes.
Assuming the Title node is most complete, go up 1 level with xml_parent() and attempt to retrieving the same number of description nodes, this didn't work. Then tried 2 levels up using xml_parent() %>% xml_parent(), this seems to work.
library(rvest)
url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)
Title = html_dat %>% html_nodes('.DY5T1d') %>% html_text()
# Link = dat$Link
Link = html_dat %>% html_nodes('.VDXfz') %>% html_attr('href')
Link <- gsub("./articles/", "https://news.google.com/articles/",Link)
#Find the common parent node
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>% html_nodes('.DY5T1d') %>% xml_parent() %>% xml_parent()
Description = Titlenodes %>% html_node('.Rai5ob') %>% html_text()
Time = Titlenodes %>% html_node('.WW6dff') %>% html_text()
answer <- data.frame(Title, Time, Description, Link)
I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!
I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()