Scraping only some columns from multiple tables

Scraping only some columns from multiple tables - r

I would like to scrape only the candidate names from these tables and the votes that are reported in the third column (after the image, candidate name).
This is as far as I've gotten.
library(rvest)
ndp_leadership<-url('https://en.wikipedia.org/wiki/New_Democratic_Party_leadership_elections')
results<-read_html(ndp_leadership, 'table')
results<-html_nodes(results, 'table')
out<-results %>%
html_nodes(xpath="//*[contains(., 'Candidate')]//tr/td")
out

Although this doesn't really use XPath, here's one way to do it:
results <- read_html(ndp_leadership) %>%
html_nodes(".wikitable") %>%
html_table(fill=TRUE) %>%
map(~ .[,2]) %>%
unlist %>%
setdiff(., c("Candidate", "Total"))

Related

Extracting repeated class with rvest html_elements in R

how are you? I am trying to extract some info about this sportbetting webpage using rvest. I asked a related question a few days ago and i get almost 100% of my goals. So far , and thanks to you, extracted succesfully the title, the score and the time of the matches being played using the next code:
library(rvest)
library(tidyverse)
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>%
read_html()
data=data.frame(
Titulo = page %>%
html_elements(".titulo") %>%
html_text(),
Marcador = page %>%
html_elements(".marcador") %>%
html_text(),
Tiempo = page %>%
html_elements(".marcador+ span") %>%
html_text() %>%
str_squish()
)
Now i want to get repeated values, for example if the country of the match is "Brasil" I want to put it in the data frame that the country is Brasil for every match in that category. So far i only managed to extract all the countries but individually. Same applies for sport name and tournament.
Can you help me with that? Already thanks.

You could re-write your code to use separate functions that work with different levels of information. These can be called in a nested fashion making the code easier to read.
Essentially, using nested map_dfr() calls to produce a single dataframe from functions working with lists at different levels within the DOM.
Below, you could think of it like an outer loop of sports, then an intermediate loop over countries, and an innermost loop over events within a sport and country.
library(rvest)
library(tidyverse)
get_sport_info <- function(sport) {
df <- map_dfr(sport %>% html_elements(".category"), get_play_info)
df$sport <- sport %>%
html_element(".sport-name") %>%
html_text()
return(df)
}
get_play_info <- function(play) {
df <- map_dfr(play %>% html_elements(".event"), ~
data.frame(
titulo = .x %>% html_element(".titulo") %>% html_text(),
marcador = .x %>% html_element(".marcador") %>% html_text(),
tiempo = .x %>% html_element(".marcador + span") %>% html_text() %>% str_squish()
))
df$country <- play %>%
html_element(".category-name") %>%
html_text()
return(df)
}
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>% read_html()
sports <- page %>% html_elements(".sport")
final <- map_dfr(sports, get_sport_info)

What would be the best practice to merge additional variables to data based on specific row information when web scraping in R using 'rvest'?

I'm currently web scraping the IMDB website to extract movie data.
I would like to know how you would solve this problem.
library(tidyverse)
library(data.table)
library(rvest)
library(janitor)
#top rated movies website
url <- 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
# extract the title of the movies using rvest
titles <- url %>%
read_html() %>%
html_nodes(' .titleColumn a') %>%
html_text() %>%
as.data.table() %>%
setnames(. ,old = colnames(.), new='title')
# extract links to each of the titles, this will be the reference
links <- url %>%
read_html() %>%
html_nodes('.titleColumn a') %>%
html_attr('href') %>%
as.data.table() %>%
setnames(. ,old = colnames(.), new='links')
# creating a DT with the data
movies <- cbind(titles,links)
I will have movies DT with title and links as columns.
Now, I will like to extract additional data of each movie using the links
I will continue using the first row as an example.
#the first link in movies
link <- 'https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=NJ52X0MM1V9FKSPBT46G&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1'
# selector for budget data (this will not change)
select <- '.txt-block:nth-child(15) , .txt-block:nth-child(14) , #titleDetails .txt-block:nth-child(13) , #titleDetails .txt-block:nth-child(12)'
# get budget data
budget <- link %>%
read_html() %>%
html_nodes(select) %>%
html_text() %>%
gsub('\\n','',.) %>%
str_split(.,'\\:')%>%
as.data.table() %>%
janitor::row_to_names(row_number = 1) %>%
setnames(.,old=colnames(.),new= tolower(gsub(' ','_' , str_trim(colnames(.)))))
budget[,(colnames(budget))] <- lapply(budget,function(x) str_extract_all(x, "(\\$) *([0-9,]+)"))
Now I have a 1x4 table with budget information
I would like to pull data for each link of movies and merge it into DT to have a final DT with 6 columns; 'title', 'link' + four budget variables. I was trying to create a function that includes the code to get the budget data using each row's link as a parameter and the using 'lapply', I don't think this is the correct approach.
I would like to see if you have a solution to this in an efficient way.
Thanks so much for your help.

I think this would solve your problem:
# selector for budget data (this will not change)
select <- '.txt-block:nth-child(15) , .txt-block:nth-child(14) , #titleDetails .txt-block:nth-child(13) , #titleDetails .txt-block:nth-child(12)'
# get budget data
## As function
get_budget = function(link,select){
budget <- link %>%
read_html() %>%
html_nodes(select) %>%
html_text() %>%
gsub('\\n','',.) %>%
str_split(.,'\\:')%>%
as.data.table() %>%
janitor::row_to_names(row_number = 1) %>%
setnames(.,old=colnames(.),new= tolower(gsub(' ','_' , str_trim(colnames(.)))))
budget[,(colnames(budget))] <- lapply(budget,function(x) str_extract_all(x, "(\\$) *([0-9,]+)"))
return(budget)
}
#As your code is slow I'll subset movies to have 10 rows:
movies = movies[1:10,]
tmp =
lapply(movies[, links], function(x)
get_budget(link = paste0("https://www.imdb.com/",x),select=select )) %>%
rbindlist(., fill = T)
movies = cbind(movies, tmp)
And your result would seem like this: movies_result
Finally, I think this little advice would make your code loke cooler:
setnames doesn't need . from magrittr; it automatically understands your kind of code.
When possible avoid using setnames(. ,old = colnames(.), new='links'). In your case is just necessary setnames('links') since you are renaming all your variables.
setnames(dt,old = oldnames, new=newnames) is only necessary when oldnames is not equal to names(dt).
Since DT is another R popular library, completely unrelated with data.table I think is better to refer to a data.table as what is a data.table.

PDF: Table Extraction - Tabulizer (R)

I'm trying to extract a table from a PDF with the R tabulizer package. The functions work fine, but it can't get all the data from the entire table.
Below are my codes
library(tabulizer)
library(tidyverse)
library(abjutils)
D_path = "https://github.com/financebr/files/raw/master/Compacto09-08-2019.pdf"
out <- extract_tables(D_path,encoding = 'UTF-8')
arrumar_nomes <- function(x) {
x %>%
tolower() %>%
str_trim() %>%
str_replace_all('[[:space:]]+', '_') %>%
str_replace_all('%', 'p') %>%
str_replace_all('r\\$', '') %>%
abjutils::rm_accent()
}
tab_tidy <- out %>%
map(as_tibble) %>%
bind_rows() %>%
set_names(arrumar_nomes(.[1,])) %>%
slice(-1) %>%
mutate_all(funs(str_replace_all(., '[[:space:]]+', ' '))) %>%
mutate_all(str_trim)
Comparing the PDF table (D_path) with the tab_tidy database you can see that some information was missing. All first columns, which are merged, are not found during extract_tables(). Also, all lines that contain “Boi Gordo” and “Boi Magro” information are not found by the function either.
The rest is in perfect condition. Would you know why and how to solve it? The questions here in the forum dealing with this do not have much answer.

Scraped table returns empty data frame

I'm trying to scrape two things. I want to extract the links from each individual school on a page with this code:
scraped_links <- read_html("https://www.scholenopdekaart.nl/middelbare-scholen/zoeken/") %>%
html_nodes("a.school-naam") %>%
html_attr("href") %>%
html_table() %>%
as.data.frame() %>%
as.tbl()
Then I want to scrape the tabels on these pages:
scraped_tables <- read_html("https://www.scholenopdekaart.nl/Middelbare-scholen/146/1086/Almere-College/Slaagpercentage") %>%
html_nodes(xpath = "/html/body/div[3]/div[3]/div[1]/div[3]/div[3]/div[3]") %>%
html_table() %>%
as.data.frame() %>%
as.tbl()
They both return empty data frames. I tried css selectors, multiple xpaths, but I can't get it to work... Hope someone can help me.

rvest web content scraping issue / car trading website

Question
I wanted to rvest specific parts of the websites (car sales platform).
The CSS is frankly too confusing for me to figure out what's wrong on my own.
#### scraping the website www.otomoto.pl with used cars #####
baseURL_otomoto = "https://www.otomoto.pl/osobowe/?page="
i <- 1
for ( i in 1:7000 )
{
link = paste0(baseURL_otomoto,i)
out = read_html(link)
print(i)
print(link)
### building year
build_year = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
mileage = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[2]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
volume = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[3]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
fuel_type = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[4]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
price = html_nodes(out, xpath = '//div[#class="offer-item__price"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
link = html_nodes(out, xpath = '//div[#class="offer-item__title"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
offer_details = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
Any guesses what might be the reason for this behaviour?
PS#1.
How to rvest all build_type, mileage and fuel_type data from offers available on the analysed website at once as a data.frame? using classes (xpath = '//div[#class=...) didn't work in my case
PS#2.
I wanted to rvest details of the actual offers using f.i.
gear_type = html_nodes(out, xpath = '//*[#id="parameters"]/ul[1]/li[10]/div') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
the arguments
in ul[a] are for a in (1:2) &
in li[b] are for b in (1:12)
Unfortunately though this concept fails as the resulting data frame is empty. Any guesses why?

First and foremost, learn about CSS selectors and XPath. Your selectors are very long and extremely fragile (some of them did not work for me at all, mere two weeks later). For example, instead of:
html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text()
you can write:
html_nodes(out, css="[data-code=year]") %>% html_text()
Second, read documentation of libraries that you use. str_replace_all pattern may be regular expression, which saves you one call (use str_replace_all("[\n\r]", "") instead of str_replace_all("\n","") %>% str_replace_all("\r","")). html_text can do text trimming for you, which means that str_trim() is not needed at all.
Third, if you copy-paste some code, step back and think if function wouldn't be better solution; usually it would. In your case, personally, I would probably skip str_replace_all calls until data cleaning step, when I would call them on data.frame holding entire scrapped data.
To create data.frame from your data, call data.frame() function with column names and content, like that:
data.frame(build_year = build_year,
mileage = mileage,
volume = volume,
fuel_type = fuel_type,
price = price,
link = link,
offer_details = offer_details)
Or you could initialize data.frame with one column only and then add further vectors as columns:
output_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
output_df$volume <- html_nodes(out, css="[data-code=engine_capacity]") %>%
html_text(TRUE)
Finally, you should note that data.frame columns must all be the same length, while some of data that you scrap is optional. At the moment of writing this answer I had few offers without engine capacity and without offer description. You have to use two html_nodes calls in succession (as single CSS selector will not match what doesn't exist). But even then, html_nodes will silently drop missing data. This can be worked around by piping html_nodes output to html_node call:
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
The final version of my approach to loop internals is below. Just make sure that you initialize empty data.frame before calling it and that you merge output of current iteration with final data frame (using for example rbind), or each iteration will overwrite results of previous one. Or you could use do.call(rbind, lapply()), which is idiomatic R for such task.
As a side note, when scraping large amount of quickly changing data, consider decoupling data downloading and data processing steps. Imagine that there is some corner case that you haven't accounted for which will cause R to terminate. How will you proceed if such condition appear in the middle of your iterations? The longer you stay on one page, the more duplicates you introduce (as more offers appear and existing ones are pushed down on further pages), and more offers you miss (as sale is concluded and offers disappear forever).
current_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
current_df$mileage = html_nodes(out, css="[data-code=mileage]") %>%
html_text(TRUE)
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
current_df$fuel_type = html_nodes(out, css="[data-code=fuel_type]") %>%
html_text(TRUE)
current_df$price = out %>% html_nodes(xpath="//div[#class='offer-price']//span[contains(#class, 'number')]") %>%
html_text(TRUE)
current_df$link = out %>% html_nodes(css = "div.offer-item__title h2 > a") %>%
html_text(TRUE) %>%
str_replace_all("[\n\r]", "")
current_df$offer_details = out %>% html_nodes("div.offer-item__title") %>%
html_node("h3") %>%
html_text(TRUE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping only some columns from multiple tables - r

Although this doesn't really use XPath, here's one way to do it: results <- read_html(ndp_leadership) %>% html_nodes(".wikitable") %>% html_table(fill=TRUE) %>% map(~ .[,2]) %>% unlist %>% setdiff(., c("Candidate", "Total"))

Related

Extracting repeated class with rvest html_elements in R

What would be the best practice to merge additional variables to data based on specific row information when web scraping in R using 'rvest'?

PDF: Table Extraction - Tabulizer (R)

Scraped table returns empty data frame

rvest web content scraping issue / car trading website

Categories

Resources