Web scrape subtitles from opensubtitles.org in R - r

I'm new to web scraping, and I'm currently trying to download subtitle files for over 100,000 films for a research project. Each film has a unique IMDb ID (i.e., the ID for Inception is 1375666). I have a list in R containing the 102524 IDs, and I want to download the corresponding subtitles from opensubtitles.org.
Each film has its own page on the site, for example, Inception has:
https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-1375666
The link to download the subtitles is obtained by clicking on the first link in the table called "Movie name", which takes you to a new page, then clicking the "Download button" on that page.
I'm using rvest to scrape the pages, and I've written this code:
for(i in 1:102524) {
subtitle.url = paste0("https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-", movie.ids[i])
read_html(subtitle.url) %>%
html_nodes(".head+ .expandable .bnone")
# Not sure where to go from here
}
Any help on how to do this will be greatly appreciated.
EDIT: I know I'm asking something pretty complicated, but any pointers on where to start would be great.

Following the link and the download button, we can see that the actual subtitle file is downloaded from https://www.opensubtitles.org/en/download/vrf-108d030f/sub/6961922 (for you example). I found out this inspecting the Network tab in Mozilla's Developer Tools while doing a download.
We can download directly from that address using:
download.file('https://www.opensubtitles.org/en/download/vrf-108d030f/sub/6961922',
destfile = 'subtitle-6961922.zip')
The base url (https://www.opensubtitles.org/en/download/vrf-108d030f/sub/) is fixed for all the downloads as far as I can see, so we only need the site's id.
The id is found within the search page doing:
id <- read_html(subtitle.url) %>%
html_node('.bnone') %>%
html_attr('href') %>%
stringr::str_extract('\\d+')
So, putting it all together:
search_url <- 'https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-'
download_url <- 'https://www.opensubtitles.org/en/download/vrf-108d030f/sub/'
for(i in 1:102524) {
subtitle.url = paste0(search_url, movie.ids[i])
id <- read_html(subtitle.url) %>%
html_node('.bnone') %>%
html_attr('href') %>%
stringr::str_extract('\\d+')
download.file(paste0(download_url, id),
destfile = paste0('subtitle-', movie.ids[i], '.zip'))
# Wait somwhere between 1 and 4 second before next download
# as courtesy to the site
Sys.sleep(runif(1, 1, 4))
}
Keep in mind this will take a long time!

Related

How do I extract certain html nodes using rvest?

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

Scraping a paginated website with the same URL when clicking the "next page" button

I am new to R and sorry in advance if my question is too basic.
I am scrapping a website (here's the link: http://jhsjk.people.cn/result?type ) to download all the articles. I noted that every time I click the next page button, the url remain unchanged.
I try to use rvest loop function to scrape the next page, but failed.
I searched on this website and learned that i might use RSelenium package to get myself there, but I still could sort this out :( (I am so stupid on this)
here's my code
url <- c("http://jhsjk.people.cn/result?type").
page<-read_html(url).
title <- page %>% html_nodes(css=".btbg .w1200.p2_cn.cf #news_list.list_14.p1_2.clearfix a" ) %>% html_text ().
link <- page %>% html_elements(css=".btbg .w1200.p2_cn.cf #news_list.list_14.p1_2.clearfix a") %>% html_attr('href')
press_releases_df <- data.frame( title = title, link = link)
with this code, i can only extract the first page,I want to use loop function,but don't really know what should be looped. should it be the page number?

Using rvest to Scrape Multiple Job Listing pages

I have read through multiple other similar questions and can't seem to find one that gives me the right answer.
I am trying to scrape all the current job titles on TeamWorkOnline.com.
This is the specific URL: https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&commit=Search
I have no problem starting the scraping process with this code:
listings <- data.frame(title=character(),
stringsAsFactors=FALSE)
{
url_ds <- paste0('https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&commit=Search',i)
var <- read_html(url_ds)
#job title
title <- var %>%
html_nodes('.margin-none') %>%
html_text() %>%
str_extract("(\\w+.+)+")
listings <- rbind(listings, as.data.frame(cbind(title)))
}
However, if you look at the site, there is 'numbered navigation' at the bottom to continue to other pages where more jobs are listed.
I cannot seem to figure out how to add the correct code to get rvest to automatically navigate to the other pages and scrape those jobs as well.
Any help would be greatly appreciated.
Try this:
library(rvest)
library(stringr)
listings <- character()
for (i in 1:25) {
url_ds <- paste0("https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&page=", i)
#job title
title <- read_html(url_ds) %>%
html_nodes('.margin-none') %>%
html_text() %>%
str_extract("(\\w+.+)+")
listings <- c(listings, title)
}
Simply loop through all pages to scrape and combine them.
There are 25 pages in the search results,
https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&page=25
when ever you clicking on the next button the number at the end of url is changing according to the navigation page number, if the above code working for the first page then you need to iterate through a range 1 to 25 and append your page number to the url and extract it.
I hope it works

How do I find html_node on search form?

I have a list of names (first name, last name, and date-of-birth) that I need to search the Fulton County Georgia (USA) Jail website to determine if a person is in or released from jail.
The website is http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400
The site requires you enter a last name and first name, then it gives you a list of results.
I have found some stackoverflow posts that have given me some direction, but I'm still struggling to figure this out. I"m using this post as and example to follow. I am using SelectorGaget to help figure out the CSS tags.
Here is the code I have so far. Right now I can't figure out what html_node to use.
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session(fc.url)
# Grab initial form
form.unfilled <- jail %>% html_node("form")
form.unfilled
The result I get from form.unfilled is {xml_missing} <NA> which I know isn't right.
I think if I can figure out the html_node value, I can proceed to using set_values and submit_form.
Thanks.
It appears on the initial call the webpage opens onto "http://justice.fultoncountyga.gov/PAJailManager/default.aspx". Once the session is started you should be able to jump to the search page:
library(rvest)
# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form")
Note: Verify that your actions are within the terms of service for the website. Many sites do have policy against scraping.
The website relies heavily on Javascript to render itself. When opening the link you provided in a fresh browser instance, you get redirected to http://justice.fultoncountyga.gov/PAJailManager/default.aspx, where you have to click the "Jail Records" link. This executes a bit a Javascript, to send you to the page with the form.
rvest is unable to execute arbitrary Javascript. You might have to look at RSelenium. Selenium basically remote-controls a browser (for example Firefox or Chrome), which executes the Javascript as intended.
Thanks to Dave2e.
Here is the code that works. This questions is answered (but I'll post another one because I'm not getting a table of data as a result.)
Note: I cannot find any Terms of Service on this site that I'm querying
library(rvest)
# start session
jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
#jump to search page
jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
#list the form's fields
html_form(jail2)[[1]]
# Grab initial form
form.unfilled <- jail2 %>% html_node("form") %>% html_form()
form.unfilled
#name values
lname <- "DOE"
fname <- "JOHN"
# Fille the form with name values
form.filled <- form.unfilled %>%
set_values("LastName" = lname,
"FirstName" = fname)
#Submit form
r <- submit_form(jail2, form.filled,
submit = "SearchSubmit")
#grab tables from submitted form
table <- r %>% html_nodes("table")
#grab a table with some data
table[[5]] %>% html_table()
# resulting text in this table:
# " An error occurred while processing your request.Please contact your system administrator."

Web Scraping Image URL for a series of events in ESPN Play-By-Play

I am trying to use web scraping to generate a play by play dataset from ESPN. I have figured out most of it, but have been unable to tell which team the event is for, as this is only encoded on ESPN in the form of an image. The best way I have come up with to solve this problem is to get the URL of the logo for each entry and compare it to the URL of the logo for each team at the top of the page. However, I have been unable to figure out how to get an attribute such as the url from the image.
I am running this on R and am using the rvest package. The url I am scraping is https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906 and I am scraping using the SelectorGadget Chrome extension. I have also tried comparing the name of the player to the boxscore, which has all of the players listed, but each team has a player with the last name of Jones, so I would prefer to be able to get the team by looking at the image, as this will always be right.
library(rvest)
url <- "https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906"
webpage <- read_html(url)
# have been able to successfully scrape game_details and score
game_details_html <- html_nodes(webpage,'.game-details')
game_details <- html_text(game_details_html) %>% as.character()
score_html <- html_nodes(webpage,'.combined-score')
score <- html_text(score_html)
# have not been able to scrape image
ImgNode <- html_nodes(webpage, css = "#gp-quarter-1 .team-logo")
link <- html_attr(ImgNode, "src")
For each event, I want it to be labeled "Duke" or "Wake Forest".
Is there a way to generate the URL for each image? Any help would be greatly appreciated.
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100"
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100"
Your code returns these.
500/150 is Duke and 500/154 is Wake Forest. You can create a simple dataframe with these and then join the tables.
link_df <- as.data.frame(link)
link_ref_df <- data.frame(link = c("https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100", "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100"),
team_name = c("Duke", "Wake Forest"))
link_merged <- merge(link_df,
link_ref_df,
by = 'link',
all.x = T)
This is not scalable if you're doing hundreds of these with other teams, but works for this specific option.

Resources