Using rvest to Scrape Multiple Job Listing pages - r

I have read through multiple other similar questions and can't seem to find one that gives me the right answer.
I am trying to scrape all the current job titles on TeamWorkOnline.com.
This is the specific URL: https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&commit=Search
I have no problem starting the scraping process with this code:
listings <- data.frame(title=character(),
stringsAsFactors=FALSE)
{
url_ds <- paste0('https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&commit=Search',i)
var <- read_html(url_ds)
#job title
title <- var %>%
html_nodes('.margin-none') %>%
html_text() %>%
str_extract("(\\w+.+)+")
listings <- rbind(listings, as.data.frame(cbind(title)))
}
However, if you look at the site, there is 'numbered navigation' at the bottom to continue to other pages where more jobs are listed.
I cannot seem to figure out how to add the correct code to get rvest to automatically navigate to the other pages and scrape those jobs as well.
Any help would be greatly appreciated.

Try this:
library(rvest)
library(stringr)
listings <- character()
for (i in 1:25) {
url_ds <- paste0("https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&page=", i)
#job title
title <- read_html(url_ds) %>%
html_nodes('.margin-none') %>%
html_text() %>%
str_extract("(\\w+.+)+")
listings <- c(listings, title)
}
Simply loop through all pages to scrape and combine them.

There are 25 pages in the search results,
https://www.teamworkonline.com/jobs-in-sports?employment_opportunity_search%5Bexclude_united_states_opportunities%5D=0&page=25
when ever you clicking on the next button the number at the end of url is changing according to the navigation page number, if the above code working for the first page then you need to iterate through a range 1 to 25 and append your page number to the url and extract it.
I hope it works

Related

How do I extract certain html nodes using rvest?

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

Scraping a paginated website with the same URL when clicking the "next page" button

I am new to R and sorry in advance if my question is too basic.
I am scrapping a website (here's the link: http://jhsjk.people.cn/result?type ) to download all the articles. I noted that every time I click the next page button, the url remain unchanged.
I try to use rvest loop function to scrape the next page, but failed.
I searched on this website and learned that i might use RSelenium package to get myself there, but I still could sort this out :( (I am so stupid on this)
here's my code
url <- c("http://jhsjk.people.cn/result?type").
page<-read_html(url).
title <- page %>% html_nodes(css=".btbg .w1200.p2_cn.cf #news_list.list_14.p1_2.clearfix a" ) %>% html_text ().
link <- page %>% html_elements(css=".btbg .w1200.p2_cn.cf #news_list.list_14.p1_2.clearfix a") %>% html_attr('href')
press_releases_df <- data.frame( title = title, link = link)
with this code, i can only extract the first page,I want to use loop function,but don't really know what should be looped. should it be the page number?

Webscraping in R: Why does my loop return NA?

I've posted about the same question before here but the other thread is dying and I'm getting desperate.
I'm trying to scrape a webpage using rvest etc. Most of the stuff works but now I need R to loop trough a list of links and all it gives me is NA.
This is my code:
install.packages("rvest")
site20min <- read_xml("https://api.20min.ch/rss/view/1")
urls <- site20min %>% html_nodes('link') %>% html_text()
I need the next one because the first two links the api gives me direct back to the homepage
urls <- urls[-c(1:2)]
If I print my links now it gives me a list of 109 links.
urls
Now this is my loop. I need it to give me the first link of urls so I can read_html it
I'm looking for something like: "https://beta.20min.ch/story/so-sieht-die-coronavirus-kampagne-des-bundes-aus-255254143692?legacy=true".
I use break so it shows me only the first link but all I get is NA.
for(i in i:length(urls)) {
link <- urls[i]
break
}
link
If I can get this far, I think I can handle the rest with rvest but I've tried for hours now and just ain't getting anywhere.
Thx for your help.
Can you try out
for(i in 1:length(urls)) {
link <- urls[i]
break
}
link
instead?

R - Using rvest to scrape Google + reviews

As part of a project, I am trying to scrape the complete reviews from Google + (in previous attempts on other websites, my reviews were truncated by a More which hides the full review unless you click on it).
I have chosen the package rvest for this. However, I do not seem to be getting the results I want.
Here are my steps
library(rvest)
library(xml2)
library(RSelenium)
queens <- read_html("https://www.google.co.uk/search?q=queen%27s+hospital+romford&oq=queen%27s+hospitql+&aqs=chrome.1.69i57j0l5.5843j0j4&sourceid=chrome&ie=UTF-8#lrd=0x47d8a4ce4aaaba81:0xf1185c71ae14d00,1,,,")
#Here I use the selectorgadget tool to identify the user review part that I wish to scrape
reviews=queens %>%
html_nodes(".review-snippet") %>%
html_text()
However this doesn't seem to be working. I do not get any output here.
I am quite new to this package and web scraping, so any inputs on this would be greatly appreciated.
Here is the workflow with RSelenium and rvest:
1. Scroll down any times to get as many contents as you want, remember to pause once a while to let the contents load.
2. Click on all "click on more" buttons and get full reviews.
3. Get pagesource and use rvest to get all reveiws in a list
What you want to scrape is not static, so you need the help of RSelenium. This should work:
library(rvest)
library(xml2)
library(RSelenium)
rmDr=rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
myclient= rmDr$client
myclient$navigate("https://www.google.co.uk/search?q=queen%27s+hospital+romford&oq=queen%27s+hospitql+&aqs=chrome.1.69i57j0l5.5843j0j4&sourceid=chrome&ie=UTF-8#lrd=0x47d8a4ce4aaaba81:0xf1185c71ae14d00,1,,,")
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
#simulate scroll down for several times-------------
scroll_down_times=20
for(i in 1 :scroll_down_times){
webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
#the content needs time to load,wait 1 second every 5 scroll downs
if(i%%5==0){
Sys.sleep(1)
}
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
tryCatch(webEle$clickElement(),error=function(e){print(e)}) # trycatch to prevent any error from stopping the loop
}
pagesource= myclient$getPageSource()[[1]]
#this should get you the full review, including translation and original text-------------
reviews=read_html(pagesource) %>%
html_nodes(".review-full-text") %>%
html_text()
#number of stars
stars <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes("g-review-stars > span") %>%
html_attr("aria-label")
#time posted
post_time <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes(".dehysf") %>%
html_text()

How to write a loop to extract articles from website archive which links to numerous external sources?

I am trying to extract articles for a period of 200 days from Time dot mk archive, e.g. http://www.time.mk/week/2016/22. Each day has top 10 headings, each of which link to all articles related to it (at bottom of each heading "e.g. 320 поврзани вести". Following this link leads to a list of all related articles.
This is what I've managed so far:
`library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
mark = "http://www.time.mk/"
frontpagelinks = paste0(mark, frontpage)`
by now I access primary links going to related news
The following extracts all of the links to related news for the first heading, from where I clear my data for only those links that I need.
final = list()
final = read_html(frontpagelinks[1]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()`
My question is how I could instruct R, whether via loop or some other option so as to extract links from all 10 headings from "frontpagelinks" at once - I tried a variety of options but nothing really worked.
Thanks!
EDIT
Parfait's response worked like a charm! Thank you so much.
I've run into an inexplicable issue however after using that solution.
Whereas before, when I was going link by link, I could easily sort out the data for only those portals that I need via:
a1onJune = str_extract_all(dataframe, ".a1on.")
Which provided me with a clean output: [130] "a1on dot mk/wordpress/archives/618719"
with only the links I needed, now if I try to run the same code with the larger df of all links I inexplicably get many variants of this this:
"\"alsat dot mk/News/255645\", \"a1on dot mk/wordpress/archives/620944\", , \"http://www dot libertas dot mk/sdsm-poradi-kriminalot-na-vmro-dpmne-makedonija-stana-slepo-tsrevo-na-balkanot/\",
As you can see in bold it returns my desired link, but also many others (I've edited out most for clarity sake) that occur both before and after it.
Given that I'm using the same expression, I don't see why this would be happening.
Any thoughts?
Simply run lapply to return a list of links from each element of frontpagelinks
linksList <- lapply(frontpagelinks, function(i) {
read_html(i) %>%
html_nodes("h1 a") %>%
html_attr("href")
})

Resources