I am new to web scraping and I am trying to crap the movie budget data from IMDb. Here is my code:
budget=vector()
for(i in 1:50){
remDr$navigate('http://www.imdb.com/search/title?sort=moviemeter,asc&start=1&title_type=feature&year=2011,2011')
webElems=remDr$findElements('css selector','.wlb_lite+ a')
webElems[[i]]$clickElement()
b=remDr$findElements('css selector','.txt-block:nth-child(11)')
b_text=unlist(lapply(b, function(x){x$getElementText()}))
if(is.null(b_text)==T){
budget=c(budget,'NULL')
}
if(is.null(b_text)==F){budget=c(budget,'NULL')}
print(b_text)
}
On each page there are 50 movies. I want to click every link one by one and collect the corresponding budget data. If I do not run the code in loop, the code works well. But the code always returns 'Null' when I run it in a loop. I am afraid that is because the pages do not load completely in the loop. I tried to use 'setTimeout' and 'setImplicitWaitTimeout' commands but they do not work well. Can anybody help me out?
try
Sys.sleep(time in seconds)
for each loop instead of setTimeout.
That has solved problems like yours to me.
Related
I am trying to scrape some indeed job postings for personal stuff (code below), however I currently have to go until the last page to find out what its "index" or page number is, then I am able to iterate from first to the last page.
I wanted to have it automatic, where I only provide the URL and rest the function takes care. Could anyone help me out? Also, since I will scraping couple 100 of pages, I fear that I will get kicked out, so I wanted make sure to get as much data as possible, so I have writing to a csv file like in the example below. Is there a better way to do that too?
Indeed didn't give me an API key so this is the only method I know. Here is the code:
## squencing the pages based on the result (here i just did 1 page to 5th page)
page_results <- seq(from = 10, to = 50, by = 10)
first_page_url <- "https://www.indeed.com/jobs?q=data+analyst&l=United+States"
for(i in seq_along(page_results)) {
Sys.sleep(1)
url <- paste0(first_page_url, "&start=", page_results[i]) #second page will have url+&start= 20 and so on.
page <- xml2::read_html(url)
####
#bunch of scraping from each page, method for that is implemented already
#....
####
print(i) #prints till fifth page, so i will print 1 to 5
#I also wanted to write CSV line by line so if some error happens I atleast get everythinh pre-error
# do you anything efficient than this?
write.table(as.data.frame(i), "i.csv", sep = ",", col.names = !file.exists("i.csv"), append = T)
}
I took this advice, and wated to close this answer to reduce the open question. So answered my own question. Thank you SO community to always helping out.
"I think the manual approach where you decide to give the page start and page end makes more sense, and "scraping friendly" because you can control how much pages you want to get (plus respects the company servers). You know after a while you see same job descriptions. So stick with current approach in my opinion. About writing the .csv file per iteration, I think that's fine. Someone better than me should definitely say something. Because I don't have enough knowledge in R yet." – UltaPitt
I'm having some difficulties web scraping. Specifically, I'm scraping web pages that generally have tables embedded. However, for the instances in which there is no embedded table, I can't seem to handle the error in a way that doesn't break the loop.
Example code below:
event = c("UFC 226: Miocic vs. Cormier", "ONE Championship 76: Battle for the Heavens", "Rizin FF 12")
eventLinks = c("https://www.bestfightodds.com/events/ufc-226-miocic-vs-cormier-1447", "https://www.bestfightodds.com/events/one-championship-76-battle-for-the-heavens-1532", "https://www.bestfightodds.com/events/rizin-ff-12-1538")
testLinks = data.frame(event, eventLinks)
for (i in 1:length(testLinks)) {
print(testLinks$event[i])
event = tryCatch(as.data.frame(read_html(testLinks$eventLink[i]) %>% html_table(fill=T)),
error = function(e) {NA})
}
The second link does not have a table embedded. I thought I'd just skip it with my tryCatch, but instead of skipping it, the link breaks the loop.
What I'm hoping to figure out is a way to skip links with no tables, but continue scraping the next link in the list. To continue using the example above, I want the tryCatch to move from the second link onto the third.
Any help? Much appreciated!
There are a few things to fix here. Firstly, your links are considered factors (you can see this with testLinks %>% sapply(class), so you'll need to convert them to character using as.chracter() I've done this in the code below.
Secondly, you need to assign each scrape to a list element, so we create a list outside the loop with events <- list(), and then assign each scrape to an element of the list inside the loop i.e. events[[i]] <- "something" Without a list, you'll simply override the first scrape with the second, and the second with the third, and so on.
Now your tryCatch will work and assign NA when a url does not contain a table (there will be no error)
events <- list()
for (i in 1:nrow(testLinks)) {
print(testLinks$event[i])
events[[i]] = tryCatch(as.data.frame(read_html(testLinks$eventLink[i] %>% as.character(.)) %>% html_table(fill=T)),
error = function(e) {NA})
}
events
So, here's the current situation:
I have 2000+ lines of R code that produces a couple dozen text files. This code runs in under 10 seconds.
I then manually paste each of these text files into a website, wait ~1 minute for the website's response (they're big text files), then manually copy and paste the response into Excel, and finally save them as text files again. This takes hours and is prone to user error.
Another ~600 lines of R code then combines these dozens of text files into a single analysis. This takes a couple of minutes.
I'd like to automate step 2--and I think I'm close, I just can't quite get it to work. Here's some sample code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
The code runs and every time I've done it "balcoResults" comes back with "Status: 200". Success! EXCEPT the file size is 0...
I don't know where the problem is, but my best guess is that the text block isn't getting filled out before the form is submitted. If I go to the website (http://hess.ess.washington.edu/math/v3/v3_age_in.html) and manually submit an empty form, it produces a blank webpage: pure white, nothing on it.
The problem with this potential explanation (and me fixing the code) is that I don't know why the text block wouldn't be filled out. The results of set_values tells me that "text_block" has 120 characters in it. This is the correct length for textString. I don't know why these 120 characters wouldn't be pasted into the web form.
An alternative possibility is that R isn't waiting long enough to get a response from the website, but this seems less likely because a single sample (as here) runs quickly and the status code of the response is 200.
Yesterday I took the DataCamp course on "Working with Web Data in R." I've explored GET and POST from the httr package, but I don't know how to pick apart the GET response to modify the form and then have POST submit it. I've considered trying the package RSelenium, but according to what I've read, I'd have to download and install a "Selenium Server". This intimidates me, but I could probably do it -- if I was convinced that RSelenium would solve my problem. When I look on CRAN at the function names in the RSelenium package, it's not clear which ones would help me. Without firm knowledge for how RSelenium would solve my problem, or even if it would, this seems like a poor return on the time investment required. (But if you guys told me it was the way to go, and which functions to use, I'd be happy to do it.)
I've explored SO for fixes, but none of the posts that I've found have helped. I've looked here, here, and here, to list three.
Any suggestions?
After two days of thinking, I spotted the problem. I didn't assign the results of set_value function to a variable (if that's the right R terminology).
Here's the corrected code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
balcoForm <- set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
I would like to extract the table (table 4) from the URL "http://www.moneycontrol.com/financials/oilnaturalgascorporation/profit-loss/IP02". The catch is that I will have to use RSelenium
Now here is the code I am using:
remDr$navigate(URL)
doc<-htmlParse(remDr$getPageSource()[[1]])
x<-readHTMLTable(doc)
The above code is not able to extract the table 4. However when I do not use Rselenium like below, I am able to extract the table easily
download.file(URL,'quote.html')
doc<-htmlParse('quote.html')
x<-readHTMLTable(doc,which=5)
Please let me the solution as I have been stuck on this part for a month now. Appreciate your suggestions
I think it works fine. The table you were able to get using download.file can also be gotten by using the following code for RSelenium
readHTMLTable(htmlParse(remDr$getPageSource(),asText=TRUE),header=TRUE,which=6)
Hope that helps!
I found the solution. In my case, I had to first navigate to the inner frame (boxBg1) before I could extract the outer html and then use readHtmlTable function. It works fine now. Will post in case I run into a similar issue in the future
I'm struggling with more or less the same issue: I'm trying to come up with a solution that doesn't use htmlParse: for example (after navigating to the page):
table <- remDr$findElements(using = "tag name", value = "table"))
You might have to use css or xpath on yours, next step I'm still working on.
I finally got a table downloaded into a nice little data frame, It seems easy when you get it figured out. Using the help page from the XML package:
library(RSelenium)
library(XML)
u <- 'http://www.w3schools.com/html/html_tables.asp'
doc <- htmlParse(u)
tableNodes <- getNodeSet(do9c, "//table")
tb <- readHTMLTable(tableNodes[[1]])
I am trying to scrap data from a website which lists the ratings of multiple products. So, let's say a product has 800 brands. So, with 10 brands per page, I will need to scrap data from 8 pages. Eg: Here is the data for baby care. There are 24 pages worth of brands that I need - http://www.goodguide.com/products?category_id=152775-baby-care&sort_order=DESC#!rf%3D%26rf%3D%26rf%3D%26cat%3D152775%26page%3D1%26filter%3D%26sort_by_type%3Drating%26sort_order%3DDESC%26meta_ontology_node_id%3D
I have used the bold font for 1, as that is the only thing that changes in this url as we move from page to page. So, I thought it might be straight forward to write a loop in R. But what I find is that as I move to page 2, the page does not load again. Instead, just the results are updated in about 5 secs. However, R does not wait for 5 seconds and thus, I had the data from the first page 26 times.
I also tried entering the page 2 url directly and ran my code without a loop. Same story- I got page 1 results. I am sure I can't be the only one facing this. Any help is appreciated. I have attached the code.
Thanks a million. And I hope my question was clear enough.
# build the URL
N<-matrix(NA,26,15)
R<-matrix(NA,26,60)
for(n in 1:26){
url <- paste("http://www.goodguide.com/products?category_id=152775-baby-care&sort_order=DESC#!rf%3D%26rf%3D%26rf%3D%26cat%3D152775%26page%3D",i,"%26filter%3D%26sort_by_type%3Drating%26sort_order%3DDESC%26meta_ontology_node_id%3D")
raw.data <-readLines(url)
Parse <- htmlParse(raw.data)
#####
A<-querySelector(Parse, "div.results-container")
#####
Name<-querySelectorAll(A,"div.reviews>a")
Ratings<-querySelectorAll(A,"div.value")
N[n,]<-sapply(Name,function(x)xmlGetAttr(x,"href"))
R[n,]<-sapply(Ratings,xmlValue)
}
Referring to the html source reveals that the urls you want can be simplified to this structure:
http://www.goodguide.com/products?category_id=152775-baby-care&page=2&sort_order=DESC.
The content of these urls is retrieved by R as expected.
Note that you can also go straight to:
u <- sprintf('http://www.goodguide.com/products?category_id=152775-baby-care&page=%s&sort_order=DESC', n)
Parse <- htmlParse(u)