I'm trying to write a scraper that goes through a list of pages (all from the same site) and either 1. downloads the html/css from each page, or 2. gets me the links which exist within a list item with a particular class. (For now, my code reflects the former.) I'm doing this in R; python returned a 403 error upon the very first get request of the site, so BeautifulSoup and selenium were ruled out. In R, my code works for a time (a rather short one), and then I receive a 403 error, specifically:
"Error in open.connection(x, "rb") : HTTP error 403."
I considered putting a Sys.sleep() timer on each item in the loop, but I need to run this nearly 1000 times, so I found that solution impractical. I'm a little stumped as to what to do, particularly since the code does work, but only for a short time before it's halted. I was looking into proxies/headers, but my knowledge of either of these is unfortunately rather limited (although, of course, I'd be willing to learn if anyone has a suggestion involving either of these). Any help would be sincerely appreciated. Here's the code for reference:
for (i in 1:length(data1$Search)) {
url = data1$Search[i]
name = data1$Name[i]
download.file(url, destfile = paste(name, ".html", sep = ""), quiet = TRUE)
}
where data1 is a two column dataframe with the columns "Search" and "Name". Once again, any suggestions are much welcome. Thank you.
Related
I am trying to scrape some indeed job postings for personal stuff (code below), however I currently have to go until the last page to find out what its "index" or page number is, then I am able to iterate from first to the last page.
I wanted to have it automatic, where I only provide the URL and rest the function takes care. Could anyone help me out? Also, since I will scraping couple 100 of pages, I fear that I will get kicked out, so I wanted make sure to get as much data as possible, so I have writing to a csv file like in the example below. Is there a better way to do that too?
Indeed didn't give me an API key so this is the only method I know. Here is the code:
## squencing the pages based on the result (here i just did 1 page to 5th page)
page_results <- seq(from = 10, to = 50, by = 10)
first_page_url <- "https://www.indeed.com/jobs?q=data+analyst&l=United+States"
for(i in seq_along(page_results)) {
Sys.sleep(1)
url <- paste0(first_page_url, "&start=", page_results[i]) #second page will have url+&start= 20 and so on.
page <- xml2::read_html(url)
####
#bunch of scraping from each page, method for that is implemented already
#....
####
print(i) #prints till fifth page, so i will print 1 to 5
#I also wanted to write CSV line by line so if some error happens I atleast get everythinh pre-error
# do you anything efficient than this?
write.table(as.data.frame(i), "i.csv", sep = ",", col.names = !file.exists("i.csv"), append = T)
}
I took this advice, and wated to close this answer to reduce the open question. So answered my own question. Thank you SO community to always helping out.
"I think the manual approach where you decide to give the page start and page end makes more sense, and "scraping friendly" because you can control how much pages you want to get (plus respects the company servers). You know after a while you see same job descriptions. So stick with current approach in my opinion. About writing the .csv file per iteration, I think that's fine. Someone better than me should definitely say something. Because I don't have enough knowledge in R yet." – UltaPitt
Dear Stackoverflow users,
I am using R to scrape profiles of a few psycotherapists from Psychology Today; this is done for exercising and learning more about web scraping.
I am new to R and I I have to go through this intense training that will help me with a future projects. It implies that I might not know precisely what I am doing at the moment (e.g. I might not interpret well either the script or the error messages from R), but I have to get it done. Therefore, I beg your pardon for possible misunderstandings or inaccuracies.
In short, the situation is the following.
I have created a function through which I scrape information from 2 nodes of psycotherapists' profiles; the function is showed on this stackoverflow post.
Then I create a loop where that function is used on a few psycotherapists' profiles; the loop is in the above post as well, but I report it below because that is the part of the script that generates some problems (additionally to what I solved in the above mentioned post).
j <- 1
MHP_codes <- c(150140:150180) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- read_html(URL)
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NA)
j <- j + 1
}
when the loop is done, I bind the information from different profiles into one data frame and save it.
final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")
The function (getProfile) works well on individual profiles.
It works also on a small range of profiles ( c(150100:150150)).
Please, note that I do not know what psychoterapist id is actually assigned; so, many URLs within the range do not exist.
However, generally speaking, tryCatch should solve this issue. When an URL is non-existent (and thus the ID is not associated to any psychoterapist), each of the 2 nodes (and thus each of the 2 corresponding variables in my data frame) are empty (i.e. the data frame shows NAs in the corresponding cells).
However, in some IDs ranges, two problems might happen.
First, I get one error message such as teh following one:
Error in open.connection(x, "rb") : HTTP error 404.
So, this happens despite the fact that I am usign tryCatch and despite the fact that it generally appears to work (at least, until the error message appear).
Moreover, after the loop is stopped and R runs the line:
final_df <- rbind.fill(df_list)
A second error message appears:
Warning message:
In df[[var]] :
closing unused connection 3 (https://www.psychologytoday.com/us/therapists/illinois/150152)
It seems like there is a specific problem with that one empty URL.
In fact, when I change ID range, the loop works well despite non-existent URLs: on one hand, when the URL exists the information is scraped from the website, on the other hand, when the URL does not exists, the 2 variables associated to that URL (and thus to that psyciotherapist ID) get an NA.
Is it possible, perhaps, to tell R to skip the URL if it is empty? Without recording anything?
This solution would be excellent, since it would shrink the data frame to the existing URLs, but I do not know how to do it and I do not know whether it is a solution to my problem.
Anyone who is able to help me sorting out this issue?
Yes, you need to wrap a tryCatch around the read_html call. This is where R tries to connect to the website, so it will throw an error (as opposed to returning an empty object) there if fails to connect. You can catch that error and then use next to tell R to skip to the next iteration of the loop.
library(rvest)
##Valid URL, works fine
URL <- "https://news.bbc.co.uk"
read_html(URL)
##Invalid URL, error raised
URL <- "https://news.bbc.co.uk/not_exist"
read_html(URL)
##Leads to error
Error in open.connection(x, "rb") : HTTP error 404.
##Invalid URL, catch and skip to next iteration of the loop
URL <- "https://news.bbc.co.uk/not_exist"
tryCatch({
URL <- read_html(URL)},
error=function(e) {print("URL Not Found, skipping")
next})
I would like to thank #Jul for the answer.
Here I post my updated loop:
j <- 1
MHP_codes <- c(150000:150200) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
delayedAssign("do.next", {next})
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- tryCatch(read_html(URL),
error = function(e) force(do.next))
df_list[[j]] <- getProfile(URL)
j <- j + 1
}
final_df <- rbind.fill(df_list)
As you can see, something had to be changed: although the answer from #Jul was close to solve the problem, the loop still stopped, and thus I had to slightly change the original suggestion.
In particular, I have introduced in the loop but outside of the tryCatch function the following line:
delayedAssign("do.next", {next})
And in the tryCatch function the following argument:
force(do.next)
This is based on this other stackoverlflow post.
So, here's the current situation:
I have 2000+ lines of R code that produces a couple dozen text files. This code runs in under 10 seconds.
I then manually paste each of these text files into a website, wait ~1 minute for the website's response (they're big text files), then manually copy and paste the response into Excel, and finally save them as text files again. This takes hours and is prone to user error.
Another ~600 lines of R code then combines these dozens of text files into a single analysis. This takes a couple of minutes.
I'd like to automate step 2--and I think I'm close, I just can't quite get it to work. Here's some sample code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
The code runs and every time I've done it "balcoResults" comes back with "Status: 200". Success! EXCEPT the file size is 0...
I don't know where the problem is, but my best guess is that the text block isn't getting filled out before the form is submitted. If I go to the website (http://hess.ess.washington.edu/math/v3/v3_age_in.html) and manually submit an empty form, it produces a blank webpage: pure white, nothing on it.
The problem with this potential explanation (and me fixing the code) is that I don't know why the text block wouldn't be filled out. The results of set_values tells me that "text_block" has 120 characters in it. This is the correct length for textString. I don't know why these 120 characters wouldn't be pasted into the web form.
An alternative possibility is that R isn't waiting long enough to get a response from the website, but this seems less likely because a single sample (as here) runs quickly and the status code of the response is 200.
Yesterday I took the DataCamp course on "Working with Web Data in R." I've explored GET and POST from the httr package, but I don't know how to pick apart the GET response to modify the form and then have POST submit it. I've considered trying the package RSelenium, but according to what I've read, I'd have to download and install a "Selenium Server". This intimidates me, but I could probably do it -- if I was convinced that RSelenium would solve my problem. When I look on CRAN at the function names in the RSelenium package, it's not clear which ones would help me. Without firm knowledge for how RSelenium would solve my problem, or even if it would, this seems like a poor return on the time investment required. (But if you guys told me it was the way to go, and which functions to use, I'd be happy to do it.)
I've explored SO for fixes, but none of the posts that I've found have helped. I've looked here, here, and here, to list three.
Any suggestions?
After two days of thinking, I spotted the problem. I didn't assign the results of set_value function to a variable (if that's the right R terminology).
Here's the corrected code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
balcoForm <- set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
I'm trying to scrape a set of news articles using rvest and boilerpipeR. The code works fine for most of time, however, it crashes for some specific values. I searched online high and low and could not find anyone with anything similar.
require(rvest)
require(stringr)
require(boilerpipeR)
# this is a problematic URL, its duplicates also generate fatal errors
url = "http://viagem.estadao.com.br/noticias/geral,museu-da-mafia-ganha-exposicao-permanente-da-serie-the-breaking-bad,10000018395"
content_html = getURLContent(url) # HTML source code in character type
article_text = ArticleExtractor(content_html) # returns 'NA'
# next line induces fatal error
encoded_exit = read_html(content_html ,encoding = "UTF-8")
paragraph = html_nodes(encoded_exit,"p")
article_text = html_text(paragraph)
article_text = iconv(article_text,from="UTF-8", to="latin1")
This is not the only news piece that ArticleExtractor() returns 'NA' to, and the code was built to handle it as a viable result. This whole snippet is inside a tryCatch(), so regular errors should not be able to stop execution.
The main issue is that the entire R session just crashes and has to be reloaded, which prevents me from grabbing data and debugging it.
What could be causing this issue?
And how can I stop it from crashing the entire R session?
I had the same problem.
RScript crashes without any error message (session aborted), no matter if I use 32bit or 64bit.
The solution for me was to look at the URL I was scraping.
If the URL has some severe mistakes in the HTML-Code-syntax, RScript will crash. It's reproducable. Check the page with https://validator.w3.org.
In your case:
"Error: Start tag body seen but an element of the same type was
already open."
From line 107, column 1; to line 107, column 25
crashed it. So your document had two <body><body> opening Tags. A quick&dirty solution for me was to check first, if read_html gets valid HTML content:
url = "http://www.blah.de"
page = read_html(url, encoding = "UTF-8")
# check HTML-validity first to prevent fatal crash
if (!grepl("<html.*<body.*</body>.*</html>", toString(page), ignore.case=T)) {
print("Skip this Site")
}
# proceed with html_nodes(..) etc
rrscriptrvestsession-abortedweb-scraping
I'm a starter in web scraping and I'm not yet familiarized with the nomenclature for the problems I'm trying to solve. Nevertheless, I've searched exhaustively for this specific problem and was unsuccessful in finding a solution. If it is already somewhere else, I apologize in advance and thank your suggestions.
Getting to it. I'm trying to build a script with R that will:
1. Search for specific keywords in a newspaper website;
2. Give me the headlines, dates and contents for the number of results/pages that I desire.
I already know how to post the form for the search and scrape the results from the first page, but I've had no success so far in getting the content from the next pages. To be honest, I don't even know where to start from (I've read stuff about RCurl and so on, but it still haven't made much sense to me).
Below, it follows a partial sample of the code I've written so far (scraping only the headlines of the first page to keep it simple).
curl <- getCurlHandle()
curlSetOpt(cookiefile='cookies.txt', curl=curl, followlocation = TRUE)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
search=getForm("http://www.washingtonpost.com/newssearch/search.html",
.params=list(st="Dilma Rousseff"),
.opts=curlOptions(followLocation = TRUE),
curl=curl)
results=htmlParse(search)
results=xmlRoot(results)
results=getNodeSet(results,"//div[#class='pb-feed-headline']/h3")
results=unlist(lapply(results, xmlValue))
I understand that I could perform the search directly on the website and then inspect the URL for references regarding the page numbers or the number of the news article displayed in each page and, then, use a loop to scrape each different page.
But please bear in mind that after I learn how to go from page 1 to page 2, 3, and so on, I will try to develop my script to perform more searches with different keywords in different websites, all at the same time, so the solution in the previous paragraph doesn't seem the best to me so far.
If you have any other solution to suggest me, I will gladly embrace it. I hope I've managed to state my issue clearly so I can get a share of your ideas and maybe help others that are facing similar issues. I thank you all in advance.
Best regards
First, I'd recommend you use httr instead of RCurl - for most problems it's much easier to use.
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff"
)
)
stop_for_status(r)
content(r)
Second, if you look at url in your browse, you'll notice that clicking the page number, modifies the startat query parameter:
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff",
startat = 10
)
)
Third, you might want to try out my experiment rvest package. It makes it easier to extract information from a web page:
# devtools::install_github("hadley/rvest")
library(rvest)
page <- html(r)
links <- page[sel(".pb-feed-headline a")]
links["href"]
html_text(links)
I highly recommend reading the selectorgadget tutorial and using that to figure out what css selectors you need.