I have a very long list of websites that I'd like to scrape for its title, description, and keywords.
I'm using ContentScraper from Rcrawler package, and I know it's working, but there are certain URLs that it can't do and just generate the error message below. Is there anyway that it can skip that particular URL instead of stopping the entire execution?
Error: 'NULL' does not exist in current working directory
I've looked at this, but I don't think it has any answer to it. Here is the code I'm using. Any advice is greatly appreciated.
Web_Info <- ContentScraper(Url = Websites_List,
XpathPatterns = c('/html/head/title', '//meta[#name="description"]/#content', '//meta[#name="keywords"]/#content'),
PatternsName = c("Title", "Description", "Keywords"),
asDataFrame = TRUE)
Related
I'm trying to write a scraper that goes through a list of pages (all from the same site) and either 1. downloads the html/css from each page, or 2. gets me the links which exist within a list item with a particular class. (For now, my code reflects the former.) I'm doing this in R; python returned a 403 error upon the very first get request of the site, so BeautifulSoup and selenium were ruled out. In R, my code works for a time (a rather short one), and then I receive a 403 error, specifically:
"Error in open.connection(x, "rb") : HTTP error 403."
I considered putting a Sys.sleep() timer on each item in the loop, but I need to run this nearly 1000 times, so I found that solution impractical. I'm a little stumped as to what to do, particularly since the code does work, but only for a short time before it's halted. I was looking into proxies/headers, but my knowledge of either of these is unfortunately rather limited (although, of course, I'd be willing to learn if anyone has a suggestion involving either of these). Any help would be sincerely appreciated. Here's the code for reference:
for (i in 1:length(data1$Search)) {
url = data1$Search[i]
name = data1$Name[i]
download.file(url, destfile = paste(name, ".html", sep = ""), quiet = TRUE)
}
where data1 is a two column dataframe with the columns "Search" and "Name". Once again, any suggestions are much welcome. Thank you.
I am having an issue trying to scrape data from a site using R. I am trying to scrape the first table from the following webpage:
http://racing-reference.info/race/2016_Daytona_500/W
I looked at a lot of the threads about this, but I can't figure out how to make it work, most likely due to the fact I don't know HTML or much about it.
I have tried lots of different things with the code, and I keep getting the same error:
Error: failed to load HTTP resource
Here is what I have now:
library(RCurl)
library(XML)
URL <- "http://racing-reference.info/race/2016_Daytona_500/W"
doc <- htmlTreeParse(URL, useInternalNodes = TRUE)
If possible could you explain why whatever the solution is works and why what I have is giving an error? Thanks in advance.
Your sample code specifically included RCurl, but did not use it. You needed to. I think that you will get what you want from:
URL <- "http://racing-reference.info/race/2016_Daytona_500/W"
Content = getURL(URL)
doc <- htmlTreeParse(Content, useInternalNodes = TRUE)
I'm trying to scrape some website using R. However, I cannot get all the information from the website due to an unknown reason. I found a work around by first downloading the complete webpage (save as from browser). I was wondering whether it would be to download complete websites using some function.
I tried "download.file" and "htmlParse" but they seems to only download the source code.
url = "http://www.tripadvisor.com/Hotel_Review-g2216639-d2215212-Reviews-Ayurveda_Kuren_Maho-Yapahuwa_North_Western_Province.html"
download.file(url , "webpage")
doc <- htmlParse(urll)
ratings = as.data.frame(xpathSApply(doc,'//div[#class="rating reviewItemInline"]/span//#alt'))
This worked with rvest first go.
llply(html(url) %>% html_nodes('div.rating.reviewItemInline'),function(i)
data.frame(nth_stars = html_nodes(i,'img') %>% html_attr('alt'),
date_var = html_text(i)%>%stri_replace_all_regex('(\n|Reviewed)','')))
I'm a starter in web scraping and I'm not yet familiarized with the nomenclature for the problems I'm trying to solve. Nevertheless, I've searched exhaustively for this specific problem and was unsuccessful in finding a solution. If it is already somewhere else, I apologize in advance and thank your suggestions.
Getting to it. I'm trying to build a script with R that will:
1. Search for specific keywords in a newspaper website;
2. Give me the headlines, dates and contents for the number of results/pages that I desire.
I already know how to post the form for the search and scrape the results from the first page, but I've had no success so far in getting the content from the next pages. To be honest, I don't even know where to start from (I've read stuff about RCurl and so on, but it still haven't made much sense to me).
Below, it follows a partial sample of the code I've written so far (scraping only the headlines of the first page to keep it simple).
curl <- getCurlHandle()
curlSetOpt(cookiefile='cookies.txt', curl=curl, followlocation = TRUE)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
search=getForm("http://www.washingtonpost.com/newssearch/search.html",
.params=list(st="Dilma Rousseff"),
.opts=curlOptions(followLocation = TRUE),
curl=curl)
results=htmlParse(search)
results=xmlRoot(results)
results=getNodeSet(results,"//div[#class='pb-feed-headline']/h3")
results=unlist(lapply(results, xmlValue))
I understand that I could perform the search directly on the website and then inspect the URL for references regarding the page numbers or the number of the news article displayed in each page and, then, use a loop to scrape each different page.
But please bear in mind that after I learn how to go from page 1 to page 2, 3, and so on, I will try to develop my script to perform more searches with different keywords in different websites, all at the same time, so the solution in the previous paragraph doesn't seem the best to me so far.
If you have any other solution to suggest me, I will gladly embrace it. I hope I've managed to state my issue clearly so I can get a share of your ideas and maybe help others that are facing similar issues. I thank you all in advance.
Best regards
First, I'd recommend you use httr instead of RCurl - for most problems it's much easier to use.
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff"
)
)
stop_for_status(r)
content(r)
Second, if you look at url in your browse, you'll notice that clicking the page number, modifies the startat query parameter:
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff",
startat = 10
)
)
Third, you might want to try out my experiment rvest package. It makes it easier to extract information from a web page:
# devtools::install_github("hadley/rvest")
library(rvest)
page <- html(r)
links <- page[sel(".pb-feed-headline a")]
links["href"]
html_text(links)
I highly recommend reading the selectorgadget tutorial and using that to figure out what css selectors you need.
I'm trying to pull tweets using the twitteR package, but I'm having an issue getting them through the searchTwitter function when I specify a geocode the way they have it in their docs. Please see code below:
#Oauth code (successful authentication)
keyword = "the"
statuses = searchTwitter(keyword, n=100, lang="en",sinceID = NULL, geocode="39.312957, -76.618119, 10km",retryOnRateLimit=10)
Code works perfectly when I leave out geocode="39.312957, -76.618119, 10km",, but when I include it, I get the following:
Warning message:
In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, :
100 tweets were requested but the API can only return 0
I thought maybe my formatting was wrong but based on the twitteR CRAN page the string is in the right format (I also tried switching between km and mi).
Has anyone else experienced this or know a better way to search for a specific geocode? Would they have deprecated the geocode functionality?
I'm looking for tweets from Baltimore so if there is a better way to do so, I'm all ears. (By the way, I want to avoid trying to pull all tweets and then filter myself because I think I will hit the data limit fairly quickly and miss out on what I'm looking for)
Thanks!
I believe you need to remove the spaces in the geocode parameter:
statuses = searchTwitter(keyword, n=100, lang="en",sinceID = NULL, geocode="39.312957,-76.618119,10km",retryOnRateLimit=10)
FWIW You can use the Twitter desktop client "Develop" console to test out URLs before committing them into scripts.
Had the same issue. Your parameters are in the correct order, but you must avoid any whitespace within the geocode. Also, 10km might be too little a radius for the accuracy of the coordinates given, might want to try with 12mi.