I'm a starter in web scraping and I'm not yet familiarized with the nomenclature for the problems I'm trying to solve. Nevertheless, I've searched exhaustively for this specific problem and was unsuccessful in finding a solution. If it is already somewhere else, I apologize in advance and thank your suggestions.
Getting to it. I'm trying to build a script with R that will:
1. Search for specific keywords in a newspaper website;
2. Give me the headlines, dates and contents for the number of results/pages that I desire.
I already know how to post the form for the search and scrape the results from the first page, but I've had no success so far in getting the content from the next pages. To be honest, I don't even know where to start from (I've read stuff about RCurl and so on, but it still haven't made much sense to me).
Below, it follows a partial sample of the code I've written so far (scraping only the headlines of the first page to keep it simple).
curl <- getCurlHandle()
curlSetOpt(cookiefile='cookies.txt', curl=curl, followlocation = TRUE)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
search=getForm("http://www.washingtonpost.com/newssearch/search.html",
.params=list(st="Dilma Rousseff"),
.opts=curlOptions(followLocation = TRUE),
curl=curl)
results=htmlParse(search)
results=xmlRoot(results)
results=getNodeSet(results,"//div[#class='pb-feed-headline']/h3")
results=unlist(lapply(results, xmlValue))
I understand that I could perform the search directly on the website and then inspect the URL for references regarding the page numbers or the number of the news article displayed in each page and, then, use a loop to scrape each different page.
But please bear in mind that after I learn how to go from page 1 to page 2, 3, and so on, I will try to develop my script to perform more searches with different keywords in different websites, all at the same time, so the solution in the previous paragraph doesn't seem the best to me so far.
If you have any other solution to suggest me, I will gladly embrace it. I hope I've managed to state my issue clearly so I can get a share of your ideas and maybe help others that are facing similar issues. I thank you all in advance.
Best regards
First, I'd recommend you use httr instead of RCurl - for most problems it's much easier to use.
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff"
)
)
stop_for_status(r)
content(r)
Second, if you look at url in your browse, you'll notice that clicking the page number, modifies the startat query parameter:
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff",
startat = 10
)
)
Third, you might want to try out my experiment rvest package. It makes it easier to extract information from a web page:
# devtools::install_github("hadley/rvest")
library(rvest)
page <- html(r)
links <- page[sel(".pb-feed-headline a")]
links["href"]
html_text(links)
I highly recommend reading the selectorgadget tutorial and using that to figure out what css selectors you need.
Related
I am trying to scrape some indeed job postings for personal stuff (code below), however I currently have to go until the last page to find out what its "index" or page number is, then I am able to iterate from first to the last page.
I wanted to have it automatic, where I only provide the URL and rest the function takes care. Could anyone help me out? Also, since I will scraping couple 100 of pages, I fear that I will get kicked out, so I wanted make sure to get as much data as possible, so I have writing to a csv file like in the example below. Is there a better way to do that too?
Indeed didn't give me an API key so this is the only method I know. Here is the code:
## squencing the pages based on the result (here i just did 1 page to 5th page)
page_results <- seq(from = 10, to = 50, by = 10)
first_page_url <- "https://www.indeed.com/jobs?q=data+analyst&l=United+States"
for(i in seq_along(page_results)) {
Sys.sleep(1)
url <- paste0(first_page_url, "&start=", page_results[i]) #second page will have url+&start= 20 and so on.
page <- xml2::read_html(url)
####
#bunch of scraping from each page, method for that is implemented already
#....
####
print(i) #prints till fifth page, so i will print 1 to 5
#I also wanted to write CSV line by line so if some error happens I atleast get everythinh pre-error
# do you anything efficient than this?
write.table(as.data.frame(i), "i.csv", sep = ",", col.names = !file.exists("i.csv"), append = T)
}
I took this advice, and wated to close this answer to reduce the open question. So answered my own question. Thank you SO community to always helping out.
"I think the manual approach where you decide to give the page start and page end makes more sense, and "scraping friendly" because you can control how much pages you want to get (plus respects the company servers). You know after a while you see same job descriptions. So stick with current approach in my opinion. About writing the .csv file per iteration, I think that's fine. Someone better than me should definitely say something. Because I don't have enough knowledge in R yet." – UltaPitt
note: I haven't asked a question here before, and am still not sure how to make this legible, so let me know of any confusion or tips on making this more readable
I'm trying to download user information from the 2004/06 to 2004/09 Internet Archive captures of makeoutclub.com (a wacky, now-defunct social network targeted toward alternative music fans, which was created in ~2000, making it one of the oldest profile-based social networks on the Internet) using r,* specifically the rcrawler package.
So far, I've been able to use the package to get the usernames and profile links in a dataframe, using xpath to identify the elements I want, but somehow it doesn't work for either the location or interests sections of the profiles, both of which are just text instead of other elements in the html. For an idea of the site/data I'm talking about, here's the page I've been texting my xpath on: https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html
I have been testing out my xpath expressions using rcrawler's ContentScraper function, which extracts the set of elements matching the specified xpath from one specific page of the site you need to crawl. Here is my functioning expression that identifies the usernames and links on the site, with the specific page I'm using specified, and returns a vector:
testwaybacktable <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = c("//tr[1]/td/font/a[1]/#href", "//tr[1]/td/font/a[1]"), ManyPerPattern = TRUE)
And here is the bad one, where I'm testing the "location," which ends up returning an empty vector
testwaybacklocations <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = "//td/table/tbody/tr[1]/td/font/text()[2]", ManyPerPattern = TRUE)
And the other bad one, this one looking for the text under "interests":
testwaybackint <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = "//td/table/tbody/tr[2]/td/font/text()", ManyPerPattern = TRUE)
The xpath expressions I'm using here seem to select the right elements when I try searching them in the Chrome Inspect thing, but the program doesn't seem to read them. I also have tried selecting only one element for each field, and it still produced an empty vector. I know that this tool can read text in this webpage–I tested another random piece of text–but somehow I'm getting nothing when I run this test.
Is there something wrong with my xpath expression? Should I be using different tools to do this?
Thanks for your patience!
*This is for a digital humanities project will hopefully use some nlp to analyze especially language around gender and sexuality, in dialogue with some nlp analysis of the lyrics of the most popular bands on the site.
A late answer, but maybe it will help nontheless. Also I am not sure about the whole TOS question, but I think that's yours to figure out. Long story short ... I will just try to to adress the technical aspects of your problem ;)
I am not familiar with the rcrawler-package. Usually I use rvest for webscraping and I think it is a good choice. To achive the desired output you would have to use something like
# parameters
url <- your_url
xpath_pattern <- your_pattern
# get the data
wp <- xml2::read_html(url)
# extract whatever you need
res <- rvest::html_nodes(wp,xpath=xpath_pattern)
I think it is not possible to use a vector with multiple elements as pattern argument, but you can run html_nodes for each pattern you want to extract seperately.
I think the first two urls/patterns should work this way. The pattern in your last url seems to be wrong somehow. If you want to extract the text inside the tables, it should probably be something like "//tr[2]/td/font/text()[2]"
So, here's the current situation:
I have 2000+ lines of R code that produces a couple dozen text files. This code runs in under 10 seconds.
I then manually paste each of these text files into a website, wait ~1 minute for the website's response (they're big text files), then manually copy and paste the response into Excel, and finally save them as text files again. This takes hours and is prone to user error.
Another ~600 lines of R code then combines these dozens of text files into a single analysis. This takes a couple of minutes.
I'd like to automate step 2--and I think I'm close, I just can't quite get it to work. Here's some sample code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
The code runs and every time I've done it "balcoResults" comes back with "Status: 200". Success! EXCEPT the file size is 0...
I don't know where the problem is, but my best guess is that the text block isn't getting filled out before the form is submitted. If I go to the website (http://hess.ess.washington.edu/math/v3/v3_age_in.html) and manually submit an empty form, it produces a blank webpage: pure white, nothing on it.
The problem with this potential explanation (and me fixing the code) is that I don't know why the text block wouldn't be filled out. The results of set_values tells me that "text_block" has 120 characters in it. This is the correct length for textString. I don't know why these 120 characters wouldn't be pasted into the web form.
An alternative possibility is that R isn't waiting long enough to get a response from the website, but this seems less likely because a single sample (as here) runs quickly and the status code of the response is 200.
Yesterday I took the DataCamp course on "Working with Web Data in R." I've explored GET and POST from the httr package, but I don't know how to pick apart the GET response to modify the form and then have POST submit it. I've considered trying the package RSelenium, but according to what I've read, I'd have to download and install a "Selenium Server". This intimidates me, but I could probably do it -- if I was convinced that RSelenium would solve my problem. When I look on CRAN at the function names in the RSelenium package, it's not clear which ones would help me. Without firm knowledge for how RSelenium would solve my problem, or even if it would, this seems like a poor return on the time investment required. (But if you guys told me it was the way to go, and which functions to use, I'd be happy to do it.)
I've explored SO for fixes, but none of the posts that I've found have helped. I've looked here, here, and here, to list three.
Any suggestions?
After two days of thinking, I spotted the problem. I didn't assign the results of set_value function to a variable (if that's the right R terminology).
Here's the corrected code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
balcoForm <- set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
I am fairly new to R and am having trouble with pulling data from the Forbes website.
My current function is:
url = http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states
data = readHTMLTable(url)
However, when I change the page # in the url from 1 to 2 (or to any other number), the data that is pulled is the same data from page 1. For some reason R does not pull the data from the correct page. If you manually paste the link into the browser with a specific page #, then it works fine.
Does anyone have an idea as to why this is happening?
Thanks!
This appears to be an issue caused by URL fragments, which the pound sign represents. It essentially creates an anchor on a page and directs your browser to jump to that particular location.
You might be having this trouble because readHTMLTable() might not be created to work with URL fragments. See if you can find a version of the same table that does not require # in the URL.
Here are some helpful links that might shed light on what you are experiencing:
What is it when a link has a pound "#" sign in it
https://support.microsoft.com/kb/202261/en-us
If I come across anything else that's helpful, I'll share it in follow-up comments.
What you might need to do is use the URLencode() method in R.
kdb.url <- "http://m1:5000/q.csv?select from data0 where folio0 = `KF"
kdb.url <- URLencode(kdb.url)
df <- read.csv(kdb.url, header=TRUE)
You might have meta-characters in your URL too. (Mine are the spaces and the backtick.)
>kdb.url
[1] "http://m1:5000/q.csv?select%20from%20data0%20where%20folio0%20=%20%60KF"
They think of everything those R guys.
I'm trying to pull tweets using the twitteR package, but I'm having an issue getting them through the searchTwitter function when I specify a geocode the way they have it in their docs. Please see code below:
#Oauth code (successful authentication)
keyword = "the"
statuses = searchTwitter(keyword, n=100, lang="en",sinceID = NULL, geocode="39.312957, -76.618119, 10km",retryOnRateLimit=10)
Code works perfectly when I leave out geocode="39.312957, -76.618119, 10km",, but when I include it, I get the following:
Warning message:
In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, :
100 tweets were requested but the API can only return 0
I thought maybe my formatting was wrong but based on the twitteR CRAN page the string is in the right format (I also tried switching between km and mi).
Has anyone else experienced this or know a better way to search for a specific geocode? Would they have deprecated the geocode functionality?
I'm looking for tweets from Baltimore so if there is a better way to do so, I'm all ears. (By the way, I want to avoid trying to pull all tweets and then filter myself because I think I will hit the data limit fairly quickly and miss out on what I'm looking for)
Thanks!
I believe you need to remove the spaces in the geocode parameter:
statuses = searchTwitter(keyword, n=100, lang="en",sinceID = NULL, geocode="39.312957,-76.618119,10km",retryOnRateLimit=10)
FWIW You can use the Twitter desktop client "Develop" console to test out URLs before committing them into scripts.
Had the same issue. Your parameters are in the correct order, but you must avoid any whitespace within the geocode. Also, 10km might be too little a radius for the accuracy of the coordinates given, might want to try with 12mi.