Scrape the Data with Rcurl - r

I want to crawl some data from the following the url using Rcurl and XML.
http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?&lang=
the data range is from "2000-06-05" to "2013-12-30", there're more than 10000 pages.
The elements in this page associated with the data.
<form name="report1_turnPageForm" method=post
action="http://datacenter.mep.gov.cn:80/.../air.../air_dairy.jsp..." style="display:none">
<input type=hidden name=reportParamsId value=122169>
<input type=hidden name=report1_currPage value="1">
<input type=hidden name=report1_cachedId value=53661>
</form>
and the link also looks like this
http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?city&startdate=2013-12-15&enddate=2013-12-30&page=31
there're startdate and enddate and page..
then I began to crawl the web.
require(RCurl)
require(XML)
k = postForm("http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?&lang=")
k = iconv(k, 'gbk', 'utf-8')
k = htmlParse(k, asText = TRUE, encoding = 'utf-8')
then..I don't know what to do next..and I'm not sure whether I'm on the correct track?
I also tried this
k = sapply(getNodeSet(doc = k, path = "//font[#color='#0000FF' and #size='2']"),
xmlValue)[1:24]
It doesn't work..
Could give some suggestions ? Thanks a lot!
Scrapy and beautifulsoup solutions are also strongly welcomed!

If XML is sufficient, maybe this would be a starting point:
require(XML)
url <- "http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?city&startdate=2013-12-15&enddate=2013-12-30&page=%d"
pages <- 2
tabs <- vector("list", length=pages)
for (page in 1:pages) {
doc <- htmlParse(paste(suppressWarnings(readLines(sprintf(url,
page),
encoding="UTF-8")),
collapse="\n"))
tabs[[page]] <- readHTMLTable(doc,
header=TRUE,
which=4) # readHTMLTable(doc)[["report1"]]
}
do.call(rbind.data.frame, tabs) # output

Related

Filling and submit search with rvest in R

I am learning how to fill forms and submit with rvest in R, and I got stucked when I want to search for ggplot tag in stackoverflow. This is my code:
url<-"https://stackoverflow.com/questions"
(session<-html_session("https://stackoverflow.com/questions"))
(form<-html_form(session)[[2]])
(filled_form<-set_values(form, tagQuery = "ggplot"))
searched<-submit_form(session, filled_form)
I've got the error:
Submitting with '<unnamed>'
Error in parse_url(url) : length(url) == 1 is not TRUE
Follow this question (rvest error on form submission) I tried several things to solve this, but I couldnt:
filled_form$fields[[13]]$name<-"submit"
filled_form$fields[[14]]$name<-"submit"
filled_form$fields[[13]]$type<-"button"
filled_form$fields[[14]]$type<-"button"
Any help guys
The search query is in html_form(session)[[1]]
As there is no submit button in this form :
<form> 'search' (GET /search)
<input text> 'q':
this workaround seems to work :
<form> 'search' (GET /search)
<input text> 'q':
<input submit> '':
Giving the following code sequence :
library(rvest)
url<-"https://stackoverflow.com/questions"
(session<-html_session("https://stackoverflow.com/questions"))
(form<-html_form(session)[[1]])
fake_submit_button <- list(name = NULL,
type = "submit",
value = NULL,
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "input"
form[["fields"]][["submit"]] <- fake_submit_button
(filled_form<-set_values(form, q = "ggplot"))
searched<-submit_form(session, filled_form)
the problem is that the reply has a captcha :
searched$url
[1] "https://stackoverflow.com/nocaptcha?s=7291e7e6-9b8b-4b5f-bd1c-0f6890c23573"
You won't be able to handle this with rvest, but after clicking manually on the captcha you get the query you're looking for :
https://stackoverflow.com/search?q=ggplot
Probably much easier to use my other answer with:
read_html(paste0('https://stackoverflow.com/search?tab=newest&q=',search))

How do I scrape pages with URL in the forms of page=0%2C0, page=0%2C1, page=0%2C2... (Using R)

I'm pretty new to using R and have a question about web scraping multiple pages with a different form of page URL.
How should I scrape multiple pages with a for{} loop when the pages are in the form of page=0%2C0, page=0%2C1, page=0%2C2,...?
I tried using the modify_url function to add number after "0%2C" however the page remains to be on the first page.
Below is the code I wrote :
library(xml2)
library(stringr)
library(httr)
list.url <- 'http://www.epilepsy.com/connect/forums/living-epilepsy-adults?page=0%2C'
post.num <- 1
a<-"0%2C"
for(page.num in 1:10){
h = read_html(modify_url(list.url, query=list(page=(sprintf('a', page.num)))))
article_list<- html_nodes(h, 'span.field-content')
link <- html_nodes(h, 'span.field-content a')
html_attr(link, 'href')
article_href <- unique(html_attr(link, 'href'))
for(link in article_href){
link = sprintf('http://www.epilepsy.com%s', link)
print(link)
h = read_html(link)
#extracting title
title = html_text(html_node(h, 'div.panel-pane.pane-node-title.no-title.block'))
title <- str_trim(title)
str_replace_all(title, '[[:space:]]', '')
print(title)
#extracting contents
content = html_text(html_node(h, 'div.field-item.even p'))
print(content)
dataf[post.num, 'content'] = content
#add on post numbers
post.num <- post.num + 1
}
}
Thanks in advance.
modify_url is more for working with abstract parts of URLS: queries, ports, and so on. You're better off editing the URL as a string (like you describe), so paste0 fits the bill.
Replace the line
h = read_html(modify_url(list.url, query=list(page=(sprintf('a', page.num)))))
with
h = read_html(paste0(list.url, page.num))
I ran the modified code (after commenting out the dataf line, since you didn't provide that part), and a lot of different stuff was printed to the console. So it seems to work.

NodeSet as character

I want to get a NodeSet, with the getNodeSet function from the XML package, and write it as text in a file.
For example :
> getNodeSet(htmlParse("http://www.google.fr/"), "//div[#id='hplogo']")[[1]]
<div title="Google" align="left" id="hplogo" onload="window.lol&&lol()" style="height:110px;width:276px;background:url(/images/srpr/logo9w.png) no-repeat">
<div nowrap="" style="color:#777;font-size:16px;font-weight:bold;position:relative;top:70px;left:218px">France</div>
</div>
I want to save all this node unchanged in a file.
The problem is we can't write the object directly with :
write.lines(getNodeSet(...), file)
And as.character(getNodeSet(...)) returns a C pointer.
How can I do this ? Thank you.
To save an XML object to a file, use saveXML, e.g.,
url = "http://www.google.fr/"
nodes = getNodeSet(htmlParse(url), "//div[#id='hplogo']")[[1]]
fl <- saveXML(nodes, tempfile())
readLines(fl)
There has to be a better way, until then you can capture what the print method for a XMLNode outputs:
nodes <- getNodeSet(...)
sapply(nodes, function(x)paste(capture.output(print(x)), collapse = ""))
I know it might be a bit outdated but i got into the same problem and wanted to leave it for future reference, after searching and struggling the answer is as simple as:
htmlnodes <- toString(nodes)
write.lines(htmlnodes, file)

R: XPath expression returns links outside of selected element

I am using R to scrape the links from the main table on that page, using XPath syntax. The main table is the third on the page, and I want only the links containing magazine article.
My code follows:
require(XML)
(x = htmlParse("http://www.numerama.com/magazine/recherche/125/hadopi/date"))
(y = xpathApply(x, "//table")[[3]])
(z = xpathApply(y, "//table//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href"))
(links = unique(z))
If you look at the output, the final links do not come from the main table but from the sidebar, even though I selected the main table in my third line by asking object y to include only the third table.
What am I doing wrong? What is the correct/more efficient way to code this with XPath?
Note: XPath novice writing.
Answered (really quickly), thanks very much! My solution is below.
extract <- function(x) {
message(x)
html = htmlParse(paste0("http://www.numerama.com/magazine/recherche/", x, "/hadopi/date"))
html = xpathApply(html, "//table")[[3]]
html = xpathApply(html, ".//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href")
html = gsub("#ac_newscomment", "", html)
html = unique(html)
}
d = lapply(1:125, extract)
d = unlist(d)
write.table(d, "numerama.hadopi.news.txt", row.names = FALSE)
This saves all links to news items with keyword 'Hadopi' on this website.
You need to start the pattern with . if you want to restrict the search to the current node.
/ goes back to the start of the document (even if the root node is not in y).
xpathSApply(y, ".//a/#href" )
Alternatively, you can extract the third table directly with XPath:
xpathApply(x, "//table[3]//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href")

removing data with tags from a vector

I have a string vector which contains html tags e.g
abc<-""welcome <span class=\"r\">abc</span> Have fun!""
I want to remove these tags and get follwing vector
e.g
abc<-"welcome Have fun"
Try
> gsub("(<[^>]*>)","",abc)
what this says is 'substitute every instance of < followed by anything that isnt a > up to a > with nothing"
You cant just do gsub("<.*>","",abc) because regexps are greedy, and the .* would match up to the last > in your text (and you'd lose the 'abc' in your example).
This solution might fail if you've got > in your tags - but is <foo class=">" > legal? Doubtless someone will come up with another answer that involves parsing the HTML with a heavyweight XML package.
You can convert your piece of HTML to an XML document with
htmlParse or htmlTreeParse.
You can then convert it to text,
i.e., strip all the tags, with xmlValue.
abc <- "welcome <span class=\"r\">abc</span> Have fun!"
library(XML)
#doc <- htmlParse(abc, asText=TRUE)
doc <- htmlTreeParse(abc, asText=TRUE)
xmlValue( xmlRoot(doc) )
If you also want to remove the contents of the links,
you can use xmlDOMApply to transform the XML tree.
f <- function(x) if(xmlName(x) == "span") xmlTextNode(" ") else x
d <- xmlDOMApply( xmlRoot(doc), f )
xmlValue(d)

Resources