I'm testing some web scrape scripts in R. I've read many tutorials, docs and tried different things but no success so far.
The URL I'm trying to scrape is this one. It has public, government data, and no statements against web scrapers. It's in Portuguese, but I believe it won't be a big problem.
It shows a search form, with several fields. My test was searching for data from a particular state ("RJ", in this case the field is "UF"), and city ("Rio de Janeiro", in the field "MUNICIPIO"). By clicking "Pesquisar" (Search), it shows the following output:
Using Firebug, I found the URL it calls (using the parameters above) is:
http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3A**estadoSelect=33**&municipioDecorate%3A**municipioSelect=3304557**&bairroDecorate%3AbairroInput=&pesquisar.x=42&pesquisar.y=16&javax.faces.ViewState=j_id10
The site uses a jsessionid, as can be seen using the following:
library(rvest)
library(httr)
url <- GET("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/")
cookies(url)
Knowing it uses a jsessionid, I used cookies(url) to check this info, and used it into a new URL like this:
url <- read_html("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam;jsessionid=008142964577DBEC622E6D0C8AF2F034?buscaForm=buscaForm&codEntidadeDecorate%3AcodEntidadeInput=33108064&noEntidadeDecorate%3AnoEntidadeInput=&descEnderecoDecorate%3AdescEnderecoInput=&estadoDecorate%3AestadoSelect=org.jboss.seam.ui.NoSelectionConverter.noSelectionValue&bairroDecorate%3AbairroInput=&pesquisar.x=65&pesquisar.y=8&javax.faces.ViewState=j_id2")
html_text(url)
Well, the output doesn't have the data. In fact, it has a error message. Translated into English, it basically says the session was expired.
I assume it is a basic mistake, but I looked all around and couldn't find a way to overcome this.
This combination worked for me:
library(curl)
library(xml2)
library(httr)
library(rvest)
library(stringi)
# warm up the curl handle
start <- GET("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam")
# get the cookies
ck <- handle_cookies(handle_find("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam")$handle)
# make the POST request
res <- POST("http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam;jsessionid=" %s+% ck[1,]$value,
user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:40.0) Gecko/20100101 Firefox/40.0"),
accept("*/*"),
encode="form",
multipart=FALSE, # this gens a warning but seems to be necessary
add_headers(Referer="http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam"),
body=list(`buscaForm`="buscaForm",
`codEntidadeDecorate:codEntidadeInput`="",
`noEntidadeDecorate:noEntidadeInput`="",
`descEnderecoDecorate:descEnderecoInput`="",
`estadoDecorate:estadoSelect`=33,
`municipioDecorate:municipioSelect`=3304557,
`bairroDecorate:bairroInput`="",
`pesquisar.x`=50,
`pesquisar.y`=15,
`javax.faces.ViewState`="j_id1"))
doc <- read_html(content(res, as="text"))
html_nodes(doc, "table")
## {xml_nodeset (5)}
## [1] <table border="0" cellpadding="0" cellspacing="0" class="rich-tabpanel " id="j_id17" sty ...
## [2] <table border="0" cellpadding="0" cellspacing="0">\n <tr>\n <td>\n <img alt="" ...
## [3] <table border="0" cellpadding="0" cellspacing="0" id="j_id18_shifted" onclick="if (RichF ...
## [4] <table border="0" cellpadding="0" cellspacing="0" style="height: 100%; width: 100%;">\n ...
## [5] <table border="0" cellpadding="10" cellspacing="0" class="dr-tbpnl-cntnt-pstn rich-tabpa ...
I used BurpSuite to inspect what was going on and did a quick test at the command line with the output from "Copy as cURL" and adding --verbose to I could validate what was being sent/received. I then mimicked the curl parameters.
By starting at the bare search page, the cookies for the session id and the bigip server are already warmed up (i.e. will be sent with every request so you don't have to mess with them) BUT you still need to fill it in on the URL path so we have to retrieve them, then fill it in.
Related
Suppose I have the below text:
x <- "<p>I would like to run tests for a package with <code>testthat</code> and compute code coverage with <code>covr</code>. Furthermore, the results from <code>testthat</code> should be saved in the JUnit XML format and the results from <code>covr</code> should be saved in the Cobertura format.</p>\n\n<p>The following code does the trick (when <code>getwd()</code> is the root of the package):</p>\n\n<pre><code>options(\"testthat.output_file\" = \"test-results.xml\")\ndevtools::test(reporter = testthat::JunitReporter$new())\n\ncov <- covr::package_coverage()\ncovr::to_cobertura(cov, \"coverage.xml\")\n</code></pre>\n\n<p>However, the tests are executed <em>twice</em>. Once with <code>devtools::test</code> and once with <code>covr::package_coverage</code>. </p>\n\n<p>My understanding is that <code>covr::package_coverage</code> executes the tests, but it does not produce <code>test-results.xml</code>.</p>\n\n<p>As the title suggests, I would like get both <code>test-results.xml</code> and <code>coverage.xml</code> with a single execution of the test suite.</p>\n"
**PROBLEM: **
I need to do remove all <code></code> tags and its content, regardless if they are on its own or inside another tag.
I HAVE TRIED:
I have tried the following, but as you can see, the tags are still there:
content <- xml2::read_html(x) %>%
rvest::html_nodes(css = ":not(code)")
print(content)
But the result I get is the following, and the tags are still there:
{xml_nodeset (8)}
[1] <body>\n<p>I would like to run tests for a package with <code>testthat</code> and compute code coverage with <code>cov ...
[2] <p>I would like to run tests for a package with <code>testthat</code> and compute code coverage with <code>covr</code> ...
[3] <p>The following code does the trick (when <code>getwd()</code> is the root of the package):</p>
[4] <pre><code>options("testthat.output_file" = "test-results.xml")\ndevtools::test(reporter = testthat::JunitReporter$new ...
[5] <p>However, the tests are executed <em>twice</em>. Once with <code>devtools::test</code> and once with <code>covr::pac ...
[6] <em>twice</em>
[7] <p>My understanding is that <code>covr::package_coverage</code> executes the tests, but it does not produce <code>test ...
[8] <p>As the title suggests, I would like get both <code>test-results.xml</code> and <code>coverage.xml</code> with a sin ...
The solution was the following:
content <- xml2::read_html(x)
toRemove <- content %>% rvest::html_nodes(css = "code")
xml_remove(toRemove)
After that, content had no code tags, nor its content, and this wasn't manipulated as string.
I am trying to get data from IMDB with
page <- read_html("URL_of_Movie")
The output is always in German. However, I need the data content to be in its original form in English. Although my settings are set to "English"
I saw others Questions here like this
curl -H "Accept-Language: en-us,en;q=0.8,de-de;q=0.5,de;q=0.3" http://www.imdb.com/title/tt0076306/
which shows how to use curl function for english but I don't know how to integrate this into my R code
Needed to include the GET Function and the correct Syntax for the language request
page <- read_html(GET(
"https://www.imdb.com/list/ls020643534/?sort=list_order,asc&st_dt=&mode=detail&page=1&title_type=movie&ref_=ttls_ref_typ",
add_headers("Accept-Language" = "en-US")))
Given this URL requires the following login data:
Benutzername oder E-Mail -> User: testuserscrap#web.de
Passwort -> Password: testuserscrap
(The website is kind of fantasy football of the German Bundesliga.)
There exists a post where someone asks for help for the same website.
However, I do not want to retrieve information about certain players but about the actual team. In the browser, these steps are required:
Click on the red circled icon:
Leads to this page where I would like to retrieve all the names (of the players) in list 1 and 2:
Means I would like to have an output such as:
Diego Contento
Alfred Finnbogason
...
I am not sure which way might be the best one. According to the referred posts there seem to be an API. However, I cannot access the information with the code adapted from the referred post:
library(rvest)
library(jsonlite)
library(httr)
library(plyr)
library(dplyr)
url<-"https://kickbase.sky.de/"
page<-html_session(url)
page<-rvest:::request_POST(page,url="https://kickbase.sky.de/api/v1/user/login",
body=list("email"="testuserscrap#web.de",
"password"="testuserscrap",
"redirect_url"="https://www.kickbase.com/transfermarkt/kader"),
encode='json'
)
ck <- cookies(page)
player_page<-jump_to(ck$value,"https://api.kickbase.com/leagues/1420282/lineupex")
Unfortunately, I'm not such an expert in coding or webscraping. I tried many things but I do not come to a solution :/ Therefore, I would be really grateful if you have any advice or idea how I can retrieve the information.
Best :)
Wow, this was a tough question, but a very good learning experience for me. To solve this one I used the "curlconvertor" package, download available from GitHub using devtools package. See https://github.com/hrbrmstr/curlconverter, and other questions/answers posted here at stack overflow.
First login into the web page using your browser and navigate to the page in interest. Using the developer tools copy the 'cURL' address from the file of intereste. The cURL can be stripped of the nonessential parts, but I would need to determine the noncritical parts through trial and error.
Then use the straighten function, edit the userid and password (these were not saved with the cURL address), make the request, and then parse the return.
#cURL copied from network tab for the requested file
xcurl<-"curl 'https://api.kickbase.com/leagues/1420282/lineupex'
-XGET
-H 'Accept: */*'
-H 'Origin: https://kickbase.com'
-H 'Referer: https://kickbase.com/transfermarkt/kader'
-H 'Accept-Language: en-us'
-H 'Host: api.kickbase.com'
-H 'Authorization: Bearer XU3DGDZBxlHB0sjqG01yLhHihT2AacPeIeWOlY+u3nxz/iokfCjn8a9vaKeKFXwxJpcH/0FXOgGg3J2EfmUUDJ9uwjT+oxHZTGc1EuOxbG0i66fRBBm1RBT0Yd4ACRDQ9BCs8yb+/w9+gOPIyhM2Vio3DZemExATq22osCGeW6VzYmos/3F8MTDbKOAk8NPKQYr5xPSght26ayZ4/X21ag==' \
-H 'Accept-Encoding: br, gzip, deflate'
-H 'Connection: keep-alive'"
#See https://github.com/hrbrmstr/curlconverter, install from devtools
library(curlconverter)
library(dplyr)
my_ip<-straighten(xcurl)
#add password and user id
my_ip[[1]]$password<-"testuserscrap"
my_ip[[1]]$username<-"testuserscrap#web.de"
#Make page request
respone<-my_ip %>% make_req()
#retrieve the entire file
#jsonfile<-jsonlite::toJSON(content(respone[[1]](), as="parsed"), auto_unbox = TRUE, pretty=TRUE)
#retrieve only the player info from file and convert to data frame
dfs <- lapply(content(respone[[1]](), as="parsed")$players, data.frame)
#not every player has the same information thus bind_rows instead of rbind
players <- do.call(bind_rows, dfs)
players
in case you are still looking for access to the kickbase API I've written a small python library lately for it and just released it some days ago. Might still have some bugs but it serves my purpose and maybe you want to contribute to it. :)
https://github.com/kevinskyba/kickbase-api-python
I am constructing an URI in R which is generated on the fly with ~40000 characters.
I tried using
RCurl
jsonlite
curl
All three give a bad URL Error when connecting through a HTTP GET request. I am refraining from using httr as it will install 5 additional dependencies, while I want minimum dependency in my R program. I am unsure if even httr would be able to handle so many characters in URL.
Is there a way that I can encode/pack it to a allowed limit or a better approach/package that can handle URL of any length similar to python's urllib?
Thanks in advance.
This is not a limitation of RCurl.
Let's make a long URL and try it:
> s = paste0(rep(letters,2000),collapse="")
> nchar(s)
[1] 52000
That's 52000 characters of A-Z. Stick it on a URL:
> url = paste0("http://www.omegahat.net/RCurl/",s,sep="")
> nchar(url)
[1] 52030
> substr(url, 1, 40)
[1] "http://www.omegahat.net/RCurl/abcdefghij"
Now try and get it:
> txt = getURL(url)
> txt
[1] "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>414 Request-URI Too Large</title>\n</head><body>\n<h1>Request-URI Too Large</h1>\n<p>The requested URL's length exceeds the capacity\nlimit for this server.<br />\n</p>\n</body></html>\n"
>
That's the correct response from the server. The server decided it was a long URL, returned a 414 error, and proves RCurl can request URLs of over 40,000 characters.
Until we know more, I can only presume the "bad URL" message is coming from the server, about which we know nothing.
Trying do download text as in company profiles from this website
http://www.evca.eu/about-evca/members/member-search/#lsearch
In the past I had good success with similar tasks using for example the XML package, but this won't work here because the data I am trying to grasp is inside some sort of dynamic and the single elements in the list don't have own URLs or something.
Unfortunately I don't know much about web-design, so I am not really sure how to address this. Any suggestions, it would really suck to do this manually. Thanks
First download Fiddler Web Debugger or some other similar tool. It places itself between your browser and web server, then you can see what is going on (also dynamic/AJAX communication).
Run it, go to the website you are trying to understand and execute actions you want to do automatically.
For example if you open http://www.evca.eu/about-evca/members/member-search/#lsearch, enter "a" in the search box and then choose "All" (to get all results), you will see in the Fiddler that browser opens http://www.evca.eu/umbraco/Surface/MemberSearchPage/HandleSearchForm?page=1&rpp=999999 URL and sends "Company=a&MemberType=&Country=&X-Requested-With=XMLHttpRequest".
You can do the same with R, parse the result, get some text, maybe some links to other stuff.
Below code in R will do the same as described above:
require('XML')
require(stringr)
library(httr)
r <- POST("http://www.evca.eu/umbraco/Surface/MemberSearchPage/HandleSearchForm?page=1&rpp=999999",
body = "Company=a&MemberType=&Country=&X-Requested-With=XMLHttpRequest")
stop_for_status(r)
txt=content(r,"text")
library(stringr)
matches <- str_match_all(txt,"Full company details.*?</h2>")
# remove some rubish from match
companies=gsub("(Full company details)|\t|\n|\r|<[^>]+>",'',matches[[1]])
#remove trainling spaces
companies=gsub("^[ ]+",'',companies)
Result:
> length(companies)
[1] 1148
> head(companies)
[,1]
[1,] "350 Investment Partners"
[2,] "350 Investment Partners LLP"
[3,] "360° Capital Management SA"
[4,] "360° Capital Partners France - Advisory Company"
[5,] "360° Capital Partners Italia - Advisory Company"
[6,] "3i Deutschland Gesellschaft für Industriebeteiligungen mbH"