I would like to scrape all the information from the "How did you contribute to OpenStreetMap?" (https://hdyc.neis-one.org/). It is necessary to login to OSM in order to gain a user profile.
Since there are quite a lot of profiles that need to be scraped, I want to automatically scrape the list using the rvest package (https://rvest.tidyverse.org/).
So far I attempted to do this:
> library(rvest)
> url <- "https://hdyc.neis-one.org/?mrsensible"
> pgsession <- session(URL)
> pgsession
<session> https://hdyc.neis-one.org/?mrsensible
Status: 200
Type: text/html
Size: 4245
When I tried to read the information of my OSM record with read_html(url), here is what it turns out:
> read_html(url)
{html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n<meta ...
[2] <body onload="init();">\n <div class="copyright">Copyright © <a target="_bl ...
So it doesn't really capture the information.
Would it be possible to scrape the data using rvest codes?
Many thanks in advance!
Related
I am trying to scrape the following: 13.486 Kč from: https://www.aofis.cz/informace-pro-klienty/elba-opf/
For some reason, the following code does not seem to find the number. I am rather a newbie to this so perhaps it is because the string in xml_find_all is wrong. Can anyone please have a look why?
library(xml)
library(xml2)
page <- "https://www.aofis.cz/informace-pro-klienty/elba-opf/"
read_page <- read_html(page)
Price <- read_page %>%
rvest::html_nodes('page-content') %>%
xml2::xml_find_all("//strong[contains(#class 'sg_selected')]") %>%
rvest::html_text()
Price
Thank you!!
Michael
The html code you see in your browser developer panel (or selector gadget) is not the same as the content that is being delivered to your R session. It is actually a javascript file which then builds the web page. This is why your rvest call isn't finding the correct html node: there are no html nodes in the string you are processing!
There are a few different ways to get the information you want, but perhaps the best is to just get the monetary values from the javascript code using regex:
page <- "https://www.aofis.cz/informace-pro-klienty/elba-opf/"
read_page <- httr::content(httr::GET(page), "text")
stringr::str_extract_all(read_page, "\\d+\\.\\d+ K")[[1]][1]
#> [1] "13.486 K"
I would like to scrape table from the webpage into R on Mac:
library("rvest")
url <- "https://www.gpw.pl/wskazniki_spolek_full"
gpw <- url %>%
read_html() %>%
html_nodes(xpath='//table') %>%
html_table(fill = TRUE)
gpw
Error: Failure when receiving data from the peer
Error: object 'gpw' not found
When I am downloading table from this website in RStudio on Windows or via excel (Data tab -> Get External Data -> From Web) it works.
EDIT:
I have problem with accessing this specific webpage. It seems like there is no connection at all.
When I try same code on different webpage it works.
for example:
url <- "https://pl.wikipedia.org/wiki/Mapa"
Map <- url %>%
read_html()
Map
{xml_document}
<html class="client-nojs" lang="pl" dir="ltr">
[1] <head>\n <meta charset="UTF-8"/>\n <title>Mapa – Wikipedia, wolna encyklopedia</title>\n <s ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject page-Mapa rootpage-Ma ...
I was looking for similar cases but there was no answers for R (only some for php curl).
The following code is working, but if you want a clean table, you will have to carry out some cleaning:
library("rvest")
url <- "https://www.gpw.pl/wskazniki_spolek_full"
population <- url %>%
read_html() %>%
html_nodes(xpath='//table') %>%
html_table(fill = TRUE)
population
For instance I want scrape flight data for flights operating between Chicago (ORD) and NewDelhi (DEL). I would search for the flights on makemytrip and this is teh URL that gets generated - http://us.makemytrip.com/international/listing/exUs/RT/ORD-DEL-D-22May2016_JFK-DEL-D-25May2016/A-1/E?userID=90281463121653408
When I am trying to read this HTML page using rvest package, this is what I get -
htmlpage<-read_html("http://us.makemytrip.com/international/listing/exUs/RT/ORD-DEL-D-22May2016_JFK-DEL-D-25May2016/A-1/E?userID=90281463121653408")
htmlpage
{xml_document}
<html>
[1] <head>\n <meta http-equiv="Content-Type" cont ...
[2] <body onload="done_loading();">\n\n <div id= ...
myhtml<-html_nodes(htmlpage,".flight_info")
> myhtml
{xml_nodeset (0)}
Need help on parsing/scraping this data and understand what is going on wrong here.
Thanks !
I am scraping an Airbnb page using rvest.
My objective is to get the number of listings of a user (on the lower left-hand side of the web page) as well as the links for each listing.
However, it seems that Airbnb is blocking access to the source or something. I am a bit lost..
1) Using SelectorGadget and rvest, I have identified the node I'm interested in. Here is my entire code:
library(rvest)
URL = "https://www.airbnb.com/users/show/..."
--> put any user id instead of ...
source = read_html(URL)
source %>% html_nodes(".row-space-3") %>% .[[1]] %>% html_text()
And here is my (disappointing) output:
[1] "\n "
Looking for the webpage source code I should get "Listings (2)" - here it is:
<div class="listings row-space-2 row-space-top-4">
<h2 class="row-space-3">
Listings
<small>(2)</small>
</h2>
What is happening?
PS:
2) I noticed that when I try to get the source code by brute force with XML THERE IS A WHOLE SECTION MISSING if compared to the source code on Chrome or Firefox
library(XML)
library(RCurl)
URL = "https://www.airbnb.com/users/show/..."
parsed <- htmlParse(getURL(URL),asText=TRUE,encoding = "UTF-8")
I am gathering data about different universities and I have a question about the follow error after executing the following code. The problem is when using htmlParse()
Code:
url1 <- "http://nces.ed.gov/collegenavigator/?id=165015"
webpage1<- getURL(url1)
doc1 <- htmlParse(webpage1)
Output:
Error in htmlParse(webpage1) : File
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
html xmlns="http://www.w3.org/1999/xhtml" head id="ctl00_hd"meta http-equiv="Content-type" content="text/html;charset=UTF-8" /title
College Navigator - National Center for Education Statistics
/titlelink href="css/md0.css" type="text/css" rel="stylesheet" meta name="keywords" content="college navigator,college search,postsecondary education,postsecondary statistics,NCES,IPEDS,college locator"/meta meta name="description" content="College Navigator is a free consumer information tool designed to help students, parents, high school counselors, and others get information about over 7,000 postsecondary institutions in the United States - such as programs offered, retention and graduation rates, prices, aid available, degrees awarded, campus safety, and accreditation."meta>meta name="robots" content="index,nofollow"/metalink
I have webs scraped pages before using this package and I never had an issue. Does the name="robots" have anything to do with it? Any help would be greatly appreciate.
http://validator.w3.org/check?verbose=1&uri=http%3A%2F%2Fnces.ed.gov%2Fcollegenavigator%2F%3Fid%3D165015
indicates the webpage is badly formed. Your browser can compensate for this but your R package is having problems.
if you are using windows you can get the IE browser to fix it for you as follows:
library(rcom)
library(XML)
ie = comCreateObject('InternetExplorer.Application')
ie[["visible"]]=T # true for debugging
comInvoke(ie,"Navigate2","http://nces.ed.gov/collegenavigator/?id=165015")
while(comGetProperty(ie,"busy")||comGetProperty(ie,"ReadyState")<4){
Sys.sleep(1)
print(comGetProperty(ie,"ReadyState"))
}
myDoc<-comGetProperty(ie,"Document")
webpage1<-myDoc$getElementsByTagName('html')[[0]][['innerHTML']]
ie$Quit()
doc1 <- htmlParse(webpage1)