Webscraping "How did you contribute to OpenStreetMap" with rvest - web-scraping

I would like to scrape all the information from the "How did you contribute to OpenStreetMap?" (https://hdyc.neis-one.org/). It is necessary to login to OSM in order to gain a user profile.
Since there are quite a lot of profiles that need to be scraped, I want to automatically scrape the list using the rvest package (https://rvest.tidyverse.org/).
So far I attempted to do this:
> library(rvest)
> url <- "https://hdyc.neis-one.org/?mrsensible"
> pgsession <- session(URL)
> pgsession
<session> https://hdyc.neis-one.org/?mrsensible
Status: 200
Type: text/html
Size: 4245
When I tried to read the information of my OSM record with read_html(url), here is what it turns out:
> read_html(url)
{html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n<meta ...
[2] <body onload="init();">\n <div class="copyright">Copyright © <a target="_bl ...
So it doesn't really capture the information.
Would it be possible to scrape the data using rvest codes?
Many thanks in advance!

Related

Web scraping price with the use of xml

I am trying to scrape the following: 13.486 Kč from: https://www.aofis.cz/informace-pro-klienty/elba-opf/
For some reason, the following code does not seem to find the number. I am rather a newbie to this so perhaps it is because the string in xml_find_all is wrong. Can anyone please have a look why?
library(xml)
library(xml2)
page <- "https://www.aofis.cz/informace-pro-klienty/elba-opf/"
read_page <- read_html(page)
Price <- read_page %>%
rvest::html_nodes('page-content') %>%
xml2::xml_find_all("//strong[contains(#class 'sg_selected')]") %>%
rvest::html_text()
Price
Thank you!!
Michael
The html code you see in your browser developer panel (or selector gadget) is not the same as the content that is being delivered to your R session. It is actually a javascript file which then builds the web page. This is why your rvest call isn't finding the correct html node: there are no html nodes in the string you are processing!
There are a few different ways to get the information you want, but perhaps the best is to just get the monetary values from the javascript code using regex:
page <- "https://www.aofis.cz/informace-pro-klienty/elba-opf/"
read_page <- httr::content(httr::GET(page), "text")
stringr::str_extract_all(read_page, "\\d+\\.\\d+ K")[[1]][1]
#> [1] "13.486 K"

Scraping https returns an error on Mac but works on Windows

I would like to scrape table from the webpage into R on Mac:
library("rvest")
url <- "https://www.gpw.pl/wskazniki_spolek_full"
gpw <- url %>%
read_html() %>%
html_nodes(xpath='//table') %>%
html_table(fill = TRUE)
gpw
Error: Failure when receiving data from the peer
Error: object 'gpw' not found
When I am downloading table from this website in RStudio on Windows or via excel (Data tab -> Get External Data -> From Web) it works.
EDIT:
I have problem with accessing this specific webpage. It seems like there is no connection at all.
When I try same code on different webpage it works.
for example:
url <- "https://pl.wikipedia.org/wiki/Mapa"
Map <- url %>%
read_html()
Map
{xml_document}
<html class="client-nojs" lang="pl" dir="ltr">
[1] <head>\n <meta charset="UTF-8"/>\n <title>Mapa – Wikipedia, wolna encyklopedia</title>\n <s ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject page-Mapa rootpage-Ma ...
I was looking for similar cases but there was no answers for R (only some for php curl).
The following code is working, but if you want a clean table, you will have to carry out some cleaning:
library("rvest")
url <- "https://www.gpw.pl/wskazniki_spolek_full"
population <- url %>%
read_html() %>%
html_nodes(xpath='//table') %>%
html_table(fill = TRUE)
population

How to use R (rvest or XML or RCurl) to scrape data from website like makemytrip

For instance I want scrape flight data for flights operating between Chicago (ORD) and NewDelhi (DEL). I would search for the flights on makemytrip and this is teh URL that gets generated - http://us.makemytrip.com/international/listing/exUs/RT/ORD-DEL-D-22May2016_JFK-DEL-D-25May2016/A-1/E?userID=90281463121653408
When I am trying to read this HTML page using rvest package, this is what I get -
htmlpage<-read_html("http://us.makemytrip.com/international/listing/exUs/RT/ORD-DEL-D-22May2016_JFK-DEL-D-25May2016/A-1/E?userID=90281463121653408")
htmlpage
{xml_document}
<html>
[1] <head>\n <meta http-equiv="Content-Type" cont ...
[2] <body onload="done_loading();">\n\n <div id= ...
myhtml<-html_nodes(htmlpage,".flight_info")
> myhtml
{xml_nodeset (0)}
Need help on parsing/scraping this data and understand what is going on wrong here.
Thanks !

Web scraping Airbnb with R (rvest, XML) - hidden html \n?

I am scraping an Airbnb page using rvest.
My objective is to get the number of listings of a user (on the lower left-hand side of the web page) as well as the links for each listing.
However, it seems that Airbnb is blocking access to the source or something. I am a bit lost..
1) Using SelectorGadget and rvest, I have identified the node I'm interested in. Here is my entire code:
library(rvest)
URL = "https://www.airbnb.com/users/show/..."
--> put any user id instead of ...
source = read_html(URL)
source %>% html_nodes(".row-space-3") %>% .[[1]] %>% html_text()
And here is my (disappointing) output:
[1] "\n "
Looking for the webpage source code I should get "Listings (2)" - here it is:
<div class="listings row-space-2 row-space-top-4">
<h2 class="row-space-3">
Listings
<small>(2)</small>
</h2>
What is happening?
PS:
2) I noticed that when I try to get the source code by brute force with XML THERE IS A WHOLE SECTION MISSING if compared to the source code on Chrome or Firefox
library(XML)
library(RCurl)
URL = "https://www.airbnb.com/users/show/..."
parsed <- htmlParse(getURL(URL),asText=TRUE,encoding = "UTF-8")

Error using XML package in R

I am gathering data about different universities and I have a question about the follow error after executing the following code. The problem is when using htmlParse()
Code:
url1 <- "http://nces.ed.gov/collegenavigator/?id=165015"
webpage1<- getURL(url1)
doc1 <- htmlParse(webpage1)
Output:
Error in htmlParse(webpage1) : File
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
html xmlns="http://www.w3.org/1999/xhtml" head id="ctl00_hd"meta http-equiv="Content-type" content="text/html;charset=UTF-8" /title
College Navigator - National Center for Education Statistics
/titlelink href="css/md0.css" type="text/css" rel="stylesheet" meta name="keywords" content="college navigator,college search,postsecondary education,postsecondary statistics,NCES,IPEDS,college locator"/meta meta name="description" content="College Navigator is a free consumer information tool designed to help students, parents, high school counselors, and others get information about over 7,000 postsecondary institutions in the United States - such as programs offered, retention and graduation rates, prices, aid available, degrees awarded, campus safety, and accreditation."meta>meta name="robots" content="index,nofollow"/metalink
I have webs scraped pages before using this package and I never had an issue. Does the name="robots" have anything to do with it? Any help would be greatly appreciate.
http://validator.w3.org/check?verbose=1&uri=http%3A%2F%2Fnces.ed.gov%2Fcollegenavigator%2F%3Fid%3D165015
indicates the webpage is badly formed. Your browser can compensate for this but your R package is having problems.
if you are using windows you can get the IE browser to fix it for you as follows:
library(rcom)
library(XML)
ie = comCreateObject('InternetExplorer.Application')
ie[["visible"]]=T # true for debugging
comInvoke(ie,"Navigate2","http://nces.ed.gov/collegenavigator/?id=165015")
while(comGetProperty(ie,"busy")||comGetProperty(ie,"ReadyState")<4){
Sys.sleep(1)
print(comGetProperty(ie,"ReadyState"))
}
myDoc<-comGetProperty(ie,"Document")
webpage1<-myDoc$getElementsByTagName('html')[[0]][['innerHTML']]
ie$Quit()
doc1 <- htmlParse(webpage1)

Resources