How to deal with this website in a webscraping format?

How to deal with this website in a webscraping format? - r

I am trying to webscrape this website.
I am applying the same code that I always use to webscrape pages:
url_dv1 <- "https://ec.europa.eu/commission/presscorner/detail/en/qanda_20_171?fbclid=IwAR2GqXLmkKRkWPoy3-QDwH9DzJiexFJ4Sp2ZoWGbfmOR1Yv8POdlLukLRaU"
url_dv1 <- paste(html_text(html_nodes(read_html(url_dv1), "#inline-nav-1 .ecl-paragraph")), collapse = "")
For this website, thought, the code doesn't seem to be working. In fact, I get Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "c('xml_document', 'xml_node')".
Why is it so? How can I fix it?
Thanks a lot!

The problem is that the web page is dynamically rendered. You can overcome this using phantomjs (can be downloaded here https://phantomjs.org/download.html). You will also need a custom javascript script (see below). The below R code works for me.
library(tidyverse)
library(rvest)
dir_js <- "path/to/a/directory" # JS code needs to be inserted here, the name of the file needs to be javascript.js
url <- "https://ec.europa.eu/commission/presscorner/detail/en/qanda_20_171?fbclid=IwAR2GqXLmkKRkWPoy3-QDwH9DzJiexFJ4Sp2ZoWGbfmOR1Yv8POdlLukLRaU"
system2("path/to/where/you/have/phantomjs.exe", # directory to phantomJS
args = c(file.path(dir_js, "javascript.js"), url))
read_html("myhtml.html") %>%
html_nodes("#inline-nav-1 .ecl-paragraph") %>%
html_text()
# this is the javascript code to be saved in javascript directory as javascript.js
// create a webpage object
var page = require('webpage').create(),
system = require('system')
// the url for each country provided as an argument
country= system.args[1];
// include the File System module for writing to files
var fs = require('fs');
// specify source and path to output file
// we'll just overwirte iteratively to a page in the same directory
var path = 'myhtml.html'

Related

Access sharepoint folders in R

I'm currently trying to access sharepoint folders in R. I read multiple articles addressing that issue but all the proposed solutions don't seem to work in my case.
I first tried to upload a single .txt file using the httr package, as follows:
URL <- "<domain>/<file>/<subfile>/document.txt"
r <- httr::GET(URL, httr::authenticate("username","password",type="any"))
I get the following error:
Error in curl::curl_fetch_memory(url, handle = handle) :
URL using bad/illegal format or missing URL
I then tried another package that use a similar syntax (RCurl):
URL <- "<domain>/<file>/<subfile>/document.txt"
r <- getURL(URL, userpwd = "username:password")
I get the following error:
Error in function (type, msg, asError = TRUE) :
I tried many other ways of linking R to sharepoint, but these two seemed the most straightforward. (also, my URL doesn't seem to be the problem since it works when I run it in my web browser).
Ultimately, I want to be able to upload a whole sharepoint folder to R (not only a single document). Something that would really help is to set my sharepoint folder as my working directory and use the base::list.files() function to list files in my folder, but I doubt thats possible.
Does anyone have a clue how I can do that?

I created an R library called sharepointr for doing just that.
What I basically did was:
Create App Registration
Add permissions
Get credentials
Make REST calls
The Readme.md for the repository has a full description, and here is an example:
# Install
install.packages("devtools")
devtools::install_github("esbeneickhardt/sharepointr")
# Parameters
client_id <- "insert_from_first_step"
client_secret <- "insert_from_first_step"
tenant_id <- "insert_from_fourth_step"
resource_id <- "insert_from_fourth_step"
site_domain <- "yourorganisation.sharepoint.com"
sharepoint_url <- "https://yourorganisation.sharepoint.com/sites/MyTestSite"
# Get Token
sharepoint_token <- get_sharepoint_token(client_id, client_secret, tenant_id, resource_id, site_domain)
# Get digest value
sharepoint_digest_value <- get_sharepoint_digest_value(sharepoint_token, sharepoint_url)
# List folders
sharepoint_path <- "Shared Documents/test"
get_sharepoint_folder_names(sharepoint_token, sharepoint_url, sharepoint_digest_value, sharepoint_path)

Rcurl parsing HTML webpage using class tag

I am trying to parse the following webpage to return the links of each result sub-page. However, the 'result' dimension just returns an empty list. What do i need to put into the span clause in order for it to correctly return the header and underlying URL of each result page?
Many thanks.
# load packages
library(RCulr)
library(XML)
# download html
url = "http://www.sportinglife.com/racing/results"
http = htmlParse(url)
result = lapply(http['//span[#class="hdr t2"]'],xmlValue)

Easy. When you look at "hdr t2" in the source code of the url you'll notice that the tag containing this as a class name is a h3 tag while you are querying for a span tag. Replace "span" with "h3" and it'll work. This works for me
# load packages
library(RCulr)
library(XML)
# download html
url = "http://www.sportinglife.com/racing/results"
http = htmlParse(url)
result = lapply(http['//h3[#class="hdr t2"]'],xmlValue)
I say it's easy, but it's easy to oversee as well :)

Downloading a complete html

I'm trying to scrape some website using R. However, I cannot get all the information from the website due to an unknown reason. I found a work around by first downloading the complete webpage (save as from browser). I was wondering whether it would be to download complete websites using some function.
I tried "download.file" and "htmlParse" but they seems to only download the source code.
url = "http://www.tripadvisor.com/Hotel_Review-g2216639-d2215212-Reviews-Ayurveda_Kuren_Maho-Yapahuwa_North_Western_Province.html"
download.file(url , "webpage")
doc <- htmlParse(urll)
ratings = as.data.frame(xpathSApply(doc,'//div[#class="rating reviewItemInline"]/span//#alt'))

This worked with rvest first go.
llply(html(url) %>% html_nodes('div.rating.reviewItemInline'),function(i)
data.frame(nth_stars = html_nodes(i,'img') %>% html_attr('alt'),
date_var = html_text(i)%>%stri_replace_all_regex('(\n|Reviewed)','')))

Working directory error while parsing

I'd like to apologise in advance for the lack of a reproducible example. The data I'm using my script on are not live right now and are confidential in addition.
I wanted to make a script which can find all links on a certain page. The script works as following:
* find homepage html to start with
* find all urls on this homepage
* open these urls with Selenium
* save the html of each page in a list
* repeat this (find urls, open urls, save html)
The workhorse of this script is the following function:
function(listofhtmls) {
urls <- lapply(listofhtmls, scrape)
urls <- lapply(urls, clean)
urls <- unlist(urls)
urls <- urls[-which(duplicated(urls))]
urls <- paste("base_url", urls, sep = "")
html <- lapply(urls, savesource)
result <- list(html, urls)
return(result) }
Urls are scraped, cleaned (I don't need all urls) and duplicated urls are removed.
All of this works fine for most pages but sometimes I get a strange error while using this function:
Error: '' does not exist in current working directory.
Called from: check_path(path)
I don't see any link between the working directory and the parsing that's going on. I'd like to resolve this error as it's kinda blocking the rest of my script at the moment. Thanks in advance and once again excuses for not using a reproducible example.

Download URL links using R

I am new to R and would like to seek some advice.
I am trying to download multiple url links (pdf format, not html) and save it into pdf file format using R.
The links I have are in character (took from the html code of the website).
I tried using download.file() function, but this requires specific url link (Written in R script) and therefore can only download 1 link for 1 file. However I have many url links, and would like to get help in doing this.
Thank you.

I believe what you are trying to do is download a list of URLs, you could try something like this approach:
Store all the links in a vector using c(), ej:
urls <- c("http://link1", "http://link2", "http://link3")
Iterate through the file and download each file:
for (url in urls) {
download.file(url, destfile = basename(url))
}
If you're using Linux/Mac and https you may need to specify method and extra attributes for download.file:
download.file(url, destfile = basename(url), method="curl", extra="-k")
If you want, you can test my proof of concept here: https://gist.github.com/erickthered/7664ec514b0e820a64c8
Hope it helps!

URL
url = c('https://cran.r-project.org/doc/manuals/r-release/R-data.pdf',
'https://cran.r-project.org/doc/manuals/r-release/R-exts.pdf',
'http://kenbenoit.net/pdfs/text_analysis_in_R.pdf')
Designated names
names = c('manual1',
'manual2',
'manual3')
Iterate through the file and download each file with corresponding name:
for (i in 1:length(url)){
download.file(url[i], destfile = names[i], mode = 'wb')
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to deal with this website in a webscraping format? - r

Related

Access sharepoint folders in R

Rcurl parsing HTML webpage using class tag

Downloading a complete html

Working directory error while parsing

Download URL links using R

Categories

Resources