Scraping data in Excel format from URL into R - r

I am trying to download data from url
https://migration.iom.int/datasets/europe-%E2%80%94-mixed-migration-flows-europe-quarterly-overview-april-june-2021
On this page is available dataset with file into Excel and link for downloading data is https://migration.iom.int/system/tdf/datasets/Q2%202021%20Mixed%20Migration%20Flows%20to%20Europe%20%28April%20-%20June%202021%29.xlsx?file=1&type=node&id=12261
So I want to download all this data in Excel format directly into R.
library(rvest)
URL <- "https://migration.iom.int/system/tdf/datasets/Q2%202021%20Mixed%20Migration%20Flows%20to%20Europe%20%28April%20-%20June%202021%29.xlsx?file=1&type=node&id=12261"
pg <- read_html(URL)
html_attr(html_nodes(pg, "download"), "href")
But I made some mistake and I don't make download. So can anybody help me how to download this data into R .

I personally would go about it in the following way.
Download the data into a specified destination, read the excel file from that location. An idea would be:
download.file(url, destinationFile)
fileInR <- read.table(file = desinationFile,sep = “\t”)
However, a simple google search for both (downloading and reading in an excel file in R) should provide you with plenty more options.

Related

opening PDF from a webpage in R

I'm trying to practice text analysis with the Fed FOMC minutes.
I was able to obtain all links to the appropriate pdf files from the link below.
https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm
I tried download.file(https://www.federalreserve.gov/monetarypolicy/files/fomcminutes20160316.pdf,"1.pdf").
The download was successful; however, when I click on the downloaded file, it outputs "There was an error opening this document. The file is damaged and could not be repaired."
What are some ways to fix this? Is this a way of preventing web scraping on Fed's side?
I have 44 links(pdf files) to download and read in R. Is there a way to do this without physically downloading the files?
library(stringr)
library(rvest)
library(pdftools)
# Scrape the website with rvest for all href links
p <-
rvest::read_html("https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm")
pdfs <- p %>% rvest::html_elements("a") %>% html_attr("href")
# Filter selected fomcminute paths and reconstruct html links
pdfs <- pdfs[stringr::str_detect(pdfs, "fomcminutes.*pdf")]
pdfs <- pdfs[!is.na(pdfs)]
paths <- paste0("https://www.federalreserve.gov/", pdfs)
# Scrape minutes as list of text files
pdf_data <- lapply(paths, pdftools::pdf_text)

How do I read excel file through URL in R studio? Its https

I have a secure Url which provides data in Excel format. How do I read it in R studio?
Please mention the necessary package and functions. I have tried read.xls(),
read_xlsx, read.URL and some more. Nothing seems to work.
You can do it in two steps. First, you'll need to download it with something like download.file, then read it with readxl::read_excel
download.file("https://file-examples.com/wp-content/uploads/2017/02/file_example_XLS_10.xls", destfile = "/tmp/file.xls")
readxl::read_excel("/tmp/file.xls")
library(readxl)
library(httr)
url<-'https://......xls'
GET(url, write_disk(TF <- tempfile(fileext = ".xls")))
read_excel(TF)
Have you tried importing it as a .csv dataset into RStudio? Might be worth a try!:)

Scraping Tabular(Equity historical) data from the nse website

I'm trying to webscrape equity historical data from the nse website :
https://www.nseindia.com/products/content/equities/equities/eq_security.htm
I Tried to web scrape data data for a company(symbol name) named RELIANCE for the range(time period) past 2 weeks and transfer the contents to a CSV file
library(rvest)
url <- "https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?symbol=RELIANCE&segmentLink=3&symbolCount=2&series=ALL&dateRange=15days&fromDate=&toDate=&dataType=PRICEVOLUMEDELIVERABLE"
page_html <- read_html(url)
data <- html_nodes(page_html, "p")
data <- html_text(data)
write.csv(data$data, "scrapedData.csv", row.names=FALSE)
Its Says character(empty)
I know that there is an option to download the csv file there in the website but i want an automated R Script for getting the data.
I know that there are other packages such as quantmod are present for getting historical stock data but i require from this website as it has useful information such as TTQ,Turnover,etc.
why reinvent the wheel?
you can use nsepy python module.
https://github.com/swapniljariwala/nsepy
there are similar alternatives exist.
You just need to use this:
from nsepy import get_history
from datetime import date
data = get_history(symbol="SBIN", start=date(2015,1,1), end=date(2015,1,31))

download/scraping table from htmlTable

I am trying to download a csv from
https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.html
or I am trying to scrape data frame the html table output from the website found here
https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],analysis_error[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)]
I have tried to scrape the data using
library(rvest)
url <- read_html("https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-
poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-
02-09T12:00:00Z)][(-7):1:(42)][(179):1:(238)],analysis_error[(2019-02-
09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-7):1:(42)][(179):1:
(238)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-7):1:(42)]
[(179):1:(238)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-
09T12:00:00Z)][(-7):1:(42)][(179):1:(238)]")
test <- url %>%
html_nodes(xpath='table.erd.commonBGColor.nowrap') %>%
html_text()
And I have tried to download a csv with
download.file(url, destfile = "~/Documents/test.csv", mode = 'wb')
But neither worked either. The download.file function downloaded a csv with the node description. and the rvest method gave me a huge character string on my macbook and a null data frame on my windows. I have also tried to use selectorgadget (chrome extension) to obtain only data i need, but selectorgadget does not seem to work on the htmlTable
I managed to find workaround solution using htmltab package, not sure if it's optimal though, it's big data frame for a webpage, took a while to load in data frame. table[2] is for actual table, as there're 2 html tables in link you've given.
url1 <- "https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],analysis_error[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)]"
tbls <- htmltab(url1,which = "//table[2]")
rdf <- as.data.frame(tbls)
let me know if it helps.

get data from excel online (office 365)

I'm using a microsoft app (from http://portal.office.com) to translate and stock tweet on an online excel sheet, now I want to read it with R.
The data in excel sheet url https://myagency-my.sharepoint.com/.../tweet.xlsx
I tried:
library(readxl)
read_excel('//companySharepointSite/project/.../ExcelFilename.xlsx', Sheet1', skip=1)`
from this post. It gives:
Error in sheets_fun(path) : Evaluation error: zip file
I believe read_excel works only with local files. This might do the trick:
library(xlsx)
library(httr)
url <- 'https://myagency-my.sharepoint.com/.../tweet.xlsx'
GET(url, write_disk("excel.xlsx", overwrite=TRUE))
frmData <- read.xlsx("excel.xlsx", sheetIndex=1, header=TRUE)

Resources