Download US census files from the web using R - r

I am trying to download all 1980 US Census files from the URL https://www2.census.gov/census_1980/ and store in my computer using R.
I already tried download.file and the package downloader, but the usual commands download only one file with no format.
Is there an easy way to download all files (including subfolders, etc) at once in R?

You can check if data you are interested in are in FRED | U.S. Census Bureau https://fred.stlouisfed.org/source?soid=19
If you are interested in somethin specific it is easy to get data with
# install.packages(quantmod)
library(quantmod)
retail_sales_total <- getSymbols('MRTSSM44X72USS', src = 'FRED', auto.assign = FALSE)
But if you want to get all files it is possible using xml2 and rvest packages.
# Readhtml
page <- read_html(URL)
# Try to extract the atributes of the html and get all the download links
links <- html_attr(html_nodes(page, "a"), "href")
and download it all in a loop

Related

Scraping data in Excel format from URL into R

I am trying to download data from url
https://migration.iom.int/datasets/europe-%E2%80%94-mixed-migration-flows-europe-quarterly-overview-april-june-2021
On this page is available dataset with file into Excel and link for downloading data is https://migration.iom.int/system/tdf/datasets/Q2%202021%20Mixed%20Migration%20Flows%20to%20Europe%20%28April%20-%20June%202021%29.xlsx?file=1&type=node&id=12261
So I want to download all this data in Excel format directly into R.
library(rvest)
URL <- "https://migration.iom.int/system/tdf/datasets/Q2%202021%20Mixed%20Migration%20Flows%20to%20Europe%20%28April%20-%20June%202021%29.xlsx?file=1&type=node&id=12261"
pg <- read_html(URL)
html_attr(html_nodes(pg, "download"), "href")
But I made some mistake and I don't make download. So can anybody help me how to download this data into R .
I personally would go about it in the following way.
Download the data into a specified destination, read the excel file from that location. An idea would be:
download.file(url, destinationFile)
fileInR <- read.table(file = desinationFile,sep = “\t”)
However, a simple google search for both (downloading and reading in an excel file in R) should provide you with plenty more options.

opening PDF from a webpage in R

I'm trying to practice text analysis with the Fed FOMC minutes.
I was able to obtain all links to the appropriate pdf files from the link below.
https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm
I tried download.file(https://www.federalreserve.gov/monetarypolicy/files/fomcminutes20160316.pdf,"1.pdf").
The download was successful; however, when I click on the downloaded file, it outputs "There was an error opening this document. The file is damaged and could not be repaired."
What are some ways to fix this? Is this a way of preventing web scraping on Fed's side?
I have 44 links(pdf files) to download and read in R. Is there a way to do this without physically downloading the files?
library(stringr)
library(rvest)
library(pdftools)
# Scrape the website with rvest for all href links
p <-
rvest::read_html("https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm")
pdfs <- p %>% rvest::html_elements("a") %>% html_attr("href")
# Filter selected fomcminute paths and reconstruct html links
pdfs <- pdfs[stringr::str_detect(pdfs, "fomcminutes.*pdf")]
pdfs <- pdfs[!is.na(pdfs)]
paths <- paste0("https://www.federalreserve.gov/", pdfs)
# Scrape minutes as list of text files
pdf_data <- lapply(paths, pdftools::pdf_text)

How can I parse a XML file in R, which has been generated probably using SRSS?

In my job I have to perform some analytics on data shared by external organisation through user access granted on web portal. Various reports are available there, which I can view and download in many formats. Two of these formats are very useful namely MS Excel and 'XML file with report data'. Excel file is normally heavily formatted (with sub-totals, merged cells, etc.) to suit the purpose of Excel users. Converting these Excel files to data frame/table is normally a big hassle. I therefore prefer to download 'xml' file and then parse it through -> save it in csv and then carry out my analysis in R.
However, whenever I try to parse xml file directly into R (to avoid intervening convert to csv step) I never succeed. So far I have tried XML xml2 libraries in R but to no avail.
Recently I tried this code.
library("XML")
library("methods")
setwd("C:\\Users\\Administrator\\Desktop\\")
res <- xmlParse("Skil.xml")
> res <- xmlParse("Skil.xml")
xmlns: URI RptSancDig_VoucherCompilationSheet is not absolute
rootnode <- xmlRoot(res)
rootsize <- xmlSize(rootnode)
> rootsize
[1] 2
xmldataframe <- xmlToDataFrame("Skil.xml")
> xmldataframe <- xmlToDataFrame("Skil.xml")
xmlns: URI RptSancDig_VoucherCompilationSheet is not absolute
> xmldataframe
Textbox24 Textbox63 DDOName_Collection
1 <NA> <NA> <NA>
2
Just to mention the file size of Skil.xml is about 12.1 Mb, and is successfully parsed in Excel.
I have also tried read_xml() function of xml2 but to no avail.
I would have happily shared a sample file to try, but I am unable to do so. Moreover, I am also unable to generate a sample file in that kind of xml format.
Can someone help?

Scraping Tabular(Equity historical) data from the nse website

I'm trying to webscrape equity historical data from the nse website :
https://www.nseindia.com/products/content/equities/equities/eq_security.htm
I Tried to web scrape data data for a company(symbol name) named RELIANCE for the range(time period) past 2 weeks and transfer the contents to a CSV file
library(rvest)
url <- "https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?symbol=RELIANCE&segmentLink=3&symbolCount=2&series=ALL&dateRange=15days&fromDate=&toDate=&dataType=PRICEVOLUMEDELIVERABLE"
page_html <- read_html(url)
data <- html_nodes(page_html, "p")
data <- html_text(data)
write.csv(data$data, "scrapedData.csv", row.names=FALSE)
Its Says character(empty)
I know that there is an option to download the csv file there in the website but i want an automated R Script for getting the data.
I know that there are other packages such as quantmod are present for getting historical stock data but i require from this website as it has useful information such as TTQ,Turnover,etc.
why reinvent the wheel?
you can use nsepy python module.
https://github.com/swapniljariwala/nsepy
there are similar alternatives exist.
You just need to use this:
from nsepy import get_history
from datetime import date
data = get_history(symbol="SBIN", start=date(2015,1,1), end=date(2015,1,31))

How can I download GADM data in R?

library(raster)
france<-getData('GADM', country='FRA', level=1)
However, the command is leading me to this error.
trying URL 'http://biogeo.ucdavis.edu/data/gadm2.8/rds/FRA_adm1.rds'
Error in utils::download.file(url = aurl, destfile = fn, method = "auto", :
cannot open URL 'http://biogeo.ucdavis.edu/data/gadm2.8/rds/FRA_adm1.rds'
First, download the country data you want from the GADM database, and save it to your local directory. Be sure that you have chosen the R (SpatialPolygonsDataFrame) format. There are five levels available for France (from level 0 to level 5). You can choose what you need.
Second, read the .rds file downloaded from GADM with readRDS() function and transform it into a data.frame with ggplot2::fortify().
library(ggplot2)
library(sp)
# assumed that you downloaded into a such path: '~/Downloads/FRA_adm1.rds':
path <- file.path(Sys.getenv("HOME"), "Downloads", "FRA_adm1.rds")
# FR map (Level 1) from GADM version 2.8
frRDS <- readRDS(path)
# Region names 1 in data frame
frRDS_df <- ggplot2::fortify(frRDS, region = "NAME_1")
head(frRDS_df)
I am going to improve upon the previous answer to the OP's question.
To answer the OP's question directly and correctly, there is nothing wrong with the OP's code. The issue was likely a temporary internet connection issue because the OP's code works and retrieves the gadm.org data without issue. Note, the getData() function retrieves the gadm.org website's geodata that is stored and retrieved from the http://biogeo.ucdavis.edu/ website.
The raster package provides the getData() function which is very useful for automatically retrieving the geodata from the internet. This function can also be used to retrieve geodata that is kept locally on a PC.
In years past, the way to use geodata was to first download a file from the gadm.org website, and then to move that file from the download folder and save the file in a folder on the pc. These files then needed to be unpackaged/unzipped before the geodata was available to be used by R.
Using the getData() makes life simpler because this method directly retrieves the desired geodata and then makes the geodata available to use with R.
The gadm.org website clearly states:
"Downloading by country is the recommended approach"
Even though downloading the large world geodata file directly from the website can be done, it is unnecessary and resource intensive. Unless there is some specific reason for doing so, there is absolutely no need to download and keep the large worldwide geodata database on the PC.
And one last thing about the getData() function. This function is currently generating a warning when it is used in R nowadays. The warning reads:
Warning message in getData("GADM", country = "USA", level = 1):
"getData will be removed in a future version of raster.
Please use the geodata package instead"

Resources