R Web scrape Excel spreadsheet URLs to read with openxlsx

R Web scrape Excel spreadsheet URLs to read with openxlsx - r

I need to read parts of an Excel file into R. I have some existing code, but the authority changed the source. Previously, there was a direct URL to the document, now the link to the document can only be accessed through a website landing page.
Could someone tell me with which package I can achieve that? The link to the Excel file is: http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/
There i am looking at the document: "Dati operativi relativi al bilanciamento del sistema post Del. 312/2016/R/gas - Database 2018"
I add the previous code to give an idea what I did. As you can see, I only required read.xlsx for this first step.
Many thanks in advance!
library(ggplot2)
library(lubridate)
library(openxlsx)
library(reshape2)
library(dplyr)
Bilres <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G", startRow = 1, colNames = TRUE)
# Selecting Column R from Storico_G and stored in variable Bilres_df
Bilres_df <- data.frame(Bilres$pubblicazione, Bilres$BILANCIAMENTO.RESIDUALE )
# Conerting pubblicazione in date format and time
Bilres_df$pubblicazione <- ymd_h(Bilres_df$Bilres.pubblicazione)
Bilreslast=tail(Bilres_df,1)
Bilreslast=data.frame(Bilreslast)
Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE <- as.numeric(as.character((Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE)))

If you copy the URL from the web page, you can then use download.files() first to download as a binary file and use read.xlsx() to read the data. Depending on how frequently the content changes on the web page, you may be better off just copying the URL than parsing it from the page.
oldFile <- "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx"
newFile <- "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"
if(!file.exists("./data/downloadedXlsx.xlsx")){
download.file(newFile,"./data/downloadedXlsx.xlsx",
method="curl", #use "curl" for OS X / Linux, "wininet" for Windows
mode="wb") # "wb" means "write binary"
} else message("file already loaded locally, using disk version")
library(openxlsx)
Bilres <- read.xlsx(xlsxFile = "./data/downloadedXlsx.xlsx",
sheet = "Storico_G", startRow = 1, colNames = TRUE)
head(Bilres[,1:3])
...and the output:
> head(Bilres[,1:3])
pubblicazione aggiornato.il IMMESSO
1 2017_01_01_06 42736.24 1915484
2 2017_01_01_07 42736.28 1915484
3 2017_01_01_08 42736.33 1866326
4 2017_01_01_09 42736.36 1866326
5 2017_01_01_10 42736.41 1866326
6 2017_01_01_11 42736.46 1866326
>
UPDATE: Added logic to avoid downloading the file once it has been downloaded.

You can find the .xlsx links this way:
library(rvest)
library(magrittr)
pg <- read_html("http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/")
# get all the Excel (xlsx) links on that page:
html_nodes(pg, xpath=".//a[contains(#href, '.xlsx')]") %>%
html_attr("href") %>%
sprintf("http://www.snamretegas.it%s", .) -> excel_links
head(excel_links)
## [1] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"
## [2] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2018/DatiOperativi_2018-IT.xlsx"
And, pass in what you want to your Excel reading function:
openxlsx::read.xlsx(excel_links[1], sheet = "Storico_G", startRow = 1, colNames = TRUE)
## data frame output here that I'm not going to show
BUT!!
This is a very selfish and unkind way to do this since you hit that site for the Excel file every time you want to read it, wasting their CPU and bandwidth and your bandwidth.
You should use the download.file() technique Len described to cache a local copy and only re-download when necessary.

This should get you going in the right direction.
library(data.table)
mydat <- fread('http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx')
head(mydat)

Related

Downloading images from multiple URLs in R

Thanks to everyone who helped me with my previous query!
I have another question about how to proceed to download those images utilising the loop function!
I would like to download images from my data frame which consists of URL links that point directly to a .jpg image all at once.
I've attached the current code below:
This is the current code to load the URLs
# load libraries and packages
library("rvest")
library("ralger")
library("tidyverse")
library("jpeg")
library("here")
# set the number of pages
num_pages <- 5
# set working directory for photos to be stored
setwd("~/Desktop/lab/male_generic")
# create a list to hold the output
male <- vector("list", num_pages)
# looping the scraping, images from istockphoto
for(page_result in 1:num_pages){
link = paste0("https://www.istockphoto.com/search/2/image?alloweduse=availableforalluses&mediatype=photography&phrase=man&page=", page_result)
male[[page_result]] <- images_preview(link)
}
male <- unlist(male)
I only figured out how to download one image at a time, but I would like to learn how to do it all at once:
test = "https://media.istockphoto.com/id/1028900652/photo/man-meditating-yoga-at-sunset-mountains-travel-lifestyle-relaxation-emotional-concept.jpg?s=612x612&w=0&k=20&c=96TlYdSI8POnOrcqH10GlPgOeWFjEIoY-7G_yMV4Eco="
download.file(test,'test.jpg', mode = 'wb')

num_pages = 10 # write the number of pages you want to download
link = paste0("https://www.istockphoto.com/search/2/image?alloweduse=availableforalluses&mediatype=photography&phrase=man&page=", 1:num_pages)
sapply(link, function(x) {
download.file(x,
destfile = paste0("C:/Users/USUARIO/Desktop", # change it to your directory
str_extract(x, pattern = "[0-9]{1,2}"), ".jpg"),
mode = "wb")
}
)

R:Check existence of "today's" file and if it doesn't exist, download it

I'm trying to check to see if a user has an up-to-date file version of London's Covid prevalence data in their working directory and if not, download it from here:
fileURL <-"https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=region;areaName=London&structure=%7B%22areaType%22:%22areaType%22,%22areaName%22:%22areaName%22,%22areaCode%22:%22areaCode%22,%22date%22:%22date%22,%22newCasesBySpecimenDateRollingSum%22:%22newCasesBySpecimenDateRollingSum%22,%22newCasesBySpecimenDateRollingRate%22:%22newCasesBySpecimenDateRollingRate%22%7D&format=csv"
Pasting that URL into the browser downloads the csv file. using download.file(URL, "data_.csv") creates junk. Why?
So far I have:
library(data.table)
#Look for COVID file starting with "data_"
destfile <- list.files(pattern = "data_")
#Find file date
fileDate<-file.info(destfile)$ctime %>%as.Date()
if(!file.exists(destfile) | fileDate != today()){
res <- tryCatch(download.file(url = fileURL,
destfile = paste0("data_",today(),".csv"),
method = "auto"),
error=function(e) 1)
if(res!=1) COVIDdata<-data.table::fread(destfile) #This doesn't read the file
}
The function always downloads a file regardless of the date on it but it saves it an unreadable format. I've resorted to downloading the file every time as follows.
COVIDdata <- data.table::fread(fileURL)
The junk file that gets downloaded is this:

I think this an issue with encoding the result of download.file, one way could be to use fread to get the data then write it with fwrite:
#Look for COVID file starting with "data_"
destfile <- list.files(pattern = "data_")
#Find file date
fileDate <- file.info(destfile)$ctime %>% as.Date()
#>[1] "2020-11-06" "2020-11-06" "2020-11-06"
if(!length(destfile) | max(fileDate) != today()){
COVIDdata <- fread(fileURL)
fwrite(COVIDdata, file = paste0("data_",today(),".csv"))
}

How can I create a data frame in R from a zip file with multiple levels located in an URL?

I have been trying to work this out but I have not been able to do it...
I want to create a data frame with four columns: country-number-year-(content of the .txt file)
There is a .zip file in the following URL:
https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT
The file contains a folder with 49 folders in it, and each of them contain 150 .txt files give or take.
I first tried to download the zip file with get_dataset but did not work
if (!require("dataverse")) devtools::install_github("iqss/dataverse-client-r")
library("dataverse")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu")
"Error in get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu") :
Not Found (HTTP 404)."
Then I tried
temp <- tempfile()
download.file("https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT",temp)
UNGDC <-unzip(temp, "UNGDC+1970-2018.zip")
It worked to some point... I downloaded the .zip file and then I created UNGDC but nothing happened, because it only has the following information:
UNGDC
A connection with
description "/var/folders/nl/ss_qsy090l78_tyycy03x0yh0000gn/T//RtmpTc3lvX/fileab730f392b3:UNGDC+1970-2018.zip"
class "unz"
mode "r"
text "text"
opened "closed"
can read "yes"
can write "yes"
Here I don't know what to do... I have not found relevant information to proceed... Can someone please give me some hints? or any web to learn how to do it?
Thanks for your attention and help!!!

How about this? I used the zip package to unzip, but possibly the base unzip might work as well.
library(zip)
dir.create(temp <- tempfile())
url<-'https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT'
download.file(url, paste0(temp, '/PZUURT.zip'), mode = 'wb', exdir = temp)
unzip(paste0(temp, '/PZUURT.zip'), exdir = temp)
Note in particular I had to set the mode = 'wb' as I'm on a Windows machine.
I then saw that the unzipped archive had a _MACOSX folder and a Converted sessions folder. Assuming I don't need the MACOSX stuff, I did the following to get just the files I'm interested in:
root_folder <- paste0(temp,'/Converted sessions/')
filelist <- list.files(path = root_folder, pattern = '*.txt', recursive = TRUE)
filenames <- basename(filelist)
'filelist' contains the full paths to each text file, while 'filenames' has just each file name, which I'll then break up to get the country, the number and the year:
df <- data.frame(t(sapply(strsplit(filenames, '_'),
function(x) c(x[1], x[2], substr(x[3], 1, 4)))))
colnames(df) <- c('Country', 'Number', 'Year')
Finally, I can read the text from each of the files and stick it into the dataframe as a new Text field:
df$Text <- sapply(paste0(root_folder, filelist), function(x) readChar(x, file.info(x)$size))

How to download multiple gzip file?

I have to download a lot of files in the format .gz (one file ~40mb , 40k rows).
The file contain a data from another country i would like to choose data only from france -> fr (limiting the number of columns)
I am trying to automate this process but I have problems with unpacking.
The data is on a webpage
and I'm intersted in data from whole folder.
I try with:
create tempfile
dowloand zip to tempfile
unzip, read and selected rows.
save as new file and repeat to next file.
I would like to ask if this way of think is correct.(the below code will be in for loop)
temp <- tempfile()
temp1 <- "C:/Users/tdo/Desktop/data/test.txt"
download.file("https://dumps.wikimedia.org/other/pageviews/2018/2018-
06/pageviews-20180601-000000.gz",temp) # example
unzip(files = temp,exdir = temp1)
data <- read.table(..)
daata[data$name == 'fr']
write.table(...)
In this way I created links:
dumpList <- read_html("https://dumps.wikimedia.org/other/pageviews/2018/2018-04/")
links <- data_frame(filename = html_attr(html_nodes(dumpList, "a"), "href")) %>%
filter(grepl(x = filename, "pageviews")) %>% data by project
mutate(link = paste0("https://dumps.wikimedia.org/other/pageviews/2018/2018-04/", filename))

Why not directly read the gzipped files? I don't see the need to locally unpack the archives, if all you want to do is subset/filter the data and store as new local files.
I recommend using readr::read_table2 to directly read the gzipped file.
Here is a minimal example:
# List of files to download
# url is the link, target the local filename
lst.files <- list(
list(
url = "https://dumps.wikimedia.org/other/pageviews/2018/2018-06/pageviews-20180601-000000.gz",
target = "pageviews-20180601-000000.gz"))
# Download gzipped files (only if file does not exist)
lapply(lst.files, function(x)
if (!file.exists(x$target)) download.file(x$url, x$target))
# Open files
library(readr)
lst <- lapply(lst.files, function(x) {
df <- read_table2(x$target)
# Filter/subset entries
# Write to file with write_delim
})

download googlesheet and create folder in specific path to save

I have below mentioned code which download google sheet and store it in Documents.
library(dplyr)
library(data.table)
library(googlesheets)
library(rJava)
t.start<-Sys.Date()
t.start<-as.character(t.start)
#gs_auth(new_user = TRUE)
#gs_ls()
#gs_auth()
as<-gs_title("XYZ")
gs_download(as, overwrite = TRUE)
I want the sheet XYZ store to a specific location (i.e E:\My_data\File) with below mentioned condition.
I want to Run this script 2 time a day, Where i want to rename the file XYZ based on Sys.Date() and with time condition. (For Ex. If Sys.Date=01/01/2017 and time is < 15:00 hrs than name should be 01/01/2017_A_XYZ.xlsx' for >15:00 hrs it should be01/01/2017_B_XYZ.xlsx')
I want to automatically create folder in E:\My_data\File based on Sys.Date() (i.e Year & Month). If Sys.Date()=01/01/2017 than there would be one folder with name 2017 and one sub folder with name Jan-17 and in sub folder there would be 2 sub folder A (For files <15:00 hrs for that particular Year/Month) and B (For files >15:00 hrs for that particular Year/Month).
If Year/Month change the new folder creates with same structure.

You can use the following code to do that:
# To handle the googlesheets
require(googlesheets)
# For easier date manipulation
require(lubridate)
# Get current time
t <- Sys.time()
# Set your base path and create the basic file structure
base_path <- "E:/My_data/File"
dir.create(paste0(base_path, year(t)))
sub_folder_path <- paste0(base_path, year(t), "/", month(t, label = TRUE), "-", day(t))
dir.create(sub_folder_path)
AB_split <- ifelse(hour(t)<15, "A", "B")
dir.create(paste0(sub_folder_path, "/", AB_split))
# Set your gsheet title and the wanted file-name
ws_title <- "XYZ"
ws_file_name <- paste0(date(t), "_", AB_split, "_", ws_title, ".xlsx")
ws_file_path <- paste0(sub_folder_path, "/", AB_split, "/", ws_file_name)
# Download it
as<-gs_title(ws_title)
gs_download(as, to = ws_file_path, overwrite = TRUE)
Trying to create an existing folder results if a warning. if you want to surpress the warnings wrap the create.dir calls in suppressWarnings(create.dir(...))
I would strongly recommend NOT to work with the worksheet title but use the key instead. See ?gs_key

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Web scrape Excel spreadsheet URLs to read with openxlsx - r

This should get you going in the right direction. library(data.table) mydat <- fread('http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx') head(mydat)

Related

Downloading images from multiple URLs in R

R:Check existence of "today's" file and if it doesn't exist, download it

How can I create a data frame in R from a zip file with multiple levels located in an URL?

How to download multiple gzip file?

download googlesheet and create folder in specific path to save

Categories

Resources