Read in feather file directly from GitHub in R

Read in feather file directly from GitHub in R - r

How can I read in a .feather file from the web (e.g. GitHub) in R? I can read formats as .csv or .dta from GitHub directly as raw
# CSV
coursedata <- read.csv(file = 'https://raw.githubusercontent.com/MarcoKuehne/seminars_in_applied_economics/main/Data/GF_2020.csv')
# DTA
library(haven)
soep <- read_dta("https://github.com/MarcoKuehne/seminars_in_applied_economics/blob/main/Data/soep_lebensz_en.dta?raw=true")
But the same approach fails for arrow and read_feather.
library(arrow)
digital <- read_feather("https://github.com/MarcoKuehne/seminars_in_applied_economics/blob/main/Data/Digital_Literacy_EN.feather?raw=true")
Is there a direct way or a nested command? Or am I required to download the file manually or programmatically as a temporary file?

Related

Read .tar.gz file in R

I have compressed file like cat.txt.tar.gz, I just need to load into R and process as follows
zip <-("cat.txt.tar.gz")
data <- read.delim(file=(untar(zip,"cat.txt")),sep="\t")
but "data" is empty while running the code.Is there any way to read a file from .tar.gz

Are you sure your file is named correctly?
Usually compressed files are named cat.tar.gz, excluding the .txt.
Second, try the following code:
tarfile <- "cat.txt.tar.gz" # Or "cat.tar.gz" if that is right
data <- read.delim(file = untar(tarfile,compressed="gzip"),sep="\t")
If this doesn't work, you might need to extract the file first, and then read the extracted file.

To read in a particular csv or txt within a gz archive without having to UNZIP it first one can use library(archive) :
library(archive)
library(readr)
read_csv(archive_read("cat.txt.tar.gz", file = 1), col_types = cols(), sep="\t")
should work.

Batch convert Excel files to PDFs in R

I have a folder of excel (xlsx) sheets that I want to convert to PDFs in R. I've tried reading the worksheets into R directly (using almost all packages) but the data is never read properly. I'm dealing with excel spreadsheets from several different people so assume this is because of the differences between saving files from everyone's computers.
I figure that converting these files to PDFs would mean that they are all formatted the same and therefore will be easier to work with.
Is it possible to convert files from excel worksheets to PDFs using R without opening the files/reading them into R as this is where the errors occur?

This should work with R 3.6.2:
# install RDCOMClient for 3.6.2
url <- "http://www.omegahat.net/R/bin/windows/contrib/3.5.1/RDCOMClient_0.93-0.zip"
install.packages(url, repos=NULL, type="binary")
# install R.utils
install.packages("R.utils")
library(RDCOMClient)
library(R.utils)
# make a list of the folder with your excel files
# replace "Path to your folder" with the path to your folder
list.files("Path to your folder",full.names=TRUE) -> list
# Batch convert (replace "Path to your folder" with the path to your folder)
lapply(list, function(x) {
file <- x # path to Excel file
ex <- COMCreate("Excel.Application") # create COM object
file <- getAbsolutePath(file) # convert to absolute path
book <- ex$workbooks()$Open(file) # open Excel file
sheet <- book$Worksheets()$Item(1) # pointer to first worksheet
sheet$Select() # select first worksheet
ex[["ActiveSheet"]]$ExportAsFixedFormat(Type=0, # export as PDF
Filename=paste0("Path to your folder",gsub(pattern = "\\.xlsx$", "", basename(x)),".pdf"),
IgnorePrintAreas=FALSE)
ex[["ActiveWorkbook"]]$Save() # save workbook
ex$Quit() # close Excel
})

Import excel from Azure blob using R

I have the basic setup done following the link below:
http://htmlpreview.github.io/?https://github.com/Microsoft/AzureSMR/blob/master/inst/doc/tutorial.html
There is a method 'azureGetBlob' which allows you to retrieve objects from the containers. however, it seems to only allow "raw" and "text" format which is not very useful for excel. I've tested the connections and etc, I can retrieve .txt / .csv files but not .xlsx files.
Does anyone know any workaround for this?
Thanks

Does anyone know any workaround for this?
There is no file type on the azure blob storage, it is just a blob name. The extension type is known for OS. If we want to open the excel file in the r, we could use the 3rd library to do that such as readXl.
Work around:
You could use the get blob api to download the blob file to local path then use readXl to read the file. We also get could more demo code from this link.
# install
install.packages("readxl")
# Loading
library("readxl")
# xls files
my_data <- read_excel("my_file.xls")
# xlsx files
my_data <- read_excel("my_file.xlsx")

Solved with the following code. Basically, read the file in byte then wrote the file to disk then read it into R
excel_bytes <- azureGetBlob(sc, storageAccount = "accountname", container = "containername", blob=blob_name, type="raw")
q <- tempfile()
f <- file(q, 'wb')
writeBin(excel_bytes, f)
close(f)
result <- read.xlsx(q, sheetIndex = sheetIndex)
unlink(q)

Dynamically converting a list of Excel files to csv files in R

I currently have a folder containing all Excel (.xlsx) files, and using R I would like to automatically convert all of these files to CSV files using the "openxlsx" package (or some variation). I currently have the following code to convert one of the files and place it in the same folder:convert("team_order\\team_1.xlsx", "team_order\\team_1.csv")
I would like to automate the process so it does it to all the files in the folder, and also removes the current xlsx files, so only the csv files remain. Thanks!

You can try this using rio, since it seems like that's what you're already using:
library("rio")
xls <- dir(pattern = "xlsx")
created <- mapply(convert, xls, gsub("xlsx", "csv", xls))
unlink(xls) # delete xlsx files

library(readxl)
# Create a vector of Excel files to read
files.to.read = list.files(pattern="xlsx")
# Read each file and write it to csv
lapply(files.to.read, function(f) {
df = read_excel(f, sheet=1)
write.csv(df, gsub("xlsx", "csv", f), row.names=FALSE)
})
You can remove the files with the command below. However, this is dangerous to run automatically right after the previous code. If the previous code fails for some reason, the code below will still delete your Excel files.
lapply(files.to.read, file.remove)
You could wrap it in a try/catch block to be safe.

Using R to download zipped data file, extract, and import data

#EZGraphs on Twitter writes:
"Lots of online csvs are zipped. Is there a way to download, unzip the archive, and load the data to a data.frame using R? #Rstats"
I was also trying to do this today, but ended up just downloading the zip file manually.
I tried something like:
fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")
but I feel as if I'm a long way off.
Any thoughts?

Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to
Create a temp. file name (eg tempfile())
Use download.file() to fetch the file into the temp. file
Use unz() to extract the target file from temp. file
Remove the temp file via unlink()
which in code (thanks for basic example, but this is simpler) looks like
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)
Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)

Just for the record, I tried translating Dirk's answer into code :-P
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)

I used CRAN package "downloader" found at http://cran.r-project.org/web/packages/downloader/index.html . Much easier.
download(url, dest="dataset.zip", mode="wb")
unzip ("dataset.zip", exdir = "./")

For Mac (and I assume Linux)...
If the zip archive contains a single file, you can use the bash command funzip, in conjuction with fread from the data.table package:
library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")
In cases where the archive contains multiple files, you can use tar instead to extract a specific file to stdout:
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")

Here is an example that works for files which cannot be read in with the read.table function. This example reads a .xls file.
url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"
temp <- tempfile()
temp2 <- tempfile()
download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))
unlink(c(temp, temp2))

To do this using data.table, I found that the following works. Unfortunately, the link does not work anymore, so I used a link for another data set.
library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)
I know this is possible in a single line since you can pass bash scripts to fread, but I am not sure how to download a .zip file, extract, and pass a single file from that to fread.

Using library(archive) one can also read in a particular csv file within the archive, without having to UNZIP it first; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())
which I find more convenient & is faster.
It also supports all major archive formats & is quite a bit faster than the base R untar or unz - it supports tar, ZIP, 7-zip, RAR, CAB, gzip, bzip2, compress, lzma, xz & uuencoded files.
To unzip everything one can use archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)
This works on all platforms & given the superior performance for me would be the preferred option.

Try this code. It works for me:
unzip(zipfile="<directory and filename>",
exdir="<directory where the content will be extracted>")
Example:
unzip(zipfile="./data/Data.zip",exdir="./data")

rio() would be very suitable for this - it uses the file extension of a file name to determine what kind of file it is, so it will work with a large variety of file types. I've also used unzip() to list the file names within the zip file, so its not necessary to specify the file name(s) manually.
library(rio)
# create a temporary directory
td <- tempdir()
# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")
# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)
# list zip archive
file_names <- unzip(tf, list=TRUE)
# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)
# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))
# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))
# delete the files and directories
unlink(td)

I found that the following worked for me. These steps come from BTD's YouTube video, Managing Zipfile's in R:
zip.url <- "url_address.zip"
dir <- getwd()
zip.file <- "file_name.zip"
zip.combine <- as.character(paste(dir, zip.file, sep = "/"))
download.file(zip.url, destfile = zip.combine)
unzip(zip.file)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Read in feather file directly from GitHub in R - r

Related

Read .tar.gz file in R

Batch convert Excel files to PDFs in R

Import excel from Azure blob using R

Dynamically converting a list of Excel files to csv files in R

Using R to download zipped data file, extract, and import data

Categories

Resources