Cross read parquet files between R and Python - r

We have generated a parquet files, one in Dask (Python) and another with R Drill (using the Sergeant packet ). They use a different implementations of parquet see my other parquet question
We are not able to cross read the files (the python can't read the R file and vice versa).
When reading the Python parquet file in the R environment we receive the following error: system error: Illegalstatexception: UTF8 can only annotate binary filed .
When reading the R/Drill parquet file in Dask we get an FileNotFoundError: [Error 2] no such file or directory ...\_metadata (which is self explanatory).
What are the options to cross read parquet files between R and Python?
Any insights would be appreciated.

To read drill-like parquet data-sets with fastparquet/dask, you need to pas a list of the filenames, e.g.,
files = glob.glob('mydata/*/*.parquet')
df = dd.read_parquet(files)
The error from going in the other direction might be a bug, or (gathering from your other question), may indicate that you used fixed-length strings, but drill/R doesn't support them.

Related

Error when trying to read excel file from web site

I'm trying to download the xlsx file that is available at the following url. If you go to the website and click the link, it will download as a file on your computer. However, I want to automate this process. I have tried the following:
library(RCurl)
download.file("https://dshs.texas.gov/coronavirus/TexasCOVID19DailyCountyCaseCountData.xlsx", "temp.xlsx")
library(readxl)
tmp <- read_xlsx("temp.xlsx")
# Error: Evaluation error: error reading from the connection.
This method does download a temp.xlsx file to my drive. However, if you try and manually click on it to open, excel fails to open it. It knows it's size, but is unable to open.
.
readxl::read_xlsx("https://dshs.texas.gov/coronavirus/TexasCOVID19DailyCountyCaseCountData.xlsx")
# Error: `path` does not exist: ‘https://dshs.texas.gov/coronavirus/TexasCOVID19DailyCountyCaseCountData.xlsx’
Both of these methods are my go-to for downloading excel files from websites. Is there some specific reason why these methods don't work here?
When downloading certain file formats on Windows you need to specify that it should be a binary rather than the (usual) default of a text transfer - from the download.file() documentation:
The choice of binary transfer (mode = "wb" or "ab") is important on
Windows, since unlike Unix-alikes it does distinguish between text and
binary files and for text transfers changes \n line endings to \r\n
(aka ‘CRLF’).
On Windows, if mode is not supplied (missing()) and url ends in one of
.gz, .bz2, .xz, .tgz, .zip, .rda, .rds or .RData, mode = "wb" is set
such that a binary transfer is done to help unwary users.
Code written to download binary files must use mode = "wb" (or "ab"),
but the problems incurred by a text transfer will only be seen on
Windows.
In this case so that the file is written correctly use:
download.file("https://dshs.texas.gov/coronavirus/TexasCOVID19DailyCountyCaseCountData.xlsx",
"temp.xlsx", mode = "wb")

Reading csv files from microsoft Azure using R

I have recently started working with databricks and azure.
I have microsoft azure storage explorer. I ran a jar program on databricks
which outputs many csv files in the azure storgae explorer in the path
..../myfolder/subfolder/output/old/p/
The usual thing I do is to go the folder p and download all the csv files
by right clicking the p folder and click download on my local drive
and these csv files in R to do any analysis.
My issue is that sometimes my runs could generate more than 10000 csv files
whose downloading to the local drive takes lot of time.
I wondered if there is a tutorial/R package which helps me to read in
the csv files from the path above without downloading them. For e.g.
is there any way I can set
..../myfolder/subfolder/output/old/p/
as my working directory and process all the files in the same way I do.
EDIT:
the full url to the path looks something like this:
https://temp.blob.core.windows.net/myfolder/subfolder/output/old/p/
According to the offical document CSV Files of Azure Databricks, you can directly read a csv file in R of a notebook of Azure Databricks as the R example of the section Read CSV files notebook example said, as the figure below.
Alternatively, I used R package reticulate and Python package azure-storage-blob to directly read a csv file from a blob url with sas token of Azure Blob Storage.
Here is my steps as below.
I created a R notebook in Azure Databricks workspace.
To install R package reticulate via code install.packages("reticulate").
To install Python package azure-storage-blob as the code below.
%sh
pip install azure-storage-blob
To run Python script to generate a sas token of container level and to use it to get a list of blob urls with sas token, please see the code below.
library(reticulate)
py_run_string("
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
sas_token = blob_service.generate_container_shared_access_signature(container_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
blob_names = blob_service.list_blob_names(container_name, prefix = 'myfolder/')
blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+sas_token for blob_name in blob_names]
")
blob_urls_with_sas <- py$blob_urls_with_sas
Now, I can use different ways in R to read a csv file from the blob url with sas token, such as below.
5.1. df <- read.csv(blob_urls_with_sas[[1]])
5.2. Using R package data.table
install.packages("data.table")
library(data.table)
df <- fread(blob_urls_with_sas[[1]])
5.3. Using R package readr
install.packages("readr")
library(readr)
df <- read_csv(blob_urls_with_sas[[1]])
Note: for reticulate library, please refer to the RStudio article Calling Python from R.
Hope it helps.
Update for your quick question:

How to load objects from S3 bucket into Spark in RStudio?

The Object in the S3 bucket is 5.3 GB size. In order to convert object into data, I used get_object("link to bucket path"). But this leads to memory issues.
So, I installed Spark 2.3.0 in RStudio and trying to load this object directly into Spark but the command to load object directly into spark is not known.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
If I convert the object into a readable data type (such as data.frame/tbl in R) I would use copy_to to transfer the data into spark from R as below:
Copy data to Spark
spark_tbl <- copy_to(spark_conn,data)
I was wondering how can convert the object inside spark ?
relevant links would be
https://github.com/cloudyr/aws.s3/issues/170
Sparklyr connection to S3 bucket throwing up error
Any guidance would be sincerely appreciated.
Solution.
I was trying to read the csv file which is 5.3 GB from S3 bucket. But Since R is single-threaded, it was giving memory issues (IO exceptions).
However, the solution is to load sparklyr in R (library(sparklyr)) and hence now all the cores in the computer will be utilized.
get_object("link to bucket path") can be replaced by
spark_read_csv("link to bucket path"). Since RStudio uses all cores, we have no memory issues.
Also, depending on the file extension, you can change the functions:
´spark_load_table, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_jdbc, spark_write_json, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text´

Cannot load my CSV file into my R? keep getting error messages

So basically I succesfully exported my SQL view data into a csv file. but no when I load into Rgui software, I get the following errror:
> load("C:\\Users\\dachen\\Documents\\vTargetBuyers.csv")
Error: bad restore file magic number (file may be corrupted) -- no data loaded
In addition: Warning message:
file ‘vTargetBuyers.csv’ has magic number 'Marit'
Use of save versions prior to 2 is deprecated
What should I do? Is it the R version installed wrong? or something wrong with my CSV file?
Try using read.csv instead of load. load is for reading files created by save.
Type ?read.csv to access the documentation.

Intermittant "Permission Denied" error when compiling Latex document with Knitr and openxlsx

I am using a data set that has been provided in an xlsx workbook across several worksheets. I created a function to extract the data from one of the sheets and reformat it into a usable form.
Opening the file twice in one compilation of Latex often (but not all the time) triggers the following error:
Error in unzip(xlsxFile, exdir = xmlDir) :
cannot open file 'C:/Users/Jonno/AppData/Local/Temp/RtmpewoYVY/_excelXMLRead/xl/worksheets/sheet5.xml': Permission denied
This suggests that the temporary file is not being closed which prevents openxlsx from unzipping the xlsx file again.
Unfortunately the file needs to be opened several times to extract the data.
Also unfortunately I haven't been able to create a reproducible example suggests that it may be something to do with the structure of the file I have received.
Is there a way to force the temporary file to be closed and deleted after each time it is opened?
This problem only occurs when knitting, when it is run in the R enivironment this problem never occurs.

Resources