sparklyr spark_read_parquet from s3 error - r

When I read a parquet file on s3 from sparklyr context like this:
{spark_read_parquet(sc, name = "parquet_test", path = "s3a://<path-to-file>")}
It throws me an error which is:
Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=s3a: .....
I was able to read the parquet file in a sparkR session by using read.parquet() function. So there must be some differences in terms of spark context configuration between sparkR and sparklyr.
Any suggestions on this issue? Thanks.

In yarn-client mode, The file schema s3 that you are using is not correct. You'll need to use the s3://<path-to-file>

Related

Error Reading XLS files into R - cannot parse file

I tried installing java and using the XLConnect package mmethods (both of them) but I get an error that looks like this:
Error: IOException (Java): Your InputStream was neither an OLE2 stream, nor an OOXML stream
I am definitely not trying to load an empty file, and I definitely do have access to/permission to read from the file location. What might I be doing wrong?
I tried installing java and using the XLConnect package mmethods (both of them) but I get an error that looks like this:
Error: IOException (Java): Your InputStream was neither an OLE2 stream, nor an OOXML stream
I am definitely not trying to load an empty file, and I definitely do have access to/permission to read from the file location. What might I be doing wrong?
I've investigated these related questions without success.
Importing Excel files into R, xlsx or xls
R read_excel: libxls error: Unable to parse file
Check the below code, it works with library(readxl)
xls and xslx both work

Why are some jsonl files failing to load in SparklyR

I am currently trying to read in some jsonl files using SparklyR v 1.3.1 with Spark 2.3.3. While some files read in fine, I am struggling with others, using exactly the same code. Long-ish details below including error messages and packages/code being used.
library(sparklyr)
library(sparklyr.nested)
library(dplyr)
sc <- spark_connect(master = "local")
june1 <- spark_read_json(sc, "june1-aa.jsonl")
june <- spark_read_json(sc, "janetweets_june24.jsonl")
Error: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
The first file appears to read in ok, but any attempt to view the file meets with the following error, and "no tables" is displayed in the connections window, as opposed to the file structure for similar files which can be read in fine.
Error in value[[3L]](cond) :
Failed to fetch data: java.lang.NullPointerException
at sparklyr.Collectors$.collectLongArr(collectors.scala:87)
at sparklyr.Collectors$$anonfun$mkColumnCtx$17.apply(collectors.scala:224)
at sparklyr.Collectors$$anonfun$mkColumnCtx$17.apply(collectors.scala:224)
at sparklyr.Collectors$ColumnCtx.collect(collectors.scala:183)
at sparklyr.Utils$.sparklyr$Utils$$collectRows(utils.scala:90)
at sparklyr.Utils$.collect(utils.scala:114)
at sparklyr.Utils.collect(utils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sparklyr.Invoke.invoke(invoke.scala:147)
at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
at sparklyr.StreamHandler.read(stream.scala:61)
at sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(h
Owing to the size of these files, these are the first lines on each file, as viewed in terminal.
Sample data in pastebin: https://pastebin.com/y3Zevnpv.
I have tried updating my package to the latest version, deleting and reinstalling, and these files are definitely in the jsonlines format. This files were pulled from Twitter using the twarc command line tool from a Windows machine. Pls note: I have removed some URLs from above data owing to Stack Overflow guidelines. Thanks!

Load h5 keras model file in R

I'm building a R package for binary classification and I'm using opencpu to host it. Currently I've saved the h5 file as .RData file(serialized), which is then loaded in the environment using the .onLoad() function in R. This enables the R script to use the environment variable to load keras model using keras::unserialized_model().
I've tried directly using keras::load_model_hdf5() in the code, but after building and deploying on opencpu, when I try to hit the prediction API, I get error
ioerror: unable to open file (unable to open file: name = '/home/modelfile_26feb.h5', errno = 13, error message = 'permission denied', flags = 0, o_flags = 0)
I have changed permission for the file(777) and even the groups but still getting the error.
I even tried putting the file in inst/extdata folder so that it gets in the package but still same error.
Can anyone help on this, or suggest some alternative to load the h5 model directly?
Which OS does OpenCPU run on? Why does it try to write in /home/, this is very unusual? The best solution is to adapt your code to write in getwd() or tempdir(). Even better is to store data in a local database or redis server and let R read it from there, so you don't need disk access at all.
If you run on Ubuntu Server, reading from /home/ is not permitted by default. If you want to allow this, you need to add apparmor rules, see section 3.5 of the server manual.
Some relevant topics from the opencpu mailing list:
write in home dir: https://groups.google.com/d/msg/opencpu/5vRvgSKY-qE/4xMzZCGJBAAJ
keras in opencpu: https://groups.google.com/d/msg/opencpu/HhRzFVVFdaA/n5Nu1sxyFgAJ
write tmp folder: https://groups.google.com/d/msg/opencpu/Y1tYhaQUzwU/ubSEd_CDCgAJ

Error reading a .nc4 file in R (ncdf4 package)

I am trying to use a data set of .nc4 files downloaded from NASA.
The format NCDF4 is confirmed by this source.
I used download .file in R to get the database and then a simple nc_open (ncdf4 package) to test the file. Unfortunately the result is an "Unknown file format" error.
Here my replication file and my script:
download.file (url=http://hydro1.gesdisc.eosdis.nasa.gov/.../url, destfile=destination_folder/file.nc4)
All fine till this point, but when testing the files:
library(ncdf4)
setwd('destination_folder')
data <- nc_open('file.nc4')
Error in R_nc4_open: NetCDF: Unknown file format
Error in nc_open("file.nc4") :
Error in nc_open trying to open file file.nc4
Am I missing something?
Thank you.
I do not know what is wrong, but I can add the information that the problem resides in the Windows implementation of the ncdf4 package. With the following statement:
catlg<-nc_open("http://opendap.deltares.nl/thredds/dodsC/opendap/rijkswaterstaat/waterbase/concentration_of_suspended_matter_in_water/catalog.nc")
I have the same problem as described in the question. However, it works perfectly in R under Linux
The file server is an OpenDAP server strictly following netcdf 4 conventions, but maybe some features are not correctly implemented in the ncdf4 package under Windows
for some reason I get the same error using [64-bit] C:\Program Files\R\R-3.4.2), but when using [64-bit] C:\Program Files\R\R-3.3.3 the ncdf4 package works fine.
not that this solves the problem, but it provides an easy work around for the time being.

SparkR and Packages

How do one call packages from spark to be utilized for data operations with R?
example i am trying to access my test.csv in hdfs as below
Sys.setenv(SPARK_HOME="/opt/spark14")
library(SparkR)
sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
flights <- read.df(sqlContext,"hdfs://sandbox.hortonWorks.com:8020 /user/root/test.csv","com.databricks.spark.csv", header="true")
but getting error as below:
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
i tried loading the csv package by below option
Sys.setenv('SPARKR_SUBMIT_ARGS'='--packages com.databricks:spark-csv_2.10:1.0.3')
but getting the below error during loading sqlContext
Launching java with spark-submit command /opt/spark14/bin/spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 /tmp/RtmpuvwOky /backend_port95332e5267b
Error: Cannot load main class from JAR file:/tmp/RtmpuvwOky/backend_port95332e5267b
Any help will be highly appreciated.
So it looks like by setting SPARKR_SUBMIT_ARGS you are overriding the default value, which is sparkr-shell. You could probably do the same thing and just append sparkr-shell to the end of your SPARKR_SUBMIT_ARGS. This is seems unnecessarily complex compared to depending on jars so I've created a JIRA to track this issue (and I'll try and a fix if the SparkR people agree with me) https://issues.apache.org/jira/browse/SPARK-8506 .
Note: another option would be using the sparkr command + --packages com.databricks:spark-csv_2.10:1.0.3 since that should work.

Resources