Why are some jsonl files failing to load in SparklyR - r

I am currently trying to read in some jsonl files using SparklyR v 1.3.1 with Spark 2.3.3. While some files read in fine, I am struggling with others, using exactly the same code. Long-ish details below including error messages and packages/code being used.
library(sparklyr)
library(sparklyr.nested)
library(dplyr)
sc <- spark_connect(master = "local")
june1 <- spark_read_json(sc, "june1-aa.jsonl")
june <- spark_read_json(sc, "janetweets_june24.jsonl")
Error: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
The first file appears to read in ok, but any attempt to view the file meets with the following error, and "no tables" is displayed in the connections window, as opposed to the file structure for similar files which can be read in fine.
Error in value[[3L]](cond) :
Failed to fetch data: java.lang.NullPointerException
at sparklyr.Collectors$.collectLongArr(collectors.scala:87)
at sparklyr.Collectors$$anonfun$mkColumnCtx$17.apply(collectors.scala:224)
at sparklyr.Collectors$$anonfun$mkColumnCtx$17.apply(collectors.scala:224)
at sparklyr.Collectors$ColumnCtx.collect(collectors.scala:183)
at sparklyr.Utils$.sparklyr$Utils$$collectRows(utils.scala:90)
at sparklyr.Utils$.collect(utils.scala:114)
at sparklyr.Utils.collect(utils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sparklyr.Invoke.invoke(invoke.scala:147)
at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
at sparklyr.StreamHandler.read(stream.scala:61)
at sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(h
Owing to the size of these files, these are the first lines on each file, as viewed in terminal.
Sample data in pastebin: https://pastebin.com/y3Zevnpv.
I have tried updating my package to the latest version, deleting and reinstalling, and these files are definitely in the jsonlines format. This files were pulled from Twitter using the twarc command line tool from a Windows machine. Pls note: I have removed some URLs from above data owing to Stack Overflow guidelines. Thanks!

Related

How to solve "bad restore file magic number" when trying to load data?

I tried to load data to my R working directory and receive this error:
Error: bad restore file magic number (file may be corrupted) -- no data loaded
In addition: Warning message:
file ‘classize.RData’ has magic number 'RDX3'
Use of save versions prior to 2 is deprecated
I googled it and tried many options, unsuccessfully.
My Rstudio version is: 1.2.5033 (The error was happening before updating as well)
I create a new project, in the new directory, I put the data file
The data file is "classize.RData"
I have another alternative which is "classize.RDS" with the sugesstion to use readRDS(file = "classize.RDS"). When using this command, I receive that error:
cannot read workspace version 3 written by R 3.6.1; need R 3.5.0 or newer
This is in the context of a statistical course at university and my teacher assistant is unable to help me out, and whitout resolving this issue, I cannot move forward in the resolution of the needed exrecices. So please, couly you help me resolve that problem.
ps: all the students have access to the same data, It's just for me that it's not working, therefore the file should not be corrupted.

sparklyr spark_read_parquet from s3 error

When I read a parquet file on s3 from sparklyr context like this:
{spark_read_parquet(sc, name = "parquet_test", path = "s3a://<path-to-file>")}
It throws me an error which is:
Caused by: java.io.IOException: Could not read footer for file: FileStatus{path=s3a: .....
I was able to read the parquet file in a sparkR session by using read.parquet() function. So there must be some differences in terms of spark context configuration between sparkR and sparklyr.
Any suggestions on this issue? Thanks.
In yarn-client mode, The file schema s3 that you are using is not correct. You'll need to use the s3://<path-to-file>

Error when trying to load .RData information

I have been trying to save my R environment to load in later sessions (my environment has about 12 data frames).
save.image(file = 'tests.RData')
if I look in my directory, my file looks like data has been saved since the .RData file is about 40MB in size.
Now when I try to load the file in the directory...
load('test.RData')
I get this error:
Error: bad restore file magic number (file may be corrupted) -- no data loaded
In addition: Warning message:
file 'tests.RData' has magic number ''
Use of save versions prior to 2 is deprecated
I tried looking at previous issues with this, and the only advice was to try load() and source(). However, I believe load() is for .rda/.RData files.
Is there anyway to save my environment and load it properly? What is causing my issue exactly?

Error reading an xlsx file from Dropbox using the R package 'repmis'

I need to be able to read a sheet from an xlsx workbook into R for use in a Shiny app. (I know it should be a csv file, but that is unfortunately not my decision...).(Edited to add: The file I need to read is on dropbox) I am trying to use the repmis package. The code I have tried is simply:
library('repmis')
library('xlsx')
lnk<-"https://www.dropbox.com/s/pzyt86pguko3xg6/TestBook.xlsx?dl=0"
my_data<-source_XlsxData(lnk, sheet="MainData", startRow=1)
Unfortunately I get the following error message:
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream
I have no idea what it means... :|
I think a recent update removed the ability to read data files stored on Dropbox from within R. I will go look for confirmation, and delete my comment otherwise.
Source: Ran into the same issue myself a couple of months ago
UPDATE: Confirmation of dropped support within the github package https://github.com/christophergandrud/repmis/commit/f85469f38c6f4e4a5735ecc888b4263b969d4e22

SparkR and Packages

How do one call packages from spark to be utilized for data operations with R?
example i am trying to access my test.csv in hdfs as below
Sys.setenv(SPARK_HOME="/opt/spark14")
library(SparkR)
sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
flights <- read.df(sqlContext,"hdfs://sandbox.hortonWorks.com:8020 /user/root/test.csv","com.databricks.spark.csv", header="true")
but getting error as below:
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
i tried loading the csv package by below option
Sys.setenv('SPARKR_SUBMIT_ARGS'='--packages com.databricks:spark-csv_2.10:1.0.3')
but getting the below error during loading sqlContext
Launching java with spark-submit command /opt/spark14/bin/spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 /tmp/RtmpuvwOky /backend_port95332e5267b
Error: Cannot load main class from JAR file:/tmp/RtmpuvwOky/backend_port95332e5267b
Any help will be highly appreciated.
So it looks like by setting SPARKR_SUBMIT_ARGS you are overriding the default value, which is sparkr-shell. You could probably do the same thing and just append sparkr-shell to the end of your SPARKR_SUBMIT_ARGS. This is seems unnecessarily complex compared to depending on jars so I've created a JIRA to track this issue (and I'll try and a fix if the SparkR people agree with me) https://issues.apache.org/jira/browse/SPARK-8506 .
Note: another option would be using the sparkr command + --packages com.databricks:spark-csv_2.10:1.0.3 since that should work.

Resources