SparkR + Cassandra query with complex data types - r

We have SparkR setup to connect to Cassandra and we are able to successfully connect/query Cassandra data. However, many of our Cassandra column families have complex data types like MapType and we get errors when querying these types. Is there a way to coerce these before or during the query using SparkR? For example, cqlsh command of the same data would coerce a row of MapType column b below to a string like "{38: 262, 97: 21, 98: 470}"
Sys.setenv(SPARK_HOME = "/opt/spark")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
mySparkPackages <- "datastax:spark-cassandra-connector:1.6.0-s_2.10"
mySparkEnvironment <- list(
spark.local.dir="...",
spark.eventLog.dir="...",
spark.cassandra.connection.host="...",
spark.cassandra.auth.username="...",
spark.cassandra.auth.password="...")
sc <- sparkR.init(master="...", sparkEnvir=mySparkEnvironment,sparkPackages=mySparkPackages)
sqlContext <- sparkRSQL.init(sc)
spark.df <- read.df(sqlContext,
source = "org.apache.spark.sql.cassandra",
keyspace = "...",
table = "...")
spark.df.sub <- subset(spark.df, (...)), select = c(1,2))
schema(spark.df.sub)
StructType
|-name = "a", type = "IntegerType", nullable = TRUE
|-name = "b", type = "MapType(IntegerType,IntegerType,true)", nullable = TRUE
r.df.sub <- collect(spark.df.sub, stringsAsFactors = FALSE)
Here we get this error from the collect():
16/07/13 12:13:50 INFO TaskSetManager: Finished task 1756.0 in stage 0.0 (TID 1756) in 1525 ms on ip-10-225-70-184.ec2.internal (1757/1758)
16/07/13 12:13:50 INFO TaskSetManager: Finished task 1755.0 in stage 0.0 (TID 1755) in 1661 ms on ip-10-225-70-184.ec2.internal (1758/1758)
16/07/13 12:13:50 INFO DAGScheduler: ResultStage 0 (dfToCols at NativeMethodAccessorImpl.java:-2) finished in 2587.670 s
16/07/13 12:13:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/07/13 12:13:50 INFO DAGScheduler: Job 0 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 2588.088830 s
16/07/13 12:13:51 ERROR RBackendHandler: dfToCols on org.apache.spark.sql.api.r.SQLUtils failed
Error in readBin(con, raw(), stringLen, endian = "big") :
invalid 'n' argument
Our stack:
Ubuntu 14.04.4 LTS Trusty Tahr
Cassandra v 2.1.14
Scala 2.10.6
Spark 1.6.2 with Hadoop libs 2.6
Spark-Cassandra connector 1.6.0 for Scala 2.10
DataStax Cassandra Java driver v3.0 (v3.0.1 actually)
Microsoft R Open aka Revo R version 3.2.5 with MTL
Rstudio server 0.99.902

Related

Read Kudu from SparkR

In Spark I am unable to find how to connect to Kudu using SparkR. If I try the following in scala:
import org.apache.kudu.spark.kudu._
import org.apache.kudu.client._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
// Read kudu table and select data of August 2018
val df = spark.sqlContext.read.options(Map("kudu.master" -> "198.y.x.xyz:7051","kudu.table" -> "table_name")).kudu
df.createOrReplaceTempView("mytable")
it works perfectly. In SparkR I have been trying to the following:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sc = sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"), sparkPackages = "org.apache.kudu:kudu-spark2_2.11:1.8.0")
sqlContext <- sparkRSQL.init(sc)
df = read.jdbc(url="198.y.x.xyz:7051",
driver = "jdbc:kudu:sparkdb",
source="jdbc",
tableName = "table_name"
)
I get the following error:
Error in jdbc : java.lang.ClassNotFoundException: jdbc:kudu:sparkdb
Trying the following:
df = read.jdbc(url="jdbc:mysql://198.19.10.103:7051",
tableName = "banglalink_data_table_1"
)
gives:
Error: Error in jdbc : java.sql.SQLException: No suitable driver
I cannot find any help on how to load the correct driver. I think that using the sparkPackages option is correct as it gives no error. What am I doing wrong??

cronR cronjob working on ubuntu/local system but not in shinyapp.io

Cron Job is not working after deployment on shinyapp.io
We are trying to schedule some jobs in our shinyapp through cronR, it is working on local system but no as soon as when we are deploying it on shinyapp.io server it is showing a below error.
An error has occurred
The application failed to start (exited with code 1).
Attaching package: ‘DT’
The following objects are masked from ‘package:shiny’:
dataTableOutput, renderDataTable
Adding cronjob:
---------------
# cronR job
# id: temp_data_fetch
# tags: lab, xyz
# desc: temp Data Loading
0-59 * * * * /opt/R/3.5.3/lib/R/bin/Rscript '/srv/connect/apps/Temp/ETL.R' >> '/srv/connect/apps/Temp/ETL.log' 2>&1
Error in value[[3L]](cond) : error in running command
Calls: local ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted
#Cron Script
if(!(file.exists("/srv/connect/apps/Temp/scripts_scheduled.rds"))){
cmd <- cronR::cron_rscript(rscript = '/srv/connect/apps/Temp/ETL.R')
cronR::cron_add(cmd, frequency = 'minutely', id = 'temp_data_fetch',description = 'temp Data Loading',tags = c('lab', 'xyz'))
TEXT<-"temp_data_fetch"
saveRDS(TEXT,"/srv/connect/apps/Temp/scripts_scheduled.rds")
}
#ETL.R
trigger_time <- readRDS(file = "/srv/connect/apps/Temp/trigger_time.rds")
trigger_time <- c(trigger_time,paste(Sys.time()))
saveRDS(trigger_time,file = "/srv/connect/apps/Temp/trigger_time.rds")

Can't get data with dbplyr from shiny-server

I'm trying to get data from AWS SQL Server.
This code works fine from local PC, but it didn't work from shiny-server (ubuntu).
library(dbplyr)
library(dplyr)
library(DBI)
con <- dbConnect(odbc::odbc(),
driver = "FreeTDS",
server = "aws server",
database = "",
uid = "",
pwd = "")
tbl(con, "shops")
dbGetQuery(con,"SELECT *
FROM shops")
"R version 3.4.2 (2017-09-28)"
packageVersion("dbplyr")
[1] ‘1.2.1.9000’
packageVersion("dplyr")
[1] ‘0.7.4’
packageVersion("DBI")
[1] ‘0.7.15’
I have next error:
tbl(con, "shops")
Error: <SQL> 'SELECT *
FROM "shops" AS "zzz2"
WHERE (0 = 1)'
nanodbc/nanodbc.cpp:1587: 42000: [FreeTDS][SQL Server]Incorrect syntax near 'shops'.
But dbGetQuery(con,"SELECT * FROM shops") works fine.
Can you explain what's going wrong?
This is more likely because the FreeTDS driver does not return the class that dbplyr expects to see in order to use the MS SQL translation. The workaround is to take the result of class(con) and then add the following lines right after you connect, but before calling tbl(). Replace the [you class name] with the results of the class(con) call:
sql_translate_env.[your class name] <- dbplyr:::`sql_translate_env.Microsoft SQL Server`
sql_select.[your class name]<- dbplyr:::`sql_select.Microsoft SQL Server`

SparkR: Cannot read data at deployed workers, but ok with local machine

New to spark and spakrR. For Hadoop, only have a file called winutils/bin/winutils.exe.
Running system:
OS
Windows 10
Java
Java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
R
platform: x86_64-w64-mingw32
arch: x86_64
os: mingw32
RStudio:
Version 1.0.20 – © 2009-2016 RStudio, Inc.
Spark
2.0.0
I can read data on my local machine, but on the deployed workers, I cannot do that.
Could anybody help me?
Run at local:
Sys.setenv(SPARK_HOME = "D:/SPARK2")
library(SparkR)
sparkR.session(master = "local[*]", enableHiveSupport = FALSE,sparkConfig = list(spark.driver.memory="4g",spark.sql.warehouse.dir = "d:/winutils/bin",sparkPackages = "com.databricks:spark-avro_2.11:3.0.1"))
Java ref type org.apache.spark.sql.SparkSession id 1
multiPeople <- read.json(c(paste(getwd(),"people.json",sep = "/"),"D:/RwizSpark_Private/people2.json"))
rand_10m_x <- read.df(x = "./demo.csv",source = "csv", inferSchema="true",na.strings= "NA")
Run at deployed workers:
Sys.setenv(SPARK_HOME = "D:/SPARK2")
library(SparkR)
sparkR.session(master = "spark:/mymasterIP", enableHiveSupport = FALSE,appName = "sparkRenzhi", sparkConfig = list(spark.driver.memory="6g",spark.sql.warehouse.dir = "d:/winutils/bin",spark.executor.memory = "2g", spark.executor.cores= "2"),sparkPackages = "com.databricks:spark-avro_2.11:3.0.1")
Java ref type org.apache.spark.sql.SparkSession id 1
multiPeople <- read.json(c(paste(getwd(),"people.json",sep = "/"),"D:/RwizSpark_Private/people2.json"))
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, 172.29.110.101): java.io.FileNotFoundException: File file:/D:/RwizSpark_Private/people2.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109)
at org.apache.hadoop.mapre
rand_10m_x <- read.df(x = "./demo.csv",source = "csv", inferSchema="true",na.strings= "NA")
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 11, 172.29.110.101): java.io.FileNotFoundException: File file:/D:/RwizSpark_Private/demo.csv does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109)
at org.apache.hadoop.mapred.T

R package ( qdapTools) version not getting detected correctly in Azure ML

I'm trying to install qdap package in Azure ML. Rest of the dependent packages get installed without any issues. When it comes to qdapTools, I get this error , though the version that I try to install is 1.3.1 ( Verified this from the Decription file that comes with the R package)
package 'qdapTools' 1.1.0 was found, but >= 1.3.1 is required by 'qdap
The code in "Execute R Script" :
install.packages("src/qdapTools.zip", repos = NULL, verbose = TRUE)
install.packages("src/magrittr.zip", lib = ".", repos = NULL, verbose = TRUE)
install.packages("src/stringi.zip", lib = ".", repos = NULL, verbose = TRUE)
install.packages("src/stringr.zip", lib = ".", repos = NULL, verbose = TRUE)
install.packages("src/qdapDictionaries.zip", lib = ".", repos = NULL, verbose = TRUE)
install.packages("src/qdapRegex.zip", lib = ".", repos = NULL, verbose = TRUE)
install.packages("src/RColorBrewer.zip", lib = ".", repos = NULL, verbose = TRUE)
install.packages("src/qdap.zip", lib = ".", repos = NULL, verbose = TRUE)
library(stringr, lib.loc=".", verbose=TRUE)
library(qdap, lib.loc=".", verbose=TRUE)
And the log :
[ModuleOutput] End R Execution: 9/22/2016 6:44:44 AM
[Stop] DllModuleMethod::Execute. Duration = 00:00:16.7828106
[Critical] Error: Error 0063: The following error occurred during evaluation of R script:
---------- Start of error message from R ----------
package 'qdapTools' 1.1.0 was found, but >= 1.3.1 is required by 'qdap'
package 'qdapTools' 1.1.0 was found, but >= 1.3.1 is required by 'qdap'
----------- End of error message from R -----------
[Critical] {"InputParameters":{"DataTable":[{"Rows":2,"Columns":1,"estimatedSize":11767808,"ColumnTypes":{"System.String":1},"IsComplete":true,"Statistics":{"0":[2,0]}}],"Generic":{"bundlePath":"..\\..\\Script Bundle\\Script Bundle.zip","rLibVersion":"R310"},"Unknown":["Key: rStreamReader, ValueType : System.IO.StreamReader"]},"OutputParameters":[],"ModuleType":"LanguageWorker","ModuleVersion":" Version=6.0.0.0","AdditionalModuleInfo":"LanguageWorker, Version=6.0.0.0, Culture=neutral, PublicKeyToken=69c3241e6f0468ca;Microsoft.MetaAnalytics.LanguageWorker.LanguageWorkerClientRS;RunRSNR","Errors":"Microsoft.Analytics.Exceptions.ErrorMapping+ModuleException: Error 0063: The following error occurred during evaluation of R script:\r\n---------- Start of error message from R ----------\r\npackage 'qdapTools' 1.1.0 was found, but >= 1.3.1 is required by 'qdap'\r\n\r\n\r\npackage 'qdapTools' 1.1.0 was found, but >= 1.3.1 is required by 'qdap'\r\n----------- End of error message from R -----------\r\n at Microsoft.MetaAnalytics.LanguageWorker.LanguageWorkerClientRS.ExecuteR(NewRWorker worker, DataTable dataset1, DataTable dataset2, IEnumerable`1 bundlePath, StreamReader rStreamReader, Nullable`1 seed) in d:\\_Bld\\8831\\7669\\Sources\\Product\\Source\\Modules\\LanguageWorker\\LanguageWorker.Dll\\EntryPoints\\RModule.cs:line 287\r\n at Microsoft.MetaAnalytics.LanguageWorker.LanguageWorkerClientRS._RunImpl(NewRWorker worker, DataTable dataset1, DataTable dataset2, String bundlePath, StreamReader rStreamReader, Nullable`1 seed, ExecuteRScriptExternalResource source, String url, ExecuteRScriptGitHubRepositoryType githubRepoType, SecureString accountToken) in d:\\_Bld\\8831\\7669\\Sources\\Product\\Source\\Modules\\LanguageWorker\\LanguageWorker.Dll\\EntryPoints\\RModule.cs:line 207\r\n at Microsoft.MetaAnalytics.LanguageWorker.LanguageWorkerClientRS.RunRSNR(DataTable dataset1, DataTable dataset2, String bundlePath, StreamReader rStreamReader, Nullable`1 seed, ExecuteRScriptRVersion rLibVersion) in d:\\_Bld\\8831\\7669\\Sources\\Product\\Source\\Modules\\LanguageWorker\\LanguageWorker.Dll\\EntryPoints\\REntryPoint.cs:line 105","Warnings":[],"Duration":"00:00:16.7752607"}
Module finished after a runtime of 00:00:17.1411124 with exit code -2
Module failed due to negative exit code of -2
Record Ends at UTC 09/22/2016 06:44:44.
Editing the code to :
install.packages("src/qdapTools.zip",lib="." , repos = NULL, verbose = TRUE)
install.packages("src/qdapDictionaries.zip", lib = ".", repos = NULL, verbose = TRUE)
install.packages("src/qdapRegex.zip", lib = ".", repos = NULL, verbose = TRUE)
install.packages("src/RColorBrewer.zip", lib = ".", repos = NULL, verbose = TRUE)
install.packages("src/qdap.zip", lib = ".", repos = NULL, verbose = TRUE)
library(qdapTools, lib.loc=".", verbose=TRUE)
library(qdap, lib.loc=".", verbose=TRUE)
throws the following error :-
[ModuleOutput] 4: package 'qdapTools' was built under R version 3.3.1
[ModuleOutput]
[ModuleOutput] End R Execution: 9/22/2016 7:11:05 AM
[Stop] DllModuleMethod::Execute. Duration = 00:00:17.0656414
[Critical] Error: Error 0063: The following error occurred during evaluation of R script:
---------- Start of error message from R ----------
package or namespace load failed for 'qdapTools'
package or namespace load failed for 'qdapTools'
----------- End of error message from R -----------
Not sure how to proceed, can someone help please.
Thanks!
This is kind of a shot in the dark, since I don't know the specifics of your system, but it could be that qdapTools 1.3.1 does not get installed to the location of the other packages, since the location specification also is missing from the first line of "Execute R Script" part where qdapTools gets installed:
lib="."
Which could result in R loading an older version of qdapTools (did you have an older version installed before?) that lies somewhere else.

Resources