Creating data frames in SparkR?

Creating data frames in SparkR? - r

I am new here....so sorry if I ask naive questions !!!
I am using SparkR in Rstudio.
R version 3.3.2
Spark version 2.0.2
I am able to successfully launch Spark in R studio and I can see using webUI. localhost:4040 that my spark is up and running.
But once I try to create data frame it gives error something like this:
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:474)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:488)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:480)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:7
Can anybody help me with this....Thanks in advance :)

Thank you Guys. I was missing one file, which can be downloaded from git, here is link: https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin
Actually one of my friends also had same problem, just add this file and it should work fine.

Related

How do I register custom JDBC dialect in Rstudio?

I'm trying to analyze bigquery data in Rstudio server running on a google dataproc cluster. However, due to the memory limitations of Rstudio, I intend to run queries on this data in sparklyr but I haven't had any success importing the data directly into the spark cluster from bigquery.
I'm using google's official JDBC connectivity driver:
ODBC and JDBC drivers for BigQuery
I also have the following software versions running:
Google Dataproc: 2.0-Debian 10
Sparklyr: Spark 3.2.1 Hadoop 3.2
R version 4.2.1
I also had to replace the following spark jars with the versions being used by the JDBC connectivity driver above or added them where they were non-existent:
failureaccess-1.0.1 was added
protobuff-java-3.19.4 replaced 2.5.0
guava 31.1-jre replaced 14.0.1
Below is my code using the spark_read_jdbc function to retrieve a dataset from big query
conStr <- "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=xxxx;OAuthType=3;AllowLargeResults=1;"
spark_read_jdbc(sc = spkc,
name = "events_220210",
memory = FALSE,
options = list(url = conStr,
driver = "com.simba.googlebigquery.jdbc.Driver",
user = "rstudio",
password = "xxxxxx",
dbtable = "dataset.table"))
The table gets imported into the spark cluster but when I try to preview it, the following error message is received
ERROR sparklyr: Gateway (551) failed calling collect on sparklyr.Utils: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4) (faucluster1-w-0.europe-west2-c.c.ga4-warehouse-342410.internal executor 2): java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to long.
at com.simba.googlebigquery.exceptions.ExceptionConverter.toSQLException(Unknown Source)
at com.simba.googlebigquery.utilities.conversion.TypeConverter.toLong(Unknown Source)
at com.simba.googlebigquery.jdbc.common.SForwardResultSet.getLong(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$9(JdbcUtils.scala:446)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$9$adapted(JdbcUtils.scala:445)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:367)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:349)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
When I try to import the data via an SQL query, e.g
SELECT date, name, age FROM dataset.tablename
I end up with a table looking like this:
date
name
age
date
name
age
date
name
age
date
name
age
I've read on several posts that the solution to this is to register custom JDBC dialect but I have no idea how to do this; what platform to do it on, or if it's possible to do it within Rstudio. Links to any materials that would help me solve this problem would be appreciated.

makeClusterPSOCK ERROR workers failed to connect

I encounter this error when running Seurat on R.
Error in makeClusterPSOCK(workers, ...) :
Cluster setup failed. 4 of 4 workers failed to connect.
Never happened before installing R 4.1.
I have tried to no avail
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
cl <- parallel::makeCluster(2, setup_strategy = "sequential")
Any suggestions (and maybe a little explanation because I am relatively new to R still)? My computer overheats and I believe the command below is not working
**options(future.globals.maxSize = 8000 * 1024^2)
plan("multiprocess", workers = 4)**

4.1 R/RStudio has all sorts of issues with parallel right now. I experienced similar issues with the CB2 package on R 4.1 which also uses parallel for multicore support. This is probably related to an as of yet unpatched bug in R 4.1 (mentioned here and here), though there is now a specific fix in R-devel 80472. If your issues are unresolved with the advice from those threads, I suggest rolling back to a previous R version that doesn't present the issue.

R Studio crashing not at initialization. Error: Error occurred during transmission

Some context on my environment:
I am running R Studio in a docker container called rocker/verse.
I downloaded this dataset from Kaggle, which has about 470 MB.
When working with it, at some point RStudio restart. It does't happen after a specific call, and I've seen the same problem when working with other projects. Though it is not related to my code, I am posting it bellow.
library(data.table)
fraud<- fread("path.csv")
fraud1<- sort(sample(nrow(fraud), nrow(fraud)*.7))
train<- fraud[fraud1, ]
test<-fraud[-fraud1, ]
Usually on the console this message is printed:
Error: Error occurred during transmission
And, this pop up is also showed:
I have no idea what is causing it. I would appreciate any help.

Delete the .Rhistory files associated with the installation and any open project.

You have a problem with your user data files for Rstudio. Follow the hints given here: https://community.rstudio.com/t/rstudio-server-error-occurred-during-transmission/84258 and here: https://support.rstudio.com/hc/en-us/articles/218730228-Resetting-a-user-s-state-on-RStudio-Server.

R Execution failed with the application has requested the Runtime to terminate it in an unusual way

I am new to R, kindly help me with below error.
Calling R code using batch file (e.g: c:\batchfile\x.bat) in a machine which has dynamic memory i.e. based on load memory and cores will increase.
in above approach everything I executing with out error. R code using RODBCext
koRpus, akmeans, lsa, stringr, topicmodels, RWeka, lda, snowfall, tm, openNLP, reshape2, plyr, RODBC packages.
But while calling x.bat file using remote server within powershell (e.g.,Invoke-Command -Computername 'Servername' {Start-Process 'C:\batchfile\x.bat' -wait}) resulting below errors:
LoadLibrary failure: The paging file is too small for this operation to complete.
just-in-time debugging errors
This application has requested the Runtime to terminate it in an unusual way.
Thanks in advance

"Cannot open the connection" - HPC in R with snow

I'm attempting to run a parallel job in R using snow. I've been able to run extremely similar jobs with no trouble on older versions of R and snow. R package dependencies prevent me from reverting.
What happens: My jobs terminate at the parRapply step, i.e., the first time the nodes have to do anything short of reporting Sys.info(). The error message reads:
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: cannot open the connection
Calls: parRapply ... clusterApply -> staticClusterApply -> checkForRemoteErrors
Specs: R 2.14.0, snow 0.3-8, RedHat Enterprise Linux Client release 5.6. The snow package has been built on the correct version of R.
Details:
The following code appears to execute fine:
cl <- makeCluster(3)
clusterEvalQ(cl,library(deSolve,lib="~/R/library"))
clusterCall(cl,function() Sys.info()[c("nodename","machine")])
I'm an end-user, not a system admin, but I'm desperate for suggestions and insights into what could be going wrong.

This cryptic error appeared because an input file that's requested during program execution wasn't actually present. Each node would attempt to load this file and then fail, but this would result only in a "cannot open the connection" message.
What this means is that almost anything can cause a "connection" error. Incredibly annoying!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating data frames in SparkR? - r

Thank you Guys. I was missing one file, which can be downloaded from git, here is link: https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin Actually one of my friends also had same problem, just add this file and it should work fine.

Related

How do I register custom JDBC dialect in Rstudio?

makeClusterPSOCK ERROR workers failed to connect

R Studio crashing not at initialization. Error: Error occurred during transmission

R Execution failed with the application has requested the Runtime to terminate it in an unusual way

"Cannot open the connection" - HPC in R with snow

Categories

Resources