SparkR dubt and Broken pipe exception - r

Hi I'm working on SparkR in distribuited mode with yarn cluster.
I have two question :
1) If I made a script that contains R line code and SparkR line code, it will distribute just the SparkR code or simple R too?
This is the script. I read a csv and take just the 100k first record.
I clean it (with R function) deleting NA values and created a SparkR dataframe.
This is what it do: foreach Lineset take each TimeInterval where that LineSet appear and sum some attribute (numeric attribute), after that put them all into a Matrix.
This is the script with R and SparkR code. It takes 7h in standalone mode and like 60h in distributed mode (killed by java.net.SocketException: Broken Pipe)
LineSmsInt<-fread("/home/sentiment/Scrivania/LineSmsInt.csv")
Short<-LineSmsInt[1:100000,]
Short[is.na(Short)] <- 0
Short$TimeInterval<-Short$TimeInterval/1000
ShortDF<-createDataFrame(sqlContext,Short)
UniqueLineSet<-unique(Short$LINESET)
UniqueTime<-unique(Short$TimeInterval)
UniqueTime<-as.numeric(UniqueTime)
Row<-length(UniqueLineSet)*length(UniqueTime)
IntTemp<-matrix(nrow =Row,ncol=7)
k<-1
colnames(IntTemp)<-c("LINESET","TimeInterval","SmsIN","SmsOut","CallIn","CallOut","Internet")
Sys.time()
for(i in 1:length(UniqueLineSet)){
SubSetID<-filter(ShortDF,ShortDF$LINESET==UniqueLineSet[i])
for(j in 1:length(UniqueTime)){
SubTime<-filter(SubSetID,SubSetID$TimeInterval==UniqueTime[j])
IntTemp[k,1]<-UniqueLineSet[i]
IntTemp[k,2]<-as.numeric(UniqueTime[j])
k3<-collect(select(SubTime,sum(SubTime$SmsIn)))
IntTemp[k,3]<-k3[1,1]
k4<-collect(select(SubTime,sum(SubTime$SmsOut)))
IntTemp[k,4]<-k4[1,1]
k5<-collect(select(SubTime,sum(SubTime$CallIn)))
IntTemp[k,5]<-k5[1,1]
k6<-collect(select(SubTime,sum(SubTime$CallOut)))
IntTemp[k,6]<-k6[1,1]
k7<-collect(select(SubTime,sum(SubTime$Internet)))
IntTemp[k,7]<-k7[1,1]
k<-k+1
}
print(UniqueLineSet[i])
print(i)
}
This is the script R the only things that change is the subset function and of course is a normal R data.frame not SparkR dataframe.
It takes 1.30 minute in standalone mode.
Why it's so fast just in R and it's so slowly in SparkR?
for(i in 1:length(UniqueLineSet)){
SubSetID<-subset.data.frame(LineSmsInt,LINESET==UniqueLineSet[i])
for(j in 1:length(UniqueTime)){
SubTime<-subset.data.frame(SubSetID,TimeInterval==UniqueTime[j])
IntTemp[k,1]<-UniqueLineSet[i]
IntTemp[k,2]<-as.numeric(UniqueTime[j])
IntTemp[k,3]<-sum(SubTime$SmsIn,na.rm = TRUE)
IntTemp[k,4]<-sum(SubTime$SmsOut,na.rm = TRUE)
IntTemp[k,5]<-sum(SubTime$CallIn,na.rm = TRUE)
IntTemp[k,6]<-sum(SubTime$CallOut,na.rm = TRUE)
IntTemp[k,7]<-sum(SubTime$Internet,na.rm=TRUE)
k<-k+1
}
print(UniqueLineSet[i])
print(i)
}
2) First script, in distribuited mode, was killed by :
java.net.SocketException: Broken Pipe
and this appear too sometimes:
java.net.SocketTimeoutException: Accept timed out
It may comes from a bad configuration? suggestion?
Thanks.

Don't take this the wrong way but it is not a particularly well written piece of code. It is already inefficient using core R and adding SparkR to the equation makes it even worse.
If I made a script that contains R line code and SparkR line code, it will distribute just the SparkR code or simple R too?
Unless you're using distributed data structures and functions which operate on these structures it is just a plain R code executed in a single thread on the master.
Why it's so fast just in R and it's so slowly in SparkR?
For starters you execute a single job for each combination of LINESET, UniqueTime and column. Each time Spark has scan all records and fetch data to the driver.
Moreover using Spark to handle data that can be easily handled in memory of a single machine simply doesn't make sense. Cost of running the job in case like this is usually much higher than a cost of actual processing.
suggestion?
If you really want to use SparkR just groupBy and agg:
group_by(Short, Short$LINESET, Short$TimeInterval) %>% agg(
sum(Short$SmsIn), sum(Short$SmsOut), sum(Short$CallIn),
sum(Short$CallOut), sum(Short$Internet))
If you care about missing (LINESET, TimeInterval) pairs fill these using either join or unionAll.
In practice it would simply stick with a data.table and aggregate locally:
Short[, lapply(.SD, sum, na.rm=TRUE), by=.(LINESET, TimeInterval)]

Related

Spark DataFrame does not persist when running all commands in a databricks notebook

In databricks I am using spark (in R) to import data to a spark DataFrame
sc <- spark_connect(method="databricks")
tbl_change_db(sc, "database_name")
some_var <- spark_read_table(sc,"table_name")
In a subsequent command I reference the DataFrame using sparklyr operators
refined_data_set <- sdf_read_column(some_var, 'column_name') %>% unique()
When running each command one by one everything works as expected. However, if I try to run the whole notebook, the refined_data_set command fails with:
error in evaluating the argument 'x' in selecting a method for function 'unique': org.apache.spark.sql.AnalysisException: Table or view not found: table_name
If I run the refined_data_set cell on it's own immediately after the error, it works. If I include the DataFrame declaration and refining in one cell and run the whole notebook it also works. Why is the DataFrame not persisting across all cells in the notebook when running them as one?
EDIT:
The problem seems to stem from the use of spark_read_table a second time in a command between these two. The command references a completely different database:
scp <- spark_connect(method="databricks")
tbl_change_db(scp, other_database_name)
other_var <- spark_read_table(scp,"other_table_name")
Removing this command fixes the issue, but why is it causing the issue? How can we use two DataFrames in the same notebook?

R tryCatch RODBC function issue

We have a number of MS Access databases on a server which are copies from remote locations which are updated overnight. We collate some of the data from these machines for reporting purposes on a daily basis. Sometimes the overnight update fails, meaning we don’t have access to all of the databases, so I am attempting to write an R script which will test if we can connect (using a list of the database paths), and output an updated version of the list including only those which we can connect to. This will then be used to run a further script which will only update the data related to the available databases.
This is what I have so far (I am new to R but reasonably proficient in SAS and SQL – attempting to use R both as a learning exercise and for potential cost savings);
{
# Create Store data locations listing
A=matrix(c(1000,1,"One","//Server/Comms1/Access.mdb"
,2000,2,"Two","//Server/Comms2/Access.mdb"
,3000,3,"Three","//Server/Comms3/Access.mdb"
)
,nrow=3,ncol=4,byrow=TRUE)
# Add column names
colnames(A)<-c("Ref1","Ref2","Ref3","Location")
#Create summary for testing connections (Ref1 and Location)
B<-A[,c(1,4)]
ConnectionTest<-function(Ref1,Location)
{
out<-tryCatch({ch<-odbcDriverConnect(paste("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=",Location))
sqlQuery(ch,paste("select ",Ref1," as Ref1,COUNT(variable) as Count from table"))}
,error=matrix(c(Ref1,0),nrow=1,ncol=2,byrow=TRUE)
)
return(out)
}
#Run function, using 'B' to provide arguments
C<-apply(B,1,function(x)do.call(ConnectionTest,as.list(x)))
#Convert to matrix and add column names
D<-matrix(unlist(C),ncol=2,byrow=T)
colnames(D)<-c("Ref1","Count")
}
When I run the script I get the following error message;
Error in value[3L] : attempt to apply non-function
I am guessing this is because I am using TryCatch incorrectly inside the UDF?
Does anyone have any advice on what I am doing incorrectly, or even if this is the best way to do what I am attempting?
Thanks
(apologies if this is formatted incorrectly, having to post on my phone due to Stackoverflow posting being blocked)
Edit - I think I fixed the 'Error in value[3L]' issue by adding function(e) {} around the matrix function in the error part of the tryCatch.
The issue now is that the script just fails if it can't reach one of the databases, rather than doing the matrix function. Do I need to add something else to make it ignore the error?
Edit 2 - it seems tryCatch does now work - it processes the
alternate function upon error but also shows warnings about the error, which makes sense.
As mentioned in the edit above, using 'function(e) {}' to wrap the Matrix function in the error section of the tryCatch fixed the 'Error in value[3L]' issue, so the script now works, but displays error messages if it can't access a particular channel. I am guessing the 'warning' section of the tryCatch can be used to adjust these as necessary.

parallel reads from Athena (AWS) database, via R

I've got a largish dataset on an Athena database on AWS. I'd like to read from it in parallel, and I'm accustomed to the foreach package's approach to forking from within R.
I'm using RJDBC
Here's what I am trying:
out <- foreach(i = 1:length(fipsvec), .combine = rbind, .errorhandling = "remove") %dopar% {
coni <- dbConnect(driver, "jdbc:awsathena://<<location>>/",
s3_staging_dir="my_directory",
user="...",
password="...")
print(paste0("starting ", i))
sqlstring <- paste0("SELECT ",
"My_query_body"
fipsvec[i]
)
row <- fetch(dbSendQuery(coni, sqlstring), -1, block = 999)
print(i)
dbDisconnect(coni)
rm(coni)
gc()
return(row)
}
(Sorry I can't make this reproducible -- I obviously can't hand out the keys to the DB online.)
When I run this, the first c = number of cores steps run fine, but then it hangs and does nothing -- indefinitely as far as I can tell. htop shows no activity on any of the cores. And when I change the for loop to only loop over c entries, the output is what I expect. When I change from parallel to serial (%do% instead of %dopar%), it also works fine.
Does this have something to do with the connection not being closed properly, or somehow being defined redundantly? I've placed the connection within the parallel loop, so each core should have its own connection in its own environment. But I don't know enough about databases to tell whether this is sufficiently distinct.
I'd appreciate answers that help me understand what's going on under the hood here -- it's all voodoo to me at this point.
Are you passing the RJDBC package (and it's dependencies-- methods, DBI, and rJava) into the cluster anywhere?
If not, your the first line of your code should look something like below:
results <- foreach(i = 1:length(fipsvec),
.combine = rbind,
.errorhandling = "remove",
.packages=c('methods','DBI','rJava','RJDBC')) %dopar% {
One thing that I suspect (but don't know) might make things a little hairier is that RJDBC uses a JVM to execute the queries. Not super knowledgeable about how rJava handles JVM initialization, and if each of the threads may be trying to re-use the same JVM simultaneously, or if they have enough information about the external environment to properly initialize one in the first place.
Another troubleshooting step if the above doesn't work might be to move the assignment step for driver into the %dopar% environment.
On another track, how many rows are in your result set? If the result set is in the million+ row range and can be returned with a single query, I actually came across an opportunity for optimization within the RJDBC package and have an open pull request on github ( https://github.com/s-u/RJDBC/pull/50 ) that I haven't heard anything on but have been using myself for a couple months. There's a basic benchmark documented in the pull request, I found the speedup to be substantial on the particular query I was running.
If it seems applicable you can install the branch with:
library(devtools)
devtools::install_github("msummersgill/RJDBC",ref = "harmonize", force = TRUE)

How can I run multiple independent and unrelated functions in parallel without larger code do-over?

I've been searching around the internet, trying to understand parallel processing.
What they all seem to assume is that I have some kind of loop function operating on e.g. every Nth row of a data set divided among N cores and combined afterwards, and I'm pointed towards a lot of parallelized apply() functions.
(Warning, ugly code below)
My situation though is that I have is on form
tempJob <- myFunction(filepath, string.arg1, string.arg2)
where the path is a file location, and the string arguments are various ways of sorting my data.
My current workflow is simply amassing a lot of
tempjob1 <- myFunction(args)
tempjob2 <- myFunction(other args)
...
tempjobN <- myFunction(some other args here)
# Make a list of all temporary outputs in the global environment
temp.list <- lapply(ls(pattern = "temp"), get)
# Stack them all
df <- rbindlist(temp.list)
# Remove all variables from workspace matching "temp"
rm(list=ls(pattern="temp"))
These jobs are entirely independent, and could in principle be run in 8 separate instances of R (although that would be a bother to manage I guess). How can I separate the first 8 jobs out to 8 cores, and whenever a core finishes its job and returns a treated dataset to the global environment it'll simply take whichever job is next in line.
With the future package (I'm the author) you can achieve what you want with a minor modification to your code - use "future" assignments %<-% instead of regular assignments <- for the code you want to run asynchronously.
library("future")
plan(multisession)
tempjob1 %<-% myFunction(args)
tempjob2 %<-% myFunction(other args)
...
tempjobN %<-% myFunction(some other args here)
temp.list <- lapply(ls(pattern = "temp"), get)
EDIT 2022-01-04: plan(multiprocess) -> plan(multisession) since multiprocess is deprecated and will eventually be removed.
Unless you are unfortunate enough to be using Windows, you could maybe try with GNU Parallel like this:
parallel Rscript ::: script1.R script2.R JOB86*.R
and that would keep 8 scripts running at a time, if your CPU has 8 cores. You can change it with -j 4 if you just want 4 at a time. The JOB86 part is just random - I made it up.
You can also add switches for a progress bar, for how to handle errors, for adding parameters and distributing jobs across multiple machines.
If you are on a Mac, you can install GNU Parallel with homebrew:
brew install parallel
I think the easiest way is to use one of the parallelized apply functions. Those will do all the fiddly work of separating out the jobs, taking whichever job is next in line, etc.
Put all your arguments into a list:
args <- list(
list(filePath1, stringArgs11, stringArgs21),
list(filePath2, stringArgs12, stringArgs22),
...
list(filePath8, stringArgs18, stringArgs28)
)
Then do something like
library(parallel)
cl <- makeCluster(detectCores())
df <- parSapply(cl, args, myFunction)
I'm not sure about parSapply, and I can't check as R isn't working on my machine just now. If that doesn't work, use parLapply and then manipulate the result.

R parallel computing with snowfall - writing to files from separate workers

I am using the snowfall 1.84 package for parallel computing and would like each worker to write data to its own separate file during the computation. Is this possible ? if so how ?
I am using the "SOCK" type connection e.g., sfInit( parallel=TRUE, ...,type="SOCK" ) and would like the code to be platform independent (unix/windows).
I know it is possible to Use the "slaveOutfile" option in sfInit to define a file where to write the log files. But this is intended for debugging purposes and all slaves/workers must use the same file. I need each worker to have its OWN output file !!!
The data i need to write are large dataframes, and NOT simple diagnostic messages. These dataframes need be output by the slaves and could not be sent back to the master process.
Anyone knows how i can get this done?
Thanks
A simple solution is to use sfClusterApply to execute a function that opens a different file on each of the workers, assigning the resulting file object to a global variable so you can write to it in subsequent parallel operations:
library(snowfall)
nworkers <- 3
sfInit(parallel=TRUE, cpus=nworkers, type='SOCK')
workerinit <- function(datfile) {
fobj <<- file(datfile, 'w')
NULL
}
sfClusterApply(sprintf('worker_%02d.dat', seq_len(nworkers)), workerinit)
work <- function(i) {
write.csv(data.frame(x=1:3, i=i), file=fobj)
i
}
sfLapply(1:10, work)
sfStop()

Resources