R GBM versus Spark GBT performance - r

I'm trying to compare performance between R and Spark-ML and my initial testing tells me that Spark-ML is better than R in most cases and scales much better when the dataset gets bigger.
However, I'm having strange results when it comes to Gradient Boosted Trees, especially because R takes 3 minutes where Spark takes 15 on the same dataset, on the same computer.
Here is the R code:
train <- read.table("c:/Path/to/file.csv", header=T, sep=";",dec=".")
train$X1 <- factor(train$X1)
train$X2 <- factor(train$X2)
train$X3 <- factor(train$X3)
train$X4 <- factor(train$X4)
train$X5 <- factor(train$X5)
train$X6 <- factor(train$X6)
train$X7 <- factor(train$X7)
train$X8 <- factor(train$X8)
train$X9 <- factor(train$X9)
library(gbm)
boost <- gbm(Freq~X1+X2+X3+X4+X5+X6+X7+X8+X9+Y1, distribution = "gaussian", data = train, n.trees = 2000, bag.fraction = 1, shrinkY1 = 1, interaction.depth = 1, n.minobsinnode = 50, train.fraction = 1.0, cv.folds = 0, keep.data = TRUE)
And here is the scala code for Spark
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.regression.GBTRegressor
val conf = new SparkConf()
.setAppName("GBTExample")
.set("spark.driver.memory", "8g")
.set("spark.executor.memory", "8g")
.set("spark.network.timeout", "120s")
val sc = SparkContext.getOrCreate(conf.setMaster("local[8]"))
val spark = new SparkSession.Builder().getOrCreate()
import spark.implicits._
val sourceData = spark.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ";")
.option("inferSchema", "true")
.load("c:/Path/to/file.csv")
val data = sourceData.select($"X1", $"X2", $"X3", $"X4", $"X5", $"X6", $"X7", $"X8", $"X9", $"Y1".cast("double"), $"Freq".cast("double"))
val X1Indexer = new StringIndexer().setInputCol("X1").setOutputCol("X1Idx")
val X2Indexer = new StringIndexer().setInputCol("X2").setOutputCol("X2Idx")
val X3Indexer = new StringIndexer().setInputCol("X3").setOutputCol("X3Idx")
val X4Indexer = new StringIndexer().setInputCol("X4").setOutputCol("X4Idx")
val X5Indexer = new StringIndexer().setInputCol("X5").setOutputCol("X5Idx")
val X6Indexer = new StringIndexer().setInputCol("X6").setOutputCol("X6Idx")
val X7Indexer = new StringIndexer().setInputCol("X7").setOutputCol("X7Idx")
val X8Indexer = new StringIndexer().setInputCol("X8").setOutputCol("X8Idx")
val X9Indexer = new StringIndexer().setInputCol("X9").setOutputCol("X9Idx")
val assembler = new VectorAssembler()
.setInputCols(Array("X1Idx", "X2Idx", "X3Idx", "X4Idx", "X5Idx", "X6Idx", "X7Idx", "X8Idx", "X9Idx", "Y1"))
.setOutputCol("features")
val dt = new GBTRegressor()
.setLabelCol("Freq")
.setFeaturesCol("features")
.setImpurity("variance")
.setMaxIter(2000)
.setMinInstancesPerNode(50)
.setMaxDepth(1)
.setStepSize(1)
.setSubsamplingRate(1)
.setMaxBins(32)
val pipeline = new Pipeline()
.setStages(Array(X1Indexer, X2Indexer, X3Indexer, X4Indexer, X5Indexer, X6Indexer, X7Indexer, X8Indexer, X9Indexer, assembler, dt))
val model = pipeline.fit(data)
I have the feeling that I'm not comparing the same methods here, but the documentation that I could find did not clarify the situation.

Related

How to create a new data file from an existing dataset to load into Rattle?

My goal is to create a decision tree model in Rattle for a school project. I've been able to determine the variables that I would need for my research question and created a new dataset from the original .csv file. After saving the new dataset as not only an .xls file and a .rdata file, I received an error message after loading the file into Rattle. This is my first time creating a decision tree model so I'm struggling a bit. Thanks in advance for your help!
Here's what I have so far:
install.packages(readxl)
library(readxl)
library(rattle)
setwd("C:/Users/river/OneDrive/Documents/Random Data")
edu <- read_excel('pfi_pu.xlsx')
eduu <- data.frame(c("P1HRSWK" = c(edu$P1HRSWK),
"P1EMPL" = c(edu$P1EMPL),
"P2HRSWK" = c(edu$P2HRSWK),
"P2EMPL" = c(edu$P2EMPL),
"P1ENRL" = c(edu$P1ENRL),
"P2ENRL" = c(edu$P2ENRL),
"P1EDUC" = c(edu$P1EDUC),
"P2EDUC" = c(edu$P2EDUC),
"P1HISPRM" = c(edu$P1HISPRM),
"P2HISPRM" = c(edu$P2HISPRM),
"P1PACI" = c(edu$P1PACI),
"P2PACI" = c(edu$P2PACI),
"P1BLACK" = c(edu$P1BLACK),
"P2BLACK" = c(edu$P2BLACK),
"P1ASIAN" = c(edu$P1ASIAN),
"P2ASIAN" = c(edu$P2ASIAN),
"P1AMIND" = c(edu$P1AMIND),
"P2AMIND" = c(edu$P2AMIND),
"P1HISPAN" = c(edu$P1HISPAN),
"P2HISPAN" = c(edu$P2HISPAN),
"P1LKWRK" = c(edu$P1LKWRK),
"P2LKWRK" = c(edu$P2LKWRK),
"P1MTHSWRK" = c(edu$P1MTHSWRK),
"P1REL" = c(edu$P1REL),
"P2REL" = c(edu$P2REL),
"P1SEX" = c(edu$P1SEX),
"P2SEX" = c(edu$P2SEX),
"P1MRSTA" = c(edu$P1MRSTA),
"SEFUTUREX" = c(edu$SEFUTUREX),
"HSFUTUREX" = c(edu$HSFUTUREX),
"PARGRADEX" = c(edu$PARGRADEX),
"TTLHHINC" = c(edu$TTLHHINC),
"PAR1EMPL" = c(edu$PAR1EMPL),
"PAR2EMPL" = c(edu$PAR2EMPL),
"SEEXPEL" = c(edu$SEEXPEL),
"SESUSPIN" = c(edu$SESUSPIN),
"SESUSOUT" = c(edu$SESUSOUT),
"SEGRADEQ" = c(edu$SEGRADEQ)
,dim = c(14075,38,1))
save(eduu,file="eduu.xls")
error message
Seems your problem is about writing a file. The command save must be used to save .RData files, not Excel files. According to this post, you may try:
openxlsx::write.xlsx(eduu, 'eduu.xlsx')
xlsx::write.xlsx(eduu, 'eduu.xlsx')
writexl::write_xlsx(eduu, 'eduu.xlsx')

How to get/build a JavaRDD[DataSet]?

When I use deeplearning4j and try to train a model in Spark
public MultiLayerNetwork fit(JavaRDD<DataSet> trainingData)
fit() need a JavaRDD parameter,
I try to build like this
val totalDaset = csv.map(row => {
val features = Array(
row.getAs[String](0).toDouble, row.getAs[String](1).toDouble
)
val labels = Array(row.getAs[String](21).toDouble)
val featuresINDA = Nd4j.create(features)
val labelsINDA = Nd4j.create(labels)
new DataSet(featuresINDA, labelsINDA)
})
but the tip of IDEA is No implicit arguments of type:Encode[DataSet]
it's a error and I dont know how to solve this problem,
I know SparkRDD can transform to JavaRDD, but I dont know how to build a Spark RDD[DataSet]
DataSet is in import org.nd4j.linalg.dataset.DataSet
Its construction method is
public DataSet(INDArray first, INDArray second) {
this(first, second, (INDArray)null, (INDArray)null);
}
this is my code
val spark:SparkSession = {SparkSession
.builder()
.master("local")
.appName("Spark LSTM Emotion Analysis")
.getOrCreate()
}
import spark.implicits._
val JavaSC = JavaSparkContext.fromSparkContext(spark.sparkContext)
val csv=spark.read.format("csv")
.option("header","true")
.option("sep",",")
.load("/home/hadoop/sparkjobs/LReg/data.csv")
val totalDataset = csv.map(row => {
val features = Array(
row.getAs[String](0).toDouble, row.getAs[String](1).toDouble
)
val labels = Array(row.getAs[String](21).toDouble)
val featuresINDA = Nd4j.create(features)
val labelsINDA = Nd4j.create(labels)
new DataSet(featuresINDA, labelsINDA)
})
val data = totalDataset.toJavaRDD
create JavaRDD[DataSet] by Java in deeplearning4j official guide:
String filePath = "hdfs:///your/path/some_csv_file.csv";
JavaSparkContext sc = new JavaSparkContext();
JavaRDD<String> rddString = sc.textFile(filePath);
RecordReader recordReader = new CSVRecordReader(',');
JavaRDD<List<Writable>> rddWritables = rddString.map(new StringToWritablesFunction(recordReader));
int labelIndex = 5; //Labels: a single integer representing the class index in column number 5
int numLabelClasses = 10; //10 classes for the label
JavaRDD<DataSet> rddDataSetClassification = rddWritables.map(new DataVecDataSetFunction(labelIndex, numLabelClasses, false));
I try to create by scala:
val JavaSC: JavaSparkContext = new JavaSparkContext()
val rddString: JavaRDD[String] = JavaSC.textFile("/home/hadoop/sparkjobs/LReg/hf-data.csv")
val recordReader: CSVRecordReader = new CSVRecordReader(',')
val rddWritables: JavaRDD[List[Writable]] = rddString.map(new StringToWritablesFunction(recordReader))
val featureColnum = 3
val labelColnum = 1
val d = new DataVecDataSetFunction(featureColnum,labelColnum,true,null,null)
// val rddDataSet: JavaRDD[DataSet] = rddWritables.map(new DataVecDataSetFunction(featureColnum,labelColnum, true,null,null))
// can not reslove overloaded method 'map'
debug error infomations:
A DataSet is just a pair of INDArrays. (inputs and labels)
Our docs cover this in depth:
https://deeplearning4j.konduit.ai/distributed-deep-learning/data-howto
For stack overflow sake, I'll summarize what's here since there's no "1" way to create a data pipeline. It's relative to your problem. It's very similar to how you you would create a dataset locally, generally you want to take whatever you do locally and put that in to spark in a function.
CSVs and images for example are going to be very different. But generally you use the datavec library to do that. The docs summarize the approach for each kind.

Choosing initial population of mlr genetic algorithm

I would like to choose the initial population of the genetic algorithm for variable selection with makeFeatSelControlGA() function from mlr package.
Is it doable ?
Edit
Since it is not implemented I did a dirty editing of selectFeaturesGA function for this purpose. Here is the code, if it helps anyone.
library(R.utils)
reassignInPackage("selectFeaturesGA","mlr",function (learner, task, resampling, measures, bit.names, bits.to.features,
control, opt.path, show.info)
{
states=get("states",envir=.GlobalEnv)
mu = control$extra.args$mu
lambda = control$extra.args$lambda
yname = opt.path$y.names[1]
minimize = opt.path$minimize[1]
if (!length(states))
{
for (i in seq_len(mu)) {
while (TRUE) {
states[[i]] = rbinom(length(bit.names), 1, 0.5)
if (is.na(control$max.features) || sum(states[[i]] <=
control$max.features))
break
}
}
}
evalOptimizationStatesFeatSel(learner, task, resampling,
measures, bits.to.features, control, opt.path, show.info,
states, 0L, NA_integer_)
pop.inds = seq_len(mu)
for (i in seq_len(control$maxit)) {
pop.df = as.data.frame(opt.path)[pop.inds, , drop = FALSE]
pop.featmat = as.matrix(pop.df[, bit.names, drop = FALSE])
mode(pop.featmat) = "integer"
pop.y = pop.df[, yname]
kids.list = replicate(lambda, generateKid(pop.featmat,
control), simplify = FALSE)
kids.evals = evalOptimizationStatesFeatSel(learner, task,
resampling, measures, bits.to.features, control,
opt.path, show.info, states = kids.list, i, as.integer(NA))
kids.y = extractSubList(kids.evals, c("y", yname))
oplen = getOptPathLength(opt.path)
kids.inds = seq(oplen - lambda + 1, oplen)
if (control$extra.args$comma) {
setOptPathElEOL(opt.path, pop.inds, i - 1)
pool.inds = kids.inds
pool.y = kids.y
}
else {
pool.inds = c(pop.inds, kids.inds)
pool.y = c(pop.y, kids.y)
}
pop.inds = pool.inds[order(pool.y, decreasing = !minimize)[seq_len(mu)]]
setOptPathElEOL(opt.path, setdiff(pool.inds, pop.inds),
i)
}
makeFeatSelResultFromOptPath(learner, measures, resampling,
control, opt.path)
})
And in the global environment, you need to add a list named states with length equal to mu
This is currently not supported by mlr, so you would have to modify the source code to achieve that.

How to get transaction history without certain state

I try to get transaction history on corda.
I need to get the amount of the transaction for a certain period
My api for this :
#GET
#Path("transactions")
#Produces(MediaType.APPLICATION_JSON)
fun gettransatcions(): List<StateAndRef<ContractState>> {
val TODAY = Instant.now()
val pagingSpec = PageSpecification(DEFAULT_PAGE_NUM, 100)
val start = TODAY.minus(1, ChronoUnit.HOURS)
val end = TODAY.plus(1, ChronoUnit.HOURS)
val recordedBetweenExpression = QueryCriteria.TimeCondition(
QueryCriteria.TimeInstantType.RECORDED,
ColumnPredicate.Between(start, end))
val criteria = QueryCriteria.VaultQueryCriteria(timeCondition = recordedBetweenExpression,status = Vault.StateStatus.ALL)
val results = rpcOps.vaultQueryBy<ContractState>(criteria, paging = pagingSpec)
val size = results.states.count()
return rpcOps.vaultQueryBy<ContractState>().states
}
where:
val rpcOps: CordaRPCOps
I can explicitly specify States for which to receive transactions like:
val criteria = VaultQueryCriteria(contractStateTypes = setOf(Cash.State::class.java, DealState::class.java))
but, I need to get transactions across all states except for a certain.
Have corda got any mechanism for this ?
There is no type of query criteria that specifically excludes certain states. However, you can define a query criteria that specifically includes certain states, then combine that with your existing criteria using an AND composition:
val TODAY = Instant.now()
val pagingSpec = PageSpecification(DEFAULT_PAGE_NUM, 100)
val start = TODAY.minus(1, ChronoUnit.HOURS)
val end = TODAY.plus(1, ChronoUnit.HOURS)
val recordedBetweenExpression = QueryCriteria.TimeCondition(
QueryCriteria.TimeInstantType.RECORDED,
ColumnPredicate.Between(start, end))
val timeCriteria = QueryCriteria.VaultQueryCriteria(timeCondition = recordedBetweenExpression, status = Vault.StateStatus.ALL)
val typeCriteria = QueryCriteria.VaultQueryCriteria(contractStateTypes = setOf(State1::class.java, State2::class.java), status = Vault.StateStatus.ALL)
val combinedCriteria = timeCriteria.and(typeCriteria)
val results = rpcOps.vaultQueryBy<ContractState>(combinedCriteria, paging = pagingSpec)
This will retrieve all the states that meet both your time criteria and your type criteria.

Trouble with a function in R, "BinHist"

I'm trying to use a bit of code that I found in an academic journal (). I'm new-ish to R. I keep getting an error when I reach the code calling up the function "binHist" that says "could not find the function "binhist". I can't figure out if it's in a library/ package I need to install or if there's another problem with the code. Any help would be much appreciated. Here's the code I extracted from the article:
whichData = yourData
baseH = data.frame()
RunningSum = 0
for (i in 2:16) {
tempBin = NULL
tempBin = binhist(i, whichData$rt)
theMean = sum(tempBin)/(i)
Divisor = sum(tempBin)
new = data.frame()
for (j in 1:ncol(tempBin)) {
grabVal = (tempBin[j] - theMean)^2
names(grabVal) <- NULL
new = c(new,grabVal)
}
extra = i - ncol(tempBin)
NewSum = Reduce("+",new) + extra*((0 - theMean)^2)
StdDev = sqrt(NewSum /(i-1))
RowVal = StdDev /Divisor
RunningSum = RunningSum + RowVal
baseH = c(baseH, list(tempBin))
}
paste("Number of Trials:",Divisor)
paste("Modulo-Binning Score (MBS): ",RunningSum)
library(plyr)
baseNow = do.call(rbind.fill,baseH)

Resources