I'm trying to compare performance between R and Spark-ML and my initial testing tells me that Spark-ML is better than R in most cases and scales much better when the dataset gets bigger.
However, I'm having strange results when it comes to Gradient Boosted Trees, especially because R takes 3 minutes where Spark takes 15 on the same dataset, on the same computer.
Here is the R code:
train <- read.table("c:/Path/to/file.csv", header=T, sep=";",dec=".")
train$X1 <- factor(train$X1)
train$X2 <- factor(train$X2)
train$X3 <- factor(train$X3)
train$X4 <- factor(train$X4)
train$X5 <- factor(train$X5)
train$X6 <- factor(train$X6)
train$X7 <- factor(train$X7)
train$X8 <- factor(train$X8)
train$X9 <- factor(train$X9)
library(gbm)
boost <- gbm(Freq~X1+X2+X3+X4+X5+X6+X7+X8+X9+Y1, distribution = "gaussian", data = train, n.trees = 2000, bag.fraction = 1, shrinkY1 = 1, interaction.depth = 1, n.minobsinnode = 50, train.fraction = 1.0, cv.folds = 0, keep.data = TRUE)
And here is the scala code for Spark
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.regression.GBTRegressor
val conf = new SparkConf()
.setAppName("GBTExample")
.set("spark.driver.memory", "8g")
.set("spark.executor.memory", "8g")
.set("spark.network.timeout", "120s")
val sc = SparkContext.getOrCreate(conf.setMaster("local[8]"))
val spark = new SparkSession.Builder().getOrCreate()
import spark.implicits._
val sourceData = spark.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ";")
.option("inferSchema", "true")
.load("c:/Path/to/file.csv")
val data = sourceData.select($"X1", $"X2", $"X3", $"X4", $"X5", $"X6", $"X7", $"X8", $"X9", $"Y1".cast("double"), $"Freq".cast("double"))
val X1Indexer = new StringIndexer().setInputCol("X1").setOutputCol("X1Idx")
val X2Indexer = new StringIndexer().setInputCol("X2").setOutputCol("X2Idx")
val X3Indexer = new StringIndexer().setInputCol("X3").setOutputCol("X3Idx")
val X4Indexer = new StringIndexer().setInputCol("X4").setOutputCol("X4Idx")
val X5Indexer = new StringIndexer().setInputCol("X5").setOutputCol("X5Idx")
val X6Indexer = new StringIndexer().setInputCol("X6").setOutputCol("X6Idx")
val X7Indexer = new StringIndexer().setInputCol("X7").setOutputCol("X7Idx")
val X8Indexer = new StringIndexer().setInputCol("X8").setOutputCol("X8Idx")
val X9Indexer = new StringIndexer().setInputCol("X9").setOutputCol("X9Idx")
val assembler = new VectorAssembler()
.setInputCols(Array("X1Idx", "X2Idx", "X3Idx", "X4Idx", "X5Idx", "X6Idx", "X7Idx", "X8Idx", "X9Idx", "Y1"))
.setOutputCol("features")
val dt = new GBTRegressor()
.setLabelCol("Freq")
.setFeaturesCol("features")
.setImpurity("variance")
.setMaxIter(2000)
.setMinInstancesPerNode(50)
.setMaxDepth(1)
.setStepSize(1)
.setSubsamplingRate(1)
.setMaxBins(32)
val pipeline = new Pipeline()
.setStages(Array(X1Indexer, X2Indexer, X3Indexer, X4Indexer, X5Indexer, X6Indexer, X7Indexer, X8Indexer, X9Indexer, assembler, dt))
val model = pipeline.fit(data)
I have the feeling that I'm not comparing the same methods here, but the documentation that I could find did not clarify the situation.
Related
My goal is to create a decision tree model in Rattle for a school project. I've been able to determine the variables that I would need for my research question and created a new dataset from the original .csv file. After saving the new dataset as not only an .xls file and a .rdata file, I received an error message after loading the file into Rattle. This is my first time creating a decision tree model so I'm struggling a bit. Thanks in advance for your help!
Here's what I have so far:
install.packages(readxl)
library(readxl)
library(rattle)
setwd("C:/Users/river/OneDrive/Documents/Random Data")
edu <- read_excel('pfi_pu.xlsx')
eduu <- data.frame(c("P1HRSWK" = c(edu$P1HRSWK),
"P1EMPL" = c(edu$P1EMPL),
"P2HRSWK" = c(edu$P2HRSWK),
"P2EMPL" = c(edu$P2EMPL),
"P1ENRL" = c(edu$P1ENRL),
"P2ENRL" = c(edu$P2ENRL),
"P1EDUC" = c(edu$P1EDUC),
"P2EDUC" = c(edu$P2EDUC),
"P1HISPRM" = c(edu$P1HISPRM),
"P2HISPRM" = c(edu$P2HISPRM),
"P1PACI" = c(edu$P1PACI),
"P2PACI" = c(edu$P2PACI),
"P1BLACK" = c(edu$P1BLACK),
"P2BLACK" = c(edu$P2BLACK),
"P1ASIAN" = c(edu$P1ASIAN),
"P2ASIAN" = c(edu$P2ASIAN),
"P1AMIND" = c(edu$P1AMIND),
"P2AMIND" = c(edu$P2AMIND),
"P1HISPAN" = c(edu$P1HISPAN),
"P2HISPAN" = c(edu$P2HISPAN),
"P1LKWRK" = c(edu$P1LKWRK),
"P2LKWRK" = c(edu$P2LKWRK),
"P1MTHSWRK" = c(edu$P1MTHSWRK),
"P1REL" = c(edu$P1REL),
"P2REL" = c(edu$P2REL),
"P1SEX" = c(edu$P1SEX),
"P2SEX" = c(edu$P2SEX),
"P1MRSTA" = c(edu$P1MRSTA),
"SEFUTUREX" = c(edu$SEFUTUREX),
"HSFUTUREX" = c(edu$HSFUTUREX),
"PARGRADEX" = c(edu$PARGRADEX),
"TTLHHINC" = c(edu$TTLHHINC),
"PAR1EMPL" = c(edu$PAR1EMPL),
"PAR2EMPL" = c(edu$PAR2EMPL),
"SEEXPEL" = c(edu$SEEXPEL),
"SESUSPIN" = c(edu$SESUSPIN),
"SESUSOUT" = c(edu$SESUSOUT),
"SEGRADEQ" = c(edu$SEGRADEQ)
,dim = c(14075,38,1))
save(eduu,file="eduu.xls")
error message
Seems your problem is about writing a file. The command save must be used to save .RData files, not Excel files. According to this post, you may try:
openxlsx::write.xlsx(eduu, 'eduu.xlsx')
xlsx::write.xlsx(eduu, 'eduu.xlsx')
writexl::write_xlsx(eduu, 'eduu.xlsx')
When I use deeplearning4j and try to train a model in Spark
public MultiLayerNetwork fit(JavaRDD<DataSet> trainingData)
fit() need a JavaRDD parameter,
I try to build like this
val totalDaset = csv.map(row => {
val features = Array(
row.getAs[String](0).toDouble, row.getAs[String](1).toDouble
)
val labels = Array(row.getAs[String](21).toDouble)
val featuresINDA = Nd4j.create(features)
val labelsINDA = Nd4j.create(labels)
new DataSet(featuresINDA, labelsINDA)
})
but the tip of IDEA is No implicit arguments of type:Encode[DataSet]
it's a error and I dont know how to solve this problem,
I know SparkRDD can transform to JavaRDD, but I dont know how to build a Spark RDD[DataSet]
DataSet is in import org.nd4j.linalg.dataset.DataSet
Its construction method is
public DataSet(INDArray first, INDArray second) {
this(first, second, (INDArray)null, (INDArray)null);
}
this is my code
val spark:SparkSession = {SparkSession
.builder()
.master("local")
.appName("Spark LSTM Emotion Analysis")
.getOrCreate()
}
import spark.implicits._
val JavaSC = JavaSparkContext.fromSparkContext(spark.sparkContext)
val csv=spark.read.format("csv")
.option("header","true")
.option("sep",",")
.load("/home/hadoop/sparkjobs/LReg/data.csv")
val totalDataset = csv.map(row => {
val features = Array(
row.getAs[String](0).toDouble, row.getAs[String](1).toDouble
)
val labels = Array(row.getAs[String](21).toDouble)
val featuresINDA = Nd4j.create(features)
val labelsINDA = Nd4j.create(labels)
new DataSet(featuresINDA, labelsINDA)
})
val data = totalDataset.toJavaRDD
create JavaRDD[DataSet] by Java in deeplearning4j official guide:
String filePath = "hdfs:///your/path/some_csv_file.csv";
JavaSparkContext sc = new JavaSparkContext();
JavaRDD<String> rddString = sc.textFile(filePath);
RecordReader recordReader = new CSVRecordReader(',');
JavaRDD<List<Writable>> rddWritables = rddString.map(new StringToWritablesFunction(recordReader));
int labelIndex = 5; //Labels: a single integer representing the class index in column number 5
int numLabelClasses = 10; //10 classes for the label
JavaRDD<DataSet> rddDataSetClassification = rddWritables.map(new DataVecDataSetFunction(labelIndex, numLabelClasses, false));
I try to create by scala:
val JavaSC: JavaSparkContext = new JavaSparkContext()
val rddString: JavaRDD[String] = JavaSC.textFile("/home/hadoop/sparkjobs/LReg/hf-data.csv")
val recordReader: CSVRecordReader = new CSVRecordReader(',')
val rddWritables: JavaRDD[List[Writable]] = rddString.map(new StringToWritablesFunction(recordReader))
val featureColnum = 3
val labelColnum = 1
val d = new DataVecDataSetFunction(featureColnum,labelColnum,true,null,null)
// val rddDataSet: JavaRDD[DataSet] = rddWritables.map(new DataVecDataSetFunction(featureColnum,labelColnum, true,null,null))
// can not reslove overloaded method 'map'
debug error infomations:
A DataSet is just a pair of INDArrays. (inputs and labels)
Our docs cover this in depth:
https://deeplearning4j.konduit.ai/distributed-deep-learning/data-howto
For stack overflow sake, I'll summarize what's here since there's no "1" way to create a data pipeline. It's relative to your problem. It's very similar to how you you would create a dataset locally, generally you want to take whatever you do locally and put that in to spark in a function.
CSVs and images for example are going to be very different. But generally you use the datavec library to do that. The docs summarize the approach for each kind.
I would like to choose the initial population of the genetic algorithm for variable selection with makeFeatSelControlGA() function from mlr package.
Is it doable ?
Edit
Since it is not implemented I did a dirty editing of selectFeaturesGA function for this purpose. Here is the code, if it helps anyone.
library(R.utils)
reassignInPackage("selectFeaturesGA","mlr",function (learner, task, resampling, measures, bit.names, bits.to.features,
control, opt.path, show.info)
{
states=get("states",envir=.GlobalEnv)
mu = control$extra.args$mu
lambda = control$extra.args$lambda
yname = opt.path$y.names[1]
minimize = opt.path$minimize[1]
if (!length(states))
{
for (i in seq_len(mu)) {
while (TRUE) {
states[[i]] = rbinom(length(bit.names), 1, 0.5)
if (is.na(control$max.features) || sum(states[[i]] <=
control$max.features))
break
}
}
}
evalOptimizationStatesFeatSel(learner, task, resampling,
measures, bits.to.features, control, opt.path, show.info,
states, 0L, NA_integer_)
pop.inds = seq_len(mu)
for (i in seq_len(control$maxit)) {
pop.df = as.data.frame(opt.path)[pop.inds, , drop = FALSE]
pop.featmat = as.matrix(pop.df[, bit.names, drop = FALSE])
mode(pop.featmat) = "integer"
pop.y = pop.df[, yname]
kids.list = replicate(lambda, generateKid(pop.featmat,
control), simplify = FALSE)
kids.evals = evalOptimizationStatesFeatSel(learner, task,
resampling, measures, bits.to.features, control,
opt.path, show.info, states = kids.list, i, as.integer(NA))
kids.y = extractSubList(kids.evals, c("y", yname))
oplen = getOptPathLength(opt.path)
kids.inds = seq(oplen - lambda + 1, oplen)
if (control$extra.args$comma) {
setOptPathElEOL(opt.path, pop.inds, i - 1)
pool.inds = kids.inds
pool.y = kids.y
}
else {
pool.inds = c(pop.inds, kids.inds)
pool.y = c(pop.y, kids.y)
}
pop.inds = pool.inds[order(pool.y, decreasing = !minimize)[seq_len(mu)]]
setOptPathElEOL(opt.path, setdiff(pool.inds, pop.inds),
i)
}
makeFeatSelResultFromOptPath(learner, measures, resampling,
control, opt.path)
})
And in the global environment, you need to add a list named states with length equal to mu
This is currently not supported by mlr, so you would have to modify the source code to achieve that.
I try to get transaction history on corda.
I need to get the amount of the transaction for a certain period
My api for this :
#GET
#Path("transactions")
#Produces(MediaType.APPLICATION_JSON)
fun gettransatcions(): List<StateAndRef<ContractState>> {
val TODAY = Instant.now()
val pagingSpec = PageSpecification(DEFAULT_PAGE_NUM, 100)
val start = TODAY.minus(1, ChronoUnit.HOURS)
val end = TODAY.plus(1, ChronoUnit.HOURS)
val recordedBetweenExpression = QueryCriteria.TimeCondition(
QueryCriteria.TimeInstantType.RECORDED,
ColumnPredicate.Between(start, end))
val criteria = QueryCriteria.VaultQueryCriteria(timeCondition = recordedBetweenExpression,status = Vault.StateStatus.ALL)
val results = rpcOps.vaultQueryBy<ContractState>(criteria, paging = pagingSpec)
val size = results.states.count()
return rpcOps.vaultQueryBy<ContractState>().states
}
where:
val rpcOps: CordaRPCOps
I can explicitly specify States for which to receive transactions like:
val criteria = VaultQueryCriteria(contractStateTypes = setOf(Cash.State::class.java, DealState::class.java))
but, I need to get transactions across all states except for a certain.
Have corda got any mechanism for this ?
There is no type of query criteria that specifically excludes certain states. However, you can define a query criteria that specifically includes certain states, then combine that with your existing criteria using an AND composition:
val TODAY = Instant.now()
val pagingSpec = PageSpecification(DEFAULT_PAGE_NUM, 100)
val start = TODAY.minus(1, ChronoUnit.HOURS)
val end = TODAY.plus(1, ChronoUnit.HOURS)
val recordedBetweenExpression = QueryCriteria.TimeCondition(
QueryCriteria.TimeInstantType.RECORDED,
ColumnPredicate.Between(start, end))
val timeCriteria = QueryCriteria.VaultQueryCriteria(timeCondition = recordedBetweenExpression, status = Vault.StateStatus.ALL)
val typeCriteria = QueryCriteria.VaultQueryCriteria(contractStateTypes = setOf(State1::class.java, State2::class.java), status = Vault.StateStatus.ALL)
val combinedCriteria = timeCriteria.and(typeCriteria)
val results = rpcOps.vaultQueryBy<ContractState>(combinedCriteria, paging = pagingSpec)
This will retrieve all the states that meet both your time criteria and your type criteria.
I'm trying to use a bit of code that I found in an academic journal (). I'm new-ish to R. I keep getting an error when I reach the code calling up the function "binHist" that says "could not find the function "binhist". I can't figure out if it's in a library/ package I need to install or if there's another problem with the code. Any help would be much appreciated. Here's the code I extracted from the article:
whichData = yourData
baseH = data.frame()
RunningSum = 0
for (i in 2:16) {
tempBin = NULL
tempBin = binhist(i, whichData$rt)
theMean = sum(tempBin)/(i)
Divisor = sum(tempBin)
new = data.frame()
for (j in 1:ncol(tempBin)) {
grabVal = (tempBin[j] - theMean)^2
names(grabVal) <- NULL
new = c(new,grabVal)
}
extra = i - ncol(tempBin)
NewSum = Reduce("+",new) + extra*((0 - theMean)^2)
StdDev = sqrt(NewSum /(i-1))
RowVal = StdDev /Divisor
RunningSum = RunningSum + RowVal
baseH = c(baseH, list(tempBin))
}
paste("Number of Trials:",Divisor)
paste("Modulo-Binning Score (MBS): ",RunningSum)
library(plyr)
baseNow = do.call(rbind.fill,baseH)