How to get/build a JavaRDD[DataSet]? - rdd

When I use deeplearning4j and try to train a model in Spark
public MultiLayerNetwork fit(JavaRDD<DataSet> trainingData)
fit() need a JavaRDD parameter,
I try to build like this
val totalDaset = csv.map(row => {
val features = Array(
row.getAs[String](0).toDouble, row.getAs[String](1).toDouble
)
val labels = Array(row.getAs[String](21).toDouble)
val featuresINDA = Nd4j.create(features)
val labelsINDA = Nd4j.create(labels)
new DataSet(featuresINDA, labelsINDA)
})
but the tip of IDEA is No implicit arguments of type:Encode[DataSet]
it's a error and I dont know how to solve this problem,
I know SparkRDD can transform to JavaRDD, but I dont know how to build a Spark RDD[DataSet]
DataSet is in import org.nd4j.linalg.dataset.DataSet
Its construction method is
public DataSet(INDArray first, INDArray second) {
this(first, second, (INDArray)null, (INDArray)null);
}
this is my code
val spark:SparkSession = {SparkSession
.builder()
.master("local")
.appName("Spark LSTM Emotion Analysis")
.getOrCreate()
}
import spark.implicits._
val JavaSC = JavaSparkContext.fromSparkContext(spark.sparkContext)
val csv=spark.read.format("csv")
.option("header","true")
.option("sep",",")
.load("/home/hadoop/sparkjobs/LReg/data.csv")
val totalDataset = csv.map(row => {
val features = Array(
row.getAs[String](0).toDouble, row.getAs[String](1).toDouble
)
val labels = Array(row.getAs[String](21).toDouble)
val featuresINDA = Nd4j.create(features)
val labelsINDA = Nd4j.create(labels)
new DataSet(featuresINDA, labelsINDA)
})
val data = totalDataset.toJavaRDD
create JavaRDD[DataSet] by Java in deeplearning4j official guide:
String filePath = "hdfs:///your/path/some_csv_file.csv";
JavaSparkContext sc = new JavaSparkContext();
JavaRDD<String> rddString = sc.textFile(filePath);
RecordReader recordReader = new CSVRecordReader(',');
JavaRDD<List<Writable>> rddWritables = rddString.map(new StringToWritablesFunction(recordReader));
int labelIndex = 5; //Labels: a single integer representing the class index in column number 5
int numLabelClasses = 10; //10 classes for the label
JavaRDD<DataSet> rddDataSetClassification = rddWritables.map(new DataVecDataSetFunction(labelIndex, numLabelClasses, false));
I try to create by scala:
val JavaSC: JavaSparkContext = new JavaSparkContext()
val rddString: JavaRDD[String] = JavaSC.textFile("/home/hadoop/sparkjobs/LReg/hf-data.csv")
val recordReader: CSVRecordReader = new CSVRecordReader(',')
val rddWritables: JavaRDD[List[Writable]] = rddString.map(new StringToWritablesFunction(recordReader))
val featureColnum = 3
val labelColnum = 1
val d = new DataVecDataSetFunction(featureColnum,labelColnum,true,null,null)
// val rddDataSet: JavaRDD[DataSet] = rddWritables.map(new DataVecDataSetFunction(featureColnum,labelColnum, true,null,null))
// can not reslove overloaded method 'map'
debug error infomations:

A DataSet is just a pair of INDArrays. (inputs and labels)
Our docs cover this in depth:
https://deeplearning4j.konduit.ai/distributed-deep-learning/data-howto
For stack overflow sake, I'll summarize what's here since there's no "1" way to create a data pipeline. It's relative to your problem. It's very similar to how you you would create a dataset locally, generally you want to take whatever you do locally and put that in to spark in a function.
CSVs and images for example are going to be very different. But generally you use the datavec library to do that. The docs summarize the approach for each kind.

Related

Access the contents of the row inserted into Dynamodb using Pynamodb save method

I have the below model for a my dynamodb table using pynamodb :
from pynamodb.models import Model
from pynamodb.attributes import (
UnicodeAttribute, UTCDateTimeAttribute, UnicodeSetAttribute, BooleanAttribute
)
class Reminders(Model):
"""Model class for the Reminders table."""
# Information on global secondary index for the table
# user_id (hash key) + reminder_id+reminder_title(sort key)
class Meta:
table_name = 'Reminders'
region = 'eu-central-1'
reminder_id = UnicodeAttribute(hash_key=True)
user_id = UnicodeAttribute(range_key=True)
reminder_title = UnicodeAttribute()
reminder_tags = UnicodeSetAttribute()
reminder_description = UnicodeAttribute()
reminder_frequency = UnicodeAttribute(default='Only once')
reminder_tasks = UnicodeSetAttribute(default=set())
reminder_expiration_date_time = UTCDateTimeAttribute(null=True)
reminder_title_reminder_id = UnicodeAttribute()
next_reminder_date_time = UTCDateTimeAttribute()
should_expire = BooleanAttribute()
When i want to create a new reminder i do it through the below code :
class DynamoBackend:
#staticmethod
def create_a_new_reminder(new_reminder: NewReminder) -> Dict[str, Any]:
"""Create a new reminder using pynamodb."""
new_reminder = models.Reminders(**new_reminder.dict())
return new_reminder.save()
In this case the NewReminder is an instance of pydantic base model like so :
class NewReminder(pydantic.BaseModel):
reminder_id: str
user_id: str
reminder_title: str
reminder_description: str
reminder_tags: Sequence[str]
reminder_frequency: str
should_expire: bool
reminder_expiration_date_time: Optional[datetime.datetime]
next_reminder_date_time: datetime.datetime
reminder_title_reminder_id: str
when i call the save method on the model object i receive the below response:
{
"ConsumedCapacity": {
"CapacityUnits": 2.0,
"TableName": "Reminders"
}
}
Now my question is the save method is directly being called by a lambda function which is in turn called by an API Gateway POST endpoint so ideally the response should be a 201 created and instead of returning the consumed capacity and table name , would be great if it returns the item inserted in the database. Below is my route code :
def create_a_new_reminder():
"""Creates a new reminder in the database."""
request_context = app.current_request.context
request_body = json.loads(app.current_request.raw_body.decode())
request_body["reminder_frequency"] = data_structures.ReminderFrequency[request_body["reminder_frequency"]]
reminder_details = data_structures.ReminderDetailsFromRequest.parse_obj(request_body)
user_details = data_structures.UserDetails(
user_name=request_context["authorizer"]["claims"]["cognito:username"],
user_email=request_context["authorizer"]["claims"]["email"]
)
reminder_id = str(uuid.uuid1())
new_reminder = data_structures.NewReminder(
reminder_id=reminder_id,
user_id=user_details.user_name,
reminder_title=reminder_details.reminder_title,
reminder_description=reminder_details.reminder_description,
reminder_tags=reminder_details.reminder_tags,
reminder_frequency=reminder_details.reminder_frequency.value[0],
should_expire=reminder_details.should_expire,
reminder_expiration_date_time=reminder_details.reminder_expiration_date_time,
next_reminder_date_time=reminder_details.next_reminder_date_time,
reminder_title_reminder_id=f"{reminder_details.reminder_title}-{reminder_id}"
)
return DynamoBackend.create_a_new_reminder(new_reminder=new_reminder)
I am very new to REST API creation and best practices so would be great if someone would guide me here . Thanks in advance !

How to create a custom transformer with out any input column?

We have requirement where in we wanted to generate scores of our model with some random values in between 0-1.
To do that we wanted to have a custom transformer which will be generating random numbers with out any input fields.
So can we generate a transformer without input fields in mleap?
Like usually we do create as below:
import ml.combust.mleap.core.Model
import ml.combust.mleap.core.types._
case class RandomNumberModel() extends Model {
private val rnd = scala.util.Random
def apply(): Double = rnd.nextFloat
override def inputSchema: StructType = StructType("input" -> ScalarType.String).get
override def outputSchema: StructType = StructType("output" -> ScalarType.Double ).get
}
How to make it as input schema no necessary to put?
I have never tried that, but given how I achieved to have a custom transformer with multiple input fields ...
package org.apache.spark.ml.feature.mleap
import ml.combust.mleap.core.Model
import ml.combust.mleap.core.types._
import org.apache.spark.ml.linalg._
case class PropertyGroupAggregatorBaseModel (props: Array[String],
aggFunc: String) extends Model {
val outputSize = props.size
//having multiple inputs, you will have apply with a parameter Seq[Any]
def apply(features: Seq[Any]): Vector = {
val properties = features(0).asInstanceOf[Seq[String]]
val values = features(1).asInstanceOf[Seq[Double]]
val mapping = properties.zip(values)
val histogram = props.foldLeft(Array.empty[Double]){
(acc, property) =>
val newValues = mapping.filter(x => x._1 == property).map(x => x._2)
val newAggregate = aggFunc match {
case "sum" => newValues.sum.toDouble
case "count" => newValues.size.toDouble
case "avg" => (newValues.sum / Math.max(newValues.size, 1)).toDouble
}
acc :+ newAggregate
}
Vectors.dense(histogram)
}
override def inputSchema: StructType = {
//here you define the input
val inputFields = Seq(
StructField("input1" -> ListType(BasicType.String)),
StructField("input2" -> ListType(BasicType.Double))
)
StructType(inputFields).get
}
override def outputSchema: StructType = StructType(StructField("output" -> TensorType.Double(outputSize))).get
}
My suggestion would be, that the apply might already work for you. I guess if you define inputSchema as follows, it might work:
override def inputSchema: StructType = {
//here you define the input
val inputFields = Seq.empty[StructField]
StructType(inputFields).get
}

How to get transaction history without certain state

I try to get transaction history on corda.
I need to get the amount of the transaction for a certain period
My api for this :
#GET
#Path("transactions")
#Produces(MediaType.APPLICATION_JSON)
fun gettransatcions(): List<StateAndRef<ContractState>> {
val TODAY = Instant.now()
val pagingSpec = PageSpecification(DEFAULT_PAGE_NUM, 100)
val start = TODAY.minus(1, ChronoUnit.HOURS)
val end = TODAY.plus(1, ChronoUnit.HOURS)
val recordedBetweenExpression = QueryCriteria.TimeCondition(
QueryCriteria.TimeInstantType.RECORDED,
ColumnPredicate.Between(start, end))
val criteria = QueryCriteria.VaultQueryCriteria(timeCondition = recordedBetweenExpression,status = Vault.StateStatus.ALL)
val results = rpcOps.vaultQueryBy<ContractState>(criteria, paging = pagingSpec)
val size = results.states.count()
return rpcOps.vaultQueryBy<ContractState>().states
}
where:
val rpcOps: CordaRPCOps
I can explicitly specify States for which to receive transactions like:
val criteria = VaultQueryCriteria(contractStateTypes = setOf(Cash.State::class.java, DealState::class.java))
but, I need to get transactions across all states except for a certain.
Have corda got any mechanism for this ?
There is no type of query criteria that specifically excludes certain states. However, you can define a query criteria that specifically includes certain states, then combine that with your existing criteria using an AND composition:
val TODAY = Instant.now()
val pagingSpec = PageSpecification(DEFAULT_PAGE_NUM, 100)
val start = TODAY.minus(1, ChronoUnit.HOURS)
val end = TODAY.plus(1, ChronoUnit.HOURS)
val recordedBetweenExpression = QueryCriteria.TimeCondition(
QueryCriteria.TimeInstantType.RECORDED,
ColumnPredicate.Between(start, end))
val timeCriteria = QueryCriteria.VaultQueryCriteria(timeCondition = recordedBetweenExpression, status = Vault.StateStatus.ALL)
val typeCriteria = QueryCriteria.VaultQueryCriteria(contractStateTypes = setOf(State1::class.java, State2::class.java), status = Vault.StateStatus.ALL)
val combinedCriteria = timeCriteria.and(typeCriteria)
val results = rpcOps.vaultQueryBy<ContractState>(combinedCriteria, paging = pagingSpec)
This will retrieve all the states that meet both your time criteria and your type criteria.

How to get the class reference from KParameter in kotlin?

The code below is about reflection.
It tries to do 2 things:
case1() creates an instance from SimpleStudent class, it works.
case2() creates an instance from Student class, not work.
The reason that case2() not work as well as the question, is that inside that generateValue():
I don't know how to check it is kotlin type or my own type(I have a dirty way to check param.type.toString() not contain "kotlin" but I wonder if there is a better solution
I don't know how to get its class reference when it's a custom class. The problem is that even though param.type.toString() == "Lesson", when I tried to get param.type::class, it's class kotlin.reflect.jvm.internal.KTypeImpl
So, how to solve it? Thanks
==============
import kotlin.reflect.KParameter
import kotlin.reflect.full.primaryConstructor
import kotlin.test.assertEquals
data class Lesson(val title:String, val length:Int)
data class Student(val name:String, val major:Lesson )
data class SimpleStudent(val name:String, val age:Int )
fun generateValue(param:KParameter, originalValue:Map<*,*>):Any? {
var value = originalValue[param.name]
// if (param.type is not Kotlin type){
// // Get its ::class so that we could create the instance of it, here, I mean Lesson class?
// }
return value
}
fun case1(){
val classDesc = SimpleStudent::class
val constructor = classDesc.primaryConstructor!!
val value = mapOf<Any,Any>(
"name" to "Tom",
"age" to 16
)
val params = constructor.parameters.associateBy (
{it},
{generateValue(it, value)}
)
val result:SimpleStudent = constructor.callBy(params)
assertEquals("Tom", result.name)
assertEquals(16, result.age)
}
fun case2(){
val classDesc = Student::class
val constructor = classDesc.primaryConstructor!!
val value = mapOf<Any,Any>(
"name" to "Tom",
"major" to mapOf<Any,Any>(
"title" to "CS",
"length" to 16
)
)
val params = constructor.parameters.associateBy (
{it},
{generateValue(it, value)}
)
val result:Student = constructor.callBy(params)
assertEquals("Tom", result.name)
assertEquals(Lesson::class, result.major::class)
assertEquals("CS", result.major.title)
}
fun main(args : Array<String>) {
case1()
case2()
}
Problem solved:
You could get that ::class by using param.type.classifier as KClass<T> where param is KParameter

iOS- swift 3- Nested sorting and union with flatmap, map, filter or formUnion

class Flight{
var name:String?
var vocabulary:Vocabulary?
}
class Vocabulary{
var seatMapPlan:[Plan] = []
var foodPlan:[Plan] = []
}
class Plan{
var planName:String?
var planId:String?
}
var flightList:[Flight] = []
var plan1 = Plan()
plan1.planId = "planId1"
plan1.planName = "Planname1"
var plan2 = Plan()
plan2.planId = "planId2"
plan2.planName = "Planname2"
var plan3 = Plan()
plan3.planId = "planId3"
plan3.planName = "Planname3"
var plan4 = Plan()
plan4.planId = "planId4"
plan4.planName = "Planname4"
var plan5 = Plan()
plan5.planId = "planId5"
plan5.planName = "Planname5"
var plan6 = Plan()
plan6.planId = "planId6"
plan6.planName = "Planname6"
var flight1 = Flight()
flight1.name = "Flight1"
flight1.vocabulary = Vocabulary()
flight1.vocabulary?.seatMapPlan = [plan1, plan2]
flight1.vocabulary?.foodPlan = [plan3, plan4, plan5]
var flight2 = Flight()
flight2.name = "Flight2"
flight2.vocabulary = Vocabulary()
flight2.vocabulary?.seatMapPlan = [plan2, plan3]
flight2.vocabulary?.foodPlan = [plan3, plan4, plan5]
flightList=[flight1, flight2]
Problem 1:
I want to use flatmap,filter,custom unique func or Sets.formUnion to achieve a union of seatMapPlans. For this particular example it is
seatMapUnion = [plan1,plan2,plan3]
Because of nesting with the help of answered questions I am unable to achieve this.
Please give me a combination of filter,flatMap and map for resolving this particular problem.
Problem 2:
I have vice-versa scenarios too were i have to sort this array flightList on basis of plan(plan1 or multiple) selected. I want to sort this on basis of filter and map, but the nesting is making it difficult to achieve.
e.g. 1:
if the search parameter is plan1 for seatMapPlan. Then the result is flight1.
e.g. 2:
And if the search parameter is plan2 for seatMapPlan. Then the result is flight1,flight2.
For the first problem I would use sets. So first make Plan implement Hashable :
class Plan : Hashable {
var planName:String?
var planId:String?
public var hashValue: Int { return planName?.hashValue ?? 0 }
public static func ==(lhs: Plan, rhs: Plan) -> Bool { return lhs.planId == rhs.planId }
}
Then it's straightforward :
let set1 = Set<Plan>(flight1.vocabulary!.seatMapPlan)
let set2 = Set<Plan>(flight2.vocabulary!.seatMapPlan)
let union = set1.union(set2)
print(union.map { $0.planName! } )
It'll print:
["Planname2", "Planname1", "Planname3"]
Not sure I understand your second problem.

Resources