In Flink is it possible to have a DataStream<Tuple> where Tuple is the base class of all known Tulples like Tuple2, Tuple3 etc? - bigdata

I am creating a Flink application that reads strings from a Kafka topic for example "2 5 9" is a value. Then split the string with " " delimiter and create map it to a tuple. In this case the result of the map function would be a DataStream<Tuple3<Integer,Integer,Integer>> which is simple. The problem is that I want my app to be parameterised, meaning that sometimes the data that it will read have 3 dimensions (like "2 5 9"), and another time maybe 2 dimensions so in this case I would need Tuple2.
I thought that I could use the Tuple base class like I'm showing but things didn't work
DataStream<String> strData = env.fromSource(...)
DataStream<Tuple> tupleData = inputData.map(new MapFunction<String, Tuple>() {
#Override
public Tuple map(String s) throws Exception {
String[] tokens = s.split(" ");
int numOfDimensions = tokens.length;
Tuple tuple = Tuple.newInstance(numOfDimensions);
for(int i=0; i<numOfDimensions; i++){
tuple.setField(Integer.valueOf(tokens[i]), i);
}
return tuple;
}
});
Im getting this error:
InvalidTypesException: Usage of class Tuple as a type is not allowed. Use a concrete subclass (e.g. Tuple1, Tuple2, etc.) instead.
So this solution doesn't seem to work. Is there any alternative for this purpose or maybe I am missing something here?
Thanks
Edit due to Bartosz Mikulski comment:
1.The input data are always integers.
2.I am planning to run one job at a time that receives standard length input. For example today I want to run a job with input length of 2, so the parameter that defines the input length is 2.
Tomorrow maybe I want to run a job with input length=3 so I will run the job with input length parameter=3.

Thank you for your clarification. In this case, you need to have a variable-length data structure that always holds Integers. The simplest solution would be to use List<Integer> instead of Tuple. So your map function can look like this:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<String> s1 = env.fromElements("1 2 3", "4 5 6", "8 9 10 11 12");
DataStream<List<Long>> s2 = s1.map(text -> Arrays.stream(text.split(" "))
.map(Long::parseLong)
.collect(Collectors.toList()),
new TypeHint<List<Long>>() {}.getTypeInfo()
);
s2.print();
env.execute("Flink Java API Skeleton");
General information on handling stream elements
How to handle homogenous data (variable or fixed length collection of the same type) - your case
In this case, use any Java collection that can be serialized. For example List (both ArrayList and LinkedList are serializable).
Reasons for using List<Integer> instead of Tuple:
Lists are used for storing data of the same type. In your case Integer.
Lists can be adapted to any input length
You can use .size() method to check the length
How to handle fixed-length heterogeneous data (can store different types)
You can use tuples, but probably custom POJO (Plain Old Java Object) would be better. Tuples are fixed-length storage for heterogeneous types. This means that Tuple2<Integer, Integer> is a different type than Tuple3<Integer, Integer, Integer> or even Tuple2<Integer, Long> (of course in the case of JVM we have type erasure so both Tuple2 are of "the same" type).
Flink disallows the usage of bare Tuple as a stream element because it has no semantic meaning. What operator could be applied for DataStream<Tuple>? Maybe one that accepts Tuple, but there can be 25 tuples of different lengths and then each tuple field can have different types.
So why do we have tuples in Flink? Just for convenience. If you need to return for example 2 or 3 elements in an ad hoc manner from a function use a tuple. You don't need to create your own class to handle that case. However, if you model some data type (for example a student record) then I would just use POJO instead (so implement class Student for this scenario).
The best approach I find with Flink is to model data as classes on the highest level as it is more redable: DataStream<StudentGrade> then DataStream<Tuple3<String, String, Float>>, where StudentGrade stores name, subject, and grade itself.
How to handle variable-length heterogeneous data of known subtypes
Let's say that you know that you can allow different types of events in your application (different subtypes). For example your application reads sensor data from the stream and you accept events like SensorValue(sensorId, value) and SensorMean(sensorId, n). SensorValue is information about readout from the sensor and SensorMean is a command that will calculate the mean of the last n sensor readings and output it to the sink. In that case we know that we have only two possible messages and let's say they are modeled as: value 123 0.34 for SensorValue and mean 123 5 for SensorMean.
For this scenario the best approach would be to use Algebraic Data Type available in Scala. In Java we can model it with interface and inheritance (less powerful, but it will work), so we would have interface SensorEvent, class SensorValue implements SensorEvent, and class SensorMean implements SensorEvent. Then we can have DataStream<SensorEvent> and Flink will not complain, like in the tuple case.
How to handle variable-length heterogeneous data of unknown subtypes
In this case, the easiest solution would be to represent that element as some kind of dynamic object. For example Jackson ObjectNode or Json4s JValue. Try not to use List<Object> or Map<String, Object> as this will be harder to manage down the line.

Related

What's the purpose of Just in Elm?

So, I have been doing the Elm track on Exercism.org and I just finished the exercise about the Maybe concept, but one thing is not clear to me yet. What is the purpose of the Just in the definition of Maybe?
type Maybe a = Nothing | Just a
For example, what's the difference between Int and Just Int and why an integer is not considered a Just Int if I don't add the Just word before?
More concretely, when I was trying to solve the RPG problem my first trying resulted in something like this:
type alias Player =
{ name : Maybe String
, level : Int
, health : Int
, mana : Maybe Int
}
revive : Player -> Maybe Player
revive player =
case player.health of
0 ->
if player.level >= 10 then
Player player.name player.level 100 100
else
Player player.name player.level 100 Nothing
_ ->
Nothing
Just to find out that my mistake was in the if statement, that should return Just Person, i.e.:
if player.level >= 10 then
Just (Player player.name player.level 100 (Just 100))
else
Just (Player player.name player.level 100 Nothing)
If you're coming from a background of dynamic typing like Python then it's easy to see it as pointless. In Python, if you have an argument and you want it to be either an integer or empty, you pass either an integer or None. And everyone just understands that None is the absence of an integer.
Even if you're coming from a poorly-done statically typed language, you may still see it as odd. In Java, every reference datatype is nullable, so String is really "eh, there may or may not be a String here" and MyCustomClass is really "eh, there may or may not really be an instance here". Everything can be null, which results in everyone constantly checking whether things are null at every turn.
There are, broadly speaking, two solutions to this problem: nullable types and optional types. In a language like Kotlin with nullable types, Int is the type of integers. Int can only contain integers. Not null, not a string, not anything else. However, if you want to allow null, you use the type Int?. The type Int? is either an integer or a null value, and you can't do anything integer-like with it (such as add it to another integer) unless you check for null first. This is the most straightforward solution to the null problem, for people coming from a language like Java. In that analogy, Int really is a subtype of Int?, so every integer is an instance of Int?. 3 is an instance of both Int and Int?, and it means both "this is an integer" and also "this is an integer which is optional but exists".
That approach works fine in languages with subtyping. If your language is built up from a typical OOP hierarchy, it's easy to say "well, T is clearly a subtype of T?" and move on. But Elm isn't built that way. There's no subtyping relationships in Elm (there's unification, which is a different thing). Elm is based on Haskell, which is built up from the Hindley-Milner model. In this model, every value has a unique type.
Whereas in Kotlin, 3 is an instance of Int, and also Int?, and also Number, and also Number?, and so on all the way up to Any? (the top type in Kotlin), there is no equivalent in Elm. There is no "top type" that everything inherits from, and there is no subtyping. So it's not meaningful to say that 3 is an instance of multiple types. In Elm, 3 is an instance of Int. End of story. That's it. If a function takes an argument of type Int, it must be an integer. And since 3 can't be an instance of some other type, we need another way to represent "an integer that may or may not be there".
type Maybe a = Nothing | Just a
Enter optional typing. 3 can't be an optional integer, since it's an Int and nothing else. But Just 3, on the other hand... Just 3 is an entirely different value and its type is Maybe Int. A Just 3 is only valid in situations where an optional integer is expected, since it's not an Int. Maybe a is what's called an optional type; it's a completely separate type which represents the type a, but optional. It serves the same purpose and T? in a language like Kotlin, but it's built up from different foundations.
Getting into which one is better would derail this post, and I don't think that's important here. I have my opinions, but others have theirs as well. Optional typing and nullable typing are two different approaches to dealing with values that may or may not exist. Elm (and Haskell-like languages) use one, and other languages might use the other. A well-rounded programmer should become comfortable with both.
Why an integer is not considered a Just Int if I don't add the Just word before?
Simply because without the constructor (Just), it's only an integer and not something else. There's no automatic type conversion, you have to be explicit about what you want. Would you also consider allow writing 100 if you meant the single-element list [100]? Soon, you would have no idea what it meant if someone wrote 100.
This is not specific to Maybe and its Just variant, this is the rule for all data types. There is no exception for Maybes, even if the language is confusing - an Int is just an Int, but not a Just Int.
Just in Elm is a tag but in this context you can think of it like a function that takes a value of type Int, and return something of the type Maybe Int.
type Maybe a = Nothing | Just a
---
Just 123 -- is a `Maybe Int`
Means Maybe is a type with an associated generic type a. Similar to C++'s T
template <class T>
class Maybe
using MaybeInt = Maybe<Int>
All Nothing and Just a are are functions (aka constructors) to make a Maybe. In Python it might look like:
def Nothing() -> Maybe:
return Maybe() # except in elm, it knows the returned Maybe came from a
# Nothing, so there's some machinery missing here
def Just(some_val) -> Maybe:
return Maybe(some_val)
So if a function returns a Maybe, the returned value has to be passed through one of the two Nothing or Just constructors.

Does Kotlin have pointers?

Does Kotlin have pointers?
If yes,
How to increment a Pointer?
How to decrement a Pointer?
How to do Pointer Comparisons?
It has references, and it doesn't support pointer arithmetic (so you can't increment or decrement).
Note that the only thing that "having pointers" allows you is the ability to create a pointer and to dereference it.
The closest thing to a "pointer comparison" is referential equality, which is performed with the === operator.
There is no pointers in Kotlin for low-level processing as C.
However, it's possible emulate pointers in high-level programming.
For low-level programming it is necessary using special system APIs to simulate arrays in memories, that exists in Windows, Linux, etc. Read about memory mapped files here and here. Java has library to read and write directly in memory.
Single types (numeric, string and boolean) are values, however, other types are references (high level pointers) in Kotlin, that one can compare, assign, etc.
If one needs increment or decrement pointers, just encapsulate the desired data package into a array
For simulate pointers to simple values it just wrap the value in a class:
data class pStr ( // Pointer to a String
var s:String=""
)
fun main() {
var st=pStr("banana")
var tt=st
tt.s = "melon"
println(st.s) // display "melon"
var s:String = "banana"
var t:String = s
t.s = "melon"
println(s.s) // display "banana"
}
I found this question while googling over some interesting code I found and thought that I would contribute my own proverbial "two cents". So Kotlin does have an operator which might be confused as a pointer, based on syntax, the spread operator. The spread operator is often used to pass an array as a vararg parameter.
For example, one might see something like the following line of code which looks suspiciously like the use of a pointer:
val process = ProcessBuilder(*args.toTypedArray()).start()
This line isn't calling the toTypedArray() method on a pointer to the args array, as you might expect if you come from a C/C++ background like me. Rather, this code is actually just calling the toTypedArray() method on the args array (as one would expect) and then passing the elements of the array as an arbitrary number of varargs arguments. Without the spread operator (i.e. *), a single argument would be passed, which would be the typed args array, itself.
That's the key difference: the spread operator enables the developer to pass the elements of the array as a list of varargs as opposed to passing a pointer to the array, itself, as a single argument.
I hope that helps.

How to obtain a KType in Kotlin?

I'm experimenting with the reflection functionality in Kotlin, but I can't seem to understand how to obtain a KType value.
Suppose I have a class that maps phrases to object factories. In case of ambiguity, the user can supply a type parameter that narrows the search to only factories that return that type of object (or some sub-type).
fun mapToFactory(phrase: Phrase,
type: KType = Any::class): Any {...}
type needs to accept just about anything, including Int, which from my experience seems to be treated somewhat specially. By default, it should be something like Any, which means "do not exclude any factories".
How do I assign a default value (or any value) to type?
From your description, sounds like your function should take a KClass parameter, not a KType, and check the incoming objects with isSubclass, not isSubtype.
Types (represented by KType in kotlin-reflect) usually come from signatures of declarations in your code; they denote a broad set of values which functions take as parameters or return. A type consists of the class, generic arguments to that class, and nullability. The problem with types at runtime on JVM is that because of erasure, it's impossible to determine the exact type of a variable of a generic class. For example if you have a list, you cannot determine the generic type of that list at runtime, i.e. you cannot differentiate between List<String> and List<Throwable>.
To answer your initial question though, you can create a KType out of a KClass with createType():
val type: KType = Any::class.createType()
Note that if the class is generic, you need to pass type projections of generic arguments. In simple cases (all type variables can be replaced with star projections), starProjectedType will also work. For more info on createType and starProjectedType, see this answer.
Since Kotlin 1.3.40, you can use the experimental function typeOf<T>() to obtain the KType of any type:
val int: KType = typeOf<Int>()
In contrast to T::class.createType(), this supports nested generic arguments:
val listOfString: KType = typeOf<List<String>>()
The typeOf<T>() function is particularly useful when you want to obtain a KType from a reified type parameter:
inline fun <reified T> printType() {
val type = typeOf<T>()
println(type.toString())
}
Example usage:
fun main(args: Array<String>) {
printType<Map<Int, String>>()
// prints: kotlin.collections.Map<kotlin.Int, kotlin.String>
}
Since this feature is still in experimental status, you need to opt-in with #UseExperimental(ExperimentalStdlibApi::class) around your function that uses typeOf<T>(). As the feature becomes more stable (possibly in Kotlin 1.4), this can be omitted. Also, at this time it is only available for Kotlin/JVM, not Kotlin/Native or Kotlin/JS.
See also:
Release announcement
API Doc (very sparse currently)

Difference between List, Tuple, Sequence, Sequential, Iterable, Array, etc. in Ceylon

Ceylon has several different concepts for things that might all be considered some kind of array: List, Tuple, Sequence, Sequential, Iterable, Array, Collection, Category, etc. What's is different about these these types and when should I use them?
The best place to start to learn about these things at a basic level is the Ceylon tour. And the place to learn about these things in depth is the module API. It can also be helpful to look at the source files for these.
Like all good modern programming languages, the first few interfaces are super abstract. They are built around one formal member and provide their functionality through a bunch of default and actual members. (In programming languages created before Java 8, you may have heard these called "traits" to distinguish them from traditional interfaces which have only formal members and no functionality.)
Category
Let's start by talking about the interface Category. It represents types of which you can ask "does this collection contain this object", but you may not necessarily be able to get any of the members out of the collection. It's formal member is:
shared formal Boolean contains(Element element)
An example might be the set of all the factors of a large number—you can efficiently test if any integer is a factor, but not efficiently get all the factors.
Iterable
A subtype of Category is the interface Iterable. It represents types from which you can get each element one at a time, but not necessarily index the elements. The elements may not have a well-defined order. The elements may not even exist yet but are generated on the fly. The collection may even be infinitely long! It's formal member is:
shared formal Iterator<Element> iterator()
An example would be a stream of characters like standard out. Another example would be a range of integers provided to a for loop, for which it is more memory efficient to generate the numbers one at a time.
This is a special type in Ceylon and can be abbreviated {Element*} or {Element+} depending on if the iterable might be empty or is definitely not empty, respectively.
Collection
One of Iterable's subtypes is the interface Collection. It has one formal member:
shared formal Collection<Element> clone()
But that doesn't really matter. The important thing that defines a Collection is this line in the documentation:
All Collections are required to support a well-defined notion of value
equality, but the definition of equality depends upon the kind of
collection.
Basically, a Collection is a collection who structure is well-defined enough to be equatable to each other and clonable. This requirement for a well-defined structure means that this is the last of the super abstract interfaces, and the rest are going to look like more familiar collections.
List
One of Collection's subtypes is the interface List. It represents a collection whose elements we can get by index (myList[42]). Use this type when your function requires an array to get things out of, but doesn't care if it is mutable or immutable. It has a few formal methods, but the important one comes from its other supertype Correspondence:
shared formal Item? get(Integer key)
Sequential, Sequence, Empty
The most important of List's subtypes is the interface Sequential. It represents an immutable List. Ceylon loves this type and builds a lot of syntax around it. It is known as [Element*] and Element[]. It has exactly two subtypes:
Empty (aka []), which represents empty collections
Sequence (aka [Element+]), which represents nonempty collections.
Because the collections are immutable, there are lots of things you can do with them that you can't do with mutable collections. For one, numerous operations can fail with null on empty lists, like reduce and first, but if you first test that the type is Sequence then you can guarantee these operations will always succeed because the collection can't become empty later (they're immutable after all).
Tuple
A very special subtype of Sequence is Tuple, the first true class listed here. Unlike Sequence, where all the elements are constrained to one type Element, a Tuple has a type for each element. It gets special syntax in Ceylon, where [String, Integer, String] is an immutable list of exactly three elements with exactly those types in exactly that order.
Array
Another subtype of List is Array, also a true class. This is the familiar Java array, a mutable fixed-size list of elements.
drhagen has already answered the first part of your question very well, so I’m just going to say a bit on the second part: when do you use which type?
In general: when writing a function, make it accept the most general type that supports the operations you need. So far, so obvious.
Category is very abstract and rarely useful.
Iterable should be used if you expect some stream of elements which you’re just going to iterate over (or use stream operations like filter, map, etc.).
Another thing to consider about Iterable is that it has some extra syntax sugar in named arguments:
void printAll({Anything*} things, String prefix = "") {
for (thing in things) {
print(prefix + (thing?.string else "<null>"));
}
}
printAll { "a", "b", "c" };
printAll { prefix = "X"; "a", "b", "c" };
Try online
Any parameter of type Iterable can be supplied as a list of comma-separated arguments at the end of a named argument list. That is,
printAll { "a", "b", "c" };
is equivalent to
printAll { things = { "a", "b", "c" }; };
This allows you to craft DSL-style expressions; the tour has some nice examples.
Collection is, like Correspondence, fairly abstract and in my experience rarely used directly.
List sounds like it should be a often used type, but actually I don’t recall using it a lot. I’m not sure why. I seem to skip over it and declare my parameters as either Iterable or Sequential.
Sequential and Sequence are when you want an immutable, fixed-length list. It also has some syntax sugar: variadic methods like void foo(String* bar) are a shortcut for a Sequential or Sequence parameter. Sequential also allows you to do use the nonempty operator, which often works out nicely in combination with first and rest:
String commaSeparated(String[] strings) {
if (nonempty strings) {
value sb = StringBuilder();
sb.append(strings.first); // known to exist
for (string in strings.rest) { // skip the first one
sb.append(", ").append(string);
// we don’t need a separate boolean to track if we need the comma or not :)
}
return sb.string;
} else {
return "";
}
}
Try online
I usually use Sequential and Sequence when I’m going to iterate over a stream several times (which could be expensive for a generic Iterable), though List might be the better interface for that.
Tuple should never be used as Tuple (except in the rare case where you’re abstracting over them), but with the [X, Y, Z] syntax sugar it’s often useful. You can often refine a Sequential member to a Tuple in a subclass, e. g. the superclass has a <String|Integer>[] elements which in one subclass is known to be a [String, Integer] elements.
Array I’ve never used as a parameter type, only rarely as a class to instantiate and use.

What is the model of value vs. reference in Nim?

NOTE: I am not asking about difference between pointer and reference, and for this question it is completely irrelevant.
One thing I couldn't find explicitly stated -- what model does Nim use?
Like C++ -- where you have values and with new you create pointers to data (in such case the variable could hold pointer to a pointer to a pointer to... to data)?
Or like C# -- where you have POD types as values, but user defined objects with referenced (implicitly)?
I spotted only dereferencing is automatic, like in Go.
Rephrase. You define your new type, let's say Student (with name, university, address). You write:
var student ...?
to make student hold actual data (of Student type/class)
to make student hold a pointer to the data
to make student hold a pointer to a pointer to the data
Or some from those points are impossible?
By default the model is of passing data by value. When you create a var of a specific type, the compiler will allocate on the stack the required space for the variable. Which is expected, as Nim compiles to C, and complex types are just structures. But like in C or C++, you can have pointers too. There is the ptr keyword to get an unsafe pointer, mostly for interfacing to C code, and there is a ref to get a garbage collected safe reference (both documented in the References and pointer types section of the Nim manual).
However, note that even when you specify a proc to pass a variable by value, the compiler is free to decide to pass it internally by reference if it considers it can speed execution and is safe at the same time. In practice the only time I've used references is when I was exporting Nim types to C and had to make sure both C and Nim pointed to the same memory. Remember that you can always check the generated C code in the nimcache directory. You will see then that a var parameter in a proc is just a pointer to its C structure.
Here is an example of a type with constructors to be created on the stack and passed in by value, and the corresponding pointer like version:
type
Person = object
age: int
name: string
proc initPerson(age: int, name: string): Person =
result.age = age
result.name = name
proc newPerson(age: int, name: string): ref Person =
new(result)
result.age = age
result.name = name
when isMainModule:
var
a = initPerson(3, "foo")
b = newPerson(4, "bar")
echo a.name & " " & $a.age
echo b.name & " " & $b.age
As you can see the code is essentially the same, but there are some differences:
The typical way to differentiate initialisation is to use init for value types, and new for reference types. Also, note that Nim's own standard library mistakes this convention, since some of the code predates it (eg. newStringOfCap does not return a reference to a string type).
Depending on what your constructors actually do, the ref version allows you to return a nil value, which you can treat as an error, while the value constructor forces you to raise an exception or change the constructor to use the var form mentioned below so you can return a bool indicating success. Failure tends to be treated in different ways.
In C-like languages theres is an explicit syntax to access either the memory value of a pointer or the memory value pointed by it (dereferencing). In Nim there is as well, and it is the empty subscript notation ([]). However, the compiler will attempt to automatically put those to avoid cluttering the code. Hence, the example doesn't use them. To prove this you can change the code to read:
echo b[].name & " " & $b[].age
Which will work and compile as expected. But the following change will yield a compiler error because you can't dereference a non reference type:
echo a[].name & " " & $a[].age
The current trend in the Nim community is to get rid of single letter prefixes to differentiate value vs reference types. In the old convention you would have a TPerson and an alias for the reference value as PPerson = ref TPerson. You can find a lot of code still using this convention.
Depending on what exactly your object and constructor need to do, instead of having a initPerson returning the value you could also have a init(x: var Person, ...). But the use of the implicit result variable allows the compiler to optimise this, so it is much more a taste preference or requirements of passing a bool to the caller.
It can be either.
type Student = object ...
is roughly equivalent to
typedef struct { ... } Student;
in C, while
type Student = ref object ...
or
type Student = ptr object ...
is roughly equivalent to
typedef struct { ... } *Student;
in C (with ref denoting a reference that is traced by the garbage collector, while ptr is not traced).

Resources