Spark: value reduceByKey is not a member - vector

After clustering some sparse vectors I need to find intersection vector in every cluster. To achieve this I try to reduce MLlib vectors as in the following example:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
//For Sparse Vector
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.{Vector, Vectors}
object Recommend {
def main(args: Array[String]) {
// set up environment
val conf = new SparkConf()
.setAppName("Test")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
// Some vectors
val vLen = 1800
val sv11: Vector = Vectors.sparse(vLen,Seq( (100,1.0), (110,1.0), (120,1.0), (130, 1.0) ))
val sv12: Vector = Vectors.sparse(vLen,Seq( (100,1.0), (110,1.0), (120,1.0), (130, 1.0), (140, 1.0) ))
val sv13: Vector = Vectors.sparse(vLen,Seq( (100,1.0), (120,1.0), (130,1.0) ))
val sv14: Vector = Vectors.sparse(vLen,Seq( (110,1.0), (130, 1.0) ))
val sv15: Vector = Vectors.sparse(vLen,Seq( (140, 1.0) ))
val sv21: Vector = Vectors.sparse(vLen,Seq( (200,1.0), (210,1.0), (220,1.0), (230, 1.0) ))
val sv22: Vector = Vectors.sparse(vLen,Seq( (200,1.0), (210,1.0), (220,1.0), (230, 1.0), (240, 1.0) ))
val sv23: Vector = Vectors.sparse(vLen,Seq( (200,1.0), (220,1.0), (230,1.0) ))
val sv24: Vector = Vectors.sparse(vLen,Seq( (210,1.0), (230, 1.0) ))
val sv25: Vector = Vectors.sparse(vLen,Seq( (240, 1.0) ))
val sv31: Vector = Vectors.sparse(vLen,Seq( (300,1.0), (310,1.0), (320,1.0), (330, 1.0) ))
val sv32: Vector = Vectors.sparse(vLen,Seq( (300,1.0), (310,1.0), (320,1.0), (330, 1.0), (340, 1.0) ))
val sv33: Vector = Vectors.sparse(vLen,Seq( (300,1.0), (320,1.0), (330,1.0) ))
val sv34: Vector = Vectors.sparse(vLen,Seq( (310,1.0), (330, 1.0) ))
val sv35: Vector = Vectors.sparse(vLen,Seq( (340, 1.0) ))
val sparseData = sc.parallelize(Seq(
sv11, sv12, sv13, sv14, sv15,
sv21, sv22, sv23, sv24, sv25,
sv31, sv32, sv33, sv34, sv35
))
// Cluster the data into two classes using KMeans
val numClusters = 3
val numIterations = 20
test(numClusters, numIterations, sparseData)
}
def test(numClusters:Int, numIterations:Int,
data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]) = {
val clusters = KMeans.train(data, numClusters, numIterations)
val predictions = data.map(v => (clusters.predict(v), v) )
predictions.reduceByKey((v1, v2) => v1)
}
}
The line predictions.reduceByKey((v1, v2) => v1) results in error:
value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, org.apache.spark.mllib.linalg.Vector)]
What is the reason for that?

Your code should have, as you've already guessed, this import added :
import org.apache.spark.SparkContext._
Why ? because with it comes a few implicit transformations, the main important (for your case) being the PairRDD implicit transformation.
Spark will guess when you have a RDD of Tuple that the left side can be considered as a key, and will therefor give you access to a few convenient transformations or actions like reduceByKey.
Regards,

Related

Iterate over R object using SEXP in Rcpp

I'd like to subset an xts object in an Rcpp function and return the subset.
If the xts object has an index of class Date extracting the index via Rcpp corrupts the xts object -- see dirk's answer to this question, where he demonstrates that getting a pointer to the Date indices from the xts (what i call the SEXP approach) doesn't lead to corruption.
Say that i have a pointer s to the SEXP in Rcpp -- how do i iterate over the underlying object using that SEXP? Can it be done?
I'd like to iterate over the underlying object, and return a subset of that object.
The below R code does what I require:
set.seed(1)
require(xts)
xx_date <- xts(round(runif(100, min = 0, max = 20), 0),
order.by = seq.Date(Sys.Date(), by = "day", length.out = 100))
subXts_r <- function(Xts) {
i = 2
while( as.numeric(Xts[i, ]) != as.numeric(Xts[i-1, ])) {
if (i == nrow(Xts)) break else i = i+1
}
Xts[1:i,]
}
subXts_r(xx_date)
This Rcpp code also does what I want, but it uses a clone of the index (second line) to prevent corruption. My idea is to replace the second line with SEXP s = X.attr(\"index\") -- but I don't know how to iterate over s once I have it.
cppFunction("NumericVector subXts_cpp(NumericMatrix X) {
DatetimeVector v = clone(NumericVector(X.attr(\"index\"))); // need to clone else xx_date is corrupted
double * p_dt = v.begin() +1;
double * p_value = X.begin() +1;
while( (*p_value != *(p_value -1)) & (p_value < X.end())) {
p_value++;
p_dt++;
}
Rcpp::NumericVector toDoubleValue(X.begin(), p_value);
Rcpp::NumericVector toDoubleDate(v.begin(), p_dt);
int rows = toDoubleValue.size(); // find length of xts object
toDoubleDate.attr(\"tzone\") = \"UTC\"; // the index has attributes
CharacterVector t_class = CharacterVector::create(\"POSIXct\", \"POSIXt\");
toDoubleDate.attr(\"tclass\") = t_class;
// now modify dataVec to make into an xts
toDoubleValue.attr(\"dim\") = IntegerVector::create(rows,1);
toDoubleValue.attr(\"index\") = toDoubleDate;
CharacterVector d_class = CharacterVector::create(\"xts\", \"zoo\");
toDoubleValue.attr(\"class\") = d_class;
toDoubleValue.attr(\".indexCLASS\") = t_class;
toDoubleValue.attr(\"tclass\") = t_class;
toDoubleValue.attr(\".indexTZ\") = \"UTC\";
toDoubleValue.attr(\"tzone\") = \"UTC\";
return toDoubleValue;}")

Optimize parameters of a function (e.g., capacitor-charging) curve to fit data

In my attempt to fit a function of the form y = a * (1 - exp(-x / b)) to some given data, I'm a bit lost. I suspect the optimization package of apache-common-math might be of help, but I've not yet managed to use it successfully. Below you can find some code explaining what I'd like to achieve.
import kotlin.math.exp
import kotlin.random.Random
// Could be interpreted as a capacitor-charging curve with Vs = a and t = b
fun fGeneric(a: Double, b: Double, x: Double) = a * (1 - exp(-x / b))
fun fGiven(x: Double) = fGeneric(a = 10.0, b = 200.0, x = x)
fun fGivenWithNoise(x: Double) = fGiven(x) + Random.nextDouble(-0.1, 0.1)
fun main() {
val xs = (0..1000).map(Int::toDouble).toDoubleArray()
val ys = xs.map { x -> fGivenWithNoise(x) }.toDoubleArray()
// todo: From data, find a and b, such that fGeneric fits optimally.
}
Do I need to provide an implementation of the MultivariateDifferentiableVectorFunction interface? And if so, how would it need to look like?
Found a solution by using lbfgs4j instead:
package com.jaumo.ml.lifetimevalue
import com.github.lbfgs4j.LbfgsMinimizer
import com.github.lbfgs4j.liblbfgs.Function
import kotlin.math.exp
import kotlin.random.Random
// Could be interpreted as a capacitor-charging curve with Vs = a and t = b
fun fGeneric(a: Double, b: Double, x: Double) = a * (1 - exp(-x / b))
fun fGiven(x: Double) = fGeneric(a = 10.0, b = 200.0, x = x)
fun fGivenWithNoise(x: Double) = fGiven(x) + Random.nextDouble(-0.1, 0.1)
private fun subtractVectors(a: DoubleArray, b: DoubleArray): DoubleArray {
assert(a.size == b.size)
val result = DoubleArray(a.size)
(a.indices).forEach { dim ->
result[dim] = a[dim] - b[dim]
}
return result
}
fun main() {
val xs = (0..1000).map(Int::toDouble).toDoubleArray()
val ys = xs.map { x -> fGivenWithNoise(x) }.toDoubleArray()
val f = object : Function {
override fun getDimension(): Int {
return 2
}
override fun valueAt(x: DoubleArray): Double {
val maxVal = x[0]
val slowness = x[1]
val capacitorFunc = { x0: Double ->
maxVal * (1 - exp(-x0 / slowness))
}
return subtractVectors(xs.map(capacitorFunc).toDoubleArray(), ys)
.map { it * it }
.sum()
}
override fun gradientAt(x: DoubleArray): DoubleArray {
val a = valueAt(doubleArrayOf(x[0] - 0.001, x[1]))
val b = valueAt(doubleArrayOf(x[0] + 0.001, x[1]))
val c = valueAt(doubleArrayOf(x[0], x[1] - 0.001))
val d = valueAt(doubleArrayOf(x[0], x[1] + 0.001))
return doubleArrayOf(b - a, d - c)
}
}
val minimizer = LbfgsMinimizer()
val x = minimizer.minimize(f, doubleArrayOf(1.0, 10.0))
println(x[0])
println(x[1])
}
The result looks good:
9.998170586347115
200.14238710377768

Propositional Logic Valuation in SML

I'm trying to define a propositional logic valuation using SML structure. A valuation in propositional logic maps named variables (i.e., strings) to Boolean values.
Here is my signature:
signature VALUATION =
sig
type T
val empty: T
val set: T -> string -> bool -> T
val value_of: T -> string -> bool
val variables: T -> string list
val print: T -> unit
end;
Then I defined a matching structure:
structure Valuation :> VALUATION =
struct
type T = (string * bool) list
val empty = []
fun set C a b = (a, b) :: C
fun value_of [] x = false
| value_of ((a,b)::d) x = if x = a then b else value_of d x
fun variables [] = []
| variables ((a,b)::d) = a::(variables d )
fun print valuation =
(
List.app
(fn name => TextIO.print (name ^ " = " ^ Bool.toString (value_of valuation name) ^ "\n"))
(variables valuation);
TextIO.print "\n"
)
end;
So the valuations should look like [("s",true), ("c", false), ("a", false)]
But I can't declare like a structure valuation or make an instruction like: [("s",true)]: Valuation.T; When I tried to use the valuation in a function, I get errors like:
Can't unify (string * bool) list (*In Basis*) with
Valuation.T
Could someone help me? Thanks.
The type Valuation.T is opaque (hidden).
All you know about it is that it's called "T".
You can't do anything with it except through the VALUATION signature, and that signature makes no mention of lists.
You can only build Valuations using the constructors empty and set, and you must start with empty.
- val e = Valuation.empty;
val e = - : Valuation.T
- val v = Valuation.set e "x" true;
val v = - : Valuation.T
- val v2 = Valuation.set v "y" false;
val v2 = - : Valuation.T
- Valuation.value_of v2 "x";
val it = true : bool
- Valuation.variables v2;
val it = ["y","x"] : string list
- Valuation.print v2;
y = false
x = true
val it = () : unit
Note that every Valuation.T value is printed as "-" since the internal representation isn't exposed.

How to modify drawdown functions in PerformanceAnalytics package for value

I am calculating the average drawdown, average length, recovery length, etc. in R for a PnL data series rather than return data. This is data frame like this
PNL
2008-11-03 3941434
2008-11-04 4494446
2008-11-05 2829608
2008-11-06 2272070
2008-11-07 -2734941
2008-11-10 -2513580
I used the maxDrawDown function from fTrading package and it worked. How could I get the other drawdown functions? If I directly run AverageDrawdown(quantbook) function, it will give out error message like this
Error in if (thisSign == priorSign) { : missing value where TRUE/FALSE needed
I checked the documentation for AverageDrawdown and it is as below:
findDrawdowns(R, geometric = TRUE, ...)
R an xts, vector, matrix, data frame, timeSeries or zoo object of asset returns
My quantbook is a data frame but doesn't work for this function.
Or do you have anything other packages to get the same funciton, please advise.
I've modified the package's functions. Here is one solution in PnL case (or any other case you want to get the value rather than the return) and hope you find it useful. The parameter x is a dataframe and the row.names for x are dates so you don't bother to convert amongst different data types (which I actually suffer a lot). With the function findPnLDrawdown, you could perform a lot other functions to calculate averageDrawDown, averageLength, recovery, etc.
PnLDrawdown <- function(x) {
ts = as.vector(x[,1])
cumsum = cumsum(c(0, ts))
cmaxx = cumsum - cummax(cumsum)
cmaxx = cmaxx[-1]
cmaxx = as.matrix(cmaxx)
row.names(cmaxx) = row.names(x)
cmaxx = timeSeries(cmaxx)
cmaxx
}
findPnLDrawdown <- function(R) {
drawdowns = PnLDrawdown(R)
draw = c()
begin = c()
end = c()
length = c(0)
trough = c(0)
index = 1
if (drawdowns[1] >= 0) {
priorSign = 1
} else {
priorSign = 0
}
from = 1
sofar = as.numeric(drawdowns[1])
to = 1
dmin = 1
for (i in 1:length(drawdowns)) {
thisSign =ifelse(drawdowns[i] < 0, 0, 1)
if (thisSign == priorSign) {
if (as.numeric(drawdowns[i]) < as.numeric(sofar)) {
sofar = drawdowns[i]
dmin = i
}
to = i+ 1
}
else {
draw[index] = sofar
begin[index] = from
trough[index] = dmin
end[index] = to
from = i
sofar = drawdowns[i]
to = i + 1
dmin = i
index = index + 1
priorSign = thisSign
}
}
draw[index] = sofar
begin[index] = from
trough[index] = dmin
end[index] = to
list(pnl = draw, from = begin, trough = trough, to = end,
length = (end - begin + 1),
peaktotrough = (trough - begin + 1),
recovery = (end - trough))
}

Difficulty solving this with recursive code

I need to find the length of the longest common subsequence.
s and t are Strings, and n and m are their lengths. I would like to write a recursive code.
This is what I did so far but I cant get any progress:
def lcs_len_v1(s, t):
n = len(s)
m = len(t)
return lcs_len_rec(s,n,t,m)
def lcs_len_rec(s,size_s,t,size_t):
cnt= 0
if size_s==0 or size_t==0:
return 0
elif s[0]==t[0]:
cnt= +1
return cnt, lcs_len_rec(s[1:], len(s[1:]), t[1:], len(t[1:]))
This works:
def lcs(xstr, ystr):
if not xstr or not ystr:
return ""
x, xs, y, ys = xstr[0], xstr[1:], ystr[0], ystr[1:]
if x == y:
return x + lcs(xs, ys)
else:
return max(lcs(xstr, ys), lcs(xs, ystr), key=len)
print(lcs("AAAABCC","AAAACCB"))
# AAAACC
You should know that a recursive approach will only work with relatively trivial string; the complexity increases very rapidly with longer strings.
this is my code, how can I use on it the memoization technique?
def lcs_len_v1(s, t):
n = len(s)
m = len(t)
return lcs_len_rec(s,n,t,m)
def lcs_len_rec(s,size_s,t,size_t):
if size_s==0 or size_t==0:
return 0
elif s[0]==t[0]:
cnt=0
cnt+= 1
return cnt+ lcs_len_rec(s[1:], size_s-1, t[1:], size_t-1)
else:
return max(lcs_len_rec(s[1:], size_s-1, t, size_t), lcs_len_rec(s, size_s, t[1:], size_t-1))
Using the memoization technique, you can run the algorithm also with a very long strings. Infact it is just O(n^2):
def recursiveLCS(table, s1, s2):
if(table[len(s1)][len(s2)] != False):
return table[len(s1)][len(s2)]
elif len(s1) == 0 or len(s2) == 0:
val = ""
elif s1[0] == s2[0]:
val = s1[0] + recursiveLCS(table, s1[1:], s2[1:])
else:
res1 = recursiveLCS(table, s1[1:], s2)
res2 = recursiveLCS(table, s1, s2[1:])
val = res2
if len(res1) > len(res2):
val = res1
table[len(s1)][len(s2)] = val
return val
def computeLCS(s1, s2):
table = [[False for col in range(len(s2) + 1)] for row in range(len(s1) + 1)]
return recursiveLCS(table, s1, s2)
print computeLCS("testistest", "this_is_a_long_testtest_for_testing_the_algorithm")
Output:
teststest

Resources