Using R to verify the coefficients computed from spark.ml.regression.LinearRegressionModel - r
any idea how to use R to verify the coefficients computed from spark.ml.regression.LinearRegressionModel? I've tried the lm() function in R, but the two sets of coefficients from R and Spark are quite different. Maybe I should use other function in R?
// transform dataframe
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = dataRDD.map{case(fdate, adHashValue, effectDummyArray, timeDummyArray, label) =>
val features = Vectors.dense(effectDummyArray ++ timeDummyArray)
(label, features)
}.toDF("label", "features")
// set up model
val lr = new LinearRegression().setRegParam(0.3)
val lr_model = lr.fit(df)
val summary = lr_model.summary
val PValues = summary.pValues
val Variance = summary.coefficientStandardErrors.map{x => x * x}
val coefficients: Array[Double] = lr_model.coefficients.asInstanceOf[DenseVector].values
Okay. Set the regParam close to 0. could do just fine.
setRegParam(0.00001).setSolver("normal")
Related
Softmax on logits
I am trying to apply softmax over the logits, and it is not working. Softmax outputs are not probabilities sum to 1. I've tried multiple methods and still. These are the logits: [-6.4912415 17.46841 -6.2352724 -5.43603 0.64098835] [-4.355269 16.009415 -4.4656963 -6.2603498 -0.8400534] [-4.6737485 -5.791109 -0.5734087 23.885012 -6.231243 ] [ -4.664542 19.617783 -5.443952 -10.097789 -0.12621117] This is the code: smm = scipy.special.softmax(test_logits[1]) print(smm) sms = np.exp(test_logits) / np.sum(np.exp(test_logits[1]), axis=0) print(sms) def soft_max(z): t = np.exp(z) a = np.exp(z) / np.sum(t, axis=-1, keepdims=True) return a print(soft_max(test_logits[1])) so = np.exp(test_logits[1]) sd = np.exp(test_logits[1]) / np.sum(so) print(sd) sess = tf.compat.v1.Session() print (sess.run(tf.nn.softmax(test_logits[1]))) These are the outputs: [1.21089486e-10 2.59700234e-10 2.09438869e-10 1.61381047e+05 1.35572924e-11] [7.21880611e-10 3.91965869e-11 1.53963411e-10 3.64820341e-10 6.75980139e+00] [9.13843152e-08 1.24351451e-09 1.27243671e-09 1.18640177e-02 2.31230979e-10]] [3.9305680e-11 1.0000000e+00 5.0771651e-11 1.1290883e-10 4.9197542e-08] [3.9305680e-11 1.0000000e+00 5.0771651e-11 1.1290883e-10 4.9197542e-08] [3.9305684e-11 1.0000000e+00 5.0771703e-11 1.1290889e-10 4.9197496e-08] I do not know what is wrong. Maybe the logit itself?
How can I access the trained parameters of a Neural ODE in Julia?
I'm trying to fit one Neural ODE to a time series usind Julia's DiffEqFlux. Here my code: u0 = Float32[2.;0] train_size = 15 tspan_train = (0.0f0,0.75f0) function trueODEfunc(du,u,p,t) true_A = [-0.1 2.0; -2.0 -0.1] du .= ((u.^3)'true_A)' end t_train = range(tspan_train[1],tspan_train[2],length = train_size) prob = ODEProblem(trueODEfunc, u0, tspan_train) ode_data_train = Array(solve(prob, Tsit5(),saveat=t_train)) dudt = Chain( Dense(2,50,tanh), Dense(50,2)) ps = Flux.params(dudt) n_ode = NeuralODE(dudt, tspan_train, Tsit5(), saveat = t_train, reltol=1e-7, abstol=1e-9) **n_ode.p** function predict_n_ode(p) n_ode(u0,p) end function loss_n_ode(p) pred = predict_n_ode(p) loss = sum(abs2, ode_data_train .- pred) loss,pred end final_p = [] losses = [] cb = function(p,l,pred) display(l) display(p) push!(final_p, p) push!(losses,l) pl = scatter(t_train, ode_data_train[1,:],label="data") scatter!(pl,t_train,pred[1,:],label="prediction") display(plot(pl)) end DiffEqFlux.sciml_train!(loss_n_ode, n_ode.p, ADAM(0.05), cb = cb, maxiters = 100) **n_ode.p** The problem is that calling n_ode.p (or Flux.params(dudt)) before and after the train function gives me back the save values. I would have expected to receive the latest updated values from the training. That's why I've created an array to gather all parameter values during the training and then access it to get the updated parameters. Am I doing something wrong in the code? Does the train function automatically update the parameters? If not how to enforce it? Thanks in advance!
The result is an object that holds the best parameters. Here's a complete example: using DiffEqFlux, OrdinaryDiffEq, Flux, Optim, Plots u0 = Float32[2.; 0.] datasize = 30 tspan = (0.0f0,1.5f0) function trueODEfunc(du,u,p,t) true_A = [-0.1 2.0; -2.0 -0.1] du .= ((u.^3)'true_A)' end t = range(tspan[1],tspan[2],length=datasize) prob = ODEProblem(trueODEfunc,u0,tspan) ode_data = Array(solve(prob,Tsit5(),saveat=t)) dudt2 = FastChain((x,p) -> x.^3, FastDense(2,50,tanh), FastDense(50,2)) n_ode = NeuralODE(dudt2,tspan,Tsit5(),saveat=t) function predict_n_ode(p) n_ode(u0,p) end function loss_n_ode(p) pred = predict_n_ode(p) loss = sum(abs2,ode_data .- pred) loss,pred end loss_n_ode(n_ode.p) # n_ode.p stores the initial parameters of the neural ODE cb = function (p,l,pred;doplot=false) #callback function to observe training display(l) # plot current prediction against data if doplot pl = scatter(t,ode_data[1,:],label="data") scatter!(pl,t,pred[1,:],label="prediction") display(plot(pl)) end return false end # Display the ODE with the initial parameter values. cb(n_ode.p,loss_n_ode(n_ode.p)...) res1 = DiffEqFlux.sciml_train(loss_n_ode, n_ode.p, ADAM(0.05), cb = cb, maxiters = 300) cb(res1.minimizer,loss_n_ode(res1.minimizer)...;doplot=true) res2 = DiffEqFlux.sciml_train(loss_n_ode, res1.minimizer, LBFGS(), cb = cb) cb(res2.minimizer,loss_n_ode(res2.minimizer)...;doplot=true) # result is res2 as an Optim.jl object # res2.minimizer are the best parameters # res2.minimum is the best loss At the end, the sciml_train function returns a result object that holds information about the optimization, including the final parameters as .minimizer.
R fit user defined distribution
I am trying to fit my own distribution to my data, find the optimum parameters of the distribution to match the data and ultimately find the FWHM of the peak in the distribution. From what I've read, the package fitdistrplus is the way to do this. I know the data takes the shape of a lorentzian peak on a quadratic background. plot of the data: plot of raw data The raw data used: data = c(0,2,5,4,5,4,3,3,2,2,0,4,4,2,5,5,3,3,4,4,4,3,3,5,5,6,6,8,4,0,6,5,7,5,6,3,2,1,7,0,7,9,5,7,5,3,5,5,4,1,4,8,10,2,5,8,7,14,7,5,8,4,2,2,6,5,4,6,5,7,5,4,8,5,4,8,11,9,4,8,11,7,8,6,9,5,8,9,10,8,4,5,8,10,9,12,10,10,5,5,9,9,11,19,17,9,17,10,17,18,11,14,15,12,11,14,12,10,10,8,7,13,14,17,18,16,13,16,14,17,20,15,12,15,16,18,24,23,20,17,21,20,20,23,20,15,20,28,27,26,20,17,19,27,21,28,32,29,20,19,24,19,19,22,27,28,23,37,41,42,34,37,29,28,28,27,38,32,37,33,23,29,55,51,41,50,44,46,53,63,49,50,47,54,54,43,45,58,54,55,67,52,57,67,69,62,62,65,56,72,75,88,87,77,70,71,84,85,81,84,75,78,80,82,107,102,98,82,93,98,90,94,118,107,113,103,99,103,96,108,114,136,126,126,124,130,126,113,120,107,107,106,107,136,143,135,151,132,117,118,108,120,145,140,122,135,153,157,133,130,128,109,106,122,133,132,150,156,158,150,137,147,150,146,144,144,149,171,185,200,194,204,211,229,225,235,228,246,249,238,214,228,250,275,311,323,327,341,368,381,395,449,474,505,529,585,638,720,794,896,919,1008,1053,1156,1134,1174,1191,1202,1178,1236,1200,1130,1094,1081,1009,949,890,810,760,690,631,592,561,515,501,489,467,439,388,377,348,345,310,298,279,253,257,259,247,237,223,227,217,210,213,197,197,192,195,198,201,202,211,193,203,198,202,174,164,162,173,170,184,170,168,175,170,170,168,162,149,139,145,151,144,152,155,170,156,149,147,158,171,163,146,151,150,147,137,123,127,136,149,147,124,137,133,129,130,128,139,137,147,141,123,112,136,147,126,117,116,100,110,120,105,91,100,100,105,92,88,78,95,75,75,82,82,80,83,83,66,73,80,76,69,81,93,79,71,80,90,72,72,63,57,53,62,65,49,51,57,73,54,56,78,65,52,58,49,47,56,46,43,50,43,40,39,36,45,28,35,36,43,48,37,36,35,39,31,24,29,37,26,22,36,33,24,31,31,20,30,28,23,21,27,26,29,21,20,22,18,19,19,20,21,20,25,18,12,18,20,20,13,14,21,20,16,18,12,17,20,24,21,20,18,11,17,12,5,11,13,16,13,13,12,12,9,15,13,15,11,12,11,8,13,16,16,16,14,8,8,10,11,11,17,15,15,9,9,13,12,3,11,14,11,14,13,8,7,7,15,12,8,12,14,9,5,2,10,8) I have calculated the equations which define the distribution and cumulative distribution: dFF <- function(x,a,b,c,A,gamma,pos) a + b*x + (c*x^2) + ((A/pi)*(gamma/(((x-pos)^2) + (gamma^2)))) pFF <- function(x,a,b,c,A,gamma,pos) a*x + (b/2)*(x^2) + (c/3)*(x^3) + A/2 + (A/pi)*(atan((x - pos)/gamma)) I believe these to be correct. From what I understand, a distribution fit should be possible using just these definitions using the fitdist (or mledist) method: fitdist(data,'FF', start = list(0,0.3,-0.0004,70000,13,331)) mledist(data,'FF', start = list(0,0.3,-0.0004,70000,13,331)) This returns the statement 'function cannot be evaluated at initial parameters> Error in fitdist(data, "FF", start = list(0, 0.3, -4e-04, 70000, 13, 331)):the function mle failed to estimate the parameters, with the error code 100' in the first case and in the second I just get a list of 'NA' values for the estimates. I then calculated a function to give the quantile distribution values to use the other fitting methods (qmefit): qFF <- function(p,a,b,c,A,gamma,pos) { qList = c() axis = seq(1,600,1) aF = dFF(axis,a,b,c,A,gamma,pos) arr = histogramCpp(aF) # change data to a histogram format for(element in 1:length(p)){ q = quantile(arr,p[element], names=FALSE) qList = c(qList,q) } return(qList) } Part of this code requires calling the c++ function (by using the library Rcpp): #include <Rcpp.h> #include <vector> #include <math.h> using namespace Rcpp; // [[Rcpp::export]] std::vector<int> histogramCpp(NumericVector x) { std::vector<int> arr; double number, fractpart, intpart; for(int i = 0; i <= 600; i++){ number = (x[i]); fractpart = modf(number , &intpart); if(fractpart < 0.5){ number = (int) intpart; } if(fractpart >= 0.5){ number = (int) (intpart+1); } for(int j = 1; j <= number; j++){ arr.push_back(i); } } return arr; } This c++ method just turns the data into a histogram format. If the first element of the vector describing the data is 4 then '1' is added 4 times to the returned vector etc. . This also seems to work as sensible values are returned. plot of the quantile function: Plot of quantiles returned for probabilities from 0 to 1 in steps of 0.001 The 'qmefit' method can then be attempted through the fitdist function: fitdist(data,'FF', start = list(0,0.3,-0.0004,70000,13,331), method = 'qme', probs = c(0,0.3,0.4,0.5,0.7,0.9)) I chose the 'probs' values randomly as I don't fully understand their meaning. This either straight-up crashes the R session or after a brief stuttering returns a list of 'NA' values as estimates and the line <std::bad_alloc : std::bad_alloc> I am not sure if I am making a basic mistake here and any help or recommendations are appreciated.
In the end I managed to find a work-around for this using the rPython package and lmfit from python. It solved my issue and might be useful for others with the same issue. The R-code was as follows: library(rPython) python.load("pyFit.py") python.assign("row",pos) python.assign("vals",vals) python.exec("FWHM,ERROR,FIT = fitDist(row,vals)") FWHM = python.get("FWHM") ERROR = python.get("ERROR") cFIT = python.get("FIT") and the called python code was: from lmfit import Model, minimize, Parameters, fit_report from sklearn import mixture import numpy as np import matplotlib.pyplot as plt import math def cauchyDist(x,a,b,c,d,e,f,g,A,gamma,pos): return a + b*x + c*pow(x,2) + d*pow(x,3) + e*pow(x,4) + f*pow(x,5) + g*pow(x,6) + (A/np.pi)*(gamma/((pow((x-pos),2)) + (pow(gamma,2)))) def fitDist(row, vals): gmod = Model(cauchyDist) x = np.arange(0,600) result = gmod.fit(vals, x=x, a = 0, b = 0.3, c = -0.0004, d = 0, e = 0, f= 0, g = 0, A = 70000, gamma = 13, pos = row) newFile = open('fitData.txt', 'w') newFile.write(result.fit_report()) newFile.close() with open('fitData.txt', 'r') as inF: for line in inF: if 'gamma:' in line: j = line.split() inF.close() FWHM = float(j[1]) error = float(j[3]) fit = result.best_fit fit = fit.tolist() return FWHM, error, fit I increased the order of polynomial to obtain a better fit for the data and returned the FWHM, its error and the values for the fit. There are likely much better ways of achieving this but the final fit is as I needed. Final fit. Red data points are raw data, the black line is the fitted distribution.
Evaluation Metrics for Binary Classification in Spark: AUC and PR curve
I was trying to calculate Precision, Recall by Threshold for LogisticRegressionwithLBFGS using BinaryclassificationMetrics. I got all those. I was trying to figure out if I could get a graphical output of PR and AUC curve. Pasting my Codes below: import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics} import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object log_reg_eval_metric { def main(args: Array[String]): Unit = { System.setProperty("hadoop.home.dir", "c:\\winutil\\") val sc = new SparkContext(new SparkConf().setAppName("SparkTest").setMaster("local[*]")) val sqlContext = new org.apache.spark.sql.SQLContext(sc); val data: RDD[String] = sc.textFile("C:/Users/user/Documents/spark-1.5.1-bin-hadoop2.4/data/mllib/credit_approval_2_attr.csv") val parsedData = data.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts(0), Vectors.dense(parts.tail)) } //Splitting the data val splits: Array[RDD[LabeledPoint]] = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L) val training: RDD[LabeledPoint] = splits(0).cache() val test: RDD[LabeledPoint] = splits(1) // Run training algorithm to build the model val model = new LogisticRegressionWithLBFGS() .setNumClasses(2) .run(training) // Clear the prediction threshold so the model will return probabilities model.clearThreshold // Compute raw scores on the test set val predictionAndLabels = test.map { case LabeledPoint(label, features) => val prediction = model.predict(features) (prediction, label) } // Instantiate metrics object val metrics = new BinaryClassificationMetrics(predictionAndLabels) // Precision by threshold val precision = metrics.precisionByThreshold precision.foreach { case (t, p) => println(s"Threshold: $t, Precision: $p") } // Precision-Recall Curve val PRC = metrics.pr print(PRC) } } output from print(PRC): UnionRDD[39] at union at BinaryClassificationMetrics.scala:108 I am not sure what is an union RDD and how to use it. Is there any other way to get the graphical output. Doing my research on it. Any suggestion would be great.
You can use BinaryLogisticRegressionTrainingSummary from spark.ml package.It provides PR and ROC values out of box as dataframes. You can input these values to any rendering utility to see the specific curves.(Any multiline plot with x and y values will display the curves.)
Catching the print of the function
I am using package fda in particular function fRegress. This function includes another function that is called eigchk and checks if coeffients matrix is singular. Here is the function as the package owners (J. O. Ramsay, Giles Hooker, and Spencer Graves) wrote it. eigchk <- function(Cmat) { # check Cmat for singularity eigval <- eigen(Cmat)$values ncoef <- length(eigval) if (eigval[ncoef] < 0) { neig <- min(length(eigval),10) cat("\nSmallest eigenvalues:\n") print(eigval[(ncoef-neig+1):ncoef]) cat("\nLargest eigenvalues:\n") print(eigval[1:neig]) stop("Negative eigenvalue of coefficient matrix.") } if (eigval[ncoef] == 0) stop("Zero eigenvalue of coefficient matrix.") logcondition <- log10(eigval[1]) - log10(eigval[ncoef]) if (logcondition > 12) { warning("Near singularity in coefficient matrix.") cat(paste("\nLog10 Eigenvalues range from\n", log10(eigval[ncoef])," to ",log10(eigval[1]),"\n")) } } As you can see last if condition checks if logcondition is bigger than 12 and prints then the ranges of eigenvalues. The following code implements the useage of regularization with roughness pennalty. The code is taken from the book "Functional data analysis with R and Matlab". annualprec = log10(apply(daily$precav,2,sum)) tempbasis =create.fourier.basis(c(0,365),65) tempSmooth=smooth.basis(day.5,daily$tempav,tempbasis) tempfd =tempSmooth$fd templist = vector("list",2) templist[[1]] = rep(1,35) templist[[2]] = tempfd conbasis = create.constant.basis(c(0,365)) betalist = vector("list",2) betalist[[1]] = conbasis SSE = sum((annualprec - mean(annualprec))^2) Lcoef = c(0,(2*pi/365)^2,0) harmaccelLfd = vec2Lfd(Lcoef, c(0,365)) betabasis = create.fourier.basis(c(0, 365), 35) lambda = 10^12.5 betafdPar = fdPar(betabasis, harmaccelLfd, lambda) betalist[[2]] = betafdPar annPrecTemp = fRegress(annualprec, templist, betalist) betaestlist2 = annPrecTemp$betaestlist annualprechat2 = annPrecTemp$yhatfdobj SSE1.2 = sum((annualprec-annualprechat2)^2) RSQ2 = (SSE - SSE1.2)/SSE Fratio2 = ((SSE-SSE1.2)/3.7)/(SSE1/30.3) resid = annualprec - annualprechat2 SigmaE. = sum(resid^2)/(35-annPrecTemp$df) SigmaE = SigmaE.*diag(rep(1,35)) y2cMap = tempSmooth$y2cMap stderrList = fRegress.stderr(annPrecTemp, y2cMap, SigmaE) betafdPar = betaestlist2[[2]] betafd = betafdPar$fd betastderrList = stderrList$betastderrlist betastderrfd = betastderrList[[2]] As penalty factor the authors use certain lambda. The following code implements the search for the appropriate `lambda. loglam = seq(5,15,0.5) nlam = length(loglam) SSE.CV = matrix(0,nlam,1) for (ilam in 1:nlam) { lambda = 10ˆloglam[ilam] betalisti = betalist betafdPar2 = betalisti[[2]] betafdPar2$lambda = lambda betalisti[[2]] = betafdPar2 fRegi = fRegress.CV(annualprec, templist, betalisti) SSE.CV[ilam] = fRegi$SSE.CV } By changing the value of the loglam and cross validation I suppose to equaire the best lambda, yet if the length of the loglam is to big or its values lead the coefficient matrix to singulrity. I recieve the following message: Log10 Eigenvalues range from -5.44495317739048 to 6.78194912518214 Created by the function eigchk as I already have mentioned above. Now my question is, are there any way to catch this so called warning? By catch I mean some function or method that warns me when this has happened and I could adjust the values of the loglam. Since there is no actual warning definition in the function beside this print of the message I ran out of ideas. Thank you all a lot for your suggestions.
By "catch the warning", if you mean, will alert you that there is a potential problem with loglam, then you might want to look at try and tryCatch functions. Then you can define the behavior you want implemented if any warning condition is satisfied. If you just want to store the output of the warning (which might be assumed from the question title, but may not be what you want), then try looking into capture.output.