Python equivalent to R caTools random 'sample.split' - r

Is there a Python (perhaps pandas) equivalent to R's
install.packages("caTools")
library(caTools)
set.seed(88)
split = sample.split(df$col, SplitRatio = 0.75)
that will generate exactly the same value split?
My current context for this is, as an example getting Pandas dataframes that correspond exactly to the R dataframes (qualityTrain, qualityTest) created by:
# https://courses.edx.org/c4x/MITx/15.071x/asset/quality.csv
quality = read.csv("quality.csv")
set.seed(88)
split = sample.split(quality$PoorCare, SplitRatio = 0.75)
qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)

I think scikit-learn's train_test_split function might work for you (link).
import pandas as pd
from sklearn.cross_validation import train_test_split
url = 'https://courses.edx.org/c4x/MITx/15.071x/asset/quality.csv'
quality = pd.read_csv(url)
train, test = train_test_split(quality, train_size=0.75, random_state=88)
qualityTrain = pd.DataFrame(train, columns=quality.columns)
qualityTest = pd.DataFrame(test, columns=quality.columns)
Unfortunately I don't get the same rows as the R function. I'm guessing it's the seeding, but could be wrong.

Splitting with sample.split from caTools library means the class distribution is preserved. Scikit-learn method train_test_split does not guarantee that (it splits dataset into a random train and test subsets).
You can get equivalent result as R caTools library (regarding class distribution) by using instead sklearn.cross_validation.StratifiedShuffleSplit
sss = StratifiedShuffleSplit(quality['PoorCare'], n_iter=1, test_size=0.25, random_state=0)
for train_index, test_index in sss:
qualityTrain = quality.iloc[train_index,:]
qualityTest = quality.iloc[test_index,:]

I know this is an old thread but I just found it looking for any potential solution because for a lot of online classes in stats and machine learning that are taught in R, if you want to use Python you run into this issue where all the classes say to do a set.seed() in R and then you use something like the caTools sample.split and you must get the same split or your result won't be the same later and you can't get the right answer for some quiz or exercise question. One of the main issues is that although both Python and R use, by default, the Mercenne Twister algorithm for their pseudo-random number generation, I discovered, by looking at the random states of their respective prngs, that they won't produce the same result given the same seed. And one (I forget which) is using signed numbers and the other unsigned, so it seems like there's little hope that you could find a seed to use with Python that would produce the same series of numbers as R.

A small correction in the above, StatifiedShuffleSplit is now part of sklearn.model_selection.
I have a some data with X and Y in different numpy arrays. The distribution of 1s against 0s in my Y array is about 4.1%. If I use StatifiedShuffleSplit it maintains this distribution in test and train set made after wards. See below.
full_data_Y_np.sum() / len(full_data_Y_np)
0.041006701187937859
for train_index, test_index in sss.split(full_data_X_np, full_data_Y_np):
X_train = full_data_X_np[train_index]
Y_train = full_data_Y_np[train_index]
X_test = full_data_X_np[test_index]
Y_test = full_data_Y_np[test_index]
Y_train.sum() / len(Y_train)
0.041013925152306355
Y_test.sum() / len(Y_test)
0.040989847715736043

Related

xgboost in R providing unexpected prediction

Below is a code that produces a simple xgboost model to show the issue I've been seeing. Once the model has been built, we predict using this model and take the second row in our data. If we take the log of relative difference between prediction of the 10th and 9th model, it should give us the prediction for the 10th tree: 0.00873184 in this case.
Now if we use the input to the tree (matrix "a" which has value 0.1234561702 for row 2) and run through the model, we expect a prediction of 0.0121501638. However, it looks like after the second split (<0.123456173) it takes the wrong direction and ends up at the node with 0.00873187464 - very close to what we expect!
Does anyone have an idea what is going on?
10th Tree
Versions:
R: 4.1.0
xgboost: 1.4.1.1
dplyr: 1.0.7
data.table: 1.14.0
library(xgboost)
library(dplyr)
library(data.table)
set.seed(2)
a <- matrix(runif(1000,0.1234561,0.1234562),
ncol=1,nrow=1000)
colnames(a) <- c("b")
d <- abs(rnorm(1000,3*a[,1]))
d2 <- xgb.DMatrix(data = a,label = d)
e <- xgboost::xgboost(data=d2,nrounds=10,method="hist",objective="reg:gamma")
xgb.plot.tree(e$feature_names,e,trees=9)
x <- 2
log((predict(e,a,ntreelimit = 10)/predict(e,a,ntreelimit = 9)))[x]
format(a[x,],nsmall=10)
For anyone interested in the answer, the xgboost team provided it here:
https://github.com/dmlc/xgboost/issues/7294
In short, xgboost converts the input data into float32 before training whereas R uses double by default. Hence, what should be done is convert 0.1234561702 to float32 before running through the model. Doing that gives the value 0.123456173 which now takes the right path.

Converting a R2jags object into a Stanreg (rstanarm) object

I made a model using R2jags. I like the jags syntax but I find the output produced by R2jags not easy to use. I recently read about the rstanarm package. It has many useful functions and is well supported by the tidybayes and bayesplot packages for easy model diagnostics and visualisation. However, I'm not a fan of the syntax used to write a model in rstanarm. Ideally, I would like to get the best of the two worlds, that is writing the model in R2jags and convert the output into a Stanreg object to use rstanarm functions.
Is that possible? If so, how?
I think then question isn't necessarily whether or not it's possible - I suspect it probably is. The question really is how much time you're prepared to spend doing it. All you'd have to do is try to replicate in structure the object that gets created by rstanarm, to the extent that it's possible with the R2jags output. That would make it so that some post-processing tasks would probably work.
If I might be so bold, I suspect a better use of your time would be to turn the R2jags object into something that could be used with the post-processing functions you want to use. For example, it only takes a small modification to the JAGS output to make all of the mcmc_*() plotting functions from bayesplot work. Here's an example. Below is the example model from the jags() function help.
# An example model file is given in:
model.file <- system.file(package="R2jags", "model", "schools.txt")
# data
J <- 8.0
y <- c(28.4,7.9,-2.8,6.8,-0.6,0.6,18.0,12.2)
sd <- c(14.9,10.2,16.3,11.0,9.4,11.4,10.4,17.6)
jags.data <- list("y","sd","J")
jags.params <- c("mu","sigma","theta")
jags.inits <- function(){
list("mu"=rnorm(1),"sigma"=runif(1),"theta"=rnorm(J))
}
jagsfit <- jags(data=jags.data, inits=jags.inits, jags.params,
n.iter=5000, model.file=model.file, n.chains = 2)
Now, what the mcmc_*() plotting functions from bayesplot expect is a list of matrices of MCMC draws where the column names give the name of the parameter. By default, jags() puts all of them into a single matrix. In the above case, there are 5000 iterations in total, with 2500 as burnin (leaving 2500 sampled) and the n.thin is set to 2 in this case (jags() has an algorithm for identifying the thinning parameter), but in any case, the jagsfit$BUGSoutput$n.keep element identifies how many iterations are kept. In this case, it's 1250. So you could use that to make a list of two matrices from the output.
jflist <- list(jagsfit$BUGSoutput$sims.matrix[1:jagsfit$BUGSoutput$n.keep, ],
jagsfit$BUGSoutput$sims.matrix[(jagsfit$BUGSoutput$n.keep+1):(2*jagsfit$BUGSoutput$n.keep), ])
Now, you'd just have to call some of the plotting functions:
mcmc_trace(jflist, regex_pars="theta")
or
mcmc_areas(jflist, regex_pars="theta")
So, instead of trying to replicate all of the output that rstanarm produces, it might be a better use of your time to try to bend the jags output into a format that would be amenable to the post-processing functions you want to use.
EDIT - added possibility for pp_check() from bayesplot.
The posterior draws of y in this case are in the theta parameters. So, we make an object that has elements y and yrep and make it of class foo
x <- list(y = y, yrep = jagsfit$BUGSoutput$sims.list$theta)
class(x) <- "foo"
We can then write a pp_check method for objects of class foo. This come straight out of the help file for bayesplot::pp_check().
pp_check.foo <- function(object, ..., type = c("multiple", "overlaid")) {
y <- object[["y"]]
yrep <- object[["yrep"]]
switch(match.arg(type),
multiple = ppc_hist(y, yrep[1:min(8, nrow(yrep)),, drop = FALSE]),
overlaid = ppc_dens_overlay(y, yrep[1:min(8, nrow(yrep)),, drop = FALSE]))
}
Then, just call the function:
pp_check(x, type="overlaid")

Kaggle Digit Recognizer Using SVM (e1071): Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty

I am trying to solve the digit Recognizer competition in Kaggle and I run in to this error.
I loaded the training data and adjusted the values of it by dividing it with the maximum pixel value which is 255. After that, I am trying to build my model.
Here Goes my code,
Given_Training_data <- get(load("Given_Training_data.RData"))
Given_Testing_data <- get(load("Given_Testing_data.RData"))
Maximum_Pixel_value = max(Given_Training_data)
Tot_Col_Train_data = ncol(Given_Training_data)
training_data_adjusted <- Given_Training_data[, 2:ncol(Given_Training_data)]/Maximum_Pixel_value
testing_data_adjusted <- Given_Testing_data[, 2:ncol(Given_Testing_data)]/Maximum_Pixel_value
label_training_data <- Given_Training_data$label
final_training_data <- cbind(label_training_data, training_data_adjusted)
smp_size <- floor(0.75 * nrow(final_training_data))
set.seed(100)
training_ind <- sample(seq_len(nrow(final_training_data)), size = smp_size)
training_data1 <- final_training_data[training_ind, ]
train_no_label1 <- as.data.frame(training_data1[,-1])
train_label1 <-as.data.frame(training_data1[,1])
svm_model1 <- svm(train_label1,train_no_label1) #This line is throwing an error
Error : Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty!
Please Kindly share your thoughts. I am not looking for an answer but rather some idea that guides me in the right direction as I am in a learning phase.
Thanks.
Update to the question :
trainlabel1 <- train_label1[sapply(train_label1, function(x) !is.factor(x) | length(unique(x))>1 )]
trainnolabel1 <- train_no_label1[sapply(train_no_label1, function(x) !is.factor(x) | length(unique(x))>1 )]
svm_model2 <- svm(trainlabel1,trainnolabel1,scale = F)
It didn't help either.
Read the manual (https://cran.r-project.org/web/packages/e1071/e1071.pdf):
svm(x, y = NULL, scale = TRUE, type = NULL, ...)
...
Arguments:
...
x a data matrix, a vector, or a sparse matrix (object of class
Matrix provided by the Matrix package, or of class matrix.csr
provided by the SparseM package,
or of class simple_triplet_matrix provided by the slam package).
y a response vector with one label for each row/component of x.
Can be either a factor (for classification tasks) or a numeric vector
(for regression).
Therefore, the mains problems are that your call to svm is switching the data matrix and the response vector, and that you are passing the response vector as integer, resulting in a regression model. Furthermore, you are also passing the response vector as a single-column data-frame, which is not exactly how you are supposed to do it. Hence, if you change the call to:
svm_model1 <- svm(train_no_label1, as.factor(train_label1[, 1]))
it will work as expected. Note that training will take some minutes to run.
You may also want to remove features that are constant (where the values in the respective column of the training data matrix are all identical) in the training data, since these will not influence the classification.
I don't think you need to scale it manually since svm itself will do it unlike most neural network package.
You can also use the formula version of svm instead of the matrix and vectors which is
svm(result~.,data = your_training_set)
in your case, I guess you want to make sure the result to be used as factor,because you want a label like 1,2,3 not 1.5467 which is a regression
I can debug it if you can share the data:Given_Training_data.RData

Extracting Class Probabilities from SparkR ML Classification Functions

I'm wondering if it's possible (using the built in features of SparkR or any other workaround), to extract the class probabilities of some of the classification algorithms that included in SparkR. Particular ones of interest are.
spark.gbt()
spark.mlp()
spark.randomForest()
Currently, when I use the predict function on these models I am able to extract the predictions, but not the actual probabilities or "confidence."
I've seen several other questions that are similar to this topic, but none that are specific to SparkR, and many have not been answered in regards to Spark's most recent updates.
i ran into the same problem, and following this answer now use SparkR:::callJMethod to transform the probability DenseVector (which R cannot deserialize) to an Array (which R reads as a List). It's not very elegant or fast, but it does the job:
denseVectorToArray <- function(dv) {
SparkR:::callJMethod(dv, "toArray")
}
e.g.:
start your spark session
#library(SparkR)
#sparkR.session(master = "local")
generate toy data
data <- data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
someString = base::sample(c("this", "that"),
100, replace=TRUE),
stringsAsFactors=FALSE)
trainidxs <- base::sample(nrow(data), nrow(data)*0.7)
traindf <- as.DataFrame(data[trainidxs,])
testdf <- as.DataFrame(data[-trainidxs,])
train a random forest and run predictions:
rf <- spark.randomForest(traindf,
clicked~.,
type = "classification",
maxDepth = 2,
maxBins = 2,
numTrees = 100)
predictions <- predict(rf, testdf)
collect your predictions:
collected = SparkR::collect(predictions)
now extract the probabilities:
collected$probabilities <- lapply(collected$probability, function(x) denseVectorToArray(x))
str(probs)
ofcourse, the function wrapper around SparkR:::callJMethod is a bit of an overkill. You can also use it directly, e.g. with dplyr:
withprobs = collected %>%
rowwise() %>%
mutate("probabilities" = list(SparkR:::callJMethod(probability,"toArray"))) %>%
mutate("prob0" = probabilities[[1]], "prob1" = probabilities[[2]])

Export Linear Mixed Effects Model Outputs in csv using Julia Language

I am new to Julia programming language, however, I am fitting a Linear Mixed Effects Model and I find it difficult to save the fixed and random effects estimates in .csv files.
An example code can be found:
using MixedModels
#time modelOutput = fit(lmm(Y~ A + B + (0 + A | group), data))
There is available reference about how to obtain the fixed (fixef(modelOutput)) and random (ranef(modelOutput)) effects however using a DataFrame I am facing errors.
Any advice is appreciated.
Okay, I actually took the time to do this for you. A CoefTable is a type defined in statmodels here. Given this information, we can extract the relevant information from the CoefTable instance as follows:
df = DataFrame(variable = ct.rownms,
Estimate = ct.mat[:,1],
StdError = ct.mat[:,2],
z_val = ct.mat[:,3])
This will give an nvar-by-4 DataFrame which you can then write to csv as described earlier using writetable("output.csv",df)
I had a number of problems getting the accepted answer to work; Julia has evolved a lot since then. I rewrote it based primarily on code from the jglmm R package, with some adaptation/cobbling-together from other sources ...
"""
outfun(m, outfn="output.csv")
output the coefficient table of a fitted model to a file
"""
outfun = function(m, outfn="output.csv")
ct = coeftable(m)
coef_df = DataFrame(ct.cols);
rename!(coef_df, ct.colnms, makeunique = true)
coef_df[!, :term] = ct.rownms;
CSV.write(outfn, coef_df);
end

Resources