I'm wondering if it's possible (using the built in features of SparkR or any other workaround), to extract the class probabilities of some of the classification algorithms that included in SparkR. Particular ones of interest are.
spark.gbt()
spark.mlp()
spark.randomForest()
Currently, when I use the predict function on these models I am able to extract the predictions, but not the actual probabilities or "confidence."
I've seen several other questions that are similar to this topic, but none that are specific to SparkR, and many have not been answered in regards to Spark's most recent updates.
i ran into the same problem, and following this answer now use SparkR:::callJMethod to transform the probability DenseVector (which R cannot deserialize) to an Array (which R reads as a List). It's not very elegant or fast, but it does the job:
denseVectorToArray <- function(dv) {
SparkR:::callJMethod(dv, "toArray")
}
e.g.:
start your spark session
#library(SparkR)
#sparkR.session(master = "local")
generate toy data
data <- data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
someString = base::sample(c("this", "that"),
100, replace=TRUE),
stringsAsFactors=FALSE)
trainidxs <- base::sample(nrow(data), nrow(data)*0.7)
traindf <- as.DataFrame(data[trainidxs,])
testdf <- as.DataFrame(data[-trainidxs,])
train a random forest and run predictions:
rf <- spark.randomForest(traindf,
clicked~.,
type = "classification",
maxDepth = 2,
maxBins = 2,
numTrees = 100)
predictions <- predict(rf, testdf)
collect your predictions:
collected = SparkR::collect(predictions)
now extract the probabilities:
collected$probabilities <- lapply(collected$probability, function(x) denseVectorToArray(x))
str(probs)
ofcourse, the function wrapper around SparkR:::callJMethod is a bit of an overkill. You can also use it directly, e.g. with dplyr:
withprobs = collected %>%
rowwise() %>%
mutate("probabilities" = list(SparkR:::callJMethod(probability,"toArray"))) %>%
mutate("prob0" = probabilities[[1]], "prob1" = probabilities[[2]])
Related
I'm working on a project where I need to construct a knn model using R. The professor provided an article with step-by-step instructions (link to article) and some datasets to choose from (link to the data I'm using). I'm getting stuck on step 3 (creating the model from the training data).
Here's my code:
data <- read.delim("data.txt", header = TRUE, sep = "\t", dec = ".")
set.seed(2)
part <- sample(2, nrow(data), replace = TRUE, prob = c(0.65, 0.35))
training_data <- data[part==1,]
testing_data <- data[part==2,]
outcome <- training_data[,2]
model <- knn(train = training_data, test = testing_data, cl = outcome, k=10)
Here's the error message I'm getting:
I checked and found that training_data, testing_data, and outcome all look correct, the issue seems to only be with the knn model.
The issue is with your data and the knn function you are using; it can't handle characters or factor variable
We can force this to work doing something like this first:
library(tidyverse)
data <- data %>%
mutate(Seeded = as.numeric(as.factor(Seeded))-1) %>%
mutate(Season = as.numeric(as.factor(Season)))
But this is a bad idea in general, since Season is not ordered naturally. A better approach would be to instead treat it as a set of dummies.
See this link for examples:
R - convert from categorical to numeric for KNN
I'm new in R and I'm currently working with RandomForest Analysis.
I need to create at least 100 replicates of a RF model, each one with different test/train data.
I would like to automate the task wrapping the code into a loop if that's possible, and save the results of every model.
Without a loop, I have to run the code every time and manually write the output.
This is my code:
#split data into 80 for training/20 for testing
obs_split <- obs_split %>%
split(if_else(runif(nrow(.)) <= 0.8, "train", "test"))
map_int(obs_split, nrow)
# grow random forest with ranger package
detection_freq <- mean(obs_split$train$species_observed)
# ranger requires a factor response to do classification
obs_split$train$species_observed <- factor(obs_split$train$species_observed)
rf <- ranger(formula = species_observed ~ .,
data = obs_split$train,
importance = "impurity",
probability = TRUE,
replace = TRUE,
sample.fraction = c(detection_freq, detection_freq))
I would appreciate any solution! Thank you
I have a dataset containing missing values. I have imputed this dataset, as follows:
library(mice)
id <- c(1,2,3,4,5,6,7,8,9,10)
group <- c(0,1,1,0,1,1,0,1,0,1)
measure_1 <- c(60,80,90,54,60,61,77,67,88,90)
measure_2 <- c(55,NA,88,55,70,62,78,66,65,92)
measure_3 <- c(58,88,85,56,68,62,89,62,70,99)
measure_4 <- c(64,80,78,92,NA,NA,87,65,67,96)
measure_5 <- c(64,85,80,65,74,69,90,65,70,99)
measure_6 <- c(70,NA,80,55,73,64,91,65,91,89)
dat <- data.frame(id, group, measure_1, measure_2, measure_3, measure_4, measure_5, measure_6)
dat$group <- as.factor(dat$group)
imp_anova <- mice(dat, maxit = 0)
meth <- imp_anova$method
pred <- imp_anova$predictorMatrix
imp_anova <- mice(dat, method = meth, predictorMatrix = pred, seed = 2018,
maxit = 10, m = 5)
This creates five imputed datasets. Then, I created the complete datasets (example dataset 1):
impute_1 <- mice::complete(imp_anova, 1) # complete set 1
And then I performed the desired analysis:
library(reshape)
library(reshape2)
datLong <- melt(impute_1, id = c("id", "group"), measure.vars = c("measure_1", "measure_2", "measure_3", "measure_4", "measure_5", "measure_6"))
colnames(datLong) <- c("ID", "Gender", "Time", "Value")
table(datLong$Time) # To check if correct
datLong$ID <- as.factor(datLong$ID)
library(ez)
model_mixed_1 <- ezANOVA(data = datLong,
dv = Value,
wid = ID,
within = Time,
between = Gender,
detailed = TRUE,
type = 3,
return_aov = TRUE)
I did this for all the five datasets, resulting in five models:
model_mixed_1
model_mixed_2
model_mixed_3
model_mixed_4
model_mixed_5
Now I want to combine the results of this models, to generate one results.
I have asked a similar question before, but there I focused on the models. Here I just want to ask how I can simply combine five models. Hope someone can help me!
You understood the basic multiple imputation process right. The process is like:
First your create your m imputed datasets. (mice() - function)
Then you do your analysis on each of these datasets. (with() - function)
In the end you combine these results together. (pool() - function)
This is a quite often misunderstand process (often people assume you have to combine your m imputed datasets together to one dataset - which is wrong)
Here is a picture of this process:
Now you have to follow these steps within the mice framework - you did this only till step 1.
Here an excerpt from the mice help:
The pool() function combines the estimates from m repeated complete data analyses. The typical sequence of steps to do a multiple imputation analysis is:
Impute the missing data by the mice function, resulting in a multiple imputed data set (class mids);
Fit the model of interest (scientific model) on each imputed data set by the with() function, resulting an object of class mira;
Pool the estimates from each model into a single set of estimates and standard errors, resulting is an object of class mipo;
Optionally, compare pooled estimates from different scientific models by the pool.compare() function.
Code wise this can look for example like this:
imp <- mice(nhanes, maxit = 2, m = 5)
fit <- with(data=imp,exp=lm(bmi~age+hyp+chl))
summary(pool(fit))
This is my first attempt using a machine learning paradigm in R. I'm using a planet data set (url: https://www.kaggle.com/mrisdal/open-exoplanet-catalogue) and I simply want to predict a planet's size based on the size of its Sun. This is the code I currently have, using nnet():
library(nnet)
#Organize data:
cols_to_keep = c(1,4,21)
full_data <- na.omit(read.csv('Planet_Data.csv')[, cols_to_keep])
#Split data:
train_data <- full_data[sample(nrow(full_data), round(nrow(full_data)/2)),]
rownames(train_data) <- 1:nrow(train_data)
test_data <- full_data[!rownames(full_data) %in% rownames(data1),]
rownames(test_data) <- 1:nrow(test_data)
#nnet
nnet_attempt <- nnet(RadiusJpt~HostStarRadiusSlrRad, data=train_data, size=0, linout=TRUE, skip=TRUE, maxNWts=10000, trace=FALSE, maxit=1000, decay=.001)
nnet_newdata <- predict(nnet_attempt, newdata=test_data)
nnet_newdata
When I print nnet_newdata I get a value for each row in my data, but I don't really understand what these values mean. Is this a proper way to use the nnet() package to predict a simple regression?
Thanks
When predict is called for an object with class nnet you will get, by default, the raw output from the nnet model applied to your new dataset. If, instead, yours is a classification problem, you can use type = "class".
See here.
I have a problem with performing statistical analyses of longitudinal data after
the imputation of missing values using mice. After the imputation of missings in the wide
data-format I convert the extracted data to the longformat. Because of the longitudinal
data participants have duplicate rows (3 timepoints) and this causes problems when converting the long-formatted data set into a type mids object.
Does anyone know how to create a mids object or something else appropriate after the imputation? I want to use lmer,lme for pooled fixed effects afterwards.
I tried a lot of different things, but still cant figure it out.
Thanks in advance and see the code below:
# minimal reproducible example
## Make up some data
set.seed(2)
# ID Variable, Group, 3 Timepoints outcome measure (X1-X3)
Data <- data.frame(
ID = sort(sample(1:100)),
GROUP = sample(c(0, 1), 100, replace = TRUE),
matrix(sample(c(1:5,NA), 300, replace=T), ncol=3)
)
# install.packages("mice")
library(mice)
# Impute the data in wide format
m.out <- mice(Data, maxit = 5, m = 2, seed = 9, pred=quickpred(Data, mincor = 0.0, exclude = c("ID","GROUP"))) # ignore group here for easiness
# mids object?
is.mids(m.out) # TRUE
# Extract imputed data
imp_data <- complete(m.out, action = "long", include = TRUE)[, -2]
# Converting data into long format
# install.packages("reshape")
library(reshape)
imp_long <- melt(imp_data, id=c(".imp","ID","GROUP"))
# sort data
imp_long <- imp_long[order(imp_long$.imp, imp_long$ID, imp_long$GROUP),]
row.names(imp_long)<-NULL
# save as.mids
as.mids(imp_long,.imp=1, .id=2) # doesnt work
as.mids(imp_long) # doesnt work
Best,
Julian
I hope I can answer your question with this small example. I don't really see why conversion back to the mids class is necessary. Usually when I use mice I convert the imputed data to a list of completed datasets, then analyse that list using apply.
library(mice)
library(reshape)
library(lme4)
Data <- data.frame(
ID = sort(sample(1:100)),
GROUP = sample(c(0, 1), 100, replace = TRUE),
matrix(sample(c(1:5,NA), 300, replace=T), ncol=3)
)
# impute
m.out <- mice(Data, pred=quickpred(Data, mincor=0, exclude=c("ID","GROUP")))
# complete
imp.data <- as.list(1:5)
for(i in 1:5){
imp.data[[i]] <- complete(m.out, action=i)
}
# reshape
imp.data <- lapply(imp.data, melt, id=c("ID","GROUP"))
# analyse
imp.fit <- lapply(imp.data, FUN=function(x){
lmer(value ~ as.numeric(variable)+(1|ID), data=x)
})
imp.res <- sapply(imp.fit, fixef)
Keep in mind, however, that single-level imputation is not a good idea when you're interested in relationships of variables that vary at different levels.
For these tasks you should use procedures that maintain the two-level variation and do not suppress it as mice does in this configuration.
There are workarounds for mice, but for example Mplus and the pan package in R are specifically designed for two-level MI.
No sure how relevant my answer is since you have asked a question long time ago, but in any case... In this slide deck toward the end, on the slide titled "Method POST" the author uses function long2mids():
imp1 <- mice(boys)
long <- complete(imp1, "long", inc = TRUE)
long$whr <- with(long, wgt / (hgt / 100))
imp2 <- long2mids(long)
However, long2mids() has been deprecated in favor of as.mids() since version 2.22.
as.mids() from the miceadds package will work here