I am using the R programming language. I am trying to replicate the plots from the following stackoverflow post using the "mlr" library: R: multiplot for plotLearnerPrediction ggplot objects of MLR firing errors in RStudio
(I am also using this site here: https://www.analyticsvidhya.com/blog/2016/08/practicing-machine-learning-techniques-in-r-with-mlr-package/)
First, I created the data for this exercise ("response variable" is the response, all other variables are the predictors)
#load libraries
library(mlr)
library(girdExtra)
library(ggplot2)
library(rpart)
#create data
a = rnorm(1000, 10, 10)
b = rnorm(1000, 10, 5)
c = rnorm(1000, 5, 10)
d <- sample( LETTERS[1:3], 1000, replace=TRUE, prob=c(0.2, 0.6, 0.2) )
response_variable <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.3, 0.7) )
data <- data.frame(a, b, c, d, response_variable)
data$d = as.factor(data$d)
data$response_variable = as.factor(data$response_variable)
From here, I tried to follow the "mlr" part of the tutorial (only with the "decision tree" and the "random forest" algorithm):
task <- makeClassifTask(data = data, target = "response_variable")
learners = list(
"classif.randomForest",
"classif.rpart" )
p1<-plotLearnerPrediction(learner = learners[[1]], task = task)
p2<-plotLearnerPrediction(learner = learners[[2]], task = task)
Can someone please tell me if the plots I have produced as the user is intended to do so?
Thanks
Yes, they are as the user is intended to do so. To see this, you can run the same commands on the toy data. From this, you will see that the classification is correct. The only thing is that in your data the response has absolutely nothing to do with the predictors, so the classification sucks (in fact, it seems to be predicting everything as "B").
a = rnorm(100, 10, 10)
b = rnorm(100, 10, 5)
data <- data.frame(a, b)
library(dplyr)
data=mutate(data, response_variable=ifelse(a>mean(a) | b<mean(b), "A", "B"))
Related
I am using the R programming language. On some bigger data, I tried the following code (make a decision tree):
#load library
library(rpart)
#generate data
a = rnorm(100, 7000000, 10)
b = rnorm(100, 5000000, 5)
c = rnorm(100, 400000, 10)
group <- sample( LETTERS[1:2], 100, replace=TRUE, prob=c(0.5,0.5) )
group_1 <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) )
d = data.frame(a,b,c, group, group_1)
d$group = as.factor(d$group)
d$group_1 = as.factor(d$group_1)
#fit model
tree <- rpart(group ~ ., d)
#visualize results
plot(tree)
text(tree, use.n=TRUE, minlength = 0, xpd=TRUE, cex=.8)
In the visual output, the numbers are displayed in scientific notation (e.g. 4.21e+06). Is there a way to disable this?
I consulted this previous answer on stackoverflow:How to disable scientific notation?
I then tried the following command : options(scipen=999)
But this did not seem to fix the problem.
Can someone please tell me what I am doing wrong?
Thanks
I think the labels.rpart function has scientific notation hard-coded in: it uses a private function called formatg to do the formatting using sprintf() with a %g format, and that function ignores options(scipen). You can override this by replacing formatg with a better function. Here's a dangerous way to do that:
oldformatg <- rpart:::formatg
assignInNamespace("formatg", format, "rpart")
which replaces formatg with the standard format function. (This will definitely have dangerous side effects, so afterwards you should change it back using
assignInNamespace("formatg", oldformatg, "rpart")
A better solution would be to rescale your data. rpart switches to scientific notation only for big numbers, so you could divide the bad numbers by something like 1000 or 1000000, and describe them as being in different units. For your example, this works for me:
library(rpart)
#generate data
set.seed(123)
a = rnorm(100, 7000000, 10)/1000
b = rnorm(100, 5000000, 5)/1000
c = rnorm(100, 400000, 10)/1000
group <- sample( LETTERS[1:2], 100, replace=TRUE, prob=c(0.5,0.5) )
group_1 <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) )
d = data.frame(a,b,c, group, group_1)
d$group = as.factor(d$group)
d$group_1 = as.factor(d$group_1)
#fit model
tree <- rpart(group ~ ., d)
#visualize results
plot(tree)
text(tree, use.n=TRUE, minlength = 0, xpd=TRUE, cex=.8)
Created on 2021-01-27 by the reprex package (v0.3.0)
I am using the following code to generate data, and i am estimating regression models across a list of variables (covar1 and covar2). I have also created confidence intervals for the coefficients and merged them together.
I have been examining all sorts of examples here and on other sites, but i can't seem to accomplish what i want. I want to stack the results for each covar into a single data frame, labeling each cluster of results by the covar it is attributable to (i.e., "covar1" and "covar2"). Here is the code for generating data and results using lapply:
##creating a fake dataset (N=1000, 500 at treated, 500 at control group)
#outcome variable
outcome <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 70, sd = 10))
#running variable
running.var <- seq(0, 1, by = .0001)
running.var <- sample(running.var, size = 1000, replace = T)
##Put negative values for the running variable in the control group
running.var[1:500] <- -running.var[1:500]
#treatment indicator (just a binary variable indicating treated and control groups)
treat.ind <- c(rep(0,500), rep(1,500))
#create covariates
set.seed(123)
covar1 <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 50, sd = 20))
covar2 <- c(rnorm(500, mean = 10, sd = 20), rnorm(500, mean = 10, sd = 30))
data <- data.frame(cbind(outcome, running.var, treat.ind, covar1, covar2))
data$treat.ind <- as.factor(data$treat.ind)
#Bundle the covariates names together
covars <- c("covar1", "covar2")
#loop over them using a convenient feature of the "as.formula" function
models <- lapply(covars, function(x){
regres <- lm(as.formula(paste(x," ~ running.var + treat.ind",sep = "")), data = d)
ci <-confint(regres, level=0.95)
regres_ci <- cbind(summary(regres)$coefficient, ci)
})
names(models) <- covars
print(models)
Any nudge in the right direction, or link to a post i just haven't come across, is greatly appreciated.
You can use do.call were de second argument is a list (like in here):
do.call(rbind, models)
I made a (possible) improve to your lapply function. This way you can save the estimated parameters and the variables in a data.frame:
models <- lapply(covars, function(x){
regres <- lm(as.formula(paste(x," ~ running.var + treat.ind",sep = "")), data = data)
ci <-confint(regres, level=0.95)
regres_ci <- data.frame(covar=x,param=rownames(summary(regres)$coefficient),
summary(regres)$coefficient, ci)
})
do.call(rbind,models)
So I am trying to use image recognition to output a regression style number using the mxnet package in R using a CNN.
I have used this as the basis of my analysis: https://rstudio-pubs-static.s3.amazonaws.com/236125_e0423e328e4b437888423d3821626d92.html
This is an image recognition analysis using mxnet in R using CNN, so I have followed these steps to prepare my data for preprocessing by doing the same steps, resizing, grayscaling.
My "image" dataset looks like like this, I have 784 columns of pixels, and the last column is a numeric column with the "label" that I am trying to predict so it will be: 1132, 1491, 845, etc.
From there, I create a training and testing:
library(pbapply)
library(caret)
## test/training partitions
training_index <- createDataPartition(image$STOPPING_TIME, p = .9, times = 1)
training_index <- unlist(training_index)
train_set <- image[training_index,]
dim(train_set)
test_set <- image[-training_index,]
dim(test_set)
## Fix train and test datasets
train_data <- data.matrix(train_set)
train_x <- t(train_data[, -785])
train_y <- train_data[,785]
train_array <- train_x
dim(train_array) <- c(28, 28, 1, ncol(train_x))
test_data <- data.matrix(test_set)
test_x <- t(test_set[,-785])
test_y <- test_set[,785]
test_array <- test_x
dim(test_array) <- c(28, 28, 1, ncol(test_x))
Now I get onto using the mxnet, which is what is causing problems, not sure what I am doing wrong:
library(mxnet)
## Model
mx_data <- mx.symbol.Variable('data')
## 1st convolutional layer 5x5 kernel and 20 filters.
conv_1 <- mx.symbol.Convolution(data = mx_data, kernel = c(5, 5), num_filter = 20)
tanh_1 <- mx.symbol.Activation(data = conv_1, act_type = "tanh")
pool_1 <- mx.symbol.Pooling(data = tanh_1, pool_type = "max", kernel = c(2, 2), stride = c(2,2 ))
## 2nd convolutional layer 5x5 kernel and 50 filters.
conv_2 <- mx.symbol.Convolution(data = pool_1, kernel = c(5,5), num_filter = 50)
tanh_2 <- mx.symbol.Activation(data = conv_2, act_type = "tanh")
pool_2 <- mx.symbol.Pooling(data = tanh_2, pool_type = "max", kernel = c(2, 2), stride = c(2, 2))
## 1st fully connected layer
flat <- mx.symbol.Flatten(data = pool_2)
fcl_1 <- mx.symbol.FullyConnected(data = flat, num_hidden = 500)
tanh_3 <- mx.symbol.Activation(data = fcl_1, act_type = "tanh")
## 2nd fully connected layer
fcl_2 <- mx.symbol.FullyConnected(data = tanh_3, num_hidden = 2)
## Output
label <- mx.symbol.Variable("label")
NN_model <- mx.symbol.MakeLoss(mx.symbol.square(mx.symbol.Reshape(fcl_2, shape = 0) - label))
## Set seed for reproducibility
mx.set.seed(100)
## Train on 1200 samples
model <- mx.model.FeedForward.create(NN_model, X = train_array, y = train_y,
num.round = 30,
array.batch.size = 100,
initializer=mx.init.uniform(0.002),
learning.rate = 0.05,
momentum = 0.9,
wd = 0.00001,
eval.metric = mx.metric.rmse)
epoch.end.callback = mx.callback.log.train.metric(100))
I get the error:
[00:30:08] D:\Program Files (x86)\Jenkins\workspace\mxnet\mxnet\dmlc-core\include\dmlc/logging.h:308: [00:30:08] d:\program files (x86)\jenkins\workspace\mxnet\mxnet\src\operator\tensor\./matrix_op-inl.h:134: Check failed: oshape.Size() == dshape.Size() (100 vs. 200) Target shape size is different to source. Target: (100,)
Source: (100,2)
Error in symbol$infer.shape(list(...)) :
Error in operator reshape9: [00:30:08] d:\program files (x86)\jenkins\workspace\mxnet\mxnet\src\operator\tensor\./matrix_op-inl.h:134: Check failed: oshape.Size() == dshape.Size() (100 vs. 200) Target shape size is different to source. Target: (100,)
Source: (100,2)
I can get it to work using if I use
NN_model <- mx.symbol.SoftmaxOutput(data = fcl_2)
and keep the rmse there, but it doesn't improve performance of my model after 30 iterations.
Thanks!
Your last fully connected layer fcl_2 <- mx.symbol.FullyConnected(data = tanh_3, num_hidden = 2) creates an output shape of (batch_size, 2), reshaping it results in (2 * batch_size).
Then you are doing (mx.symbol.Reshape(fcl_2, shape = 0) - label), i.e. you are trying to subtract tensors of the following shapes: (200) - (100), which cannot work.
Instead what you likely want to do is change your last fully connected layer to have only one hidden unit fcl_2 <- mx.symbol.FullyConnected(data = tanh_3, num_hidden = 1), as you say that you are trying to learn a network that predicts a single scalar output.
I have implemented the LDA model with rjags. And I successfully got the final samples with:
jags <- jags.model('../lda_jags.bug',
data = data,
n.chains = 1,
n.adapt = 100)
update(jags, 2000)
samples <- jags.samples(jags,
c('theta', 'phi', 'z'),
1000)
Now I can use samples$theta or samples$phi to get the result of theta and phi. But how can I know how long did it take to sample? Thanks!
As #eipi10 states you can use system.time() around the update() call to time the process within R. Or, you can use the runjags package which prints the (total) time taken in updating the model, including all previous calls to extend.jags:
library('runjags')
results <- run.jags('../lda_jags.bug', monitor = c('theta', 'phi', 'z'),
data = data, n.chains = 1, adapt = 100, burnin = 2000, sample = 1000)
results
# or:
jags <- jags.model('../lda_jags.bug',
data = data,
n.chains = 1,
n.adapt = 0)
runjags <- as.runjags(jags, monitor = c('theta', 'phi', 'z'))
results <- extend.jags(runjags, adapt = 100, burnin = 2000, sample = 1000)
results
results <- extend.jags(runjags, sample = 1000)
results
I am RSNNS to make a model. I am using QuickProp algorithm. here's my neural network:
mydata1 <- read.csv("-1-5_rand1.csv");
mydata <- mydata1[1:151, ]
test_set <- mydata1[152:168, ]
test_set1 <- test_set[c(-7)]
a <- SnnsRObjectFactory()
input <- mydata[c(-7)]
output <- mydata[c(7)]
b <- splitForTrainingAndTest(input, output, ratio = 0.22)
a <- mlp(b$inputsTrain, b$targetsTrain, size = 9, maxit = 650, learnFunc = "Quickprop", learnFuncParams = c(0.01, 2.5, 0.0001, 0, 0), updateFunc = "Topological_Order",
updateFuncParams = c(0.0), hiddenActFunc = "Act_TanH", computeError=TRUE, initFunc = "Randomize_Weights", initFuncParams = c(-1,1),
shufflePatterns = TRUE, linOut = FALSE, inputsTest = b$inputsTest, targetsTest = b$targetsTest)
I am predicting using test set as:
predictions <- predict(a, test_set1)
Is it possible to in RSNNS to predict after every 50 cycles using test set instead of predicting after 650 cycles?
the answer is you can't do it with the high-level interface, but with the low-level interface, you can have a look, e.g., at the mlp_irisSnnsR.R demo that is included in RSNNS