Currently I am working on item-item based recommendation system using r. The package which I have used is arules. I have done my basic models but I want to modify my model with following criteria:
In the apriori algo. We will receive only one output, not multiple output. I want multiple output value in the rhs side. For example:
lhs rhs
{GH DAILY MOONG DAL PREMIUM 1kg,
MDH POW SPICE DEGHI CHILLI 100g,PREM 1kg} => {DAILY OTH PULSE CHANA DAL...
Rice}
My recommendation system totally based on item-item. Is there any other algorithm or package exist in r which will give me better business output?
How to calculate confidence and support value? For my case I am using default values.
My code is given below:
#Create Sparse Matrix
dataset = read.transactions('/Users/Nikita/Downloads/Reco_System/market_basket_before_model.csv', sep = ',', rm.duplicates = TRUE)
summary(dataset)
itemFrequencyPlot(dataset, topN = 20, type = 'absolute')
#1st cut
# Training Apriori on the dataset
rules = apriori(data = dataset, parameter = list(support = 0.001, confidence = 0.8))
# Visualising the results
inspect(sort(rules, by = 'lift')[1:30])
Thanks in advance.
Most implementations of association rule mining algorithms restrict the RHS of the rules to a single item to avoid further combinatorial explosion.
Related
So I have two datasets, og.data and newdata.df. I have matched their features and I want to use a feature from og.data to train a model so I can identify cases of this class in newdata.df. I am using the randomForest package in R documentation for it is here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
split <- sample.split(og.data$class_label, SplitRatio = 0.7)
training_set = subset(og.data$class_label, split == TRUE)
test_set = subset(og.data$class_label, split == FALSE)
rf.classifier.object = randomForest(x = training_set[-1],
y = training_set$Engramcell,
ntree = 500)
I then use the test set to calculate the AUC, visualize ROC, precision, recall etc etc.
I do that using prediction probability generated like so...
predictions.df <- as.data.frame(predict(rf.classifier.object,
test_set,
type = "prob")
)
All is good I proceed to try to use the classifier I've trained on new data and now I am encountering a problem because the new data does not contain the feature class label. Whihc is annoying as the entire purpose of training the classifier to to label this newdata.
predictions.df <- as.data.frame(predict(rf.classifier.object,
newdata.df,
type = "prob")
)
Please note the error has different variable names simply because I changed the code to make it more general for readability.
Error in predict.randomForest(rf.classifier.object, newdata.df, :
variables in the training data missing in newdata
As per this stack post predict.randomForest(), called here as predict(), uses rownames of feature importance to make its precitions. And when I checked with a search of the feature names I find that it is infact the class label preventing me from making the test as I show bellow.
# > rownames(rf.classifier.object$importance)[!(rownames(rf.classifier.object$importance) %in% colnames(newdata) )]
# [1] "class_label"
It is not clear to me what I should change in my script so that the classifier can be used on other data than the testing set. I have followed the instructions exactly this seems like a bad design choice to have made the function this way. The class label should not be used for calculating feature importance at all and should not even be considered a feature imo.
I have completed an eqtl analysis using the MatrixEQTL package.
me <- Matrix_eQTL_engine(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name,
pvOutputThreshold = 0.05,
useModel = modelLINEAR,
errorCovariance = errorCovariance,
verbose = TRUE,
pvalue.hist = TRUE,
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE)
The code is taken from ?Matrix_eQTL_engine, which also contains the code to set things up.
I set a threshold pvalue as 0.05 but I am wondering if there is a way to find a threshold value by a test to include in this script. I also want to correct for multiple testing and as per matrixeqtl manual, it does correct for False Discovery Rate. But I want to find a threshold pvalue for multiple testing for Bonferroni correction and then used that pvalue. I hope to add that to this code as well. Thank you for any suggestion/help!
the author of the package here.
First, I would NOT suggest using pvOutputThreshold = 1 as it can result in storing billions of p-values and make R run out of memory.
Second, Matrix eQTL returns the number of tests it has conducted, so that you can do Bonferroni correction yourself. You can get it with this line:
me$all$ntests
or, if you differentiate local and distant tests:
me$cis$ntests
me$trans$ntests
surv.gbm in the mlr3 framework outputs linear predictors, however what I'm really interested in are predicted survival times per case, which I want to compare with the actual survival times. Is there a way to obtain actual survival times?
In the mlr3 book, there is an example of a transformation between linear predictors and a distribution.
pod = po("distrcompose", param_vals = list(form = "ph", overwrite = FALSE))
prediction = pod$predict(list(base = prediction_distr, pred = prediction_lp))$output
Is there a way to change this pipeline so that it converts "lp" to "response" ?
Any help would be appriciated.
Yes this is definitely possible it just requires another transformation. Your first step is correct to compose a distribution from a linear predictor; as you're using surv.gbm only Cox PH is possible as the underlying model so default for distrcompose works for this.
Now you need to use crankcompose in order to create a survival time prediction from the distribution, you could use the mean, median, or mode of the distribution, people usually pick mean or median but that's your choice! Just make sure to include response = TRUE, overwrite = FALSE. Example code below, includes creating predictions and scoring with RMSE (surprisingly quite good!). I think the book may need updating...
Thanks,
Raphael
library(mlr3extralearners)
library(mlr3proba)
library(mlr3pipelines)
library(mlr3)
learn = ppl("crankcompositor", ppl("distrcompositor", lrn("surv.gbm")),
response = TRUE, overwrite = FALSE, method = "mean",
graph_learner = TRUE)
set.seed(1)
task = tgen("simsurv")$generate(50)
learn$train(task)
p = learn$predict(task)
p$score(msr("surv.rmse"))
I am trying to apply weights given with NIS data using the R package "survey", but I have been unsuccessful. I am fairly new to R and survey commands.
This is what I have tried:
# Create the unweighted dataset
d <- read.dta13(path)
# This produces the correct weighted amount of cases I need.
sum(d$DISCWT) # This produces the correct weighted amount of cases I need.
library(survey)
# Create survey object
dsvy <- svydesign(id = ~ d$HOSP_NIS, strata = ~ d$NIS_STRATUM, weights = ~ d$DISCWT, nest = TRUE, data = d)
d$count <- 1
svytotal(~d$count, dsvy)
However I get the following error after running the survey total:
Error in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1], :
Stratum (1131) has only one PSU at stage 1
Any help would be greatly appreciated, thank you!
The error indicates that you have specified a design where one of the strata has just a single primary sampling unit. It's not possible to get an unbiased estimate of variance for a design like that: the contribution of stratum 1131 will end up as 0/0.
As you see, R's default response is to give an error, because a reasonably likely explanation is that the data or the svydesign statement is wrong. Sometimes, as here, that's not what you want, and the global option 'survey.lonely.psu' describes other ways to respond. You want to set
options(survey.lonely.psu = "adjust")
This and other options are documented at help(surveyoptions)
I am new to neural networks and the mxnet package in R. I want to do a logistic regression on my predictors since my observations are probabilities varying between 0 and 1. I'd like to weight my observations by a vector obsWeights I have, but I'm not sure where to implement the weights. There seems to be a weight= option in mx.symbol.FullyConnected but if I try weight=obsWeights I get the following error message
Error in mx.varg.symbol.FullyConnected(list(...)) :
Cannot find argument 'weight', Possible Arguments:
----------------
num_hidden : int, required
Number of hidden nodes of the output.
no_bias : boolean, optional, default=False
Whether to disable bias parameter.
How should I proceed to weight my observations? Here is my code at the moment.
# Prepare data
train.mm = model.matrix(obs ~ . , data = train_data)
train_label = train_data$obs
# Normalize
train.mm = apply(train.mm, 2, function(x) (x-min(x))/(max(x)-min(x)))
# Create MXDataIter compatible iterator
batch_size = 128
train.iter = mx.io.arrayiter(data=t(train.mm), label=train_label,
batch.size=batch_size, shuffle=T)
# Symbolic model definition
data = mx.symbol.Variable('data')
fc1 = mx.symbol.FullyConnected(data=data, num.hidden=128, name='fc1')
act1 = mx.symbol.Activation(data=fc1, act.type='relu', name='act1')
final = mx.symbol.FullyConnected(data=act1, num.hidden=1, name='final')
logistic = mx.symbol.LogisticRegressionOutput(data=final, name='logistic')
# Run model
mxnet_train = mx.model.FeedForward.create(
symbol = logistic,
X = train.iter,
initializer = mx.init.Xavier(rnd_type = 'gaussian', factor_type = 'avg', magnitude = 2),
num.round = 25)
Assigning the fully connected weight argument is not what you want to do at any rate. That weight is a reference to parameters of the layer; i.e., what you multiply in the inputs by to get output values These are the parameter values you're trying to learn.
If you want to make some samples matter more than others, then you'll need to adjust the loss function. For example, multiply the usual loss function by your weights so that they do not contribute as much to the overall average loss.
I do not believe the standard Mxnet loss functions have a spot for assigning weights (that is LogisticRegressionOutput won't cover this). However, you can make your own cost function that does. This would involve passing your final layer through a sigmoid activation function to first generate the usual logistic regression output value. Then pass that into the loss function you define. You could do squared error, but for logistic regression you'll probably want to use the cross entropy function:
l * log(y) + (1 - l) * log(1 - y),
where l is the label and y is the predicted value.
Ideally, you'd write a symbol with an efficient definition of the gradient (Mxnet has a cross entropy function, but its for softmax input, not a binary output. You could translate your output to two outputs with softmax as an alternative, but that seems less easy to work with in this case), but the easiest path would be to let Mxnet do its autodiff on it. Then you multiply that cross entropy loss by the weights.
I haven't tested this code, but you'd ultimately have something like this (this is what you'd do in python, should be similar in R):
label = mx.sym.Variable('label')
out = mx.sym.Activation(data=final, act_type='sigmoid')
ce = label * mx.sym.log(out) + (1 - label) * mx.sym.log(1 - out)
weights = mx.sym.Variable('weights')
loss = mx.sym.MakeLoss(weigths * ce, normalization='batch')
Then you want to input your weight vector into the weights Variable along with your normal input data and labels.
As an added tip, the output of an mxnet network with a custom loss via MakeLoss outputs the loss, not the prediction. You'll probably want both in practice, in which case its useful to group the loss with a gradient-blocked version of the prediction so that you can get both. You'd do that like this:
pred_loss = mx.sym.Group([mx.sym.BlockGrad(out), loss])