how assign new text to the built model (text mining) - r

yesterday i found good R code for classification emotion and took some part
happy = readLines("./happy.txt")
sad = readLines("./sad.txt")
happy_test = readLines("./happy_test.txt")
sad_test = readLines("./sad_test.txt")
tweet = c(happy, sad)
tweet_test= c(happy_test, sad_test)
tweet_all = c(tweet, tweet_test)
sentiment = c(rep("happy", length(happy) ),
rep("sad", length(sad)))
sentiment_test = c(rep("happy", length(happy_test) ),
rep("sad", length(sad_test)))
sentiment_all = as.factor(c(sentiment, sentiment_test))
library(RTextTools)
mat= create_matrix(tweet_all, language="english",
removeStopwords=FALSE, removeNumbers=TRUE,
stemWords=FALSE, tm::weightTfIdf)
container = create_container(mat, as.numeric(sentiment_all),
trainSize=1:160, testSize=161:180,virgin=FALSE)
models = train_models(container, algorithms=c("MAXENT",
"SVM",
#"GLMNET", "BOOSTING",
"SLDA","BAGGING",
"RF", # "NNET",
"TREE"
))
# test the model
results = classify_models(container, models)
table(as.numeric(as.numeric(sentiment_all[161:180])), results[,"FORESTS_LABEL"])
all is good, but one question. Here, we work with data where we self indicate to the machine what is sad and what is happy text. If i have new documents without indicating what sad, what happy or whats positive and what's negative(suppose, path one of this document n=read.csv("C:/1/ttt.csv")), how to do, that built model can define what phrase is negative and what positive?

Well, What was all the purpose of building a model to detect what is sad and what is happy? What is you want to achieve? And this does not look like a SO question/answer.
So you are using Supervised Learning in a labeled data (you already know is sad or happy) to learn what defines those classes, so later on you can use the models built for predicting new content where you do not have the label.
So, any transformations done to the data for training you have to do it for the new data coming in, and you ask your model to predict (evaluate) based on this new input data. So you use the prediction as a result. This does not change your model, it is just evaluating it in new data.
Another scenario is that you come with new labeled data, so you want to update your model, so you can retrain it based on the new data you might learn new models that have maybe more features.
In your case you should look at classify_model or classify_models functions in that package.

Related

How to fix unbalanced data in the synthetic control method?

I am currently writing a research project on the effects of voting behaviour after the closure of mines in a given area. For this research I have chosen the 'synthetic control' method. Now, I have run into trouble with the synth package, namely, each time that I try to dataprep the data to create the synthetic control unit I get error messages. These messages show the the following:
"Your panel, as described by unit.variable and time.variable, is unbalanced. Balance it and run again."
I have currently modelled my data after the Abadie's dataset used in his study on terrorism in the Basque region. And I ought to note that there is no missing data in my dataset, nor are there outliers.
I have tried to make several changes to my code, however, each time I try this, I run into trouble. Moreover, I have tried copying code from others who came up with a solution, but this did not work either. I would be very very thankful if someone could help me with my problem.
Some other lovely person has helped me with my previous problem, for which I am very grateful. However, being new to coding, I do not really have any idea as to how to solve my problem.
enter code here {dataprep_outcomes <- dataprep(foo=dataset [dataset$Year %in% c(1948:1986),],
predictors = c("Income","Distance","Gini","Percentage_voted","Protest"),
dependent = c("Percentage_voted"),
unit.variable = c("Municipality_No"),
time.variable = c("Year"),
treatment.identifier = 1,
controls.identifier = c(2:14),
time.predictors.prior = intersect(1948:1965, dataset$Year),
time.optimize.ssr = intersect(1948:1986, dataset$Year),
unit.names.variable = c("Municipality_ID"),
time.plot = intersect("1948:1986"), dataset$Year)}
I would like to run my dataprep. If one has suggestions regarding the manner in which I can alter my data, that would be welcome as well!
Thank you in advance.

Explanation of output for Naive bayes algorithm in R

I am new to Statistics and data analysis in R.
Today i was trying Naive Bayes algorithm in R.
The problem i am facing is that I am unable to understand the output of the prediction.
The code is followed like this:
install.packages('ElemStatLearn')
library('ElemStatLearn')
library("klaR")
library("caret")
sub = sample(nrow(spam), floor(nrow(spam) * 0.9))
train = spam[sub,]
test = spam[-sub,]
xTrain = train[,-58]
yTrain = train$spam
xTest = test[,-58]
yTest = test$spam
model = train(xTrain,yTrain,'nb',trControl=trainControl(method='cv',number=10))
prop.table(table(predict(model$finalModel,xTest)$class,yTest))`
Result display here is as follow:
yTest
email spam
email 0.33405640 0.02603037
spam 0.24945770 0.39045553
Can refer this link to see http://joshwalters.com/2012/11/27/naive-bayes-classification-in-r.html
The result that you have displayed is called a 'confusion matrix'. It is used to verify how well your classifier has worked.
You will need to understand a few terms here :- True positive (TP), False positive (FP),True negative (TN) ,False negative (FN)
Compare :
with your case
The diagonal from left top to right bottom gives you the %age of right predictions, and the other two values indicate the %age that your classifier got "confused"
Hope this gives an initial idea.
Google for confusion matrix and you can find more.
One good link is here : https://classeval.wordpress.com/introduction/basic-evaluation-measures/
It is not the naive bayes model's output.
Once you used predict, you don't really "care" about the model, because you already obtained the prediction.
table.prop creates the proportion out of each combination for the entire population. You might want to consider looking at the table without the proportion part, to see the actual numbers
For example 33.4% will be detected as email and will be actually an email, while 2.6% will be detected as email while they are actually spam.

Exclude specific tensors being updated by optimizer in TensorFlow

I have two graphs, which I suppose to train them independently, which means I have two different optimizers, but at the same time one of them is using the tensor values of the other graph. As a result, I need to be able to stop specific tensors being updated while training one of the graphs. I have assigned two different namescopes two my tensors and using this code to control updates over tensors for different optimizers:
mentor_training_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "mentor")
train_op_mentor = mnist.training(loss_mentor, FLAGS.learning_rate, mentor_training_vars)
mentee_training_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "mentee")
train_op_mentee = mnist.training(loss_mentee, FLAGS.learning_rate, mentee_training_vars)
the vars variable is being used like below, in the training method of mnist object:
def training(loss, learning_rate, var_list):
# Add a scalar summary for the snapshot loss.
tf.summary.scalar('loss', loss)
# Create the gradient descent optimizer with the given learning rate.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Create a variable to track the global step.
global_step = tf.Variable(0, name='global_step', trainable=False)
# Use the optimizer to apply the gradients that minimize the loss
# (and also increment the global step counter) as a single training step.
train_op = optimizer.minimize(loss, global_step=global_step, var_list=var_list)
return train_op
I'm using the var_list attribute of the optimizer class in order to control vars being updated by the optimizer.
Right now I'm confused whether I have done what I supposed to do appropriately, and even if there is anyway to check if any optimizer would only update partial of a graph?
I would appreciate if anyone can help me with this issue.
Thanks!
I have had a similar problem and used the same approach as you, i.e. via the var_list argument of the optimizer. I then checked whether the variables not intended for training stayed the same using:
the_var_np = sess.run(tf.get_default_graph().get_tensor_by_name('the_var:0'))
assert np.equal(the_var_np, pretrained_weights['the_var']).all()
pretrained_weights is a dictionary returned by np.load('some_file.npz') which I used to store the pre-trained weights to disk.
Just in case you need that as well, here is how you can override a tensor with a given value:
value = pretrained_weights['the_var']
variable = tf.get_default_graph().get_tensor_by_name('the_var:0')
sess.run(tf.assign(variable, value))

Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API

I am trying to build random forest models by group(School_ID, more than 3 thousands) on a large model input csv file using Spark Scala API. Each of the group contains about 3000-4000 records. The resources I have at disposal are 20-30 aws m3.2xlarge instances.
In R, I can construct models by group and save them to a list like this-
library(dplyr);library(randomForest);
Rf_model <- train %>% group_by(School_ID) %>%
do(school= randomForest(formula=Rf_formula, data=., importance = TRUE))
The list can be stored somewhere and I can call them when I need to use them like below -
save(Rf_model.school,file=paste0(Modelpath,"Rf_model.dat"))
load(file=paste0(Modelpath,"Rf_model.dat"))
pred <- predict(Rf_model.school$school[school_index][[1]], newdata=test)
I was wondering how to do that in Spark, whether or not I need to split the data by group first and how to do it efficiently if it's necessary.
I was able to split up the file by School_ID based on the below code but it seems it creates one individual job to subset for each iteration and takes a long time to finish the jobs. Is there a way to do it in one pass?
model_input.cache()
val schools = model_input.select("School_ID").distinct.collect.flatMap(_.toSeq)
val bySchoolArray = schools.map(School_ID => model_input.where($"School_ID" <=> School_ID))
for( i <- 0 to programs.length - 1 ){
bySchoolArray(i).
write.format("com.databricks.spark.csv").
option("header", "true").
save("model_input_bySchool/model_input_"+ schools(i))
}
Source:
How can I split a dataframe into dataframes with same column values in SCALA and SPARK
Edit 8/24/2015
I'm trying to convert my dataframe into a format that is accepted by the random forest model. I'm following the instruction on this thread
How to create correct data frame for classification in Spark ML
Basically, I create a new variable "label" and store my class in Double. Then I combine all my features using VectorAssembler function and transform my input data as follows-
val assembler = new VectorAssembler().
setInputCols(Array("COL1", "COL2", "COL3")).
setOutputCol("features")
val model_input = assembler.transform(model_input_raw).
select("SCHOOL_ID", "label", "features")
Partial error message(let me know if you need the complete log message) -
scala.MatchError: StringType (of class
org.apache.spark.sql.types.StringType$)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2.apply(VectorAssembler.scala:57)
This is resolved after converting all the variables to numeric types.
Edit 8/25/2015
The ml model doesn't accept the label I coded manually so I need to use StringIndexer to go around the problem as indicated here. According to the official documentation, the most frequent label gets 0. It causes inconsistent labels across School_ID. I was wondering if there's a way to create the labels without resetting the order of the values.
val indexer = new StringIndexer().
setInputCol("label_orig").
setOutputCol("label")
Any suggestions or directions would be helpful and feel free to raise any questions. Thanks!
Since you already have separate data frame for each school there is not much to be done here. Since you data frames I assume you want to use ml.classification.RandomForestClassifier. If so you can try something like this:
Extract pipeline logic. Adjust RandomForestClassifier parameters and transformers according to your requirements
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.{Pipeline, PipelineModel}
def trainModel(df: DataFrame): PipelineModel = {
val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf))
pipeline.fit(df)
}
Train models on each subset
val bySchoolArrayModels = bySchoolArray.map(df => trainModel(df))
Save models
import java.io._
def saveModel(name: String, model: PipelineModel) = {
val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
oos.writeObject(model)
oos.close
}
schools.zip(bySchoolArrayModels).foreach{
case (name, model) => saveModel(name, Model)
}
Optional: Since individual subsets are rather small you can try an approach similar to the one I've describe here to submit multiple tasks at the same time.
If you use mllib.tree.model.RandomForestModel you can omit 3. and use model.save directly. Since there seem to be some problems with deserialization (How to deserialize Pipeline model in spark.ml? - as far as I can tell it works just fine but better safe than sorry, I guess) it could be a preferred approach.
Edit
According to the official documentation:
VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.
Since error indicates your column is a String you should transform it first, for example using StringIndexer.

Hidden Markov models package in R

I need some help implementing a HMM module in R. I'm new to R and don't have a lot of knowledge on it.
So i have to implement an IE using HMM, i have 2 folders with files, one with the sentences and the other with the corresponding tags i want to learn form each sentence.
folder1 > event1.txt: "2013 2nd International Conference on Information and Knowledge Management (ICIKM 2013) will be held in Chengdu, China during July 20-21, 2013."
folder2 > event1.txt:
"N: 2nd International Conference on Information and Knowledge Management (ICIKM 2013)
D: July 20-21, 2013
L: Chengdu, China"
N -> Name; D -> Date; L -> Location
My question is how to implement it on R, how do i initialize the model and how do i do to train it? And then how do i apply it to a random sentence to extract the information?
Thanks in advance for all the help!
If you run the following command:
RSiteSearch('hidden markov model')
Then it finds 4 Task Views, 40 Vignettes, and 255 functions (when I ran it, there could be more by the time you run it).
I would suggest looking through those results (probably start with the views and vignettes) to see if anything there works for you. If not, then tell us what you have tried and what you need that is not provided there.
I'm not sure what exactly you want to do, but you might find this excellent tutorial on hidden Markov models using R useful. You build the functions and Markov models from scratch starting from regular Markov models and then moving to hidden Markov models. That's really valuable to understand how they work.
There is also the R package depmixS4 for specifying and fitting hidden Markov models. It's documentation is pretty solid and going through the example code might help you.
depmixS4 is most general and reasonably good package, if you get it to work on your data. It checked out on dummy data for me but gave error on real data. HMM also works but only if you have discrete variables and not continuous.
DepmixS4 is what you are looking for.
First of all, you need to identify best number of hidden states for your model. This can be done by taking model with least value of AIC for different hidden states.
I have created a function HMM_model_execution which will return the model variable and number of states for the model.
library(depmixS4)
first column should be visible state and remaining external variable in doc_data
HMM_model_execution<-function( doc_data, k)
k number of total hidden state to compare
{
aic_values <- vector(mode="numeric", length=k-1) # to store AIC values
for( i in 2:k)
{
print(paste("loop counter",i))
mod <- depmix(response = doc_data$numpresc ~ 1, data = doc_data, nstates = i)
fm <- fit(mod, verbose = FALSE)
aic_values[i-1]<- AIC(fm)
#print(paste("Aic value at this index",aic_values[i-1]))
#writeLines("\n")
}
min_index<-which.min(aic_values)
no of hidden states for best model
#print(paste("index of minimum AIC",min_index))
mod <- depmix(response = doc_data$numpresc ~ 1, data = doc_data, nstates = (min_index+1))
fm <- fit(mod, verbose = FALSE)
best model execution
print(paste("best model with number of hidden states", min_index+1))
return(c(fm, min_index+1))
writeLines("\n")
writeLines("\n")
External variables( co-variates can be passed in function depmix ). summary (fm) will give you all model parameters.

Resources