I'm new to R and its the first time i'm using SOM.
I want to predict survival using Self Organizing Map.
The following is the code i used to ingest data:
load raw data
train <- read.csv("train.csv", header = TRUE)
test <- read.csv("test.csv", header = TRUE)
Add a "Survived" variable to the test set to allow for combining data sets
test.survived <- data.frame(survived = rep("None", nrow(test)), test[,])
Combine data sets
data.combined <- rbind(train, test.survived)
Changed the variable to factors
data.combined$Survived <- as.factor(data.combined$survived)
data.combined$Pclass <- as.factor(data.combined$pclass)
Fitting the data to the SOM model
library(kohonen)
Train SOM
som.train.1 <- data.combined[1:891, c("pclass", "title")]
som.label <- as.factor(train$survived)
table(som.train.1)
table(som.label)
som.train.1.grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal")
set.seed(1234)
som.model <- som(som.label,
grid=som.train.1.grid,
rlen = 100,
alpha = c(0.05, 0.01),
keep.data = TRUE,
normalizeDataLayers = TRUE)
plot(som.model)
I get an error that says: sort.list(y): 'x' must be atomic for 'sort.list'
Related
#data splicing
set.seed(12345)
train <- sample(1:nrow(student.mat.pass.or.fail),size =
ceiling(0.80*nrow(student.mat.pass.or.fail)),replace = FALSE)
# training set
students_train <- student.mat.pass.or.fail[train,]
# test set
students_test <- student.mat.pass.or.fail[-train,]
# penalty matrix
penalty.matrix <- matrix(c(0,1,10,0), byrow=TRUE, nrow=2)
# building the classification tree with part
tree <- rpart(class~.,
data = students_train, # as.matrix(students_train)
parms = list(loss = penalty.matrix),
method = "class")
object is not a matrix, can someone help me cause I'm new in R I also used the as. matrix(students_train) but it still showing the same problem
I have a dataset of 25 variables and 248 rows.
There are 8-factor variables and the rest are integers and numbers.
I am trying to run XGBoost.
I have done the following code: -
# Partition Data
set.seed(1234)
ind <- sample(2, nrow(mission), replace = T, prob = c(0.7,0.3))
train <- mission[ind == 1,]
test <- mission[ind == 2,]
# Create matrix - One-Hot Encoding for Factor variables
trainm <- sparse.model.matrix(GRL ~ .-1, data = train)
head(trainm)
train_label <- train[,"GRL"]
train_matrix <- xgb.DMatrix(data = as.matrix(trainm), label = train_label)
testm <- sparse.model.matrix(GRL~.-1, data = test)
test_label <- test[,"GRL"]
test_matrix <- xgb.DMatrix(data = as.matrix(testm),label = test_label)
The response variable here is "GRL" and I am running the test_label <- test[,"GRL"]
The above code is getting executed but when I am trying to use it in xgb.DMatrix, I am encountering the following error:
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
I have partitioned the data into 70:30.
test[,"GRL"] returns a data.frame, and XGBoost needs the label to be a vector.
Just use teste$GRL or test[["GRL"]] instead. You also need to do the same for the training dataset
I use PCA on my divided train dataset and project the test dataset to the results after removing irrelevant columns.
data <- read.csv('bottom10.csv')
set.seed(1)
inTrain <- createDataPartition(data$cuisine, p = .8)[[1]]
dataTrain <- data[,-1][inTrain,][,-1]
dataTest <- data[,-1][-inTrain,][,-1]
cuisine.pca <- prcomp(dataTrain[,-1])
Then I extract the first 500 components and project the test dataset.
traincom <- cuisine.pca$x[,1:500]
testcom <- scale(dataTest[,-1], cuisine.pca$center) %*% cuisine.pca$rotation
Then I transfer the labels into integer, and combine components and labels into xgbDMatrix form.
label_train <- as.integer(dataTrain$cuisine) - 1
label_test <- as.integer(dataTest$cuisine) - 1
xgb_train <- xgb.DMatrix(data = traincom, label = label_train)
xgb_test <- xgb.DMatrix(data = testcom, label = label_test)
Then I build the xgboost model as
xgb.fit <- xgboost(cuisine~., data = xgb_train, nrounds = 40, num_class = 10, early_stopping_rounds = 5)
And after I run this, there is a warning but the training can still run.
xgboost: label will be ignored
I can predict the train dataset using the model but when I try to predict test dataset there will be an error.
xgb_pred <- predict(xgb.fit, newdata = xgb_train)
sum(label_train == xgb_pred)/length(label_train)
xgb_pred <- predict(xgb.fit, newdata = xgb_test, rescale = T)
Error in predict.xgb.Booster(xgb.fit, newdata = xgb_test, rescale = T) :
Feature names stored in `object` and `newdata` are different!
Please let me know what am I doing wrong?
Regards
I'm trying to implement a Naive Bayes classifier on a data set which contains text data in the form of complaints from customers (Complaint) and Reddit comments (General_Text). The whole set has 250'000 Texts for each category. However, I use only 1000 Texts per category in the example postet here. I get the same result with the whole data set. I have done the text preprocessing with the "tm" package previously and it should not be an issue!
The data frame is structured as follows with 1000 entries for Complaint and General_Text:
type text
"General_Text" "random words"
"Complaint" "other random words"
For the Classification Task i split the data into a Training set on which the algorithm should learn and a test set to measure the accuracy. The naive Bayes algorithm is from the "e1071" library.
library(plyr)
library(e1071)
library(caret)
library(MLmetrics)
#Import data and rename columns into $type and $text`
General_Text<- read.csv("General_Text.csv", sep=";", head=T, stringsAsFactors = F)
Complaints<- read.csv("Complaints.csv", sep=";", head=T, stringsAsFactors = F)
Data <- rbind(General_Text, Complaints)
colnames(Data) <- c("type", "text")
# $type as factor and $text as string
Data$text <- iconv(Data$text, encoding = "UTF-8")
Data$type <- factor(Data$type)
# Split the data into training set (1400 texts) and test set (600 texts)
set.seed(1234)
trainIndex <- createDataPartition(Data$type, p = 0.7, list = FALSE, times = 1)
trainData <- Data[trainIndex,]
testData <- Data[-trainIndex,]
# Create corpus for training data
corpus<- Corpus(VectorSource(trainData$text))
# Create Document Term Matrix for training data
docs_dtm <- DocumentTermMatrix(corpus, control = list(global = c(2, Inf)))
# Remove Sparse Terms in DTM
docs_dtm_train <- removeSparseTerms(docs_dtm , 0.97)
# Convert counts into "Yes" or "No"
convert_counts <- function(x){
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0,1), labels = c("No", "Yes"))
return (x)
}
# Apply convert_counts function to the training data
docs_dtm_train <- apply(docs_dtm_train, MARGIN = 2, convert_counts)
# Create Corpus for test set
corpus_2 <- Corpus(VectorSource(testData$text))
# Create Document Term Matrix for test data
docs_dtm_2 <- DocumentTermMatrix(corpus_2, list(global = c(2, Inf)))
# Remove Sparse Terms in DTM
docs_dtm_test <- removeSparseTerms(docs_dtm_2, 0.97)
# Apply convert_ counts function to the test data
docs_dtm_test <- apply(docs_dtm_test, MARGIN = 2, convert_counts)
# Naive Bayes Classification
nb_classifier <- naiveBayes(docs_dtm_train, trainData$type)
nb_test_pred <- predict(nb_classifier, newdata = docs_dtm_test)
# Output as Confusion Matrix
ConfusionMatrix(nb_test_pred, testData$type)
I'm sorry that I cannot deliver the data and thus a reproducible example. The result which the code delivers is pretty demoralizing: It identifies all the texts as Complaints and none as General Texts.
> ConfusionMatrix(nb_test_pred, testData$type)
y_pred
y_true Complaint General_Text
Complaint 300 0
General_Text 300 0
I also get the following error message: In data.matrix(newdata) : NAs introduced by coercion
Could anyone clarify if I made any mistakes in my code or give me a heads up if someone had a similar issue?
I have a bnlearn model in R that is learned using the gs function with 4 categorical variables and 8 numerical variables.
when I try to validate my model with a test set, I get this error when trying to predict some of the nodes:
Error in check.fit.vs.data(fitted = object, data = data, subset = object[[node]]$parents) :
'Keyword' has different number of levels in the node and in the data.
Is it not possible to use both numerical and categorical variables with bnlearn? and if it is possible, what am I doing wrong?
mydata$A <- as.factor(mydata$A)
mydata$B <- as.numeric(mydata$B)
mydata$C <- as.numeric(mydata$C)
mydata$D <- as.numeric(mydata$D)
mydata$E <- as.factor(mydata$E)
mydata$F <- as.numeric(mydata$F)
mydata$G <- as.numeric(mydata$G)
mydata$H <- as.numeric(mydata$H)
mydata$I <- as.numeric(mydata$I)
mydata$J <- as.numeric(mydata$J)
mydata$K <- as.numeric(mydata$K)
mydata$L <- as.numeric(mydata$L)
mydata$M <- as.numeric(mydata$M)
mydata$N <- as.numeric(mydata$N)
mydata$O <- as.numeric(mydata$O)
mydata$P <- as.numeric(mydata$P)
mydata$Q <- as.numeric(mydata$Q)
#create vector of black arcs
temp1=vector(mode = "character", length = 0)
for (i in 1:length(varnames)){
for (j in 1:length(varnames)){
temp1 <- c(temp1,varnames[i])
}
}
temp2=vector(mode = "character", length = 0)
for (i in 1:length(varnames)){
temp2 <- c(temp2,varnames)
}
#creat to arcs of the model
arcdata = read.csv("C:/users/asaf/desktop/in progress/whitearcs.csv", header = T)
wfrom=arcdata[,1]
wto=arcdata[,2]
whitelist = data.frame(from = wfrom,to =wto)
#block unwanted arcs
blacklist = data.frame(from = temp1, to = temp2)
#fit and plot the model
#gaussian method
model = gs(mydata, whitelist = whitelist, blacklist = blacklist)
#inference procedure
learntmodel = bn.fit(model,mydata,method = "mle",debug = F)
graphviz.plot(learntmodel)
myvalidation=read.csv("C:/users/asaf/desktop/in progress/val.csv", header = T)
#predicate A
pred = predict(learntmodel, node="A", myvalidation)
myvalidation$A <- pred
#predicate B
pred = predict(learntmodel, node="B", myvalidation)
myvalidation$B <- pred
at this point it throws the following error :
Error in check.fit.vs.data(fitted = object, data = data, subset = object[[node]]$parents) :
'A' has different number of levels in the node and in the data.
bnlearn can't work with mixed variables (qualitative and quantitative) at same time, I read it is possible in deal package.
Another possibility is to use discretize to transform your continous variables into discrete variables:
dmydata <- discretize(mydata, breaks = 2, method = "interval")
model <- gs(dmydata, whitelist = whitelist, blacklist = blacklist)
... and continue your code.
Actually I had the same problem today, I resolved it by ensuring that the other nodes that are connected to the one in question... i.e. $A, had also the same number of levels.