I have a training data set with 28 variables (13 labels and 15 features). A test data set with 15 features and I have to predict labels for this test data set based on the features. I made KNN classifiers for all 13 labels individually.
Is there a possibility of combining all these 13 individual label KNN classifiers into one single multi label classifier?
My current code for single label:
library(class)
train_from_train <- train[1:600,2:16]
target_a_train_from_train <- train[1:600,17]
test_from_train <- train[601:800,2:16]
target_a_test_from_train <- train[601:800,17]
knn_pred_a <-knn (train = train_from_train, test = test_from_train, cl= target_a_train_from_train, k = 29)
table(knn_pred_a, target_a_test_from_train)
mean(knn_pred_a != target_a_test_from_train)
knn_pred_a_ON_TEST <-knn (train = train[,2:16], test = test[2:16], cl= train[,17], k = 29)
knn_pred_a_ON_TEST
I scoured internet and package mldr seems to be an option but I couldn't adapt it to my needs.
You can use the package ARNN for this. However, it is not exact as far as I know.
library(RANN)
library(reshape2)
####
## generate some sample data and randomize order
iris.knn <- iris[sample(1:150,150),]
#add a second class
iris.knn["Class2"] <- iris.knn[,5]=="versicolor"
iris.knn$org.row.id <- 1:nrow(iris.knn)
train <- iris.knn[1:100,]
test <- iris.knn[101:150,]
##
#####
## get nearest neighbours
nn.idx <- as.data.frame(nn2(train[1:4],query=test[1:4],k=4)$nn.idx)
## add row id
nn.idx$test.row.id <- test$rowid
#classes and row id
multiclass.vec <- data.frame(row.id=1:150,iris.knn[,5:6])
#1 row per nearest neighbour
melted <-melt(nn.idx,id.vars="row.id")
merged <- merge(melted,multiclass.vec, by.x = "value",by.y="org.row.id")
#aggrgate a single class
aggregate(merged$Species, list(merged$row.id), function(x) names(which.max(table(x))))
#### aggregate for all classes
all.classes <- melt(merged[c(2,4,5)],id.vars = "row.id")
fun.agg <- function(x) {
if(length(x)==0){
"" #<-- default value adaptation might be needed.
}else{
names(which.max(table(x)))
}
}
dcast(all.classes,row.id~variable, fun.aggregate=fun.agg,fill=NULL)
I did the aggreation only for a single class. Doing this step for all classes in parallel would require another melt operation and would make the code pretty messy.
Related
I have already written the following code, all of which works OK:
directory_path <- "~/DAEN_698/sample_obs"
file_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
head(file_list, n = 2)
> head(file_list, n = 2)
[1] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-1.csv"
[2] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-2.csv"
# Create another list with the just the "n-n-n-n" part of the names of of each dataset
DS_name_list = stri_sub(file_list, 49, 55)
head(DS_name_list, n = 3)
> head(DS_name_list, n = 3)
[1] "0-5-1-1" "0-5-1-2" "0-5-1-3"
# This command reads all the data in each of the N csv files via their names
# stored in the 'file_list' list of characters.
csvs <- lapply(file_list, read.csv)
### Run a Backward Elimination Stepwise Regression on each of the N csvs.
# Assign the full model (meaning the one with all 30 candidate regressors
# included as the initial model in step 1).
# This is crucial because if the initial model has less than the number of
# total candidate factors for Stepwise to select from in the datasets,
# then it could miss 1 or more of the true factors.
full_model <- lapply(csvs, function(i) {
lm(formula = Y ~ ., data = i) })
# my failed attempt at figuring it out myself
set.seed(50) # for reproducibility
BE_fits3 <- lapply(full_model, function(i) {step(object = i[["coefficients"]],
direction = 'backward', scope = formula(full_model), trace = 0)})
When I hit run on the above 2 lines of code after setting the seed, I get
the following error message in the Console:
Error in terms`(object) : object 'i' not found
To briefly elaborate a bit further on why it is
absolutely essential that the initial model when running a Backward Elimination
version of Stepwise Regression, consider the following example:
Let us say that we start out with an initial model of 25, so, X1:X26 instead of
X1:X30, in that case, it would be possible to miss out on Stepwise Regression j
being able to select/choose 1 or more of the IVs/factors from X26 through X30,
especially if 1 or more of those really are included in the true underlying
population model that characterizes dataset j.
Instead of two lapply loops, one to fit the models and the second to run the stepwise regressions, use a for loop doing both operations one after the other. This is an environments thing, it seems that step is not finding the data when run in the environment of the lapply function.
I have also changed the code to create DS_name_list. Below it processes the full names without string position dependent code.
DS_name_list <- basename(file_list)
DS_name_list <- tools::file_path_sans_ext(DS_name_list)
head(DS_name_list, n = 2)
And here is the regressions code.
csvs <- lapply(file_list, read.csv)
names(csvs) <- DS_name_list
set.seed(50) # for reproducibility
full_model <- vector("list", length = length(csvs))
BE_fits3 <- vector("list", length = length(csvs))
for(i in seq_along(csvs)) {
full_model[[i]] <- lm(formula = Y ~ ., data = csvs[[i]])
BE_fits3[[i]] <- step(object = full_model[[i]],
scope = formula(full_model[[i]]),
direction = 'backward',
trace = 0)
}
I am trying to plot trait data on a phylogeny using the phytools package. I'm sure this should be simple but I'm getting an unhelpful error message and I don't know what to try.
Here is my code including data download.
# General
library(dplyr)
# Phylogenetic libraries.
library(caper)
library(phytools)
#+ data_read
p <- read.table(file = 'http://esapubs.org/archive/ecol/E090/184/PanTHERIA_1-0_WR05_Aug2008.txt',
header = TRUE, sep = "\t", na.strings = c("-999", "-999.00"))
## Some data cleaning
# Remove NAs in response and response where litter size is less than one (doesn't make sense).
p <- p %>%
filter(!is.na(X15.1_LitterSize)) %>%
filter(X15.1_LitterSize >= 1) %>%
mutate(y = log1p(X15.1_LitterSize)) %>%
dplyr::select(-X15.1_LitterSize, -References, -X24.1_TeatNumber)
## Get phylogeny data.
### read in phylogeny data.
# Read in trees
tree <- read.nexus('https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2Fj.1461-0248.2009.01307.x&file=ELE_1307_sm_SA1.tre')
# Select best supported tree
tree <- tree[[1]]
tree$tip.label <- gsub('_', ' ', tree$tip.label)
# Check if species are available.
mean(p$MSW05_Binomial %in% tree$tip.label)
in_phylo <- p$MSW05_Binomial %in% tree$tip.label
# Remove data that is not in the phylogeny.
p <- p %>% filter(in_phylo)
# Try just vulpes.
unneededTips <- tree$tip.label[!grepl('Vulpes', tree$tip.label) | !(tree$tip.label %in% p$MSW05_Binomial)]
# Prune tree down to only needed tips.
pruneTree <- drop.tip(tree, unneededTips)
dotTree(pruneTree, p$y[grepl('Vulpes', p$MSW05_Binomial)])
# Try all species
unneededTips <- tree$tip.label[!(tree$tip.label %in% p$MSW05_Binomial)]
# Prune tree down to only needed tips.
pruneTree <- drop.tip(tree, unneededTips)
dotTree(pruneTree, p$y)
I have tried plotting a smaller subset of the tree and the full tree but in both cases I get the error:
Error in if (k <= 0.8 && any(rr > (strwidth("W") * fsize/2))) rr <- rr/max(rr) * :
missing value where TRUE/FALSE needed
for dotTree and similar functions in phytools (e.g. contMap), your trait value must be a named vector with the names corresponding to the tips in your tree.
In your example you need to make sure p$y is a named vector (!is.null(names(p$y)) should be TRUE):
## Prune down the non Vulpes tips
vulpes_tree <- drop.tip(tree, tree$tip.label[-grep("Vulpes", tree$tip.label)])
## Naming the variables in p$y
all_vulpes <- grepl('Vulpes', p$MSW05_Binomial)
traits_to_plot <- p$y[all_vulpes]
names(traits_to_plot) <- p$MSW05_Binomial[all_vulpes]
## Plotting the Vulpes and the traits
dotTree(vulpes_tree, traits_to_plot)
You can apply the same procedure for your bigger tree.
I suggest you use the function cleand.data from the dispRity package to match your tree and your dataset:
## Matching the tree and the data (using the dispRity package)
library(dispRity)
## Attributing rownames to the dataset
rownames(p) <- p$MSW05_Binomial
## Cleaning both the data and the tree
cleaned_data <- dispRity::clean.data(p, tree)
## Extracting the cleaned dataset and the cleaned tree
clean_p <- cleaned_data$data
clean_tree <- cleaned_data$tree
## Same for the complete tree
all_traits <- clean_p$y
names(all_traits) <- clean_p$MSW05_Binomial
## Plotting all species and their traits
dotTree(clean_tree, all_traits)
I made a train and test set using my selfmade function:
splitter <- function(dataset, number1, number2 = 0.7){
trn_index <- createDataPartition(y = dataset[,number1],
p = number2, list = FALSE)
trn_set = dataset[trn_index,]
tst_set = dataset[-trn_index,]
data <- list(trn_set = as.list(trn_set),
tst_set = as.list(tst_set))
}
general_splits <- splitter(general, 2, 0.6)
table(general_splits$trn_set$Attrition)
The purpose was to make a function that divides a dataset into a train and test set on the basis of three variables: the dataset, the column to make the split on (number1) and the fraction of rows in the train / test set (number2). It returns this as a list object. I want to split the general dataset.
When I try to check if it worked out, I see great results:
> table(general_splits$trn_set$Attrition)
No Yes
2185 419
> table(general_splits$tst_set$Attrition)
No Yes
1456 278
However, when I want to check the dimensions of the sets in my nested dataframe, the result is this:
> dim(general_splits$trn_set)
NULL
What am I missing here? The dataset has around 3500+ rows and 21 variables and this is visible from the tables.
I'm trying to implement a Naive Bayes classifier on a data set which contains text data in the form of complaints from customers (Complaint) and Reddit comments (General_Text). The whole set has 250'000 Texts for each category. However, I use only 1000 Texts per category in the example postet here. I get the same result with the whole data set. I have done the text preprocessing with the "tm" package previously and it should not be an issue!
The data frame is structured as follows with 1000 entries for Complaint and General_Text:
type text
"General_Text" "random words"
"Complaint" "other random words"
For the Classification Task i split the data into a Training set on which the algorithm should learn and a test set to measure the accuracy. The naive Bayes algorithm is from the "e1071" library.
library(plyr)
library(e1071)
library(caret)
library(MLmetrics)
#Import data and rename columns into $type and $text`
General_Text<- read.csv("General_Text.csv", sep=";", head=T, stringsAsFactors = F)
Complaints<- read.csv("Complaints.csv", sep=";", head=T, stringsAsFactors = F)
Data <- rbind(General_Text, Complaints)
colnames(Data) <- c("type", "text")
# $type as factor and $text as string
Data$text <- iconv(Data$text, encoding = "UTF-8")
Data$type <- factor(Data$type)
# Split the data into training set (1400 texts) and test set (600 texts)
set.seed(1234)
trainIndex <- createDataPartition(Data$type, p = 0.7, list = FALSE, times = 1)
trainData <- Data[trainIndex,]
testData <- Data[-trainIndex,]
# Create corpus for training data
corpus<- Corpus(VectorSource(trainData$text))
# Create Document Term Matrix for training data
docs_dtm <- DocumentTermMatrix(corpus, control = list(global = c(2, Inf)))
# Remove Sparse Terms in DTM
docs_dtm_train <- removeSparseTerms(docs_dtm , 0.97)
# Convert counts into "Yes" or "No"
convert_counts <- function(x){
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0,1), labels = c("No", "Yes"))
return (x)
}
# Apply convert_counts function to the training data
docs_dtm_train <- apply(docs_dtm_train, MARGIN = 2, convert_counts)
# Create Corpus for test set
corpus_2 <- Corpus(VectorSource(testData$text))
# Create Document Term Matrix for test data
docs_dtm_2 <- DocumentTermMatrix(corpus_2, list(global = c(2, Inf)))
# Remove Sparse Terms in DTM
docs_dtm_test <- removeSparseTerms(docs_dtm_2, 0.97)
# Apply convert_ counts function to the test data
docs_dtm_test <- apply(docs_dtm_test, MARGIN = 2, convert_counts)
# Naive Bayes Classification
nb_classifier <- naiveBayes(docs_dtm_train, trainData$type)
nb_test_pred <- predict(nb_classifier, newdata = docs_dtm_test)
# Output as Confusion Matrix
ConfusionMatrix(nb_test_pred, testData$type)
I'm sorry that I cannot deliver the data and thus a reproducible example. The result which the code delivers is pretty demoralizing: It identifies all the texts as Complaints and none as General Texts.
> ConfusionMatrix(nb_test_pred, testData$type)
y_pred
y_true Complaint General_Text
Complaint 300 0
General_Text 300 0
I also get the following error message: In data.matrix(newdata) : NAs introduced by coercion
Could anyone clarify if I made any mistakes in my code or give me a heads up if someone had a similar issue?
I have a code which predict the change in the sign of future returns.
library(quantmod)
library(PerformanceAnalytics)
library(forecast)
library(e1071)
library(caret)
library(kernlab)
library(dplyr)
library(roll)
# get data yahoo finance
getSymbols("^GSPC", from = "1990-01-01", to = "2017-12-01")
# take logreturns
rnull <- CalculateReturns(prices = GSPC$GSPC.Adjusted ,method ="log")
# lags 1, 2, 3, 4, 5 as features
feat <- merge(na.trim(lag(rnull,1)),na.trim(lag(rnull,2)),na.trim(lag(rnull,3)),na.trim(lag(rnull,4)),na.trim(lag(rnull,5)),all=FALSE)
# create dataset. 6th column is actural. Previous is lagged
dataset <- merge(feat,rnull,all=FALSE)
# set columns' names
colnames(dataset) = c("lag.1", "lag.2", "lag.3","lag.4","lag.5","TARGET")
# get signs and make a data.frame
x <- sign(dataset)%>%as.data.frame
# exclude 0 sign and assume that these values are positive
x[x==0] <- 1
# for svm purposes we need to set dependent variable as factor and make levels to interpretation
x$TARGET <- as.factor(as.character(x$TARGET))
levels(x$TARGET) <- list(positive = "1", negative = "-1")
# divide sample to training and test subsamples
trainindex <- x[1:5792,]
testindex <- x[5792:7030,]
# run svm
svmFit <- ksvm(TARGET~.,data=trainindex,type="C-svc",kernel= "rbfdot")
# prediction
predsvm <- predict(svmFit, newdata=testindex)
# results
confusionMatrix(predsvm, testindex$TARGET)
The next thing I am going to do is add a rolling window (1 step forecast) to my model.
However the basic methods as rollapply does not work with dataframe. Commom methods of one step forecast for time-series are also not valid for data.frame used in e1071 package.
I wrote the following function:
svm_next_day_prediction <- function(x){
svmFit <- svm(TARGET~., data=x)
prediction <- predict(object = svmFit, newdata = tail(x,1) )
return(prediction)
}
apl = rollapplyr(data = x, width = 180, FUN = svm_next_day_prediction, by.column = TRUE)
but recieved a error because rollapply does not understand data.frames:
Error in terms.formula(formula, data = data) : '.' in formula and
no 'data' argument
Can you please explain how to apply rolling window for svm classification model with dataframe?
A few points
rollapply works with data frames that can be coerced to a matrix so be sure that your input is entirely numeric -- not a mix of numeric and factor. For example, this works using the built-in data frame BOD which has two numeric columns. Note that x passed to pred is a matrix here.
pred <- function(x) predict(svm(demand ~ Time, x))
rollapplyr(BOD, 3, FUN = pred, by.column = FALSE)
giving
## 1 2 3
## [1,] 8.868888 10.86889 17.25474
## [2,] 11.661666 17.24870 16.00000
## [3,] 18.328435 16.18583 15.78583
## [4,] 16.230474 15.83247 19.56886
I can't reproduce the error you get. I get a different error.
the code in the question has by.column = TRUE (which is the default anyways)
but that has the result of passing only a single vector to the function which
is not what you want. You want by.column = FALSE.
Try this:
x0 <- data.matrix(x)
rollapplyr(data = x0, width = 180, FUN = svm_next_day_prediction, by.column = FALSE)
you can create a list with the individual data frames and then apply your function. I rename x to df to avoid confusion:
df=x
rowwindow=179
dfList=lapply(1:(nrow(df)-rowwindow),function(x) df[x:(rowwindow+x),])
result=sapply(dfList,svm_next_day_prediction)