Ridge-regression model: glmnet - r

Fitting a linear-regression model using least squares on my training dataset works fine.
library(Matrix)
library(tm)
library(glmnet)
library(e1071)
library(SparseM)
library(ggplot2)
trainingData <- read.csv("train.csv", stringsAsFactors=FALSE,sep=",", header = FALSE)
testingData <- read.csv("test.csv",sep=",", stringsAsFactors=FALSE, header = FALSE)
lm.fit = lm(as.factor(V42)~ ., data = trainingData)
linearMPrediction = predict(lm.fit,newdata = testingData, se.fit = TRUE)
mean((linearMPrediction$fit - testingData[,20:41])^2)
linearMPrediction$residual.scale
However, when i try to fit a ridge-regression model on my training dataset as,
x = model.matrix(as.factor(V42)~., data = trainingData)
y = as.factor(trainingData$V42)
ridge = glmnet(x, y, family = "multinomial", alpha = 1, lambda.min.ratio = 1e-2)
I am having the following error for both multinomial and binomial distributions.
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has 1 or 0 observations; not allowed
Am I missing something? Any comment would be greatly appreciated. Here is a portion of how my data looks like by the way.
> trainingData$V42[1:50]
[1] "normal" "normal" "neptune" "normal" "normal" "neptune" "neptune" "neptune" "neptune" "neptune" "neptune"
[12] "neptune" "normal" "warezclient" "neptune" "neptune" "normal" "ipsweep" "normal" "normal" "neptune" "neptune"
[23] "normal" "normal" "neptune" "normal" "neptune" "normal" "normal" "normal" "ipsweep" "neptune" "normal"
[34] "portsweep" "normal" "normal" "normal" "neptune" "normal" "neptune" "neptune" "neptune" "normal" "normal"
[45] "normal" "neptune" "teardrop" "normal" "warezclient" "neptune"
> x
(Intercept) V1 V2tcp V2udp V3bgp V3courier V3csnet_ns V3ctf V3daytime V3discard V3domain V3domain_u V3echo V3eco_i V3ecr_i V3efs V3exec V3finger V3ftp
1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> y[1:50]
[1] normal normal neptune normal normal neptune neptune neptune neptune neptune neptune neptune normal
[14] warezclient neptune neptune normal ipsweep normal normal neptune neptune normal normal neptune normal
[27] neptune normal normal normal ipsweep neptune normal portsweep normal normal normal neptune normal
[40] neptune neptune neptune normal normal normal neptune teardrop normal warezclient neptune
22 Levels: back buffer_overflow ftp_write guess_passwd imap ipsweep land loadmodule multihop neptune nmap normal phf pod portsweep rootkit satan smurf spy ... warezmaster
> table(y)
y
back buffer_overflow ftp_write guess_passwd imap ipsweep land loadmodule multihop neptune
196 6 1 10 5 710 1 1 2 8282
nmap normal phf pod portsweep rootkit satan smurf spy teardrop
301 13449 2 38 587 4 691 529 1 188
warezclient warezmaster
181 7

You have single observations for some of the classes (like ftp_write with only 1 observation), which is not allowed (and clearly stated in the error).

Related

Genetic Algorithm in R: Specify number of 1s in binary chromosomes

I am using the rbga function, but my question still stands for other genetic algorithm implementations in R. Is there a way to specify the number of 1s in binary chromosomes?
I have the following example provided by the library documentation.
data(iris)
library(MASS)
X <- as.data.frame(cbind(scale(iris[,1:4]), matrix(rnorm(36*150), 150, 36)))
Y <- iris[,5]
iris.evaluate <- function(indices) {
print("Chromosome")
print(indices)
print("================================")
result = 1
if (sum(indices) > 2) {
huhn <- lda(X[,indices==1], Y, CV=TRUE)$posterior
result = sum(Y != dimnames(huhn)[[2]][apply(huhn, 1,
function(x)
which(x == max(x)))]) / length(Y)
}
result
}
monitor <- function(obj) {
minEval = min(obj$evaluations);
plot(obj, type="hist");
}
woppa <- rbga.bin(size=40, mutationChance=0.05, zeroToOneRatio=10,
evalFunc=iris.evaluate, showSettings=TRUE, verbose=TRUE)
Here are some of the chromosomes.
"Chromosome"
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
"================================"
"Chromosome"
0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0
"================================"
"Chromosome"
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
"================================"
"Chromosome"
0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
"================================"
The 1s (i.e., the chosen characteristics) are 5, 8, 5 and 4 respectively.
I am trying to follow the technique specified in a paper and they claim that they apply a genetic algorithm and in the end they pick a specific number of characteristics.
Is it possible to specify in a genetic algorithm the number of characteristics that I want my solution(s)/chromosome(s) to have?
Could this be done on the final solution/chromosome and if yes how?

Counting repeated 5-mers gene from 100 DNA sequence samples

I am beginner in R and trying to solve this but have been struggling for few days already. Please help a newbie out.
I extracted 100 samples each of length 1000 from a 100,000 DNA sequence. Then, I want to count "AATAA" appeared how many times in the each of the sample.
dog_100
# [1] "GGGTCCTTGAAAGAAGCACAGGGTGGGGGTGGGGGTGGGGGTGGGGGAAGGCAGAGAGGAGGAAACAGGTTTTTGTCCTCAGGGCGTTGCCAGTCTGAAGGAGGTGATGGGATAATTATTTATGAGAGTTCAGGAATGCCAGGCATGGATTAAATGCAAACTAATGGAAATGACACAGAACAATACATTACAC......................................"
#[2] "CCAGGCCAGAACTGAGGCCCTCAGGGCCCCCCAGAATTCCTCATTTGCAGGATAAAAATATACTCAGCTCTTCAATCTTGGTTCTTGCTACTGCACCATGTGCTTCCTGGACTCTGGGAGGCCAGGGGTTAAGTGGGAGTGTTTGAATAAGGGAAAGGATGAGCCCTTTCCCCACACTTTGCCCCAAATAAC......................................"
#[3]
#........
# [4]
#........
# [100]
#........
I wrote a function to identify and count the "AATAA".
R
library(stringr)
cal_AATAA <- function(DNA){
sam_pro <- numeric(length(DNA))
k <- 5
sam_code <- "AATAA"
for(i in 1:(length(DNA))){
Num <- str_length(DNA[i])
for(j in 1:(Num - k +1)){
if ((str_sub(DNA[i], j, j+k-1)) == sam_code){
sam_pro[i] <- sam_pro[i] + 1
}
else {
sam_pro[i] <- sam_pro[i]
}
}
return (sam_pro)
}
}
sample_100 <- cal_AATAA(dog_100)
What I got after running the function is
> sample_100
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[46] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[91] 0 0 0 0 0 0 0 0 0 0
Tried to debug my code but don't know where went wrong. Appreciate any tips or guidance.
R has a built in function called gregexpr which can be used for counting patterns in a string. It outputs a list, so we have to use sapply to loop through the elements of the output. For each element, we count the number of values that are greater than zero because a value of -1 indicates that any match was not found. Look at the output of gregexpr("ap", c("appleap", "orange")) as an example.
dna = c("AGTACGTGCATAGC", "GTAGCTAGCTAGCAT")
sam = "AGC"
sapply(gregexpr(sam, dna), function(x) sum(x > 0))
#[1] 1 3

probability and classification in svm function of e1071 package in R

I'm using SVM in e1071 package for binary classification.
I'm using both the probability attribute, and the SVM predict classification to compare the results. What I'm puzzled by is that the predicted classification (0 or 1) of the predict function doesn't seem congruous with the actual probabilities listed in the attribute. For some very high probabilities for level 1, the SVM classification is level 0, and for some low probabilities for level 1, the SVM classification is level 1.
here's a sample code and results
svm_model <- svm(as.factor(CHURNED) ~ .
, scale = FALSE
, data = train
, cost = 1
, gamma = 0.1
, kernel = "radial"
, probability = TRUE
)
test$Pred_Class <- predict(svm_model, test, probability = TRUE)
test$Pred_Prob <- attr(test$Pred_Class, "probabilities")[,1]
Here is the results: (rows have been placed differently to see various examples)
CHURNED: is response variable that is being predicted
Pred_class: is the predicted class by SVM
Pred_Prob: is the predicted probability, based on which SVM makes classification?
CHURNED Pred_Class Pred_Prob
1 0 0.03968526 # --> makes sense
1 0 0.03968526
1 0 0.07033222
1 0 0.11711195
1 0 0.12477983
1 0 0.12827296
1 0 0.12829345
1 0 0.12829345
1 0 0.12829345
1 0 0.12829444
1 0 0.12829927
1 0 0.12829927
1 0 0.12831169
1 0 0.12831169
1 0 0.12831428
1 1 0.13053475 # --> doesn't make sense. Prob is less than 0.5
1 1 0.13053475
1 1 0.13053475
1 1 0.1305348
1 1 0.1305348
1 1 0.1305348
1 1 0.1690807
1 1 0.2206993
1 1 0.2321171
0 0 0.998289 # --> doesn't make sense. Prob is almost 1!
0 0 0.9982887
0 0 0.993133
0 0 0.9898889
1 0 0.9849951
0 0 0.9849951
1 0 0.546427
0 0 0.5440994 # --> doesn't make sense. Prob is more than 0.5
0 0 0.5437889
1 0 0.5417848
0 0 0.5284112
0 0 0.5252177
0 1 0.5180776 # --> makes sense but is not consistent with above example
0 1 0.5180704
1 1 0.5180436
1 1 0.5180436
0 1 0.518043
This result doesn't make sense to me at all. The predicted class and predicted probabilities don't match. I've checked to make sure that I'm referencing the right column from the "probabilities" attribute matrix:
test$Pred_Class
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[98] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
attr(,"probabilities")
1 0
6442 0.2369796 0.7630204
6443 0.2520246 0.7479754
6513 0.2322581 0.7677419
6801 0.2309437 0.7690563
6802 0.2244768 0.7755232
6954 0.2322450 0.7677550
6968 0.2537544 0.7462456
6989 0.2352477 0.7647523
7072 0.2322308 0.7677692
...
...
...
Maybe I am interpreting the probability incorrectly?

R igraph Adjazenzmatrix weighted graph – plot is not weighted

I am trying to plot a weighed graph of terms used in tweets. Basically I made a term Document Matrix; removed sparse terms; build a adjazenzmatrix of the remaining words and would like to plot them.
I can't figure out where the problem is. Tried to do it exactly like on: http://www.rdatamining.com/examples/text-mining
Here's my code:
tweet_corpus = Corpus(VectorSource(df$CONTENT))
tdm = TermDocumentMatrix(
tweet_corpus,
control = list(
removePunctuation = TRUE,
stopwords = c("hehe", "haha", stopwords_phil, stopwords("english"), stopwords("spanish")),
removeNumbers = TRUE, tolower = TRUE)
)
m = as.matrix(tdm)
termDocMatrix <- m
termDocMatrix[5:10,1:20]
Docs
Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
aabutin 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
aad 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
aaf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
aali 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
aannacm 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
aantukin 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
myTdm2 <- removeSparseTerms(tdm, sparse =0.98)
m2 <- as.matrix(myTdm2)
m2[5:10,1:20]
Docs
Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
filipino 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
give 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
god 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
good 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
guy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
haiyan 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
myTdm2
<<TermDocumentMatrix (terms: 34, documents: 27395)>>
Non-/sparse entries: 39769/891661
Sparsity : 96%
Maximal term length: 9
Weighting : term frequency (tf)
termDocMatrix2 <- m2
termDocMatrix2[termDocMatrix2>=1] <- 1
termMatrix2 <- termDocMatrix2 %*% t(termDocMatrix2)
termMatrix2[5:10,5:10]
Terms
Terms disaster give god good guy test
disaster 623 6 53 11 4 19
give 6 592 98 16 8 6
god 53 98 2679 135 38 29
good 11 16 135 816 21 5
guy 4 8 38 21 637 5
test 19 6 29 5 5 610
g2 <- graph.adjacency(termMatrix2, weighted=T, mode="undirected")
g2 <- simplify(g2)
V(g)$label <- V(g)$name
V(g2)$label <- V(g2)$name
V(g2)$degree <- degree(g2)
set.seed(3952)
layout1 <- layout.fruchterman.reingold(g2)
plot(g2, layout=layout1)
plot(g2, layout=layout.kamada.kawai)
V(g2)$label.cex <- 2.2 * V(g2)$degree / max(V(g2)$degree)+ .2
V(g2)$label.color <- rgb(0, 0, .2, .8)
V(g2)$frame.color <- NA
egam <- (log(E(g2)$weight)+.4) / max(log(E(g2)$weight)+.4)
E(g2)$color <- rgb(.5, .5, 0, egam)
E(g2)$width <- egam
plot(g2, layout=layout1)
This then looks like:
but i would like to have something like this:
apparently the weighing doesn't work - but why?!
Thank you guys in advance!
Even though your graph is weighted, the layout algorithm does not use the weights unless you explicitly tell it to do so. Try this:
layout1 <- layout.fruchterman.reingold(g2, weights=E(g2)$weight)
However, if your weights are wildly varying in terms of magnitude, it is usually better to use the logarithm of the weights (plus some constant to make all of them strictly positive) as the input of the layout algorithm.

How do you silently save an inspect object in R's tm package?

When I save the inspect() object in R's tm package it prints to screen. It does save the data that I want in the data.frame, but I have thousands of documents to analyze and the printing to screen is eating up my memory.
library(tm)
data("crude")
matrix <- TermDocumentMatrix(corpus,control=list(removePunctuation = TRUE,
stopwords=TRUE))
out= data.frame(inspect(matrix))
I have tried every trick that I can think of. capture.output() changes the object (not the desired effect), as does sink(). dev.off() does not work. invisible() does nothing. suppressWarnings(), suppressMessages(), and try() unsurprisingly do nothing. There are no silent or quiet options in the inspect command.
The closest that I can get is
out= capture.output(inspect(matrix))
out= data.frame(out)
which notably does not give the same data.frame, but pretty easily could be if I need to go down this route. Any other (less hacky) suggestions would be helpful. Thanks.
Windows 7
64- bit R-3.0.1
tm package is the most recent version (0.5-9.1).
Assign inside the capture then:
capture.output(out <- data.frame(inspect(matrix))) -> .null # discarding this
But really, inspect is for visual inspection, so maybe try
as.data.frame(as.matrix(matrix))
instead (btw matrix is a very unfortunate name for a variable, as that's a base function).
Using this input (varible name changed from you question as using a variable named "matrix" can be confusing:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude,control=list(removePunctuation = TRUE,
stopwords=TRUE))
Then this will avoid printing to screen
m <- as.matrix(tdm)
and then I would personally do something like
require(data.table)
data.table(m, keep.rownames=TRUE)
# rn 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708
# 1: 100000 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
# 2: 108 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 3: 111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 4: 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 5: 12217 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
# ---
# 996: yesterday 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 0 0 0 0
# 997: yesterdays 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
# 998: york 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0
# 999: zero 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 1000: zone 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0

Resources