How to overcome Naive Bayes classification error - r

I downloaded the email files from my email account to classify them as a promotional mail or not. I am using Naive Bayes classifier for classification.
I am using the code given below
> classify.trip<- function(path, training.df, prior = 0.5, c=1e-6)
+ {
+ # Here, we use many of the support functions to get the
+ # trip file in a workable format
+ msg <- get.msg(path)
+ msg.tdm <- get.tdm(msg)
+ msg.tdm<-removeSparseTerms(msg.tdm,0.8)
+ msg.freq <- rowSums(as.matrix(msg.tdm))
+ # Find intersections of words
+ msg.match <- intersect(names(msg.freq), training.df$term)
+ # Now, we just perform the naive Bayes calculation
+ if(length(msg.match) < 1)
+ {
+ return(prior * c ^ (length(msg.freq)))
+ }
+ else
+ {
+ match.probs <- training.df$occurrence[match(msg.match, training.df$term)]
+ return(prior * prod(match.probs) * c ^ (length(msg.freq) - length(msg.match)))
+ }
+ }
> promo.test <-sapply(ptest.docs,
+ function(p) classify.trip(file.path(ptest.path, p), training.df = promo.df))
> promo.test
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
but it gives me the probability zero of all files

Related

UpSetR error when using queries: replacement has 1 row, data has 0

I have tried to use UpsetR to visualize the input file which can be found here
> library("UpSetR")
> orthogroups_df<- read.table("orthogroups.GeneCount.tsv", header=T, stringsAsFactors = F)
> #All species
> selected_species <- colnames(orthogroups_df)[2:(ncol(orthogroups_df) -1)]
> selected_species
[1] "Atha" "Cann" "NQLD" "Natt" "Ngla" "Nlab" "Nsyl" "Ntab" "Ntom" "Slyc" "Stub" "Vvin"
> head(orthogroups_df)
Orthogroup Atha Cann NQLD Natt Ngla Nlab Nsyl Ntab Ntom Slyc Stub Vvin Total
1 OG0000000 0 0 965 0 0 3 0 0 0 0 0 0 968
2 OG0000001 0 1 3 0 0 448 0 0 0 0 0 0 452
3 OG0000002 0 1 313 0 0 120 1 0 1 0 0 0 436
4 OG0000003 0 93 15 21 46 16 33 63 36 25 39 26 413
5 OG0000004 1 42 2 34 109 6 8 154 11 9 4 0 380
6 OG0000005 0 2 61 1 34 44 91 70 43 20 1 0 367
> ncol(orthogroups_df)
[1] 14
> orthogroups_df[orthogroups_df > 0] <- 1
> upset(orthogroups_df,
+ nsets = ncol(orthogroups_df),
+ sets = rev(c(selected_species)),
+ queries = list(list(query = intersects, params = list("NQLD", "Nlab", "Nsyl"), color = "#238c45", active = T),
+ list(query = intersects, params = list("NQLD", "Nlab"), color = "#ffd977", active = T)))
Error in `$<-.data.frame`(`*tmp*`, "freq", value = 45L) :
replacement has 1 row, data has 0
How is it possible to fix the above error?
We need to set the number of intersects - nintersects - to a higher number so that sets in query params can be shown.
By default nintersects is set to 40, and list("NQLD", "Nlab", "Nsyl") appears after 40 at 90th position, so we need a bigger number, here I tried with 90:
upset(orthogroups_df,
nsets = ncol(orthogroups_df),
sets = rev(c(selected_species)),
nintersects = 90,
queries = list(
list(query = intersects,
params = list("NQLD", "Nlab", "Nsyl"),
color = "red",
active = TRUE),
list(query = intersects,
params = list("NQLD", "Nlab"),
color = "blue",
active = TRUE)))

Generating 99 datasets at once in R

#Generating 99 pairs of datasets
for (i in 1:99) {
set.seed(i)
X1 <- rnorm(100, mean=3, sd=sqrt(10))
X2 <- rnorm(100, mean=-2, sd=sqrt(3))
X1sq <- X1^2
X2sq <- X2^2
X1X2 <- X1*X2
a <- exp(0.2*X1sq + 0.3*X2sq + 0.7*X1X2)
b <- 1 + exp(0.2*X1sq + 0.3*X2sq + 0.7*X1X2)
px3 <- a/b
Y <- rbinom(100, 1, px3)
newtest <- data.frame(X1, X2, Y)
}
Hi guys, essentially I would like to generate 99 pairs of data sets, via the loop function. And each new test should be a data frame consisting of new sets of values of X1, X2 and Y respectively.I am not sure if my loop written above is correct so I would like to seek help if possible!
You can add data frames as list elements to an empty list like this:
n = 99
dfs <- list()
for (i in 1:n) {
set.seed(i)
X1 <- rnorm(100, mean=3, sd=sqrt(10))
X2 <- rnorm(100, mean=-2, sd=sqrt(3))
X1sq <- X1^2
X2sq <- X2^2
X1X2 <- X1*X2
a <- exp(0.2*X1sq + 0.3*X2sq + 0.7*X1X2)
b <- 1 + exp(0.2*X1sq + 0.3*X2sq + 0.7*X1X2)
px3 <- a/b
Y <- rbinom(100, 1, px3)
newtest <- data.frame(X1, X2, Y)
dfs[[i]] <- newtest
}
Output:
> length(dfs)
[1] 99
> dfs[[1]]
X1 X2 Y
1 1.01897911 -3.07450660 1
2 3.58073118 -1.92705317 0
3 0.35751031 -3.57776258 0
4 8.04472084 -1.72628614 1
5 4.04199507 -3.13377386 1
6 0.40545116 1.06103134 0
7 4.54138600 -0.75862624 1
8 5.33478772 -0.42353199 1
9 4.82078051 -1.33457144 1
10 2.03427713 0.91361444 0
11 7.78067182 -3.10112784 1
12 4.23279256 -2.79959213 0
13 1.03546479 0.48078561 0
14 -4.00349598 -3.12703914 1
15 6.55734391 -2.35919398 1
16 2.85790745 -2.68036329 0
17 2.94880189 -2.55424391 0
18 5.98467216 -2.48343842 0
19 5.59692944 -1.14404070 1
20 4.87808088 -2.30714541 1
21 5.90606161 -2.87634403 0
22 5.47333215 0.32621148 1
23 3.23579518 -2.37166244 0
24 -3.29088243 -2.31100103 1
25 4.96006112 -2.17353545 0
26 2.82250534 -0.76562575 1
27 2.50733135 -2.12741729 0
28 -1.65092741 -2.06518430 1
29 1.48795676 -3.18067058 0
30 4.32164726 -2.56165259 0
31 7.29652199 -1.89579906 0
32 2.67495667 -3.01999517 0
33 4.22592528 -1.07942159 1
34 2.82985352 -4.62993570 0
35 -1.35464467 -1.46902621 1
36 1.68767196 -4.66120916 1
37 1.75314569 -2.52130594 1
38 2.81243457 -2.91500764 0
39 6.47858566 -3.12946129 0
40 5.41337362 -2.09854811 1
41 2.47973071 -5.31576779 1
42 2.19880002 0.03790208 1
43 5.20399171 -4.88381685 0
44 4.76032360 -2.80285821 0
45 0.82196325 -3.93283032 1
46 0.76270387 -3.30045666 1
47 4.15290939 1.61507850 1
48 5.43031450 -1.96986990 1
49 2.64473008 -4.22793787 0
50 5.78630728 -4.84161214 0
51 4.25892133 -1.22025307 0
52 1.06460261 -2.03214657 0
53 4.07871518 -2.55091058 0
54 -0.57135969 -3.60970246 1
55 7.53161884 -4.57635683 0
56 9.26257436 -3.86228769 0
57 1.83874373 -0.26789930 1
58 -0.30184360 -3.07606548 0
59 4.80161165 -4.39789764 0
60 2.57291984 1.23770633 1
61 10.59458219 -1.26370455 1
62 2.87591222 -2.41334890 0
63 5.18114738 -0.16665358 1
64 3.08855060 -0.46467093 1
65 0.64956374 -3.07256042 1
66 3.59701367 1.82108156 1
67 -2.70778035 -2.44171977 1
68 7.63449140 -4.46729711 0
69 3.48462961 -2.25010745 0
70 9.87040135 -1.64053305 1
71 4.50369316 1.99753585 1
72 0.75495226 -1.81674492 0
73 4.93128630 -1.20845485 1
74 0.04612393 -2.13363280 1
75 -0.96433690 -2.57850643 1
76 3.92163392 -2.06014725 0
77 1.59818801 -0.63576818 1
78 3.00349543 1.59442979 1
79 3.23508791 -0.22050410 1
80 1.13577108 0.09215872 0
81 1.20171157 -4.13271473 1
82 2.57252769 -0.29584288 1
83 6.72543819 -1.61907907 1
84 -1.81794126 -4.54135160 1
85 4.87822276 -1.09756214 1
86 4.05288152 -2.27497104 0
87 6.36181687 0.53673964 1
88 2.03808597 -3.32689295 0
89 4.17010222 -2.74514862 0
90 3.84464054 -3.60406870 0
91 1.28440103 -2.30675306 0
92 6.81961338 -1.30369517 1
93 6.66951527 -3.26742501 0
94 5.21426998 -0.56175148 1
95 8.01800798 -4.09246077 0
96 4.76608915 -3.81516225 0
97 -1.03693902 0.49615837 0
98 1.18717559 -3.75949942 1
99 -0.87256511 -1.28643887 0
100 1.50297574 -2.66004308 0
Here is a solution with replicate. From the documentation, my emphasis:
replicate is a wrapper for the common use of sapply for repeated evaluation of an expression (which will usually involve random number generation).
set.seed(1)
n <- 99L
dfs <- replicate(n, {
X1 <- rnorm(100, mean=3, sd=sqrt(10))
X2 <- rnorm(100, mean=-2, sd=sqrt(3))
X1sq <- X1^2
X2sq <- X2^2
X1X2 <- X1*X2
a <- exp(0.2*X1sq + 0.3*X2sq + 0.7*X1X2)
b <- 1 + exp(0.2*X1sq + 0.3*X2sq + 0.7*X1X2)
px3 <- a/b
Y <- rbinom(100, 1, px3)
data.frame(X1, X2, Y)
}, simplify = FALSE)
Created on 2022-05-21 by the reprex package (v2.0.1)

What is causing this error? Coefficients not defined because of singularities

I'm trying to find a model for my data but I get the message "Coefficients: (3 not defined because of singularities)"
These occur for winter, large and high_flow
I found this:
https://stats.stackexchange.com/questions/13465/how-to-deal-with-an-error-such-as-coefficients-14-not-defined-because-of-singu
which said it may be incorrect dummy variables, but I've checked that none of my columns are duplicates.
when I use the function alias() I get:
Model :
S ~ A + B + C + D + E + F + G + spring + summer + autumn + winter + small + medium + large + low_flow + med_flow + high_flow
Complete :
(Intercept) A B C D E F G spring summer autumn small medium
winter 1 0 0 0 0 0 0 0 -1 -1 -1 0 0
large 1 0 0 0 0 0 0 0 0 0 0 -1 -1
high_flow 1 0 0 0 0 0 0 0 0 0 0 0 0
low_flow med_flow
winter 0 0
large 0 0
high_flow -1 -1
columns A-H of my data contain numeric values
the remaining columns take 0 or 1, and I have checked there are no conflicting values (i.e. if spring = 1 for a case, autumn=summer=winter=0)
model_1 <- lm(S ~ A+B+C+D+E+F+G+spring+summer+autumn+winter+small+medium+large+low_flow+med_flow+high_flow, data = trainOne)
summary(model_1)
Can someone explain the error please?
EDIT: example of my data before I changed it to binary
season size flow A B C D E F G S
spring small medium 52 72 134 48 114 114 142 11
autumn small medium 43 21 98 165 108 23 60 31
spring medium medium 41 45 161 86 177 145 32 12
autumn large medium 40 86 132 80 82 138 186 16
winter medium high 49 32 147 189 125 43 144 67
summer large high 43 9 158 64 14 146 15 71
The issue is perfect collinearity. Namely,
spring + summer + autumn + winter == 1
small + medium + large == 1
low_flow + med_flow + high_flow == 1
Constant term == 1
By this I mean that those identities hold for each observation individually. (E.g., only one of the seasons is equal to one.)
So, for instance, lm cannot distinguish between the intercept and the sum of all the seasons' effects. Perhaps this or this will help to get the idea better. More technically, the OLS estimates involve a certain matrix that is not invertible in this case.
To fix this, you may run, e.g.,
model_1 <- lm(S ~ A + B + C + D + E + F + G + spring + summer + autumn + small + medium + low_flow + med_flow, data = trainOne)
Also see this question.
#JuliusVainora has already given you a good explanation of how the error occurs, which I will not repeat. However, Julius' answer is only one method and might not be satisfying if you don't understand that there really is a value for cases where winter = 1, large=1 and high_flow=1. It can readily be seen in the display as the value for "(Intercept)". You might be able to make the result more interpretable by adding +0 to your formula. (Or it might not, depending on the data situation.)
However, I think that you really should re-examine how your coding of categorical variables is done. You are using a method of one dummy variable per level that you are copying from some other system, perhaps SAS or SPSS? That's going to predictably cause problems for you in the future, as well as being a painful method to code and maintain. R's data.frame function already automagically creates factor's that encode multiple levels in a single variable. (Read ?factor.) So your formula would become:
S ~ A + B + C + D + E + F + G + season + size + flow
Some of you variables could be perfectly collinear. Take a look at the variables and how they correlate with each other. You can start inspecting the data with cor(dataset), this will return a correlation matrix of your dataset.

Nested xtab tables

I would like to produce nested tables for a multilevel factorial experiment. I have 10 paints examined for time to reach an end point under 4 levels of humidity, 3 temperatures and 2 wind speeds. Of course I have searched on line but without success.
Some sample code can be generated using:
## Made Up Data # NB the data is continuous whereas observations were made 40/168 so data is censored.
time3 <- 4*seq(1:24) # Dependent: times in hrs, runif is not really representative but will do
wind <- c(1,2) # Independent: factor draught on or off
RH <- c(0,35,75,95) # Independent: value for RH but can be processes as a factor
temp <- c(5,11,20) # Independent: value for temperature but can be processed as a factor
paint <- c("paintA", "paintB", "paintC") # Independent: Experimental material
# Combine into dataframe
dfa <- data.frame(rep(temp,8))
dfa$RH <- rep(RH,6)
dfa$wind <- rep(wind,12)
dfa$time3 <- time3
dfa$paint <- rep(paint[1],24)
# Replicate for different paints
dfb <- dfa
dfb$paint <- paint[2]
dfc <- dfa
dfc$paint <- paint[3]
dfx <- do.call("rbind", list(dfa,dfb,dfc))
# Rename first col
colnames(dfx)[1] <- "temp"
# Prepare xtab tables
tx <- xtabs(dfx$time3 ~ dfx$wind + dfx$RH + dfx$temp + dfx$paint)
tx
And the target I hope to obtain would be like this xtab example
This
tx <- xtabs(dfx$time3 ~ dfx$wind + dfx$RH + dfx$temp)
does not work well enough. I would also like to write to C:\file.csv for printing and reporting etc. Please advise on how to achieve the desired output.
You can paste the two variables you want to nest together. Since the items will be ordered lexicographically, you will need to zero-pad the temp variable, to get numerical ordering.
xtabs(time3~wind+paste(sprintf("%02d",temp),RH,sep=":")+paint,dfx)
, , paint = paintA
paste(sprintf("%02d", temp), RH, sep = ":")
wind 05:0 05:35 05:75 05:95 11:0 11:35 11:75 11:95 20:0 20:35 20:75 20:95
1 56 0 104 0 88 0 136 0 120 0 72 0
2 0 128 0 80 0 64 0 112 0 96 0 144
, , paint = paintB
paste(sprintf("%02d", temp), RH, sep = ":")
wind 05:0 05:35 05:75 05:95 11:0 11:35 11:75 11:95 20:0 20:35 20:75 20:95
1 56 0 104 0 88 0 136 0 120 0 72 0
2 0 128 0 80 0 64 0 112 0 96 0 144
, , paint = paintC
paste(sprintf("%02d", temp), RH, sep = ":")
wind 05:0 05:35 05:75 05:95 11:0 11:35 11:75 11:95 20:0 20:35 20:75 20:95
1 56 0 104 0 88 0 136 0 120 0 72 0
2 0 128 0 80 0 64 0 112 0 96 0 144

R populating columns based on previous values

I am trying to populate a series like this.
My result ACTUAL Expected
FWK_SEQ_NBR a initial_d initial_c b c d b c d
914 9.161 131 62 0 62 69 0 62 69
915 9.087 131 0 0 53 78 0 53 78
916 8.772 131 0 0 44 140 0 44 87
917 8.698 131 0 0 0 140 0 35 96
918 7.985 131 0 69 52 139 69 96 35
919 6.985 131 0 78 63 138 78 168 0
920 7.077 131 0 140 126 138 87 247 0
921 6.651 131 0 140 126 138 96 336 0
922 6.707 131 0 139 125 138 35 364 0
Logic
a given
b lag of d by 4
c initial c for first week thereafter (c previous row + b current - a current)
d initial d - c current
Here is the code i used
DS1 = DS %>%
mutate(c = ifelse(FWK_SEQ_NBR == min(FWK_SEQ_NBR), intial_c, 0) ) %>%
mutate(c = lag(c) + b - a)) %>%
mutate(d = initial_d - c) %>%
mutate(d = ifelse(d<0,0,d)) %>%
mutate(b = shift(d, n=4, fill=0, type="lag"))
I am not getting the c right, do you know what i am missing. I have also attached the image of the actual and expected output. Thank you for your help!
Actual and Expected values Image
Second Image - Added Product and Store to the list of columns
Image - Product and Store as the first two columns- please help
Below is the actual code, I have also copied the image of the expected and actual output. thank you!
Your example is not what I would call reproducible and the code snippet also did not provide much insight on what you were trying to do. However the screen image from excel was very helpful. Here is my solution
df <- as.data.frame(cbind(a = c(1:9), b = 0, c = 0, d = NA))
c_init = 62
d_init = 131
df$d <- d_init
df$c[1] <- c_init # initial data frame is ready at this stage
iter <- dim(df)[1] # for the loop to run item times
for(i in 1:iter){
if(i>4){
df[i, "b"] = df[i-4,"d"] # Calculate b with the lag
}
if(i>1){
df[i, "c"] = df[i-1, "c"] + df[i, "b"] - df[i, "a"] # calc c
}
df[i, "d"] <- d_init - df[i, "c"] # calc d
if(df[i, "d"] < 0) {
df[i, "d"] <- 0 # reset negative d values
}
}

Resources