classification in R - r

I am trying to do naive bayes classification in R. I have seen this example in following link.
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/Na%C3%AFve_Bayes
Only 2 lines are there. First classify and then predict.
> classifier<-naiveBayes(iris[,1:4], iris[,5])
> table(predict(classifier, iris[,-5]), iris[,5])
This same code on "iris dataset" working fine. But when i applied the same on my dataset, I am getting some errors.
My dataset contains 4 attributes and 4th attribute the class attribute.
> str(data1)
'data.frame': 1370 obs. of 4 variables:
$ TenScore : num 85 84.2 67.2 91.5 79.3 ...
$ TwelthScore : num 69 87.9 67.5 82.7 72.4 ...
$ GDegreeScore : num 63.3 70.7 61.3 78.2 62.1 ...
$ Got_Admission: chr "No" "No" "No" "No" ...
So, I tried this.
> classifier<-naiveBayes(data1[,1:3], data1[,4])
> table(predict(classifier, data1[,-4]), data1[,4])
Error in table(predict(classifier, data1[, -4]), data1[, 4]) :
all arguments must have the same length
I am getting above error when I am executing the command. When I just use predict, its giving me following output.
> predict(classifier, data1[,-4])
factor(0)
Levels:
str(data1) 'data.frame': 1370 obs. of 4 variables:
$ TenScore : num 85 84.2 67.2 91.5 79.3 ...
$ TwelthScore : num 69 87.9 67.5 82.7 72.4 ...
$ GDegreeScore : num 63.3 70.7 61.3 78.2 62.1 ...
$ Got_Admission: chr "No" "No" "No" "No" ...
Please explain me whats the errors about and how to solve?

I can produce the same error by changing the 5th column of iris to character:
> iris[ , 5] <- as.character(iris[ , 5] )
> classifier<-naiveBayes(iris[,1:4], iris[,5])
> table(predict(classifier, iris[,-5]), iris[,5])
Error in table(predict(classifier, iris[, -5]), iris[, 5]) :
all arguments must have the same length
# The fix -------->
iris[ , 5] <- factor(as.character(iris[ , 5] ))
classifier<-naiveBayes(iris[,1:4], iris[,5])
table(predict(classifier, iris[,-5]), iris[,5])
# ---- output--------
setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 3
virginica 0 3 47
So you should probably do this:
data1$ Got_Admission <- factor(data1$ Got_Admission)
If your 'Got_Admission' column is not in good order you will get confusing results (the GIGO effect). You should first look at the contents with:
table(data1$ Got_Admission)

Related

Loop in R not working, generating single Value

I have some metabolomics data I am trying to process (validate the compounds that are actually present).
`'data.frame': 544 obs. of 48 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ No. : int 2 32 34 95 114 141 169 234 236 278 ...
$ RT..min. : num 0.89 3.921 0.878 2.396 0.845 ...
$ Molecular.Weight : num 70 72 72 78 80 ...
$ m.z : num 103 145 114 120 113 ...
$ HMDB.ID : chr "HMDB0006804" "HMDB0031647" "HMDB0006112" "HMDB0001505" ...
$ Name : chr "Propiolic acid" "Acrylic acid" "Malondialdehyde" "Benzene" ...
$ Formula : chr "C3H2O2" "C3H4O2" "C3H4O2" "C6H6" ...
$ Monoisotopic_Mass: num 70 72 72 78 80 ...
$ Delta.ppm. : num 1.295 0.833 1.953 1.023 0.102 ...
$ X1 : num 288.3 16.7 1130.9 3791.5 33.5 ...
$ X2 : num 276.8 13.4 1069.1 3228.4 44.1 ...
$ X3 : num 398.6 19.3 794.8 2153.2 15.8 ...
$ X4 : num 247.6 100.5 1187.5 1791.4 33.4 ...
$ X5 : num 98.4 162.1 1546.4 1646.8 45.3 ...`
I tried to write a loop so that if the Delta.ppm value is larger than (m/z - molecular weight)/molecular weight, the entire row is deleted in the subsequent dataframe.
for (i in 1:nrow(rawdata)) {
ppm <- (rawdata$m.z[i] - rawdata$Molecular.Weight[i]) /
rawdata$Molecular.Weight[i]
if (ppm > rawdata$Delta.ppm[i]) {
filtered_data <- rbind(filtered_data, rawdata[i,])
}
}
Instead of giving me a new df with the validated compounds, under the 'Values' section, it generates a single number for 'ppm'.
Still very new to R, any help is super appreciated!
No need to do this row-by-row, we can remove all undesired rows in one operation:
## base R
good <- with(rawdat, (m.z - Molecular.Weight)/Molecular.Weight < Delta.ppm.)
newdat <- rawdat[good, ]
## dplyr
newdat <- filter(rawdat, (m.z - Molecular.Weight)/Molecular.Weight < Delta.ppm.)
Iteratively adding rows to a frame using rbind(old, newrow) works in practice but scales horribly, see "Growing Objects" in The R Inferno. For each row added, it makes a complete copy of all rows in old, which works but starts to slow down a lot. It is far better to produce a list of these new rows and then rbind them at one time; e.g.,
out <- list()
for (...) {
# ... newrow ...
out <- c(out, list(newrow))
}
alldat <- do.call(rbind, out)
ppm[i] <- NULL
for (i in 1:nrow(rawdata)) {
ppm[i] <- (rawdata$m.z[i] - rawdata$Molecular.Weight[i]) /
rawdata$Molecular.Weight[i]
if (ppm[i] > rawdata$Delta.ppm[i]) {
filtered_data <- rbind(filtered_data, rawdata[i,])
}
}

Error when running boxcox on response variable

I'm using the following code to try to transform my response variable for regression. Seems to need a log transformation.
bc = boxCox(auto.tf.lm)
lambda.mpg = bc$x[which.max(bc$y)]
auto.tf.bc <- with(auto_mpg, data.frame(log(mpg), as.character(cylinders), displacement**.2, log(as.numeric(horsepower)), log(weight), log(acceleration), model_year))
auto.tf.bc.lm <- lm(log(mpg) ~ ., data = auto.tf.bc)
view(auto.tf.bc)
I am receiving this error though.
Error in Math.data.frame(mpg) :
non-numeric variable(s) in data frame: manufacturer, model, trans, drv, fl, class
Not sure how to resolve this. The data is in a data frame, not csv.
Here's the output from str(auto.tf.bc). Sorry for such bad question formatting.
'data.frame': 392 obs. of 7 variables:
$ log.mpg. : num 2.89 2.71 2.89 2.77 2.83 ...
$ as.character.cylinders.: chr "8" "8" "8" "8" ...
$ displacement.0.2 : num 3.14 3.23 3.17 3.14 3.13 ...
$ log.horsepower. : num 4.87 5.11 5.01 5.01 4.94 ...
$ log.weight. : num 8.16 8.21 8.14 8.14 8.15 ...
$ log.acceleration. : num 2.48 2.44 2.4 2.48 2.35 ...
$ model_year : num 70 70 70 70 70 70 70 70 70 70 ...
removing the cylinders doesn't change anything.

Calculation of Geometric Mean of Data that includes NAs

EDIT: The problem was not within the geoMean function, but with a wrong use of aggregate(), as explained in the comments
I am trying to calculate the geometric mean of multiple measurements for several different species, which includes NAs. An example of my data looks like this:
species <- c("Ae", "Ae", "Ae", "Be", "Be")
phen <- c(2, NA, 3, 1, 2)
hveg <- c(NA, 15, 12, 60, 59)
df <- data.frame(species, phen, hveg)
When I try to calculate the geometric mean for the species Ae with the built-in function geoMean from the package EnvStats like this
library("EnvStats")
aggregate(df[, 3:3], list(df1$Sp), geoMean, na.rm=TRUE)
it works wonderful and skips the NAs to give me the geometric means per species.
Group.1 phen hveg
1 Ae 4.238536 50.555696
2 Be 1.414214 1.414214
When I do this with my large dataset, however, the function stumbles over NAs and returns NA as result even though there are e.g 10 numerical values and only one NA. This happens for example with the column SLA_mm2/mg.
My large data set looks like this:
> str(cut2trait1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 22 obs. of 19 variables:
$ Cut : chr "15_08" "15_08" "15_08" "15_08" ...
$ Block : num 1 1 1 1 1 1 1 1 1 1 ...
$ ID : num 451 512 431 531 591 432 551 393 511 452 ...
$ Plot : chr "1_1" "1_1" "1_1" "1_1" ...
$ Grazing : chr "n" "n" "n" "n" ...
$ Acro : chr "Leuc.vulg" "Dact.glom" "Cirs.arve" "Trif.prat" ...
$ Sp : chr "Lv" "Dg" "Ca" "Tp" ...
$ Label_neu : chr "Lv021" "Dg022" "Ca021" "Tp021" ...
$ PlantFunctionalType: chr "forb" "grass" "forb" "forb" ...
$ PlotClimate : chr "AC" "AC" "AC" "AC" ...
$ Season : chr "Aug" "Aug" "Aug" "Aug" ...
$ Year : num 2015 2015 2015 2015 2015 ...
$ Tiller : num 6 3 3 5 6 8 5 2 1 7 ...
$ Hveg : num 25 38 70 36 68 65 23 58 71 27 ...
$ Hrep : num 39 54 77 38 76 70 65 88 98 38 ...
$ Phen : num 8 8 7 8 8 7 6.5 8 8 8 ...
$ SPAD : num 40.7 42.4 48.7 43 31.3 ...
$ TDW_in_g : num 4.62 4.85 11.86 5.82 8.99 ...
$ SLA_mm2/mg : num 19.6 19.8 20.3 21.2 21.7 ...
and the result of my code
gm_cut2trait1 <- aggregate(cut2trait1[, 13:19], list(cut2trait1$Sp), geoMean, na.rm=TRUE)
is (only the first two rows):
Group.1 Tiller Hveg Hrep Phen SPAD TDW_in_g SLA_mm2/mg
1 Ae 13.521721 73.43485 106.67933 NA 28.17698 1.2602475 NA
2 Be 8.944272 43.95452 72.31182 5.477226 20.08880 0.7266361 9.309672
Here, the geometric mean of SLA for Ae is NA, even though there are 9 numeric measurements and only one NA in the column used to calculate the geometric mean.
I tried to use the geometric mean function suggested here:
Geometric Mean: is there a built-in?
But instead of NAs, this returned the value 1.000 when used with my big dataset, which doesn't solve my problem.
So my question is: What is the difference between my example df and the big dataset that throws the geoMean function off the rails?

Apply a loop to a data frame

I'm trying to apply the function RE.Johnson from the Johnson package to a whole data frame df that contains 157 observations of 16 variables and i'd like to loop trough all the dataframe instead of doing it manually.
I've tried the following code but it doesn't work.
lapply(df[1:16], function(x) RE.Johnson(x))
I know it might seem easy for you guys but I'm juste starting with R.
Thanks
EDIT
R provides me the answer Error in RE.ADT(xsl[, i]) : object 'p' not found and the data are not transformed.
And here is a summary of the data:
data.frame': 157 obs. of 16 variables:
$ X : num 786988 781045 777589 775266 786843 ...
$ Y : num 486608 488691 490089 489293 488068 ...
$ Z : num 182 128 191 80 131 ...
$ pH : num 7.93 7.69 7.49 7.66 7.92 7.08 7.24 7.19 7.44 7.37 ...
$ CE : num 0.775 3.284 3.745 4.072 0.95 ...
$ Nitrate : int 21 14 18 83 30 42 47 101 85 15 ...
$ NP : num 19.6 43.6 31.7 18.6 31.7 ...
$ Cl : num 1.9 21.3 2.56 21.5 3.2 ...
$ HCO3 : num 6.65 4.85 4.4 7.72 4.1 ...
$ CO3 : num 0 0 0 0 0.0736 ...
$ Ca : num 4.12 7.52 3.48 7.58 4.8 10 4.4 4.6 4.2 7.4 ...
$ Mg : num 3.94 8.92 2.34 7.1 2.5 ...
$ K : num 0.1442 0.0759 0.0709 0.3691 0.07 ...
$ Na : num 2.41 34.55 2.51 44.01 2.1 ...
$ SO4 : num 1.45 23.6 1.2 26.66 2 ...
$ Residu_sec: num 0.496 2.102 2.397 2.606 0.608 ...
Not a complete solution, just some information for others.
I tried the Johnson::RE.Johnson manually on the columns in the iris data frame. It seems to be work fine for Sepal.Length and Petal.Length only:
lapply(iris[c(1,3)], Johnson::RE.Johnson)
... and it returns the error you mentioned for Sepal.Width and Petal.Width.
lapply(iris[c(2,4)], Johnson::RE.Johnson)
Error in RE.ADT(xsl[, i]) : object 'p' not found
This seems odd because all of those columns have a data type of num. The iris data frame doesn't appear to have any missing values or extra character values hidden anywhere, so I'm not sure why the calculation is working for those columns but not others.
Without understanding too much about what the Johnson::RE.Johnson is doing to the data, it looks like it is unable to calculate a value for p and is unable to complete the iteration for those columns.
From exploring the source code, the function appears to break down at this point:
if (xsb.valida[1, i] == 0)
xsb.adtest[1, i] <- (Johnson::RE.ADT(xsb[, i])$p) # succeeds
if (xsl.valida[1, i] == 0)
xsl.adtest[1, i] <- (Johnson::RE.ADT(xsl[, i])$p) # fails
if (xsu.valida[1, i] == 0)
xsu.adtest[1, i] <- (Johnson::RE.ADT(xsu[, i])$p) # fails
The function attempts to run Johnson::RE.ADT on xsl, which at this point is a vector of just 0's. The RE.ADT returns the same error with the p value not being found.
The problem is when the function try to perform the Anderson-Darling test to a vector of equals values. If you do this, you will get the error:
require(Johnson)
x = rep(1,n=100)
RE.ADT(x)
So, to solve this problem you could check it in the IF session inside the function RE.Johnson:
if (xsb.valida[1, i] == 0 & any(xsb[, i]!=xsb[1, i])){
xsb.adtest[1, i] <- (RE.ADT(xsb[, i])$p)
}else{
xsb.adtest[1, i] <- 0
}
if (xsl.valida[1, i] == 0 & any(xsl[, i]!=xsl[1, i])) {
xsl.adtest[1, i] <- (RE.ADT(xsl[, i])$p)
}else{
xsl.adtest[1, i] <- 0
}
if (xsu.valida[1, i] == 0 & any(xsu[, i]!=xsu[1, i])) {
xsu.adtest[1, i] <- (RE.ADT(xsu[, i])$p)
}else{
xsu.adtest[1, i] <- 0
}

Analysis of PCA

I'm using the rela package to check whether I can use PCA in my data.
paf.neur2 <- paf(neur2)
summary(paf.neur2)
# [1] "Your dataset is not a numeric object."
I want to see the KMO (The Kaiser-Meyer-Olkin measure of sampling adequacy test). How to do that?
Output of str(neur2)
'data.frame': 1457 obs. of 66 variables:
$ userid : int 200 387 458 649 931 991 1044 1075 1347 1360 ...
$ funct : num 3.73 3.79 3.54 3.04 3.81 ...
$ pronoun: num 2.26 2.55 2.49 1.98 2.71 ...
.
.
.
$ time : num 1.68 1.87 1.51 1.03 1.74 ...
$ work : num 0.7419 0.2311 -0.1985 -1.6094 -0.0619 ...
$ achieve: num 0.174 0.2469 0.1823 -0.478 -0.0513 ...
$ leisure: num 0.2852 0.0296 0.0583 -0.3567 -0.0408 ...
$ home : num -0.844 -0.58 -0.844 -2.207 -1.079 ...
.
Variables are all numeric.
According to ?paf, object is a numeric dataset (usually a coerced matrix from a prior data frame)
So you need to turn your data.frame neur2 into a matrix: as.matrix(neur2).
Here is a reproduction of your problem using the Seatbelts dataset:
library(rela)
Belts <- Seatbelts[,1:7]
class(Belts)
# [1] "mts" "ts" "matrix"
Belts <- as.data.frame(Belts)
# [1] "data.frame"
paf.belt <- paf(Belts)
[1] "Your dataset is not a numeric object."
Belts <- as.matrix(Belts)
class(Belts)
# [1] "matrix"
paf.belt <- paf(Belts) # Works
Two options which can do it for you:
kmo_DIY <- function(df){
csq = cor(df)^2
csumsq = (sum(csq)-dim(csq)[1])/2
library(corpcor)
pcsq = cor2pcor(cor(df))^2
pcsumsq = (sum(pcsq)-dim(pcsq)[1])/2
kmo = csumsq/(csumsq+pcsumsq)
return(kmo)
}
or
the function KMO() from the psych package.

Resources