How to load data in R with the 'caret' library? - r

I'm trying to evaluate several cross-validation predictive models in R with the 'caret' library, but an error message appears everytime I try to load the dataset. I know I'm doing something wrong, but can't see the mistake (I summarized the code below):
> ticData<-read.table("tic-tac-toe.data.txt",header=F, sep=',')
> ticData
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 x x x x o o x o o positive
2 x x x x o o o x o positive
3 x x x x o o o o x positive
4 x x x x o o o b b positive
5 x x x x o o b o b positive
6 x x x x o o b b o positive
[…]
> colnames(ticData) <- c("topl","topm","topr","midl","midm","midr","botl","botm","botr","class")
> head(ticData)
topl topm topr midl midm midr botl botm botr class
1 x x x x o o x o o positive
2 x x x x o o o x o positive
3 x x x x o o o o x positive
4 x x x x o o o b b positive
5 x x x x o o b o b positive
6 x x x x o o b b o positive
> complete.cases("tic-tac-toe.data")
[1] TRUE
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> set.seed(100)
> trainIndex <- createDataPartition(ticData$class, p=0.7, list=FALSE, times=1)
> head(trainIndex)
Resample1
[1,] 1
[2,] 2
[3,] 4
[4,] 6
[5,] 7
[6,] 9
> Train <- ticData[ trainIndex,]
> Test <- ticData[-trainIndex,]
> install.packages("klaR")
> install.packages("C50")
> install.packages("nnet")
> install.packages("kernlab")
> library(caret)
> ticData$class=as.factor(ticData$class)
> head(ticData)
topl topm topr midl midm midr botl botm botr class
1 x x x x o o x o o positive
2 x x x x o o o x o positive
3 x x x x o o o o x positive
4 x x x x o o o b b positive
5 x x x x o o b o b positive
6 x x x x o o b b o positive
> fitControl<-trainControl(
+ method="nb",
+ number=10)
> set.seed(100)
> nbFit<-train(class~.,data=ticData,
+ method="nb",
+ trControl=fitControl,
+ verbose=FALSE)
Error: Not a recognized resampling method.
I'm still learning, many thanks in advance :)

Related

Statistical testing for multiple columns from a dataframe

For the data frame below I want to perform kolmogorov-smirnov tests for multiple columns. Column ID is the record ID, A-D are factors consisting of 2 levels ('Other' and A,B,C,D respectively. My test variable is in column E.
Now I would like to perform 4 KS tests:
Distributions of E for column A (A vs O)
Distributions of E for column B (B vs O)
Distributions of E for column C (C vs O)
Distributions of E for column A (D vs O)
In reality, I have 80 columns, so I'm looking for a way to perform these 80 tests 'Simultaneously'
ID A B C D E
1 1 O B C O 1
2 2 O O O O 3
3 3 O O O D 2
4 4 A O C D 7
5 5 A B O O 12
6 6 O O O O 4
7 7 O B O O 8
I hope this solves your problem:
dat <- read.table("path/data.txt") # your data imported into my session.
cols <- c("A", "B", "C", "D") #these are the your columnss with categories. We leave the others out.
E <- dat$E # but save the E variable
lapply(cols, function(i){ # Evaluate E at each level of each column
x <- factor(dat[,i])
a <- E[x == levels(x)[1]]
b <- E[x == levels(x)[2]]
ks.test(a, b)
}) #you get a list with the results for each column

Whole dataset shows up, although a subset has been selected and newly defined

I a dataframe which I have subsetted using normal indexing. Code below.
dframe <- dframe[1:10, c(-3,-7:-10)]
But when I write dframe$Symbol I get the output.
BABA ORCL LFC TSM ACT ABBV MA ABEV KMI UPS
3285 Levels: A AA AA^B AAC AAN AAP AAT AAV AB ABB ABBV ABC ABEV ABG ABM ABR ABR^A ABR^B ABR^C ABRN ABT ABX ACC ACCO ACE ACG ACH ACI ACM ACN ACP ACRE ACT ACT^A ACW ADC ADM ADPT ADS ADT ADX AEB AEC AED AEE AEG AEH AEK AEL AEM AEO AEP AER AES AES^C AET AF AF^C ... ZX
I'm wondering what is happening here. Does the dframe dataframe only contain 10 rows or still all rows, but only outputs 10 rows?
Thanks
That's just the way factors work. When you subset a factor, it preserves all levels, even those that are no longer represented in the subset. For example:
f1 <- factor(letters);
f1;
## [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
f2 <- f1[1:10];
f2;
## [1] a b c d e f g h i j
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
To answer your question, it's actually slightly tricky to append all missing levels to a factor. You have to combine the existing factor data with all missing indexes (here I'm referring to the integer indexes that the factor class internally uses to map the actual factor data to its levels vector, which is stored as an attribute on the factor object), and then rebuild a factor (using the original levels) from that combined data. Below I demonstrate this, now randomizing the subset taken from f1 to demonstrate that order does not matter:
set.seed(1); f3 <- sample(f1,10);
f3;
## [1] g j n u e s w m l b
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
factor(c(f3,setdiff(1:nlevels(f3),as.integer(f3))),labels=levels(f3));
## [1] g j n u e s w m l b a c d f h i k o p q r t v x y z
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

Subsetting character data in R

I have a data frame with several columns of varied character data. I want to find the average of each combination of that character data. I think I'm closing in on a solution, but am having trouble figuring out how to loop over characters. An example bit of data would be like:
Var1 Var2 Var3 M1
a w j 20
a w j 15
a w k 10
a w j 0
b x L 30
b x L 10
b y k 20
b y k 15
c z j 20
c z j 10
c z k 11
c w l 45
a d j 20
a d k 4
a d l 23
a d k 11
And trying to get it in the form of:
P1 P2 P3 Avg
a w j 11.667
a w k 10
a d j 20
a d k 15
a d l 23
b x L 20
b y k 17.5
c z j 15
c z k 11
c w l 45
I think the idea is something like:
test <- read.table("clipboard",header=T)
newdata <- subset(test,
Var1=='a'
& Var2=='w'
& Var3=='j',
select=M1
)
row.names(newdata)<-NULL
newdata2 <- as.data.frame(matrix(data=NA,nrow=3,ncol=4))
names(newdata2) <- c("P1","P2","P3","Avg")
newdata2[1,1] <- 'a'
newdata2[1,2] <- 'w'
newdata2[1,3] <- 'j'
newdata2[1,4] <- mean(newdata$M1)
Which works for the first line, but I'm not entirely sure how to automate this to loop over each character combination across the columns. Unless, of course, there's a similar apply-like function to use in this case?
library(dplyr)
newdata2 = summarise(group_by(test,Var1,Var2,Var3),Avg=mean(M1))
And the result:
> newdata2
Source: local data frame [10 x 4]
Groups: Var1, Var2
Var1 Var2 Var3 Avg
1 a d j 20.00000
2 a d k 7.50000
3 a d l 23.00000
4 a w j 11.66667
5 a w k 10.00000
6 b x L 20.00000
7 b y k 17.50000
8 c w l 45.00000
9 c z j 15.00000
10 c z k 11.00000
Using the base aggregate function:
mydata <- read.table(header=TRUE, text="
Var1 Var2 Var3 M1
a w j 20
a w j 15
a w k 10
a w j 0
b x L 30
b x L 10
b y k 20
b y k 15
c z j 20
c z j 10
c z k 11
c w l 45
a d j 20
a d k 4
a d l 23
a d k 11")
aggdata <-aggregate(mydata$M1, by=list(mydata$Var1,mydata$Var2,mydata$Var3) , FUN=mean, na.rm=TRUE)
output:
> aggdata
Group.1 Group.2 Group.3 x
1 a d j 20.00000
2 a w j 11.66667
3 c z j 15.00000
4 a d k 7.50000
5 a w k 10.00000
6 b y k 17.50000
7 c z k 11.00000
8 a d l 23.00000
9 c w l 45.00000
10 b x L 20.00000

Remove variables with factor level 1

I am using the program gs in the bnlearn package for my data frame EMGbin. The dataframe EMGbin contains all factors, ranging from A to Z. EMGbin has 600000 columns and 130 rows. Here is a sample of EMGbin:
V101 V102 V103 V104 V105 V106
1 L M D S O O
2 L M C P A O
3 J M C O O O
4 L N D R A O
5 K M D O A O
6 K M C P O O
7 K N D Q O O
8 L N D R O O
9 L M D O O O
10 K M D S A O
When I run the program gs(EMGbin), I get the error:
Error in check.data(x) : all factors must have at least two levels.
When I run sapply(EMGbin, nlevels), I see the levels of factors each of the 600,000 variables has, and I see some of them are listed as 1 level. Would removing the variables with 1 factor level help? So far, the only way I know how to do this is x[, sapply(x, fun) != 1], but I don't know what to substitute in for fun.
Use this:
x[, sapply(x, nlevels) > 1]
You can check the number of levels in a factor with the nlevels function.

Transforming data.frame in R

I have the following data frame:
foo <- data.frame( abs( cbind(rnorm(3),rnorm(3, mean=.8),rnorm(3, mean=.9),rnorm(3, mean=1))))
colnames(foo) <- c("w","x","y","z")
rownames(foo) <- c("n","q","r")
foo
# w x y z
# n 1.51550092 1.4337572 1.2791624 1.1771230
# q 0.09977303 0.8173761 1.6123402 0.1510737
# r 1.17083866 1.2469347 0.8712135 0.8488029
What I want to do is to change it into :
newdf
# 1 n w 1.51550092
# 2 q w 0.09977303
# 3 r w 1.17083866
# 4 n x 1.43375725
# 5 q x 0.81737606
# 6 r x 1.24693468
# 7 n y 1.27916241
# 8 q y 1.61234016
# 9 r y 0.87121353
# 10 n z 1.17712302
# 11 q z 0.15107369
# 12 r z 0.84880292
What's the way to do it?
There are several ways to do this. Here's one:
set.seed(1)
foo <- data.frame( abs( cbind(rnorm(3),
rnorm(3, mean=.8),
rnorm(3, mean=.9),
rnorm(3, mean=1))))
colnames(foo) <- c("w","x","y","z")
rownames(foo) <- c("n","q","r")
foo
# w x y z
# n 0.6264538 2.39528080 1.387429 0.6946116
# q 0.1836433 1.12950777 1.638325 2.5117812
# r 0.8356286 0.02046838 1.475781 1.3898432
data.frame(rows = row.names(foo), stack(foo))
# rows values ind
# 1 n 0.62645381 w
# 2 q 0.18364332 w
# 3 r 0.83562861 w
# 4 n 2.39528080 x
# 5 q 1.12950777 x
# 6 r 0.02046838 x
# 7 n 1.38742905 y
# 8 q 1.63832471 y
# 9 r 1.47578135 y
# 10 n 0.69461161 z
# 11 q 2.51178117 z
# 12 r 1.38984324 z
reshape2:::melt() is particularly well suited to this transformation:
library(reshape2)
foo <- cbind(ID=rownames(foo), foo)
melt(foo)
# Using ID as id variables
# ID variable value
# 1 n w 1.7337416
# 2 q w 0.5890877
# 3 r w 0.2245508
# 4 n x 0.5237346
# 5 q x 0.9320455
# 6 r x 0.8156573
# 7 n y 1.9287306
# 8 q y 1.1604229
# 9 r y 1.7631215
# 10 n z 0.3591350
# 11 q z 0.9740170
# 12 r z 0.5621968

Resources