How to reveal the content of an build-in dataset? - r

Many packages in R with build-in datasets in them.(just like “Vehicle” in “mlbench”, and “churn” in C50) We can use data() function to load these dataset. Sometimes, I want to check the structure and content of these data set in order to construct a new dataset for further analysis. But the view() function offen failed to do this job, summary() could use in some cases, but if you use summary(churn), the only result you get is an error: Error in summary(churn) : 找不到对象'churn'.
Is there any common methods to reveal a part of the build-in dataset?

Despite the fact that churn.Rdata is in the ../data/ directory of the C50 library, loading it shows that there is no 'churn' object in it. There are, however, both 'churnTest' and 'churnTrain' datasets and you can see their structure with str():
load('/path/to/my/current_R/Resources/library/C50/data/churn.RData')
ls(patt='churn')
#[1] "churnTest" "churnTrain"
str(churnTest)
'data.frame': 1667 obs. of 20 variables:
$ state : Factor w/ 51 levels "AK","AL","AR",..: 12 27 36 33 41 13 29 19 25 44 ...
$ account_length : int 101 137 103 99 108 117 63 94 138 128 ...
$ area_code : Factor w/ 3 levels "area_code_408",..: 3 3 1 2 2 2 2 1 3 2 ...
$ international_plan : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ voice_mail_plan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 2 ...
# snipped remainder of output
Would also have gotten some sort of response to:
data(package="C50")
I get a panel that pops up with:
Data sets in package ‘C50’:
churnTest (churn) Customer Churn Data
churnTrain (churn) Customer Churn Data

Related

Grouping categorical data with a factor (yes/no) for each observation

Iam working on a project in R but can't figure out how to create grouped data based on the categorical variable (Occuptation with 10 factors) and Died( being a yes/no factor variable).
Ive looked at numerous articles but everytime I try to count the number of "yes" and "no" of a single column(Died) i get a dimension error
'data.frame': 2571 obs. of 4 variables:
$ Occupation: Factor w/ 10 levels "business/service",..: 3 2 2 2 2 2 2 2 2 5 ...
$ Education : Factor w/ 5 levels "iliterate","primary",..: 3 2 2 2 3 1 1 3 2 3 ...
$ Age : int 39 83 60 73 51 62 90 54 66 30 ...
$ Died : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 1 ...
this is the summary of my data. SO looking to group Each 10 of the occupation factors with the number of people died.
This was the code I was trying is:
dperoccu <- summarise(occu, count = n(), deaths = count(SuicideChina$Died, "yes"))
but produced the follwing error:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "factor"

How to automatically convert a numeric column to categorical data using statistical techniques

>data
ACC_ID REG PRBLT OPP_TYPE_DESC PARENT_ID ACCT_NM INDUSTRY_ID BUY PWR REV QTY
11316456 No 90 A 2122628569 INF 7379 10190.82 6500 1
11456476 Yes 1 I 2385888136 Module 9199 17441.72 466.5 31
13453245 No 10 D 2122628087 Wooden 3559 44279.21 2500 500
15674568 No 1 I 2702074521 Nine 7379 183218.8 25.91 1
Above is the given dataset
When I load the same in R, I have the following structure
>str(data)
$ ACC_ID : int 11316974 11620677 11865091 ...
$ REG : Factor w/ 2 levels "No ","Yes ": 1 2 1 1 1 1 1 1 1 1 ...
$ PRBLT : int 90 1 10 1 30 30 10 1 60 1 ...
$ OPP_TYPE_DESC : Factor w/ 3 levels "D",..: 3 2 1 2 1 1 1 3 3 2 ...
$ PARENT_ID : num 2.12e+09 2.39e+09 2.12e+09 2.70e+09 2.12e+09 ...
$ ACCT_NM : Factor w/ 20 levels "Marketing Vertical",..: 10 15 20 17 8 16 2 14 7 11 ...
$ INDUSTRY_ID : int 7379 9199 3559 7379 2711 7374 7371 8742 4813 2111 ..
$ BUY PWR : num 1014791 17442 ...
$ REV : num 6500 46617 250000 25564 20000 ...
$ QTY : int 1 31 500 1 6 100 ...
But, I would want to somehow automatically want R to output the below fields as factors instead of int (with the help of statistical modelling or any other technique). Ideally, these are not continuous fields but categorical nominal fields
ACC_ID
PARENT_ID
INDUSTRY_ID
Whereas the REV and QTY columns should be left as is.
Also, the analysis should not be specific to the data and the columns shown here. The logic must be applicable to any data-set (with different columns) that we load in R
Can there be any method through which this is possible? Any ideas are welcome.
Thank you

KNNCAT error "some classes have only one member"

I'm trying to run a KNN analysis on auto data using knncat's knncat function. My training set is around 700,000 observations. The following happens when I try to implement the analysis. I've attempted to remove NA using the complete cases method while reading the data in. I'm not sure exactly how to take care of the errors or what they mean.
kdata.training = kdataf[ind==1,]
kdata.test = kdataf[ind==2,]
kdata_pred = knncat(train = kdata.training, test = kdata.test, classcol = 4)
Error in knncat(train = kdata.training, test = kdata.test, classcol = 4) :
Some classes have only one member. Check "classcol"
When I attempt to run a small subsection of the training and test set(200 and 70 observations respectively) I get the following error:
kdata_strain = kdata.training[1:200,]
kdata_stest = kdata.test[1:70,]
kdata_pred = knncat(train = kdata_strain, test = kdata_stest, classcol = 4)
Error in knncat(train = kdata_strain, test = kdata_stest, classcol = 4) :
Some factor has empty levels
Here is the str method called on kdataf, the dataframe for which the above data was sampled for:
str(kdataf)
'data.frame': 1159712 obs. of 9 variables:
$ vehicle_sales_price: num 13495 11999 14499 12495 14999 ...
$ week_number: Factor w/ 27 levels "1","2","3","4",..: 11 10 13 10 10 9 18 10 10 10 ...
$ county: Factor w/ 219 levels "Anderson","Andrews",..: 49 49 49 49 49 49 49 49 49 49 ...
$ ownership_code : Factor w/ 23 levels "1","2","3","4",..: 11 11 3 1 11 11 11 11 11 11 ...
$ X30_days_late : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
$ X60_days_late : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
$ penalty : num 0 0 55.3 0 0 ...
$ processing_time : int 28 24 32 29 19 20 63 27 28 24 ...
$ transaction_code : Factor w/ 2 levels "TITLE","WDTA": 2 2 2 2 2 2 2 2 2 2 ...
The seed was set to '1234' and the ratio of the training to test data was 2:1
First, I know very little about R, so take my answer with a grain of salt.
I had the same problem, that made no sense, because there were no NAs. I thought at the beginning that it were strange characters like ', /, etc that I had in my data. But no, the knncat algorithm works with those characters when I put the following three lines of code after defining my train sets (i use data.table because my data are huge):
write.csv(train, file="train.csv")
train <- fread("train.csv", sep=",", header=T, stringsAsFactors=T)
train[,V1:=NULL]
Then, there are no more messages 'Some factor has empty levels' or 'Some classes have only one member. Check "classcol"'.
I know this is not a real solution to the problem, but at least, you can finish your work.
Hope it helps.

Modeling a very big data set (1.8 Million rows x 270 Columns) in R

I am working on a Windows 8 OS with a RAM of 8 GB . I have a data.frame of 1.8 million rows x 270 columns on which I have to perform a glm. (logit/any other classification)
I've tried using ff and bigglm packages for handling the data.
But I am still facing a problem with the error "Error: cannot allocate vector of size 81.5 Gb".
So, I decreased the number of rows to 10 and tried the steps for bigglm on an object of class ffdf. However the error still is persisting.
Can any one suggest me the solution of this problem of building a classification model with these many rows and columns?
**EDITS**:
I am not using any other program when I am running the code.
The RAM on the system is 60% free before I run the code and that is because of the R program. When I terminate R, the RAM 80% free.
I am adding some of the columns which I am working with now as suggested by the commenters for reproduction.
OPEN_FLG is the DV and others are IDVs
str(x[1:10,])
'data.frame': 10 obs. of 270 variables:
$ OPEN_FLG : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1
$ new_list_id : Factor w/ 9 levels "0","3","5","6",..: 1 1 1 1 1 1 1 1 1 1
$ new_mailing_id : Factor w/ 85 levels "1398","1407",..: 1 1 1 1 1 1 1 1 1 1
$ NUM_OF_ADULTS_IN_HHLD : num 3 2 6 3 3 3 3 6 4 4
$ NUMBER_OF_CHLDRN_18_OR_LESS: Factor w/ 9 levels "","0","1","2",..: 2 2 4 7 3 5 3 4 2 5
$ OCCUP_DETAIL : Factor w/ 49 levels "","00","01","02",..: 2 2 2 2 2 2 2 21 2 2
$ OCCUP_MIX_PCT : num 0 0 0 0 0 0 0 0 0 0
$ PCT_CHLDRN : int 28 37 32 23 36 18 40 22 45 21
$ PCT_DEROG_TRADES : num 41.9 38 62.8 2.9 16.9 ...
$ PCT_HOUSEHOLDS_BLACK : int 6 71 2 1 0 4 3 61 0 13
$ PCT_OWNER_OCCUPIED : int 91 66 63 38 86 16 79 19 93 22
$ PCT_RENTER_OCCUPIED : int 8 34 36 61 14 83 20 80 7 77
$ PCT_TRADES_NOT_DEROG : num 53.7 55 22.2 92.3 75.9 ...
$ PCT_WHITE : int 69 28 94 84 96 79 91 29 97 79
$ POSTAL_CD : Factor w/ 104568 levels "010011203","010011630",..: 23789 45173 32818 6260 88326 29954 28846 28998 52062 47577
$ PRES_OF_CHLDRN_0_3 : Factor w/ 4 levels "","N","U","Y": 2 2 3 4 2 4 2 4 2 4
$ PRES_OF_CHLDRN_10_12 : Factor w/ 4 levels "","N","U","Y": 2 2 4 3 3 2 3 2 2 3
[list output truncated]
And this is the example of code which I am using.
require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)
require(ff)
x$id <- ffseq_len(nrow(x))
xex <- expand.ffgrid(x$id, ff(1:100))
colnames(xex) <- c("id","explosion.nr")
xex <- merge(xex, x, by.x="id", by.y="id", all.x=TRUE, all.y=FALSE)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = xex)
The problem is both times I get the same error "Error: cannot allocate vector of size 81.5 Gb".
Please let me know if this is enough or should I include anymore details about the problem.
I have the impression you are not using ffbase::bigglm.ffdf but you want to. Namely the following will put all your data in RAM and will use biglm::bigglm.function, which is not what you want.
require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)
You need to use ffbase::bigglm.ffdf, which works chunkwise on an ffdf. So load package ffbase which exports bigglm.ffdf.
If you use ffbase, you can use the following:
require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())
Explanation:
Because you don't limit yourself to the columns you use in the model, you will get all your columns of your xex ffdf in RAM which is not needed. You were using a gaussian model on a factor response, bizarre? I believe you were trying to do a logistic regression, so use the appropriate family argument? And it will use ffbase::bigglm.ffdf and not biglm::bigglm.function.
If that does not work - which I doubt, it is because you have other things in RAM which you are not aware of. In that case do.
require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
ffsave(mymodeldataset, file = "mymodeldataset")
## Open R again
require(ffbase)
require(biglm)
ffload("mymodeldataset")
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())
And off you go.

Getting an error "(subscript) logical subscript too long" while training SVM from e1071 package in R

I am training svm using my traindata. (e1071 package in R). Following is the information about my data.
> str(train)
'data.frame': 891 obs. of 10 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 15 levels "capt","col","countess",..: 12 13 9 13 12 12 12 8 13 13
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 ...
$ ticket : Factor w/ 533 levels "110152","110413",..: 516 522 531 50 473 276 86 396
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 9 levels "a","b","c","d",..: 9 3 9 3 9 9 5 9 9 9 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ family : int 1 1 0 1 0 0 0 4 2 1 ...
I train it as the following.
library(e1071)
model1 <- svm(survived~.,data=train, type="C-classification")
No problem here. But when I predict as:
pred <- predict(model1,test)
I get the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
I also tried removing "ticket" predictor from both train and test data. But still same error. What is the problem?
There might a difference in the number of levels in one of the factors in 'test' dataset.
run str(test) and check that the factor variables have the same levels as corresponding variables in the 'train' dataset.
ie the example below shows my.test$foo only has 4 levels.....
str(my.train)
'data.frame': 554 obs. of 7 variables:
....
$ foo: Factor w/ 5 levels "C","Q","S","X","Z": 2 2 4 3 4 4 4 4 4 4 ...
str(my.test)
'data.frame': 200 obs. of 7 variables:
...
$ foo: Factor w/ 4 levels "C","Q","S","X": 3 3 3 3 1 3 3 3 3 3 ...
Thats correct train data contains 2 blanks for embarked because of this there is one extra categorical value for blanks and you are getting this error
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The first is blank
I encountered the same problem today. It turned out that the svm model in e1071 package can only use rows as the objects, which means one row is one sample, rather than column. If you use column as the sample and row as the variable, this error will occur.
Probably your data is good (no new levels in test data), and you just need a small trick, then you are fine with prediction.
test.df = rbind(train.df[1,],test.df)
test.df = test.df[-1,]
This trick was from R Random Forest - type of predictors in new data do not match.
Today I encountered this problem, used above trick and then solved the problem.
I have been playing with that data set as well. I know this was a long time ago, but one of the things you can do is explicitly include only the columns you feel will add to the model, like such:
fit <- svm(Survived~Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train)
This eliminated the problem for me by eliminating columns that contribute nothing (like ticket number) which have no relevant data.
Another possible issue that resolved my code was the fact I hard forgotten to make some of my independent variables factors.

Resources