R round correlate function from corrr package - r

I'm creating a correlation table using the correlate function in the corrr package. Here is my code and a screenshot of the output.
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table
I think this would look better and be easier to read if I could round off the values in the correlation table. I tried this code:
correlation_table <- round(corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson"),2)
But I get this error:
Error in Math.data.frame(list(term = c("prof_rank_factor", "yrs.since.phd", : non-numeric variable(s) in data frame: term
The non-numeric variables part of this error message doesn't make sense to me. When I look at the structure I only see integer or numeric variable types.
'data.frame': 397 obs. of 6 variables:
$ prof_rank_factor : num 3 3 1 3 3 2 3 3 3 3 ...
$ yrs.since.phd : int 19 20 4 45 40 6 30 45 21 18 ...
$ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
$ salary : num 139750 173200 79750 115000 141500 ...
$ sex_factor : num 1 1 1 1 1 1 1 1 1 2 ...
$ discipline_factor: num 2 2 2 2 2 2 2 2 2 2 ...
How can I clean up this correlation table with rounded values?

After returning the tibble output with correlate, loop across the columns that are numeric and round
library(dplyr)
corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson") %>%
mutate(across(where(is.numeric), round, digits = 2))

We can use:
options(digits=2)
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table

Related

Recursive partitioning for factors/characters problem

Currently I am working with the dataset predictions. In this data I have converted clear character type variables into factors because I think factors work better than characters for glmtree() code (tell me if I am wrong with this):
> str(predictions)
'data.frame': 43804 obs. of 14 variables:
$ month : Factor w/ 7 levels "01","02","03",..: 6 6 6 6 1 1 2 2 3 3 ...
$ pred : num 0.21 0.269 0.806 0.945 0.954 ...
$ treatment : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 2 2 ...
$ type : Factor w/ 4 levels "S","MS","ML",..: 1 1 4 4 4 4 4 4 4 4 ...
$ i_mode : Factor w/ 143 levels "AAA","ABC","CBB",..: 28 28 104 104 104 104 104 104 104 104 ...
$ r_mode : Factor w/ 29 levels "0","5","8","11",..: 4 4 2 2 2 2 2 2 2 2 ...
$ in_mode: Factor w/ 22 levels "XY",..: 11 11 6 6 6 6 6 6 6 6 ...
$ v_mode : Factor w/ 5 levels "1","3","4","7",..: 1 1 1 1 1 1 1 1 1 1 ...
$ di : num 1157 1157 1945 1945 1945 ...
$ cont : Factor w/ 5 levels "AN","BE",..: 2 2 2 2 2 2 2 2 2 2 ...
$ hk : num 0.512 0.512 0.977 0.977 0.941 ...
$ np : num 2 2 2 2 2 2 2 2 2 2 ...
$ hd : num 1 1 0.408 0.408 0.504 ...
$ nd : num 1 1 9 9 9 9 7 7 9 9 ...
I want to estimate a recursive partitioning model of this kind:
library("partykit")
glmtr <- glmtree(formula = pred ~ treatment + 1 | (month+type+i_mode+r_mode+in_mode+v_mode+di+cont+np+nd+hd+hk),
data = predictions,
maxdepth=6,
family = quasibinomial)
My data does not have any NA. However, the following error arises (even after changing characters by factors):
Error in matrix(0, nrow = mi, ncol = nl) :
invalid 'nrow' value (too large or NA)
In addition: Warning message:
In matrix(0, nrow = mi, ncol = nl) :
NAs introduced by coercion to integer range
Any clue?
Thank you
You are right that glmtree() and the underlying mob() function expect the split variables to be factors in case of nominal information. However, computationally this is only feasible for factors that have either a limited number of levels because the algorithm will try all possible partitions of the number of levels into two groups. Thus, for your i_mode factor this necessitates going through nl levels and mi splits into two groups with:
nl <- 143
mi <- 2^(nl - 1L) - 1L
mi
## [1] 5.575186e+42
Internally, mob() tries to create a matrix for storing all log-likelihoods associated with the corresponding partitioned models. And this is not possible because such a matrix cannot be represented. (And even if you could, then you wouldn't finish fitting all the associated models.) Admittedly, the error message is not very useful and should be improved. We will look into that for the next revision of the package.
For solving the problem, I would recommend to turn the variables i_mode, r_mode, and in_mode into variables that are more suitable for binary splitting with exhaustive search. Maybe, some of the variables are actually ordinal? If so, I would recommend to turn them into ordinal factors or in case of i_mode even into a numeric variable because the number of levels is large enough. Alternatively, you can maybe create several factors with different properties about the different levels that could then be used for partitioning.

lattice plot error: need finite xlim values calls

Whenever I try and plot across factors I keep getting the error.
Here is how my data looks like:
str(dataWithNoNa)
## 'data.frame': 17568 obs. of 4 variables:
## $ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## $ dayType : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
I am trying to plot using the lattice plotting system using Weekday/Weekend as a factor.
Here is what I tried:
plot(dataWithNoNa$steps~ dataWithNoNa$interval | dataWithNoNa$dayType, type="l")
Error in plot.window(...) : need finite 'xlim' values
I even checked to make sure my data had no NAs:
sum(is.na(dataWithNoNa$interval))
## [1] 0
sum(is.na(dataWithNoNa$steps))
## [1] 0
What am I doing wrong?
Try this:
library(lattice)
xyplot(steps ~ interval | factor(dayType), data=df)
Output:
Sample data:
df <- data.frame(
steps=c(1.717,0.3396,0.1321,0.1509,0.0755),
interval=c(0,5,10,15,20),
dayType=c(1,1,1,2,2)
)

KNNCAT error "some classes have only one member"

I'm trying to run a KNN analysis on auto data using knncat's knncat function. My training set is around 700,000 observations. The following happens when I try to implement the analysis. I've attempted to remove NA using the complete cases method while reading the data in. I'm not sure exactly how to take care of the errors or what they mean.
kdata.training = kdataf[ind==1,]
kdata.test = kdataf[ind==2,]
kdata_pred = knncat(train = kdata.training, test = kdata.test, classcol = 4)
Error in knncat(train = kdata.training, test = kdata.test, classcol = 4) :
Some classes have only one member. Check "classcol"
When I attempt to run a small subsection of the training and test set(200 and 70 observations respectively) I get the following error:
kdata_strain = kdata.training[1:200,]
kdata_stest = kdata.test[1:70,]
kdata_pred = knncat(train = kdata_strain, test = kdata_stest, classcol = 4)
Error in knncat(train = kdata_strain, test = kdata_stest, classcol = 4) :
Some factor has empty levels
Here is the str method called on kdataf, the dataframe for which the above data was sampled for:
str(kdataf)
'data.frame': 1159712 obs. of 9 variables:
$ vehicle_sales_price: num 13495 11999 14499 12495 14999 ...
$ week_number: Factor w/ 27 levels "1","2","3","4",..: 11 10 13 10 10 9 18 10 10 10 ...
$ county: Factor w/ 219 levels "Anderson","Andrews",..: 49 49 49 49 49 49 49 49 49 49 ...
$ ownership_code : Factor w/ 23 levels "1","2","3","4",..: 11 11 3 1 11 11 11 11 11 11 ...
$ X30_days_late : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
$ X60_days_late : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
$ penalty : num 0 0 55.3 0 0 ...
$ processing_time : int 28 24 32 29 19 20 63 27 28 24 ...
$ transaction_code : Factor w/ 2 levels "TITLE","WDTA": 2 2 2 2 2 2 2 2 2 2 ...
The seed was set to '1234' and the ratio of the training to test data was 2:1
First, I know very little about R, so take my answer with a grain of salt.
I had the same problem, that made no sense, because there were no NAs. I thought at the beginning that it were strange characters like ', /, etc that I had in my data. But no, the knncat algorithm works with those characters when I put the following three lines of code after defining my train sets (i use data.table because my data are huge):
write.csv(train, file="train.csv")
train <- fread("train.csv", sep=",", header=T, stringsAsFactors=T)
train[,V1:=NULL]
Then, there are no more messages 'Some factor has empty levels' or 'Some classes have only one member. Check "classcol"'.
I know this is not a real solution to the problem, but at least, you can finish your work.
Hope it helps.

Modeling a very big data set (1.8 Million rows x 270 Columns) in R

I am working on a Windows 8 OS with a RAM of 8 GB . I have a data.frame of 1.8 million rows x 270 columns on which I have to perform a glm. (logit/any other classification)
I've tried using ff and bigglm packages for handling the data.
But I am still facing a problem with the error "Error: cannot allocate vector of size 81.5 Gb".
So, I decreased the number of rows to 10 and tried the steps for bigglm on an object of class ffdf. However the error still is persisting.
Can any one suggest me the solution of this problem of building a classification model with these many rows and columns?
**EDITS**:
I am not using any other program when I am running the code.
The RAM on the system is 60% free before I run the code and that is because of the R program. When I terminate R, the RAM 80% free.
I am adding some of the columns which I am working with now as suggested by the commenters for reproduction.
OPEN_FLG is the DV and others are IDVs
str(x[1:10,])
'data.frame': 10 obs. of 270 variables:
$ OPEN_FLG : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1
$ new_list_id : Factor w/ 9 levels "0","3","5","6",..: 1 1 1 1 1 1 1 1 1 1
$ new_mailing_id : Factor w/ 85 levels "1398","1407",..: 1 1 1 1 1 1 1 1 1 1
$ NUM_OF_ADULTS_IN_HHLD : num 3 2 6 3 3 3 3 6 4 4
$ NUMBER_OF_CHLDRN_18_OR_LESS: Factor w/ 9 levels "","0","1","2",..: 2 2 4 7 3 5 3 4 2 5
$ OCCUP_DETAIL : Factor w/ 49 levels "","00","01","02",..: 2 2 2 2 2 2 2 21 2 2
$ OCCUP_MIX_PCT : num 0 0 0 0 0 0 0 0 0 0
$ PCT_CHLDRN : int 28 37 32 23 36 18 40 22 45 21
$ PCT_DEROG_TRADES : num 41.9 38 62.8 2.9 16.9 ...
$ PCT_HOUSEHOLDS_BLACK : int 6 71 2 1 0 4 3 61 0 13
$ PCT_OWNER_OCCUPIED : int 91 66 63 38 86 16 79 19 93 22
$ PCT_RENTER_OCCUPIED : int 8 34 36 61 14 83 20 80 7 77
$ PCT_TRADES_NOT_DEROG : num 53.7 55 22.2 92.3 75.9 ...
$ PCT_WHITE : int 69 28 94 84 96 79 91 29 97 79
$ POSTAL_CD : Factor w/ 104568 levels "010011203","010011630",..: 23789 45173 32818 6260 88326 29954 28846 28998 52062 47577
$ PRES_OF_CHLDRN_0_3 : Factor w/ 4 levels "","N","U","Y": 2 2 3 4 2 4 2 4 2 4
$ PRES_OF_CHLDRN_10_12 : Factor w/ 4 levels "","N","U","Y": 2 2 4 3 3 2 3 2 2 3
[list output truncated]
And this is the example of code which I am using.
require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)
require(ff)
x$id <- ffseq_len(nrow(x))
xex <- expand.ffgrid(x$id, ff(1:100))
colnames(xex) <- c("id","explosion.nr")
xex <- merge(xex, x, by.x="id", by.y="id", all.x=TRUE, all.y=FALSE)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = xex)
The problem is both times I get the same error "Error: cannot allocate vector of size 81.5 Gb".
Please let me know if this is enough or should I include anymore details about the problem.
I have the impression you are not using ffbase::bigglm.ffdf but you want to. Namely the following will put all your data in RAM and will use biglm::bigglm.function, which is not what you want.
require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)
You need to use ffbase::bigglm.ffdf, which works chunkwise on an ffdf. So load package ffbase which exports bigglm.ffdf.
If you use ffbase, you can use the following:
require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())
Explanation:
Because you don't limit yourself to the columns you use in the model, you will get all your columns of your xex ffdf in RAM which is not needed. You were using a gaussian model on a factor response, bizarre? I believe you were trying to do a logistic regression, so use the appropriate family argument? And it will use ffbase::bigglm.ffdf and not biglm::bigglm.function.
If that does not work - which I doubt, it is because you have other things in RAM which you are not aware of. In that case do.
require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
ffsave(mymodeldataset, file = "mymodeldataset")
## Open R again
require(ffbase)
require(biglm)
ffload("mymodeldataset")
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())
And off you go.

Factors in aov()

I got a wired problem following the example 1 in R Guide.
Here is the example
> datafilename="http://personality-project.org/r/datasets/R.appendix1.data"
> data.ex1 = read.table(datafilename,header=T) #read the data into a table
> aov.ex1 = aov(Alertness~Dosage,data=data.ex1) #do the analysis of variance
> summary(aov.ex1) #show the summary table
But, when I applied aov on my own data, things changed.
> test.data <- data.frame(fac=letters[c(1:3,1:3)], x=1:6)
> test.result <- aov(fac~x, data=test.data)
Error in storage.mode(y) <- "double" :
invalid to change the storage mode of a factor
In addition: Warning message:
In model.response(mf, "numeric") :
using type="numeric" with a factor response will be ignored
I'm totally confused. what's the difference between test.data and data.ex1 in example of R guide?
> str(test.data)
'data.frame': 6 obs. of 2 variables:
$ fac: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3
$ x : int 1 2 3 4 5 6
> str(data.ex1)
'data.frame': 18 obs. of 2 variables:
$ Dosage : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 2 2 2 2 ...
$ Alertness: int 30 38 35 41 27 24 32 26 31 29 ...
it should be aov(x ~ fac, data = test.data), which works. The formula needs to be response ~ factor, not factor ~ response.

Resources