Why is this error happening with the imputation function in R? - r

I am trying to do an imputation based on this example: impute example
data(airquality)
summary(airquality)
airq = airquality
ind = sample(nrow(airq), 10)
airq$Wind[ind] = NA
airq$Wind = cut(airq$Wind, c(0,8,16,24))
summary(airq)
imp = impute(airq, classes = list(integer = imputeMean(), factor = imputeMode()),
dummy.classes = "integer")
which gives me a warning:
Warning message:
In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
argument is not numeric or logical: returning NA
However, when I try looking at the returned dataframe, I get:
> head(imp, 10)
Error in x[..., drop = drop] : incorrect number of dimensions
> head(imp$data, 10)
NULL
and desc gives:
> imp$desc
NULL
I had initially done the above using my actual data, and was getting these errors, so I tried the above example for a sanity check.
I've tried this in Windows both from RStudio and from the command line interface, all with same results on the example and my actual data. Also, tried using version 3.63 and 4.03, again with the same results.
I've also tried this on two fresh installs on Ubuntu, with the same results.
Interestingly, when I do names the dummy variable are not there:
> names(imp)
[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
str(imp) gives:
> str(imp)
Classes ‘impute’ and 'data.frame': 153 obs. of 6 variables:
$ Ozone : num 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: num 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : Factor w/ 3 levels "(0,8]","(8,16]",..: 1 1 2 NA 2 2 2 2 3 2 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "imputed")= int [1:54] 5 NA NA NA NA NA NA NA NA NA ...
and looking at one of the columns on which imputation should have taken place:
> head(imp$Solar.R)
[1] 190 118 149 313 NA NA
(my actual data replaced NA with all 0's, even though it should have been the column mean)
UPDATE: I tested this just now on my local machine running MacOS and getting the exact same error.

I figured it out. I was using whichever version of impute was in the Hmisc library. Using mlr did the trick.
> imp$desc
Imputation description
Target:
Features: 6; Imputed: 6
impute.new.levels: TRUE
recode.factor.levels: TRUE
dummy.type: factor

Related

unable to write to the csv file [duplicate]

I am trying to write a dataframe in R to a text file, however it is returning to following error:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
I used the following command for the export:
write.table(df, file ='dfname.txt', sep='\t' )
I have no idea what the problem could stem from. As far as "missing data where TRUE/FALSE is needed", I have only one column which contains TRUE/FALSE values, and none of these values are missing.
Contents of the dataframe:
> str(df)
'data.frame': 776 obs. of 15 variables:
$ Age : Factor w/ 4 levels "","A","J","SA": 2 2 2 2 2 2 2 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
$ Rep : Factor w/ 11 levels "L","NR","NRF",..: 1 1 4 4 2 2 2 2 2 2 ...
$ FA : num 61.5 62.5 60.5 61 59.5 59.5 59.1 59.2 59.8 59.9 ...
$ Mass : num 20 19 16.5 17.5 NA 14 NA 23 19 18.5 ...
$ Vir1 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir2 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir3 : num 40 999 999 999 999 999 999 999 999 999 ...
$ Location : Factor w/ 4 levels "Loc1",..: 4 4 4 4 4 4 2 2 2 2 ...
$ Site : Factor w/ 6 levels "A","B","C",..: 5 5 5 5 5 5 3 3 3 3 ...
$ Date : Date, format: "2010-08-30" "2010-08-30" ...
$ Record : int 35 34 39 49 69 38 145 112 125 140 ...
$ SampleID : Factor w/ 776 levels "AT1-A-F1","AT1-A-F10",..: 525 524 527 528
529 526 111 78
88 110 ...
$ Vir1Inc : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Month :'data.frame': 776 obs. of 2 variables:
..$ Dates: Date, format: "2010-08-30" "2010-08-30" ...
..$ Month: Factor w/ 19 levels "Apr-2011","Aug-2010",..: 2 2 2 2
2 2 18 18 18 18 ...
I hope I've given enough/the right information ...
Many thanks,
Heather
An example to reproduce the error. I create a nested data.frame:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
str(dd)
'data.frame': 15 obs. of 2 variables:
$ Age : int 1 2 3 4 5 6 7 8 9 10 ...
$ Month:'data.frame': 15 obs. of 2 variables:
..$ Dates: Date, format: "2003-02-02" "2003-02-03" "2003-02-04" ...
..$ Month: Factor w/ 12 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
No I try to save it , I reproduce the error :
write.table(dd)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) : missing value where TRUE/FALSE needed
Without inverstigating, one option to remove the nested data.frame:
write.table(data.frame(subset(dd,select=-c(Month)),unclass(dd$Month)))
The solution by agstudy provides a great quick fix, but there is a simple alternative/general solution for which you do not have to specify the element(s) in your data.frame that was(were) nested:
The following bit is just copied from agstudy's solution to obtain the nested data.frame dd:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
You can use akhilsbehl's LinearizeNestedList() function (which mrdwab made available here) to flatten (or linearize) the nested levels:
library(devtools)
source_gist(4205477) #loads the function
ddf <- LinearizeNestedList(dd, LinearizeDataFrames = TRUE)
# ddf is now a list with two elements (Age and Month)
ddf <- LinearizeNestedList(ddf, LinearizeDataFrames = TRUE)
# ddf is now a list with 3 elements (Age, `Month/Dates` and `Month/Month`)
ddf <- as.data.frame.list(ddf)
# transforms the flattened/linearized list into a data.frame
ddf is now a data.frame without nesting. However, it's column names still reflect the nested structure:
names(ddf)
[1] "Age" "Month.Dates" "Month.Month"
If you want to change this (in this case it seems redundant to have Month. written before Dates, for example) you can use gsub and some regular expression that I copied from Sacha Epskamp to remove all text in the column names before the ..
names(ddf) <- gsub(".*\\.","",names(ddf))
names(ddf)
[1] "Age" "Dates" "Month"
The only thing left now is exporting the data.frame as usual:
write.table(ddf, file="test.txt")
Alternatively, you could use the "flatten" function from the jsonlite package to flatten the dataframe before export. It achieves the same result of the other functions mentioned and is much easier to implement.
jsonlite::flatten
https://rdrr.io/cran/jsonlite/man/flatten.html

How to remove $ from all values in a data frame column in a character vector?

I have a data frame in R that has information about NBA players, including salary information. All the data in the salary column have a "$" before the value and I want to convert the character data to numeric for the purpose of analysis. So I need to remove the "$" in this column. However, I am unable to subset or parse any of the values in this column. It seems that each value is a vector of 1. I've included below the structure of the data and what I have tried in my attempt at removing the "$".
> str(combined)
'data.frame': 588 obs. of 9 variables:
$ Player: chr "Aaron Brooks" "Aaron Gordon" "Aaron Gray" "Aaron Harrison" ...
$ Tm : Factor w/ 30 levels "ATL","BOS","BRK",..: 4 22 9 5 9 18 1 5 25 30 ...
$ Pos : Factor w/ 5 levels "C","PF","PG",..: 3 2 NA 5 NA 2 1 1 4 5 ...
$ Age : num 31 20 NA 21 NA 24 29 31 25 33 ...
$ G : num 69 78 NA 21 NA 52 82 47 82 13 ...
$ MP : num 1108 1863 NA 93 NA ...
$ PER : num 11.8 17 NA 4.3 NA 5.6 19.4 18.2 12.7 9.2 ...
$ WS : num 0.9 5.4 NA 0 NA -0.5 9.4 2.8 4 0.3 ...
$ Salary: chr "$2000000" "$4171680" "$452059" "$525093" ...
combined[, "Salary"] <- gsub("$", "", combined[, "Salary"])
The last line of code above is able to run successfully but it doesn't augment the "Salary" column.
I am able to successfully augment it by running the code listed below, but I need to find a way to automize the replacement process for the whole data set instead of doing it row by row.
combined[, "Salary"] <- gsub("$2000000", "2000000", combined[, "Salary"])
How can I subset the character vectors in this column to remove the "$"? Apologies for any formatting faux pas ahead of time, this is my first time asking a question. Cheers,
The $ is a metacharacter which means the end of the string. So, we need to either escape (\\$) or place it in square brackets ("[$]") or use fixed = TRUE in the sub. We don't need gsub as there seems to be only a single $ character in each string.
combined[, "Salary"] <- as.numeric(sub("$", "", combined[, "Salary"], fixed=TRUE))
Or as #gung mentioned in the comments, using substr would be faster
as.numeric(substr(d$Salary, 2, nchar(d$Salary)))

Modeling a very big data set (1.8 Million rows x 270 Columns) in R

I am working on a Windows 8 OS with a RAM of 8 GB . I have a data.frame of 1.8 million rows x 270 columns on which I have to perform a glm. (logit/any other classification)
I've tried using ff and bigglm packages for handling the data.
But I am still facing a problem with the error "Error: cannot allocate vector of size 81.5 Gb".
So, I decreased the number of rows to 10 and tried the steps for bigglm on an object of class ffdf. However the error still is persisting.
Can any one suggest me the solution of this problem of building a classification model with these many rows and columns?
**EDITS**:
I am not using any other program when I am running the code.
The RAM on the system is 60% free before I run the code and that is because of the R program. When I terminate R, the RAM 80% free.
I am adding some of the columns which I am working with now as suggested by the commenters for reproduction.
OPEN_FLG is the DV and others are IDVs
str(x[1:10,])
'data.frame': 10 obs. of 270 variables:
$ OPEN_FLG : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1
$ new_list_id : Factor w/ 9 levels "0","3","5","6",..: 1 1 1 1 1 1 1 1 1 1
$ new_mailing_id : Factor w/ 85 levels "1398","1407",..: 1 1 1 1 1 1 1 1 1 1
$ NUM_OF_ADULTS_IN_HHLD : num 3 2 6 3 3 3 3 6 4 4
$ NUMBER_OF_CHLDRN_18_OR_LESS: Factor w/ 9 levels "","0","1","2",..: 2 2 4 7 3 5 3 4 2 5
$ OCCUP_DETAIL : Factor w/ 49 levels "","00","01","02",..: 2 2 2 2 2 2 2 21 2 2
$ OCCUP_MIX_PCT : num 0 0 0 0 0 0 0 0 0 0
$ PCT_CHLDRN : int 28 37 32 23 36 18 40 22 45 21
$ PCT_DEROG_TRADES : num 41.9 38 62.8 2.9 16.9 ...
$ PCT_HOUSEHOLDS_BLACK : int 6 71 2 1 0 4 3 61 0 13
$ PCT_OWNER_OCCUPIED : int 91 66 63 38 86 16 79 19 93 22
$ PCT_RENTER_OCCUPIED : int 8 34 36 61 14 83 20 80 7 77
$ PCT_TRADES_NOT_DEROG : num 53.7 55 22.2 92.3 75.9 ...
$ PCT_WHITE : int 69 28 94 84 96 79 91 29 97 79
$ POSTAL_CD : Factor w/ 104568 levels "010011203","010011630",..: 23789 45173 32818 6260 88326 29954 28846 28998 52062 47577
$ PRES_OF_CHLDRN_0_3 : Factor w/ 4 levels "","N","U","Y": 2 2 3 4 2 4 2 4 2 4
$ PRES_OF_CHLDRN_10_12 : Factor w/ 4 levels "","N","U","Y": 2 2 4 3 3 2 3 2 2 3
[list output truncated]
And this is the example of code which I am using.
require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)
require(ff)
x$id <- ffseq_len(nrow(x))
xex <- expand.ffgrid(x$id, ff(1:100))
colnames(xex) <- c("id","explosion.nr")
xex <- merge(xex, x, by.x="id", by.y="id", all.x=TRUE, all.y=FALSE)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = xex)
The problem is both times I get the same error "Error: cannot allocate vector of size 81.5 Gb".
Please let me know if this is enough or should I include anymore details about the problem.
I have the impression you are not using ffbase::bigglm.ffdf but you want to. Namely the following will put all your data in RAM and will use biglm::bigglm.function, which is not what you want.
require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)
You need to use ffbase::bigglm.ffdf, which works chunkwise on an ffdf. So load package ffbase which exports bigglm.ffdf.
If you use ffbase, you can use the following:
require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())
Explanation:
Because you don't limit yourself to the columns you use in the model, you will get all your columns of your xex ffdf in RAM which is not needed. You were using a gaussian model on a factor response, bizarre? I believe you were trying to do a logistic regression, so use the appropriate family argument? And it will use ffbase::bigglm.ffdf and not biglm::bigglm.function.
If that does not work - which I doubt, it is because you have other things in RAM which you are not aware of. In that case do.
require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
ffsave(mymodeldataset, file = "mymodeldataset")
## Open R again
require(ffbase)
require(biglm)
ffload("mymodeldataset")
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())
And off you go.

Error when exporting dataframe to text file in R

I am trying to write a dataframe in R to a text file, however it is returning to following error:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
I used the following command for the export:
write.table(df, file ='dfname.txt', sep='\t' )
I have no idea what the problem could stem from. As far as "missing data where TRUE/FALSE is needed", I have only one column which contains TRUE/FALSE values, and none of these values are missing.
Contents of the dataframe:
> str(df)
'data.frame': 776 obs. of 15 variables:
$ Age : Factor w/ 4 levels "","A","J","SA": 2 2 2 2 2 2 2 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
$ Rep : Factor w/ 11 levels "L","NR","NRF",..: 1 1 4 4 2 2 2 2 2 2 ...
$ FA : num 61.5 62.5 60.5 61 59.5 59.5 59.1 59.2 59.8 59.9 ...
$ Mass : num 20 19 16.5 17.5 NA 14 NA 23 19 18.5 ...
$ Vir1 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir2 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir3 : num 40 999 999 999 999 999 999 999 999 999 ...
$ Location : Factor w/ 4 levels "Loc1",..: 4 4 4 4 4 4 2 2 2 2 ...
$ Site : Factor w/ 6 levels "A","B","C",..: 5 5 5 5 5 5 3 3 3 3 ...
$ Date : Date, format: "2010-08-30" "2010-08-30" ...
$ Record : int 35 34 39 49 69 38 145 112 125 140 ...
$ SampleID : Factor w/ 776 levels "AT1-A-F1","AT1-A-F10",..: 525 524 527 528
529 526 111 78
88 110 ...
$ Vir1Inc : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Month :'data.frame': 776 obs. of 2 variables:
..$ Dates: Date, format: "2010-08-30" "2010-08-30" ...
..$ Month: Factor w/ 19 levels "Apr-2011","Aug-2010",..: 2 2 2 2
2 2 18 18 18 18 ...
I hope I've given enough/the right information ...
Many thanks,
Heather
An example to reproduce the error. I create a nested data.frame:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
str(dd)
'data.frame': 15 obs. of 2 variables:
$ Age : int 1 2 3 4 5 6 7 8 9 10 ...
$ Month:'data.frame': 15 obs. of 2 variables:
..$ Dates: Date, format: "2003-02-02" "2003-02-03" "2003-02-04" ...
..$ Month: Factor w/ 12 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
No I try to save it , I reproduce the error :
write.table(dd)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) : missing value where TRUE/FALSE needed
Without inverstigating, one option to remove the nested data.frame:
write.table(data.frame(subset(dd,select=-c(Month)),unclass(dd$Month)))
The solution by agstudy provides a great quick fix, but there is a simple alternative/general solution for which you do not have to specify the element(s) in your data.frame that was(were) nested:
The following bit is just copied from agstudy's solution to obtain the nested data.frame dd:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
You can use akhilsbehl's LinearizeNestedList() function (which mrdwab made available here) to flatten (or linearize) the nested levels:
library(devtools)
source_gist(4205477) #loads the function
ddf <- LinearizeNestedList(dd, LinearizeDataFrames = TRUE)
# ddf is now a list with two elements (Age and Month)
ddf <- LinearizeNestedList(ddf, LinearizeDataFrames = TRUE)
# ddf is now a list with 3 elements (Age, `Month/Dates` and `Month/Month`)
ddf <- as.data.frame.list(ddf)
# transforms the flattened/linearized list into a data.frame
ddf is now a data.frame without nesting. However, it's column names still reflect the nested structure:
names(ddf)
[1] "Age" "Month.Dates" "Month.Month"
If you want to change this (in this case it seems redundant to have Month. written before Dates, for example) you can use gsub and some regular expression that I copied from Sacha Epskamp to remove all text in the column names before the ..
names(ddf) <- gsub(".*\\.","",names(ddf))
names(ddf)
[1] "Age" "Dates" "Month"
The only thing left now is exporting the data.frame as usual:
write.table(ddf, file="test.txt")
Alternatively, you could use the "flatten" function from the jsonlite package to flatten the dataframe before export. It achieves the same result of the other functions mentioned and is much easier to implement.
jsonlite::flatten
https://rdrr.io/cran/jsonlite/man/flatten.html

Wrong R data type or bad data?

I'm having trouble doing simple functions on a data frame and am unsure whether it's the data type of the column, or bad data in the data frame.
I exported a SQL query into a CSV file, then loaded it into a data frame, then attached it.
df <-read.csv("~/Desktop/orders.csv")
Attach(df)
When I am done, and run str(df), here is what I get:
$ AccountID: Factor w/ 18093 levels "(819947 row(s) affected)",..: 10 97 167 207 207 299 299 309 352 573 ...
$ OrderID : int 1874197767 1874197860 1874196789 1874206918 1874209100 1874207018 1874209111 1874233050 1874196791 1875081598 ...
$ OrderDate : Factor w/ 280 levels "","2010-09-24",..: 2 2 2 2 2 2 2 2 2 2 ...
$ NumofProducts : int 16 6 4 6 10 4 2 4 6 40 ...
$ OrderTotal : num 20.3 13.8 12.5 13.8 16.4 ...
$ SpecialOrder : int 1 1 1 1 1 1 1 1 1 1 ...
Trying to run the following functions, here is what I get:
> length(OrderID)
[1] 0
> min(OrderTotal)
[1] NA
> min(OrderTotal, na.rm=TRUE)
[1] 5.00
> mean(NumofProducts)
[1] NA
> mean(NumofProducts, na.rm=TRUE)
[1] 3.462902
I have two questions related to this data frame:
Do I have the right data types for the columns? Nums versus integers versus decimals.
Is there a way to review the data set to find the rows that are driving the need to use na.rm=TRUE to make the function work? I'd like to know how many there are, etc.
The difference between num and int is pretty irrelevant at this stage.
See help(is.na) for starters on NA handling. Do things like:
sum(is.na(foo))
to see how many foo's are NA values. Then things like:
df[is.na(df$foo),]
to see the rows of df where foo is NA.

Resources