I m trying to convert all the NULL values in my dataset to NA. In short
Explanation of question
My data set looks like below:
One thing that I noticed though is that when I try to find the number of empty values it shows the number of NA values in my dataset not including the NULL values. I would like to convert the NULL values to NA in order to remove them.
So I counted the number of missing values in my complete dataset then in the columns as
> dim(raw_data)
[1] 80983 16
> # Count missing values in entire data set
> table(is.na(raw_data))
FALSE TRUE
1247232 48496
> # Count na 's column wise
> na_count <-sapply(raw_data, function(y) sum(length(which(is.na(y)))))
> na_count <- data.frame(na_count)
> na_count
na_count
Merchant_Id 1
Tran_Date 1
Military_Time 1
Terminal_Id_Key 1
Amount 1
Card_Amount_Paid 1
Merchant_Name 1
Town 1
Area_Code 1
Client_ID 48481
Age_Band 1
Gender_code 1
Province 1
Avg_Income_3M 1
Value_Spent 1
Number_Spent 1
As you can see it does not show the NULL as NA so I tried to convert it as:
> # Turn Null to NA
> temp_data <- raw_data
>
> temp_data[temp_data == ''] = NA
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
I also tried
> # Turn Null to NA
> temp_data <- raw_data
> temp_data[temp_data == 'NULL'] = NA
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
But I am getting the error above. This was followed by the last one below (which was better because I did not have an error but I still got NULL values in my data set).
> raw_data[is.null(raw_data)] <- NA
> table(is.na(raw_data))
FALSE TRUE
1247232 48496
Could you perhaps suggest ways to deal with this error?
I also tried to get rid of the date and got this different error when I once again tried to remove the NULL values:
> df <- raw_data
>
> df1 <- transform(df, date = as.Date(df$Tran_Date), time = format(df$Tran_Date, "%T"))
>
> df1[df1 == NULL] = NA
Error in matrix(if (is.null(value)) logical() else value, nrow = nr, dimnames = list(rn, :
length of 'dimnames' [2] not equal to array extent
This solved my issue. Instead of changing the NULL values to NA. I imported the values in from the github account as NA values.
I added
na = c("","NA","NULL",NULL)
to my importing argument read.table or read_tsv from readr package. This then did the trick and changed my NULL values to NA.
Related
I have a dataframe, which includes a corrupt row with NAs and "". I cannot remove this from the .csv file I am importing into R since Excel cannot deal with (opening) the size of the .csv document.
I do a check when I first read.csv() like below to remove the row with NA:
if ( any( is.na(unique(data$A)) ) ){
print("WARNING: data has a corrupt row in it!")
data <- data[ !is.na(data$A) , ]
}
However, as if it is a factor, the Acolumn remembers NA as a level:
> summary(data$A)
Mode FALSE TRUE NA's
logical 185692 36978 0
This obviously causes issues when I am trying to fit a linear model. How can I get rid of the NA as a logical level here?
I tried this but doesn't seem to work:
A <- as.logical(droplevels(factor(data_combine$A)))
summary(A)
Mode FALSE TRUE NA's
logical 185692 36978 0
unique(A)
[1] FALSE TRUE
First, your data$A is not a factor, it's a logical. The summary print methods are not the same for factors and logicals. Logicals use summary.default while factors dispatch to summary.factor. Plus it tells you in the result that the variable is a logical.
fac <- factor(c(NA, letters[1:4]))
log <- c(NA, logical(4), !logical(2))
summary(fac)
# a b c d NA's
# 1 1 1 1 1
summary(log)
# Mode FALSE TRUE NA's
# logical 4 2 1
See ?summary for the differences.
Second, your call
A <- as.logical(droplevels(factor(data_combine$A)))
summary(A)
is also calling summary.default because you wrapped droplevels with as.logical (why?). So don't change data_combine$A at all, and just try
summary(data_combine$A)
and see how that goes. For more information, please provide a sample of your data.
As mentioned in my other answer, those actually are not factor levels. Since you asked how to remove the NA printing on summary, I'm undeleting this answer.
The NA printing is hard-coded into a summary for a logical vector. Here's the relevant code from summary.default.
# value <- if (is.logical(object))
# c(Mode = "logical", {
# tb <- table(object, exclude = NULL)
# if (!is.null(n <- dimnames(tb)[[1L]]) && any(iN <- is.na(n)))
# dimnames(tb)[[1L]][iN] <- "NA's"
# tb
# })
The exclude = NULL in table is the problem. If we look at the exclude argument in table with a logical vector log, we can see that when it is NULL the NAs always print out.
log <- c(NA, logical(4), NA, !logical(2), NA)
table(log, exclude = NULL) ## with NA values
# log
# FALSE TRUE <NA>
# 4 2 3
table(log[!is.na(log)], exclude = NULL) ## NA values removed
#
# FALSE TRUE <NA>
# 4 2 0
To make your summary print the way you want it, we can write a summary method based on the original source code.
summary.logvec <- function(object, exclude = NA) {
stopifnot(is.logical(object))
value <- c(Mode = "logical", {
tb <- table(object, exclude = exclude)
if(is.null(exclude)) {
if (!is.null(n <- dimnames(tb)[[1L]]) && any(iN <- is.na(n)))
dimnames(tb)[[1L]][iN] <- "NA's"
}
tb
})
class(value) <- c("summaryDefault", "table")
print.summary.logvec <- function(x) {
UseMethod("print.summaryDefault")
}
value
}
And then here are the results. Since we set exclude = NA in our print method the NAs will not print unless we set it to NULL
summary(log) ## original vector
# Mode FALSE TRUE NA's
# logical 4 2 3
class(log) <- "logvec"
summary(log, exclude = NULL) ## prints NA when exclude = NULL
# Mode FALSE TRUE NA's
# logical 4 2 3
summary(log) ## NA's don't print
# Mode FALSE TRUE
# logical 4 2
Now that I've done all this I'm wondering if you have tried to run your linear model.
This is how my data looks like
> d[1,]
Date sulfate nitrate ID
1 2003-01-01 NA NA 1
>
Total observations
> dim(d)
[1] 772087 4
I want to get the rows where ID is in range 70:72 (this is coming from parameter)
What I do
d[d$ID==(70:71),]
What I get back is
Warning message:
In d$ID == (70:71) :
longer object length is not a multiple of shorter object length
Run d[d$ID %in% 70:71, ] to subset your data frame.
I have a data frame with numbers like :
28521 59385 58381
V7220 25050 V7231
I need to replace them based on conditions like:
if the number is bigger than 59380 and smaller than 59390 then code it as 1
delete numbers starts with "v"
so the frame work will be look like
28521 1 1
NA 25050 NA
How can I do this quickly for a huge data frame?
x <- c(28521, 59385, 58381, 'V7220', 25050, 'V7231')
as.numeric(ifelse(as.numeric(x) > 59380 & as.numeric(x) < 59390, 1, x))
This will return a warning message about NA values, but if you wrap it with suppressWarnings, you'll get what you want.
> suppressWarnings(as.numeric(ifelse(as.numeric(x) > 59380 & as.numeric(x) < 59390, 1, x)))
[1] 28521 1 58381 NA 25050 NA
Write a function then apply it to the columns of the matrix/data.frame after you convert to numeric to get rid of those V entries.
sapply(df,as.numeric)
# If you have factor instead of character
sapply(df,function(x) as.numeric(as.character(x)))
replace <- function(x) {
x[x >= 59380 & x <= 59390] <- 1
return(x)
}
I'm trying to write a table from a SQLite database into an R data frame and have hit upon a problem that has me stumped. Here are the three first entries in the SQLite table I would like to import:
1|10|0|0|0|0|10|10|0|0|0|6|8|6|20000|30000|2012-02-29 21:27:07.239091|2012-02-29 21:28:24.815385|6|80.67.28.161|||||||||||||||||||||||||||||||33|13.4936||t|t|f||||||||||||||||||4|0|0|7|7|2
2|10|0|0|0|0|0|0|0|2|2|4|5|4|20000|30000|2012-02-29 22:00:30.618726|2012-02-29 22:04:09.629942|5|80.67.28.161|3|7||0|1|3|0|||4|3|4|5|5|5|5|4|5|4|4|0|0|0|0|0|9|9|9|9|9|||1|f|t|f|||||||||||||k|text|l|||-13|0|3|10||2
3|13|2|4|4|4|4|1|1|2|5|6|3|2|40000|10000|2012-03-01 09:07:52.310033|2012-03-01 09:21:13.097303|6|80.67.28.161|2|2||30|1|1|0|||4|2|1|6|8|3|5|6|6|7|6|||||||||||26|13.6336|4|f|t|f|t|f|f|f|f|||||||||some text||||10|1|1|3|2|3
What I'm interested in are columns 53 through 60, which, to save you the trouble of counting in the above, look like this:
|t|t|f||||||
|f|t|f||||||
|f|t|f|t|f|f|f|f|
As you can see for the first two entries only the first three of those columns are not NULL while for the third entry all eight columns have values assigned to them.
Here's the SQLite table info for those columns
sqlite> PRAGMA table_info(observations);
0|id|INTEGER|1||1
** snip **
53|understanding1|boolean|0||0
54|understanding2|boolean|0||0
55|understanding3|boolean|0||0
56|understanding4|boolean|0||0
57|understanding5|boolean|0||0
58|understanding6|boolean|0||0
59|understanding7|boolean|0||0
60|understanding8|boolean|0||0
** snip **
Now, when I try to read this into R here's what those same columns end up becoming:
> library('RSQLite')
> con <- dbConnect("SQLite", dbname = 'db.sqlite3))
> obs <- dbReadTable(con,'observations')
> obs[1:3,names(obs) %in% paste0('understanding',1:8)]
understanding1 understanding2 understanding3 understanding4 understanding5 understanding6 understanding7
1 t t f NA NA NA NA
2 f t f NA NA NA NA
3 f t f 0 0 0 0
understanding8
1 NA
2 NA
3 0
As you can see, while the first three columns contain values that are either 't' or 'f' the other columns are NA where the corresponding values in the SQLite table are NULL and 0 where they are not - irrespective of whether the corresponding values in the SQLite table are t or f. Needless to say this is not the behavior I expected. The problem is, I think, that these columns are typecasted incorrectly:
> sapply(obs[1:3,names(obs) %in% paste0('understanding',1:8)], class)
understanding1 understanding2 understanding3 understanding4 understanding5 understanding6 understanding7
"character" "character" "character" "numeric" "numeric" "numeric" "numeric"
understanding8
"numeric"
Could it be that RSQLite sets the first three columns to the character type upon seeing t and f as values in the corresponding columns in the first entry but goes with numeric because in these columns the first entry just happens to be NULL?
If this is indeed what's happening is there any way of working around this and casting all these columns into character (or, even better, logical)?
The following is hacky, but it works:
# first make a copy of the DB and work with it instead of changing
# data in the original
original_file <- "db.sqlite3"
copy_file <- "db_copy.sqlite3"
file.copy(original_file, copy_file) # duplicate the file
con <- dbConnect("SQLite", dbname = copy_file) # establish a connection to the copied DB
# put together a query to replace all NULLs by 'NA' and run it
columns <- c(paste0('understanding',1:15))
columns_query <- paste(paste0(columns,' = IfNull(',columns,",'NA')"),collapse=",")
query <- paste0("UPDATE observations SET ",columns_query)
dbSendQuery(con, query)
# Now that all columns have string values RSQLite will infer the
# column type to be `character`
df <- dbReadTable(con,'observations') # read the table
file.remove(copy_file) # delete the copy
# replace all 'NA' strings with proper NAs
df[names(df) %in% paste0('understanding',1:15)][df[names(df) %in% paste0('understanding',1:15)] == 'NA'] <- NA
# convert 't' to boolean TRUE and 'f' to boolean FALSE
df[ ,names(df) %in% paste0('understanding',1:15)] <- sapply( df[ ,names(df) %in% paste0('understanding',1:15)], function(x) {x=="t"} )
I have the piece to display NAs, but I can't figure it out.
try(na.fail(x))
> Error in na.fail.default(x) : missing values in object
# display NAs
myvector[is.na(x)]
# returns
NA NA NA NA
The only thing I get from this the length of the NA vector, which is actually not too helpful when the NAs where caused by a bug in my code that I am trying to track. How can I get the index of NA element(s) ?
I also tried:
subset(x,is.na(x))
which has the same effect.
EDIT:
y <- complete.cases(x)
x[!y]
# just returns another
NA NA NA NA
You want the which function:
which(is.na(arr))
is.na() will return a boolean index of the same shape as the original data frame.
In other words, any cells in that m x n index with the value TRUE correspond to NA values in the original data frame.
You can them use this to change the NAs, if you wish:
DF[is.na(DF)] = 999
To get the total number of data rows with at least one NA:
cc = complete.cases(DF)
num_missing = nrow(DF) - sum(ok)
which(Dataset$variable=="") will return the corresponding row numbers in a particular column
R Code using loop and condition :
# Testing for missing values
is.na(x) # returns TRUE if x is missing
y <- c(1,NA,3,NA)
is.na(y)
# returns a vector (F F F T)
# Print the index of NA values
for(i in 1:length(y)) {
if(is.na(y[i])) {
cat(i, ' ')
}
}
Output is :
Click here
Also :
which(is.na(y))