R replace numbers in a data frame based on their value - r

I have a data frame with numbers like :
28521 59385 58381
V7220 25050 V7231
I need to replace them based on conditions like:
if the number is bigger than 59380 and smaller than 59390 then code it as 1
delete numbers starts with "v"
so the frame work will be look like
28521 1 1
NA 25050 NA
How can I do this quickly for a huge data frame?

x <- c(28521, 59385, 58381, 'V7220', 25050, 'V7231')
as.numeric(ifelse(as.numeric(x) > 59380 & as.numeric(x) < 59390, 1, x))
This will return a warning message about NA values, but if you wrap it with suppressWarnings, you'll get what you want.
> suppressWarnings(as.numeric(ifelse(as.numeric(x) > 59380 & as.numeric(x) < 59390, 1, x)))
[1] 28521 1 58381 NA 25050 NA

Write a function then apply it to the columns of the matrix/data.frame after you convert to numeric to get rid of those V entries.
sapply(df,as.numeric)
# If you have factor instead of character
sapply(df,function(x) as.numeric(as.character(x)))
replace <- function(x) {
x[x >= 59380 & x <= 59390] <- 1
return(x)
}

Related

Turn NULL to NA in R

I m trying to convert all the NULL values in my dataset to NA. In short
Explanation of question
My data set looks like below:
One thing that I noticed though is that when I try to find the number of empty values it shows the number of NA values in my dataset not including the NULL values. I would like to convert the NULL values to NA in order to remove them.
So I counted the number of missing values in my complete dataset then in the columns as
> dim(raw_data)
[1] 80983 16
> # Count missing values in entire data set
> table(is.na(raw_data))
FALSE TRUE
1247232 48496
> # Count na 's column wise
> na_count <-sapply(raw_data, function(y) sum(length(which(is.na(y)))))
> na_count <- data.frame(na_count)
> na_count
na_count
Merchant_Id 1
Tran_Date 1
Military_Time 1
Terminal_Id_Key 1
Amount 1
Card_Amount_Paid 1
Merchant_Name 1
Town 1
Area_Code 1
Client_ID 48481
Age_Band 1
Gender_code 1
Province 1
Avg_Income_3M 1
Value_Spent 1
Number_Spent 1
As you can see it does not show the NULL as NA so I tried to convert it as:
> # Turn Null to NA
> temp_data <- raw_data
>
> temp_data[temp_data == ''] = NA
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
I also tried
> # Turn Null to NA
> temp_data <- raw_data
> temp_data[temp_data == 'NULL'] = NA
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
But I am getting the error above. This was followed by the last one below (which was better because I did not have an error but I still got NULL values in my data set).
> raw_data[is.null(raw_data)] <- NA
> table(is.na(raw_data))
FALSE TRUE
1247232 48496
Could you perhaps suggest ways to deal with this error?
I also tried to get rid of the date and got this different error when I once again tried to remove the NULL values:
> df <- raw_data
>
> df1 <- transform(df, date = as.Date(df$Tran_Date), time = format(df$Tran_Date, "%T"))
>
> df1[df1 == NULL] = NA
Error in matrix(if (is.null(value)) logical() else value, nrow = nr, dimnames = list(rn, :
length of 'dimnames' [2] not equal to array extent
This solved my issue. Instead of changing the NULL values to NA. I imported the values in from the github account as NA values.
I added
na = c("","NA","NULL",NULL)
to my importing argument read.table or read_tsv from readr package. This then did the trick and changed my NULL values to NA.

Subset NAs between values

I would like to know how I could only subset NAs excluding those that are on the extremes of a vector.
For instance,
vector <- c(NA,NA,1,3,5,NA,3,NA,7,NA,NA,NA)
How could I only subset the NAs vector[6] and vector[8]?
Thank you very much for your help!
One way to get indices which are not on extremes is
non_NA_inds <- which(!is.na(vector))
NA_inds <- which(is.na(vector))
NA_inds[NA_inds > min(non_NA_inds) & NA_inds < max(non_NA_inds)]
#[1] 6 8
You can try the following code
idx <- which(!is.na(vector))
res <- setdiff(min(idx):max(idx),idx)
which gives:
> res
[1] 6 8

Finding the unique identifyer for a row with NA in a particular column in R

I have data in the following format:
ID Species Side_of_boat
1 spA Port
2 spB Starboard
3 spA NA
I would like to write a line of code that gives me the unique ID for all rows that have NA in 'side of boat'.
I have tried:
unique(df$ID[df$side_of_boat == "NA"])
But it doesn't give me the output I want. I would like the output to be:
"3"
Thanks!
Try
unique(df$ID[is.na(df$Side_of_boat)])
instead. NA is a special value in R and also has its own special function is.na() to test if an entry is NA. Check ?NA for more information.
#Method1
n <- which(is.na(df$side_of_boat))
you can also use *apply with this, e.g.
lapply(apply(df$side_of_boat, 1, function(x) which(!is.na(x)) ) , paste, collapse=", ")
#Method 2
new_DF <- subset(df, is.na(df$side_of_boat))
#Method 3
You could also write a function to do this for you:
getNa <- function(dfrm) lapply(dfrm, function(x) which(is.na(x) ) )
#Note
In case you have NA character values, first run
df$side_of_boat[df$side_of_boat=='NA'] <- NA
Try:
df$ID[which(is.na(df$Side_of_Boat))]
It should give you a vector of the ID's regardless of them being numbers or characters

Trimming NAs based on column subset - a more elegant solution?

A New Year's quandary for the stackoverflow community which has been quite the help by reading posts and answers in the past (this is my first question). I've found a work around, but I'm wondering if other approaches/solutions might be suggested.
I am attempting to remove trailing NA's from a large data.frame, but those NA's are only found in a few of the columns of the data.frame and I would like to retain all columns in the output. Here is a representative data subset.
df=data.frame(var1=rep("A", 8), var2=c("a","b","c","d","e","f","g","h"), var3=c(0,1,NA,2,3,NA,NA,NA), var4=c(0,0,NA,4,5,NA,NA,NA), var5=c(0,0,NA,0,2,4,NA,NA))
Goals of the process:
Trim trailing NAs based on NA presence in var3,var4 and var5
Retain all columns in final output
Only remove trailing NAs (i.e. row 3 remains in record as a placeholder)
Only trim if all columns have an NA (i.e. row 7 and 8, but not row 6)
Based on these goals, the solution should remove the last two rows of df:
df.output = df[-c(7,8),]
The behaviour of na.trim (in the zoo package) is ideal (as it limits removal to those NA's at the end of the data.frame, with sides="right"), and my work-around involved altering the na.trim.default function to include a subset term.
Any suggestions? Many thanks for any help.
EDIT: Just to complete this question, below is the function I created from the na.trim.default code which also works, but as noted, does require loading the zoo package.
na.trim.multiplecols <- function (object, colrange, sides = c("both", "left", "right"), is.na = c("any","all"),...)
{
is.na <- match.arg(is.na)
nisna <- if (is.na == "any" || length(dim(object[,colrange])) < 1) {
complete.cases(object[,colrange])
}
else rowSums(!is.na(object[,colrange])) > 0
idx <- switch(match.arg(sides), left = cumsum(nisna) > 0,
right = rev(cumsum(rev(nisna) > 0) > 0), both = (cumsum(nisna) >
0) & rev(cumsum(rev(nisna)) > 0))
if (length(dim(object)) < 2)
object[idx]
else object[idx, , drop = FALSE]
}
Something based on max(which(!is.na())) will work. We use this to find the largest index of non-missing data from the columns of interest.
Using your df
ind <- max(max(which(!is.na(df$var3))),
max(which(!is.na(df$var4))),
max(which(!is.na(df$var5))))
df[1:ind, ]
var1 var2 var3 var4 var5
1 A a 0 0 0
2 A b 1 0 0
3 A c NA NA NA
4 A d 2 4 0
5 A e 3 5 2
6 A f NA NA 4
Edit: First solution using base rle and apply
t <- rle(apply(as.matrix(df[,3:5]), 1, function(x) all(is.na(x))))
r <- ifelse(t$values[length(t$values)] == TRUE, t$lengths[length(t$lengths)], 0)
head(df, -r)
Second solution using Rle from package IRanges:
require(IRanges)
t <- min(sapply(df[,3:5], function(x) {
o <- Rle(x)
val <- runValue(o)
if (is.na(val[length(val)])) {
len <- runLength(o)
out <- len[length(len)]
} else {
out <- 0
}
}))
head(df, -t)

How to show indexes of NAs?

I have the piece to display NAs, but I can't figure it out.
try(na.fail(x))
> Error in na.fail.default(x) : missing values in object
# display NAs
myvector[is.na(x)]
# returns
NA NA NA NA
The only thing I get from this the length of the NA vector, which is actually not too helpful when the NAs where caused by a bug in my code that I am trying to track. How can I get the index of NA element(s) ?
I also tried:
subset(x,is.na(x))
which has the same effect.
EDIT:
y <- complete.cases(x)
x[!y]
# just returns another
NA NA NA NA
You want the which function:
which(is.na(arr))
is.na() will return a boolean index of the same shape as the original data frame.
In other words, any cells in that m x n index with the value TRUE correspond to NA values in the original data frame.
You can them use this to change the NAs, if you wish:
DF[is.na(DF)] = 999
To get the total number of data rows with at least one NA:
cc = complete.cases(DF)
num_missing = nrow(DF) - sum(ok)
which(Dataset$variable=="") will return the corresponding row numbers in a particular column
R Code using loop and condition :
# Testing for missing values
is.na(x) # returns TRUE if x is missing
y <- c(1,NA,3,NA)
is.na(y)
# returns a vector (F F F T)
# Print the index of NA values
for(i in 1:length(y)) {
if(is.na(y[i])) {
cat(i, ' ')
}
}
Output is :
Click here
Also :
which(is.na(y))

Resources