Is there a way to identify where NAs are introduced? - r

Recently went through my fairly large dataset and realized some foo decided to use commas. Trying to convert it all to numeric. Used a nice little gsub to get rid of those pesky commas, but I'm still finding NAs introduced by coercion. Is there a way to identify the location by column and row where those NAs are being introduced so I can see why that is occurring?

Use the is.na() function. Consider the following data frame, which contains NA values, as an example:
> df <- data.frame(v1=c(1,2,NA,4), v2=c(NA,6,7,8), v3=c(9,NA,NA,12))
> df
v1 v2 v3
1 1 NA 9
2 2 6 NA
3 NA 7 NA
4 4 8 12
You can use is.na along with sapply to get the following result:
> sapply(df, function(x) { c(1:length(x))[is.na(x)] })
$v1
[1] 3
$v2
[1] 1
$v3
[1] 2 3
Each column will come back along with the rows where NA values occurred.

I would also use which with arr.ind=TRUE to get the row/column indices ('df' from #Tim Biegeleisen's post)
which(is.na(df), arr.ind=TRUE)
# row col
#[1,] 3 1
#[2,] 1 2
#[3,] 2 3
#[4,] 3 3

Related

Convert entire data frame into one long column (vector)

I want to turn the entire content of a numeric (incl. NA's) data frame into one column. What would be the smartest way of achieving the following?
>df <- data.frame(C1=c(1,NA,3),C2=c(4,5,NA),C3=c(NA,8,9))
>df
C1 C2 C3
1 1 4 NA
2 NA 5 8
3 3 NA 9
>x <- mysterious_operation(df)
>x
[1] 1 NA 3 4 5 NA NA 8 9
I want to calculate the mean of this vector, so ideally I'd want to remove the NA's within the mysterious_operation - the data frame I'm working on is very large so it will probably be a good idea.
Here's a couple ways with purrr:
# using invoke, a wrapper around do.call
purrr::invoke(c, df, use.names = FALSE)
# similar to unlist, reduce list of lists to a single vector
purrr::flatten_dbl(df)
Both return:
[1] 1 NA 3 4 5 NA NA 8 9
The mysterious operation you are looking for is called unlist:
> df <- data.frame(C1=c(1,NA,3),C2=c(4,5,NA),C3=c(NA,8,9))
> unlist(df, use.names = F)
[1] 1 NA 3 4 5 NA NA 8 9
We can use unlist and create a single column data.frame
df1 <- data.frame(col =unlist(df))
Just for fun. Of course unlist is the most appropriate function.
alternative
stack(df)[,1]
alternative
do.call(c,df)
do.call(c,c(df,use.names=F)) #unnamed version
Maybe they are more mysterious.

How can I find out the names of columns that satisfy a condition in a data frame

I wish to know (by name) which columns in my data frame satisfy a particular condition. For example, if I was looking for the names of any columns that contained more than 3 NA, how could I proceed?
>frame
m n o p
1 0 NA NA NA
2 0 2 2 2
3 0 NA NA NA
4 0 NA NA 1
5 0 NA NA NA
6 0 1 2 3
> for (i in frame){
na <- is.na(i)
as.numeric(na)
total<-sum(na)
if(total>3){
print (i) }}
[1] NA 2 NA NA NA 1
[2] NA 2 NA NA NA 2
So this actually succeeds in evaluating which columns satisfy the condition, however, it does not display the column name. Perhaps subsetting the columns which interest me would be another way to do it, but I'm not sure how to solve it that way either. Plus I'd prefer to know if there's a way to just get the names directly.
I'll appreciate any input.
We can use colSums on a logical matrix (is.na(frame)), check whether it is greater than 3 to get a logical vector and then subset the names of 'frame' based on that.
names(frame)[colSums(is.na(frame))>3]
#[1] "n" "o"
If we are using dplyr, one way is
library(dplyr)
frame %>%
summarise_each(funs(sum(is.na(.))>3)) %>%
unlist() %>%
names(.)[.]
#[1] "n" "o"

Fill in-between entries in an ID vector

Looking for a quick-and-easy solution to a problem which I have only been able to solve inelegantly, by looping. I have an ID vector which looks something like this:
id<-c(NA,NA,1,1,1,NA,1,NA,2,2,2,NA,3,NA,3,3,3)
The NA's that fall in-between a sequence of a single number (id[6], id[14]) need to be replaced by that number. However, the NA's that don't meet this condition (those between sequences of two different numbers) need to be left alone (i.e., id[1],id[2],id[8],id[12]). The target vector is therefore:
id.target<-c(NA,NA,1,1,1,1,1,NA,2,2,2,NA,3,3,3,3,3)
This is not difficult to do by looping through each value, but I am looking to do this to many very long vectors, and was hoping for a neater solution. Thanks for any suggestions.
This seem to work. The idea is to use zoo::na.locf in order to fill the NAs correctly and then insert NAs when they are between different numbers
id.target <- zoo::na.locf(id, na.rm = FALSE)
id.target[(c(diff(id.target), 1L) > 0L) & is.na(id)] <- NA
id.target
## [1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3
Here is a base R option
d1 <- do.call(rbind,lapply(split(seq_along(id), id), function(x) {
i1 <- min(x):max(x)
data.frame(val= unique(id[x]), i1)}))
id[seq_along(id) %in% d1$i1 ] <- d1$val
id
#[1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3

Incorrect vector length after replacing certain values with NA

I encountered this very strange problem:
Say I make the following data frame
test<-as.data.frame(matrix(c(2,4,5,2,4,6),2,3,byrow=T))
# V1 V2 V3
# 1 2 4 5
# 2 2 4 6
Then I replace the number 5 in column V3 row 1 with NA:
test$V3[test$V3==5]<-NA
# V1 V2 V3
# 1 2 4 NA
# 2 2 4 6
Strangely, now the length of vector with value 6 is incorrect:
length(test$V3[test$V3==6])
# 2
How come the output is 2 instead of 1?
You can take apart the expression to see what's happening:
test$V3==6
# [1] NA TRUE
As you can see, there is an NA value for the missing element. This causes an NA when subsetting test$V3:
test$V3[test$V3==6]
# [1] NA 6
Since this is a vector of length 2, this explains why your code returns 2.
It sounds like you actually want to count the number of elements equal to 6, ignoring missing values. You could do this with:
sum(test$V3 == 6, na.rm=TRUE)
# [1] 1
or
sum(!is.na(test$V3) & test$V3 == 6)
# [1] 1
Besides the two methods offered so far I will offer a couple more. The first one does the NA removal for you and I find it useful in selection rows from data.frames when I don't want all the garbage rows that "[" drags along with the NA selections:
> length(which(test$V3 == 6))
[1] 1
> length(subset(test, V3 == 6, V3))
[1] 1
The second one with two "V3" tokens might seem a bit redundant until you realize that without that second "V3" that you would get 3 columns in the one row dataframe.

Length of columns excluding NA in r

Suppose that I have a data.frame as follows:
a b c
1 5 NA 6
2 NA NA 7
3 6 5 8
I would like to find the length of each column, excluding NA's. The answer should look like
a b c
2 1 3
So far, I've tried:
!is.na() # Gives TRUE/FALSE
length(!is.na()) # 9 -> Length of the whole matrix
dim(!is.na()) # 3 x 3 -> dimension of a matrix
na.omit() # removes rows with any NA in it.
Please tell me how can I get the required answer.
Or faster :
colSums(!is.na(dat))
a b c
2 1 3
Though the sum is probably a faster solution, I think that length(x[!is.na(x)]) is more readable.
> apply(dat, 2, function(x){sum(!is.na(x))})
a b c
2 1 3
I tried NCOL instead of ncol and it worked.
> nrow(tsa$Region)
NULL
> NROW(tsa$Region)
[1] 27457
> ncol(tsa$Region)
NULL
> NCOL(tsa$Region)
[1] 1

Resources