Unexpected row(s) of NAs when selecting subset of dataframe - r

When selecting a subset of data from a dataframe, I get row(s) entirely made up of NA values that were not present in the original dataframe. For example:
example.df[example.df$census_tract == 27702, ]
returns:
census_tract number_households_est
NA NA NA
23611 27702 2864
Where did that first row of NAs come from? And why is it returned even though example.df$census_tract != 27702 for that row?

That is because there is a missing observation
> sum(is.na(example.df$census_tract))
[1] 1
> example.df[which(is.na(example.df$census_tract)), ]
census_tract number_households_est
64 NA NA
When == evaluates the 64th row it gives NA because by default we can't know wheter 27702 is equal to the missing value. Therefore the result is missing (aka NA). So a NA is putted in the logical vector used for indexing purposes. And this gives, by default, a full-of-NA row, because we are asking for a row but "we don't know which one".
The proper way is
> example.df[example.df$census_tract %in% 27702, ]
census_tract number_households_est
23611 27702 2864
HTH, Luca

Related

Set character column of dataframe to missing if length<n

Very new to R here.
I have a dataframe with a character column "col1":
col1 <- c("org","blorg","forg","chorg","horg","blorg","horg","phthorg")
col2 <- c("a","b","c","d","a","b","e","f")
df<-data.frame(col1, col2)
I would like to set the values with fewer than 5 characters to missing so I end up with:
c(NA,"blorg",NA,"chorg",NA,"blorg",NA,"phthorg")
I have tried the following:
if(nchar(as.character(df$col1))<5) {df$col1<-NA}
but I get the error "the condition has length > 1".
nchar is vectorized, so we can directly apply instead of if/else (which expects a single TRUE/FALSE as input) Or may use ifelse/replace/case_when which are vectorized, but it is not needed as we need to replace only to NA based on the condition
df$col1[nchar(df$col1) <5] <- NA
-output
> df$col1
[1] NA "blorg" NA "chorg" NA "blorg" NA "phthorg"

doing for loop in R

I have a file that I have filtered my SNPs for LD (in the example below;my.filtered.snp.id). I want to keep only these SNPs in my genotype matrix (geno_snp), I am trying to write a for loops in R, and I would appreciate any help to fix my code. I want to keep those lines (the whole line including snp.id and genotype information) in the genotype matrix where snp.id matches with snp.id in my my.filtered.snp.id and delete those that are not match.
head(my.filtered.snp.id)
Chr10_31458
Chr10_31524
Chr10_45901
Chr10_102754
Chr10_102828
Chr10_103480
head (geno_snp)
XRQChr10_103805 NA NA NA 0 NA 0 NA NA NA NA NA 0 0
XRQChr10_103937 NA NA NA 0 NA 1 NA NA NA NA NA 0 2
XRQChr10_103990 NA NA NA 0 NA 0 NA NA NA NA NA 0 NA
I am trying something like this:
for (i in 1:length(geno_snp[,1])){
for (j in 1:length(my.filtered.snp.id)){
if geno_snp[i,] == my.filtered.snp.i[j]
print (the whole line in geno_snp)
}
else (remove the line)
}
If I understood it correctly, you want a subset of your data.frame geno_snp in which the row names must match the selected SNP IDs from the vector my.filtered.snp.id.
Please check if this solution works for you:
index <- unlist(sapply(row.names(geno_snp), function(x) grep(pattern = x, x = my.filtered.snp.id)))
selected_subset <- geno_snp[index,]
What I did was to create an index adressing the rows with names that were a match with any value in my.filtered.snp.id. Then I used the index to make the subset of the dataframe. Since the result from applying the grep function with the aid of sapply was in the form of a list, I used unlist to obtain the results in the form of a vector.
EDIT:
I noticed you had some row.names that weren't an exact match with your original my.filtered.snp.id values. In this case, maybe what you wanna do is:
index <- unlist(sapply(my.filtered.snp.id, function(x) grep(pattern = x, x = row.names(geno_snp))))
selected_subset <- geno_snp[index,]
The thing is that you have row.names beggining with XRQ... so in this last case the code uses the reference values from my.filtered.snp.id to detect matches in row.names(geno_snp), even if there is this XRQ string in the beggining of it.
Finally, in the case I have misunderstood your data and what I'm calling row names here are, in fact, data in a column (the SNP IDs), just use geno_snp[,1] instead of row.names(geno_snp) in both codes above.

Function to change blanks to NA

I'm trying to write a function that turns empty strings into NA. A summary of one of my column looks like this:
a b
12 210 468
I'd like to change the 12 empty values to NA. I also have a few other factor columns for which I'd like to change empty values to NA, so I borrowed some stuff from here and there to come up with this:
# change nulls to NAs
nullToNA <- function(df){
# split df into numeric & non-numeric functions
a<-df[,sapply(df, is.numeric), drop = FALSE]
b<-df[,sapply(df, Negate(is.numeric)), drop = FALSE]
# Change empty strings to NA
b<-b[lapply(b,function(x) levels(x) <- c(levels(x), NA) ),] # add NA level
b<-b[lapply(b,function(x) x[x=="",]<- NA),] # change Null to NA
# Put the columns back together
d<-cbind(a,b)
d[, names(df)]
}
However, I'm getting this error:
> foo<-nullToNA(bar)
Error in x[x == "", ] <- NA : incorrect number of subscripts on matrix
Called from: FUN(X[[i]], ...)
I have tried the answer found here: Replace all 0 values to NA but it changes all my columns to numeric values.
You can directly index fields that match a logical criterion. So you can just write:
df[is_empty(df)] = NA
Where is_empty is your comparison, e.g. df == "":
df[df == ""] = NA
But note that is.null(df) won’t work, and would be weird anyway1. I would advise against merging the logic for columns of different types, though! Instead, handle them separately.
1 You’ll almost never encounter NULL inside a table since that only works if the underlying vector is a list. You can create matrices and data.frames with this constraint, but then is.null(df) will never be TRUE because the NULL values are wrapped inside the list).
This worked for me
df[df == 'NULL'] <- NA
How about just:
df[apply(df, 2, function(x) x=="")] = NA
Works fine for me, at least on simple examples.
This is the function I used to solve this issue.
null_na=function(vector){
new_vector=rep(NA,length(vector))
for(i in 1:length(vector))
if(vector[i]== ""){new_vector[i]=NA}else if(is.na(vector[i]))
{new_vector[i]=NA}else{new_vector[i]=vector[i]}
return(new_vector)
}
Just plug in the column or vector you are having an issue with.

Iterating through all rows in R, removing those that fit criteria

R data frame. It has about a dozen columns and 150 or so rows. I want to iterate through each row and remove it, under these two conditions
It's value in column 8 is undefined
The value for the row ABOVE it, in column 8 IS defined.
My code looks like this, but it keeps crashing. It's gotta be a dumb mistake, but I can't figure it out.
for (i in 2:nrow(newfile)){
if (is.na(newfile[i,8]) && !is.na(newfile[(i-1),8]){
newfile<-newfile[-i,]
}
}
Obviously in this example, newfile is my dataframe.
The error I get
Error in [.data.frame(newfile, -i, ) : object 'i' not found
Problem solved, but some test data if you guys wanted to muck around:
23 L8 29141078 744319 27165443
24 L8 27165443 NA NA
25 L8 28357836 8293 25116398
26 L8 25116398 NA NA
27 L8 28357836 21600 25116398
28 L8 25116398 NA NA
29 L8 40929564 NA NA
30 L8 40929564 NA NA
31 L8 41917264 33234 39446503
32 L8 39446503 NA NA
33 L8 41917264 33981 39446503
34 L8 39446503 NA NA
Obviously a little modified here, so now you are comparing column 4 with the one above it (or you can use column 5, either way)
The problem is that you're changing the data frame out from under yourself; the original evaluation of nrow(newfile) doesn't get updated as you go along (it would if you had a C-style loop for (i=1; i<=nrow(newfile); i++) ...). In a while loop, on the other hand, the condition will get re-evaluated every time through the loop, so I think this will work.
i <- 2
while (i<=nrow(newfile)){
if (is.na(newfile[i,8]) && !is.na(newfile[i-1,8])) {
newfile<-newfile[-i,]
}
i <- i+1
}
You didn't give us an easily reproducible answer (i.e. a test dataset with answers), so I'm not going to test this right now.
Careful thought (which I don't have time to give this at the moment) might lead to a non-iterative (and hence perhaps very much faster, if that matters) way to do this.
Hmm, if I do this, I get
Error in if (is.na(newfile[i,8]) && !is.na(newfile[(i-1),8]) { :
missing value where TRUE/FALSE needed
This is because you're removing rows while you're iterating through them, so by the time you get to nrow(newfile) (which is the original number of rows, since the nrow(newfile) is evaluated once at the beginning of the foor loop), it may not exist any more because rows have been removed.
You can avoid looping altogether by constructing a logical index of which rows to keep (ie vector of length nrow(newfile) with TRUE if you want to keep the row and FALSE otherwise):
n <- nrow(newfile)
# first bit says "is the row NA (for rows 2:n)"
# second bit says "is the row above *not* NA (for rows 1:(n-1))
# the & finds rows satisfying *both* conditions (first row always gets kept)
toRemove <- c(FALSE,is.na(newfile[-1,8])) & c(FALSE,!is.na(newfile[-n,8]))
toKeep <- !toRemove
newfile <- newfile[toKeep,]
You could do it all in one line if that's your thing:
newfile <- newfile[ !(c(FALSE,is.na(newfile[-1,8])) & c(FALSE,!is.na(newfile[-nrow(newfile),8]))), ]
Here is another solution. But it keeps NA values if the previous value is also NA.
#create some dummy data
newfile <- matrix(runif(800), ncol = 8)
newfile[rbinom(100, 1, 0.25) == 1, 8] <- NA
#the selection
newfile[-which(diff(is.na(newfile[, 8])) == 1) - 1, ]

How to show indexes of NAs?

I have the piece to display NAs, but I can't figure it out.
try(na.fail(x))
> Error in na.fail.default(x) : missing values in object
# display NAs
myvector[is.na(x)]
# returns
NA NA NA NA
The only thing I get from this the length of the NA vector, which is actually not too helpful when the NAs where caused by a bug in my code that I am trying to track. How can I get the index of NA element(s) ?
I also tried:
subset(x,is.na(x))
which has the same effect.
EDIT:
y <- complete.cases(x)
x[!y]
# just returns another
NA NA NA NA
You want the which function:
which(is.na(arr))
is.na() will return a boolean index of the same shape as the original data frame.
In other words, any cells in that m x n index with the value TRUE correspond to NA values in the original data frame.
You can them use this to change the NAs, if you wish:
DF[is.na(DF)] = 999
To get the total number of data rows with at least one NA:
cc = complete.cases(DF)
num_missing = nrow(DF) - sum(ok)
which(Dataset$variable=="") will return the corresponding row numbers in a particular column
R Code using loop and condition :
# Testing for missing values
is.na(x) # returns TRUE if x is missing
y <- c(1,NA,3,NA)
is.na(y)
# returns a vector (F F F T)
# Print the index of NA values
for(i in 1:length(y)) {
if(is.na(y[i])) {
cat(i, ' ')
}
}
Output is :
Click here
Also :
which(is.na(y))

Resources