How can I count number of NA values in dataset? - r

How can I know how many values are NA in a dataset? OR if there are any NAs and NaNs in dataset?

This may also work fine
sum(is.na(df)) # For entire dataset
for a particular column in a dataset
sum(is.na(df$col1))
Or to check for all the columns as mentioned by #nicola
colSums(is.na(df))

As #Roland noticed there are multiple functions for finding and dealing with missing values in R (see help("NA") and here).
Example:
Create a fake dataset with some NA's:
data <- matrix(1:300,,3)
data[sample(300, 40)] <- NA
Check if there are any missing values:
anyNA(data)
Columnwise check if there are any missing values:
apply(data, 2, anyNA)
Check percentages and counts of missing values in columns:
colMeans(is.na(data))*100
colSums(is.na(data))

For a dataframe it is:
sum(is.na(df)
here df is the dataframe
where as for a particular column in the dataframe you can use:
sum(is.na(df$col)
or
cnt=0
for(i in df$col){
if(is.na(i)){
cnt=cnt+1
}
}
cnt
here cnt gives the no. of NA in the column

You can simply get the number of "NA" included in the each column of dataset by using R.
For a vector x
summary(x)
For a data frame df
summary(df)

Related

Interpolate across row NA's of dataframe

I have a dataframe with NA values peppered in that I want to interpolate.
Here is the repeatable example:
A <- as.data.frame(c(1:6))
A$b <- NA
A$c <- 2:7
library(zoo)
na.approx(A)
#expectation
A$b <- seq(1.5, 6.5, 1)
Obviously na.approx() isn't doing it for me, is there a function that will interpolate by row?
na.approx and also work column wise on a matrix
t(na.approx(t(A)))
how about?
t(apply(A,1,na.approx));
Here a solution that enables you to keep the original data type:
library(imputeTS)
as.data.frame(t(na.interpolation(t(A))))
Will do the same calculation wise as the mentioned solutions with na.approx .
(but this way you'll still have a data.frame and retain your column names)

Condensing code: Check that multiple columns follow a boolean in data frame in R

I have a data frame (df) that has some NA values. I wanted to extract the rows where there are NA values across multiple columns (in the example below, I am doing so for columns 12-20):
NArows = which(is.na(df[,20])&is.na(df[,19]&is.na(df[,18])&is.na(df[,17])&is.na(df[,16])&is.na(df[,15])&is.na(df[,14])&is.na(df[,13])&is.na(df[,12])))
Is there a more readable (and condensed) way to accomplish this, without putting each column condition surrounded by & sign?
Thank you for any help...
Try this:
cols <- 12:20
NArows <- which(apply(df[cols],1,function(y)sum(!is.na(y))==0))
It slices your df to just the 'cols' you care about,then applies to each row the test is.na() and if it finds all values in those cols are NA it adds that row number to NArows.
Or based on david arenburg's answer, this flags any rows with 2 or more NAs:
NArows <- which(rowSums(is.na(df[12:20])) > 1L)
Adapting it to more closely match your requirements, this flags only rows where all are NAs:
cols <- 12:20
NArows <- which(rowSums(is.na(df[cols])) == ncol(df[cols]))

Replacing outliers from multiple columns in a dataframe containing NAs using R

I am trying to replace outliers from a big dataset (more than 3000 columns and 250000 rows) by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) < 3*sd(height,na.rm=TRUE),height,NA)
However, I would like to create a function to do that in a subset of columns. To do that, I created a list with the column names that I want to replace the outliers. But it is not working.
Anyone could help me, please?
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
This was my last try:
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Sorry, I am still learning how to program in R. Thank you very much.
Cheers.
I would look into using apply and scale, scale will omit NAs. The following code should work:
# get sd for a subset of the columns
data.scale <- scale(data[ ,c("age","height","mark") ])
# set outliers to NA
data.scale[ abs(data.scale) > 3 ] <- NA
# write back to the data set
data[ ,c("age","height","mark") ] <- data.scale

Is it possible to report NAs in stargazer table?

Stargazer package gives me really nice descriptive table to include in latex document.
library(stargazer)
stargazer(attitude)
Is it possible to add a column reporting number of NAs for each of the variables?
This snippet will give you the number of NAs per row and column in a data frame
# make some data
count <- 10
data <- data.frame( a=runif(count), b=runif(count))
# add some NAs
data[data$a>0.5,]$a <- NA
data[data$b>0.5,]$b <- NA
# NAs per row
data$NACount <- apply(data, 1, function(x) {length(x[is.na(x)])})
# NAs per column
NACountsByColumn <- lapply(data, function(x) {length(x[is.na(x)])} )
The "N" column of stargazer does not count the NAs. Alhough there is not direct way to input NAs in the output table, you can simply calculate it by
nrow(dataset) - the number in "N" column

!is.na creates NAs in other columns

In the midst of merging several data sets, I'm trying to remove all rows of a data frame that have a missing value for one particular variable (I want to keep the NAs in some of the other columns for the time being). I used the following line:
data.frame <- data.frame[!is.na(data.frame$year),]
This successfully removes all rows with NAs for year, (and no others), but the other columns, which previously had data, are now entirely NAs. In other words, non-missing values are being converted to NA. Any ideas as to what's going on here? I've tried these alternatives and got the same outcome:
data.frame <- subset(data.frame, !is.na(year))
data.frame$x <- ifelse(is.na(data.frame$year) == T, 1, 0);
data.frame <- subset(data.frame, x == 0)
Am I using is.na incorrectly? Are there any alternatives to is.na in this scenario? Any help would be greatly appreciated!
Edit Here is code that should reproduce the issue:
#data
tc <- read.csv("http://dl.dropbox.com/u/4115584/tc2008.csv")
frame <- read.csv("http://dl.dropbox.com/u/4115584/frame.csv")
#standardize NA codes
tc[tc == "."] <- NA
tc[tc == -9] <- NA
#standardize spatial units
colnames(frame)[1] <- "loser"
colnames(frame)[2] <- "gainer"
frame$dyad <- paste(frame$loser,frame$gainer,sep="")
tc$dyad <- paste(tc$loser,tc$gainer,sep="")
drops <- c("loser","gainer")
tc <- tc[,!names(tc) %in% drops]
frame <- frame[,!names(frame) %in% drops]
rm(drops)
#merge tc into frame
data <- merge(tc, frame, by.x = "year", by.y = "dyad", all.x=T, all.y=T) #year column is duplicated in this process. I haven't had this problem with nearly identical code using other data.
rm(tc,frame)
#the first column in the new data frame is the duplicate year, which does not actually contain years. I'll rename it.
colnames(data)[1] <- "double"
summary(data$year) #shows 833 NA's
summary(data$procedur) #note that at this point there are non-NA values
#later, I want to create 20 year windows following the events in the tc data. For simplicity, I want to remove cases with NA in the year column.
new.data <- data[!is.na(data$year),]
#now let's see what the above operation did
summary(new.data$year) #missing years were successfully removed
summary(new.data$procedur) #this variable is now entirely NA's
I think the actual problem is with your merge.
After you merge and have the data in data, if you do:
# > table(data$procedur, useNA="always")
# 1 2 3 4 5 6 <NA>
# 122 112 356 59 39 19 192258
You see there are these many (122+112...+19) values for data$procedur. But, all these values are corresponding to data$year = NA.
> all(is.na(data$year[!is.na(data$procedur)]))
# [1] TRUE # every value of procedur occurs where year = NA
So, basically, all values of procedur are also removed because you removed those rows checking for NA in year.
To solve this problem, I think you should use merge as:
merge(tc, frame, all=T) # it'll automatically calculate common columns
# also this will not result in duplicated year column.
Check if this merge gives you the desired result.
Try complete.cases:
data.frame.clean <- data.frame[complete.cases(data.frame$year),]
...though, as noted above, you may want to pick a more descriptive name.

Resources