Determine the number of NA values in a column - r

I want to count the number of NA values in a data frame column. Say my data frame is called df, and the name of the column I am considering is col. The way I have come up with is following:
sapply(df$col, function(x) sum(length(which(is.na(x)))))
Is this a good/most efficient way to do this?

You're over-thinking the problem:
sum(is.na(df$col))

If you are looking for NA counts for each column in a dataframe then:
na_count <-sapply(x, function(y) sum(length(which(is.na(y)))))
should give you a list with the counts for each column.
na_count <- data.frame(na_count)
Should output the data nicely in a dataframe like:
----------------------
| row.names | na_count
------------------------
| column_1 | count

Try the colSums function
df <- data.frame(x = c(1,2,NA), y = rep(NA, 3))
colSums(is.na(df))
#x y
#1 3

A quick and easy Tidyverse solution to get a NA count for all columns is to use summarise_all() which I think makes a much easier to read solution than using purrr or sapply
library(tidyverse)
# Example data
df <- tibble(col1 = c(1, 2, 3, NA),
col2 = c(NA, NA, "a", "b"))
df %>% summarise_all(~ sum(is.na(.)))
#> # A tibble: 1 x 2
#> col1 col2
#> <int> <int>
#> 1 1 2
Or using the more modern across() function:
df %>% summarise(across(everything(), ~ sum(is.na(.))))

If you are looking to count the number of NAs in the entire dataframe you could also use
sum(is.na(df))

In the summary() output, the function also counts the NAs so one can use this function if one wants the sum of NAs in several variables.

A tidyverse way to count the number of nulls in every column of a dataframe:
library(tidyverse)
library(purrr)
df %>%
map_df(function(x) sum(is.na(x))) %>%
gather(feature, num_nulls) %>%
print(n = 100)

This form, slightly changed from Kevin Ogoros's one:
na_count <-function (x) sapply(x, function(y) sum(is.na(y)))
returns NA counts as named int array

sapply(name of the data, function(x) sum(is.na(x)))

Try this:
length(df$col[is.na(df$col)])

User rrs answer is right but that only tells you the number of NA values in the particular column of the data frame that you are passing to get the number of NA values for the whole data frame try this:
apply(<name of dataFrame>, 2<for getting column stats>, function(x) {sum(is.na(x))})
This does the trick

I read a csv file from local directory. Following code works for me.
# to get number of which contains na
sum(is.na(df[, c(columnName)]) # to get number of na row
# to get number of which not contains na
sum(!is.na(df[, c(columnName)])
#here columnName is your desire column name

Similar to hute37's answer but using the purrr package. I think this tidyverse approach is simpler than the answer proposed by AbiK.
library(purrr)
map_dbl(df, ~sum(is.na(.)))
Note: the tilde (~) creates an anonymous function. And the '.' refers to the input for the anonymous function, in this case the data.frame df.

If you're looking for null values in each column to be printed one after the other then you can use this. Simple solution.
lapply(df, function(x) { length(which(is.na(x)))})

Another option using complete.cases like this:
df <- data.frame(col = c(1,2,NA))
df
#> col
#> 1 1
#> 2 2
#> 3 NA
sum(!complete.cases(df$col))
#> [1] 1
Created on 2022-08-27 with reprex v2.0.2

You can use this to count number of NA or blanks in every column
colSums(is.na(data_set_name)|data_set_name == '')

In the interests of completeness you can also use the useNA argument in table. For example table(df$col, useNA="always") will count all of non NA cases and the NA ones.

Related

In R, sample n rows from a df in which a certain column has non-NA values (sample conditionally)

Background
Here's a toy df:
df <- data.frame(ID = c("a","b","c","d","e","f"),
gender = c("f","f","m","f","m","m"),
zip = c(48601,NA,29910,54220,NA,44663),stringsAsFactors=FALSE)
As you can see, I've got a couple of NA values in the zip column.
Problem
I'm trying to randomly sample 2 entire rows from df -- but I want them to be rows for which zip is not null.
What I've tried
This code gets me a basic (i.e. non-conditional) random sample:
df2 <- df[sample(nrow(df), 2), ]
But of course, that only gets me halfway to my goal -- a bunch of the time it's going to return a row with an NA value in zip. This code attempts to add the condition:
df2 <- df[sample(nrow(df$zip != NA), 2), ]
I think I'm close, but this yields an error invalid first argument.
Any ideas?
We can use is.na
tmp <- df[!is.na(df$zip),]
> tmp[sample(nrow(tmp), 2),]
We can use rownames + na.omit to sample the rows
> df[sample(rownames(na.omit(df["zip"])), 2),]
ID gender zip
3 c m 29910
4 d f 54220
Here is a base R solution with complete.cases()
# define a logical vector to identify NA
x <- complete.cases(df)
# subset only not NA values
df_no_na <- df[x,]
# do the sample
df_no_na[sample(nrow(df_no_na), 2),]
Output:
ID gender zip
3 c m 29910
6 f m 44663
For the tidyverse lovers out there...
library("dplyr")
df %>%
tidyr::drop_na() %>%
dplyr::slice_sample(n = 2)
If it only NA in the zip column you care about, then:
df %>%
tidyr::drop_na(zip) %>%
dplyr::slice_sample(n = 2)
The important thing here is to avoid creating an unnecessary second data frame with the NA values dropped. You could use the solution using na.omit given in another answer, but alternatively you can use which to return a list of valid rows to sample from. For example:
nsamp <- 23
df[sample(which(!is.na(df$zip)), nsamp), ]
The advantage to doing it this way is that the condition inside the which can be anything you like, whether or not it involves missing values. For example this version will sample from all the rows with female gender in zip codes starting with 336:
df[sample(which(df$gender=='f' & grepl('^336', df$zip)), nsamp), ]

How to get the highest value of a table per row on R [duplicate]

I have a dataframe as below. I want to get a column of maximums for each row. But that column should ignore value 9 if it is present in that row.
How can i achive that efficiently?
df <- data.frame(age=c(5,6,9), marks=c(1,2,7), story=c(2,9,1))
df$max <- apply(df, 1, max)
df
Here's one possibility:
df$colMax <- apply(df, 1, function(x) max(x[x != 9]))
The pmax function would be useful here. The only catch is that it takes a bunch of vectors as parameters. You can convert a data.frame to parameters with do.call. I also set the 9 values to NA as suggested by other but do so using the somewhat unconventional is.na<- command.
do.call(pmax, c(`is.na<-`(df, df==9), na.rm=T))
# [1] 5 6 7
Substitute 9 with NA and then use pmax as suggested by #MrFlick in his deleted answer:
df2 <- df #copy df because we are going to change it
df2[df2==9] <- NA
do.call(function(...) pmax(..., na.rm=TRUE), df2)
#[1] 5 6 7
#make a copy of your data.frame
tmp.df <- df
#replace the 9s with NA
tmp.df[tmp.df==9] <- NA
#Use apply to process the data one row at a time through the max function, removing NA values first
apply(tmp.df,1,max,na.rm=TRUE)

Using dplyr, Remove all strings from a data frame

I have a data frame with 300 columns which has a string variable somewhere which I am trying to remove. I have found this solution in stack overflow using lapply (see below), which is what I want to do, but using the dplyr package. I have tried using the mutate_each function but cant seem to make it work
"If your data frame (df) is really all integers except for NAs and garbage then then the following converts it.
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
You'll have a warning about NAs introduced by coercion but that's just all those non numeric character strings turning into NAs.
dplyr 0.5 now includes a select_if() function.
For example:
person <- c("jim", "john", "harry")
df <- data.frame(matrix(c(1:9,NA,11,12), nrow=3), person)
library(dplyr)
df %>% select_if(is.numeric)
# X1 X2 X3 X4
#1 1 4 7 NA
#2 2 5 8 11
#3 3 6 9 12
Of course you could add further conditions if necessary.
If you want to use this line of code:
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
with dplyr (by which I assume you mean "using pipes") the easiest would be
df2 = df %>% lapply(function(x) as.numeric(as.character(x))) %>%
as.data.frame
To "translate" this into the mutate_each idiom:
mutate_each(df, funs(as.numeric(as.character(.)))
This function will, of course, convert all columns to character, then to numeric. To improve efficiency, don't bother doing two conversions on columns that are already numeric:
mutate_each(df, funs({
if (is.numeric(.)) return(.)
as.numeric(as.character(.))
}))
Data for testing:
df = data.frame(v1 = 1:10, v2 = factor(11:20))
mutate_all works here, and simply wrap the gsub in a function. (I also assume you aren't necessarily string hunting, so much as trawling for non-integers.
StrScrub <- function(x) {
as.integer(gsub("^\\D+$",NA, x))
}
ScrubbedDF <- mutate_all(data, funs(StrScrub))
Example dataframe:
library(dplyr)
options(stringsAsFactors = F)
data = data.frame("A" = c(2:5),"B" = c(5,"gr",3:2), "C" = c("h", 9, "j", "1"))
with reference/help from Tony Ladson

rowwise maximum for R

I have a dataframe as below. I want to get a column of maximums for each row. But that column should ignore value 9 if it is present in that row.
How can i achive that efficiently?
df <- data.frame(age=c(5,6,9), marks=c(1,2,7), story=c(2,9,1))
df$max <- apply(df, 1, max)
df
Here's one possibility:
df$colMax <- apply(df, 1, function(x) max(x[x != 9]))
The pmax function would be useful here. The only catch is that it takes a bunch of vectors as parameters. You can convert a data.frame to parameters with do.call. I also set the 9 values to NA as suggested by other but do so using the somewhat unconventional is.na<- command.
do.call(pmax, c(`is.na<-`(df, df==9), na.rm=T))
# [1] 5 6 7
Substitute 9 with NA and then use pmax as suggested by #MrFlick in his deleted answer:
df2 <- df #copy df because we are going to change it
df2[df2==9] <- NA
do.call(function(...) pmax(..., na.rm=TRUE), df2)
#[1] 5 6 7
#make a copy of your data.frame
tmp.df <- df
#replace the 9s with NA
tmp.df[tmp.df==9] <- NA
#Use apply to process the data one row at a time through the max function, removing NA values first
apply(tmp.df,1,max,na.rm=TRUE)

Unique values in each of the columns of a data frame

I want to get the number of unique values in each of the columns of a data frame.
Let's say I have the following data frame:
DF <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
then it should return that there are 3 distinct values for v1, and 2 for v2.
I tried unique(DF), but it does not work as each rows are different.
Or using unique:
rapply(DF,function(x)length(unique(x)))
v1 v2
3 2
sapply(DF, function(x) length(unique(x)))
In dplyr:
DF %>% summarise_all(funs(n_distinct(.)))
Here's one approach:
> lapply(DF, function(x) length(table(x)))
$v1
[1] 3
$v2
[1] 2
This basically tabulates the unique values per column. Using length on that tells you the number. Removing length will show you the actual table of unique values.
For the sake of completeness: Since CRAN version 1.9.6 of 19 Sep 2015, the data.table package includes the helper function uniqueN() which saves us from writing
function(x) length(unique(x))
when calling one of the siblings of apply():
sapply(DF, data.table::uniqueN)
v1 v2
3 2
Note that neither the data.table package needs to be loaded nor DF coerced to class data.table in order to use uniqueN(), here.
In dplyr (>=1.0.0 - june 2020):
DF %>% summarize_all(n_distinct)
v1 v2
1 3 2
I think a function like this would give you what you are looking for. This also shows the unique values, in addition to how many NA's there are in each dataframe's columns. Simply plug in your dataframe, and you are good to go.
totaluniquevals <- function(df) {
x <<- data.frame("Row Name"= numeric(0), "TotalUnique"=numeric(0), "IsNA"=numeric(0))
result <- sapply(df, function(x) length(unique(x)))
isnatotals <- sapply(df, function(x) sum(is.na(x)))
#Now Create the Row names
for (i in 1:length(colnames(df))) {
x[i,1] <<- (names(result[i]))
x[i,2] <<- result[[i]]
x[i,3] <<- isnatotals[[i]]
}
return(x)
}
Test:
DF <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
totaluniquevals(DF)
Row.Name TotalUnique IsNA
1 v1 3 0
2 v2 2 0
You can then use unique on whatever column, to see what the specific unique values are.
unique(DF$v2)
[1] a b
Levels: a b
This should work for getting an unique value for each variable:
length(unique(datasetname$variablename))
This will give you unique values in DF dataframe of column 1.
unique(sc_data[,1])

Resources