Defining missing values in R? [duplicate] - r

I have a dataframe with some numeric columns. Some row has a 0 value which should be considered as null in statistical analysis. What is the fastest way to replace all the 0 value to NULL in R?

Replacing all zeroes to NA:
df[df == 0] <- NA
Explanation
1. It is not NULL what you should want to replace zeroes with. As it says in ?'NULL',
NULL represents the null object in R
which is unique and, I guess, can be seen as the most uninformative and empty object.1 Then it becomes not so surprising that
data.frame(x = c(1, NULL, 2))
# x
# 1 1
# 2 2
That is, R does not reserve any space for this null object.2 Meanwhile, looking at ?'NA' we see that
NA is a logical constant of length 1 which contains a missing value
indicator. NA can be coerced to any other vector type except raw.
Importantly, NA is of length 1 so that R reserves some space for it. E.g.,
data.frame(x = c(1, NA, 2))
# x
# 1 1
# 2 NA
# 3 2
Also, the data frame structure requires all the columns to have the same number of elements so that there can be no "holes" (i.e., NULL values).
Now you could replace zeroes by NULL in a data frame in the sense of completely removing all the rows containing at least one zero. When using, e.g., var, cov, or cor, that is actually equivalent to first replacing zeroes with NA and setting the value of use as "complete.obs". Typically, however, this is unsatisfactory as it leads to extra information loss.
2. Instead of running some sort of loop, in the solution I use df == 0 vectorization. df == 0 returns (try it) a matrix of the same size as df, with the entries TRUE and FALSE. Further, we are also allowed to pass this matrix to the subsetting [...] (see ?'['). Lastly, while the result of df[df == 0] is perfectly intuitive, it may seem strange that df[df == 0] <- NA gives the desired effect. The assignment operator <- is indeed not always so smart and does not work in this way with some other objects, but it does so with data frames; see ?'<-'.
1 The empty set in the set theory feels somehow related.
2 Another similarity with the set theory: the empty set is a subset of every set, but we do not reserve any space for it.

Let me assume that your data.frame is a mix of different datatypes and not all columns need to be modified.
to modify only columns 12 to 18 (of the total 21), just do this
df[, 12:18][df[, 12:18] == 0] <- NA

dplyr::na_if() is an option:
library(dplyr)
df <- data_frame(col1 = c(1, 2, 3, 0),
col2 = c(0, 2, 3, 4),
col3 = c(1, 0, 3, 0),
col4 = c('a', 'b', 'c', 'd'))
na_if(df, 0)
# A tibble: 4 x 4
col1 col2 col3 col4
<dbl> <dbl> <dbl> <chr>
1 1 NA 1 a
2 2 2 NA b
3 3 3 3 c
4 NA 4 NA d

An alternative way without the [<- function:
A sample data frame dat (shamelessly copied from #Chase's answer):
dat
x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0
Zeroes can be replaced with NA by the is.na<- function:
is.na(dat) <- !dat
dat
x y
1 NA 2
2 1 2
3 1 1
4 2 1
5 NA NA

#Sample data
set.seed(1)
dat <- data.frame(x = sample(0:2, 5, TRUE), y = sample(0:2, 5, TRUE))
#-----
x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0
#replace zeros with NA
dat[dat==0] <- NA
#-----
x y
1 NA 2
2 1 2
3 1 1
4 2 1
5 NA NA

Because someone asked for the Data.Table version of this, and because the given data.frame solution does not work with data.table, I am providing the solution below.
Basically, use the := operator --> DT[x == 0, x := NA]
library("data.table")
status = as.data.table(occupationalStatus)
head(status, 10)
origin destination N
1: 1 1 50
2: 2 1 16
3: 3 1 12
4: 4 1 11
5: 5 1 2
6: 6 1 12
7: 7 1 0
8: 8 1 0
9: 1 2 19
10: 2 2 40
status[N == 0, N := NA]
head(status, 10)
origin destination N
1: 1 1 50
2: 2 1 16
3: 3 1 12
4: 4 1 11
5: 5 1 2
6: 6 1 12
7: 7 1 NA
8: 8 1 NA
9: 1 2 19
10: 2 2 40

In case anyone arrives here via google looking for the opposite (i.e. how to replace all NAs in a data.frame with 0), the answer is
df[is.na(df)] <- 0
OR
Using dplyr / tidyverse
library(dplyr)
mtcars %>% replace(is.na(.), 0)

You can replace 0 with NA only in numeric fields (i.e. excluding things like factors), but it works on a column-by-column basis:
col[col == 0 & is.numeric(col)] <- NA
With a function, you can apply this to your whole data frame:
changetoNA <- function(colnum,df) {
col <- df[,colnum]
if (is.numeric(col)) { #edit: verifying column is numeric
col[col == -1 & is.numeric(col)] <- NA
}
return(col)
}
df <- data.frame(sapply(1:5, changetoNA, df))
Although you could replace the 1:5 with the number of columns in your data frame, or with 1:ncol(df).

Here is my contribution for those who are struggling with datasets with different types of columns with multiple values representing missing data.
dat <- data_frame(numA = c(1, 0, 3, 4),
numB = c(NA, 2, 3, 4),
strC = c("0", "1.2", "NA", "2.4"),
strD = c("Yes", "Yes", "missing", "No"))
Let's say in this data we want to replace 0 in numeric columns with NA as well as 'NA' and 'missing' values in character/string values with NA. Notice that 'NA' in strC column is a character type value, not the desired NA.
dat
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 0 2 1.2 Yes
3 3 3 'NA' missing
4 4 4 2.4 No
First, an obvious case, notice that when converting a character column to numeric values any non-numeric string value is coerced to NA.
as.numeric(dat$strC)
[1] 0.0 1.2 NA 2.4
Answer with indexing:
dat[dat == "NA" | dat =="missing"] <- NA
However, do NOT use that for 0 because it changes both numeric and character 0s to NA. This is because "0" == 0 returns TRUE in R.
dplyr::na_if method:
library(dplyr)
dat %>%
lapply(na_if, y = "missing") %>%
lapply(na_if, y = "NA") %>%
lapply(na_if, y = 0) %>% # DONT DO THIS! It converts string 0s to NA as well!
data.frame()
Here we apply na_if function to each column of the data. Since na_if does not accept multiple values to be converted to NA we need to write multiple lines of code for each value to be converted into NA. However, simple usage of this function with 0 converts both the numeric and character 0s into NA. We need to do something else!
Using mutate across method with na_if function:
This is my favorite solution. Here we check the column type and apply na_if function as necessary. The character 0 is untouched, whereas all desired values are converted into NA.
dat %>%
mutate(across(where(is.numeric), ~na_if(., 0))) %>%
mutate(across(where(is.character), ~na_if(., "NA"))) %>%
mutate(across(where(is.character), ~na_if(., "missing")))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
Finally, nariar package can be used
nariar is a recent package that introduces a variety of replace_with_ functions.
library(naniar)
Replace all 'NA' and 'missing' values to NA:
dat %>%
replace_with_na_all(~.x %in% c("NA", "missing"))
but if you use this with 0s, it still erroneously converts the character 0 to NA:
dat %>%
replace_with_na_all(~.x %in% c(0, "NA", "missing"))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA NA Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
#strC's first element should not be NA here!
So, we have to specify column type using replace_with_na_if:
dat %>%
replace_with_na_if(is.character, ~.x %in% c("NA", "missing")) %>%
replace_with_na_if(is.numeric, ~.x %in% c(0))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
We achieved the desired outcome. I hope all this is helpful :)

If you are like me and landed here while wondering how to replace ALL values in a dataframe with NA, it's just:
df[,] <- NA

Another option is to replace all 0 with NA using mutate_all like this:
library(dplyr)
df <- data.frame(v1 = c(1,0,4,2),
v2 = c(3,1,0,0))
df
#> v1 v2
#> 1 1 3
#> 2 0 1
#> 3 4 0
#> 4 2 0
mutate_all(df, ~replace(., .==0, NA))
#> v1 v2
#> 1 1 3
#> 2 NA 1
#> 3 4 NA
#> 4 2 NA
Created on 2022-07-10 by the reprex package (v2.0.1)

Related

How to change NA into 0 based on other variable / how many times it was recorded

I am still new to R and need help. I want to change the NA value in variables x1,x2,x3 to 0 based on the value of count. Count specifies the number of observations, and the x1,x2,x3 stand for the visit to the site (or replication). The value in each 'X' variable is the number of species found. However, not all sites were visited 3 times. The variable count is telling us how many times the site was actually visited. I want to identify the actual NA and real 0 (which means no species found). I want to change the NA into 0 if the site is actually visited and keep it NA if the site is not visited. For example from the dummy data, 'zhask' site is visited 2 times, then the NA in x1 of zhask needs to be replaced with 0.
This is the dummy data:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask NA 1 NA 2
3 balmond 3 NA 2 3
4 layla NA 1 NA 2
5 angela NA 3 NA 2
So, it the table need to be changed into:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2
I've tried many things and try to make my own function, however, it is not working:
for(i in 1:nrow(df))
{
if( is.na(df$x1[i]) && (i < df$count[i]))
{df$x1[i]=0}
else
{df$x1[i]=df$x1[i]}
}
this is the script for the dummy dataframe:
x1= c(1,NA,3, NA, NA)
x2= c(2,1, NA, 1, 3)
x3 = c(1, NA, 2, NA, NA)
count=c(3,2,3,2,2)
site=c("miya", "zhask", "balmond", "layla", "angela")
df=data.frame(site,x1,x2,x3,count)
Any help will be very much appreciated!
One way to be to apply a function over all of your count columns. Here's a way to do that.
cols <- c("x1", "x2", "x3")
df[, cols] <- mapply(function(col, idx, count) {
ifelse(idx <=count & is.na(col), 0, col)
}, df[,cols], seq_along(cols), MoreArgs=list(count=df$count))
# site x1 x2 x3 count
# 1 miya 1 2 1 3
# 2 zhask 0 1 NA 2
# 3 balmond 3 0 2 3
# 4 layla 0 1 NA 2
# 5 angela 0 3 NA 2
We use mapply to iterate over the columns and the index of the column. We also pass in the count value each time (since it's the same for all columns, it goes in the MoreArgs= parameter). This mapply will return a list and we can use that to replace the columns with the updated values.
If you wanted to use dplyr, that might look more like
library(dplyr)
cols <- c("x1"=1, "x2"=2, "x3"=3)
df %>%
mutate(across(starts_with("x"), ~if_else(cols[cur_column()]<count & is.na(.x), 0, .x)))
I used the cols vector to get the index of the column which doesn't seem to be otherwise available when using across().
But a more "tidy" way to tackle this problem would be to pivot your data first to a "tidy" format. Then you can clean the data more easily and pivot back if necessary
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols=starts_with("x")) %>%
mutate(index=readr::parse_number(name)) %>%
mutate(value=if_else(index < count & is.na(value), 0, value)) %>%
select(-index) %>%
pivot_wider(names_from=name, values_from=value)
# site count x1 x2 x3
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 miya 3 1 2 1
# 2 zhask 2 0 1 NA
# 3 balmond 3 3 0 2
# 4 layla 2 0 1 NA
# 5 angela 2 0 3 NA
Via some indexing of the columns:
vars <- c("x1","x2","x3")
df[vars][is.na(df[vars]) & (col(df[vars]) <= df$count)] <- 0
# site x1 x2 x3 count
#1 miya 1 2 1 3
#2 zhask 0 1 NA 2
#3 balmond 3 0 2 3
#4 layla 0 1 NA 2
#5 angela 0 3 NA 2
Essentially this is:
selecting the variables/columns and storing in vars
flagging the NA cells within those variables with is.na(df[vars])
col(df[vars]) returns a column number for every cell, which can be checked if it is less than the df$count in each corresponding row
the values meeting both the above criteria are overwritten <- with 0
This could be yet another solution using purrr::pmap:
purrr::pmap is used for row-wise operations when applied on a data frame. It enables us to iterate over multiple arguments at the same time. So here c(...) refers to all corresponding elements of the selected variable (all except site) in each row
I think the rest of the solution is pretty clear but please let me know if I need to explain more about this.
library(dplyr)
library(purrr)
library(tidyr)
df %>%
mutate(output = pmap(df[-1], ~ {x <- head(c(...), -1)
inds <- which(is.na(x))
req <- tail(c(...), 1) - sum(!is.na(x))
x[inds[seq_len(req)]] <- 0
x})) %>%
select(site, output, count) %>%
unnest_wider(output)
# A tibble: 5 x 5
site x1 x2 x3 count
<chr> <dbl> <dbl> <dbl> <dbl>
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2

How to substitute zeros for NAs in R [duplicate]

I have a dataframe with some numeric columns. Some row has a 0 value which should be considered as null in statistical analysis. What is the fastest way to replace all the 0 value to NULL in R?
Replacing all zeroes to NA:
df[df == 0] <- NA
Explanation
1. It is not NULL what you should want to replace zeroes with. As it says in ?'NULL',
NULL represents the null object in R
which is unique and, I guess, can be seen as the most uninformative and empty object.1 Then it becomes not so surprising that
data.frame(x = c(1, NULL, 2))
# x
# 1 1
# 2 2
That is, R does not reserve any space for this null object.2 Meanwhile, looking at ?'NA' we see that
NA is a logical constant of length 1 which contains a missing value
indicator. NA can be coerced to any other vector type except raw.
Importantly, NA is of length 1 so that R reserves some space for it. E.g.,
data.frame(x = c(1, NA, 2))
# x
# 1 1
# 2 NA
# 3 2
Also, the data frame structure requires all the columns to have the same number of elements so that there can be no "holes" (i.e., NULL values).
Now you could replace zeroes by NULL in a data frame in the sense of completely removing all the rows containing at least one zero. When using, e.g., var, cov, or cor, that is actually equivalent to first replacing zeroes with NA and setting the value of use as "complete.obs". Typically, however, this is unsatisfactory as it leads to extra information loss.
2. Instead of running some sort of loop, in the solution I use df == 0 vectorization. df == 0 returns (try it) a matrix of the same size as df, with the entries TRUE and FALSE. Further, we are also allowed to pass this matrix to the subsetting [...] (see ?'['). Lastly, while the result of df[df == 0] is perfectly intuitive, it may seem strange that df[df == 0] <- NA gives the desired effect. The assignment operator <- is indeed not always so smart and does not work in this way with some other objects, but it does so with data frames; see ?'<-'.
1 The empty set in the set theory feels somehow related.
2 Another similarity with the set theory: the empty set is a subset of every set, but we do not reserve any space for it.
Let me assume that your data.frame is a mix of different datatypes and not all columns need to be modified.
to modify only columns 12 to 18 (of the total 21), just do this
df[, 12:18][df[, 12:18] == 0] <- NA
dplyr::na_if() is an option:
library(dplyr)
df <- data_frame(col1 = c(1, 2, 3, 0),
col2 = c(0, 2, 3, 4),
col3 = c(1, 0, 3, 0),
col4 = c('a', 'b', 'c', 'd'))
na_if(df, 0)
# A tibble: 4 x 4
col1 col2 col3 col4
<dbl> <dbl> <dbl> <chr>
1 1 NA 1 a
2 2 2 NA b
3 3 3 3 c
4 NA 4 NA d
An alternative way without the [<- function:
A sample data frame dat (shamelessly copied from #Chase's answer):
dat
x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0
Zeroes can be replaced with NA by the is.na<- function:
is.na(dat) <- !dat
dat
x y
1 NA 2
2 1 2
3 1 1
4 2 1
5 NA NA
#Sample data
set.seed(1)
dat <- data.frame(x = sample(0:2, 5, TRUE), y = sample(0:2, 5, TRUE))
#-----
x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0
#replace zeros with NA
dat[dat==0] <- NA
#-----
x y
1 NA 2
2 1 2
3 1 1
4 2 1
5 NA NA
Because someone asked for the Data.Table version of this, and because the given data.frame solution does not work with data.table, I am providing the solution below.
Basically, use the := operator --> DT[x == 0, x := NA]
library("data.table")
status = as.data.table(occupationalStatus)
head(status, 10)
origin destination N
1: 1 1 50
2: 2 1 16
3: 3 1 12
4: 4 1 11
5: 5 1 2
6: 6 1 12
7: 7 1 0
8: 8 1 0
9: 1 2 19
10: 2 2 40
status[N == 0, N := NA]
head(status, 10)
origin destination N
1: 1 1 50
2: 2 1 16
3: 3 1 12
4: 4 1 11
5: 5 1 2
6: 6 1 12
7: 7 1 NA
8: 8 1 NA
9: 1 2 19
10: 2 2 40
In case anyone arrives here via google looking for the opposite (i.e. how to replace all NAs in a data.frame with 0), the answer is
df[is.na(df)] <- 0
OR
Using dplyr / tidyverse
library(dplyr)
mtcars %>% replace(is.na(.), 0)
You can replace 0 with NA only in numeric fields (i.e. excluding things like factors), but it works on a column-by-column basis:
col[col == 0 & is.numeric(col)] <- NA
With a function, you can apply this to your whole data frame:
changetoNA <- function(colnum,df) {
col <- df[,colnum]
if (is.numeric(col)) { #edit: verifying column is numeric
col[col == -1 & is.numeric(col)] <- NA
}
return(col)
}
df <- data.frame(sapply(1:5, changetoNA, df))
Although you could replace the 1:5 with the number of columns in your data frame, or with 1:ncol(df).
Here is my contribution for those who are struggling with datasets with different types of columns with multiple values representing missing data.
dat <- data_frame(numA = c(1, 0, 3, 4),
numB = c(NA, 2, 3, 4),
strC = c("0", "1.2", "NA", "2.4"),
strD = c("Yes", "Yes", "missing", "No"))
Let's say in this data we want to replace 0 in numeric columns with NA as well as 'NA' and 'missing' values in character/string values with NA. Notice that 'NA' in strC column is a character type value, not the desired NA.
dat
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 0 2 1.2 Yes
3 3 3 'NA' missing
4 4 4 2.4 No
First, an obvious case, notice that when converting a character column to numeric values any non-numeric string value is coerced to NA.
as.numeric(dat$strC)
[1] 0.0 1.2 NA 2.4
Answer with indexing:
dat[dat == "NA" | dat =="missing"] <- NA
However, do NOT use that for 0 because it changes both numeric and character 0s to NA. This is because "0" == 0 returns TRUE in R.
dplyr::na_if method:
library(dplyr)
dat %>%
lapply(na_if, y = "missing") %>%
lapply(na_if, y = "NA") %>%
lapply(na_if, y = 0) %>% # DONT DO THIS! It converts string 0s to NA as well!
data.frame()
Here we apply na_if function to each column of the data. Since na_if does not accept multiple values to be converted to NA we need to write multiple lines of code for each value to be converted into NA. However, simple usage of this function with 0 converts both the numeric and character 0s into NA. We need to do something else!
Using mutate across method with na_if function:
This is my favorite solution. Here we check the column type and apply na_if function as necessary. The character 0 is untouched, whereas all desired values are converted into NA.
dat %>%
mutate(across(where(is.numeric), ~na_if(., 0))) %>%
mutate(across(where(is.character), ~na_if(., "NA"))) %>%
mutate(across(where(is.character), ~na_if(., "missing")))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
Finally, nariar package can be used
nariar is a recent package that introduces a variety of replace_with_ functions.
library(naniar)
Replace all 'NA' and 'missing' values to NA:
dat %>%
replace_with_na_all(~.x %in% c("NA", "missing"))
but if you use this with 0s, it still erroneously converts the character 0 to NA:
dat %>%
replace_with_na_all(~.x %in% c(0, "NA", "missing"))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA NA Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
#strC's first element should not be NA here!
So, we have to specify column type using replace_with_na_if:
dat %>%
replace_with_na_if(is.character, ~.x %in% c("NA", "missing")) %>%
replace_with_na_if(is.numeric, ~.x %in% c(0))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
We achieved the desired outcome. I hope all this is helpful :)
If you are like me and landed here while wondering how to replace ALL values in a dataframe with NA, it's just:
df[,] <- NA
Another option is to replace all 0 with NA using mutate_all like this:
library(dplyr)
df <- data.frame(v1 = c(1,0,4,2),
v2 = c(3,1,0,0))
df
#> v1 v2
#> 1 1 3
#> 2 0 1
#> 3 4 0
#> 4 2 0
mutate_all(df, ~replace(., .==0, NA))
#> v1 v2
#> 1 1 3
#> 2 NA 1
#> 3 4 NA
#> 4 2 NA
Created on 2022-07-10 by the reprex package (v2.0.1)

Replace certain values in all tibble columns with NA [duplicate]

I have a dataframe with some numeric columns. Some row has a 0 value which should be considered as null in statistical analysis. What is the fastest way to replace all the 0 value to NULL in R?
Replacing all zeroes to NA:
df[df == 0] <- NA
Explanation
1. It is not NULL what you should want to replace zeroes with. As it says in ?'NULL',
NULL represents the null object in R
which is unique and, I guess, can be seen as the most uninformative and empty object.1 Then it becomes not so surprising that
data.frame(x = c(1, NULL, 2))
# x
# 1 1
# 2 2
That is, R does not reserve any space for this null object.2 Meanwhile, looking at ?'NA' we see that
NA is a logical constant of length 1 which contains a missing value
indicator. NA can be coerced to any other vector type except raw.
Importantly, NA is of length 1 so that R reserves some space for it. E.g.,
data.frame(x = c(1, NA, 2))
# x
# 1 1
# 2 NA
# 3 2
Also, the data frame structure requires all the columns to have the same number of elements so that there can be no "holes" (i.e., NULL values).
Now you could replace zeroes by NULL in a data frame in the sense of completely removing all the rows containing at least one zero. When using, e.g., var, cov, or cor, that is actually equivalent to first replacing zeroes with NA and setting the value of use as "complete.obs". Typically, however, this is unsatisfactory as it leads to extra information loss.
2. Instead of running some sort of loop, in the solution I use df == 0 vectorization. df == 0 returns (try it) a matrix of the same size as df, with the entries TRUE and FALSE. Further, we are also allowed to pass this matrix to the subsetting [...] (see ?'['). Lastly, while the result of df[df == 0] is perfectly intuitive, it may seem strange that df[df == 0] <- NA gives the desired effect. The assignment operator <- is indeed not always so smart and does not work in this way with some other objects, but it does so with data frames; see ?'<-'.
1 The empty set in the set theory feels somehow related.
2 Another similarity with the set theory: the empty set is a subset of every set, but we do not reserve any space for it.
Let me assume that your data.frame is a mix of different datatypes and not all columns need to be modified.
to modify only columns 12 to 18 (of the total 21), just do this
df[, 12:18][df[, 12:18] == 0] <- NA
dplyr::na_if() is an option:
library(dplyr)
df <- data_frame(col1 = c(1, 2, 3, 0),
col2 = c(0, 2, 3, 4),
col3 = c(1, 0, 3, 0),
col4 = c('a', 'b', 'c', 'd'))
na_if(df, 0)
# A tibble: 4 x 4
col1 col2 col3 col4
<dbl> <dbl> <dbl> <chr>
1 1 NA 1 a
2 2 2 NA b
3 3 3 3 c
4 NA 4 NA d
An alternative way without the [<- function:
A sample data frame dat (shamelessly copied from #Chase's answer):
dat
x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0
Zeroes can be replaced with NA by the is.na<- function:
is.na(dat) <- !dat
dat
x y
1 NA 2
2 1 2
3 1 1
4 2 1
5 NA NA
#Sample data
set.seed(1)
dat <- data.frame(x = sample(0:2, 5, TRUE), y = sample(0:2, 5, TRUE))
#-----
x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0
#replace zeros with NA
dat[dat==0] <- NA
#-----
x y
1 NA 2
2 1 2
3 1 1
4 2 1
5 NA NA
Because someone asked for the Data.Table version of this, and because the given data.frame solution does not work with data.table, I am providing the solution below.
Basically, use the := operator --> DT[x == 0, x := NA]
library("data.table")
status = as.data.table(occupationalStatus)
head(status, 10)
origin destination N
1: 1 1 50
2: 2 1 16
3: 3 1 12
4: 4 1 11
5: 5 1 2
6: 6 1 12
7: 7 1 0
8: 8 1 0
9: 1 2 19
10: 2 2 40
status[N == 0, N := NA]
head(status, 10)
origin destination N
1: 1 1 50
2: 2 1 16
3: 3 1 12
4: 4 1 11
5: 5 1 2
6: 6 1 12
7: 7 1 NA
8: 8 1 NA
9: 1 2 19
10: 2 2 40
In case anyone arrives here via google looking for the opposite (i.e. how to replace all NAs in a data.frame with 0), the answer is
df[is.na(df)] <- 0
OR
Using dplyr / tidyverse
library(dplyr)
mtcars %>% replace(is.na(.), 0)
You can replace 0 with NA only in numeric fields (i.e. excluding things like factors), but it works on a column-by-column basis:
col[col == 0 & is.numeric(col)] <- NA
With a function, you can apply this to your whole data frame:
changetoNA <- function(colnum,df) {
col <- df[,colnum]
if (is.numeric(col)) { #edit: verifying column is numeric
col[col == -1 & is.numeric(col)] <- NA
}
return(col)
}
df <- data.frame(sapply(1:5, changetoNA, df))
Although you could replace the 1:5 with the number of columns in your data frame, or with 1:ncol(df).
Here is my contribution for those who are struggling with datasets with different types of columns with multiple values representing missing data.
dat <- data_frame(numA = c(1, 0, 3, 4),
numB = c(NA, 2, 3, 4),
strC = c("0", "1.2", "NA", "2.4"),
strD = c("Yes", "Yes", "missing", "No"))
Let's say in this data we want to replace 0 in numeric columns with NA as well as 'NA' and 'missing' values in character/string values with NA. Notice that 'NA' in strC column is a character type value, not the desired NA.
dat
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 0 2 1.2 Yes
3 3 3 'NA' missing
4 4 4 2.4 No
First, an obvious case, notice that when converting a character column to numeric values any non-numeric string value is coerced to NA.
as.numeric(dat$strC)
[1] 0.0 1.2 NA 2.4
Answer with indexing:
dat[dat == "NA" | dat =="missing"] <- NA
However, do NOT use that for 0 because it changes both numeric and character 0s to NA. This is because "0" == 0 returns TRUE in R.
dplyr::na_if method:
library(dplyr)
dat %>%
lapply(na_if, y = "missing") %>%
lapply(na_if, y = "NA") %>%
lapply(na_if, y = 0) %>% # DONT DO THIS! It converts string 0s to NA as well!
data.frame()
Here we apply na_if function to each column of the data. Since na_if does not accept multiple values to be converted to NA we need to write multiple lines of code for each value to be converted into NA. However, simple usage of this function with 0 converts both the numeric and character 0s into NA. We need to do something else!
Using mutate across method with na_if function:
This is my favorite solution. Here we check the column type and apply na_if function as necessary. The character 0 is untouched, whereas all desired values are converted into NA.
dat %>%
mutate(across(where(is.numeric), ~na_if(., 0))) %>%
mutate(across(where(is.character), ~na_if(., "NA"))) %>%
mutate(across(where(is.character), ~na_if(., "missing")))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
Finally, nariar package can be used
nariar is a recent package that introduces a variety of replace_with_ functions.
library(naniar)
Replace all 'NA' and 'missing' values to NA:
dat %>%
replace_with_na_all(~.x %in% c("NA", "missing"))
but if you use this with 0s, it still erroneously converts the character 0 to NA:
dat %>%
replace_with_na_all(~.x %in% c(0, "NA", "missing"))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA NA Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
#strC's first element should not be NA here!
So, we have to specify column type using replace_with_na_if:
dat %>%
replace_with_na_if(is.character, ~.x %in% c("NA", "missing")) %>%
replace_with_na_if(is.numeric, ~.x %in% c(0))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
We achieved the desired outcome. I hope all this is helpful :)
If you are like me and landed here while wondering how to replace ALL values in a dataframe with NA, it's just:
df[,] <- NA
Another option is to replace all 0 with NA using mutate_all like this:
library(dplyr)
df <- data.frame(v1 = c(1,0,4,2),
v2 = c(3,1,0,0))
df
#> v1 v2
#> 1 1 3
#> 2 0 1
#> 3 4 0
#> 4 2 0
mutate_all(df, ~replace(., .==0, NA))
#> v1 v2
#> 1 1 3
#> 2 NA 1
#> 3 4 NA
#> 4 2 NA
Created on 2022-07-10 by the reprex package (v2.0.1)

Replacing NAs between two rows with identical values in a specific column

I have a dataframe with multiple columns and I want to replace NAs in one column if they are between two rows with an identical number. Here is my data:
v1 v2
1 2
NA 3
NA 2
1 1
NA 7
NA 2
3 1
I basically want to start from the beginning of the data frame and replcae NAs in column v1 with previous Non NA if the next Non NA matches the previous one. That been said, I want the result to be like this:
v1 v2
1 2
1 3
1 2
1 1
NA 7
NA 2
3 1
As you may see, rows 2 and 3 are replaced with number "1" because row 1 and 4 had an identical number but rows 5,6 stays the same because the non na values in rows 4 and 7 are not identical. I have been twicking a lot but so far no luck. Thanks
Here is an idea using zoo package. We basically fill NAs in both directions and set NA the values that are not equal between those directions.
library(zoo)
ind1 <- na.locf(df$v1, fromLast = TRUE)
df$v1 <- na.locf(df$v1)
df$v1[df$v1 != ind1] <- NA
which gives,
v1 v2
1 1 2
2 1 3
3 1 2
4 1 1
5 NA 7
6 NA 2
7 3 1
Here is a similar approach in tidyverse using fill
library(tidyverse)
df1 %>%
mutate(vNew = v1) %>%
fill(vNew, .direction = 'up') %>%
fill(v1) %>%
mutate(v1 = replace(v1, v1 != vNew, NA)) %>%
select(-vNew)
# v1 v2
#1 1 2
#2 1 3
#3 1 2
#4 1 1
#5 NA 7
#6 NA 2
#7 3 1
Here is a base R solution, the logic is almost the same as Sotos's one:
replace_na <- function(x){
f <- function(x) ave(x, cumsum(!is.na(x)), FUN = function(x) x[1])
y <- f(x)
yp <- rev(f(rev(x)))
ifelse(!is.na(y) & y == yp, y, x)
}
df$v1 <- replace_na(df$v1)
test:
> replace_na(c(1, NA, NA, 1, NA, NA, 3))
[1] 1 1 1 1 NA NA 3
I could use na.locf function to do so. Basically, I use the normal na.locf function package zoo to replace each NA with the latest previous non NA and store the data in a column. by using the same function but fixing fromlast=TRUE NAs are replaces with the first next nonNA and store them in another column. I checked these two columns and if the results in each row for these two columns are not matching I replace them with NA.

Replace all 0 values to NA

I have a dataframe with some numeric columns. Some row has a 0 value which should be considered as null in statistical analysis. What is the fastest way to replace all the 0 value to NULL in R?
Replacing all zeroes to NA:
df[df == 0] <- NA
Explanation
1. It is not NULL what you should want to replace zeroes with. As it says in ?'NULL',
NULL represents the null object in R
which is unique and, I guess, can be seen as the most uninformative and empty object.1 Then it becomes not so surprising that
data.frame(x = c(1, NULL, 2))
# x
# 1 1
# 2 2
That is, R does not reserve any space for this null object.2 Meanwhile, looking at ?'NA' we see that
NA is a logical constant of length 1 which contains a missing value
indicator. NA can be coerced to any other vector type except raw.
Importantly, NA is of length 1 so that R reserves some space for it. E.g.,
data.frame(x = c(1, NA, 2))
# x
# 1 1
# 2 NA
# 3 2
Also, the data frame structure requires all the columns to have the same number of elements so that there can be no "holes" (i.e., NULL values).
Now you could replace zeroes by NULL in a data frame in the sense of completely removing all the rows containing at least one zero. When using, e.g., var, cov, or cor, that is actually equivalent to first replacing zeroes with NA and setting the value of use as "complete.obs". Typically, however, this is unsatisfactory as it leads to extra information loss.
2. Instead of running some sort of loop, in the solution I use df == 0 vectorization. df == 0 returns (try it) a matrix of the same size as df, with the entries TRUE and FALSE. Further, we are also allowed to pass this matrix to the subsetting [...] (see ?'['). Lastly, while the result of df[df == 0] is perfectly intuitive, it may seem strange that df[df == 0] <- NA gives the desired effect. The assignment operator <- is indeed not always so smart and does not work in this way with some other objects, but it does so with data frames; see ?'<-'.
1 The empty set in the set theory feels somehow related.
2 Another similarity with the set theory: the empty set is a subset of every set, but we do not reserve any space for it.
Let me assume that your data.frame is a mix of different datatypes and not all columns need to be modified.
to modify only columns 12 to 18 (of the total 21), just do this
df[, 12:18][df[, 12:18] == 0] <- NA
dplyr::na_if() is an option:
library(dplyr)
df <- data_frame(col1 = c(1, 2, 3, 0),
col2 = c(0, 2, 3, 4),
col3 = c(1, 0, 3, 0),
col4 = c('a', 'b', 'c', 'd'))
na_if(df, 0)
# A tibble: 4 x 4
col1 col2 col3 col4
<dbl> <dbl> <dbl> <chr>
1 1 NA 1 a
2 2 2 NA b
3 3 3 3 c
4 NA 4 NA d
An alternative way without the [<- function:
A sample data frame dat (shamelessly copied from #Chase's answer):
dat
x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0
Zeroes can be replaced with NA by the is.na<- function:
is.na(dat) <- !dat
dat
x y
1 NA 2
2 1 2
3 1 1
4 2 1
5 NA NA
#Sample data
set.seed(1)
dat <- data.frame(x = sample(0:2, 5, TRUE), y = sample(0:2, 5, TRUE))
#-----
x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0
#replace zeros with NA
dat[dat==0] <- NA
#-----
x y
1 NA 2
2 1 2
3 1 1
4 2 1
5 NA NA
Because someone asked for the Data.Table version of this, and because the given data.frame solution does not work with data.table, I am providing the solution below.
Basically, use the := operator --> DT[x == 0, x := NA]
library("data.table")
status = as.data.table(occupationalStatus)
head(status, 10)
origin destination N
1: 1 1 50
2: 2 1 16
3: 3 1 12
4: 4 1 11
5: 5 1 2
6: 6 1 12
7: 7 1 0
8: 8 1 0
9: 1 2 19
10: 2 2 40
status[N == 0, N := NA]
head(status, 10)
origin destination N
1: 1 1 50
2: 2 1 16
3: 3 1 12
4: 4 1 11
5: 5 1 2
6: 6 1 12
7: 7 1 NA
8: 8 1 NA
9: 1 2 19
10: 2 2 40
In case anyone arrives here via google looking for the opposite (i.e. how to replace all NAs in a data.frame with 0), the answer is
df[is.na(df)] <- 0
OR
Using dplyr / tidyverse
library(dplyr)
mtcars %>% replace(is.na(.), 0)
You can replace 0 with NA only in numeric fields (i.e. excluding things like factors), but it works on a column-by-column basis:
col[col == 0 & is.numeric(col)] <- NA
With a function, you can apply this to your whole data frame:
changetoNA <- function(colnum,df) {
col <- df[,colnum]
if (is.numeric(col)) { #edit: verifying column is numeric
col[col == -1 & is.numeric(col)] <- NA
}
return(col)
}
df <- data.frame(sapply(1:5, changetoNA, df))
Although you could replace the 1:5 with the number of columns in your data frame, or with 1:ncol(df).
Here is my contribution for those who are struggling with datasets with different types of columns with multiple values representing missing data.
dat <- data_frame(numA = c(1, 0, 3, 4),
numB = c(NA, 2, 3, 4),
strC = c("0", "1.2", "NA", "2.4"),
strD = c("Yes", "Yes", "missing", "No"))
Let's say in this data we want to replace 0 in numeric columns with NA as well as 'NA' and 'missing' values in character/string values with NA. Notice that 'NA' in strC column is a character type value, not the desired NA.
dat
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 0 2 1.2 Yes
3 3 3 'NA' missing
4 4 4 2.4 No
First, an obvious case, notice that when converting a character column to numeric values any non-numeric string value is coerced to NA.
as.numeric(dat$strC)
[1] 0.0 1.2 NA 2.4
Answer with indexing:
dat[dat == "NA" | dat =="missing"] <- NA
However, do NOT use that for 0 because it changes both numeric and character 0s to NA. This is because "0" == 0 returns TRUE in R.
dplyr::na_if method:
library(dplyr)
dat %>%
lapply(na_if, y = "missing") %>%
lapply(na_if, y = "NA") %>%
lapply(na_if, y = 0) %>% # DONT DO THIS! It converts string 0s to NA as well!
data.frame()
Here we apply na_if function to each column of the data. Since na_if does not accept multiple values to be converted to NA we need to write multiple lines of code for each value to be converted into NA. However, simple usage of this function with 0 converts both the numeric and character 0s into NA. We need to do something else!
Using mutate across method with na_if function:
This is my favorite solution. Here we check the column type and apply na_if function as necessary. The character 0 is untouched, whereas all desired values are converted into NA.
dat %>%
mutate(across(where(is.numeric), ~na_if(., 0))) %>%
mutate(across(where(is.character), ~na_if(., "NA"))) %>%
mutate(across(where(is.character), ~na_if(., "missing")))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
Finally, nariar package can be used
nariar is a recent package that introduces a variety of replace_with_ functions.
library(naniar)
Replace all 'NA' and 'missing' values to NA:
dat %>%
replace_with_na_all(~.x %in% c("NA", "missing"))
but if you use this with 0s, it still erroneously converts the character 0 to NA:
dat %>%
replace_with_na_all(~.x %in% c(0, "NA", "missing"))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA NA Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
#strC's first element should not be NA here!
So, we have to specify column type using replace_with_na_if:
dat %>%
replace_with_na_if(is.character, ~.x %in% c("NA", "missing")) %>%
replace_with_na_if(is.numeric, ~.x %in% c(0))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
We achieved the desired outcome. I hope all this is helpful :)
If you are like me and landed here while wondering how to replace ALL values in a dataframe with NA, it's just:
df[,] <- NA
Another option is to replace all 0 with NA using mutate_all like this:
library(dplyr)
df <- data.frame(v1 = c(1,0,4,2),
v2 = c(3,1,0,0))
df
#> v1 v2
#> 1 1 3
#> 2 0 1
#> 3 4 0
#> 4 2 0
mutate_all(df, ~replace(., .==0, NA))
#> v1 v2
#> 1 1 3
#> 2 NA 1
#> 3 4 NA
#> 4 2 NA
Created on 2022-07-10 by the reprex package (v2.0.1)

Resources