How to change NA into 0 based on other variable / how many times it was recorded - r

I am still new to R and need help. I want to change the NA value in variables x1,x2,x3 to 0 based on the value of count. Count specifies the number of observations, and the x1,x2,x3 stand for the visit to the site (or replication). The value in each 'X' variable is the number of species found. However, not all sites were visited 3 times. The variable count is telling us how many times the site was actually visited. I want to identify the actual NA and real 0 (which means no species found). I want to change the NA into 0 if the site is actually visited and keep it NA if the site is not visited. For example from the dummy data, 'zhask' site is visited 2 times, then the NA in x1 of zhask needs to be replaced with 0.
This is the dummy data:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask NA 1 NA 2
3 balmond 3 NA 2 3
4 layla NA 1 NA 2
5 angela NA 3 NA 2
So, it the table need to be changed into:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2
I've tried many things and try to make my own function, however, it is not working:
for(i in 1:nrow(df))
{
if( is.na(df$x1[i]) && (i < df$count[i]))
{df$x1[i]=0}
else
{df$x1[i]=df$x1[i]}
}
this is the script for the dummy dataframe:
x1= c(1,NA,3, NA, NA)
x2= c(2,1, NA, 1, 3)
x3 = c(1, NA, 2, NA, NA)
count=c(3,2,3,2,2)
site=c("miya", "zhask", "balmond", "layla", "angela")
df=data.frame(site,x1,x2,x3,count)
Any help will be very much appreciated!

One way to be to apply a function over all of your count columns. Here's a way to do that.
cols <- c("x1", "x2", "x3")
df[, cols] <- mapply(function(col, idx, count) {
ifelse(idx <=count & is.na(col), 0, col)
}, df[,cols], seq_along(cols), MoreArgs=list(count=df$count))
# site x1 x2 x3 count
# 1 miya 1 2 1 3
# 2 zhask 0 1 NA 2
# 3 balmond 3 0 2 3
# 4 layla 0 1 NA 2
# 5 angela 0 3 NA 2
We use mapply to iterate over the columns and the index of the column. We also pass in the count value each time (since it's the same for all columns, it goes in the MoreArgs= parameter). This mapply will return a list and we can use that to replace the columns with the updated values.
If you wanted to use dplyr, that might look more like
library(dplyr)
cols <- c("x1"=1, "x2"=2, "x3"=3)
df %>%
mutate(across(starts_with("x"), ~if_else(cols[cur_column()]<count & is.na(.x), 0, .x)))
I used the cols vector to get the index of the column which doesn't seem to be otherwise available when using across().
But a more "tidy" way to tackle this problem would be to pivot your data first to a "tidy" format. Then you can clean the data more easily and pivot back if necessary
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols=starts_with("x")) %>%
mutate(index=readr::parse_number(name)) %>%
mutate(value=if_else(index < count & is.na(value), 0, value)) %>%
select(-index) %>%
pivot_wider(names_from=name, values_from=value)
# site count x1 x2 x3
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 miya 3 1 2 1
# 2 zhask 2 0 1 NA
# 3 balmond 3 3 0 2
# 4 layla 2 0 1 NA
# 5 angela 2 0 3 NA

Via some indexing of the columns:
vars <- c("x1","x2","x3")
df[vars][is.na(df[vars]) & (col(df[vars]) <= df$count)] <- 0
# site x1 x2 x3 count
#1 miya 1 2 1 3
#2 zhask 0 1 NA 2
#3 balmond 3 0 2 3
#4 layla 0 1 NA 2
#5 angela 0 3 NA 2
Essentially this is:
selecting the variables/columns and storing in vars
flagging the NA cells within those variables with is.na(df[vars])
col(df[vars]) returns a column number for every cell, which can be checked if it is less than the df$count in each corresponding row
the values meeting both the above criteria are overwritten <- with 0

This could be yet another solution using purrr::pmap:
purrr::pmap is used for row-wise operations when applied on a data frame. It enables us to iterate over multiple arguments at the same time. So here c(...) refers to all corresponding elements of the selected variable (all except site) in each row
I think the rest of the solution is pretty clear but please let me know if I need to explain more about this.
library(dplyr)
library(purrr)
library(tidyr)
df %>%
mutate(output = pmap(df[-1], ~ {x <- head(c(...), -1)
inds <- which(is.na(x))
req <- tail(c(...), 1) - sum(!is.na(x))
x[inds[seq_len(req)]] <- 0
x})) %>%
select(site, output, count) %>%
unnest_wider(output)
# A tibble: 5 x 5
site x1 x2 x3 count
<chr> <dbl> <dbl> <dbl> <dbl>
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2

Related

Creating a count variable for NA cases in data frame

I have an R data frame including a few columns of numerical data with NA values too. See the example with first 2 columns below. I want to create a new column (3rd one below called output) which shows an incremental count of NA values for each of my group variables. For example, region A has 2 NA values so it will show 1 and 2 next to the relevant rows. Region B has only one NA value so will show 1 next to it. If a region X has 10 NA values it should show 1,2,3 ... , 10 next to each case, as move down the data frame.
Region
Value
Output
Region A
5
0
Region B
2
0
Region B
NA
1
Region A
NA
1
Region A
9
0
Region A
NA
2
Region A
4
0
I am familiar with dplyr so happy to see a solution around it. Ideally i don't want to use a for loop, but could do if the best solution. In my example above i used zero values for my non-NA cases, that can be anything, doesn't have to be 0.
thanks! :)
You can use cumsum to count up NA within each group. An ifelse will only assign these counts to NA, otherwise will include 0 in output.
library(dplyr)
df %>%
group_by(Region) %>%
mutate(Output = ifelse(is.na(Value), cumsum(is.na(Value)), 0))
Output
Region Value Output
<chr> <int> <dbl>
1 A 5 0
2 B 2 0
3 B NA 1
4 A NA 1
5 A 9 0
6 A NA 2
7 A 4 0
You could create a new column with is.na(value), than group by region and than use cumsum() to create your desired output
df%>%mutate(output=ifelse(!is.na(Value), 0, 1))%>%group_by(Region, output)%>%mutate(output=cumsum(output))
# A tibble: 7 x 3
# Groups: Region, output [5]
Region Value output
<fct> <dbl> <dbl>
1 A 5 0
2 B 2 0
3 B NA 1
4 A NA 1
5 A 9 0
6 A NA 2
7 A 4 0

create new column (with outcome min or NA) from multiple selected columns

My data has many columns and subjects, but to illustrate it simpler, lets say I have 7 subjects with 3 variables/columns called x1, x2 and x3 (values range from 1 to 3 and NAs). In the analysis that I want it is important I actually call the columns I want to use (since I cannot just use the whole dataframe in my analysis because there are more variables/columns there)
>data <- data.frame(‘id’=c(1,2,3,4,5,6,7), ‘x1’=c(1,2,2,NA,3,3,1), ‘x2’=c(NA,3,1,NA,2,3,2), ‘x3’=c(NA,2,NA,NA,3,NA,1)
id x1 x2 x3
1 1 NA NA
2 2 3 2
3 2 1 NA
4 NA NA NA
5 3 2 NA
6 3 3 NA
7 1 2 1
The class of x1 x2 and x3 are numeric.
Out of that, I want to create a variable/column called ‘x4’ that:
- gives me the lowest number of row x1, x2 and x3.
-If there is an NA in a row of x1,x2,x3, the NA shall be ignored.
-If they are however ALL NAs, I would want the outcome to be NA. (NOT Inf, which is what it does with my code now)
-If there are two lowest numbers that are the same, just display any one of those two. So like this:
>data <- data.frame(‘id’=c(1,2,3,4,5,6,7), ‘x1’=c(1,2,2,NA,3,3,1), ‘x2’=c(NA,3,1,NA,2,3,2), ‘x3’=c(NA,2,NA,NA,3,NA,1), ‘x4’=c(1,2,1,NA,2,3,1)
id x1 x2 x3 x4
1 1 NA NA 1
2 2 3 2 2
3 2 1 NA 1
4 NA NA NA NA
5 3 2 NA 2
6 3 3 NA 3
7 1 2 1 1
I managed to find a very similar question, and I can mostly make it work: min for each row with dataframe in R
data$x4 <- apply(data[, c("x1","x2","x3")],1, FUN=min, na.rm = TRUE)
the problem I have now is that in case of all NAs (so id number 4), my outcome is not NA, but it is 'Inf'.
Question 1:How can I make it so it becomes an NA instead of Inf? I can of course do that afterwards like this:
is.na(data$x4) <- sapply(data$x4, is.infinite)
But I wonder if there is a nice way to do that already with/inside the previous code?
Also, rather then using sapply and the inside FUNction min, I would also like to try to make it work with code in a way like below: Question 2: is using this other code below possible?
data$x4 <- min(data[, c("x1","x2","x3")],1 , na.rm = TRUE)
for this x4 gets the outcome '1' everytime. I guess it just shows the lowest number (1) of the whole column? I dont understand why. I am already using ',1' but doesnt help.
I hope somebody can help me(r and stackoverflow newbie) out, thanks!
You are looking for pmin function which returns the (regular or parallel) minima of the input values. Below are two approaches using pmin:
df$minIget <- do.call(pmin, c(df[,-1], na.rm = TRUE)) # Approch1: using do.call
df %>% rowwise() %>% mutate(minIget = pmin(x1, x2,x3,na.rm = T))# Approch2: using tidyverse.
output:
A tibble: 7 x 5
# Rowwise:
id x1 x2 x3 minIget
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 NA NA 1
2 2 2 3 2 2
3 3 2 1 NA 1
4 4 NA NA NA NA
5 5 3 2 3 2
6 6 3 3 NA 3
7 7 1 2 1 1
You can test if all are NA before you call min like:
apply(data[, c("x1","x2","x3")], 1, function(x)
if(all(is.na(x))) NA else min(x, na.rm=TRUE))
#[1] 1 2 1 NA 2 3 1
min(data[, c("x1","x2","x3")],1 , na.rm = TRUE) gives you the minimum of 1 and data[, c("x1","x2","x3")].

Conditionally set value in previous row, within group

I have a data frame "df" like this, grouped by "nest."
nest laid stage
1 NA 2
1 5 4
1 -10 NA
2 NA 1
2 3 1
2 -8 NA
I want to make a condition so that if "laid" is > 0, the "stage" of that nest at the previous visit is set to 0. If "laid" is not greater than 0, I want no change in "stage".
Desired outcome:
nest laid stage
1 NA 0
1 5 4
1 -10 NA
2 NA 0
2 3 1
2 -8 NA
I've tried different versions of code below (dplyr and tidyr), with various errors:
df1 <- df %>%
group_by(nest) %>%
mutate(stage, if(laid > 0){stage = 0}) %>%
fill(stage, .direction = "up")
I've gone over similar questions, but they all use ifelse. Any tips are much appreciated!
You can use if_else (or ifelse if you are not certain of the column data types), which is a vectorized version of if/else; To check the next laid value, use lead:
df %>%
group_by(nest) %>%
mutate(stage = if_else(lead(laid) > 0, 0L, stage))
# A tibble: 6 x 3
# Groups: nest [2]
# nest laid stage
# <int> <int> <int>
#1 1 NA 0
#2 1 5 4
#3 1 -10 NA
#4 2 NA 0
#5 2 3 1
#6 2 -8 NA

Count occurrences of value in a set of variables in R (per row)

Let's say I have a data frame with 10 numeric variables V1-V10 (columns) and multiple rows (cases).
What I would like R to do is: For each case, give me the number of occurrences of a certain value in a set of variables.
For example the number of occurrences of the numeric value 99 in that single row for V2, V3, V6, which obviously has a minimum of 0 (none of the three have the value 99) and a maximum of 3 (all of the three have the value 99).
I am really looking for an equivalent to the SPSS function COUNT: "COUNT creates a numeric variable that, for each case, counts the occurrences of the same value (or list of values) across a list of variables."
I thought about table() and library plyr's count(), but I cannot really figure it out. Vectorized computation preferred. Thanks a lot!
If you need to count any particular word/letter in the row.
#Let df be a data frame with four variables (V1-V4)
df <- data.frame(V1=c(1,1,2,1,L),V2=c(1,L,2,2,L),
V3=c(1,2,2,1,L), V4=c(L, L, 1,2, L))
For counting number of L in each row just use
#This is how to compute a new variable counting occurences of "L" in V1-V4.
df$count.L <- apply(df, 1, function(x) length(which(x=="L")))
The result will appear like this
> df
V1 V2 V3 V4 count.L
1 1 1 1 L 1
2 1 L 2 L 2
3 2 2 2 1 0
4 1 2 1 2 0
I think that there ought to be a simpler way to do this, but the best way that I can think of to get a table of counts is to loop (implicitly using sapply) over the unique values in the dataframe.
#Some example data
df <- data.frame(a=c(1,1,2,2,3,9),b=c(1,2,3,2,3,1))
df
# a b
#1 1 1
#2 1 2
#3 2 3
#4 2 2
#5 3 3
#6 9 1
levels=unique(do.call(c,df)) #all unique values in df
out <- sapply(levels,function(x)rowSums(df==x)) #count occurrences of x in each row
colnames(out) <- levels
out
# 1 2 3 9
#[1,] 2 0 0 0
#[2,] 1 1 0 0
#[3,] 0 1 1 0
#[4,] 0 2 0 0
#[5,] 0 0 2 0
#[6,] 1 0 0 1
Try
apply(df,MARGIN=1,table)
Where df is your data.frame. This will return a list of the same length of the amount of rows in your data.frame. Each item of the list corresponds to a row of the data.frame (in the same order), and it is a table where the content is the number of occurrences and the names are the corresponding values.
For instance:
df=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))
#create a data.frame containing some data
df #show the data.frame
V1 V2 V3
1 10 20 20
2 20 30 10
3 10 20 20
4 20 30 10
apply(df,MARGIN=1,table) #apply the function table on each row (MARGIN=1)
[[1]]
10 20
1 2
[[2]]
10 20 30
1 1 1
[[3]]
10 20
1 2
[[4]]
10 20 30
1 1 1
#desired result
Here is another straightforward solution that comes closest to what the COUNT command in SPSS does — creating a new variable that, for each case (i.e., row) counts the occurrences of a given value or list of values across a list of variables.
#Let df be a data frame with four variables (V1-V4)
df <- data.frame(V1=c(1,1,2,1,NA),V2=c(1,NA,2,2,NA),
V3=c(1,2,2,1,NA), V4=c(NA, NA, 1,2, NA))
#This is how to compute a new variable counting occurences of value "1" in V1-V4.
df$count.1 <- apply(df, 1, function(x) length(which(x==1)))
The updated data frame contains the new variable count.1 exactly as the SPSS COUNT command would do.
> df
V1 V2 V3 V4 count.1
1 1 1 1 NA 3
2 1 NA 2 NA 1
3 2 2 2 1 1
4 1 2 1 2 2
5 NA NA NA NA 0
You can do the same to count how many time the value "2" occurs per row in V1-V4. Note that you need to select the columns (variables) in df to which the function is applied.
df$count.2 <- apply(df[1:4], 1, function(x) length(which(x==2)))
You can also apply a similar logic to count the number of missing values in V1-V4.
df$count.na <- apply(df[1:4], 1, function(x) sum(is.na(x)))
The final result should be exactly what you wanted:
> df
V1 V2 V3 V4 count.1 count.2 count.na
1 1 1 1 NA 3 0 1
2 1 NA 2 NA 1 1 2
3 2 2 2 1 1 3 0
4 1 2 1 2 2 2 0
5 NA NA NA NA 0 0 4
This solution can easily be generalized to a range of values.
Suppose we want to count how many times a value of 1 or 2 occurs in V1-V4 per row:
df$count.1or2 <- apply(df[1:4], 1, function(x) sum(x %in% c(1,2)))
A solution with functions from the dplyr package would be the following:
Using the example data set from LechAttacks answer:
df <- data.frame(V1=c(1,1,2,1,NA),V2=c(1,NA,2,2,NA),
V3=c(1,2,2,1,NA), V4=c(NA, NA, 1,2, NA))
Count the appearances of "1" and "2" each and both combined:
df %>%
rowwise() %>%
mutate(count_1 = sum(c_across(V1:V4) == 1, na.rm = TRUE),
count_2 = sum(c_across(V1:V4) == 2, na.rm = TRUE),
count_12 = sum(c_across(V1:V4) %in% 1:2, na.rm = TRUE)) %>%
ungroup()
which gives the table:
V1 V2 V3 V4 count_1 count_2 count_12
1 1 1 1 NA 3 0 3
2 1 NA 2 NA 1 1 2
3 2 2 2 1 1 3 4
4 1 2 1 2 2 2 4
5 NA NA NA NA 0 0 0
In my effort to find something similar to Count from SPSS in R is as follows:
`df <- data.frame(a=c(1,1,NA,2,3,9),b=c(1,2,3,2,NA,1))` #Dummy data with NAs
`df %>%
dplyr::mutate(count = rowSums( #this allows calculate sum across rows
dplyr::select(., #Slicing on .
dplyr::one_of( #within select use one_of by clarifying which columns your want
c('a','b'))), na.rm = T)) #once the columns are specified, that's all you need, na.rm is cherry on top
That's how the output looks like
>df
a b count
1 1 1 2
2 1 2 3
3 NA 3 3
4 2 2 4
5 3 NA 3
6 9 1 10
Hope it helps :-)

Replace all 0 values to NA

I have a dataframe with some numeric columns. Some row has a 0 value which should be considered as null in statistical analysis. What is the fastest way to replace all the 0 value to NULL in R?
Replacing all zeroes to NA:
df[df == 0] <- NA
Explanation
1. It is not NULL what you should want to replace zeroes with. As it says in ?'NULL',
NULL represents the null object in R
which is unique and, I guess, can be seen as the most uninformative and empty object.1 Then it becomes not so surprising that
data.frame(x = c(1, NULL, 2))
# x
# 1 1
# 2 2
That is, R does not reserve any space for this null object.2 Meanwhile, looking at ?'NA' we see that
NA is a logical constant of length 1 which contains a missing value
indicator. NA can be coerced to any other vector type except raw.
Importantly, NA is of length 1 so that R reserves some space for it. E.g.,
data.frame(x = c(1, NA, 2))
# x
# 1 1
# 2 NA
# 3 2
Also, the data frame structure requires all the columns to have the same number of elements so that there can be no "holes" (i.e., NULL values).
Now you could replace zeroes by NULL in a data frame in the sense of completely removing all the rows containing at least one zero. When using, e.g., var, cov, or cor, that is actually equivalent to first replacing zeroes with NA and setting the value of use as "complete.obs". Typically, however, this is unsatisfactory as it leads to extra information loss.
2. Instead of running some sort of loop, in the solution I use df == 0 vectorization. df == 0 returns (try it) a matrix of the same size as df, with the entries TRUE and FALSE. Further, we are also allowed to pass this matrix to the subsetting [...] (see ?'['). Lastly, while the result of df[df == 0] is perfectly intuitive, it may seem strange that df[df == 0] <- NA gives the desired effect. The assignment operator <- is indeed not always so smart and does not work in this way with some other objects, but it does so with data frames; see ?'<-'.
1 The empty set in the set theory feels somehow related.
2 Another similarity with the set theory: the empty set is a subset of every set, but we do not reserve any space for it.
Let me assume that your data.frame is a mix of different datatypes and not all columns need to be modified.
to modify only columns 12 to 18 (of the total 21), just do this
df[, 12:18][df[, 12:18] == 0] <- NA
dplyr::na_if() is an option:
library(dplyr)
df <- data_frame(col1 = c(1, 2, 3, 0),
col2 = c(0, 2, 3, 4),
col3 = c(1, 0, 3, 0),
col4 = c('a', 'b', 'c', 'd'))
na_if(df, 0)
# A tibble: 4 x 4
col1 col2 col3 col4
<dbl> <dbl> <dbl> <chr>
1 1 NA 1 a
2 2 2 NA b
3 3 3 3 c
4 NA 4 NA d
An alternative way without the [<- function:
A sample data frame dat (shamelessly copied from #Chase's answer):
dat
x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0
Zeroes can be replaced with NA by the is.na<- function:
is.na(dat) <- !dat
dat
x y
1 NA 2
2 1 2
3 1 1
4 2 1
5 NA NA
#Sample data
set.seed(1)
dat <- data.frame(x = sample(0:2, 5, TRUE), y = sample(0:2, 5, TRUE))
#-----
x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0
#replace zeros with NA
dat[dat==0] <- NA
#-----
x y
1 NA 2
2 1 2
3 1 1
4 2 1
5 NA NA
Because someone asked for the Data.Table version of this, and because the given data.frame solution does not work with data.table, I am providing the solution below.
Basically, use the := operator --> DT[x == 0, x := NA]
library("data.table")
status = as.data.table(occupationalStatus)
head(status, 10)
origin destination N
1: 1 1 50
2: 2 1 16
3: 3 1 12
4: 4 1 11
5: 5 1 2
6: 6 1 12
7: 7 1 0
8: 8 1 0
9: 1 2 19
10: 2 2 40
status[N == 0, N := NA]
head(status, 10)
origin destination N
1: 1 1 50
2: 2 1 16
3: 3 1 12
4: 4 1 11
5: 5 1 2
6: 6 1 12
7: 7 1 NA
8: 8 1 NA
9: 1 2 19
10: 2 2 40
In case anyone arrives here via google looking for the opposite (i.e. how to replace all NAs in a data.frame with 0), the answer is
df[is.na(df)] <- 0
OR
Using dplyr / tidyverse
library(dplyr)
mtcars %>% replace(is.na(.), 0)
You can replace 0 with NA only in numeric fields (i.e. excluding things like factors), but it works on a column-by-column basis:
col[col == 0 & is.numeric(col)] <- NA
With a function, you can apply this to your whole data frame:
changetoNA <- function(colnum,df) {
col <- df[,colnum]
if (is.numeric(col)) { #edit: verifying column is numeric
col[col == -1 & is.numeric(col)] <- NA
}
return(col)
}
df <- data.frame(sapply(1:5, changetoNA, df))
Although you could replace the 1:5 with the number of columns in your data frame, or with 1:ncol(df).
Here is my contribution for those who are struggling with datasets with different types of columns with multiple values representing missing data.
dat <- data_frame(numA = c(1, 0, 3, 4),
numB = c(NA, 2, 3, 4),
strC = c("0", "1.2", "NA", "2.4"),
strD = c("Yes", "Yes", "missing", "No"))
Let's say in this data we want to replace 0 in numeric columns with NA as well as 'NA' and 'missing' values in character/string values with NA. Notice that 'NA' in strC column is a character type value, not the desired NA.
dat
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 0 2 1.2 Yes
3 3 3 'NA' missing
4 4 4 2.4 No
First, an obvious case, notice that when converting a character column to numeric values any non-numeric string value is coerced to NA.
as.numeric(dat$strC)
[1] 0.0 1.2 NA 2.4
Answer with indexing:
dat[dat == "NA" | dat =="missing"] <- NA
However, do NOT use that for 0 because it changes both numeric and character 0s to NA. This is because "0" == 0 returns TRUE in R.
dplyr::na_if method:
library(dplyr)
dat %>%
lapply(na_if, y = "missing") %>%
lapply(na_if, y = "NA") %>%
lapply(na_if, y = 0) %>% # DONT DO THIS! It converts string 0s to NA as well!
data.frame()
Here we apply na_if function to each column of the data. Since na_if does not accept multiple values to be converted to NA we need to write multiple lines of code for each value to be converted into NA. However, simple usage of this function with 0 converts both the numeric and character 0s into NA. We need to do something else!
Using mutate across method with na_if function:
This is my favorite solution. Here we check the column type and apply na_if function as necessary. The character 0 is untouched, whereas all desired values are converted into NA.
dat %>%
mutate(across(where(is.numeric), ~na_if(., 0))) %>%
mutate(across(where(is.character), ~na_if(., "NA"))) %>%
mutate(across(where(is.character), ~na_if(., "missing")))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
Finally, nariar package can be used
nariar is a recent package that introduces a variety of replace_with_ functions.
library(naniar)
Replace all 'NA' and 'missing' values to NA:
dat %>%
replace_with_na_all(~.x %in% c("NA", "missing"))
but if you use this with 0s, it still erroneously converts the character 0 to NA:
dat %>%
replace_with_na_all(~.x %in% c(0, "NA", "missing"))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA NA Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
#strC's first element should not be NA here!
So, we have to specify column type using replace_with_na_if:
dat %>%
replace_with_na_if(is.character, ~.x %in% c("NA", "missing")) %>%
replace_with_na_if(is.numeric, ~.x %in% c(0))
# A tibble: 4 x 4
numA numB strC strD
<dbl> <dbl> <chr> <chr>
1 1 NA 0 Yes
2 NA 2 1.2 Yes
3 3 3 NA NA
4 4 4 2.4 No
We achieved the desired outcome. I hope all this is helpful :)
If you are like me and landed here while wondering how to replace ALL values in a dataframe with NA, it's just:
df[,] <- NA
Another option is to replace all 0 with NA using mutate_all like this:
library(dplyr)
df <- data.frame(v1 = c(1,0,4,2),
v2 = c(3,1,0,0))
df
#> v1 v2
#> 1 1 3
#> 2 0 1
#> 3 4 0
#> 4 2 0
mutate_all(df, ~replace(., .==0, NA))
#> v1 v2
#> 1 1 3
#> 2 NA 1
#> 3 4 NA
#> 4 2 NA
Created on 2022-07-10 by the reprex package (v2.0.1)

Resources