How to replace 0 or missing value with NA in R [duplicate] - r

This question already has answers here:
Replace all 0 values to NA
(11 answers)
Closed 4 years ago.
this is what i have already done so far
data is numeric data type
if (is.na(data) || attribute==0){replace(data,NA)}
it gives me error message that
Error in replace(attribute, NA) : argument "values" is missing, with no default

With mutate_all:
library(dplyr)
df %>%
mutate_all(~replace(., . == 0, NA))
or with mutate_if to be safe:
df %>%
mutate_if(is.numeric, ~replace(., . == 0, NA))
Note that there is no need to check for NA's, because we are replacing with NA anyway.
Output:
> df %>%
+ mutate_all(~replace(., . == 0, NA))
X Y Z
1 1 5 <NA>
2 4 4 2
3 2 3 2
4 5 5 2
5 5 3 <NA>
6 NA 4 <NA>
7 3 3 1
8 5 3 2
9 3 1 1
10 2 NA 5
11 5 5 <NA>
12 2 5 2
13 4 4 4
14 3 4 <NA>
15 NA NA 3
16 5 2 1
17 1 4 <NA>
18 NA 1 4
19 1 1 5
20 5 1 2
> df %>%
+ mutate_if(is.numeric, ~replace(., . == 0, NA))
X Y Z
1 1 5 0
2 4 4 2
3 2 3 2
4 5 5 2
5 5 3 0
6 NA 4 0
7 3 3 1
8 5 3 2
9 3 1 1
10 2 NA 5
11 5 5 0
12 2 5 2
13 4 4 4
14 3 4 0
15 NA NA 3
16 5 2 1
17 1 4 0
18 NA 1 4
19 1 1 5
20 5 1 2
Data:
set.seed(123)
df <- data.frame(X = sample(0:5, 20, replace = TRUE),
Y = sample(0:5, 20, replace = TRUE),
Z = as.character(sample(0:5, 20, replace = TRUE)))

You could just use replace without any additional function / package:
data <- replace(data, data == 0, NA)
This is now assuming that data is your data frame.
Otherwise you can simply insert the column name, e.g. if your data frame is df and column name data:
df$data <- replace(df$data, df$data == 0, NA)

Assuming that data is a dataframe then you could use sapply to update your values based on a set of filters:
new.data = as.data.frame(sapply(data,FUN= function(x) replace(x,is.na(x) | x == 0)))

Related

Custom data frame in R

I have a below data frame
df <- data.frame(a = c(1,3,4,5,8,9), b = c("","",0,0,"",""))
df$b <- as.numeric(df$b)
df
a b
1 1 NA
2 3 NA
3 4 0
4 5 0
5 8 NA
6 9 NA
Is there a way to populate the data frame that is capturing the value in column a only at a specific point
Example : Expected output (a cell before 0 and after 0 in column b should be filled by the value in column a.
df1
a b
1 1 NA
2 3 3
3 4 0
4 5 0
5 8 8
6 9 NA
I think the following solution will help you:
library(dplyr)
df %>%
mutate(b = ifelse(is.na(b) & lead(b) == 0 | is.na(b) & lag(b) == 0, a, b))
a b
1 1 NA
2 3 3
3 4 0
4 5 0
5 8 8
6 9 NA

Replace NA values when they are in two adjacent columns

Hi this is an example of a similar dataframe I am working with. I have an experiment with 10 samples and two replicates
df <- data.frame("ID" = c(1,2,3,4,5,6,7,8,9,10),
"Rep1" = c(6,5,3,"Na","Na",9,4,"Na","Na",2),
"Rep2" = c(8,4,4,"Na",3,"Na",6,"Na",2,1))
I have different Na values, however, I only want to replace them with zeros in the samples 4 and 8 due to they are the only ones which have NA in both replicates. Then, other samples would maintain the "NA".
You can also use the following solution. In the following solution we iterate over each row and detect corresponding index or indices that is (are) equal to Na then if there were more that one index we replace it with 0 otherwise the row will remain as it:
library(dplyr)
library(purrr)
df %>%
pmap_df(., ~ {ind <- which(c(...) == "Na");
if(length(ind) > 1) {
replace(c(...), ind, "0")
} else {
c(...)
}
}
) %>%
mutate(across(ID, as.integer))
# A tibble: 10 x 3
ID Rep1 Rep2
<int> <chr> <chr>
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 Na 3
6 6 9 Na
7 7 4 6
8 8 0 0
9 9 Na 2
10 10 2 1
P.S = I almost went crazy as why I could not get it to work only to realize your NAs are in fact Na.
We create an index where the 'Rep' columns are both "Na" with rowSums on a logical matrix. Use the row, column index/names to subset the data and assign the values to 0
nm1 <- grep("Rep", names(df), value = TRUE)
i1 <- rowSums(df[nm1] == "Na") == length(nm1)
df[i1, nm1] <- 0
-output
df
ID Rep1 Rep2
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 Na 3
6 6 9 Na
7 7 4 6
8 8 0 0
9 9 Na 2
10 10 2 1
As the OP created string "Na", the column types are not numeric. We can convert this to numeric as
df[-1] <- lapply(df[-1], as.numeric)
forces the "Na" to be converted to NA
-output
df
ID Rep1 Rep2
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 NA 3
6 6 9 NA
7 7 4 6
8 8 0 0
9 9 NA 2
10 10 2 1
With dplyr we could:
library(dplyr)
df %>%
mutate(across(starts_with("Rep"), ~case_when(.=="Na" & ID==4 | ID==8 ~ "0",
TRUE ~ .)))
Output:
ID Rep1 Rep2
1 1 6 8
2 2 5 4
3 3 3 4
4 4 0 0
5 5 Na 3
6 6 9 Na
7 7 4 6
8 8 0 0
9 9 Na 2
10 10 2 1
Though it has been marked as solved, yet I propose a simple answer
df <- data.frame("ID" = c(1,2,3,4,5,6,7,8,9,10),
"Rep1" = c(6,5,3,"Na","Na",9,4,"Na","Na",2),
"Rep2" = c(8,4,4,"Na",3,"Na",6,"Na",2,1))
library(dplyr)
df %>% group_by(ID) %>%
mutate(replace(cur_data(), all(cur_data() == 'Na'), '0'))
#> # A tibble: 10 x 3
#> # Groups: ID [10]
#> ID Rep1 Rep2
#> <dbl> <chr> <chr>
#> 1 1 6 8
#> 2 2 5 4
#> 3 3 3 4
#> 4 4 0 0
#> 5 5 Na 3
#> 6 6 9 Na
#> 7 7 4 6
#> 8 8 0 0
#> 9 9 Na 2
#> 10 10 2 1
OR
df %>% rowwise() %>%
mutate(replace(cur_data()[-1], all(cur_data()[-1] == 'Na'), '0'))

R: creating multiple new variables based on conditions of selection of other variables with similar names

I have a data frame where each condition (in the example: hope, dream, joy) has 5 variables (in the example, coded with suffixes x, y, z, a, b - the are the same for each condition).
df <- data.frame(matrix(1:16,5,16))
names(df) <- c('ID','hopex','hopey','hopez','hopea','hopeb','dreamx','dreamy','dreamz','dreama','dreamb','joyx','joyy','joyz','joya','joyb')
df[1,2:6] <- NA
df[3:5,c(7,10,14)] <- NA
This is how the data looks like:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16
I want to create a new variable for each condition (hope, dream, joy) that codes whether all of the variables x...b for that condition are NA (0 if all are NA, 1 if any is non-NA). And I want the new variables to be stored in the data frame. Thus, the output should be this:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope joy dream
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12 0 1 1
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13 1 1 1
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14 1 1 1
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15 1 1 1
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16 1 1 1
The code below does it, but I'm looking for a more elegant solution (e.g., for a case where I have even more conditions). I've tried with various combinations of all(), select(), mutate(), but while they all seem useful, I cannot figure out how to combine them to get what I want. I'm stuck and would be interested in learning to code more efficiently. Thanks in advance!
df$hope <- 0
df[is.na(df$hopex) == FALSE | is.na(df$hopey) == FALSE | is.na(df$hopez) == FALSE | is.na(df$hopea) == FALSE | is.na(df$hopeb) == FALSE, "hope"] <- 1
df$dream <- 0
df[is.na(df$dreamx) == FALSE | is.na(df$dreamy) == FALSE | is.na(df$dreamz) == FALSE | is.na(df$dreama) == FALSE | is.na(df$dreamb) == FALSE, "dream"] <- 1
df$joy<- 0
df[is.na(df$joyx) == FALSE | is.na(df$joyy) == FALSE | is.na(df$joyz) == FALSE | is.na(df$joya) == FALSE | is.na(df$joyb) == FALSE, "joy"] <- 1
Here is an option with tidyverse
library(dplyr)
library(purrr)
library(magrittr)
df %>%
mutate(hope = select(., starts_with('hope')) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer)
# hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope
#1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0
#2 1 1 4 3 2 3 5 4 5 2 5 NA 4 3 1 1
#3 2 NA 4 4 4 3 5 NA 5 5 4 NA 4 5 1 1
#4 4 3 NA 1 1 1 5 2 NA 5 1 2 1 1 1 1
#5 1 NA 4 NA NA 2 1 5 1 2 NA 3 1 2 5 1
Or with rowSums
df %>%
mutate(hope = +(rowSums(!is.na(select(., starts_with('hope'))))!= 0))
For multiple columns, we can create a function
f1 <- function(dat, colSubstr) {
dplyr::select(dat, starts_with(colSubstr)) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer
}
df %>%
mutate(hope = f1(., 'hope'),
dream = f1(., 'dream'),
joy = f1(., 'joy'))
Or using base R
cbind(df, sapply(split.default(df, sub(".$", "", names(df))),
function(x) +(rowSums(!is.na(x)) != 0)))
If we want to subset columns
nm1 <- setdiff(names(df), "ID")
cbind(df, sapply(split.default(df[nm1], sub(".$", "", names(df[nm1]))),
function(x) +(rowSums(!is.na(x)) != 0)))
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA, 1:5), 5 * 15, replace = TRUE),
ncol = 15, dimnames = list(NULL, paste0(rep(c("hope", "dream", "joy"),
each = 5), c('x', 'y', 'z', 'a', 'b')))))
df[1,] <- NA

Fill subset of rows with values from row above

I have a long format dataset with longitudinal data and for one variable I want to fill in the missings in timepoint 0 with the values in timepoint 1, but I do not want to fill in the missings from timepoint 1 with values from timepoint 2 and so on.
My dataset is ordered by id and timepoint.
I have used the fill function succesfully in cases where I just needed to fill missings from all timepoints from a specific id.
Example dataframe:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
timepoint=c(0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3),
var1=c(NA,9,8,10, NA, 10, NA, 12, NA, NA, 12, 11, NA, 12, 12, NA))
> df
id timepoint var1
1 1 0 NA
2 1 1 9
3 1 2 8
4 1 3 10
5 2 0 NA
6 2 1 10
7 2 2 NA
8 2 3 12
9 3 0 NA
10 3 1 NA
11 3 2 12
12 3 3 11
13 4 0 NA
14 4 1 12
15 4 2 12
16 4 3 NA
This is what works when I just need to fill any missing no matter the timepoint:
library(dplyr)
library(tidyr)
df <- df %>%
group_by(id) %>%
fill(`var9`:`var12`, .direction = "up") %>%
as.data.frame
But now I have trouble specifying to only fill in the missings in rows at timepoint 0. Any help is appreciated.
My expected output:
> df
id timepoint var1
1 1 0 9
2 1 1 9
3 1 2 8
4 1 3 10
5 2 0 10
6 2 1 10
7 2 2 NA
8 2 3 12
9 3 0 NA
10 3 1 NA
11 3 2 12
12 3 3 11
13 4 0 12
14 4 1 12
15 4 2 12
16 4 3 NA
This might be an oversimplification, but you can just call the fill function again, but this time with direction down. Then your entire data frame will be complete.
df <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
timepoint=c(0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3),
var1=c(NA,9,8,10, NA, 10, NA, 12, NA, NA, 12, 11, NA, 12, 12, NA))
In this case I will use an ifelse statement followed the by the lead function.
library(dplyr); library(tidyr);
df %>%
group_by(id) %>%
mutate(var1 = ifelse(is.na(var1) & timepoint == 0,
lead(var1, 1), var1))
Yields:
# A tibble: 16 x 3
# Groups: id [4]
id timepoint var1
<dbl> <dbl> <dbl>
1 1 0 9
2 1 1 9
3 1 2 8
4 1 3 10
5 2 0 10
6 2 1 10
7 2 2 NA
8 2 3 12
9 3 0 NA
10 3 1 NA
11 3 2 12
12 3 3 11
13 4 0 12
14 4 1 12
15 4 2 12
16 4 3 NA
We can group_by id and use replace to change the values where timepoint = 0 & var1 is NA from the corresponding value of var1 where timepoint = 1 in each group.
library(dplyr)
df %>%
group_by(id) %>%
mutate(var2 = replace(var1, timepoint == 0 & is.na(var1), var1[timepoint == 1]))
# id timepoint var1 var2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 NA 9
# 2 1 1 9 9
# 3 1 2 8 8
# 4 1 3 10 10
# 5 2 0 NA 10
# 6 2 1 10 10
# 7 2 2 NA NA
# 8 2 3 12 12
# 9 3 0 NA NA
#10 3 1 NA NA
#11 3 2 12 12
#12 3 3 11 11
#13 4 0 NA 12
#14 4 1 12 12
#15 4 2 12 12
#16 4 3 NA NA

Applying custom function to each row uses only first value of argument

I am trying to recode NA values to 0 in a subset of columns using the following dataset:
set.seed(1)
df <- data.frame(
id = c(1:10),
trials = sample(1:3, 10, replace = T),
t1 = c(sample(c(1:9, NA), 10)),
t2 = c(sample(c(1:7, rep(NA, 3)), 10)),
t3 = c(sample(c(1:5, rep(NA, 5)), 10))
)
Each row has a certain number of trials associated with it (between 1-3), specified by the trials column. columns t1-t3 represent scores for each trial.
The number of trials indicates the subset of columns in which NAs should be recoded to 0: NAs that are within the number of trials represent missing data, and should be recoded as 0, while NAs outside the number of trials are not meaningful, and should remain NAs. So, for a row where trials == 3, an NA in column t3 would be recoded as 0, but in a row where trials == 2, an NA in t3 would remain an NA.
So, I tried using this function:
replace0 <- function(x, num.sun) {
x[which(is.na(x[1:(num.sun + 2)]))] <- 0
return(x)
}
This works well for single vectors. When I try applying the same function to a data frame with apply(), though:
apply(df, 1, replace0, num.sun = df$trials)
I get a warning saying:
In 1:(num.sun + 2) :
numerical expression has 10 elements: only the first used
The result is that instead of having the value of num.sun change every row according to the value in trials, apply() simply uses the first value in the trials column for every single row. How could I apply the function so that the num.sun argument changes according to the value of df$trials?
Thanks!
Edit: as some have commented, the original example data had some non-NA scores that didn't make sense according to the trials column. Here's a corrected dataset:
df <- data.frame(
id = c(1:5),
trials = c(rep(1, 2), rep(2, 1), rep(3, 2)),
t1 = c(NA, 7, NA, 6, NA),
t2 = c(NA, NA, 3, 7, 12),
t3 = c(NA, NA, NA, 4, NA)
)
Another approach:
# create an index of the NA values
w <- which(is.na(df), arr.ind = TRUE)
# create an index with the max column by row where an NA is allowed to be replaced by a zero
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
# subset 'w' such that only the NA's which fall in the scope of 'm' remain
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
# use 'i' to replace the allowed NA's with a zero
df[i] <- 0
which gives:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
You could easily wrap this in a function:
replace.NA.with.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
df[i] <- 0
return(df)
}
Now, using replace.NA.with.0(df) will produce the above result.
As noted by others, some rows (1, 3 & 10) have more values than trails. You could tackle that problem by rewriting the above function to:
replace.with.NA.or.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
df[w] <- 0
v <- tapply(m[,2], m[,1], FUN = function(x) tail(x:5,-1))
ina <- matrix(as.integer(unlist(stack(v)[2:1])), ncol = 2)
df[ina] <- NA
return(df)
}
Now, using replace.with.NA.or.0(df) produces the following result:
id trials t1 t2 t3
1 1 1 3 NA NA
2 2 2 2 2 NA
3 3 2 6 6 NA
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 NA
9 9 2 1 3 NA
10 10 1 9 NA NA
Here I just rewrite your function using double subsetting x[paste0('t',x['trials'])], which overcome the problem in the other two solutions with row 6
replace0 <- function(x){
#browser()
x_na <- x[paste0('t',x['trials'])]
if(is.na(x_na)){x[paste0('t',x['trials'])] <- 0}
return(x)
}
t(apply(df, 1, replace0))
id trials t1 t2 t3
[1,] 1 1 3 NA 5
[2,] 2 2 2 2 NA
[3,] 3 2 6 6 4
[4,] 4 3 NA 1 2
[5,] 5 1 5 NA NA
[6,] 6 3 7 NA 0
[7,] 7 3 8 7 0
[8,] 8 2 4 5 1
[9,] 9 2 1 3 NA
[10,] 10 1 9 4 3
Here is a way to do it:
x <- is.na(df)
df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
The output looks like this:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
> x <- is.na(df)
> df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
Note: row 1/3/10, is problematic since there are more non-NA values than the trials.
Here's a tidyverse way, note that it doesn't give the same output as other solutions.
Your example data shows results for trials that "didn't happen", I assumed your real data doesn't.
library(tidyverse)
df %>%
nest(matches("^t\\d")) %>%
mutate(data = map2(data,trials,~mutate_all(.,replace_na,0) %>% select(.,1:.y))) %>%
unnest
# id trials t1 t2 t3
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA
Using the more commonly used gather strategy this would be:
df %>%
gather(k,v,matches("^t\\d")) %>%
arrange(id) %>%
group_by(id) %>%
slice(1:first(trials)) %>%
mutate_at("v",~replace(.,is.na(.),0)) %>%
spread(k,v)
# # A tibble: 10 x 5
# # Groups: id [10]
# id trials t1 t2 t3
# <int> <int> <dbl> <dbl> <dbl>
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA

Resources