Replace Na based on condition - r

id var1 var2 var3 var4
1 3 5 NA 10
2 0 NA 7 NA
3 1 3 NA 6
4 0 NA NA 6
Hello I have this example as a data set. I am trying to replace the na based on the condition that if var1 =0 then replace all nas of the row as 0, but not the other na of the other rows.
I have tried the following
mydf <- replace(mydf, is.na(mydf), 0)
but as you understand this replaces all na values
I want to replace all nas of the row based on my condition not just for one column.
Could you provide me with some help please? Thank you

We may create a condition with the var1 column as well to only consider those rows where 'var1' is 0
i1 <- is.na(mydf[-c(1, 2)])
i2 <- (mydf$var1 == 0)[row(mydf[-c(1,2)])]
mydf[-c(1,2)][i1 & i2] <- 0
-output
> mydf
id var1 var2 var3 var4
1 1 3 5 NA 10
2 2 0 0 7 0
3 3 1 3 NA 6
4 4 0 0 0 6
Or instead of subsetting the data, it can be applied to the whole data as well
replace(mydf, is.na(mydf) & mydf$var1 == 0, 0)
id var1 var2 var3 var4
1 1 3 5 NA 10
2 2 0 0 7 0
3 3 1 3 NA 6
4 4 0 0 0 6
Or using dplyr
library(dplyr)
mydf %>%
mutate(across(var2:var4, ~ replace(.x, is.na(.x) & var1 == 0, 0)))
-output
id var1 var2 var3 var4
1 1 3 5 NA 10
2 2 0 0 7 0
3 3 1 3 NA 6
4 4 0 0 0 6
data
mydf <- structure(list(id = 1:4, var1 = c(3L, 0L, 1L, 0L), var2 = c(5L,
NA, 3L, NA), var3 = c(NA, 7L, NA, NA), var4 = c(10L, NA, 6L,
6L)), class = "data.frame", row.names = c(NA, -4L))

Related

Getting a NA for sum across rows if any variable values is NA

I have a dataset with var1, var2, var3, var4, and I am calculating a sum var_total \<- var1 + var2 + var3 + var4. I want missing value in var_total if any of the values var1,
var2, var3 and var4 is missing.
Have:
Var1
var2
var3
var4
var_total
1
0
0
0
1
1
NA
2
0
3
1
0
0
NA
1
Want:
Var1
var2
var3
var4
var_total
1
0
0
0
1
1
NA
2
0
NA
1
0
0
NA
NA
I assume something involving ifelse().
Libraries
library(dplyr)
Data
data <-
tibble::tribble(
~var1, ~var2, ~var3, ~var4, ~var_total,
1L, 0L, 0L, 0L, 1L,
1L, NA, 2L, 0L, 3L,
1L, 0L, 0L, NA, 1L
)
Code
data %>%
rowwise() %>%
mutate(var_total = sum(c_across(cols = var1:var4),na.rm = FALSE))
Output
# A tibble: 3 x 5
# Rowwise:
var1 var2 var3 var4 var_total
<int> <int> <int> <int> <int>
1 1 0 0 0 1
2 1 NA 2 0 NA
3 1 0 0 NA NA
Using rowSums() in base R:
data$var_total <- rowSums(data[ , 1:4])
Or with dplyr:
library(dplyr)
data %>%
mutate(var_total = rowSums(across(var1:var4)))
Result from either approach:
var1 var2 var3 var4 var_total
1 1 0 0 0 1
2 1 NA 2 0 NA
3 1 0 0 NA NA

Adding condition when counting NA value by a group

I am counting row-based Na values according to col1 variable in the data set. I want to add a condition to this query:
When calculating the number of NA,
For col2 = a and b, also look at col4 column, for col2 = c, do not look at col4 column
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
col2 = letters[1:3],
col3 = c(1,NA,NA,1,NA,NA,2,NA,2),
col4 = c(1,4,NA,1,NA,NA,NA,1,2))
data_frame = data_frame %>%
rowwise() %>%
mutate(Count_NA = sum(is.na(cur_data()))) %>%
ungroup
#print (data_frame)
data_frame %>% group_by(col1) %>%
summarize(Sum_Count_NA=sum(Count_NA))
The output I want is;
col1
col2
col3
col4
Count_NA
8
a
1
1
0
6
b
NA
4
1
8
c
NA
NA
2
7
a
1
1
0
8
b
NA
NA
2
8
c
NA
NA
2
8
a
2
NA
1
8
b
NA
1
1
9
c
2
2
0
After adding the condition, the output I want is;
Counting Na in col4 for col2 = c
col1
col2
col3
col4
Count_NA
8
a
1
1
0
6
b
NA
4
1
8
c
NA
NA
1
7
a
1
1
0
8
b
NA
NA
2
8
c
NA
NA
1
8
a
2
NA
1
8
b
NA
1
1
9
c
2
2
0
An option is also to replace the NA elements in the 'col4' with non-NA when 'col2' is 'c' and then do the rowSums on the logical matrix
library(dplyr)
data_frame %>%
mutate(Count_Na = rowSums(is.na(cbind(col3, replace(col4, col2 == 'c', 999)))))
-output
col1 col2 col3 col4 Count_Na
1 7 a 1 1 0
2 9 b NA 4 1
3 9 c NA NA 1
4 7 a 1 1 0
5 7 b NA NA 2
6 7 c NA NA 1
7 7 a 2 NA 1
8 9 b NA 1 1
9 7 c 2 2 0
You can do this:
library(dplyr)
data_frame %>%
mutate(sum = rowSums(is.na(select(., contains("col3")))) + (col2 == "c" & is.na(col4)))
col1 col2 col3 col4 sum
1 8 a 1 1 0
2 6 b NA 4 1
3 9 c NA NA 2
4 8 a 1 1 0
5 7 b NA NA 1
6 7 c NA NA 2
7 7 a 2 NA 0
8 9 b NA 1 1
9 7 c 2 2 0
data
data_frame <- structure(list(col1 = c(8L, 6L, 9L, 8L, 7L, 7L, 7L, 9L, 7L),
col2 = c("a", "b", "c", "a", "b", "c", "a", "b", "c"), col3 = c(1,
NA, NA, 1, NA, NA, 2, NA, 2), col4 = c(1, 4, NA, 1, NA, NA,
NA, 1, 2)), class = "data.frame", row.names = c(NA, -9L))

compare ratings involving integers and NA

I have ratings by different raters:
df <- structure(list(SZ = c(1, 1, NA, 0, NA, 1, 1),
SZ_ptak = c(1, 1, NA, NA, NA, 1, 0)),
row.names = c(NA, 7L), class = "data.frame")
I need to compare them to find ratings that differ. This code works fine as long as both raters assigned either 1 or 0. If one rating is NA and the other is 1 or 0, I also want to obtain the value 1 in column diff_SZ - how can that be done?
df %>%
mutate(diff_SZ = +(SZ != SZ_ptak))
SZ SZ_ptak diff_SZ
1 1 1 0
2 1 1 0
3 NA NA NA
4 0 NA NA
5 NA NA NA
6 1 1 0
7 1 0 1
Desired:
SZ SZ_ptak diff_SZ
1 1 1 0
2 1 1 0
3 NA NA NA
4 0 NA 1 <--
5 NA NA NA
6 1 1 0
7 1 0 1
Maybe it would be easy to understand if you list out the conditions.
library(dplyr)
df %>%
mutate(diff_SZ = case_when(is.na(SZ) & is.na(SZ_ptak) ~ NA_real_,
is.na(SZ) | is.na(SZ_ptak) ~ 1,
SZ != SZ_ptak ~ 1,
TRUE ~ 0))
# SZ SZ_ptak diff_SZ
#1 1 1 0
#2 1 1 0
#3 NA NA NA
#4 0 NA 1
#5 NA NA NA
#6 1 1 0
#7 1 0 1

R - dplyr - Group by column and calculate the sum keeping NA's if only NA's present for a given group

I have a dataframe with duplicated id's in the first column and different values in the subsequent columns. I would like to truncate this data to have only one record for each unique id, and the values in the subsequent columns be the sum of these values. However, I can do this with dplyr::summarise, but if I use na.rm=TRUE, it replaces NA's with 0 (if all the records were NA) or if I use it without na.rm=TRUE, then it sums it to NA (if there was a NA present).
How can I get it to retain the NA as the new value if all the values were NA, and the sum if there were numeric values with an NA.
Apologies for the bad explanation. Not sure how to better word it.
A mock dataframe would look like this:
df <- structure(
list(
id = structure(
c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 5L, 6L, 7L, 7L),
.Label = c("a", "b", "c", "d", "e", "f", "g"),
class = "factor"
),
`1` = c(NA, NA, NA, 1, 1, 0, 1, 1, 0, 1, NA, 1, NA, 0, 1, 0),
`2` = c(NA, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, NA, 0),
`3` = c(NA, 1, 1, 0, 1, 1, 0, 1, 0, 1, NA, 1, 0, 0, NA, NA)
),
row.names = c(NA, -16L),
class = "data.frame"
)
which would print out looking like this:
> df
id 1 2 3
1 a NA NA NA
2 a NA 0 1
3 a NA 1 1
4 b 1 0 0
5 b 1 1 1
6 c 0 0 1
7 c 1 1 0
8 c 1 0 1
9 c 0 1 0
10 c 1 1 1
11 c NA 0 NA
12 d 1 1 1
13 e NA 0 0
14 f 0 0 0
15 g 1 NA NA
16 g 0 0 NA
I would like to group by the 'id' column, and then sum it to get something like this:
id 1 2 3
1 a NA 1 2
2 b 2 1 1
3 c 3 3 3
4 d 1 1 1
5 e NA 0 0
6 f 0 0 0
7 g 1 0 NA
I have tried using summarise with and without na.rm=T but it does not provide what I need.
df %>%
group_by(
id
) %>%
summarise_at(
c(
1,2,3
),
sum,
na.rm = T
)
# A tibble: 7 x 4
id `1` `2` `3`
<fct> <dbl> <dbl> <dbl>
1 a 0 1 2
2 b 2 1 1
3 c 3 3 3
4 d 1 1 1
5 e 0 0 0
6 f 0 0 0
7 g 1 0 0
Without na.rm = T:
df %>%
group_by(
id
) %>%
summarise_at(
c(
1,2,3
),
sum
)
# A tibble: 7 x 4
id `1` `2` `3`
<fct> <dbl> <dbl> <dbl>
1 a NA NA NA
2 b 2 1 1
3 c NA 3 NA
4 d 1 1 1
5 e NA 0 0
6 f 0 0 0
7 g 1 NA NA
I am not sure what else to try. Any advice would be greatly appreciated. Thank you very much
You can check for the values in each id and if all the values are NA return NA.
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(`1`:`3`, ~if(all(is.na(.))) NA else sum(., na.rm = TRUE)))
#summarise_at(vars(`1`:`3`), ~if(all(is.na(.))) NA else sum(., na.rm = TRUE))
# id `1` `2` `3`
# <fct> <dbl> <dbl> <dbl>
#1 a NA 1 2
#2 b 2 1 1
#3 c 3 3 3
#4 d 1 1 1
#5 e NA 0 0
#6 f 0 0 0
#7 g 1 0 NA
We can use
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(-id, ~ if(sum(is.na(.)) == n() NA else sum(., na.rm = TRUE)))

R Add missing columns AND rows of data (Dplyr/TidyR & Complete?)

I'm fairly used to adding in missing cases for data but this use case escapes me.
I have a number of dataframes (which differ slightly), an example would be:
> t1
3 4 5
2 1 0 0
3 0 2 2
4 2 6 4
5 1 2 1
structure(list(`3` = c(1L, 0L, 2L, 1L), `4` = c(0L, 2L, 6L, 2L
), `5` = c(0L, 2L, 4L, 1L)), .Names = c("3", "4", "5"), row.names = c("2",
"3", "4", "5"), class = "data.frame")
Row names & Column names should be from 1:5 and, obviously, where these were missing the cell value set to NA. For the example above this would give:
> t1
1 2 3 4 5
1 NA NA NA NA NA
2 NA NA 1 0 0
3 NA NA 0 2 2
4 NA NA 2 6 4
5 NA NA 1 2 1
In each case ANY one or more rows AND/OR columns might be missing.
I can readily get the missing columns using the method described by Josh O'Brien here but am missing the row method.
Can anyone help?
We can do this in a much easier way with base R by creating a matrix of NAs of the required dimensions and then assign the values of 't1' based on the row names and column names of 't1'
m1 <- matrix(NA, ncol=5, nrow=5, dimnames = list(1:5, 1:5))
m1[row.names(t1), colnames(t1)] <- unlist(t1)
m1
# 1 2 3 4 5
#1 NA NA NA NA NA
#2 NA NA 1 0 0
#3 NA NA 0 2 2
#4 NA NA 2 6 4
#5 NA NA 1 2 1
Or using tidyverse
library(tidyverse)
rownames_to_column(t1, "rn") %>%
gather(Var, Val, -rn) %>%
mutate_at(vars(rn, Var), as.integer) %>%
complete(rn = seq_len(max(rn)), Var = seq_len(max(Var))) %>%
spread(Var, Val)
# A tibble: 5 × 6
# rn `1` `2` `3` `4` `5`
#* <int> <int> <int> <int> <int> <int>
#1 1 NA NA NA NA NA
#2 2 NA NA 1 0 0
#3 3 NA NA 0 2 2
#4 4 NA NA 2 6 4
#5 5 NA NA 1 2 1
Based on the solution you mentioned by Josh O'Brien, you can do the same but use rownames instead of names. Take a look at the code below..
df <- data.frame(a=1:4, e=4:1)
colnms <- c("a", "b", "d", "e")
rownms <- c("1", "2", "3", "4", "5")
rownames(df) <- c("1", "3", "4", "5")
## find missing columns and replace with zero, and order them
Missing <- setdiff(colnms, names(df))
df[Missing] <- 0
df <- df[colnms]
df
## do the same for rows
MissingR <- setdiff(rownms, rownames(df))
df[MissingR,] <- 0
df <- df[rownms,]
df
# > df
# a b d e
#1 1 0 0 4
#2 0 0 0 0
#3 2 0 0 3
#4 3 0 0 2
#5 4 0 0 1

Resources