Replace values outside range with NA using replace_with_na function - r

I have the following dataset
structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
a b c
1 2 4 50
2 1 5 34
3 9 1 77
4 2 9 88
5 9 12 33
6 8 NA 60
From column b I only want values between 4-9. Column c between 50-80. Replacing the values outside the range with NA, resulting in
structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, NA, 9, NA,
NA), c = c(50, NA, 77, NA, NA, 60)), class = "data.frame", row.names = c(NA,
-6L))
a b c
1 2 4 50
2 1 5 NA
3 9 NA 77
4 2 9 NA
5 9 NA NA
6 8 NA 60
I've tried several things with replace_with_na_at function where this seemed most logical:
test <- replace_with_na_at(data = test, .vars="c",
condition = ~.x < 2 & ~.x > 2)
However, nothing I tried works. Does somebody know why? Thanks in advance! :)

You can subset with a logical vector testing your conditions.
x$b[x$b < 4 | x$b > 9] <- NA
x$c[x$c < 50 | x$c > 80] <- NA
x
# a b c
#1 2 4 50
#2 1 5 NA
#3 9 NA 77
#4 2 9 NA
#5 9 NA NA
#6 8 NA 60
Data:
x <- structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))

Yet another base R solution, this time with function is.na<-
is.na(test$b) <- with(test, b < 4 | b > 9)
is.na(test$c) <- with(test, c < 50 | c > 80)
A package naniar solution with a pipe could be
library(naniar)
library(magrittr)
test %>%
replace_with_na_at(
.vars = 'b',
condition = ~(.x < 4 | .x > 9)
) %>%
replace_with_na_at(
.vars = 'c',
condition = ~(.x < 50 | .x > 80)
)

You should mention the packages you are using. From googling, i'm guessing you are using naniar. The problem appears to be that you did not properly specify the condition, but the following should work:
library(naniar)
test <- structure(list(a = c(2, 1, 9, 2, 9, 8),
b = c(4, 5, 1, 9, 12, NA),
c = c(50, 34, 77, 88, 33, 60)),
class = "data.frame",
row.names = c(NA, -6L))
replace_with_na_at(test, "c", ~.x < 50 | .x > 80)
#> a b c
#> 1 2 4 50
#> 2 1 5 NA
#> 3 9 1 77
#> 4 2 9 NA
#> 5 9 12 NA
#> 6 8 NA 60
Created on 2020-06-02 by the reprex package (v0.3.0)

You simply could use Map to replace your values with NA.
dat[2:3] <- Map(function(x, y) {x[!x %in% y] <- NA;x}, dat[2:3], list(4:9, 50:80))
dat
# a b c
# 1 2 4 50
# 2 1 5 NA
# 3 9 NA 77
# 4 2 9 NA
# 5 9 NA NA
# 6 8 NA 60
Data:
dat <- structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))

We can use map2
library(purrr)
library(dplyr)
df1[c('b', 'c')] <- map2(df1 %>%
select(b, c), list(c(4, 9), c(50,80)), ~
replace(.x, .x < .y[1]|.x > .y[2], NA))

Related

Routine for non-manual argument of a set of variables in coalesce() dplyr function [duplicate]

This question already has answers here:
Using dplyr to fill in missing values (through a join?)
(3 answers)
Closed 8 months ago.
This post was edited and submitted for review 8 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a list of dfs to be combined into one. These dfs have some matching columns and rows and some distinct or missing ones.
The minimum structure (for understanding) of the first two dfs.
df1:
df1 <- structure(list(id = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6),
Name = c("LI","NO","WH","MA","BU","SO","FO","AT","CO","IN","SP","CE"),
H_A = c("H", "A", "H", "A", "H", "A", "H", "A", "H", "A", "H", "A"),
W = c(15, 13, 5, 13, 9, 12, 10, 13, 1, 8, 4, 2),
X = c(NA, NA, NA, NA, NA, NA, 12, 7, 5, 13, 1, 3),
Y = c(0, 0, 0, 0, 0,0, NA, NA, NA, NA, NA, NA)),
row.names = c(NA,-12L), class = c("tbl_df","tbl", "data.frame"))
df2:
df2 <- structure(list(id = c(1, 1, 2, 2, 3, 3),
Name = c("LI","NO", "WH", "MA", "BU", "SO"),
H_A = c("H", "A", "H", "A", "H", "A"),
W = c(15, 13, 5, 13, 9, 12),
X = c(10, 12, 11, 15, 6, 14),
Z = c(4, 14, 16, 16, 25, 30)),
row.names = c(NA,-6L),class = c("tbl_df", "tbl", "data.frame"))
This can be solved with this alternative:
df_combined <- full_join(df1, df2, by = c("id", "Name", "H_A")) %>%
mutate(X = coalesce(X.x, X.y),
W = coalesce(W.x, W.y)) %>%
select(-contains("."))
I would like to automate the routine for non-manual input of the variables in mutate coalesce function. After all, there are several variables for the context X and W above. In addition to this I will continue the routine for df3, df4, df5 that have the same minimal matching with df1.
Joins by their nature don't natively fill in positions we have to implement a fix to solve this problem, and although you can use if else statements as shown in the answer above, coalesce() is a much cleaner function to use.
See this post here for another example (could potentially be seen as a repeated question).
Using dplyr to fill in missing values (through a join?)
library(tidyverse)
df_test <- full_join(df1, df2, by = c("id", "Name", "H_A")) %>%
mutate(X = coalesce(X.x, X.y),
W = coalesce(W.x, W.y)) %>%
select(id, Name, H_A, W, X, Y, Z)
df_test == df_combined
id Name H_A W X Y Z
[1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[7,] TRUE TRUE TRUE TRUE TRUE NA NA
[8,] TRUE TRUE TRUE TRUE TRUE NA NA
[9,] TRUE TRUE TRUE TRUE TRUE NA NA
[10,] TRUE TRUE TRUE TRUE TRUE NA NA
[11,] TRUE TRUE TRUE TRUE TRUE NA NA
[12,] TRUE TRUE TRUE TRUE TRUE NA NA
NA's expectedly return NA as you can't match two NA's together using a simple == statement.
You can use left_join from dplyr and substitute NA's like this, where I am guessing Id and H_A together make a key value:
library(dplyr)
df1 <- structure(list(id = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6),
Name = c("LI","NO","WH","MA","BU","SO","FO","AT","CO","IN","SP","CE"),
H_A = c("H", "A", "H", "A", "H", "A", "H", "A", "H", "A", "H", "A"),
W = c(15, 13, 5, 13, 9, 12, 10, 13, 1, 8, 4, 2),
X = c(NA, NA, NA, NA, NA, NA, 12, 7, 5, 13, 1, 3),
Y = c(0, 0, 0, 0, 0,0, NA, NA, NA, NA, NA, NA)),
row.names = c(NA,-12L), class = c("tbl_df","tbl", "data.frame"))
df2 <- structure(list(id = c(1, 1, 2, 2, 3, 3),
Name = c("LI","NO", "WH", "MA", "BU", "SO"),
H_A = c("H", "A", "H", "A", "H", "A"),
W = c(15, 13, 5, 13, 9, 12),
X = c(10, 12, 11, 15, 6, 14),
Z = c(4, 14, 16, 16, 25, 30)),
row.names = c(NA,-6L),class = c("tbl_df", "tbl", "data.frame"))
df_combined <- left_join(df1,
df2 %>%
select(id, H_A, "df2_X" = X, Z)) %>%
mutate(X = if_else(is.na(X), df2_X, X)) %>%
select(-df2_X)
#> Joining, by = c("id", "H_A")
df_combined
#> # A tibble: 12 × 7
#> id Name H_A W X Y Z
#> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 LI H 15 10 0 4
#> 2 1 NO A 13 12 0 14
#> 3 2 WH H 5 11 0 16
#> 4 2 MA A 13 15 0 16
#> 5 3 BU H 9 6 0 25
#> 6 3 SO A 12 14 0 30
#> 7 4 FO H 10 12 NA NA
#> 8 4 AT A 13 7 NA NA
#> 9 5 CO H 1 5 NA NA
#> 10 5 IN A 8 13 NA NA
#> 11 6 SP H 4 1 NA NA
#> 12 6 CE A 2 3 NA NA
data.table approach
library(data.table)
# set to data.table format
setDT(df1); setDT(df2)
# perform an update join, overwriting NA-values in W, X and Y, and
# adding Z, based on key-columns ID, Name and H_A
df1[df2, `:=`(W = ifelse(is.na(W), i.W, W),
X = ifelse(is.na(X), i.X, X),
Y = ifelse(is.na(Y), i.Y, Y),
Z = i.Z),
on = .(id, Name, H_A)][]
# id Name H_A W X Y Z
# 1: 1 LI H 15 10 0 4
# 2: 1 NO A 13 12 0 14
# 3: 2 WH H 5 11 0 16
# 4: 2 MA A 13 15 0 16
# 5: 3 BU H 9 6 0 25
# 6: 3 SO A 12 14 0 30
# 7: 4 FO H 10 12 NA NA
# 8: 4 AT A 13 7 NA NA
# 9: 5 CO H 1 5 NA NA
#10: 5 IN A 8 13 NA NA
#11: 6 SP H 4 1 NA NA
#12: 6 CE A 2 3 NA NA

How to replace missing data of questionnaire items with row means in R?

df <- data.frame(A1 = c(6, 8, NA, 1, 5),
A2 = c(NA, NA, 9, 3, 6),
A3 = c(9, NA, 1, NA, 4),
B1 = c(NA, NA, 9, 3, 6),
B2 = c(9, NA, 1, NA, 4),
B3 = c(NA, NA, 9, 3, 6)
)
I have a dataset with multiple questionnaires that each have multiple items. I would like to replace the missing data with the row mean of the observable values for each of the questionnaires (missing values in A items replaced by row mean of A1 to A3 and missing values in B items replaces by row mean of B1 to B3). What is the best way to do that?
You may try
df <- data.frame(A1 = c(6, 8, NA, 1, 5),
A2 = c(NA, NA, 9, 3, 6),
A3 = c(9, NA, 1, NA, 4),
B1 = c(NA, NA, 9, 3, 6),
B2 = c(9, NA, 1, NA, 4),
B3 = c(NA, NA, 9, 3, 6)
)
df1 <- df %>%
select(starts_with("A"))
df2 <- df %>%
select(starts_with("B"))
x1 <- which(is.na(df1), arr.ind = TRUE)
df1[x1] <- rowMeans(df1, na.rm = T)[x1[,1]]
x2 <- which(is.na(df2), arr.ind = TRUE)
df2[x2] <- rowMeans(df2, na.rm = T)[x2[,1]]
df <- cbind(df1, df2)
df
A1 A2 A3 B1 B2 B3
1 6 7.5 9 9 9 9
2 8 8.0 8 NaN NaN NaN
3 5 9.0 1 9 1 9
4 1 3.0 2 3 3 3
5 5 6.0 4 6 4 6
You may use split.default to split data in different groups and replace NA with row-wise mean (taken from this answer https://stackoverflow.com/a/6918323/3962914 )
as.data.frame(lapply(split.default(df, sub('\\d+', '', names(df))), function(x) {
k <- which(is.na(x), arr.ind = TRUE)
x[k] <- rowMeans(x, na.rm = TRUE)[k[, 1]]
x
})) -> result
names(result) <- names(df)
result
# A1 A2 A3 B1 B2 B3
#1 6 7.5 9 9 9 9
#2 8 8.0 8 NaN NaN NaN
#3 5 9.0 1 9 1 9
#4 1 3.0 2 3 3 3
#5 5 6.0 4 6 4 6
You could also do:
library(dplyr)
df %>%
reshape(names(.), dir='long', sep="")%>%
group_by(id) %>%
mutate(across(A:B, ~replace(.x, is.na(.x), mean(.x, na.rm = TRUE))))%>%
pivot_wider(id, names_from = time, values_from = A:B, names_sep = "") %>%
ungroup() %>%
select(-id)
# A tibble: 5 x 6
A1 A2 A3 B1 B2 B3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 7.5 9 9 9 9
2 8 8 8 NaN NaN NaN
3 5 9 1 9 1 9
4 1 3 2 3 3 3
5 5 6 4 6 4 6
We can use split.default with na.aggregate
library(purrr)
library(zoo)
library(dplyr)
library(stringr)
map_dfc(split.default(df, str_remove(names(df), "\\d+")), ~
as_tibble(t(na.aggregate(t(.x)))))
# A tibble: 5 × 6
A1 A2 A3 B1 B2 B3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 7.5 9 9 9 9
2 8 8 8 NaN NaN NaN
3 5 9 1 9 1 9
4 1 3 2 3 3 3
5 5 6 4 6 4 6
Span a matrix of rowMeans on the rows and replace the NA's. In an lapply that greps the questions.
do.call(cbind, lapply(c('A', 'B'), function(q) {
s <- df[, grep(q, names(df))]
na <- is.na(s)
replace(s, na, rowMeans(s, na.rm=TRUE)[row(s)][na])
}))
# A1 A2 A3 B1 B2 B3
# 1 6 7.5 9 9 9 9
# 2 8 8.0 8 NaN NaN NaN
# 3 5 9.0 1 9 1 9
# 4 1 3.0 2 3 3 3
# 5 5 6.0 4 6 4 6
Data:
df <- structure(list(A1 = c(6, 8, NA, 1, 5), A2 = c(NA, NA, 9, 3, 6
), A3 = c(9, NA, 1, NA, 4), B1 = c(NA, NA, 9, 3, 6), B2 = c(9,
NA, 1, NA, 4), B3 = c(NA, NA, 9, 3, 6)), class = "data.frame", row.names = c(NA,
-5L))

remove rows containing NA based on condition

df <- data.frame(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30))
From df I want to remove rows containing NA in y based on the condition that the x value in that row is smaller than the x value in the row with the minimum y value to obtain this data frame.
data.frame(x = 3:7, y = c(5, 10, NA, 20, 30))
dlypr() solutions preferable!
We could use which.min to get the index of minimum 'y' value, subset the 'x' create the comparison with the 'x' values along with the expression for NA elements in 'y' and negate (!)
subset(df, !(x< x[which.min(y)] & is.na(y)))
-output
x y
3 3 5
4 4 10
5 5 NA
6 6 20
7 7 30
Or the same logic can be applied with dplyr::filter
library(dplyr)
df %>%
filter(!(x< x[which.min(y)] & is.na(y)))
-ouptut
x y
1 3 5
2 4 10
3 5 NA
4 6 20
5 7 30
data
df <- structure(list(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30)),
class = "data.frame", row.names = c(NA,
-7L))
Use logical indices for each of the conditions and combine them with logical AND, &:
df <- data.frame(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30))
i <- is.na(df$y)
j <- df$x < df$y
df[!i & j, ]
# x y
#3 3 5
#4 4 10
#6 6 20
#7 7 30

Check whether names(df) is in other character list giving both true and false values back

I have the following
structure(list(id = c(14, 15, 16, 17, 18), a = c(1, 2, 3, 5,
6), b = c(3, NA, 2, 5, 7), c = c(1, 2, 3, 4, 5)), row.names = c(NA,
-5L), class = "data.frame")
id a b c
1 14 1 3 1
2 15 2 NA 2
3 16 3 2 3
4 17 5 5 4
5 18 6 7 5
library(caret)
corr <- cor(na.omit(df))
highcorr <- findCorrelation(corr, cutoff = 0.9, names=TRUE)
highcorr
[1] "a" "id"
I would like to get a new data frame where if the col name is in highcorr returns true, else false. New data frame would look like this
col result
1 id TRUE
2 a TRUE
3 b FALSE
4 c FALSE
I think I'm thinking way too difficult. I tried things with %in% but then you only get the TRUE values. Any suggestion would be appreciated :)!
You can use below code:
library(reshape2)
library(dplyr)
a<-melt(corr,value.name = "corr")
a<-a[!duplicated(a$corr),]
a<- a %>% select(Var1, corr)%>% mutate(result = ifelse(corr > 0.9,T,F )) %>% select(Var1, result)
Var1 result
1 id TRUE
2 a TRUE
3 b FALSE
4 b FALSE

R Recode Variables In A Loop

Ciao,
Here is a replicate able example.
df <- data.frame("STUDENT"=c(1,2,3,4,5),
"TEST1"=c(6,88,17,5,18),
"TEST2"=c(34,NA,87,88,82),
"TEST3"=c(87,62,13,8,71),
"TEST1NEW"=c(0,1,0,0,0),
"TEST2NEW"=c(0,NA,1,1,1),
"TEST3NEW"=c(1,1,0,0,1)
If I have data frame df with STUDENT, TEST1, TEST2, TEST3 I want to make TEST1NEW TEST2NEW and TEST3NEW such that the new variables are equal to 1 when old variable TEST is more than or equals to 50 and the NEW TEST variables should be equal to 0 when the old TEST variable is below 50. I made an attempt here below but this is insufficient and also I believe this may require a loop.
COLUMNS <- c("TEST1", "TEST2", "TEST3")
df[paste0(COLUMNS)] <- replace(df[COLUMNS],df[COLUMNS] < 50, 0 , 1, NA)
You could do
df[, paste0("TEST", 1:3, "_NEW")] <- as.integer(df[,-1] >= 50)
df
# STUDENT TEST1 TEST2 TEST3 TEST1_NEW TEST2_NEW TEST3_NEW
#1 1 6 34 87 0 0 1
#2 2 88 NA 62 1 NA 1
#3 3 17 87 13 0 1 0
#4 4 5 88 8 0 1 0
#5 5 18 82 71 0 1 1
data
df <- data.frame(
"STUDENT" = c(1, 2, 3, 4, 5),
"TEST1" = c(6, 88, 17, 5, 18),
"TEST2" = c(34, NA, 87, 88, 82),
"TEST3" = c(87, 62, 13, 8, 71)
)
In case where the assignment is more complex we can make use of dplyr::case_when
library(dplyr)
df[, paste0("TEST", 1:3, "_NEW")] <- case_when(df[,-1] < 20 ~ 4L,
df[,-1] >= 65 ~ 8L,
is.na(df[,-1]) ~ NA_integer_,
TRUE ~ 7L)

Resources