I need to create a variable which is similar to var1 if var2 is missing, similar to var2 if var1 is empty, the mean of var1 and var 2 if the two are not missing ((var1+var2)/2) and finally NA if both var1 and var2 are also missing.
I have data like:
library(tidyverse)
df <- tibble(
var1 = c(1, 2, 3, 4, NA, NA, 3, 2),
var2 = c(3, 5, NA, 2, 3, NA, 4, NA)
)
The result should be:
var1 var2 newvar
1 3 2
2 5 3.5
3 NA 3
4 2 3
NA 3 3
NA NA NA
3 4 3.5
2 NA 2
I have tried using main R recoding functions, also tried using case_when:
df <- df %>% mutate (
newvar = case_when(
var1 == NA ~ var2,
var2 == NA ~ var1,
TRUE ~ (var1+var2)/2
)
)
Not sure whether the last line would be correct but anyway the code didn't work due to missings, it says:
Error in mutate_impl(.data, dots) :
Evaluation error: NAs are not allowed in subscripted assignments.
df %>% mutate (
newvar = case_when(
xor(is.na(var1), is.na(var2)) ~ pmax(var1, var2, na.rm = TRUE),
!is.na(var1) & !is.na(var2) ~ (var1 + var2)/2,
TRUE ~ NaN
)
)
Almost there, just some minor edits and it's working on my end. It's usually better to use is.na(x) instead of x == NA. Also, your TRUE at the end should check what you actually want, the case where none of them are NA.
df %>% mutate (
newvar = case_when(
is.na(var1) ~ var2,
is.na(var2) ~ var1,
!is.na(var1) && !is.na(var2) ~ (var1+var2)/2
)
)
Produces
# A tibble: 8 x 3
var1 var2 newvar
<dbl> <dbl> <dbl>
1 1 3 2
2 2 5 3.5
3 3 NA 3
4 4 2 3
5 NA 3 3
6 NA NA NA
7 3 4 3.5
8 2 NA 2
Related
I am currently working on data structured like this:
library(tibble)
df <- tibble(
id = c("1", "2", "3", "4", "5"),
var1 = c(2, NA, 3, 1, 2),
var2 = c(1, 2, NA, NA, 2),
var3 = c(5, 8, 6, NA, NA),
var4 = c(11, 22, 33, 44, 55)
)
> df
# A tibble: 5 × 5
eid var1 var2 var3 var4
<chr> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 5 11
2 2 NA 2 8 22
3 3 3 NA 6 33
4 4 1 NA NA 44
5 5 2 2 NA 55
I need to compute the unstandardised mean difference between each pair of valid cases across var1, var2, and var3, for each row taken singularly.
I would like to get a resulting variable in the tibble with the mean difference between any two variables (out of the 3 I listed before).
If I were to do it by hand for the first row, I would calculate the differences first
2 - 1 = 1
1 - 5 = -4
2 - 5 = -3
then take the module, as I am interested in the distances only
|1| = 1
|-4| = 4
|-3| = 3
and then compute an average of the differences
1+4+3 / 3 = 2.67
An important exception would be that, if an NA or more is present, it shouldn't be considered in the count, neither in the difference nor in the average. E.g. in the 2nd row, I'd need the result to be 6, not NA.
The expected scenario with 2 NAs would be the average difference to be 0, but NA would be acceptable.
What I tried so far didn't work, as it does not sum by row:
df %>%
mutate(meandiff = sum(
abs(sum(var1, -var2, na.rm = TRUE)),
abs(sum(var2, -var3, na.rm = TRUE)),
abs(sum(var1, -var3, na.rm = TRUE)),
na.rm = TRUE
) / 3)
I was thinking of using the function rowsum(), but I need the pairwise difference and not for all three variables at the same.
Would you be able to help me find out a way to compute it in R?
Thank you!
Something like this?
func <- function(...) {
dots <- na.omit(c(...))
sum(abs(diff(c(dots, dots[1]))), na.rm = TRUE) / length(dots)
}
df %>%
mutate(meandiff = mapply(func, var1, var2, var3))
# # A tibble: 5 x 6
# id var1 var2 var3 var4 meandiff
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2 1 5 11 2.67
# 2 2 NA 2 8 22 6
# 3 3 3 NA 6 33 3
# 4 4 1 NA NA 44 0
# 5 5 2 2 NA 55 0
(This calculates var3 - var1 for the third mid-sum value instead of your var1 - var3, but since you use abs it should not matter.)
If I understand you correctly, you want something like
df %>%
mutate(
meandiff = rowSums(
cbind(
abs(var1-var2) / 2,
abs(var2-var3) / 2,
abs(var1-var3) / 2
), na.rm = TRUE) / 3
)
Btw, if you want to remove the NA and calculate the mean, will / 3 still be proper?
I want to indicate if all observations in a given row are NA. For example, with the following data:
dat <- tibble::tribble(
~x, ~y, ~z,
1, 2, NA,
1, 2, 3,
NA, NA, NA,
NA, NA, NA
)
dat
# A tibble: 4 x 3
x y z
<dbl> <dbl> <dbl>
1 1 2 NA
2 1 2 3
3 NA NA NA
4 NA NA NA
I want to create a new column (allisna) to indicate if all observations are NA. Note: I want to do this using dplyr (if needed, can use other tidyverse functions, not base R functions like apply().
I have the following solution, but I prefer a solution that uses rowwise() and another dplyr function call inside of mutate.
library(dplyr)
dat %>%
mutate(allisna = apply(tmp, 1, function(x){
case_when(all(is.na(x)) ~ 1,
TRUE ~ 0)
}))
The final product should be:
# A tibble: 4 x 4
x y z allisna
<dbl> <dbl> <dbl> <dbl>
1 1 2 NA 0
2 1 2 3 0
3 NA NA NA 1
4 NA NA NA 1
in base R without using apply you can do
dat$allisna <- +(rowSums(!is.na(dat)) == 0)
Figured out one solution (but am open to and will reward more!):
dat %>%
rowwise() %>%
mutate(allisna = case_when(all(is.na(c_across(everything()))) ~ 1,
TRUE ~ 0))
EDIT to include #RonakShah's answer
dat %>%
mutate(allisna = as.numeric(rowSums(!is.na(.)) == 0))
If you have a full data frame, it easy to multiply values based on a logical condition:
df = data.frame(
var1 = c(1, 2, 3, 4, 5),
var2 = c(1, 2, 3, 2, 1),
var3 = c(5, 4, 3, 4, 5)
)
> df
var1 var2 var3
1 1 1 5
2 2 2 4
3 3 3 3
4 4 2 4
5 5 1 5
> df[df > 2] <- df[df > 2] * 10
> df
var1 var2 var3
1 1 1 50
2 2 2 40
3 30 30 30
4 40 2 40
5 50 1 50
However, if you have NA values in the data frame, the operation fails:
> df_na = data.frame(
var1 = c(NA, 2, 3, 4, 5),
var2 = c(1, 2, 3, 1, NA),
var3 = c(5, NA, 3, 4, 5)
)
> df_na
var1 var2 var3
1 NA 1 5
2 2 2 NA
3 3 3 3
4 4 1 4
5 5 NA 5
> df_na[df_na > 2] <- df_na[df_na > 2] * 10
Error in `[<-.data.frame`(`*tmp*`, df_na > 2, value = c(NA, 30, 40, 50, :
'value' is the wrong length
I tried, for example, some na.omit() tactics but could not make it work. I also could not find an appropriate question here in Stack Overflow.
So, how should I do it?
You can add !is.na() as an additional logical argument to subset by:
df_na[df_na > 2 & !is.na(df_na)] <- df_na[df_na > 2 & !is.na(df_na)] * 10
# > df_na
# var1 var2 var3
# 1 NA 1 50
# 2 2 2 NA
# 3 30 30 30
# 4 40 1 40
# 5 50 NA 50
Alternatively, a dplyr / tidyverse solution would be:
library(dplyr)
df_na %>%
mutate_all(.funs = ~ ifelse(!is.na(.x) & .x > 2, .x * 10, .x))
Added based on OP comment:
If you want to subset by values based on the %in% operator, opt for the dplyr solution (the %in% operator won't work the same way here as explained in this post):
df_na %>%
mutate_all(.funs = ~ ifelse(!is.na(.x) & .x %in% c(3, 4), .x * 10, .x))
# var1 var2 var3
# 1 NA 1 5
# 2 2 2 NA
# 3 30 30 30
# 4 40 1 40
# 5 5 NA 5
This approach generally lends itself to more complex manipulation tasks. You may, for instance, also define additional conditions with the help of dplyr::case_when() instead of the one-alternative ifelse.
Does this work, Using base R:
df_na[] <- lapply(df_na, function(x) ifelse(!is.na(x) & x > 2, x * 10, x))
df_na
var1 var2 var3
1 NA 1 50
2 2 2 NA
3 30 30 30
4 40 1 40
5 50 NA 50
The problem is not with the multiplication, it is with the array indexing.
(df_na > 2 returns NAs).
You can convert the line below into one line if you like,
inds <- which(df_na > 2, arr.ind = TRUE)
df_na[inds] <- df_na[inds] * 10
Consider this simple dataframe
> data_frame(var1 = c(NA, 1 , NA),
+ var2 = c (NA, 3, NA))
# A tibble: 3 x 2
var1 var2
<dbl> <dbl>
1 NA NA
2 1 3
3 NA NA
I want to fill forward all the missing values for all columns in the dataframe. It seems to me that tidyr::fill() can do that, but I am unable to make it work without specifying the columns one at a time.
> data_frame(var1 = c(NA, 1 , NA),
+ var2 = c (NA, 3, NA)) %>% tidyr::fill(.direction = 'down')
# A tibble: 3 x 2
var1 var2
<dbl> <dbl>
1 NA NA
2 1 3
3 NA NA
while entering the column name seems to work
> data_frame(var1 = c(NA, 1 , NA),
+ var2 = c (NA, 3, NA)) %>% tidyr::fill(var1, .direction = 'down')
# A tibble: 3 x 2
var1 var2
<dbl> <dbl>
1 NA NA
2 1 3
3 1 NA
what am I missing here?
Thanks
tidyr verbs accept dplyr::select column specifications, so you can use everything():
library(tidyverse)
df <- data_frame(var1 = c(NA, 1 , NA),
var2 = c (NA, 3, NA))
df %>% fill(everything())
#> # A tibble: 3 x 2
#> var1 var2
#> <dbl> <dbl>
#> 1 NA NA
#> 2 1 3
#> 3 1 3
We can convert the column names to symbols with syms and evaluate (!!!)
d1 %>%
tidyr::fill(!!! rlang::syms(names(.)), .direction = 'down')
# A tibble: 3 x 2
# var1 var2
# <dbl> <dbl>
#1 NA NA
#2 1 3
#3 1 3
data
d1 <- data_frame(var1 = c(NA, 1 , NA), var2 = c (NA, 3, NA))
I am aware of the all the question regarding the filter multiple conditions with very comprehensive answers such as Q1, Q2, or even for removing NA values Q3, Q4.
But I have a different question, How I can do filter using dplyr or even data.table functions to keep both NA values and a conditional parameters?
as an example in the following I'd like to keep all of the values in Var3 which is >5 PLUS NA values.
library(data.table)
library(dplyr)
Var1<- seq(1:5)
Var2<- c("s", "a", "d", NA, NA)
Var3<- c(NA, NA, 2, 5, 2)
Var4<- c(NA, 5, 1, 3,4)
DT <- data.table(Var1,Var2,Var3, Var4)
DT
Var1 Var2 Var3 Var4
1: 1 s NA NA
2: 2 a NA 5
3: 3 d 2 1
4: 4 NA 5 3
5: 5 NA 2 4
The Expected results:
Var1 Var2 Var3 Var4
1: 1 s NA NA
2: 2 a NA 5
3: 3 d 2 1
4: 5 NA 2 4
I have tried followings but not successful:
##Using dplyr::filter
DT %>% filter(!Var3 ==5)
Var1 Var2 Var3 Var4
1 3 d 2 1
2 5 <NA> 2 4
# or
DT %>% filter(Var3 <5 & is.na(Var3))
[1] Var1 Var2 Var3 Var4
<0 rows> (or 0-length row.names)
## using data.table
DT[DT[,.I[Var3 <5], Var1]$V1]
Var1 Var2 Var3 Var4
1: NA NA NA NA
2: NA NA NA NA
3: 3 d 2 1
4: 5 NA 2 4
Any help with explanation is highly appreciated!
I think this will work. Use | to indicate or for the filters. dt2 is the expected output.
library(dplyr)
Var1 <- seq(1:5)
Var2 <- c("s", "a", "d", NA, NA)
Var3 <- c(NA, NA, 2, 5, 2)
Var4 <- c(NA, 5, 1, 3, 4)
dt <- data_frame(Var1, Var2, Var3, Var4)
dt2 <- dt %>% filter(Var3 < 5 | is.na(Var3))
With data.table, we use the following logic to filter the rows where 'Var3' is less than 5 and not an NA (!is.na(Var3)) or (|) if it is an NA
DT[(Var3 < 5& !is.na(Var3)) | is.na(Var3)]
# Var1 Var2 Var3 Var4
#1: 1 s NA NA
#2: 2 a NA 5
#3: 3 d 2 1
#4: 5 NA 2 4
If we need the dplyr, just use the same logic in filter
DT %>%
filter((Var3 <5 & !is.na(Var3)) | is.na(Var3))
As #ycw mentioned the & !is.na(Var3) is not really needed but if we remove the is.na(Var3), it becomes important
DT[, Var3 < 5 ]
#[1] NA NA TRUE FALSE TRUE
DT[, Var3 < 5 & !is.na(Var3)]
#[1] FALSE FALSE TRUE FALSE TRUE