I am currently working on data structured like this:
library(tibble)
df <- tibble(
id = c("1", "2", "3", "4", "5"),
var1 = c(2, NA, 3, 1, 2),
var2 = c(1, 2, NA, NA, 2),
var3 = c(5, 8, 6, NA, NA),
var4 = c(11, 22, 33, 44, 55)
)
> df
# A tibble: 5 × 5
eid var1 var2 var3 var4
<chr> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 5 11
2 2 NA 2 8 22
3 3 3 NA 6 33
4 4 1 NA NA 44
5 5 2 2 NA 55
I need to compute the unstandardised mean difference between each pair of valid cases across var1, var2, and var3, for each row taken singularly.
I would like to get a resulting variable in the tibble with the mean difference between any two variables (out of the 3 I listed before).
If I were to do it by hand for the first row, I would calculate the differences first
2 - 1 = 1
1 - 5 = -4
2 - 5 = -3
then take the module, as I am interested in the distances only
|1| = 1
|-4| = 4
|-3| = 3
and then compute an average of the differences
1+4+3 / 3 = 2.67
An important exception would be that, if an NA or more is present, it shouldn't be considered in the count, neither in the difference nor in the average. E.g. in the 2nd row, I'd need the result to be 6, not NA.
The expected scenario with 2 NAs would be the average difference to be 0, but NA would be acceptable.
What I tried so far didn't work, as it does not sum by row:
df %>%
mutate(meandiff = sum(
abs(sum(var1, -var2, na.rm = TRUE)),
abs(sum(var2, -var3, na.rm = TRUE)),
abs(sum(var1, -var3, na.rm = TRUE)),
na.rm = TRUE
) / 3)
I was thinking of using the function rowsum(), but I need the pairwise difference and not for all three variables at the same.
Would you be able to help me find out a way to compute it in R?
Thank you!
Something like this?
func <- function(...) {
dots <- na.omit(c(...))
sum(abs(diff(c(dots, dots[1]))), na.rm = TRUE) / length(dots)
}
df %>%
mutate(meandiff = mapply(func, var1, var2, var3))
# # A tibble: 5 x 6
# id var1 var2 var3 var4 meandiff
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2 1 5 11 2.67
# 2 2 NA 2 8 22 6
# 3 3 3 NA 6 33 3
# 4 4 1 NA NA 44 0
# 5 5 2 2 NA 55 0
(This calculates var3 - var1 for the third mid-sum value instead of your var1 - var3, but since you use abs it should not matter.)
If I understand you correctly, you want something like
df %>%
mutate(
meandiff = rowSums(
cbind(
abs(var1-var2) / 2,
abs(var2-var3) / 2,
abs(var1-var3) / 2
), na.rm = TRUE) / 3
)
Btw, if you want to remove the NA and calculate the mean, will / 3 still be proper?
Related
I have made a very complex solution to something I feel should have a much simpler solution.
In short what I want:
I want to compute a new column containing the minimum value across 3 columns
I want to ignore zeros and NAs
If I only have zeros and NAs I want a zero
If I have only NAs I want a NA
Here is my solution, it works, but it is very complex and produces a warning.
> library(dplyr)
> df <- data.frame(
+ id = c(1, 2, 3, 4),
+ test1 = c( NA, NA, 2 , 3),
+ test2 = c( NA, 0, 1 , 1),
+ test3 = c(NA, NA, 0 , 2)
+ )
> df2 <- df %>%
+ mutate(nieuw = apply(across(test1:test3), 1, function(x) min(x[x>0]))) %>%
+ rowwise() %>%
+ mutate(nieuw = if_else(is.na(nieuw), max(across(test1:test3), na.rm = TRUE), nieuw)) %>%
+ mutate(nieuw = ifelse(is.infinite(nieuw), NA, nieuw))
> df
id test1 test2 test3
1 1 NA NA NA
2 2 NA 0 NA
3 3 2 1 0
4 4 3 1 2
> df2
# A tibble: 4 x 5
# Rowwise:
id test1 test2 test3 nieuw
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA NA
2 2 NA 0 NA 0
3 3 2 1 0 1
4 4 3 1 2 1
Warning message:
Problem while computing `nieuw = if_else(...)`.
i no non-missing arguments to max; returning -Inf
i The warning occurred in row 1.
You can create a helper function and then apply it rowwise:
library(dplyr)
safe <- function(x, f, ...) ifelse(all(is.na(x)), NA,
ifelse(all(is.na(x) | x == 0),
0, f(x[x > 0], na.rm = TRUE, ...)))
df %>%
rowwise() %>%
mutate(a = safe(c_across(test1:test3), min))
# A tibble: 4 × 5
# Rowwise:
id test1 test2 test3 a
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA NA
2 2 NA 0 NA 0
3 3 2 1 0 1
4 4 3 1 2 1
Here is another option. It leverages making zeros and NA's very large and then recodes them at the end:
library(tidyverse)
get_min <- function(data, cols){
data[is.na(data)] <- 1e6
data[data == 0] <- 1e5
nums <- do.call(pmin, select(data, all_of(cols)))
recode(nums, `1e+06` = NA_real_, `1e+05` = 0.)
}
df %>%
mutate(nieuw = get_min(., c("test1", "test2", "test3")))
#> id test1 test2 test3 nieuw
#> 1 1 NA NA NA NA
#> 2 2 NA 0 NA 0
#> 3 3 2 1 0 1
#> 4 4 3 1 2 1
I have questionnaire data (rows=individuals, cols=scores on questions)and would like to compute a sumscore for individuals if they answered a given number of questions, otherwise the sumscore variable should be NA. The code below computes row sums, counts the number of NA's, assigns an otherwise not occurring value to the row sum variable in case the number of NA's is large, and then replaces that with an NA. The code works but I bet there is a more elegant way...Suggestions much appreciated.
dum<-tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum<-dum %>%
mutate(sumsum = rowSums(select(., x:z), na.rm = TRUE))
dum<-dum %>%
mutate(countna=rowSums(is.na(select(.,x:z))))
dum<-dum %>%
mutate(sumsum=case_when(countna>=2 ~ 100,TRUE~sumsum))
dum<-dum %>%
mutate(sumsum = na_if(sumsum, 100))
You may combine your code in one statement -
library(dplyr)
dum <- tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum <- dum %>%
mutate(sumsum = replace(rowSums(select(., x:z), na.rm = TRUE),
rowSums(is.na(select(., x:z))) >= 2, NA))
dum
# A tibble: 5 × 4
# x y z sumsum
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 3
#2 NA 2 NA NA
#3 2 3 2 7
#4 3 NA 3 6
#5 4 5 4 13
You can also try this:
dum<-tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum2 <- dum %>% mutate(sumsum = ifelse(rowSums(is.na(select(.,x:z)))>=2, NA,rowSums(select(., x:z), na.rm = TRUE)))
dum2
# A tibble: 5 × 4
x y z sumsum
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 3
2 NA 2 NA NA
3 2 3 2 7
4 3 NA 3 6
5 4 5 4 13
If you have a full data frame, it easy to multiply values based on a logical condition:
df = data.frame(
var1 = c(1, 2, 3, 4, 5),
var2 = c(1, 2, 3, 2, 1),
var3 = c(5, 4, 3, 4, 5)
)
> df
var1 var2 var3
1 1 1 5
2 2 2 4
3 3 3 3
4 4 2 4
5 5 1 5
> df[df > 2] <- df[df > 2] * 10
> df
var1 var2 var3
1 1 1 50
2 2 2 40
3 30 30 30
4 40 2 40
5 50 1 50
However, if you have NA values in the data frame, the operation fails:
> df_na = data.frame(
var1 = c(NA, 2, 3, 4, 5),
var2 = c(1, 2, 3, 1, NA),
var3 = c(5, NA, 3, 4, 5)
)
> df_na
var1 var2 var3
1 NA 1 5
2 2 2 NA
3 3 3 3
4 4 1 4
5 5 NA 5
> df_na[df_na > 2] <- df_na[df_na > 2] * 10
Error in `[<-.data.frame`(`*tmp*`, df_na > 2, value = c(NA, 30, 40, 50, :
'value' is the wrong length
I tried, for example, some na.omit() tactics but could not make it work. I also could not find an appropriate question here in Stack Overflow.
So, how should I do it?
You can add !is.na() as an additional logical argument to subset by:
df_na[df_na > 2 & !is.na(df_na)] <- df_na[df_na > 2 & !is.na(df_na)] * 10
# > df_na
# var1 var2 var3
# 1 NA 1 50
# 2 2 2 NA
# 3 30 30 30
# 4 40 1 40
# 5 50 NA 50
Alternatively, a dplyr / tidyverse solution would be:
library(dplyr)
df_na %>%
mutate_all(.funs = ~ ifelse(!is.na(.x) & .x > 2, .x * 10, .x))
Added based on OP comment:
If you want to subset by values based on the %in% operator, opt for the dplyr solution (the %in% operator won't work the same way here as explained in this post):
df_na %>%
mutate_all(.funs = ~ ifelse(!is.na(.x) & .x %in% c(3, 4), .x * 10, .x))
# var1 var2 var3
# 1 NA 1 5
# 2 2 2 NA
# 3 30 30 30
# 4 40 1 40
# 5 5 NA 5
This approach generally lends itself to more complex manipulation tasks. You may, for instance, also define additional conditions with the help of dplyr::case_when() instead of the one-alternative ifelse.
Does this work, Using base R:
df_na[] <- lapply(df_na, function(x) ifelse(!is.na(x) & x > 2, x * 10, x))
df_na
var1 var2 var3
1 NA 1 50
2 2 2 NA
3 30 30 30
4 40 1 40
5 50 NA 50
The problem is not with the multiplication, it is with the array indexing.
(df_na > 2 returns NAs).
You can convert the line below into one line if you like,
inds <- which(df_na > 2, arr.ind = TRUE)
df_na[inds] <- df_na[inds] * 10
I need to create a variable which is similar to var1 if var2 is missing, similar to var2 if var1 is empty, the mean of var1 and var 2 if the two are not missing ((var1+var2)/2) and finally NA if both var1 and var2 are also missing.
I have data like:
library(tidyverse)
df <- tibble(
var1 = c(1, 2, 3, 4, NA, NA, 3, 2),
var2 = c(3, 5, NA, 2, 3, NA, 4, NA)
)
The result should be:
var1 var2 newvar
1 3 2
2 5 3.5
3 NA 3
4 2 3
NA 3 3
NA NA NA
3 4 3.5
2 NA 2
I have tried using main R recoding functions, also tried using case_when:
df <- df %>% mutate (
newvar = case_when(
var1 == NA ~ var2,
var2 == NA ~ var1,
TRUE ~ (var1+var2)/2
)
)
Not sure whether the last line would be correct but anyway the code didn't work due to missings, it says:
Error in mutate_impl(.data, dots) :
Evaluation error: NAs are not allowed in subscripted assignments.
df %>% mutate (
newvar = case_when(
xor(is.na(var1), is.na(var2)) ~ pmax(var1, var2, na.rm = TRUE),
!is.na(var1) & !is.na(var2) ~ (var1 + var2)/2,
TRUE ~ NaN
)
)
Almost there, just some minor edits and it's working on my end. It's usually better to use is.na(x) instead of x == NA. Also, your TRUE at the end should check what you actually want, the case where none of them are NA.
df %>% mutate (
newvar = case_when(
is.na(var1) ~ var2,
is.na(var2) ~ var1,
!is.na(var1) && !is.na(var2) ~ (var1+var2)/2
)
)
Produces
# A tibble: 8 x 3
var1 var2 newvar
<dbl> <dbl> <dbl>
1 1 3 2
2 2 5 3.5
3 3 NA 3
4 4 2 3
5 NA 3 3
6 NA NA NA
7 3 4 3.5
8 2 NA 2
Consider this simple dataframe
> data_frame(var1 = c(NA, 1 , NA),
+ var2 = c (NA, 3, NA))
# A tibble: 3 x 2
var1 var2
<dbl> <dbl>
1 NA NA
2 1 3
3 NA NA
I want to fill forward all the missing values for all columns in the dataframe. It seems to me that tidyr::fill() can do that, but I am unable to make it work without specifying the columns one at a time.
> data_frame(var1 = c(NA, 1 , NA),
+ var2 = c (NA, 3, NA)) %>% tidyr::fill(.direction = 'down')
# A tibble: 3 x 2
var1 var2
<dbl> <dbl>
1 NA NA
2 1 3
3 NA NA
while entering the column name seems to work
> data_frame(var1 = c(NA, 1 , NA),
+ var2 = c (NA, 3, NA)) %>% tidyr::fill(var1, .direction = 'down')
# A tibble: 3 x 2
var1 var2
<dbl> <dbl>
1 NA NA
2 1 3
3 1 NA
what am I missing here?
Thanks
tidyr verbs accept dplyr::select column specifications, so you can use everything():
library(tidyverse)
df <- data_frame(var1 = c(NA, 1 , NA),
var2 = c (NA, 3, NA))
df %>% fill(everything())
#> # A tibble: 3 x 2
#> var1 var2
#> <dbl> <dbl>
#> 1 NA NA
#> 2 1 3
#> 3 1 3
We can convert the column names to symbols with syms and evaluate (!!!)
d1 %>%
tidyr::fill(!!! rlang::syms(names(.)), .direction = 'down')
# A tibble: 3 x 2
# var1 var2
# <dbl> <dbl>
#1 NA NA
#2 1 3
#3 1 3
data
d1 <- data_frame(var1 = c(NA, 1 , NA), var2 = c (NA, 3, NA))