What's the easiest way for the multiple if-else conditions? - r

I have a data set something like this:
df_1 <- tribble(
~A, ~B, ~C,
10, 10, NA,
NA, 34, 15,
40, 23, NA,
4, 12, 18,
)
Now, I just want to compare A, B, C for each row, and add a new column that shows us the minimum number. Let's see how desired data looks like:
df_2 <- tribble(
~A, ~B, ~C, ~Winner,
10, 10, NA, "Same",
NA, 34, 15, "C",
40, 23, NA, "B",
4, 12, 18, "A",
)
There are four outputs: Same, A-Win, B-Win, C-Win.
How would you code to get this result?
Thanks in advance.

Here is something:
foo <- function(x) {
rmin <- which(x == min(x, na.rm = TRUE))
if (length(rmin) > 1) "same" else names(rmin)
}
apply(df_1, 1, foo)
[1] "same" "C" "B" "A"
You can add this as a column to your data.frame with:
df_1$winner <- apply(df_1, 1, foo)
# A tibble: 4 x 4
A B C winner
<dbl> <dbl> <dbl> <chr>
1 10 10 NA same
2 NA 34 15 C
3 40 23 NA B
4 4 12 18 A
If you have more variables and only want to use some you can use a character vector:
vars <- c("A", "B", "C")
apply(df_1[vars], 1, foo)

df_1 <- tribble(
~A, ~B, ~C,
10, 10, NA,
NA, 34, 15,
40, 23, NA,
4, 12, 18,
)
df_1 %>%
mutate(
winner = colnames(df_1)[apply(df_1,1,which.min)],
winner = if_else(A == B | B == C | A == C, 'same', winner, missing = winner))
# A tibble: 4 x 4
A B C winner
<dbl> <dbl> <dbl> <chr>
1 10 10 NA same
2 NA 34 15 C
3 40 23 NA B
4 4 12 18 A

Related

Routine for non-manual argument of a set of variables in coalesce() dplyr function [duplicate]

This question already has answers here:
Using dplyr to fill in missing values (through a join?)
(3 answers)
Closed 8 months ago.
This post was edited and submitted for review 8 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a list of dfs to be combined into one. These dfs have some matching columns and rows and some distinct or missing ones.
The minimum structure (for understanding) of the first two dfs.
df1:
df1 <- structure(list(id = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6),
Name = c("LI","NO","WH","MA","BU","SO","FO","AT","CO","IN","SP","CE"),
H_A = c("H", "A", "H", "A", "H", "A", "H", "A", "H", "A", "H", "A"),
W = c(15, 13, 5, 13, 9, 12, 10, 13, 1, 8, 4, 2),
X = c(NA, NA, NA, NA, NA, NA, 12, 7, 5, 13, 1, 3),
Y = c(0, 0, 0, 0, 0,0, NA, NA, NA, NA, NA, NA)),
row.names = c(NA,-12L), class = c("tbl_df","tbl", "data.frame"))
df2:
df2 <- structure(list(id = c(1, 1, 2, 2, 3, 3),
Name = c("LI","NO", "WH", "MA", "BU", "SO"),
H_A = c("H", "A", "H", "A", "H", "A"),
W = c(15, 13, 5, 13, 9, 12),
X = c(10, 12, 11, 15, 6, 14),
Z = c(4, 14, 16, 16, 25, 30)),
row.names = c(NA,-6L),class = c("tbl_df", "tbl", "data.frame"))
This can be solved with this alternative:
df_combined <- full_join(df1, df2, by = c("id", "Name", "H_A")) %>%
mutate(X = coalesce(X.x, X.y),
W = coalesce(W.x, W.y)) %>%
select(-contains("."))
I would like to automate the routine for non-manual input of the variables in mutate coalesce function. After all, there are several variables for the context X and W above. In addition to this I will continue the routine for df3, df4, df5 that have the same minimal matching with df1.
Joins by their nature don't natively fill in positions we have to implement a fix to solve this problem, and although you can use if else statements as shown in the answer above, coalesce() is a much cleaner function to use.
See this post here for another example (could potentially be seen as a repeated question).
Using dplyr to fill in missing values (through a join?)
library(tidyverse)
df_test <- full_join(df1, df2, by = c("id", "Name", "H_A")) %>%
mutate(X = coalesce(X.x, X.y),
W = coalesce(W.x, W.y)) %>%
select(id, Name, H_A, W, X, Y, Z)
df_test == df_combined
id Name H_A W X Y Z
[1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[7,] TRUE TRUE TRUE TRUE TRUE NA NA
[8,] TRUE TRUE TRUE TRUE TRUE NA NA
[9,] TRUE TRUE TRUE TRUE TRUE NA NA
[10,] TRUE TRUE TRUE TRUE TRUE NA NA
[11,] TRUE TRUE TRUE TRUE TRUE NA NA
[12,] TRUE TRUE TRUE TRUE TRUE NA NA
NA's expectedly return NA as you can't match two NA's together using a simple == statement.
You can use left_join from dplyr and substitute NA's like this, where I am guessing Id and H_A together make a key value:
library(dplyr)
df1 <- structure(list(id = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6),
Name = c("LI","NO","WH","MA","BU","SO","FO","AT","CO","IN","SP","CE"),
H_A = c("H", "A", "H", "A", "H", "A", "H", "A", "H", "A", "H", "A"),
W = c(15, 13, 5, 13, 9, 12, 10, 13, 1, 8, 4, 2),
X = c(NA, NA, NA, NA, NA, NA, 12, 7, 5, 13, 1, 3),
Y = c(0, 0, 0, 0, 0,0, NA, NA, NA, NA, NA, NA)),
row.names = c(NA,-12L), class = c("tbl_df","tbl", "data.frame"))
df2 <- structure(list(id = c(1, 1, 2, 2, 3, 3),
Name = c("LI","NO", "WH", "MA", "BU", "SO"),
H_A = c("H", "A", "H", "A", "H", "A"),
W = c(15, 13, 5, 13, 9, 12),
X = c(10, 12, 11, 15, 6, 14),
Z = c(4, 14, 16, 16, 25, 30)),
row.names = c(NA,-6L),class = c("tbl_df", "tbl", "data.frame"))
df_combined <- left_join(df1,
df2 %>%
select(id, H_A, "df2_X" = X, Z)) %>%
mutate(X = if_else(is.na(X), df2_X, X)) %>%
select(-df2_X)
#> Joining, by = c("id", "H_A")
df_combined
#> # A tibble: 12 × 7
#> id Name H_A W X Y Z
#> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 LI H 15 10 0 4
#> 2 1 NO A 13 12 0 14
#> 3 2 WH H 5 11 0 16
#> 4 2 MA A 13 15 0 16
#> 5 3 BU H 9 6 0 25
#> 6 3 SO A 12 14 0 30
#> 7 4 FO H 10 12 NA NA
#> 8 4 AT A 13 7 NA NA
#> 9 5 CO H 1 5 NA NA
#> 10 5 IN A 8 13 NA NA
#> 11 6 SP H 4 1 NA NA
#> 12 6 CE A 2 3 NA NA
data.table approach
library(data.table)
# set to data.table format
setDT(df1); setDT(df2)
# perform an update join, overwriting NA-values in W, X and Y, and
# adding Z, based on key-columns ID, Name and H_A
df1[df2, `:=`(W = ifelse(is.na(W), i.W, W),
X = ifelse(is.na(X), i.X, X),
Y = ifelse(is.na(Y), i.Y, Y),
Z = i.Z),
on = .(id, Name, H_A)][]
# id Name H_A W X Y Z
# 1: 1 LI H 15 10 0 4
# 2: 1 NO A 13 12 0 14
# 3: 2 WH H 5 11 0 16
# 4: 2 MA A 13 15 0 16
# 5: 3 BU H 9 6 0 25
# 6: 3 SO A 12 14 0 30
# 7: 4 FO H 10 12 NA NA
# 8: 4 AT A 13 7 NA NA
# 9: 5 CO H 1 5 NA NA
#10: 5 IN A 8 13 NA NA
#11: 6 SP H 4 1 NA NA
#12: 6 CE A 2 3 NA NA

Take Symmetrical Mean of a tibble (ignoring the NAs)

I have a tibble where the rows and columns are the same IDs and I would like to take the mean (ignoring the NAs) to make the df symmetrical. I am struggling to see how.
data <- tibble(group = LETTERS[1:4],
A = c(NA, 10, 20, NA),
B = c(15, NA, 25, 30),
C = c(20, NA, NA, 10),
D = c(10, 12, 15, NA)
)
I would normally do
A <- as.matrix(data[-1])
(A + t(A))/2
But this does not work because of the NAs.
Edit: below is the expected output.
output <- tibble(group = LETTERS[1:4],
A = c(NA, 12.5, 20, 10),
B = c(12.5, NA, 25, 21),
C = c(20, 25, NA, 12.5),
D = c(10, 21, 12.5, NA))
Here is a suggestion using tidyverse code.
library(tidyverse)
data <- tibble(group = LETTERS[1:4],
A = c(NA, 10, 20, NA),
B = c(15, NA, 25, 30),
C = c(20, NA, NA, 10),
D = c(10, 12, 15, NA)
)
A <- data %>%
pivot_longer(-group, values_to = "x")
B <- t(data) %>%
as.data.frame() %>%
setNames(LETTERS[1:4]) %>%
rownames_to_column("group") %>%
pivot_longer(-group, values_to = "y") %>%
left_join(A, by = c("group", "name")) %>%
mutate(
mean = if_else(!(is.na(x) | is.na(y)), (x + y)/2, x),
mean = if_else(is.na(mean) & !is.na(y), y, mean)
) %>%
select(-x, -y) %>%
pivot_wider(names_from = name, values_from = mean)
B
## A tibble: 4 x 5
# group A B C D
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A NA 12.5 20 10
#2 B 12.5 NA 25 21
#3 C 20 25 NA 12.5
#4 D 10 21 12.5 NA
Okay so this is how I ended up doing this. I would have preferred if I didnt use a for loop because the actual data I have is much bigger but beggars cant be choosers!
A <- as.matrix(data[-1])
for (i in 1:nrow(A)){
for (j in 1:ncol(A)){
if(is.na(A[i,j])){
A[i,j] <- A[j, i]
}
}
}
output <- (A + t(A))/2
output %>%
as_tibble() %>%
mutate(group = data$group) %>%
select(group, everything())
# A tibble: 4 x 5
group A B C D
<chr> <dbl> <dbl> <dbl> <dbl>
1 A NA 12.5 20 10
2 B 12.5 NA 25 21
3 C 20 25 NA 12.5
4 D 10 21 12.5 NA

Replace values outside range with NA using replace_with_na function

I have the following dataset
structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
a b c
1 2 4 50
2 1 5 34
3 9 1 77
4 2 9 88
5 9 12 33
6 8 NA 60
From column b I only want values between 4-9. Column c between 50-80. Replacing the values outside the range with NA, resulting in
structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, NA, 9, NA,
NA), c = c(50, NA, 77, NA, NA, 60)), class = "data.frame", row.names = c(NA,
-6L))
a b c
1 2 4 50
2 1 5 NA
3 9 NA 77
4 2 9 NA
5 9 NA NA
6 8 NA 60
I've tried several things with replace_with_na_at function where this seemed most logical:
test <- replace_with_na_at(data = test, .vars="c",
condition = ~.x < 2 & ~.x > 2)
However, nothing I tried works. Does somebody know why? Thanks in advance! :)
You can subset with a logical vector testing your conditions.
x$b[x$b < 4 | x$b > 9] <- NA
x$c[x$c < 50 | x$c > 80] <- NA
x
# a b c
#1 2 4 50
#2 1 5 NA
#3 9 NA 77
#4 2 9 NA
#5 9 NA NA
#6 8 NA 60
Data:
x <- structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
Yet another base R solution, this time with function is.na<-
is.na(test$b) <- with(test, b < 4 | b > 9)
is.na(test$c) <- with(test, c < 50 | c > 80)
A package naniar solution with a pipe could be
library(naniar)
library(magrittr)
test %>%
replace_with_na_at(
.vars = 'b',
condition = ~(.x < 4 | .x > 9)
) %>%
replace_with_na_at(
.vars = 'c',
condition = ~(.x < 50 | .x > 80)
)
You should mention the packages you are using. From googling, i'm guessing you are using naniar. The problem appears to be that you did not properly specify the condition, but the following should work:
library(naniar)
test <- structure(list(a = c(2, 1, 9, 2, 9, 8),
b = c(4, 5, 1, 9, 12, NA),
c = c(50, 34, 77, 88, 33, 60)),
class = "data.frame",
row.names = c(NA, -6L))
replace_with_na_at(test, "c", ~.x < 50 | .x > 80)
#> a b c
#> 1 2 4 50
#> 2 1 5 NA
#> 3 9 1 77
#> 4 2 9 NA
#> 5 9 12 NA
#> 6 8 NA 60
Created on 2020-06-02 by the reprex package (v0.3.0)
You simply could use Map to replace your values with NA.
dat[2:3] <- Map(function(x, y) {x[!x %in% y] <- NA;x}, dat[2:3], list(4:9, 50:80))
dat
# a b c
# 1 2 4 50
# 2 1 5 NA
# 3 9 NA 77
# 4 2 9 NA
# 5 9 NA NA
# 6 8 NA 60
Data:
dat <- structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
We can use map2
library(purrr)
library(dplyr)
df1[c('b', 'c')] <- map2(df1 %>%
select(b, c), list(c(4, 9), c(50,80)), ~
replace(.x, .x < .y[1]|.x > .y[2], NA))

Pivot_wider introduces NA's

I am doing datamanagement for a project and I am running into difficulties with what I thought would be a basic reshape from Long format to Wide.
The Data looks something like this:
df <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
Time = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 1, 1, 1, 1, 2, 2),
Type = c("A", "B", "C", "D", "A", "B","C", "D", "A", "A", "B", "C", "D", "A", "B"),
Value = c(100, NA, 40, 123, 95, NA, 45, 1234, 100, 70, NA, 50, 12345, 75, NA)),
row.names = c(NA, 15L), class = "data.frame")
Based on previous Stackoverflow Answers I am trying to use pivot-wider like this:
df.wide <- df %>%
group_by(ID, Type) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = Type, values_from = Value)
However this returns a dataframe with NA values at max(Time) for each ID that looks like this:
# A tibble: 5 x 7
ID Time row A B C D
<dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 100 NA 40 123
2 1 2 2 95 NA 45 1234
3 1 3 3 100 NA NA NA
4 2 1 1 70 NA 50 12345
5 2 2 2 75 NA NA NA
What am I doing wrong? My google and Stackoverflow-fu has not been able to help me.

Calculate median for multiple columns by group based on subsets defined by other columns

I am trying to calculate the median (but that could be substituted by similar metrics) by group for multiple columns based on subsets defined by other columns. This is direct follow-on question from this previous post of mine. I have attempted to incorporate calculating the median via aggregate into the Map(function(x,y) dosomething, x, y) solution kindly provided by #Frank, but that didn't work. Let me illustrate:
Calculate median for A and B by groups GRP1 and GRP2
df <- data.frame(GRP1 = c("A","A","A","A","A","A","B","B","B","B","B","B"), GRP2 = c("A","A","A","B","B","B","A","A","A","B","B","B"), A = c(0,4,6,7,0,1,9,0,0,8,3,4), B = c(6,0,4,8,6,7,0,9,9,7,3,0))
med <- aggregate(.~GRP1+GRP2,df,FUN=median)
Simple. Now add columns defining which rows to be used for calculating the median, i.e. rows with NAs should be dropped, column a defines which rows to be used for calculating the median in column A, same for columns b and B:
a <- c(1,4,7,3,NA,3,7,NA,NA,4,8,1)
b <- c(5,NA,7,9,5,6,NA,8,1,7,2,9)
df1 <- cbind(df,a,b)
As mentioned above, I have tried combining Map and aggregate, but that didn't work. I assume that Map doesn't know what to do with GRP1 and GRP2.
med1 <- Map(function(x,y) aggregate(.~GRP1+GRP2,df1[!is.na(y)],FUN=median), x=df1[,3:4], y=df1[, 5:6])
This is the result I'm looking for:
GRP1 GRP2 A B
1 A A 4 5
2 B A 9 9
3 A B 4 7
4 B B 4 3
Any help will be much appreciated!
Using data.table
library(data.table)
setDT(df1)
df1[, .(A = median(A[!is.na(a)]), B = median(B[!is.na(b)])), by = .(GRP1, GRP2)]
GRP1 GRP2 A B
1: A A 4 5
2: A B 4 7
3: B A 9 9
4: B B 4 3
Same logic in dplyr
library(dplyr)
df1 %>%
group_by(GRP1, GRP2) %>%
summarise(A = median(A[!is.na(a)]), B = median(B[!is.na(b)]))
The original df1:
df1 <- data.frame(
GRP1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
GRP2 = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"),
A = c(0, 4, 6, 7, 0, 1, 9, 0, 0, 8, 3, 4),
B = c(6, 0, 4, 8, 6, 7, 0, 9, 9, 7, 3, 0),
a = c(1, 4, 7, 3, NA, 3, 7, NA, NA, 4, 8, 1),
b = c(5, NA, 7, 9, 5, 6, NA, 8, 1, 7, 2, 9)
)
With dplyr:
library(dplyr)
df1 %>%
mutate(A = ifelse(is.na(a), NA, A),
B = ifelse(is.na(b), NA, B)) %>%
# I use this to put as NA the values we don't want to include
group_by(GRP1, GRP2) %>%
summarise(A = median(A, na.rm = T),
B = median(B, na.rm = T))
# A tibble: 4 x 4
# Groups: GRP1 [?]
GRP1 GRP2 A B
<fct> <fct> <dbl> <dbl>
1 A A 4 5
2 A B 4 7
3 B A 9 9
4 B B 4 3

Resources