R maximum with NAs in a vector - r

I have a problem. Lines where PROJECT="SNOP" are missing (they don't appear in df6) while they were present in df5. PROJECT = "SNOP" lines contain only NAs for the VERSION2 variable. Someone can help me? Here is my code:
which(df5$PROJECT=="SNOP") #200 lines appear
df6 <- df5 %>%
group_by(PROJECT) %>%
filter(VERSION2 == ifelse(!all(is.na(VERSION2)), max(VERSION2, na.rm=T), NA)) %>%
ungroup()
which(df6$PROJECT=="SNOP") #missing lines PROJECT="Snop" #answer: integer(0)

That is probably because "SNOP" has all NA values and filter drops them.
Consider this small example.
library(dplyr)
df <- data.frame(a = rep(c(1:2), each =3), b = c(1:3, NA, NA, NA))
Using your code, we do :
df %>%
group_by(a) %>%
filter(b == ifelse(!all(is.na(b)), max(b, na.rm=TRUE), NA))
# a b
# <int> <int>
#1 1 3
Notice how a = 2 is dropped.
Now you can decide what you want to do to those groups where all values are NA. For example, the below keeps all the rows where there are NA's.
df %>%
group_by(a) %>%
filter(if(!all(is.na(b))) b == max(b, na.rm=T) else TRUE)
# a b
# <int> <int>
#1 1 3
#2 2 NA
#3 2 NA
#4 2 NA

Related

Combining two variables to create new variable

I would like to combine two variables that have only one answer each into a single variable that has both answers.
Example
IPV_YES only has answers that are 1
IPV_NO only has answers that are 2
I would like to combine them into a single variable named IPV that would have the 1 and 2 results from both individual category.
I have tried using ifelse command but it only shows me the value of IPV_YES.
Dataset I have
My desired outcome
my answer
df %>% mutate(across(everything(), ~ifelse(. == "", NA, as.numeric(.)))) %>%
group_by(ID) %>%
rowwise() %>%
transmute(IPV = sum(c_across(everything()), na.rm = T))
# A tibble: 4 x 2
# Rowwise: ID
ID IPV
<dbl> <dbl>
1 1 1
2 2 2
3 3 1
4 4 2
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
We can use coalesce after converting the '' to NA
library(dplyr)
df <- df %>%
transmute(ID, IPV = coalesce(na_if(IPV_YES, ""), na_if(IPV_NO, ""))) %>%
type.convert(as.is = TRUE)
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
df$IPV <- ifelse(df$IPV_YES != "", df$IPV_YES, df$IPV_NO[!df$IPV_NO==""])
Here, we specify an ifelse statement; it can be glossed thus: if the value in df$IPV_YES is not blank, then give the value in df$IPV_YES, else give those values from df$IPV_NO that are not blank.
If you want to remove the IPV_* columns:
df[,2:3] <- NULL
Result:
df
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2
Data:
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
Maybe you can try the code below
replace(df, df == "", NA) %>%
mutate(IPV = coalesce(IPV_YES, IPV_NO)) %>%
select(ID, IPV) %>%
type.convert(as.is = TRUE)
which gives
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2

How to use slice in dplyr to keep the rows with NA values in R

I have the following dataset, and I want to know the min word for each group, and if there is no min word (it is NA), I still want to display it
df=data.frame(
key=c("A","A","B","B","C"),
word=c(1,2,3,5,NA))
df%>%group_by(key)%>%slice(which.min(word))
This excludes key=C, word=NA which I would want:
df_out=data.frame(
key=c("A","B","C"),
word=c(1,3,NA))
We can create a logical condition with is.na in filter and return the NA rows as well after doing the grouping by 'key'
library(dplyr)
df %>%
group_by(key) %>%
filter(word == min(word)|is.na(word))
Or using slice. We don't need any if/else condition
df %>%
group_by(key) %>%
slice(which(word ==min(word)|is.na(word)))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Or more compactly
df %>%
group_by(key) %>%
slice(match(min(word), word))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
NOTE: Using match returns the index of the first match.
which.min removes the NA
which.min(c(NA, 1, 3))
#[1] 2
We can check the condition with if, If all the word in a group is NA we return the first row or else return the minimum row.
library(dplyr)
df %>%
group_by(key)%>%
slice(if(all(is.na(word))) 1L else which.min(word))
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Another option is to arrange the data by word and select the 1st row in each group.
df %>% arrange(key, word) %>% group_by(key) %>% slice(1L)
You can create a modified slice-function using the tidyverse-package, which returns NA's:
slice_uneven = function(.data, .idx) {
.data_ = .data %>% add_row() # Add an extra row
.idx_ = .idx %>% c(NA) %>% replace_na(nrow(.data_)) # Replace NA with index of the extra row
.data_[.idx_,] %>% head(-1) %>% remove_rownames() %>% return() # Subset, remove extra row, and reset rownames before returning data
}
slice_uneven(cars, c(1, 2, 3, NA, NA, 3, 2))
You can also arrange by word and use distinct from dplyr to get the desired output.
library(dplyr)
df %>%
arrange(word) %>%
distinct(key, .keep_all = TRUE)
# key word
#1 A 1
#2 B 3
#3 C NA

filter() but keep groups without value

I am trying to condense a grouped df, pulling out only rows that contain a certain value, but that value isn't reflected in all groups. I want to find a way to pull out all rows with that value, but also make a NA or 0 row for groups not containing that value.
Ex:
x1 <- c('1','1','1','1','1','2','2','2','2','2','3','3','3','3','3')
x2 <- c('a','b','c','d','e','b','c','d','e','f','a','b','d','e','f')
df <- data.frame(x1,x2)
df %>% group_by(x1) %>%
filter(x2 =="a")
this returns:
x1 x2
<fct> <fct>
1 1 a
2 3 a
but I want it to return:
x1 x2
<fct> <fct>
1 1 a
2 2 NA
3 3 a
Obviously the real code is much more complicated, so I'm looking for the best way to keep these empty groups in a reproducible way.
PS - I would like to stay in dplyr to keep smooth in a function chain
Thanks!
One dplyr option could be:
df %>%
group_by(x1) %>%
slice(which.max(x2 == "a")) %>%
mutate(x2 = replace(x2, x2 != "a", NA_complex_))
x1 x2
<fct> <fct>
1 1 a
2 2 <NA>
3 3 a
If it's relevant to have multiple target values per group:
df %>%
group_by(x1) %>%
filter(x2 == "a") %>%
bind_rows(df %>%
group_by(x1) %>%
filter(all(x2 != "a")) %>%
slice(1) %>%
mutate(x2 = replace(x2, x2 != "a", NA_complex_)))
As you did not specify dplyr solutions only, here's one option with library(data.table)
setDT(df)
df[, .(x2 = x2[match('a', x2)]), x1]
# x1 x2
# 1: 1 a
# 2: 2 <NA>
# 3: 3 a
This happens because of the way Dplyr was written.
According to Hadley Wickham (the Package Creator) to maintain NA values you should declare that you want them explicitly. As he said in this issue on github, you should filter(a == x | is.na(a)). In your case you use the following:
df %>% group_by(x1) %>%
filter(x2 =="a" | is.na(x2)
That you'll return you this as a result:
x1 x2
<fct> <fct>
1 1 a
2 2 NA
3 3 a
In this code you're asking to R all rows in which x2 is equal to "a" and also those in which x2 is NA.
We can use complete after the filter step to get the missing combinations. By default, all the other columns will be filled with NA (it can be made to custom value with fill argument)
library(dplyr)
library(tidyr)
df %>%
filter(x2 == 'a') %>%
complete(x1 = unique(df$x1))
# A tibble: 3 x 2
# x1 x2
# <fct> <fct>
#1 1 a
#2 2 <NA>
#3 3 a
Another option is match
df %>%
group_by(x1) %>%
summarise(x2 = x2[match('a', x2)])
If there are many columns, then mutate 'x2' with match and then slice the first row
df %>%
group_by(x1) %>%
mutate(x2 = x2[match('a', x2)]) %>%
slice(1)
How about the base R solution using aggregate() like below?
dfout <- aggregate(x2~x1,df,function(v) ifelse("a" %in% v,"a",NA))
or
dfout <- aggregate(x2~x1,df,function(v) v[match("a", v)])
such that
> dfout
x1 x2
1 1 a
2 2 <NA>
3 3 a

dplyr summarise keep NA if all summarised values are NA

I want to use dplyr summarise to sum counts by groups. Specifically I want to remove NA values if not all summed values are NA, but if all summed values are NA, I want to display NA. For example:
name <- c("jack", "jack", "mary", "mary", "ellen", "ellen")
number <- c(1,2,1,NA,NA,NA)
df <- data.frame(name,number)
In this case I want the following result:
Jack = 3
Mary = 1
Ellen = NA
However if I set na.rm = F:
df %>% group_by(name) %>% summarise(number = sum(number, na.rm = F))
The result is:
Jack = 3
Mary = NA
Ellen = NA
And if i set na.rm = T:
df %>% group_by(name) %>% summarise(number = sum(number, na.rm = T))
The result is
Jack = 3
Mary = 1
Ellen = 0
How can I solve this so that the cases with numbers and NA's get a number as output, but the cases with only NA's get NA as output.
We can have a if/else condition - if all the values in 'number are NA, then return NA or else get the sum
library(dplyr)
df %>%
group_by(name) %>%
summarise(number = if(all(is.na(number))) NA_real_ else sum(number, na.rm = TRUE))
I was struggling with the same thing, so I wrote a solution into the package hablar. Try:
library(hablar)
df %>% group_by(name) %>%
summarise(number = sum_(number))
which gives you:
# A tibble: 3 x 2
name number
<fct> <dbl>
1 ellen NA
2 jack 3.
3 mary 1.
not that the only syntax difference is sum_ which is a function that returns NA if all is NA, else removes NA and calcuules sum no-missing values.

How to mutate() a list of columns using map2() in dplyr

I recently had to compile a data frame of student scores (one row per student, id column and several integer-valued columns, one per score component). I had to combine a "master" data frame and several "correction" data frames (containing mostly NA and some updates to the master), so that the result contains the maximum values from the master, and all corrections.
I succeeded by copy-pasting a sequence of mutate() calls, which works (see example below), but is not elegant in my opinion. What I would have wanted to do, was instead of copying and pasting, to use something along the lines of map2 and two lists of columns to compare the columns pair-wise. Something like (which obviously does not work as such):
list_of_cols1 <- list(col1.x, col2.x, col3.x)
list_of_cols2 <- list(col1.y, col2.y, col3.y
map2(list_of_cols1, list_of_cols2, ~ column = pmax(.x, .y, na.rm=T))
I can't seem to be able to figure out to do it. My question is: how to specify such lists of columns and mutate them in one map2() call in dplyr pipe, or is it even possible – have I gotten it all wrong?
Minimum working example
library(tidyverse)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- reduce(
# Ultimately there would several correction data frames
list(master, correction1),
function(x,y) {
x <- x %>%
left_join(
y,
by = c("id")
) %>%
# Wish I knew how to do this mutate call with map2
mutate(
col1 = pmax(col1.x, col1.y, na.rm=T),
col2 = pmax(col2.x, col2.y, na.rm=T),
col3 = pmax(col3.x, col3.y, na.rm=T)
) %>%
select(id, col1:col3)
}
)
The result is
> result
# A tibble: 3 x 4
id col1 col2 col3
<int> <dbl> <dbl> <dbl>
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3
Rather than do a left_join, just bind the rows then summarize. For example
result <- reduce(
list(master, master),
function(x,y) {
bind_rows(x, y) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
}
)
result
# id col1 col2 col3
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 3
# 2 2 1 2 3
# 3 3 2 3 3
Actually, you don't even need reduce as bind_rows can take a list
Adding another table
correction2 <- tibble(id=2,col1=NA,col2=8,col3=NA)
bind_rows(master, correction1, correction2) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
Sorry this doesn't answer your question about map2, I find it's easier to aggregate over rows than it is over columns in tidy R:
library(dplyr)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- list(master, correction1) %>%
bind_rows() %>%
group_by(id) %>%
summarise_all(max, na.rm = TRUE)
result
#> # A tibble: 3 x 4
#> id col1 col2 col3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 3
#> 2 2 1 2 3
#> 3 3 2 3 3
If correction tables will always have the same structure as master, you can do something like the following:
library(dplyr)
library(purrr)
update_master = function(...){
map(list(...), as.matrix) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame()
}
update_master(master, correction1)
To allow id to take character values, make the following modification:
update_master = function(x, ...){
map(list(x, ...), function(x) as.matrix(x[-1])) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame(id = x[[1]], .)
}
update_master(master, correction1)
Result:
id col1 col2 col3
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3

Resources