I want to use dplyr summarise to sum counts by groups. Specifically I want to remove NA values if not all summed values are NA, but if all summed values are NA, I want to display NA. For example:
name <- c("jack", "jack", "mary", "mary", "ellen", "ellen")
number <- c(1,2,1,NA,NA,NA)
df <- data.frame(name,number)
In this case I want the following result:
Jack = 3
Mary = 1
Ellen = NA
However if I set na.rm = F:
df %>% group_by(name) %>% summarise(number = sum(number, na.rm = F))
The result is:
Jack = 3
Mary = NA
Ellen = NA
And if i set na.rm = T:
df %>% group_by(name) %>% summarise(number = sum(number, na.rm = T))
The result is
Jack = 3
Mary = 1
Ellen = 0
How can I solve this so that the cases with numbers and NA's get a number as output, but the cases with only NA's get NA as output.
We can have a if/else condition - if all the values in 'number are NA, then return NA or else get the sum
library(dplyr)
df %>%
group_by(name) %>%
summarise(number = if(all(is.na(number))) NA_real_ else sum(number, na.rm = TRUE))
I was struggling with the same thing, so I wrote a solution into the package hablar. Try:
library(hablar)
df %>% group_by(name) %>%
summarise(number = sum_(number))
which gives you:
# A tibble: 3 x 2
name number
<fct> <dbl>
1 ellen NA
2 jack 3.
3 mary 1.
not that the only syntax difference is sum_ which is a function that returns NA if all is NA, else removes NA and calcuules sum no-missing values.
Related
I would like to combine two variables that have only one answer each into a single variable that has both answers.
Example
IPV_YES only has answers that are 1
IPV_NO only has answers that are 2
I would like to combine them into a single variable named IPV that would have the 1 and 2 results from both individual category.
I have tried using ifelse command but it only shows me the value of IPV_YES.
Dataset I have
My desired outcome
my answer
df %>% mutate(across(everything(), ~ifelse(. == "", NA, as.numeric(.)))) %>%
group_by(ID) %>%
rowwise() %>%
transmute(IPV = sum(c_across(everything()), na.rm = T))
# A tibble: 4 x 2
# Rowwise: ID
ID IPV
<dbl> <dbl>
1 1 1
2 2 2
3 3 1
4 4 2
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
We can use coalesce after converting the '' to NA
library(dplyr)
df <- df %>%
transmute(ID, IPV = coalesce(na_if(IPV_YES, ""), na_if(IPV_NO, ""))) %>%
type.convert(as.is = TRUE)
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
df$IPV <- ifelse(df$IPV_YES != "", df$IPV_YES, df$IPV_NO[!df$IPV_NO==""])
Here, we specify an ifelse statement; it can be glossed thus: if the value in df$IPV_YES is not blank, then give the value in df$IPV_YES, else give those values from df$IPV_NO that are not blank.
If you want to remove the IPV_* columns:
df[,2:3] <- NULL
Result:
df
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2
Data:
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
Maybe you can try the code below
replace(df, df == "", NA) %>%
mutate(IPV = coalesce(IPV_YES, IPV_NO)) %>%
select(ID, IPV) %>%
type.convert(as.is = TRUE)
which gives
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2
How can i use for loop to sum data by group then break and print the value accumulated sum of A and B respectively?
ie:
Type value
A 2
A NA
A 13 15
B 565
B 245
B 578 1388
library(dplyr)
df %>%
group_by(Type) %>%
mutate(cs = cumsum(value, na.rm = True))
but it only shows the whole table and originally should be sum A should be 15 but eventually become NA.
Type value cs
A 2 2
A NA NA
A 13 NA
B 565 565
B 245 810
B 578 1388
Using dplyr you can try
library(dplyr)
df %>%
group_by(Type) %>%
mutate(cs = last(sum(value, na.rm = TRUE))) %>%
mutate(id = row_number()) %>% # Creating a dummy id column
mutate(cs= replace(cs, id!= max(id),NA)) %>% # replace all rows of cs that are not the last within group Type
select(-id) # removing id column
#Output
# A tibble: 6 x 3
# Groups: Type [2]
Type value cs
<chr> <int> <int>
1 A 2 NA
2 A NA NA
3 A 13 15
4 B 565 NA
5 B 245 NA
6 B 578 1388
If I understand correctly, the OP expects that all rows of the new column cs are blank except for the last row of each group where the sum of the values belonging to the group should be printed.
A blank row is only possible if the new column cs is of type character. In case cs is expected to be of numeric type then there is no other choice to print either 0, NA, or any other numeric value, but not "" (empty string).
So, below there are suggestions to create a character column either by using
ifelse(), or
replace() and rep(), or
c() and rep().
in data.table and dplyr syntax, resp.
Note that no for loop is required at all.
data.table
library(data.table)
setDT(df)[, cs := fifelse(1:.N == .N, as.character(sum(value, na.rm = TRUE)), ""), by = Type][]
or
setDT(df)[, cs := replace(rep("", .N), .N, sum(value, na.rm = TRUE)), by = Type][]
or
setDT(df)[, cs := c(rep("", .N - 1L), sum(value, na.rm = TRUE)), by = Type][]
Type value cs
1: A 2
2: A NA
3: A 13 15
4: B 565
5: B 245
6: B 578 1388
dplyr
library(dplyr)
df %>%
group_by(Type) %>%
mutate(cs = ifelse(row_number() == n()), sum(value, na.rm = TRUE), ""))
or
df %>%
group_by(Type) %>%
mutate(cs = replace(rep("", n()), n(), sum(value, na.rm = TRUE)))
or
df %>%
group_by(Type) %>%
mutate(cs = c(rep("", n() - 1L), sum(value, na.rm = TRUE)))
# A tibble: 6 x 3
# Groups: Type [2]
Type value cs
<chr> <int> <chr>
1 A 2 ""
2 A NA ""
3 A 13 "15"
4 B 565 ""
5 B 245 ""
6 B 578 "1388"
I need to create an id that defines the relationship between contact_id and relationship_id into a common household_id if and where the combination of contact_id and relationship_id are the same.
Sample Data
account_id <- c(1,1,1,1)
contact_id <- c(1234,2345,3456,4567)
relationship_id <- c(2345,1234,NA,"")
ownership_percent <- c(26,22,40,12)
score <- c(500,300,700,600)
testdata <- data.frame(account_id,contact_id,relationship_id,ownership_percent,score)
Have been using combinations of mutate, paste0, min, max, group_indices - have not found the right combination, getting tripped up by NA and order output of new household_id
Approach 1
library(dplyr)
testdata %>%
mutate(col1 = pmin(contact_id, relationship_id),
col2 = pmax(contact_id, relationship_id),
household_id = paste0(col1,col2)) %>%
Approach 2
testdata %>%
mutate(household_id = sort(paste0(c(contact_id, relationship_id))), collapse = "")
Error: Column household_id must be length 4 (the number of rows) or one, not 8
Expected Outcome
library(dplyr)
# replace missing or NA values with 1
testdata$relationship_id <- type.convert(testdata$relationship_id)
testdata$relationship_id[relationship_id == ""] <- 1
testdata$relationship_id[is.na(testdata$relationship_id)] <- 1
# Create household_id
testdata %>%
mutate(group = paste0(pmin(contact_id, relationship_id), pmax(contact_id, relationship_id)),
household_id = match(group, unique(group)))
You can use -
library(dplyr)
testdata %>%
mutate(col1 = pmin(contact_id, relationship_id, na.rm = TRUE),
col2 = pmax(contact_id, relationship_id, na.rm = TRUE)) %>%
rowwise() %>%
mutate(household_id = paste0(unique(c(col1, col2)), collapse = '')) %>%
ungroup %>%
select(-col1, -col2)
# account_id contact_id relationship_id ownership_percent score household_id
# <dbl> <dbl> <chr> <dbl> <dbl> <chr>
#1 1 1234 "2345" 26 500 12342345
#2 1 2345 "1234" 22 300 12342345
#3 1 3456 NA 40 700 3456
#4 1 4567 "" 12 600 4567
I have the following dataset, and I want to know the min word for each group, and if there is no min word (it is NA), I still want to display it
df=data.frame(
key=c("A","A","B","B","C"),
word=c(1,2,3,5,NA))
df%>%group_by(key)%>%slice(which.min(word))
This excludes key=C, word=NA which I would want:
df_out=data.frame(
key=c("A","B","C"),
word=c(1,3,NA))
We can create a logical condition with is.na in filter and return the NA rows as well after doing the grouping by 'key'
library(dplyr)
df %>%
group_by(key) %>%
filter(word == min(word)|is.na(word))
Or using slice. We don't need any if/else condition
df %>%
group_by(key) %>%
slice(which(word ==min(word)|is.na(word)))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Or more compactly
df %>%
group_by(key) %>%
slice(match(min(word), word))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
NOTE: Using match returns the index of the first match.
which.min removes the NA
which.min(c(NA, 1, 3))
#[1] 2
We can check the condition with if, If all the word in a group is NA we return the first row or else return the minimum row.
library(dplyr)
df %>%
group_by(key)%>%
slice(if(all(is.na(word))) 1L else which.min(word))
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Another option is to arrange the data by word and select the 1st row in each group.
df %>% arrange(key, word) %>% group_by(key) %>% slice(1L)
You can create a modified slice-function using the tidyverse-package, which returns NA's:
slice_uneven = function(.data, .idx) {
.data_ = .data %>% add_row() # Add an extra row
.idx_ = .idx %>% c(NA) %>% replace_na(nrow(.data_)) # Replace NA with index of the extra row
.data_[.idx_,] %>% head(-1) %>% remove_rownames() %>% return() # Subset, remove extra row, and reset rownames before returning data
}
slice_uneven(cars, c(1, 2, 3, NA, NA, 3, 2))
You can also arrange by word and use distinct from dplyr to get the desired output.
library(dplyr)
df %>%
arrange(word) %>%
distinct(key, .keep_all = TRUE)
# key word
#1 A 1
#2 B 3
#3 C NA
I have a problem. Lines where PROJECT="SNOP" are missing (they don't appear in df6) while they were present in df5. PROJECT = "SNOP" lines contain only NAs for the VERSION2 variable. Someone can help me? Here is my code:
which(df5$PROJECT=="SNOP") #200 lines appear
df6 <- df5 %>%
group_by(PROJECT) %>%
filter(VERSION2 == ifelse(!all(is.na(VERSION2)), max(VERSION2, na.rm=T), NA)) %>%
ungroup()
which(df6$PROJECT=="SNOP") #missing lines PROJECT="Snop" #answer: integer(0)
That is probably because "SNOP" has all NA values and filter drops them.
Consider this small example.
library(dplyr)
df <- data.frame(a = rep(c(1:2), each =3), b = c(1:3, NA, NA, NA))
Using your code, we do :
df %>%
group_by(a) %>%
filter(b == ifelse(!all(is.na(b)), max(b, na.rm=TRUE), NA))
# a b
# <int> <int>
#1 1 3
Notice how a = 2 is dropped.
Now you can decide what you want to do to those groups where all values are NA. For example, the below keeps all the rows where there are NA's.
df %>%
group_by(a) %>%
filter(if(!all(is.na(b))) b == max(b, na.rm=T) else TRUE)
# a b
# <int> <int>
#1 1 3
#2 2 NA
#3 2 NA
#4 2 NA