dplyr: how to ignore NA in grouping variable - r

Using dplyr, I'm trying to group by two variables. Now, if there is a NA in one variable but the other variable match, I'd still like to see those rows grouped, with the NA taking on the value of the non-NA value. So if I have a data frame like this:
variable_A <- c("a", "a", "b", NA, "f")
variable_B <- c("c", "d", "e", "c", "c")
variable_C <- c(10, 20, 30, 40, 50)
df <- data.frame(variable_A, variable_B, variable_C)
And if I wanted to group by variable_A and variable_B, row 1 and 4 normally wouldn't group but I'd like them to , while the NA gets overridden to "a." How can I achieve this? The below doesn't do the job.
df2 <- df %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))

You can group by B first, and then fill in the missing A values. Then proceed with what you wanted to do:
df_filled = df %>%
group_by(variable_B) %>%
mutate(variable_A = first(na.omit(variable_A)))
df_filled %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))

You could do the missing value imputation using base R as follows:
ii <- which(is.na(df$variable_A))
jj <- which(df$variable_B == df$variable_B[ii])
df_filled <- df
df_filled$variable_A[jj] = df$variable_A[jj][!is.na(df$variable_A[jj])]
Then group and summarize as planned with dplyr
df_filled %>%
group_by(variable_A, variable_B) %>%
dplyr::summarise(total=sum(variable_C))

Related

concentate 2 vectors to string by common element

I have a data.frame with 2 columns. If an element appears in both columns this should be the grouping criteria. I then want to create a new column which concentates all elements by group into a single, sorted string.
df <- tibble::tribble(
~col1, ~col2,
"a", "b",
"b","c",
"c","b",
"d",NA,
"e","d",
"f","d",
"g","d",
"h","i",
"i","h",
"j", NA
)
outcome <- tibble::tribble(
~result,
c("a_b_c"),
c("d_e_f_g"),
c("h_i"),
c("j")
)
any help is appreciated since I have not yet found a starting point to solve the question thanks!
Get the connected components from igraph and paste.
library(dplyr)
library(igraph)
df %>%
mutate(col2 = coalesce(col2, col1)) %>%
as.matrix %>%
graph_from_edgelist %>%
components %>%
groups %>%
sapply(paste, collapse = "_") %>%
stack
giving:
values ind
1 a_b_c 1
2 d_e_f_g 2
3 h_i 3
4 j 4

Convert categorical "other" value to NA in a data frame using dplyr

trial <- data.frame(c("A", "B", "C", "other"), c("a","b","Others","d"))
There are 2 categorical variables (attributes) in the data frame. I want to recode the value "other" as NA. I follow the link here: https://cran.r-project.org/web/packages/naniar/vignettes/replace-with-na.html
na_strings <- c("other", "Others")
trial %>%
replace_with_na_all(condition = ~.x %in% na_strings)
However, the "other" value does change to NA, but all other characters are turned into numbers. I want the rest of the values to remain character.
What should i do? Thanks in advance.
Here is a simple dplyr solution:
library(dplyr)
library(naniar)
trial %>%
mutate_if(is.factor,as.character) %>%
replace_with_na_all(condition = ~.x %in% na_strings)
You just need to change your variable class from factor to character before the replace_with_na_all function.
You can use base R :
trial[sapply(trial, `%in%`, na_strings)] <- NA
Or only dplyr to do this :
library(dplyr)
trial %>% mutate_all(~replace(., . %in% na_strings, NA))
# col1 col2
#1 A a
#2 B b
#3 C <NA>
#4 <NA> d
data
trial <- data.frame(col1 = c("A", "B", "C", "other"),
col2 = c("a","b","Others","d"))

Using dplyr collapse rows taking condition from another numeric column

An example df:
experiment = c("A", "A", "A", "A", "A", "B", "B", "B")
count = c(1,2,3,4,5,1,2,1)
df = cbind.data.frame(experiment, count)
Desired output:
experiment_1 = c("A", "A", "A", "B", "B")
freq = c(1,1,3,2,1) # frequency
freq_per = c(20,20,60,66.6,33.3) # frequency percent
df_1 = cbind.data.frame(experiment_1, freq, freq_per)
I want to do the following:
Group df using experiment
Calculate freq using the count column
Calculate freq_per
Calculate sum of freq_per for all observations with count >= 3
I have the following code. How do I do the step 4?
freq_count = df %>% dplyr::group_by(experiment, count) %>% summarize(freq=n()) %>% na.omit() %>% mutate(freq_per=freq/sum(freq)*100)
Thank you very much.
There may be a more concise approach but I would suggest collapsing your count in a new column using mutate() and ifelse() and then summarising:
freq_count %>%
mutate(collapsed_count = ifelse(count >= 3, 3, count)) %>%
group_by(collapsed_count, add = TRUE) %>% # adds a 2nd grouping var
summarise(freq = sum(freq), freq_per = (sum(freq_per))) %>%
select(-collapsed_count) # dropped to match your df_1.
Also, just fyi, for step 2 you might consider the count() function if you're keen to save some keystrokes. Also tibble() or data.frame() are likely better options than calling the dataframe method of cbind explicitly to create a data frame.

Drop unused levels from a factor after filtering data frame using dplyr

I used dplyr function to create a new data sets which contain the names that have less than 4 rows.
df <- data.frame(name = c("a", "a", "a", "b", "b", "c", "c", "c", "c"), x = 1:9)
aa = df %>%
group_by(name) %>%
filter(n() < 4)
But when I type
table(aa$name)
I get,
a b c
3 2 0
I would like to have my output as follow
a b
3 2
How to completely separate new frame aa from df?
To complete your answer and KoenV's comment you can just, write your solution in one line or apply the function factor will remove the unused levels:
table(droplevels(aa$name))
table(factor(aa$name))
or because you are using dplyr add droplevels at the end:
aa <- df %>%
group_by(name) %>%
filter(n() < 4) %>%
droplevels()
table(aa$name)
# Without using table
df %>%
group_by(name) %>%
summarise(count = n()) %>%
filter(count < 4)
aaNew <- droplevels(aa)
table(aa$name)

How to get count-of-a-count with dplyr?

Let's say we have the data frame
df <- data.frame(x = c("a", "a", "b", "a", "c"))
Using dplyr count, we get
df %>% count(x)
x n
1 a 3
2 b 1
3 c 1
I now want to do a count on the resulting n column. If the n column were named m, the result I'm looking for is
m n
1 1 2
2 3 1
How can this be done with dplyr?
Thank you very much!
dplyr seems to have trouble with count(n).
For instance:
d <- data.frame(n = sample(1:2, 10, TRUE), x = 1:10)
d %>% count(n)
A workaround is to rename n:
df %>% # using data defined in question
count(x) %>%
rename(m = n) %>%
count(m)
EDIT: I was wrong. Didn't have the newest version of dplyr so I didn't have the count function.
With dplyr a way to count is with n() In your example you would do the following to obtain the first counts:
df <- data.frame(x = c("a", "a", "b", "a", "c"))
df %>% group_by(x) %>% summarise(count=n())
Then if you want to count the occurrences of particular counts you can do:
df %>% group_by(x) %>% summarise(count=n()) %>% group_by(count) %>% summarise(newCount=n())
This is a dplyr way.
sum((df %>% count(x))$n)
##[1] 5
If you are willing to give data.table a try, it could be quite straight forward.
df <- data.frame(x = c("a", "a", "b", "a", "c"))
library(data.table)
setDT(df)[, .N, by=x][, list(count_of_N=.N), by=N]
# N count_of_N
# 1: 3 1
# 2: 1 2
If you want to count:
df %>% count(x) %>% summarise(length(n))
# length(n)
#1 3
If you want the sum:
df %>% count(x) %>% summarise(sum(n))
# sum(n)
#1 5
Its not pure plyr but this may work:
countr<-function(x){data.frame(table(x))}
t<-count(df,x)
countr(t[,2])

Resources