filter() but keep groups without value - r

I am trying to condense a grouped df, pulling out only rows that contain a certain value, but that value isn't reflected in all groups. I want to find a way to pull out all rows with that value, but also make a NA or 0 row for groups not containing that value.
Ex:
x1 <- c('1','1','1','1','1','2','2','2','2','2','3','3','3','3','3')
x2 <- c('a','b','c','d','e','b','c','d','e','f','a','b','d','e','f')
df <- data.frame(x1,x2)
df %>% group_by(x1) %>%
filter(x2 =="a")
this returns:
x1 x2
<fct> <fct>
1 1 a
2 3 a
but I want it to return:
x1 x2
<fct> <fct>
1 1 a
2 2 NA
3 3 a
Obviously the real code is much more complicated, so I'm looking for the best way to keep these empty groups in a reproducible way.
PS - I would like to stay in dplyr to keep smooth in a function chain
Thanks!

One dplyr option could be:
df %>%
group_by(x1) %>%
slice(which.max(x2 == "a")) %>%
mutate(x2 = replace(x2, x2 != "a", NA_complex_))
x1 x2
<fct> <fct>
1 1 a
2 2 <NA>
3 3 a
If it's relevant to have multiple target values per group:
df %>%
group_by(x1) %>%
filter(x2 == "a") %>%
bind_rows(df %>%
group_by(x1) %>%
filter(all(x2 != "a")) %>%
slice(1) %>%
mutate(x2 = replace(x2, x2 != "a", NA_complex_)))

As you did not specify dplyr solutions only, here's one option with library(data.table)
setDT(df)
df[, .(x2 = x2[match('a', x2)]), x1]
# x1 x2
# 1: 1 a
# 2: 2 <NA>
# 3: 3 a

This happens because of the way Dplyr was written.
According to Hadley Wickham (the Package Creator) to maintain NA values you should declare that you want them explicitly. As he said in this issue on github, you should filter(a == x | is.na(a)). In your case you use the following:
df %>% group_by(x1) %>%
filter(x2 =="a" | is.na(x2)
That you'll return you this as a result:
x1 x2
<fct> <fct>
1 1 a
2 2 NA
3 3 a
In this code you're asking to R all rows in which x2 is equal to "a" and also those in which x2 is NA.

We can use complete after the filter step to get the missing combinations. By default, all the other columns will be filled with NA (it can be made to custom value with fill argument)
library(dplyr)
library(tidyr)
df %>%
filter(x2 == 'a') %>%
complete(x1 = unique(df$x1))
# A tibble: 3 x 2
# x1 x2
# <fct> <fct>
#1 1 a
#2 2 <NA>
#3 3 a
Another option is match
df %>%
group_by(x1) %>%
summarise(x2 = x2[match('a', x2)])
If there are many columns, then mutate 'x2' with match and then slice the first row
df %>%
group_by(x1) %>%
mutate(x2 = x2[match('a', x2)]) %>%
slice(1)

How about the base R solution using aggregate() like below?
dfout <- aggregate(x2~x1,df,function(v) ifelse("a" %in% v,"a",NA))
or
dfout <- aggregate(x2~x1,df,function(v) v[match("a", v)])
such that
> dfout
x1 x2
1 1 a
2 2 <NA>
3 3 a

Related

Select second largest row by group in r

I have this problem
library(dplyr)
problem = data.frame(id = c(1,1,1,2,2,2), var1 = c(5,4,3, 6,5,4), var2 = c(99,12,32,88,9,8))
For each id, I want to only keep row with second largest value of var1. I tried different ways (dplyr, base R):
problem %>%
group_by(id) %>%
slice_tail(2, -var1)
problem[with(problem, ave(var1, id, FUN = function(x) x == tail(sort(x), 2)[1])), ]
First code doesn;t work, second code gives wrong answer.
What am I doing wrong?
problem |> group_by(id) %>% arrange(var1) %>% slice(n()-1)
n() counts the number of rows in each group. slice(n()-1) takes the n-1th element. Note this will cause issues with groups with fewer than 2 members - you may wish to allow for that.
If you wish to use slice, I guess you can first slice_max() the largest two rows, than slice_tail to remove the largest row.
library(dplyr)
problem %>%
group_by(id) %>%
slice_max(var1, n = 2) %>%
slice_tail(n = 1)
Or you can use a single filter:
problem %>% group_by(id) %>% filter(var1 == max(var1[var1 != max(var1)]))
Output
# A tibble: 2 × 3
# Groups: id [2]
id var1 var2
<dbl> <dbl> <dbl>
1 1 4 12
2 2 5 9
In case you have volume, here is a data.tableapproach.
problem = data.frame(id = c(1,1,1,2,2,2), var1 = c(5,4,3, 6,5,4), var2 = c(99,12,32,88,9,8))
setDT(problem)
setorder(problem, id, - var1)
problem[, .SD[2], by=id]
As for #paul Stafford Allen comment, you will have issue for groups of size only 1.
After arrangeing the 'var1' on descending use slice with 2
library(dplyr)
problem %>%
arrange(id, desc(var1)) %>%
group_by(id) %>%
slice(2) %>%
ungroup
-output
# A tibble: 2 × 3
id var1 var2
<dbl> <dbl> <dbl>
1 1 4 12
2 2 5 9

New column with the count of columns that meet certain criteria

I searched a lot and I could not find a good solution for this simple problem. I tried rowSums, but with no success.
I have a df like the first image. I want to create a new column (V4), preferably using tidyverse, with the count of rows that meet a certain condition. In this example, the condition would be . == 5.
How many times number 5 appears in the other columns:
Example df
df <- data.frame(V1 = c(1,2,5,5,3),
V2 = c(1,5,5,5,5),
V3 = c(1,3,4,5,1))
We could use rowSums on a logical matrix
df$V4 <- rowSums(df == 5)
If we want a dplyr solution
library(dplyr)
df <- df %>%
mutate(V4 = rowSums(cur_data() == 5))
Or may also use reduce
library(purrr)
df %>%
mutate(V4 = across(everything(), `==`, 5) %>%
reduce(`+`))
Here is another dplyr option:
library(dplyr)
df %>%
rowwise %>%
mutate(V4 = sum(c_across(V1:V3) == 5, na.rm = TRUE))
Output
V1 V2 V3 V4
<dbl> <dbl> <dbl> <int>
1 1 1 1 0
2 2 5 3 1
3 5 5 4 2
4 5 5 5 3
5 3 5 1 1
Or another option using purrr:
library(tidyverse)
df %>%
mutate(V4 = pmap_int(select(., everything()), ~ sum(c(...) == 5, na.rm = T)))

How to count the number of times a specified variable appears in a dataframe column using dplyr?

Suppose we start with this very simple dataframe called myData:
> myData
Element Class
1 A 0
2 A 0
3 C 0
4 A 0
5 B 1
6 B 1
7 A 2
Generated by:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
How would I use dplyr to extract the number of times "A" appears in the Element column of the myData dataframe? I would simply like the number 4 returned, for further processing in dplyr. All I have so far is the dplyr code shown at the bottom, which seems clumsy because among other things it yields another dataframe with more information than just the number 4 that is needed:
# A tibble: 1 x 2
Element counted
<chr> <int>
1 A 4
The dplyr code that produces the above tibble:
library(dplyr)
myData %>% group_by(Element) %>% filter(Element == "A") %>% summarise(counted = n())
We can use count which simplifies the group_by + summarise step
library(dplyr)
myData %>%
filter(Element == 'A') %>%
count(Element, name = 'counted')
Or with just summarise and sum
myData %>%
summarise(counted = sum(Element == 'A'), Element = 'A') %>%
relocate(Element, .before = 1)
Element counted
1 A 4
Another option using tally like this:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
library(dplyr)
myData %>%
filter(Element == "A") %>%
group_by(Element) %>%
tally()
#> # A tibble: 1 × 2
#> Element n
#> <chr> <int>
#> 1 A 4
Created on 2022-07-28 by the reprex package (v2.0.1)

Replacing NA values with mode from multiple imputation in R

I ran 5 imputations on a data set with missing values. For my purposes, I want to replace missing values with the mode from the 5 imputations. Let's say I have the following data sets, where df is my original data, ID is a grouping variable to identify each case, and imp is my imputed data:
df <- data.frame(ID = c(1,2,3,4,5),
var1 = c(1,NA,3,6,NA),
var2 = c(NA,1,2,6,6),
var3 = c(NA,2,NA,4,3))
imp <- data.frame(ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
var1 = c(1,2,3,3,2,5,4,5,6,6,7,2,3,2,5,6,5,6,6,6,3,1,2,3,2),
var2 = c(4,3,2,3,2,4,6,5,4,4,7,2,4,2,3,6,5,6,4,5,3,3,4,3,2),
var3 = c(7,6,5,6,6,2,3,2,4,2,5,4,5,3,5,1,2,1,3,2,1,2,1,1,1))
I have a method that works, but it involves a ton of manual coding as I have ~200 variables total (I'm doing this on 3 different data sets with different variables). My code looks like this for one variable:
library(dplyr)
mode <- function(codes){
which.max(tabulate(codes))
}
var1 <- imp %>% group_by(ID) %>% summarise(var1 = mode(var1))
df3 <- df %>%
left_join(var1, by = "ID") %>%
mutate(var1 = coalesce(var1.x, var1.y)) %>%
select(-var1.x, -var1.y)
Thus, the original value in df is replaced with the mode only if the value was NA.
It is taking forever to keep manually coding this for every variable. I'm hoping there is an easier way of calculating the mode from the imputed data set for each variable by ID and then replacing the NAs with that mode in the original data. I thought maybe I could put the variable names in a vector and somehow iterate through them with one code where i changes to each variable name, but I didn't know where to go with that idea.
x <- colnames(df)
# Attempting to iterate through variables names using i
i = as.factor(x[[2]])
This is where I am stuck. Any help is much appreciated!
Here is one option using tidyverse. Essentially, we can pivot both dataframes long, then join together and coalesce in one step rather than column by column. Mode function taken from here.
library(tidyverse)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
imp_long <- imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
pivot_longer(-ID)
df %>%
pivot_longer(-ID) %>%
left_join(imp_long, by = c("ID", "name")) %>%
mutate(var1 = coalesce(value.x, value.y)) %>%
select(-c(value.x, value.y)) %>%
pivot_wider(names_from = "name", values_from = "var1")
Output
# A tibble: 5 × 4
ID var1 var2 var3
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 6
2 2 5 1 2
3 3 3 2 5
4 4 6 6 4
5 5 3 6 3
You can use -
library(dplyr)
mode_data <- imp %>%
group_by(ID) %>%
summarise(across(starts_with('var'), Mode))
df %>%
left_join(mode_data, by = 'ID') %>%
transmute(ID,
across(matches('\\.x$'),
function(x) coalesce(x, .[[sub('x$', 'y', cur_column())]]),
.names = '{sub(".x$", "", .col)}'))
# ID var1 var2 var3
#1 1 1 3 6
#2 2 5 1 2
#3 3 3 2 5
#4 4 6 6 4
#5 5 3 6 3
mode_data has Mode value for each of the var columns.
Join df and mode_data by ID.
Since all the pairs have name.x and name.y in their name, we can take all the name.x pairs replace x with y to get corresponding pair of columns. (.[[sub('x$', 'y', cur_column())]])
Use coalesce to select the non-NA value in each pair.
Change the column name by removing .x from the name. ({sub(".x$", "", .col)}) so var1.x becomes only var1.
where Mode function is taken from here
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr, warn.conflicts = FALSE)
imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
bind_rows(df) %>%
group_by(ID) %>%
summarise(across(everything(), ~ coalesce(last(.x), first(.x))))
#> # A tibble: 5 × 4
#> ID var1 var2 var3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 6
#> 2 2 5 1 2
#> 3 3 3 2 5
#> 4 4 6 6 4
#> 5 5 3 6 3
Created on 2022-01-03 by the reprex package (v2.0.1)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

R Summarize With DataTable

data=data.frame("StudentID"=c(1,2,3,4,5),
"Class"=c(1,2,2,3,3),
"Type"=c('A','A','B','B','B'))
Say you have data as shown above and you wish for summaries like this,
What is the effective solution to do this and output to a csv in organized way such as shown above?
Example data if there is weights involved and you wanted weighted counts and porporitons.portions.
data1=data.frame("StudentID"=c(1,2,3,4,5),
"Class"=c(1,2,2,3,3),
"Type"=c('A','A','B','B','B'),
"Weighting"=c(10,6,13,12,2))
One option is map
library(dplyr)
library(purrr)
map_dfr(names(data)[2:3], ~
data %>%
select(.x) %>%
group_by_at(.x) %>%
summarise(COUNT = n()) %>%
mutate(PROP = COUNT/sum(COUNT)))
# A tibble: 5 x 4
# Class COUNT PROP Type
#* <dbl> <int> <dbl> <fct>
#1 1 1 0.2 <NA>
#2 2 2 0.4 <NA>
#3 3 2 0.4 <NA>
#4 NA 2 0.4 A
#5 NA 3 0.6 B
Or with data.table by melting into 'long' format
library(data.table)
melt(setDT(data), id.var = 'StudentID')[, .(COUNT = .N),
.(variable, value)][, PROP := COUNT/sum(COUNT),.(variable)][]
Or with base R using table and prop.table
lapply(data[-1], function(x) {x1 <- table(x); x2 <- prop.table(x1); cbind(COUNT = x1, PROP = x2)})
Both summaries are simple, here I use dplyr. To combine them in the way you want, it's going to need to be slapped together in a somewhat inelegant way. You can remove the name col1 if you want
library(dplyr)
df1 <- data %>% group_by(Class) %>%
summarise(Count = n(), Prop = n() / nrow(data))
df2 <- data %>% group_by(Type) %>%
summarise(Count = n(), Prop = n() / nrow(data))
names(df1)[1] <- 'col1'
names(df2)[1] <- 'col1'
rbind(
c('Class', '', ''),
df1,
c('Type', '', ''),
df2
)
# A tibble: 7 x 3
col1 Count Prop
<chr> <chr> <chr>
1 Class "" ""
2 1 1 0.2
3 2 2 0.4
4 3 2 0.4
5 Type "" ""
6 A 2 0.4
7 B 3 0.6

Resources