Select column that has the fewest NA values - r

I am working with a data frame that produces two output columns. One column always has more NA values than the other column, but not in any predictable fashion. here is my question, how can I use dplyr to select the column with the fewest number of NA values. I was thinking of utilizing which.min to decide, but not sure how to put it all together. Note that both columns contain na values, and I want to select the one with the fewest of those values.

You can do this with dplyr and purrr.
inside which.min you first calculate the number of NA's in the columns with map (can be as many columns as you have in your data.frame. The keep part returns only those columns which actually have NA's. The which.min returns the named vector of which we take the name and supply it to the select function of dplyr.
I have outlined the code a bit so you can easily see which parts belong where.
library(purrr)
library(dplyr)
df %>% select(names(which.min(df %>%
map(function(x) sum(is.na(x))) %>%
keep(~ .x > 0)
)
)
)

library(dplyr)
df <- tibble(a = c(rep(c(NA, 1:5), 4)), # df with different NA counts/col
b = c(rep(c(NA, NA, 2:5), 4)))
df %>%
summarise_all(funs(sum(is.na(.)))) # NA counts
#> # A tibble: 1 x 2
#> a b
#> <int> <int>
#> 1 4 8
df %>% # answer
select_if(funs(which.min(sum(is.na(.)))))
#> # A tibble: 24 x 1
#> a
#> <int>
#> 1 NA
#> 2 1
#> 3 2
#> 4 3
#> 5 4
#> 6 5
#> 7 NA
#> 8 1
#> 9 2
#> 10 3
#> # ... with 14 more rows
Created on 2018-05-25 by the reprex package (v0.2.0).

Related

Mutate All columns in a list of tibbles

Lets suppose I have the following list of tibbles:
a_list_of_tibbles <- list(
a = tibble(a = rnorm(10)),
b = tibble(a = runif(10)),
c = tibble(a = letters[1:10])
)
Now I want to map them all into a single dataframe/tibble, which is not possible due to the differing column types.
How would I go about this?
I have tried this, but I want to get rid of the for loop
for(i in 1:length(a_list_of_tibbles)){
a_list_of_tibbles[[i]] <- a_list_of_tibbles[[i]] %>% mutate_all(as.character)
}
Then I run:
map_dfr(.x = a_list_of_tibbles, .f = as_tibble)
We could do the computation within the map - use across instead of the suffix _all (which is getting deprecated) to loop over the columns of the dataset
library(dplyr)
library(purrr)
map_dfr(a_list_of_tibbles,
~.x %>%
mutate(across(everything(), as.character) %>%
as_tibble))
-output
# A tibble: 30 × 1
a
<chr>
1 0.735200825884485
2 1.4741501589461
3 1.39870958697574
4 -0.36046362308853
5 -0.893860999301402
6 -0.565468636033674
7 -0.075270267983768
8 2.33534260196058
9 0.69667906338348
10 1.54213170143702
# … with 20 more rows
Another alternative is to use:
library(tidyverse)
map_depth(a_list_of_tibbles, 2, as.character) %>%
bind_rows()
#> # A tibble: 30 × 1
#> a
#> <chr>
#> 1 0.0894618169853206
#> 2 -1.50144637645091
#> 3 1.44795821718513
#> 4 0.0795342912030257
#> 5 -0.837985570593029
#> 6 -0.050845557103668
#> 7 0.031194556366589
#> 8 0.0989551909839589
#> 9 1.87007290229274
#> 10 0.67816212007413
#> # … with 20 more rows
Created on 2021-12-20 by the reprex package (v2.0.1)

How to create a one-row data.frame (tibble) from 2 vectors. One with the desired column names, and another with the values for the row?

I have two character vectors in R of the same size. Let's call them variables and values (they have the same number of elements).
I wish to create a one-row data.frame (tibble) from these 2 vectors. Where the column names are variables and he actual values in the rows are values.
How can I do this?
library(tidyverse)
variables <- letters[1:3]
variables
#> [1] "a" "b" "c"
values <- seq(3)
values
#> [1] 1 2 3
data.frame(variables, values) %>%
pivot_wider(names_from = variables, values_from = values)
#> # A tibble: 1 x 3
#> a b c
#> <int> <int> <int>
#> 1 1 2 3
Created on 2021-11-08 by the reprex package (v2.0.1)
You can create a named vector and splice it for use with tibble().
values <- 1:3
variables <- LETTERS[1:3]
library(tibble)
tibble(!!!setNames(values, variables))
# A tibble: 1 x 3
A B C
<int> <int> <int>
1 1 2 3

Find cohorts in dataset in r dataframe

I'm trying to find the biggest cohort in a dataset of about 1000 candidates and 100 test questions. Every candidate is asked 15 questions out of a pool of 100 test questions. People in different cohorts make the same set of randomly sampled questions. I'm trying to find the largest group of candidates who all make the same test.
I'm working in R. The data.frame has about a 1000 rows, and 100 columns. Each column indicates which test question we're working with. For each row (candidate) all column entries are NA apart from the ones where a candidate filled in a particular question he or she was shown. The input in these question instances are either 0 or 1. (see picture)
Is there an elegant way to solve this? The only thing I could think of was using dplyer and filter per 15 question subset, and check how many rows still remain. However, with 100 columns this means it has to check (i think) 15 choose 100 different possibilities. Many thanks!
data.frame structure
We can infer the cohort based on the NA pattern:
library(tidyverse)
answers <- tribble(
~candidate, ~q1, ~q2, ~q3,
1,0,NA,NA,
2,1,NA,NA,
3,0,0,00
)
answers
#> # A tibble: 3 x 4
#> candidate q1 q2 q3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 NA NA
#> 2 2 1 NA NA
#> 3 3 0 0 0
# infer cohort by NA pattern
cohorts <-
answers %>%
group_by(candidate) %>%
mutate_at(vars(-group_cols()), ~ ifelse(is.na(.x), NA, TRUE)) %>%
unite(-candidate, col = "cohort")
cohorts
#> # A tibble: 3 x 2
#> # Groups: candidate [3]
#> candidate cohort
#> <dbl> <chr>
#> 1 1 TRUE_NA_NA
#> 2 2 TRUE_NA_NA
#> 3 3 TRUE_TRUE_TRUE
answers %>%
pivot_longer(-candidate) %>%
left_join(cohorts) %>%
# count filled answers per candidate and cohort
group_by(cohort, candidate) %>%
filter(! is.na(value)) %>%
count() %>%
# get the largest cohort
arrange(-n) %>%
pull(cohort) %>%
first()
#> Joining, by = "candidate"
#> [1] "TRUE_TRUE_TRUE"
Created on 2021-09-21 by the reprex package (v2.0.1)

Sum() in dplyr and aggregate: NA values

I have a dataset with about 3,000 rows. The data can be accessed via https://pastebin.com/i4dYCUQX
Problem: NA results in the output, though there appear to be no NA in the data. Here is what happens when I try to sum the total value in each category of a column via dplyr or aggregate:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
example
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))
Out:
# A tibble: 4 x 2
size volume
<fctr> <int>
1 Extra Large NA
2 Large NA
3 Medium 937581572
4 Small NA
# aggregate
aggregate(volume ~ size, data=example, FUN=sum)
Out:
size volume
1 Extra Large NA
2 Large NA
3 Medium 937581572
4 Small NA
When trying to access the value via colSums, it seems to work:
# Colsums
small <- example %>% filter(size == "Small")
colSums(small["volume"], na.rm = FALSE, dims = 1)
Out:
volume
3869267348
Can anyone imagine what the issue could be?
The first thing to note is that, running your example, I get:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> # A tibble: 4 × 2
#> size volume
#> <fctr> <int>
#> 1 Extra Large NA
#> 2 Large NA
#> 3 Medium 937581572
#> 4 Small NA
which clearly states that you're sums are overflowing the integer type. If we do as the warning message suggests, we can convert the integers to numerics and then sum:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum(as.numeric(.))))
#> # A tibble: 4 × 2
#> size volume
#> <fctr> <dbl>
#> 1 Extra Large 3609485056
#> 2 Large 11435467097
#> 3 Medium 937581572
#> 4 Small 3869267348
here the funs(sum) has been replaced by funs(sum(as.numeric(.)) which is the same, executing sum on each group but converting to numeric first.
its because value is an integer and not numeric
example$volume <- as.numeric(example$volume)
aggregate(volume ~ size, data=example, FUN=sum)
size volume
1 Extra Large 3609485056
2 Large 11435467097
3 Medium 937581572
4 Small 3869267348
For more check here:
What is integer overflow in R and how can it happen?

dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output

When using summarise with plyr's ddply function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE. However, this doesn't work when using summarise with dplyr. Is there another way to keep empty categories in the result?
Here's an example with fake data.
library(dplyr)
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
# Now add an extra level to df$b that has no corresponding value in df$a
df$b = factor(df$b, levels=1:3)
# Summarise with plyr, keeping categories with a count of zero
plyr::ddply(df, "b", summarise, count_a=length(a), .drop=FALSE)
b count_a
1 1 6
2 2 6
3 3 0
# Now try it with dplyr
df %.%
group_by(b) %.%
summarise(count_a=length(a), .drop=FALSE)
b count_a .drop
1 1 6 FALSE
2 2 6 FALSE
Not exactly what I was hoping for. Is there a dplyr method for achieving the same result as .drop=FALSE in plyr?
The issue is still open, but in the meantime, especially since your data are already factored, you can use complete from "tidyr" to get what you might be looking for:
library(tidyr)
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b)
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (int)
# 1 1 6
# 2 2 6
# 3 3 NA
If you wanted the replacement value to be zero, you need to specify that with fill:
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b, fill = list(count_a = 0))
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (dbl)
# 1 1 6
# 2 2 6
# 3 3 0
Since dplyr 0.8 group_by gained the .drop argument that does just what you asked for:
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
df$b = factor(df$b, levels=1:3)
df %>%
group_by(b, .drop=FALSE) %>%
summarise(count_a=length(a))
#> # A tibble: 3 x 2
#> b count_a
#> <fct> <int>
#> 1 1 6
#> 2 2 6
#> 3 3 0
One additional note to go with #Moody_Mudskipper's answer: Using .drop=FALSE can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:
library(dplyr)
data(iris)
# Add an additional level to Species
iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))
# Species is a factor and empty groups are included in the output
iris %>% group_by(Species, .drop=FALSE) %>% tally
#> Species n
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
#> 4 empty_level 0
# Add character column
iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))
# Empty groups involving combinations of Species and group2 are not included in output
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 versicolor A 25
#> 4 versicolor B 25
#> 5 virginica B 25
#> 6 virginica C 25
#> 7 empty_level <NA> 0
# Turn group2 into a factor
iris$group2 = factor(iris$group2)
# Now all possible combinations of Species and group2 are included in the output,
# whether present in the data or not
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 setosa C 0
#> 4 versicolor A 25
#> 5 versicolor B 25
#> 6 versicolor C 0
#> 7 virginica A 0
#> 8 virginica B 25
#> 9 virginica C 25
#> 10 empty_level A 0
#> 11 empty_level B 0
#> 12 empty_level C 0
Created on 2019-03-13 by the reprex package (v0.2.1)
dplyr solution:
First make grouped df
by_b <- tbl_df(df) %>% group_by(b)
then we summarise those levels that occur by counting with n()
res <- by_b %>% summarise( count_a = n() )
then we merge our results into a data frame that contains all factor levels:
expanded_res <- left_join(expand.grid(b = levels(df$b)),res)
finally, in this case since we are looking at counts the NA values are changed to 0.
final_counts <- expanded_res[is.na(expanded_res)] <- 0
This can also be implemented functionally, see answers:
Add rows to grouped data with dplyr?
A hack:
I thought I would post a terrible hack that works in this case for interest's sake. I seriously doubt you should ever actually do this but it shows how group_by() generates the atrributes as if df$b was a character vector not a factor with levels. Also, I don't pretend to understand this properly -- but I am hoping this helps me learn -- this is the only reason I'm posting it!
by_b <- tbl_df(df) %>% group_by(b)
define an "out-of-bounds" value that cannot exist in dataset.
oob_val <- nrow(by_b)+1
modify attributes to "trick" summarise():
attr(by_b, "indices")[[3]] <- rep(NA,oob_val)
attr(by_b, "group_sizes")[3] <- 0
attr(by_b, "labels")[3,] <- 3
do the summary:
res <- by_b %>% summarise(count_a = n())
index and replace all occurences of oob_val
res[res == oob_val] <- 0
which gives the intended:
> res
Source: local data frame [3 x 2]
b count_a
1 1 6
2 2 6
3 3 0
this is not exactly what was asked in the question, but at least for this simple example, you could get the same result using xtabs, for example:
using dplyr:
df %>%
xtabs(formula = ~ b) %>%
as.data.frame()
or shorter:
as.data.frame(xtabs( ~ b, df))
result (equal in both cases):
b Freq
1 1 6
2 2 6
3 3 0

Resources