Removing NULL values from the table - r

I'm having a table with me which has NUll values in a Column, those null values add to Extra Label in the Highchart graph. How to manipulate data using Dplyr to get rid of rows which has Null Values in the specific column?
I was thinking to make changes in the backend SQL queries, and filter the result for the desired output. But it is not an appropriate way.
This is not working,
dplyr::filter(!is.na(ColumnWithNullValues)) %>%
Actual code:
df <- data() %>%
dplyr::filter(CreatedBy == 'owner') %>%
dplyr::group_by(`Reason for creation`) %>%
dplyr::arrange(ReasonOrder) %>%
ColumnWithNullValues <- This column has Null values.

Here is one option with base R
df[!sapply(df$ColumnWithNullValues, is.null),]
data
library(tibble)
df <- tibble(
ColumnWithNullValues = list(c(1:5), NULL, c(6:10)))

Here is a small example df:
library(dplyr)
library(purrr)
df <- tibble(
ColumnWithNullValues = list(c(1:5), NULL, c(6:10))
)
df
#> # A tibble: 3 x 1
#> ColumnWithNullValues
#> <list>
#> 1 <int [5]>
#> 2 <NULL>
#> 3 <int [5]>
In this case the most logical might seem to use:
df %>%
filter(!is.null(ColumnWithNullValues))
#> # A tibble: 3 x 1
#> ColumnWithNullValues
#> <list>
#> 1 <int [5]>
#> 2 <NULL>
#> 3 <int [5]>
But as you can see, that does not work. Instead, we need to use map/sapply/vapply to get inside the list. For example like this:
df %>%
filter(map_lgl(ColumnWithNullValues, function(x) !all(is.null(x))))
#> # A tibble: 2 x 1
#> ColumnWithNullValues
#> <list>
#> 1 <int [5]>
#> 2 <int [5]>
But as #akrun hast explained in the comment, it is not possible that an element in the list contains NULL among other value. So we can simplify the code to this:
df %>%
filter(!map_lgl(ColumnWithNullValues, is.null))
#> # A tibble: 3 x 1
#> ColumnWithNullValues
#> <list>
#> 1 <int [5]>
#> 2 <int [5]>

Related

How to group_by and summarise vectors inside a tibble into a single vector?

tibble(x = rep(1:3, 2),
y = list(1:5, 1:10, 10:20, 20:40, 1:50, 5:10)) -> df
df
#> # A tibble: 6 × 2
#> x y
#> <int> <list>
#> 1 1 <int [5]>
#> 2 2 <int [10]>
#> 3 3 <int [11]>
#> 4 1 <int [21]>
#> 5 2 <int [50]>
#> 6 3 <int [6]>
I want to group_by 'x' and summmarise the vectors of each group into a single vector. I tried using c(), but it didn't help.
df %>%
group_by(x) %>%
summarise(z = c(y))
#> `summarise()` has grouped output by 'x'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 2
#> # Groups: x [3]
#> x z
#> <int> <list>
#> 1 1 <int [5]>
#> 2 1 <int [21]>
#> 3 2 <int [10]>
#> 4 2 <int [50]>
#> 5 3 <int [11]>
#> 6 3 <int [6]>
I also want a union of elements in a group or any other similar function applied to these kinds of datasets.
df %>%
group_by(x) %>%
summarise(z = union(y))
#> Error in `summarise()`:
#> ! Problem while computing `z = union(y)`.
#> ℹ The error occurred in group 1: x = 1.
#> Caused by error in `base::union()`:
#> ! argument "y" is missing, with no default
If you want the data to remain nested, you can do
df %>%
group_by(x) %>%
summarise(z = list(unlist(y)))
The c() function won't work because it' doesn't unnest-lists. For example, compare
c(list(1:3, 4:5))
unlist(list(1:3, 4:5))
The c function doesn't return a single vector. But unlist does. This matters because your function will recieve a list of matching row values when you use summarize.
Also note that if you leave off the list(), the values don't be nested anymore
df %>%
group_by(x) %>%
summarise(z = unlist(y))
# x z
# <int> <int>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 1 20
# 7 1 21
# ...

preserve column-of-tibbles when using grouped summarise with reduce

Question
When using tibble %>% group_by() %>% summarise(...=reduce(...)) on a column containing tibbles, I would like the output to remain a column of tibbles. How do I do that most efficiently?
Minimal Example:
Setup
vec1 = rnorm(10)
vec2 = rnorm(10)
vec3 = rnorm(10)
vec4 = rnorm(10)
tib=tibble(grpvar=factor(c('a','a','b','b')))
tib$col2=1
tib$col2[1]=tibble(vec1)
tib$col2[2]=tibble(vec2)
tib$col2[3]=tibble(vec3)
tib$col2[4]=tibble(vec4)
This is what it looks like:
grpvar col2
<fct> <list>
1 a <dbl [10]>
2 a <dbl [10]>
3 b <dbl [10]>
4 b <dbl [10]>
A very minimal tibble with a variable that will be used for grouping, and another column containing tibbles which contain vectors of length 10.
Problem
Using reduce within summarise simplifies the output...
tib %>% group_by(grpvar) %>% summarise(aggr=reduce(col2,`+`))
yields:
grpvar aggr
<fct> <dbl>
1 a -0.0206
...
10 a -0.101
...
20 b 0.520
Here, the tibble becomes very long ... I don't want 10 rows per group variable, but instead just one tibble containing the 10 values.
Desired output:
This is what it should look like
desired_outout<-tibble(grpvar=c('a','b'),aggr=NA)
desired_outout$aggr[1]=tibble(reduce(tib$col2[1:2],`+`))
desired_outout$aggr[2]=tibble(reduce(tib$col2[3:4],`+`))
which looks like:
# A tibble: 2 x 2
grpvar aggr
<chr> <list>
1 a <dbl [10]>
2 b <dbl [10]
i.e., it retains the column-of-tibbles structure (which internally, I believe is a list of vectors)
Wrap reduce with list:
tib %>% group_by(grpvar) %>% summarise(aggr=list(reduce(col2,`+`)))
Output:
# A tibble: 2 x 2
grpvar aggr
<fct> <list>
1 a <dbl [10]>
2 b <dbl [10]>

summarize to vector output

Let's say I have the following (simplified) tibble containing a group and values in vectors:
set.seed(1)
(tb_vec <- tibble(group = factor(rep(c("A","B"), c(2,3))),
values = replicate(5, sample(3), simplify = FALSE)))
# A tibble: 5 x 2
group values
<fct> <list>
1 A <int [3]>
2 A <int [3]>
3 B <int [3]>
4 B <int [3]>
5 B <int [3]>
tb_vec[[1,2]]
[1] 1 3 2
I would like to summarize the values vectors per group by summing them (vectorized) and tried the following:
tb_vec %>% group_by(group) %>%
summarize(vec_sum = colSums(purrr::reduce(values, rbind)))
Error: Column vec_sum must be length 1 (a summary value), not 3
The error surprises me, because tibbles (the output format) can contain vectors as well.
My expected output would be the following summarized tibble:
# A tibble: 2 x 2
group vec_sum
<fct> <list>
1 A <dbl [3]>
2 B <dbl [3]>
Is there a tidyverse solution accommodate the vector output of summarize? I want to avoid splitting the tibble, because then I loose the factor.
You just need to add list(.) within summarise in your solution, in order to be able to have a column with 2 elements, where each element is a vector of 3 values:
library(tidyverse)
set.seed(1)
(tb_vec <- tibble(group = factor(rep(c("A","B"), c(2,3))),
values = replicate(5, sample(3), simplify = FALSE)))
tb_vec %>%
group_by(group) %>%
summarize(vec_sum = list(colSums(purrr::reduce(values, rbind)))) -> res
res$vec_sum
# [[1]]
# [1] 2 4 6
#
# [[2]]
# [1] 6 5 7

purrr: using %in% with a list-column

I have a column of question responses and a column of possible correct_answers. I'd like to create a third (logical) column (correct) to show whether a response matches one of the possible correct answers.
I think I may need to use a purrr function but I'm not sure how to use one of the map functions with %in%, for example.
library(tibble)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(purrr)
data <- tibble(
response = c('a', 'b', 'c'),
correct_answers = rep(list(c('a', 'b')), 3)
)
# works but correct answers specified manually
data %>%
mutate(correct = response %in% c('a', 'b'))
#> # A tibble: 3 x 3
#> response correct_answers correct
#> <chr> <list> <lgl>
#> 1 a <chr [2]> TRUE
#> 2 b <chr [2]> TRUE
#> 3 c <chr [2]> FALSE
# doesn't work
data %>%
mutate(correct = response %in% correct_answers)
#> # A tibble: 3 x 3
#> response correct_answers correct
#> <chr> <list> <lgl>
#> 1 a <chr [2]> FALSE
#> 2 b <chr [2]> FALSE
#> 3 c <chr [2]> FALSE
Created on 2018-11-05 by the reprex package (v0.2.1)
%in% doesn't check nested elements inside a list, use mapply (baseR) or map2 (purrr) to loop through the columns and check:
data %>% mutate(correct = mapply(function (res, ans) res %in% ans, response, correct_answers))
# A tibble: 3 x 3
# response correct_answers correct
# <chr> <list> <lgl>
#1 a <chr [2]> TRUE
#2 b <chr [2]> TRUE
#3 c <chr [2]> FALSE
Use map2_lgl:
library(purrr)
data %>% mutate(correct = map2_lgl(response, correct_answers, ~ .x %in% .y))
# A tibble: 3 x 3
# response correct_answers correct
# <chr> <list> <lgl>
#1 a <chr [2]> TRUE
#2 b <chr [2]> TRUE
#3 c <chr [2]> FALSE
Or as #thelatemail commented, both can be simplified:
data %>% mutate(correct = mapply(`%in%`, response, correct_answers))
data %>% mutate(correct = map2_lgl(response, correct_answers, `%in%`))

How to calculate length of vector within a list column (nested)

I have the following code
library(tidyverse)
dat <- iris %>%
group_by(Species) %>%
summarise(summary = list(fivenum(Petal.Width)))
dat
#> # A tibble: 3 x 2
#> Species summary
#> <fct> <list>
#> 1 setosa <dbl [5]>
#> 2 versicolor <dbl [5]>
#> 3 virginica <dbl [5]>
Basically I used the Iris data, grouped it by Species and then calculated fivenum().
What I want to do is to simply calculate the length of the summary values:
this is what I have tried but it doesn't produce what I expect:
dat %>%
mutate(nof_value = length(summary))
# A tibble: 3 x 3
# Species summary nof_values
# <fct> <list> <int>
#1 setosa <dbl [5]> 3
#2 versicolor <dbl [5]> 3
#3 virginica <dbl [5]> 3
The nof_values should all be equal to 5. What's the right way to do it?
We can use lengths to calculate the length of nested list
library(tidyverse)
dat %>%
mutate(nof_values = lengths(summary))
# Species summary nof_values
# <fct> <list> <int>
#1 setosa <dbl [5]> 5
#2 versicolor <dbl [5]> 5
#3 virginica <dbl [5]> 5
whose equivalent in base R is
dat$nof_values <- lengths(dat$summary)
Side note : length is different from lengths
length(dat$summary)
#[1] 3
lengths(dat$summary)
#[1] 5 5 5
You can use the map_int command from the purrr package (which is part of the tidyverse)
dat <- iris %>%
group_by(Species) %>%
summarise(summary = list(fivenum(Petal.Width))) %>%
mutate(nof_value = map_int(summary, length))

Resources