Drop list columns from dataframe using dplyr and select_if - r

Is it possible to drop all list columns from a dataframe using dpyr select similar to dropping a single column?
df <- tibble(
a = LETTERS[1:5],
b = 1:5,
c = list('bob', 'cratchit', 'rules!','and', 'tiny tim too"')
)
df %>%
select_if(-is.list)
Error in -is.list : invalid argument to unary operator
This seems to be a doable work around, but was wanting to know if it can be done with select_if.
df %>%
select(-which(map(df,class) == 'list'))

Use Negate
df %>%
select_if(Negate(is.list))
# A tibble: 5 x 2
a b
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
There is also purrr::negate that would give the same result.

We can use Filter from base R
Filter(Negate(is.list), df)
# A tibble: 5 x 2
# a b
# <chr> <int>
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5

Related

Find rows that occur only once, in two datasets

I have data as follows:
library(data.table)
datA <- fread("A B C
1 1 1
2 2 2")
datB <- fread("A B C
1 1 1
2 2 2
3 3 3")
I want to figure out which rows are unique (which is the one with 3 3 3, because all others occur more often).
I tried:
dat <- rbind(datA, datB)
unique(dat)
!duplicated(dat)
I also tried
setDT(dat)[,if(.N ==1) .SD,]
But that is NULL.
How should I do this?
You can use fsetdiff:
rbind.data.frame(fsetdiff(datA, datB, all = TRUE),
fsetdiff(datB, datA, all = TRUE))
In general, this is called an anti_join:
library(dplyr)
bind_rows(anti_join(datA, datB),
anti_join(datB, datA))
A B C
1: 4 4 4
2: 3 3 3
Data: I added a row in datA to show how to keep rows from both data sets (a simple anti-join does not work otherwise):
library(data.table)
datA <- fread("A B C
1 1 1
2 2 2
4 4 4")
datB <- fread("A B C
1 1 1
2 2 2
3 3 3")
One possible solution
library(data.table)
datB[!datA, on=c("A", "B", "C")]
A B C
<int> <int> <int>
1: 3 3 3
Or (if you are interested in the symmetric difference)
funion(fsetdiff(datB, datA), fsetdiff(datA, datB))
A B C
<int> <int> <int>
1: 3 3 3
Another dplyr option by filtering rows that appear once with a group_by and filter:
library(data.table)
library(dplyr)
datA %>%
bind_rows(., datB) %>%
group_by(across(everything())) %>%
filter(n() == 1)
#> # A tibble: 1 × 3
#> # Groups: A, B, C [1]
#> A B C
#> <int> <int> <int>
#> 1 3 3 3
Created on 2022-11-09 with reprex v2.0.2

Apply function to a row in a data.frame using dplyr

In base R I would do the following:
d <- data.frame(a = 1:4, b = 4:1, c = 2:5)
apply(d, 1, which.max)
With dplyr I could do the following:
library(dplyr)
d %>% mutate(u = purrr::pmap_int(list(a, b, c), function(...) which.max(c(...))))
If there’s another column in d I need to specify it, but I want this to work w/ an arbitrary amount if columns.
Conceptually, I’d like something like
pmap_int(list(everything()), ...)
pmap_int(list(.), ...)
But this does obviously not work. How would I solve that canonically with dplyr?
We just need the data to be specified as . as data.frame is a list with columns as list elements. If we wrap list(.), it becomes a nested list
library(dplyr)
d %>%
mutate(u = pmap_int(., ~ which.max(c(...))))
# a b c u
#1 1 4 2 2
#2 2 3 3 2
#3 3 2 4 3
#4 4 1 5 3
Or can use cur_data()
d %>%
mutate(u = pmap_int(cur_data(), ~ which.max(c(...))))
Or if we want to use everything(), place that inside select as list(everything()) doesn't address the data from which everything should be selected
d %>%
mutate(u = pmap_int(select(., everything()), ~ which.max(c(...))))
Or using rowwise
d %>%
rowwise %>%
mutate(u = which.max(cur_data())) %>%
ungroup
# A tibble: 4 x 4
# a b c u
# <int> <int> <int> <int>
#1 1 4 2 2
#2 2 3 3 2
#3 3 2 4 3
#4 4 1 5 3
Or this is more efficient with max.col
max.col(d, 'first')
#[1] 2 2 3 3
Or with collapse
library(collapse)
dapply(d, which.max, MARGIN = 1)
#[1] 2 2 3 3
which can be included in dplyr as
d %>%
mutate(u = max.col(cur_data(), 'first'))
Here are some data.table options
setDT(d)[, u := which.max(unlist(.SD)), 1:nrow(d)]
or
setDT(d)[, u := max.col(.SD, "first")]

keep last non missing observation for all variables by group

My data has multiple columns and some of those columns have missing values in different rows. I would like to group (collapse) the data by the variable "g", keeping the last non missing obserbation of each varianle.
Input:
d <- data.table(a=c(1,NA,3,4),b=c(1,2,3,4),c=c(NA,NA,'c',NA),g=c(1,1,2,2))
Desired output
d_g <- data.table(a=c(1,4),b=c(2,4),c=c(NA,'c'),g=c(1,2))
data.table (or dplyr) solution prefered here
OBS:this is related to this question, but the main answers there seem to cause unecessary NAs in some groups
Using data.table :
library(data.table)
d[, lapply(.SD, function(x) last(na.omit(x))), g]
# g a b c
#1: 1 1 2 <NA>
#2: 2 4 4 c
One option using dplyr could be:
d %>%
group_by(g) %>%
summarise(across(everything(), ~ if(all(is.na(.))) NA else last(na.omit(.))))
g a b c
<dbl> <dbl> <dbl> <chr>
1 1 1 2 <NA>
2 2 4 4 c
In base aggregatecould be used.
aggregate(.~g, d, function(x) tail(x[!is.na(x)], 1), na.action = NULL)
# g a b c
#1 1 1 2
#2 2 4 4 c

How to Pass column name in group by from a variable

Want to extract max values of a column of each group of data frame.
I have column name in a variable which i want to pass in group by condition but it is failing.
I have below data frame:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
Column values in Variables below:
columnselected <- c("Value")
groupbycol <- c("Gene")
My Code is :
df %>% group_by(groupbycol) %>% top_n(1, columnselected)
This code is giving error.
Gene Value
A 12
B 6
C 1
D 4
You need to convert column names to symbol using sym and then evaluate them using !!
library(dplyr)
df %>% group_by(!!sym(groupbycol)) %>% top_n(1, !!sym(columnselected))
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
We can use group_by_at and without using an additional package
library(dplyr)
df %>%
group_by_at(groupbycol) %>%
top_n(1, !! as.name(columnselected))
# A tibble: 4 x 2
# Groups: Gene [4]
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
NOTE: There would be many dupes for this post :=)

Create long data format based on strings of sequences defined by colons and concatenated vectors

I have data where the IDs of each observation are numbers stored as sequences usually in the form of X:Y, but sometimes concatenated lists. I would like to tidy the data so each observation has its own row so that I can then use a join function to add more descriptive IDs. Normally I'd use the gather() function from tidyr to do this but I'm having trouble unpacking the IDs as they are characters.
The data looks like this:
example <- data_frame(x = LETTERS[1:3], y = c("Condition 1", "Condition 2", "Condition 3"), z = c("1:3", "4:6", "c(7,9,10)"))
example
# A tibble: 3 × 3
x y z
<chr> <chr> <chr>
1 A Condition 1 1:3
2 B Condition 2 4:6
3 C Condition 3 c(7,9,10)
However these do not work and all produce NA:
as.numeric("1:3")
as.integer("1:3")
as.numeric("c(7,9,10)")
as.integer("c(7,9,10)")
There must be a simple way to do this but I thought one long way might be to extract the numbers and store them as a list first. For the X:Y IDs I could do this by spliting the string at ":" and then creating a sequence from one number to the other like so:
example[1:2,] %>%
+ separate(z, c("a", "b"), sep = ":") %>%
+ mutate(a = as.numeric(a), b = as.numeric(b), new = list(seq(a, b)))
Error in eval(expr, envir, enclos) : 'from' must be of length 1
However this did not work.
What I'm aiming for looks like this:
# A tibble: 9 × 3
x y z
<chr> <chr> <dbl>
1 A Condition 1 1
2 A Condition 1 2
3 A Condition 1 3
4 B Condition 2 4
5 B Condition 2 5
6 B Condition 2 6
7 C Condition 3 7
8 C Condition 3 9
9 C Condition 3 10
What is the simplest way of achieving it?
We can use tidyverse
library(tidyverse)
example %>%
group_by(x) %>%
mutate(z = list(eval(parse(text=z)))) %>%
unnest
# x y z
# <chr> <chr> <dbl>
#1 A Condition 1 1
#2 A Condition 1 2
#3 A Condition 1 3
#4 B Condition 2 4
#5 B Condition 2 5
#6 B Condition 2 6
#7 C Condition 3 7
#8 C Condition 3 9
#9 C Condition 3 10

Resources