Assert that a combination of columns is unique (using `assertr`) - r

I am looking for a "tidy" and concise way to make sure that a combination of columns is unique in a tibble using assertr.
So far, this is the best I could come up with:
PasteRows <- function(df) {
apply(df, 1, paste, collapse = '')
}
tib <- tibble(a = c(1, 1, 3), b = c('a', 'b', 'b'))
tib %>%
assert_rows(PasteRows, is_uniq, a, b)
... but I first have to define PasteRows. Also, I am not sure if apply has a performance penalty, because it converts the tibble into a matrix.
How can I improve and shorten this?

Related

R; How to select() columns that contains() strings where the string is any element of a list

I want to subset a dataframe whereby I select columns based on the fact that the colname contains a certain string or not. These strings that it must contain are stored in a separate list.
This is what I have now:
colstrings <- c('A', 'B', 'C')
for (i in colstrings){
df <- df %>% select(-contains(i))
}
However, it feels like this shouldn't be done with a for loop. Any suggestions on how to make this code shorter?
Here's an answer adapted from a previous SO post:
library(dplyr)
df <-
tibble(
ash = c(1, 2),
bet = c(2, 3),
can = c(3, 4)
)
df
substr_list <- c("sh", "an")
df %>%
select(matches(paste(substr_list, collapse="|")))
See more here: select columns based on multiple strings with dplyr contains()

Puzzling behavior of arrange_at with .funs

I don't understand what dplyr::arrange_at is doing when passed a .funs parameter.
For example suppose we create a data frame Z:
library(dplyr)
Z <- expand.grid(A = c(1:2, NA), B = c(1:2, NA))
and suppose we want to sort it by A (with NA first) and then by B. Then we might try target or current below:
all_equal(
target = Z %>% arrange(!is.na(A), A, B),
current = Z %>% arrange_at(.vars = c("A", "A", "B"),
.funs = list(function(x)!is.na(x), identity, identity)),
ignore_row_order = FALSE)
which returns "Same row values, but different order". The first version (target) is what I expected, but the second (current) is puzzling. What I expected is that each function in .funs would be applied to the corresponding column in .var and then it would be sorted much like arrange().
Ultimately I want to sort in a very dynamic way and hence want the full power of arrange_at.
Update
As #akrun says in a comment, the _at family of dplyr functions create the Cartesian product of all .vars and all .funs. Therefore what I need is an arrange_parallel_at function that expects .vars and .funs to have the same length and where the kth function is evaluated on the column whose name is the kth entry in .vars (and only that column). Then all these columns in the same order become the argument to arrange.
Below is an answer to my own question. While it works (and in particular does what I asked in the Update) it's almost certainly sub-optimal since I suspect there are better solutions based on tidy eval.
Therefore I'm reluctant to accept it.
library(tidyverse)
arrange_parallel_at <- function(.data, .vars, .funs) {
stopifnot(length(.vars) == length(.funs), is.character(.vars), is.list(.funs))
tmp_cols <- paste0('.tmp', seq_along(.vars))
for (i in seq_along(.vars)) {
.data[[tmp_cols[i]]] <- sort_trans[[i]](.data[[.vars[i]]])
}
.data <- arrange_at(.data, tmp_cols)
.data[tmp_cols] <- NULL
.data
}
Below is some test code.
tibble(A = c(1:2, NA)) %>%
crossing(B = c(1:2, NA)) ->
Z
na_first <- function(x) !is.na(x)
all_equal(
Z %>% arrange(!is.na(A), A, !is.na(B), desc(B)),
Z %>% arrange_parallel_at( c( 'A', 'A', 'B', 'B'),
list(na_first, identity, na_first, desc)),
ignore_row_order = FALSE) # returns TRUE

Renaming Several Columns in Data Frames Stored in a List Simultaneously

I have the following list, which contains several dataframes that all have the same column names:
my_list <- list(df1 = data.frame(A = c(1:3), B = c(4:6), C = c(7:9)),
df2 = data.frame(A = c(1:4), B = c(5:8), C = c(9:12)),
df3 = data.frame(A = c(1:5), B = c(6:10), C = c(11:15)))
Is there an efficient way to rename all of the column As in each data frame in the list simultaneously using base R functions?
I was thinking that something like
names(lapply(my_list, `[[`, "A")) <- "new_name"
may work, but I think I'm off track - the lapply function returns an object that might not work for what I'm trying to do.
Thanks!
A few more base options:
# rename first column name
lapply(my_list, function(x) setNames(x, replace(names(x), 1, "new_name_for_A")))
# rename column named "A"
lapply(my_list, function(x) setNames(x, replace(names(x), names(x) == "A", "new_name_for_A")))
# lowly for loop
for (i in seq_along(my_list)) {
names(my_list[[i]])[names(my_list[[i]]) == "A"] = "new_name_for_A"
}
We can use map to loop over the list and rename the column named 'A' to 'new_name" with rename_at
library(purrr)
library(dplyr)
map(my_list, ~ .x %>%
rename_at(vars("A"), ~ "new_name"))
Or with base R by making use of anonymous function call
lapply(my_list, function(x) {names(x)[names(x) == "A"] <- "new_name"; x})
How about
new.names = c('New', 'B', 'C')
lapply(my_list, `names<-`, new.names)
For the added example in your edit, you would simply change this to
new.names = sub('B', 'New', names(my_list[[1]]))

Filter Data Frame by Matching Multiple String in Multiple Columns

I have been unsuccessfully trying to filter my data frame using the dplyr and grep libraries using a list of string across multiple columns of my data frame. I would assume this is a simple task, but either nobody has asked my specific question or it's not as easy as I thought it would originally be.
For the following data frame...
foo <- data.frame(var.1 = c('a', 'b',' c'),
var.2 = c('b', 'd', 'e'),
var.3 = c('c', 'f', 'g'),
var.4 = c('z', 'a', 'b'))
... I would like to be able to filter row wise to find rows that contain all three variables a, b, and c in them. My sought after answer would only return row 1, as it contains a, b, and c, and not return rows 2 and 3 even though they contain two of the three sought after variables, they do not contain all three in the same row.
I'm running into issues where grep only allows specifying vectors or one column at a time when I really just care about finding string across many columns in the same row.
I've also used dplyr to filter using %in%, but it just returns when any of the variables are present:
foo %>%
filter(var.1 %in% c('a', 'b', 'c') |
var.2 %in% c('a', 'b', 'c') |
var.3 %in% c('a', 'b', 'c'))
Thanks for any and all help and please, let me know if you need any clarification!
Here's an approach in base R where we check if the elements of foo are equal to "a", "b", or "c" successively, add the Booleans and check if the sum of those Booleans for each row is greater than or equal to 3
Reduce("+", lapply(c("a", "b", "c"), function(x) rowSums(foo == x) > 0)) >=3
#[1] TRUE FALSE FALSE
Timings
foo = matrix(sample(letters[1:26], 1e7, replace = TRUE), ncol = 5)
system.time(Reduce("+", lapply(letters[1:20], function(x) rowSums(foo == x) > 0)) >=20)
# user system elapsed
# 3.26 0.48 3.79
system.time(apply(foo, 1, function(x) all(letters[1:20] %in% x)))
# user system elapsed
# 18.86 0.00 19.19
identical(Reduce("+", lapply(letters[1:20], function(x) rowSums(foo == x) > 0)) >=20,
apply(foo, 1, function(x) all(letters[1:20] %in% x)))
#[1] TRUE
>
Your problem arises from trying to apply "tidyverse" solutions to data that isn't tidy. Here's the tidy solution, which uses melt to make your data tidy. See how much tidier this solution is?
> library(reshape2)
> rows = foo %>%
mutate(id=1:nrow(foo)) %>%
melt(id="id") %>%
filter(value=="a" | value=="b" | value=="c") %>%
group_by(id) %>%
summarize(N=n()) %>%
filter(N==3) %>%
select(id) %>%
unlist
Warning message:
attributes are not identical across measure variables; they will be dropped
That gives you a vector of matching row indexes, which you can then subset your original data frame with:
> foo[rows,]
var.1 var.2 var.3 var.4
1 a b c z
>

Using list's elements in loops in r (example: setDT)

I have multiple data frames and I want to perform the same action in all data frames, such, for example, transform all them into data.tables (this is just an example, I want to apply other functions too).
A simple example can be (df1=df2=df3, without loss of generality here)
df1 <- data.frame(var1 = c(1, 2, 3, 4, 5), var2 =c(1, 2, 2, 1, 2), var3 = c(10, 8, 15, 7, 9))
df2 <- data.frame(var1 = c(1, 2, 3, 4, 5), var2 =c(1, 2, 2, 1, 2), var3 = c(10, 8, 15, 7, 9))
df3 <- data.frame(var1 = c(1, 2, 3, 4, 5), var2 =c(1, 2, 2, 1, 2), var3 = c(10, 8, 15, 7, 9))
My approach was: (i) to create a list of the data frames (list.df), (ii) to create a list of how they should be called afterwards (list.dt) and (iii) to loop into those two lists:
list.df:
list.df<-vector('list',3)
for(j in 1:3){
name <- paste('df',j,sep='')
list.df[j] <- name
}
list.dt
list.dt<-vector('list',3)
for(j in 1:3){
name <- paste('dt',j,sep='')
list.dt[j] <- name
}
Loop (to make all data frames into data tables):
for(i in 1:3){
name<-list.dt[i]
assign(unlist(name), setDT(list.df[i]))
}
I am definitely doing something wrong as the result of this are three data tables with 1 variable, 1 observation (exactly the name list.df[i]).
I've tried to unlist the list.df thinking r would recognize that as an entire data frame and not only as a string:
for(i in 1:3){
name<-list.dt[i]
assign(unlist(name), setDT(unlist(list.df[i])))
}
But I get the error message:
Error in setDT(unlist(list.df[i])) :
Argument 'x' to 'setDT' should be a 'list', 'data.frame' or 'data.table'
Any suggestions?
You can just put all the data into one dataframe. Then, if you want to iterate through dataframes, use dplyr::do or, preferably, other dplyr functions
library(dplyr)
data =
list(df1 = df2, df2 = df2, df3 = df3) %>%
bind_rows(.id = "source") %>%
group_by(source)
Change your last snippet to this:
for(i in 1:3){
name <- list.dt[i]
assign(unlist(name), setDT(get(list.df[[i]])))
}
# Alternative to using lists
list.df <- paste0("df", 1:3)
# For loop that works with the length of the input 'list'/vector
# Creates the 'dt' objects on the fly
for(i in seq_along(list.df)){
assign(paste0("dt", i), setDT(get(list.df[i])))
}
Using data.table (which deserve far more advertising):
a) If you need all your data.frames converted to data.tables, then as was already suggested in the comments by #A5C1D2H2I1M1N2O1R2T1, iterate over your data.frames with setDT
library(data.table)
lapply(mget(paste0("df", 1:3)), setDT)
# or, if you wish to type them one by one:
lapply(list(df1, df2, df3), setDT)
class(df1) # check if coercion took place
# [1] "data.table" "data.frame"
b) If you need to bind your data.frames by rows, then use data.table::rbindlist
data <- rbindlist(mget(paste0("df", 1:3)), idcol = TRUE)
# or, if you wish to type them one by one:
data <- rbindlist(list(df1 = df1, df2 = df2, df3 = df3), idcol = TRUE)
Side note: If you like chaining/piping with the magrittr package (which you see almost always in combination with dplyr syntax), then it goes like:
library(data.table)
library(magrittr)
# for a)
mget(paste0("df", 1:3)) %>% lapply(setDT)
# for b)
data <- mget(paste0("df", 1:3)) %>% rbindlist(idcol = TRUE)

Resources