tidyr::unnest() with different column types

tidyr::unnest() with different column types - r

Since the update to tidyr version 1.0.0 I have started to get an error when unnesting a list of dataframes.
The error comes because some of the data frames in the list contain a column with all NA values (logical), while other of the dataframes contain the same column but with some character values (character). The columns with all NA values are coded as logicals while the others are coded as character vectors.
The default behavior of earlier versions of tidyr handled the different column types without problems (at least I didn't get this error when running the script).
Can I solve this issue from inside tidyr::unest() ?
Reproducible example:
library(tidyr)
a <- tibble(
value = rnorm(3),
char_vec = c(NA, "A", NA))
b <- tibble(
value = rnorm(2),
char_vec = c(NA, "B"))
c <- tibble(
value = rnorm(3),
char_vec = c(NA, NA, NA))
tibble(
file = list(a, b, c)) %>%
unnest(cols = c(file))
#> No common type for `..1$file$char_vec` <character> and `..3$file$char_vec`
#> <logical>.
Created on 2019-10-11 by the reprex package (v0.3.0)

You can convert all relevant columns to character one step before unnesting.
tibble(
file = list(a, b, c)) %>%
mutate(file = map(file, ~ mutate(.x, char_vec = as.character(char_vec)))) %>%
unnest(cols = c(file))
If there are several columns that need treatment you can do:
tibble(
file = list(a, b, c)) %>%
mutate(file = map(file, ~ mutate_at(.x, vars(starts_with("char")), ~as.character(.))))
Data for the latter example:
a <- tibble(
value = rnorm(3),
char_vec = c(NA, "A", NA),
char_vec2 = c(NA, NA, NA))
b <- tibble(
value = rnorm(2),
char_vec = c(NA, "B"),
char_vec2 = c("C", "A"))
c <- tibble(
value = rnorm(3),
char_vec = c(NA, NA, NA),
char_vec2 = c("B", NA, "A"))

Related

Overriding data.table key order causes incorrect merge results

In the following example I use a dplyr::arrange on a data.table with a key. This overrides the sort on that column:
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setkey(x, "a")
# lose order on datatable key
x <- dplyr::arrange(x, b)
y <- data.table(a = sample(1000:1100), f = c(letters, NA), g = c("AA", "BB", NA, NA, NA, NA))
setkey(y, "a")
res <- merge(x, y, by = c("a"), all.x = TRUE)
# try merge with key removed
res2 <- merge(x %>% as.data.frame() %>% as.data.table(), y, by = c("a"), all.x = TRUE)
# merge results are inconsistent
identical(res, res2)
I can see that if I ordered with x <- x[order(b)], I would maintain the sort on the key and the results would be consistent.
I am not sure why I cannot use dplyr::arrange and what relationship the sort key has with the merge. Any insight would be appreciated.

The problem is that with dplyr::arrange(x, b) you do not remove the sorted attribute from your data.table contrary to using x <- x[order(b)] or setorder(x, "b").
The data.table way would be to use setorder in the first place e.g.
library(data.table)
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setorder(x, "b", "a", na.last=TRUE)
The wrong results of joins on data.tables which have a key although they are not sorted by it, is a known bug (see also #5361 in data.table bug tracker).

Specify nesting columns by using character vector in tidyr::complete

How can I define the columns I want to use for nesting in the tidyr::complete function?
one_of or as.name are not working.
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
char_vec <- c("item_id", "item_name")
df %>% complete(group, nesting(char_vec))
Error: `by` can't contain join column `char_vec` which is missing from RHS
Run `rlang::last_error()` to see where the error occurred.

An up to date solution with dplyr version 1.06 is !!!syms():
library(dplyr)
df %>%
complete(group, nesting(!!!syms(char_vec)))

Ok, I figured it out.
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
char_vec <- c("item_id", "item_name")
df %>% complete(group, nesting(!!as.symbol(char_vec)))

set names with magrittr where both name and value are variable of data.frame?

Lets say i have the following data:
> data.frame(value = 1:2, name = c("a", "b"))
value name
1 1 a
2 2 b
Goal:
Can i give it as Input to the pipe Operator and "send" it to setNames (or magrittr::set_names)?
What i have tried:
library(magrittr)
data.frame(value = 1:2, name = c("a", "b")) %>%
setNames(object = .$value, nm = .$name)
That doesnt work i guess, because the pipe wants to Hand over the whole data.frame and use it as a first Argument. That got me interested if i can skip this behaviour and use two subsets instead.
(So that data.frame(value = 1:2, name = c("a", "b")) %>% is fixed and not replaced by a variable).
Desired Output:
How it would look like without the pipe Operator:
> a <- data.frame(value = 1:2, name = c("a", "b"))
> setNames(object = a$value, nm = a$name)
a b
1 2

For this case, we can simply wrap it inside {}
library(dplyr)
data.frame(value = 1:2, name = c("a", "b")) %>%
{ setNames(object = .$value, nm = .$name)}
With tidyverse, there is also a deframe which will give a named vector
library(tibble)
data.frame(value = 1:2, name = c("a", "b")) %>%
select(2:1) %>%
deframe
#a b
#1 2

Re-cast column types to a data frame which has already been read

I have a data frame df1 (with many columns) which I want to join with another data frame df2 that is supposed to have the same column types. However, for some reason when written and re-read they have acquired different types.
When I want to join these data frames, due to some of the columns which do not have the same type (but should have had), it refuses to join.
How can I force R to re-cast the classes of df2 to those of df1?
For example:
df1 <- data.frame(x = c(NA, NA, "3", "3"), y = c(NA, NA, "a", "b"))
df1_class <- sapply(df1, class) #first, determine the different classes of df1
df2 <- data.frame(x = c(NA, NA, 3, 3), y = c(NA, NA, "a", "b")) # df2 is
# equal to df1 but has a different class in column x
# now cast column x of df2 as class "character" - but do this for all
# columns together because there are many columns....

Using the purrrpackage the following will update df2 to match df1 classes:
df1_class <- sapply(df1, class)
df2 <-
purrr::map2_df(
df2,
df1_class,
~ do.call(paste0('as.', .y), list(.x))
)

You could change the ?mode of each column using "mode<-" via Map.
df2[] <- Map(f = "mode<-", x = df2, value = df1_class)
df2
# A tibble: 4 x 3
# x y z
# <chr> <chr> <dbl>
#1 NA NA 2
#2 NA NA 2
#3 3 a 2
#4 3 b 2
Your data extended by a third column for illustration.
data
library(tibble)
df1 <- data_frame(x = c(NA, NA, "3", "3"), y = c(NA, NA, "a", "b"), z = 1)
df2 <- data_frame(x = c(NA, NA, 3, 3), y = c(NA, NA, "a", "b"), z = 2L)
(df1_class <- sapply(df1, class))
# x y z
#"character" "character" "numeric"

discard last or first group after group_by by referencing group directly

Data:
df <- data.frame(A=c(rep(letters[1],3),rep(letters[2],3),rep(letters[3],3)),
B=rnorm(9),
stringsAsFactors=F)
I don't know if there's a way to do this, but what I'd like to know is if there's way to discard the last group by directly referencing the groups after group_by(A) to get the desired output:
A B
1 a -0.4900863
2 a 1.4106594
3 a -0.2245738
4 b -0.2124955
5 b 0.6963785
6 b 0.9151825
I AM INTERESTED IN SOLUTIONS THAT DIRECTLY WORK AT THE GROUPS LEVEL
For instance, something like:
df %>% group_by(A) %>% head(.Groups,-1)
or
df %>% group_by(A) %>% Groups[1:2]
I AM NOT INTERESTED IN THE FOLLOWING KINDS OF SOLUTIONS
df %>% filter(!(A == max(A)))
df %>% filter(!(A %in% max(A)))
OR OTHER SOLUTIONS THAT DO NOT REQUIRE group_by TO WORK

I was assuming you were not supposed to be assuming that we knew in advance what the number of groups might be. Try using the labels attribute:
all_but_last <- df %>% group_by(A) %>% attr("labels") %>% head(-1)
A
1 a
2 b
... to extract desired rows
> df %>% filter(A %in% all_but_last[[1]])
A B
1 a -0.799026840
2 a -0.712402478
3 a 0.685320094
4 b 0.971492883
5 b -0.001479117
6 b -0.817766296
Helps to use dput to look at the actual contents of a "grouped_df":
dput( df %>% group_by(A) )
structure(list(A = c("a", "a", "a", "b", "b", "b", "c", "c",
"c"), B = c(-0.799026840397576, -0.712402478350695, 0.685320094252465,
0.971492883452258, -0.00147911717469651, -0.817766295631676,
-1.00112471676908, 1.88145909873596, -0.305560178617216)), .Names = c("A",
"B"), row.names = c(NA, -9L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = "A", drop = TRUE, indices = list(
0:2, 3:5, 6:8), group_sizes = c(3L, 3L, 3L), biggest_group_size = 3L,
labels = structure(list(
A = c("a", "b", "c")),
row.names = c(NA, -3L),
class = "data.frame",
vars = "A", drop = TRUE, .Names = "A"))
Note that the labels are a data.frame so you could have further applied unlist to the result that became all_but_last and you then would not have needed to extract its value with "[[".

Perhaps this helps
library(dplyr)
df %>%
group_by(A) %>%
group_indices(.) %in% 1:2 %>%
df[.,]
Or with data.table
library(data.table)
setDT(df)[, grp := .GRP, A][grp %in% unique(grp)[1:2]][, grp := NULL][]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

tidyr::unnest() with different column types - r

Related

Overriding data.table key order causes incorrect merge results

Specify nesting columns by using character vector in tidyr::complete

set names with magrittr where both name and value are variable of data.frame?

Re-cast column types to a data frame which has already been read

discard last or first group after group_by by referencing group directly

Categories

Resources