Convert tidy dataframe to dataframe with list of lists - r

Here is a tidy dataframe
df_tidy <- tibble(
company = c("A", "B", "A", "B", "A", "B"),
line_data = c(1, 2, 2, 2, 1, 1)
)
The format required is:
df_ll <- structure(list(company = c("A", "B"), line_data = list(list(c(1, 2, 1)), list(c(2, 2, 1)))), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))
How do I transform df_tidy into df_ll?

Grouped by 'company' summarise the 'line_data' in a list
df_ll2 <- df_tidy %>%
group_by(company) %>%
summarise(line_data = list(list(line_data)))
-checking with expected
all.equal(df_ll, df_ll2)
[1] TRUE
Or another option is nest or nest_by and then convert the tibble to a list
df_tidy %>%
nest_by(company, .key = "line_data") %>%
mutate(line_data = list(list(unlist(line_data)))) %>%
ungroup

You can also use plyr package:
df_ll <- dlply(df_tidy,.(company),c)

library(tidyverse)
# Create a tibble comprised of: df_ll2 => tibble
df_ll2 <- tibble(
# Uniquify the company vector: company => character vector
company = unique(df_tidy$company),
# Split the data into a list by the company vector, coerce each
# element to an unnamed list:
line_data = unname(
lapply(
with(df_tidy, split(line_data, company)),
list
)
)
)

Related

Transform tidy dataframe into form for sparklines (dataui)

I have some tidy data and need to transform it into a format that works for building small graphs (sparklines) using the dataui package. You can see the required dataframe format in the code example below, df_sparkline.
The tidy data I have has about 30 companies and a year of data which is < 10,000 rows. What is the best (clearest to understand is valued more than raw speed) way to transform df_tidy to df_sparklines?
library("dataui")
library("reactable")
library("tidyverse")
df_tidy <- tibble(
company = c("A", "B", "A", "B", "A", "B"),
line_data = c(1, 2, 2, 2, 1, 1),
date = c(as.Date("2021-01-01"), as.Date("2021-01-01"), as.Date("2021-01-02"), as.Date("2021-01-02"), as.Date("2021-01-03"), as.Date("2021-01-03"))
)
df_sparkline <- structure(list(company = c("A", "B"), line_data = list(list(c(1, 2, 1)), list(c(2, 2, 1)))), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))
rt1 <- reactable(
df_sparkline,
columns = list(
line_data = colDef(
cell = function(value, index) {
dui_sparkline(
data = value[[1]],
height = 80,
components = dui_sparklineseries(curve = "linear") # https://github.com/williaster/data-ui/tree/master/packages/sparkline#series
)
}
)
)
)
rt1
All you need is group_by() and summarise():
df_sparkline2 = df_tidy %>%
group_by(company) %>%
summarise(line_data=list(list(line_data)))
waldo::compare(df_sparkline, df_sparkline2)
# √ No differences
The key here is to call list() inside summarise().

Pass a concatenated string as column name in dplyr::summarise

I am trying perform dplyr summarize iteratively using concatenated string as column names
Category=c("a","a","b","b","b","c","c","c")
A1=c(1,2,3,4,3,2,1,2)
A2=c(10,11,12,13,14,15,16,17)
tt=cbind(Category,A1,A2)
tdat=data.frame(tt)
colnames(tdat)=c("Category","M1","M2")
ll=matrix(1:2,nrow=2)
for(i in 1:nrow(ll)) {
Aone=tdat %>% group_by(Category) %>%
summarize(Msum=sum(paste("M",i,sep="")))
}
I end up the following error
x invalid 'type' (character) of argument
ℹ Input Msum is sum(paste("M", i, sep = "")).
ℹ The error occurred in group 1: Category = "A".
Run rlang::last_error() to see where the error occurred.```
The goal is to iteratively get arithmentic functions within summarize function in dplyr. But this concatenated string is not recognized as column name.
If we want to pass a string as column name, then convert to symbol and evaluate (!!)
library(dplyr)
Aone <- vector('list', nrow(ll))
for(i in seq_len(nrow(ll))) {
Aone[[i]] <- tdat %>%
group_by(Category) %>%
summarize(Msum = sum(!! rlang::sym(paste("M", i, sep=""))))
}
Or assuming the column name is 'M-1', 'M-2', etc, it should work as well
Aone <- vector('list', 2)
for(i in seq_along(Aone)) {
Aone[[i]] <- tdat %>%
group_by(Category) %>%
summarise(Msum = sum(!! rlang::sym(paste("M-", i, sep=""))),
.groups = 'drop')
}
NOTE: The ll was not clear in the original post. Here, we create a list with length equal to the number of 'M-' columns and assign the output back to the list element by looping over the sequence of that list
data
tdat <- data.frame(Category, M1, M2)
tdat <- structure(list(Category = c("A", "A", "A", "A", "B", "B", "B",
"B"), `M-1` = c(1, 2, 3, 4, 3, 2, 1, 2), `M-2` = c(10, 11, 12,
13, 14, 15, 16, 17)), class = "data.frame", row.names = c(NA,
-8L))

Specify nesting columns by using character vector in tidyr::complete

How can I define the columns I want to use for nesting in the tidyr::complete function?
one_of or as.name are not working.
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
char_vec <- c("item_id", "item_name")
df %>% complete(group, nesting(char_vec))
Error: `by` can't contain join column `char_vec` which is missing from RHS
Run `rlang::last_error()` to see where the error occurred.
An up to date solution with dplyr version 1.06 is !!!syms():
library(dplyr)
df %>%
complete(group, nesting(!!!syms(char_vec)))
Ok, I figured it out.
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
char_vec <- c("item_id", "item_name")
df %>% complete(group, nesting(!!as.symbol(char_vec)))

tidyr::unnest() with different column types

Since the update to tidyr version 1.0.0 I have started to get an error when unnesting a list of dataframes.
The error comes because some of the data frames in the list contain a column with all NA values (logical), while other of the dataframes contain the same column but with some character values (character). The columns with all NA values are coded as logicals while the others are coded as character vectors.
The default behavior of earlier versions of tidyr handled the different column types without problems (at least I didn't get this error when running the script).
Can I solve this issue from inside tidyr::unest() ?
Reproducible example:
library(tidyr)
a <- tibble(
value = rnorm(3),
char_vec = c(NA, "A", NA))
b <- tibble(
value = rnorm(2),
char_vec = c(NA, "B"))
c <- tibble(
value = rnorm(3),
char_vec = c(NA, NA, NA))
tibble(
file = list(a, b, c)) %>%
unnest(cols = c(file))
#> No common type for `..1$file$char_vec` <character> and `..3$file$char_vec`
#> <logical>.
Created on 2019-10-11 by the reprex package (v0.3.0)
You can convert all relevant columns to character one step before unnesting.
tibble(
file = list(a, b, c)) %>%
mutate(file = map(file, ~ mutate(.x, char_vec = as.character(char_vec)))) %>%
unnest(cols = c(file))
If there are several columns that need treatment you can do:
tibble(
file = list(a, b, c)) %>%
mutate(file = map(file, ~ mutate_at(.x, vars(starts_with("char")), ~as.character(.))))
Data for the latter example:
a <- tibble(
value = rnorm(3),
char_vec = c(NA, "A", NA),
char_vec2 = c(NA, NA, NA))
b <- tibble(
value = rnorm(2),
char_vec = c(NA, "B"),
char_vec2 = c("C", "A"))
c <- tibble(
value = rnorm(3),
char_vec = c(NA, NA, NA),
char_vec2 = c("B", NA, "A"))

discard last or first group after group_by by referencing group directly

Data:
df <- data.frame(A=c(rep(letters[1],3),rep(letters[2],3),rep(letters[3],3)),
B=rnorm(9),
stringsAsFactors=F)
I don't know if there's a way to do this, but what I'd like to know is if there's way to discard the last group by directly referencing the groups after group_by(A) to get the desired output:
A B
1 a -0.4900863
2 a 1.4106594
3 a -0.2245738
4 b -0.2124955
5 b 0.6963785
6 b 0.9151825
I AM INTERESTED IN SOLUTIONS THAT DIRECTLY WORK AT THE GROUPS LEVEL
For instance, something like:
df %>% group_by(A) %>% head(.Groups,-1)
or
df %>% group_by(A) %>% Groups[1:2]
I AM NOT INTERESTED IN THE FOLLOWING KINDS OF SOLUTIONS
df %>% filter(!(A == max(A)))
df %>% filter(!(A %in% max(A)))
OR OTHER SOLUTIONS THAT DO NOT REQUIRE group_by TO WORK
I was assuming you were not supposed to be assuming that we knew in advance what the number of groups might be. Try using the labels attribute:
all_but_last <- df %>% group_by(A) %>% attr("labels") %>% head(-1)
A
1 a
2 b
... to extract desired rows
> df %>% filter(A %in% all_but_last[[1]])
A B
1 a -0.799026840
2 a -0.712402478
3 a 0.685320094
4 b 0.971492883
5 b -0.001479117
6 b -0.817766296
Helps to use dput to look at the actual contents of a "grouped_df":
dput( df %>% group_by(A) )
structure(list(A = c("a", "a", "a", "b", "b", "b", "c", "c",
"c"), B = c(-0.799026840397576, -0.712402478350695, 0.685320094252465,
0.971492883452258, -0.00147911717469651, -0.817766295631676,
-1.00112471676908, 1.88145909873596, -0.305560178617216)), .Names = c("A",
"B"), row.names = c(NA, -9L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = "A", drop = TRUE, indices = list(
0:2, 3:5, 6:8), group_sizes = c(3L, 3L, 3L), biggest_group_size = 3L,
labels = structure(list(
A = c("a", "b", "c")),
row.names = c(NA, -3L),
class = "data.frame",
vars = "A", drop = TRUE, .Names = "A"))
Note that the labels are a data.frame so you could have further applied unlist to the result that became all_but_last and you then would not have needed to extract its value with "[[".
Perhaps this helps
library(dplyr)
df %>%
group_by(A) %>%
group_indices(.) %in% 1:2 %>%
df[.,]
Or with data.table
library(data.table)
setDT(df)[, grp := .GRP, A][grp %in% unique(grp)[1:2]][, grp := NULL][]

Resources