Create a list column with ranges set by existing columns - r

I am trying to create a list column within a data frame, specifying the range using existing columns, something like:
# A tibble: 3 x 3
A B C
<dbl> <dbl> <list>
1 1 6 c(1, 2, 3, 4, 5, 6)
2 2 5 c(2, 3, 4, 5)
3 3 4 c(3, 4)
The catch is that it would need to be created as follows:
df %>% mutate(C = c(A:B))
I have a dataset containing integers entered as ranges, i.e someone has entered "7 to 26". I've separated the ranges into two columns A & B, or "start" and "end", and was hoping to use c(A:B) to create a list, but using dplyr I keep getting:
Warning messages:
1: In a:b : numerical expression has 3 elements: only the first used
2: In a:b : numerical expression has 3 elements: only the first used
Which gives:
# A tibble: 3 x 3
A B C
<dbl> <dbl> <list>
1 1 6 list(1:6)
2 2 5 list(1:6)
3 3 4 list(1:6)
Has anyone had a similar issue and found a workaround?

You can use map2() in purrr
library(dplyr)
df %>%
mutate(C = purrr::map2(A, B, seq))
or do rowwise() before mutate()
df %>%
rowwise() %>%
mutate(C = list(A:B)) %>%
ungroup()
Both methods give
# # A tibble: 3 x 3
# A B C
# <int> <int> <list>
# 1 1 6 <int [6]>
# 2 2 5 <int [4]>
# 3 3 4 <int [2]>
Data
df <- tibble::tibble(A = 1:3, B = 6:4)

Related

tidy way to remove duplicates per row

I've seen different solutions to remove rowwise duplicates with base R solutions, e.g. R - find all duplicates in row and replace.
However, I'm wondering if there's amore tidy way. I tried several ways of using across or a combination of rowwise with c_across, but can't get it work.
df <- data.frame(x = c(1, 2, 3, 4),
y = c(1, 3, 4, 5),
z = c(2, 3, 5, 6))
Expected output:
x y z
1 1 NA 2
2 2 3 NA
3 3 4 5
4 4 5 6
My ideas so far (not working):
df |>
mutate(apply(across(everything()), 1, function(x) replace(x, duplicated(x), NA)))
df |>
mutate(apply(across(everything()), 1, function(x) {x[duplicated(x)] <- NA}))
I got somewhat along the way by creating a list column that contains the column positions of the duplicates (but it also has the ugly warning about the usual "new names" problem. I'm unsure how to proceed from there (if that's a promising way), i.e. I guess it requires some form of purrr magic?
df |>
rowwise() |>
mutate(test = list(duplicated(c_across(everything())))) |>
unnest_wider(test)
# A tibble: 4 × 6
x y z ...1 ...2 ...3
<dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
1 1 1 2 FALSE TRUE FALSE
2 2 3 3 FALSE FALSE TRUE
3 3 4 5 FALSE FALSE FALSE
4 4 5 6 FALSE FALSE FALSE
Maybe you want something like this:
library(dplyr)
df %>%
rowwise() %>%
do(data.frame(replace(., duplicated(unlist(.)), NA)))
Output:
# A tibble: 4 × 3
# Rowwise:
x y z
<dbl> <dbl> <dbl>
1 1 NA 2
2 2 3 NA
3 3 4 5
4 4 5 6
I wouldn't say its tidy but it is a solution using map:
library(tidyverse)
df %>%
group_nest(row_number()) %>%
pull(data) %>%
map(function(x) as.numeric(x) %>% replace(., duplicated(.), NA) %>% setNames(names(df))) %>%
bind_rows()
# # A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
# 1 1 NA 2
# 2 2 3 NA
# 3 3 4 5
# 4 4 5 6
Just for completeness, after trialing & erroring a bit, I also got the same result as provided by #Quinten, just in a much, much uglier way!
df |>
rowwise() |>
mutate(pos = list(which(duplicated(c_across(everything()))))) |>
mutate(across(-pos, ~ ifelse(which(names(df) == cur_column()) %in% unlist(pos), NA, .))) |>
select(-pos)

Simplifying the list for nested data frame

Sorry I am new in R
I need to get a dataframe ready a json format. But I have trouble to put the variable back to the original format c(1,2,3,...). For example
library(tidyr)
x<-tibble(x = 1:3, y = list(c(1,5), c(1,5,10), c(1,2,3,20)))
View(x)
This shows
1 1 c(1, 5)
2 2 c(1, 5, 10)
3 3 c(1, 2, 3, 20)
x1<-x %>% unnest(y)
x2<-x1 %>% nest(data=c(y))
View(x2)
This shows
1 1 1 variable
2 2 1 variable
3 3 1 variable
the desired format is c(...) rather than a variable to get ready for the json data file
1 1 c(1, 5)
2 2 c(1, 5, 10)
3 3 c(1, 2, 3, 20)
Please help
x$y is a list-column of doubles. Whereas x2$y is a list-column of tibbles.
Use map and unlist to turn the tibbles into doubles.
library(tidyverse)
x2 %>%
mutate(data = map(data, unlist))
#> # A tibble: 3 x 2
#> x data
#> <int> <list>
#> 1 1 <dbl [2]>
#> 2 2 <dbl [3]>
#> 3 3 <dbl [4]>
Alternatively, instead of nesting, you can use summarise.
x1 %>%
group_by(x) %>%
summarise(data = list(y))
#> # A tibble: 3 x 2
#> x data
#> <int> <list>
#> 1 1 <dbl [2]>
#> 2 2 <dbl [3]>
#> 3 3 <dbl [4]>

When I don't know column names in data.frame, when I use dplyr mutate function

I like to know how I can use dplyr mutate function when I don't know column names. Here is my example code;
library(dplyr)
w<-c(2,3,4)
x<-c(1,2,7)
y<-c(1,5,4)
z<-c(3,2,6)
df <- data.frame(w,x,y,z)
df %>% rowwise() %>% mutate(minimum = min(x,y,z))
Source: local data frame [3 x 5]
Groups: <by row>
# A tibble: 3 x 5
w x y z minimum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 3 1
2 3 2 5 2 2
3 4 7 4 6 4
This code is finding minimum value in row-wise. Yes, "df %>% rowwise() %>% mutate(minimum = min(x,y,z))" works because I typed column names, x, y, z. But, let's assume that I have a really big data.frame with several hundred columns, and I don't know all of the column names. Or, I have multiple data sets of data.frame, and they have all different column names; I just want to find a minimum value from 10th column to 20th column in each row and in each data.frame.
In this example data.frame I provided above, let's assume that I don't know column names, but I just want to get minimum value from 2nd column to 4th column in each row. Of course, this doesn't work, because 'mutate' doesn't work with vector;
df %>% rowwise() %>% mutate(minimum=min(df[,2],df[,3], df[,4]))
Source: local data frame [3 x 5]
Groups: <by row>
# A tibble: 3 x 5
w x y z minimum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 3 1
2 3 2 5 2 1
3 4 7 4 6 1
These two codes below also don't work.
df %>% rowwise() %>% mutate(average=min(colnames(df)[2], colnames(df)[3], colnames(df)[4]))
df %>% rowwise() %>% mutate(average=min(noquote(colnames(df)[2]), noquote(colnames(df)[3]), noquote(colnames(df)[4])))
I know that I can get minimum value by using apply or different method when I don't know column names. But, I like to know whether dplyr mutate function can be able to do that without known column names.
Thank you,
With apply:
library(dplyr)
library(purrr)
df %>%
mutate(minimum = apply(df[,2:4], 1, min))
or with pmap:
df %>%
mutate(minimum = pmap(.[2:4], min))
Also with by_row from purrrlyr:
df %>%
purrrlyr::by_row(~min(.[2:4]), .collate = "rows", .to = "minimum")
Output:
# tibble [3 x 5]
w x y z minimum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 3 1
2 3 2 5 2 2
3 4 7 4 6 4
A vectorized option would be pmin. Convert the column names to symbols with syms and evaluate (!!!) to return the values of the columns on which pmin is applied
library(dplyr)
df %>%
mutate(minimum = pmin(!!! rlang::syms(names(.)[2:4])))
# w x y z minimum
#1 2 1 1 3 1
#2 3 2 5 2 2
#3 4 7 4 6 4
Here is a tidyeval approach along the lines of the suggestion from aosmith. If you don't know the column names, you can make a function that accepts the desired positions as inputs and finds the columns names itself. Here, rlang::syms() takes the column names as strings and turns them into symbols, !!! unquotes and splices the symbols into the function.
library(dplyr)
w<-c(2,3,4)
x<-c(1,2,7)
y<-c(1,5,4)
z<-c(3,2,6)
df <- data.frame(w,x,y,z)
rowwise_min <- function(df, min_cols){
cols <- df[, min_cols] %>% colnames %>% rlang::syms()
df %>%
rowwise %>%
mutate(minimum = min(!!!cols))
}
rowwise_min(df, 2:4)
#> Source: local data frame [3 x 5]
#> Groups: <by row>
#>
#> # A tibble: 3 x 5
#> w x y z minimum
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 1 1 3 1
#> 2 3 2 5 2 2
#> 3 4 7 4 6 4
rowwise_min(df, c(1, 3))
#> Source: local data frame [3 x 5]
#> Groups: <by row>
#>
#> # A tibble: 3 x 5
#> w x y z minimum
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 1 1 3 1
#> 2 3 2 5 2 3
#> 3 4 7 4 6 4
Created on 2018-09-04 by the reprex package (v0.2.0).

How to perform a group_by with elements that are contiguous in R and dplyr

Suppose we have this tibble:
group item
x 1
x 2
x 2
y 3
z 2
x 2
x 2
z 1
I want to perform a group_by by group. However, I'd rather group only by the elements that are adjacent. For example, in my case, I'd have three 'x' groups, summing 'item' elements. The result would be something like:
group item
x 5
y 3
z 2
x 4
z 1
I know how to solve this problem using 'for' loops. However, this is not fast and doesn't sound straightforward. I'd rather use some dplyr or tidyverse function with an easy logic.
This question is not duplicated. I know there's already a question about rle in SO, but my question was more general than that. I asked for general solutions.
If you want to use only base R + tidyverse, this code exactly replicates your desired results
mydf <- tibble(group = c("x", "x", "x", "y", "z", "x", "x", "z"),
item = c(1, 2, 2, 3, 2, 2, 2, 1))
mydf
# A tibble: 8 × 2
group item
<chr> <dbl>
1 x 1
2 x 2
3 x 2
4 y 3
5 z 2
6 x 2
7 x 2
8 z 1
runs <- rle(mydf$group)
mydf %>%
mutate(run_id = rep(seq_along(runs$lengths), runs$lengths)) %>%
group_by(group, run_id) %>%
summarise(item = sum(item)) %>%
arrange(run_id) %>%
select(-run_id)
Source: local data frame [5 x 2]
Groups: group [3]
group item
<chr> <dbl>
1 x 5
2 y 3
3 z 2
4 x 4
5 z 1
You can construct group identifiers with rle, but the easier route is to just use data.table::rleid, which does it for you:
library(dplyr)
df %>%
group_by(group,
group_run = data.table::rleid(group)) %>%
summarise_all(sum)
#> # A tibble: 5 x 3
#> # Groups: group [?]
#> group group_run item
#> <fctr> <int> <int>
#> 1 x 1 5
#> 2 x 4 4
#> 3 y 2 3
#> 4 z 3 2
#> 5 z 5 1

How to separate a column list of fixed size X to X different columns?

I have a tibble with one column being a list column, always having two numeric values named a and b (e.g. as a result of calling purrr:map to a function which returns a list), say:
df <- tibble(x = 1:3, y = list(list(a = 1, b = 2), list(a = 3, b = 4), list(a = 5, b = 6)))
df
# A tibble: 3 × 2
x y
<int> <list>
1 1 <list [2]>
2 2 <list [2]>
3 3 <list [2]>
How do I separate the list column y into two columns a and b, and get:
df_res <- tibble(x = 1:3, a = c(1,3,5), b = c(2,4,6))
df_res
# A tibble: 3 × 3
x a b
<int> <dbl> <dbl>
1 1 1 2
2 2 3 4
3 3 5 6
Looking for something like tidyr::separate to deal with a list instead of a string.
Using dplyr (current release: 0.7.0):
bind_cols(df[1], bind_rows(df$y))
# # A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
# 1 1 1 2
# 2 2 3 4
# 3 3 5 6
edit based on OP's comment:
To embed this in a pipe and in case you have many non-list columns, we can try:
df %>% select(-y) %>% bind_cols(bind_rows(df$y))
We could also make use the map_df from purrr
library(tidyverse)
df %>%
summarise(x = list(x), new = list(map_df(.$y, bind_rows))) %>%
unnest
# A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
#1 1 1 2
#2 2 3 4
#3 3 5 6

Resources