leading / lagging observations together with nested data

leading / lagging observations together with nested data - r

I want to generate a new variable based on
1.a nested vector of the current observation
2.values from current and other observations.
Here's my example:
D <- tibble(team = c(101, 101, 101, 102, 102, 102),
id = c(1, 2, 3, 1, 2, 3),
x = c(3, 7, 5, 1, 4, 10),
y = list(c(5,5,5), c(8,5,2), c(6,2,7), c(3,9,3), c(8,3,4), c(4,4,7)))
I want to create a new variable which equals
abs(y[1] - x[id==1]) + abs(y[2] - x[id==2]) + abs(y[3] - x[id==3])
This code is obviously wrong syntax, just for demonstration what I want to compute. Need to use current as well as leading or lagging (or both) observations of x, depending on value of id.
The expected result in this example would be z = c(4, 10, 10, 14, 14, 6)
I have tried something along the lines of group_by(team) followed by an attempt to use map() but I can't find anything promising.
What is the most elegant solution? I'd really appreciate your help!

We can use map to loop through the list column after grouping by 'team' and then get the sum of absolute difference between that column and the 'x'
library(dplyr)
library(purrr)
D %>%
group_by(team) %>%
mutate(z = map_dbl(y, ~ sum(abs(.x -x))))
# A tibble: 6 x 5
# Groups: team [2]
# team id x y z
# <dbl> <dbl> <dbl> <list> <dbl>
#1 101 1 3 <dbl [3]> 4
#2 101 2 7 <dbl [3]> 10
#3 101 3 5 <dbl [3]> 10
#4 102 1 1 <dbl [3]> 14
#5 102 2 4 <dbl [3]> 14
#6 102 3 10 <dbl [3]> 6

Related

Simplifying the list for nested data frame

Sorry I am new in R
I need to get a dataframe ready a json format. But I have trouble to put the variable back to the original format c(1,2,3,...). For example
library(tidyr)
x<-tibble(x = 1:3, y = list(c(1,5), c(1,5,10), c(1,2,3,20)))
View(x)
This shows
1 1 c(1, 5)
2 2 c(1, 5, 10)
3 3 c(1, 2, 3, 20)
x1<-x %>% unnest(y)
x2<-x1 %>% nest(data=c(y))
View(x2)
This shows
1 1 1 variable
2 2 1 variable
3 3 1 variable
the desired format is c(...) rather than a variable to get ready for the json data file
1 1 c(1, 5)
2 2 c(1, 5, 10)
3 3 c(1, 2, 3, 20)
Please help

x$y is a list-column of doubles. Whereas x2$y is a list-column of tibbles.
Use map and unlist to turn the tibbles into doubles.
library(tidyverse)
x2 %>%
mutate(data = map(data, unlist))
#> # A tibble: 3 x 2
#> x data
#> <int> <list>
#> 1 1 <dbl [2]>
#> 2 2 <dbl [3]>
#> 3 3 <dbl [4]>
Alternatively, instead of nesting, you can use summarise.
x1 %>%
group_by(x) %>%
summarise(data = list(y))
#> # A tibble: 3 x 2
#> x data
#> <int> <list>
#> 1 1 <dbl [2]>
#> 2 2 <dbl [3]>
#> 3 3 <dbl [4]>

Create a list column with ranges set by existing columns

I am trying to create a list column within a data frame, specifying the range using existing columns, something like:
# A tibble: 3 x 3
A B C
<dbl> <dbl> <list>
1 1 6 c(1, 2, 3, 4, 5, 6)
2 2 5 c(2, 3, 4, 5)
3 3 4 c(3, 4)
The catch is that it would need to be created as follows:
df %>% mutate(C = c(A:B))
I have a dataset containing integers entered as ranges, i.e someone has entered "7 to 26". I've separated the ranges into two columns A & B, or "start" and "end", and was hoping to use c(A:B) to create a list, but using dplyr I keep getting:
Warning messages:
1: In a:b : numerical expression has 3 elements: only the first used
2: In a:b : numerical expression has 3 elements: only the first used
Which gives:
# A tibble: 3 x 3
A B C
<dbl> <dbl> <list>
1 1 6 list(1:6)
2 2 5 list(1:6)
3 3 4 list(1:6)
Has anyone had a similar issue and found a workaround?

You can use map2() in purrr
library(dplyr)
df %>%
mutate(C = purrr::map2(A, B, seq))
or do rowwise() before mutate()
df %>%
rowwise() %>%
mutate(C = list(A:B)) %>%
ungroup()
Both methods give
# # A tibble: 3 x 3
# A B C
# <int> <int> <list>
# 1 1 6 <int [6]>
# 2 2 5 <int [4]>
# 3 3 4 <int [2]>
Data
df <- tibble::tibble(A = 1:3, B = 6:4)

How can I use the seq function on two columns in a dataframe?

Suppose I have:
x <- c(1, 2, 3, 4, 5)
y <- c(2, 3, 6, 8, 10)
my.list <- list(start = x, end = y) %>% as.data.frame()
I need to define a new variable that contains seq(start,end) or start:end stored in that variable, I want the sequence of numbers across the rows, for example, 1 2 for the first row and 3 4 5 6 for the third row.
Many thanks

We can use map2 to get the sequence of corresponding values of 'start', 'end' to create a list of vectors
library(dplyr)
library(purrr)
my.list %>%
mutate(new = map2(start, end, `:`))
# start end new
#1 1 2 1, 2
#2 2 3 2, 3
#3 3 6 3, 4, 5, 6
#4 4 8 4, 5, 6, 7, 8
#5 5 10 5, 6, 7, 8, 9, 10
Another option is rowwise
my.list %>%
rowwise %>%
mutate(new = list(start:end))
# A tibble: 5 x 3
# Rowwise:
# start end new
# <dbl> <dbl> <list>
#1 1 2 <int [2]>
#2 2 3 <int [2]>
#3 3 6 <int [4]>
#4 4 8 <int [5]>
#5 5 10 <int [6]>
Or with data.table as #markus mentioned in comments
library(data.table)
setDT(my.list)[, V3 := Map(`:`, start, end)]
Or with Map from base R
Map(`:`, my.list$start, my.list$end)

Filter a tibble in mutate based on another tibble?

I have two tibbles, ranges and sites. The first contains a set of coordinates (region, start, end, plus other character variables) and the other contains a sites (region, site). I need to get all sites in the second tibble that fall within a given range (row) in the first tibble. Complicating matters, the ranges in the first tibble overlap.
# Range tibble
region start end var_1 ... var_n
1 A 1 5
2 A 3 10
3 B 20 100
# Site tibble
region site
1 A 4
2 A 8
3 B 25
The ~200,000 ranges can be 100,000s long over about a billion sites, so I don't love my idea of a scheme of making a list of all values in the range, unnesting, semi_join'ing, grouping, and summarise(a_list = list(site))'ing.
I was hoping for something along the lines of:
range_tibble %>%
rowwise %>%
mutate(site_list = site_tibble %>%
filter(region.site == region.range, site > start, site < end) %>%
.$site %>% as.list))
to produce a tibble like:
# Final tibble
region start end site_list var_1 ... var_n
<chr> <dbl> <dbl> <list> <chr> <chr>
1 A 1 5 <dbl [1]>
2 A 3 10 <dbl [2]>
3 B 20 100 <dbl [1]>
I've seen answers using "gets" of an external variable (i.e. filter(b == get("b")), but how would I get the variable from the current line in the range tibble? Any clever pipes or syntax I'm not thinking of? A totally different approach is great, too, as long as it plays well with big data and can be turned back into a tibble.

Use left_join() to merge two data frames and summarise() to concatenate the sites contained in the specified range.
library(dplyr)
range %>%
left_join(site) %>%
filter(site >= start & site <= end) %>%
group_by(region, start, end) %>%
summarise(site = list(site))
# region start end site
# <fct> <dbl> <dbl> <list>
# 1 A 1 5 <dbl [1]>
# 2 A 3 10 <dbl [2]>
# 3 B 20 100 <dbl [1]>
Data
range <- data.frame(region = c("A", "A", "B"), start = c(1, 3, 20), end = c(5, 10, 100))
site <- data.frame(region = c("A", "A", "B"), site = c(4, 8, 25))

purrr to replace split, apply, output nested column

I understand how to use split, lapply and the combine the list outputs back together using base R. I'm trying to understand the purrr way to do this. I can do it with base R and even with purrr* but am guessing since I seem to be duplciating the order variable that I'm doing it wrong. It feels clunky so I don't think I get it.
What is the tidyverse approach to using info from data subsets to create a nested output column?
Base R approach to make nested column in a data frame
library(tidyverse)
set.seed(10)
dat2 <- dat1 <- data_frame(
v1 = LETTERS[c(1, 1, 1, 1, 2, 2, 2, 2)],
v2 = rep(1:4, 2),
from = c(1, 3, 2, 1, 3, 5, 2, 1),
to = c(1, 3, 2, 1, 3, 5, 2, 1) + sample(1:3, 8, TRUE)
)
dat1 <- split(dat1, dat1[c('v1', 'v2')]) %>%
lapply(function(x){
x$order <- list(seq(x$from, x$to))
x
}) %>%
{do.call(rbind, .)}
dat1
unnest(dat1)
My purrr approach (what is the right way?)
dat2 %>%
group_by(v1, v2) %>%
nest() %>%
mutate(order = purrr::map(data, ~ with(., seq(from, to)))) %>%
select(-data)
Desired output
v1 v2 from to order
* <chr> <int> <dbl> <dbl> <list>
1 A 1 1 3 <int [3]>
2 B 1 3 4 <int [2]>
3 A 2 3 4 <int [2]>
4 B 2 5 6 <int [2]>
5 A 3 2 4 <int [3]>
6 B 3 2 3 <int [2]>
7 A 4 1 4 <int [4]>
8 B 4 1 2 <int [2]>

In this particular case it seems you're looking for:
mutate(dat2,order = map2(.x = from,.y = to,.f = seq))

Using the new, experimental, rap package:
remotes::install_github("romainfrancois/rap")
library(rap)
dat2 %>%
rap(order = ~ seq(from, to))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

leading / lagging observations together with nested data - r

Related

Simplifying the list for nested data frame

Create a list column with ranges set by existing columns

How can I use the seq function on two columns in a dataframe?

Filter a tibble in mutate based on another tibble?

purrr to replace split, apply, output nested column

Categories

Resources