Remove duplicates from lists within a vector in R - r

I have a vector of lists like the following sample:
library(tidyverse)
z <- tribble(
~x,
c(10, 10, 64),
c(22, 22),
c(5, 9, 9),
c(55, 55),
c(76, 65)
)
I'm trying to reduce each list to include only cases with unique values. Here's the output I'm looking for:
y <- tribble(
~x,
c(10, 64),
c(22),
c(5, 9),
c(55),
c(76, 65)
)
Of course I can't post the actual output and have to write it out as a new data set for this example because it looks like this otherwise:
# A tibble: 5 x 1
x
<list>
1 <dbl [3]>
2 <dbl [2]>
3 <dbl [3]>
4 <dbl [2]>
5 <dbl [2]>

We can loop over the list with map and apply unique
library(dplyr)
library(purrr)
z %>%
mutate(x = map(x, unique))
In base R, it would be
z$x <- lapply(z$x, unique)

Related

Concatenate/merge dataframes in R into vector type cells

I would like to merge two dataframe into one, each cell becoming a vector or a list.
Columns have the same name in both dataframes. Some columns are made of numerical values that I want to keep as numerical values in the merged dataframe. Some columns are made of characters.
For example I would like from these two dataframes:
DF1 <- data.frame(
xx = c(1:5),
yy = c(2:6),
zz = c("a","b","c","d","e"))
DF2 <- data.frame(
xx = c(3:7),
yy = c(5:9),
zz = c("a","i","h","g","f"))
Which look like this:
DF1
xx
yy
zz
1
2
a
2
3
b
3
4
c
4
5
d
5
6
e
DF2
xx
yy
zz
3
5
a
4
6
i
5
7
h
6
8
g
7
9
f
To get a dataframe looking like this:
xx
yy
zz
c(1,3)
c(2,5)
c(a,a)
c(2,4)
c(3,6)
c(b,i)
c(3,5)
c(4,7)
c(c,h)
c(4,6)
c(5,8)
c(d,g)
c(5,7)
c(6,9)
c(e,f)
I have tried with paste() or str_c() but it always transforms my numerical values into char and it does not create a list or a vector like I want.
Do you know of any functions that coule help me do that?
Using some tidyverse, you can invert the lists and then build it all back together.
library(purrr)
library(dplyr)
as_tibble(map2(DF1, DF2, ~ map(transpose(list(.x, .y)), unlist)))
This gets you your data frame of vectors.
# A tibble: 5 x 3
xx yy zz
<list> <list> <list>
1 <int [2]> <int [2]> <chr [2]>
2 <int [2]> <int [2]> <chr [2]>
3 <int [2]> <int [2]> <chr [2]>
4 <int [2]> <int [2]> <chr [2]>
5 <int [2]> <int [2]> <chr [2]>
Breaking this down...
transpose(list(.x, .y)) will flip a paired list of columns inside-out from a list of two vectors to a list of 5 elements (one for each row, each with two list elements in it).
map(transpose(list(.x, .y)), unlist)) will iterate over each of the 5 lists and unlist them back from a list of 2 to a vector of 2.
map2(DF1, DF2, ~ map(transpose(list(.x, .y)), unlist)) will iterate over each column pair from DF1 and DF2 (e.g., xx, yy, zz) doing steps 1 and 2.
as_tibble(map2(DF1, DF2, ~ map(transpose(list(.x, .y)), unlist))) converts the list to a tibble (basically a data.frame).
Another thing you can do is stack the data and then nest() it. You again need a few steps to do it. This would scale better because you could do this with more than 2 data frames.
library(dplyr)
library(tibble)
library(tidyr)
bind_rows(rowid_to_column(DF1),
rowid_to_column(DF2)) %>%
group_by(rowid) %>%
nest(nest_data = -rowid) %>%
unnest_wider(nest_data) %>%
ungroup() %>%
select(-rowid)
This also gets you your data frame of vectors.
# A tibble: 5 x 3
xx yy zz
<list> <list> <list>
1 <int [2]> <int [2]> <chr [2]>
2 <int [2]> <int [2]> <chr [2]>
3 <int [2]> <int [2]> <chr [2]>
4 <int [2]> <int [2]> <chr [2]>
5 <int [2]> <int [2]> <chr [2]>
This gives you matrices in a list:
res <- setNames(
lapply( colnames(DF1), function(x) cbind(DF1[[x]], DF2[[x]]) ),
colnames(DF1) )
To convert the result into a data frame you can use this:
data.frame( sapply(
names(res), function(x){ sapply(
1:nrow(res$xx), function(y){ list(res[[x]][y,1:ncol(res$xx)]) }
) }
) )
xx yy zz
1 1, 3 2, 5 a, a
2 2, 4 3, 6 b, i
3 3, 5 4, 7 c, h
4 4, 6 5, 8 d, g
5 5, 7 6, 9 e, f
Put together in a function:
EDIT: Added functionality to apply any number of DFs
(against what the question demands, but seemed to be necessary)
morph <- function(...){
abc <- list(...)
res <- sapply( colnames(abc[[1]]), function(col) list(
sapply( abc, function(dfr) dfr[[col]] ) ) )
data.frame( sapply(
names(res), function(x){ sapply(
1:nrow(res[[1]]), function(y){ list(res[[x]][y,1:ncol(res[[1]])]) }
) }
) )
}
morph(DF1, DF2, DF2)
xx yy zz
1 1, 3, 3 2, 5, 5 a, a, a
2 2, 4, 4 3, 6, 6 b, i, i
3 3, 5, 5 4, 7, 7 c, h, h
4 4, 6, 6 5, 8, 8 d, g, g
5 5, 7, 7 6, 9, 9 e, f, f
As your data consists of different types, There is no straight forward answer. However I produced some solution, that might do the trick by creating a nested list. Let me know, if this is what you need:
library(BBmisc)
library(dplyr)
colvec <- c("xx2","yy2","zz2")
colnames(DF2) <- colvec
DF <- bind_cols(DF1,DF2)
cols.num <- c("xx","xx2","yy","yy2")
DF[cols.num] <- sapply(DF[cols.num],as.character)
DF <- DF[,c(1,4,2,5,3,6)]
xx <- convertRowsToList(DF[,1:2])
yy <- convertRowsToList(DF[,3:4])
zz <- convertRowsToList(DF[,5:6])
final_list <- list(xx,yy,zz)
Try the following base R option
> data.frame(Map(function(x, y) asplit(cbind(x, y), 1), DF1, DF2))
xx yy zz
1 1, 3 2, 5 a, a
2 2, 4 3, 6 b, i
3 3, 5 4, 7 c, h
4 4, 6 5, 8 d, g
5 5, 7 6, 9 e, f

Subset a vector of lists in R

Let's say I have a vector of lists:
library(tidyverse)
d <- tribble(
~x,
c(10, 20, 64),
c(22, 11),
c(5, 9, 99),
c(55, 67),
c(76, 65)
)
How can I subset this vector such that, for example, I have have rows with lists having a length greater than 2? Here is my unsuccessful attempt using the tidyverse:
filter(d, length(x) > 2)
# A tibble: 5 x 1
x
<list>
1 <dbl [3]>
2 <dbl [2]>
3 <dbl [3]>
4 <dbl [2]>
5 <dbl [2]>
It would be lengths as the 'x' is a list
library(dplyr)
d %>%
filter(lengths(x) > 2)
You can use subset() + lengths()
subset(d,lengths(x)>2)

leading / lagging observations together with nested data

I want to generate a new variable based on
1.a nested vector of the current observation
2.values from current and other observations.
Here's my example:
D <- tibble(team = c(101, 101, 101, 102, 102, 102),
id = c(1, 2, 3, 1, 2, 3),
x = c(3, 7, 5, 1, 4, 10),
y = list(c(5,5,5), c(8,5,2), c(6,2,7), c(3,9,3), c(8,3,4), c(4,4,7)))
I want to create a new variable which equals
abs(y[1] - x[id==1]) + abs(y[2] - x[id==2]) + abs(y[3] - x[id==3])
This code is obviously wrong syntax, just for demonstration what I want to compute. Need to use current as well as leading or lagging (or both) observations of x, depending on value of id.
The expected result in this example would be z = c(4, 10, 10, 14, 14, 6)
I have tried something along the lines of group_by(team) followed by an attempt to use map() but I can't find anything promising.
What is the most elegant solution? I'd really appreciate your help!
We can use map to loop through the list column after grouping by 'team' and then get the sum of absolute difference between that column and the 'x'
library(dplyr)
library(purrr)
D %>%
group_by(team) %>%
mutate(z = map_dbl(y, ~ sum(abs(.x -x))))
# A tibble: 6 x 5
# Groups: team [2]
# team id x y z
# <dbl> <dbl> <dbl> <list> <dbl>
#1 101 1 3 <dbl [3]> 4
#2 101 2 7 <dbl [3]> 10
#3 101 3 5 <dbl [3]> 10
#4 102 1 1 <dbl [3]> 14
#5 102 2 4 <dbl [3]> 14
#6 102 3 10 <dbl [3]> 6

R sum a twice nested list using purrr

I have a data.frame with the following dimensions:
Output:
as_tibble(data2)
lamda meanlog sdlog freq freqsev
<dbl> <dbl> <dbl> <list> <list>
1 5 9 2 <int [4]> <list [4]>
2 2 10 2.1 <int [4]> <list [4]>
3 3 11 2.2 <int [4]> <list [4]>
where freqsev is a list of values of length freq, and freq itself is a list of values of length s, where s is the number of simulations.
library(tidyverse)
set.seed(123)
s <- 5
data <- data.frame(lamda = c(5, 2, 3), meanlog = c(9, 10, 11), sdlog = c(2, 2.1, 2.2))
data2 <- data %>% mutate(
freq = map(lamda, ~rpois(s, .x)),
freqsev = map(freq, ~map(.x, function(k) rlnorm(k, meanlog, sdlog)))
)
I would like to sum freqsev (producing <dbl [4]> where the [4] is the index of s) i.e. a sum over the number of freq occurrences e.g.
For data2$freqsev[[1]][[1]] I would expect the sum.
How can this be achieved? Thank you.
To be honest, this is a really complicated way of storing your data and you would probably be better off using unnest() after creating the freq column. However, you can get the sums of the freqsev vectors like this:
data2 <- data %>% mutate(
freq = map(lamda, ~rpois(s, .x)),
freqsev = map(freq, ~map(.x, function(k) rlnorm(k, meanlog, sdlog))),
freqsum = map(freqsev, ~map_dbl(.x, ~sum(.x)))
)
Because freqsev is a double-nested list, you also need to double-map the sum operation.

purrr to replace split, apply, output nested column

I understand how to use split, lapply and the combine the list outputs back together using base R. I'm trying to understand the purrr way to do this. I can do it with base R and even with purrr* but am guessing since I seem to be duplciating the order variable that I'm doing it wrong. It feels clunky so I don't think I get it.
What is the tidyverse approach to using info from data subsets to create a nested output column?
Base R approach to make nested column in a data frame
library(tidyverse)
set.seed(10)
dat2 <- dat1 <- data_frame(
v1 = LETTERS[c(1, 1, 1, 1, 2, 2, 2, 2)],
v2 = rep(1:4, 2),
from = c(1, 3, 2, 1, 3, 5, 2, 1),
to = c(1, 3, 2, 1, 3, 5, 2, 1) + sample(1:3, 8, TRUE)
)
dat1 <- split(dat1, dat1[c('v1', 'v2')]) %>%
lapply(function(x){
x$order <- list(seq(x$from, x$to))
x
}) %>%
{do.call(rbind, .)}
dat1
unnest(dat1)
My purrr approach (what is the right way?)
dat2 %>%
group_by(v1, v2) %>%
nest() %>%
mutate(order = purrr::map(data, ~ with(., seq(from, to)))) %>%
select(-data)
Desired output
v1 v2 from to order
* <chr> <int> <dbl> <dbl> <list>
1 A 1 1 3 <int [3]>
2 B 1 3 4 <int [2]>
3 A 2 3 4 <int [2]>
4 B 2 5 6 <int [2]>
5 A 3 2 4 <int [3]>
6 B 3 2 3 <int [2]>
7 A 4 1 4 <int [4]>
8 B 4 1 2 <int [2]>
In this particular case it seems you're looking for:
mutate(dat2,order = map2(.x = from,.y = to,.f = seq))
Using the new, experimental, rap package:
remotes::install_github("romainfrancois/rap")
library(rap)
dat2 %>%
rap(order = ~ seq(from, to))

Resources