R creating combinations with replacement - r

I have a small example like the following:
df1 = data.frame(Id1=c(1,2,3))
I want to obtain the list of all combinations with replacement which would look like this:
So far I have seen the following functions which produces some parts of the above table:
a) combn function
t(combn(df1$Id1,2))
# Does not creates rows 1,4 and 5 in the above image
b) expand.grid function
expand.grid(df1$Id1,df1$Id1)
# Duplicates rows 2,3 and 5. In my case the combination 1,2 and 2,1
#are the same. Hence I do not need both of them at the same time.
c) CJ function (from data.table)
#install.packages("data.table")
CJ(df1$Id1,df1$Id1)
#Same problem as the previous function
For your reference, I know that the in python I could do the same using the itertools package (link here: https://www.hackerrank.com/challenges/itertools-combinations-with-replacement/problem)
Is there a way to do this in R?

Here's an alternative using expand.grid by creating a unique key for every combination and then removing duplicates
library(dplyr)
expand.grid(df1$Id1,df1$Id1) %>%
mutate(key = paste(pmin(Var1, Var2), pmax(Var1, Var2), sep = "-")) %>%
filter(!duplicated(key)) %>%
select(-key) %>%
mutate(row = row_number())
# Var1 Var2 row
#1 1 1 1
#2 2 1 2
#3 3 1 3
#4 2 2 4
#5 3 2 5
#6 3 3 6

Related

How to iterate column values to find out all possible combinations in R? [duplicate]

This question already has answers here:
Count common sets of items between different customers
(4 answers)
Intersect all possible combinations of list elements
(3 answers)
Closed 1 year ago.
Suppose you have a dataframe with ids and elements prescripted to each id. For example:
example <- data.frame(id = c(1,1,1,1,1,2,2,2,3,4,4,4,4,4,4,4,5,5,5,5),
vals = c("a","b",'c','d','e','a','b','d','c',
'd','f','g','h','a','k','l','m', 'a',
'b', 'c'))
I want to find all possible pair combinations. The main struggle here is not the functional of R language that I can use, but the logic. How can I iterate through all elements and find patterns? For instance, a was picked with b 3 times in my sample dataframe. But original dataframe is more than 30k rows, so I cannot count these combinations manually. How do I automatize this process of finding the number of picks of each elements?
I was thinking about widening my df with pivot_wider and then using map_lgl to find matches. Then I faced the problem that it will take a lot of time for me to find all possible combinations, applying map_lgl for every pair of elements.
I was asking nearly the same question less than a month ago, fellow users answered it but the result is not anything I really need.
Do you have any ideas how to create a dataframe with all possible combinations of values for all ids?
I understand that this code is slow, but here is another example code to get the expected output based on tidyverse package.
What I do here is first create a nested dataframe by id, then produce all pair combinations for each id, unnest the dataframe, and finally count the pairs.
library(tidyverse)
example <- data.frame(
id = c(1,1,1,1,1,2,2,2,3,4,4,4,4,4,4,4,5,5,5,5),
vals = c("a","b",'c','d','e','a','b','d','c','d','f','g','h','a','k','l','m','a','b', 'c')
)
example %>% nest(dataset=-id) %>% mutate(dataset=map(dataset, function(dataset){
if(nrow(dataset)>1){
dataset %>% .$vals %>% combn(., 2) %>% t() %>% as_tibble(.name_repair=~c("val1", "val2")) %>% return()
}else{
return(NULL)
}
})) %>% unnest(cols=dataset) %>% group_by(val1, val2) %>% summarize(n=n(), .groups="drop") %>% arrange(desc(n), val1, val2)
#> # A tibble: 34 x 3
#> val1 val2 n
#> <chr> <chr> <int>
#> 1 a b 3
#> 2 a c 2
#> 3 a d 2
#> 4 b c 2
#> 5 b d 2
#> 6 a e 1
#> 7 a k 1
#> 8 a l 1
#> 9 b e 1
#> 10 c d 1
#> # … with 24 more rows
Created on 2021-03-04 by the reprex package (v1.0.0)
This won't (can't) be fast for many IDs. If it is too slow, you need to parallelize or implement it in a compiled language (e.g., using Rcpp).
We sort vals. We can then create all combination of two items grouped by ID. We exclude ID's with 1 item. Finally we tabulate the result.
library(data.table)
setDT(example)
setorder(example, id, vals)
example[, if (.N > 1) split(combn(vals, 2), 1:2), by = id][, .N, by = c("1", "2")]
# 1 2 N
# 1: a b 3
# 2: a c 2
# 3: a d 3
# 4: a e 1
# 5: b c 2
# 6: b d 2
# 7: b e 1
#<...>

Setdiff within mutate function

I have a data frame with three columns. Each row contains three unique numbers between 1 and 5 (inclusive).
df <- data.frame(a=c(1,4,2),
b=c(5,3,1),
c=c(3,1,5))
I want to use mutate to create two additional columns that, for each row, contain the two numbers between 1 and 5 that do not appear in the initial three columns in ascending order. The desired data frame in the example would be:
df2 <- data.frame(a=c(1,4,2),
b=c(5,3,1),
c=c(3,1,5),
d=c(2,2,3),
e=c(4,5,4))
I tried to use the below mutate function utilizing setdiff to accomplish this, but returned NAs rather than the values I was looking for:
df <- df %>% mutate(d=setdiff(c(a,b,c),c(1:5))[1],
e=setdiff(c(a,b,c),c(1:5))[2])
I can get around this by looping through each row (or using an apply function) but would prefer a mutate approach if possible.
Thank you for your help!
Base R:
cbind(df, t(apply(df, 1, setdiff, x = 1:5)))
# a b c 1 2
# 1 1 5 3 2 4
# 2 4 3 1 2 5
# 3 2 1 5 3 4
Warning: if there are any non-numerical columns, apply will happily up-convert things (converting to a matrix internally).
We can use pmap to loop over the rows, create a list column and then unnest it to create two new columns
library(dplyr)
librayr(purrr)
library(tidyr)
df %>%
mutate(out = pmap(., ~ setdiff(1:5, c(...)) %>%
as.list%>%
set_names(c('d', 'e')))) %%>%
unnest_wider(c(out))
# A tibble: 3 x 5
# a b c d e
# <dbl> <dbl> <dbl> <int> <int>
#1 1 5 3 2 4
#2 4 3 1 2 5
#3 2 1 5 3 4
Or using base R
df[c('d', 'e')] <- do.call(rbind, lapply(asplit(df, 1), function(x) setdiff(1:5, x)))

How to keep one instance or more of the values in one column when removing duplicate rows?

I'm trying to remove rows with duplicate values in one column of a data frame. I want to make sure that all the existing values in that column are represented, appearing more than once if its values in one other column are not duplicated and non-missing, and only once if the values in that other column are all missing. Take for example the following data frame:
toy <- data.frame(Group = c(1,1,2,2,2,3,3,4,5,5,6,7,7), Class = c("a",NA,"a","b",NA,NA,NA,NA,"a","b","a","a","a"))
I would like to end up with this:
ideal <- data.frame(Group = c(1,2,2,3,4,5,5,6,7), Class = c("a","a","b",NA,NA,"a","b","a","a"))
I tried transforming the data frame into a data table and follow the advice here, like this:
library(data.table)
toy.dt <- as.data.table(toy)
toy.dt[, .(Class = if(all(is.na(Class))) NA_character_ else na.omit(Class)), by = Group]
but duplicates weren't handled as needed: value 7 in the column 'Group' should appear only once in the resulting data.
It would be a bonus if the solution doesn't require transforming the data into a data table.
Here is one way using base R. We first drop NA rows in toy and select only unique rows. We can then left join it with unique Group values to get the rows which are NA for the group.
df1 <- unique(na.omit(toy))
merge(unique(subset(toy, select = Group)), df1, all.x = TRUE)
# Group Class
#1 1 a
#2 2 a
#3 2 b
#4 3 <NA>
#5 4 <NA>
#6 5 a
#7 5 b
#8 6 a
#9 7 a
Same logic using dplyr functions :
library(dplyr)
toy %>%
na.omit() %>%
distinct() %>%
right_join(toy %>% distinct(Group))
If you would like to try a tidyverse approach:
library(tidyverse)
toy %>%
group_by(Group) %>%
filter(!(is.na(Class) & sum(!is.na(Class)) > 0)) %>%
distinct()
Output
# A tibble: 9 x 2
# Groups: Group [7]
Group Class
<dbl> <chr>
1 1 a
2 2 a
3 2 b
4 3 NA
5 4 NA
6 5 a
7 5 b
8 6 a
9 7 a

dplyr Update a cell in a data.frame

df <-data.frame(x=c(1:5),y=c(letters[1:5]))
Let's say I want to modify the last row,
update.row<-filter(df,x==5) %>% mutate(y="R")
How do I update this row into the data.frame ?
The only way, I found albeit a strange way is to do an
anti-join and append the results.
df <-anti_join(df,update.row,by="x") %>%
bind_rows(update.row)
However, it seems like a very inelegant way to achieve a simple task.
Any ideas are much appreciated...
With data.table, we can assign (:=) the value to the rows where i is TRUE. It is very efficient as the assignment is done in place.
library(data.table)
setDT(df)[x==5, y:="R"]
df
# x y
#1: 1 a
#2: 2 b
#3: 3 c
#4: 4 d
#5: 5 R
As the OP mentioned about the last row, a more general way is
setDT(df)[.N, y:= "R"]
Or as #thelatemail mentioned, if we want to replace any row just mention the row index in i i.e. in this case 5.
setDT(df)[5, y:="R"]
If you are insistant on dplyr, perhaps
df <-data.frame(x=c(1:5),y=c(letters[1:5]))
library(dplyr)
df %>%
mutate(y = as.character(y)) %>%
mutate(y = ifelse(row_number()==n(), "R", y))
# x y
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 R

How to repeat empty rows so that each split has the same number

My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).

Resources