R: how to obtain unique pairwise combinations of 2 vectors [duplicate] - r

This question already has answers here:
How to generate permutations or combinations of object in R?
(3 answers)
Closed 2 years ago.
x = 1:3
y = 1:3
> expand.grid(x = 1:3, y = 1:3)
x y
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
Using expand.grid gives me all of the combinations. However, I want only pairwise comparisons, that is, I don't want a comparison of 1 vs 1, 2 vs, 2, or 3 vs 3. Moreover, I want to keep only the unique pairs, i.e., I want to keep 1 vs 2 (and not 2 vs 1).
In summary, for the above x and y, I want the following 3 pairwise combinations:
x y
1 1 2
2 1 3
3 2 3
Similarly, for x = y = 1:4, I want the following pairwise combinations:
x y
1 1 2
2 1 3
3 1 4
4 2 3
5 2 4
6 3 4

We can use combn
f1 <- function(x) setNames(as.data.frame(t(combn(x, 2))), c("x", "y"))
f1(1:3)
# x y
#1 1 2
#2 1 3
#3 2 3
f1(1:4)
# x y
#1 1 2
#2 1 3
#3 1 4
#4 2 3
#5 2 4
#6 3 4

Using data.table,
library(data.table)
x <- 1:4
y <- 1:4
CJ(x, y)[x < y]
x y
1: 1 2
2: 1 3
3: 1 4
4: 2 3
5: 2 4
6: 3 4

Actually you are already very close to the desired output. You may need subset as well
> subset(expand.grid(x = x, y = y), x < y)
x y
4 1 2
7 1 3
8 2 3
Here is another option but with longer code
v <- letters[1:5] # dummy data vector
mat <- diag(length(v))
inds <- upper.tri(mat)
data.frame(
x = v[row(mat)[inds]],
y = v[col(mat)[inds]]
)
which gives
x y
1 a b
2 a c
3 b c
4 a d
5 b d
6 c d
7 a e
8 b e
9 c e
10 d e

Related

Assign unique non-repeated ID to nested groups with the same values in R

I have run across similar questions, but have not been able to find an answer for my specific needs.
I have a data set with a nested group design and I need to include a unique non-repeating ID to nested groups that can have identical values. While I regularly conduct this type of data wrangling, both the structure of this data set as well as the required outcome are beyond my skillset at this time.
Below I have provided an example data set (df) and what the results should look like.
I used the below code in my actual data set, but realized that it fails under certain circumstances...which are exaggerated in the example data set provided here. I prefer the ID to be sequentially numbered.
df$ID = cumsum(c(TRUE, diff(df$LENGTH) != 0))
I am open to all options (e.g., library(data.table), library(boot), etc) as it would be great if others find this post useful. However, I prefer solutions that do not require the installation and loading of additional packages.
Thanks in advance for you help.
Take care.
df <- read.table(text = "GROUP REGION TIME LENGTH
a x 1 3
a x 2 3
a x 3 3
a y 4 3
a y 5 3
a y 6 3
a z 7 2
a z 8 2
b z 1 2
b z 2 2
b x 3 2
b x 4 2
c x 1 2
c x 2 2
c y 3 2
c y 4 2
c x 5 2
c x 6 2
c z 7 1", header = TRUE)
result <- read.table(text = "GROUP REGION TIME LENGTH ID
a x 1 3 1
a x 2 3 1
a x 3 3 1
a y 4 3 2
a y 5 3 2
a y 6 3 2
a z 7 2 3
a z 8 2 3
b z 1 2 4
b z 2 2 4
b x 3 2 5
b x 4 2 5
c x 1 2 6
c x 2 2 6
c y 3 2 7
c y 4 2 7
c x 5 2 8
c x 6 2 8
c z 7 1 9", header = TRUE)
Paste GROUP and REGION columns and use rle to create a sequential ID column.
transform(df,ID = with(rle(paste(GROUP, REGION)),rep(seq_along(values),lengths)))
In data.table we can use rleid.
library(data.table)
setDT(df)[, ID := rleid(GROUP, REGION)]
# GROUP REGION TIME LENGTH ID
# 1: a x 1 3 1
# 2: a x 2 3 1
# 3: a x 3 3 1
# 4: a y 4 3 2
# 5: a y 5 3 2
# 6: a y 6 3 2
# 7: a z 7 2 3
# 8: a z 8 2 3
# 9: b z 1 2 4
#10: b z 2 2 4
#11: b x 3 2 5
#12: b x 4 2 5
#13: c x 1 2 6
#14: c x 2 2 6
#15: c y 3 2 7
#16: c y 4 2 7
#17: c x 5 2 8
#18: c x 6 2 8
#19: c z 7 1 9
Another base R option, but without rle
transform(
df,
ID = cumsum(c(1, (s <- paste0(GROUP, REGION))[-1] != head(s, -1)))
)
gives
GROUP REGION TIME LENGTH ID
1 a x 1 3 1
2 a x 2 3 1
3 a x 3 3 1
4 a y 4 3 2
5 a y 5 3 2
6 a y 6 3 2
7 a z 7 2 3
8 a z 8 2 3
9 b z 1 2 4
10 b z 2 2 4
11 b x 3 2 5
12 b x 4 2 5
13 c x 1 2 6
14 c x 2 2 6
15 c y 3 2 7
16 c y 4 2 7
17 c x 5 2 8
18 c x 6 2 8
19 c z 7 1 9
With dplyr
library(dplyr)
library(data.table)
df %>%
mutate(ID = rleid(GROUP, REGION))

How to order column of data.frame by another variable [duplicate]

This question already has answers here:
Sort data frame by two columns (with condition) [duplicate]
(2 answers)
Closed 5 years ago.
mydata <- data.frame(id = c(rep(1, 3), rep(2, 3), rep(3, 3)),
score = c(c(1, 2, 3), c(3, 2, 1), c(1, 3, 2)),
location = c(rep(c("X", "Y", "Z"), 3)))
> mydata
id score location
1 1 1 X
2 1 2 Y
3 1 3 Z
4 2 3 X
5 2 2 Y
6 2 1 Z
7 3 1 X
8 3 3 Y
9 3 2 Z
I would like to sort my data.frame according to score from smallest to largest for each id.
Simplying ordering by score ignores the id column.
> mydata[with(mydata, order(score)),]
id score location
1 1 1 X
6 2 1 Z
7 3 1 X
2 1 2 Y
5 2 2 Y
9 3 2 Z
3 1 3 Z
4 2 3 X
8 3 3 Y
Essentially, I want my output to be
id score location
1 1 1 X
2 1 2 Y
3 1 3 Z
4 2 1 Z
5 2 2 Y
6 2 3 X
7 3 1 X
8 3 2 Z
9 3 3 Y
Using base R only.
mydata[order(mydata$id, mydata$score), ]
id score location
1 1 1 X
2 1 2 Y
3 1 3 Z
6 2 1 Z
5 2 2 Y
4 2 3 X
7 3 1 X
9 3 2 Z
8 3 3 Y
You can use dplyr package:
library(dplyr)
mydata %>% arrange(id,score)
# id score location
# 1 1 1 X
# 2 1 2 Y
# 3 1 3 Z
# 4 2 1 Z
# 5 2 2 Y
# 6 2 3 X
# 7 3 1 X
# 8 3 2 Z
# 9 3 3 Y

Selecting top N rows for each group based on value in column

I have dataframe like below :-
x<-c(3,2,1,8,7,11,10,9,7,5,4)
y<-c("a","a","a", "b","b","c","c","c","c","c","c")
z<-c(2,2,2,1,1,3,3,3,3,3,3)
df<-data.frame(x,y,z)
df
x y z
1 3 a 2
2 2 a 2
3 1 a 2
4 8 b 1
5 7 b 1
6 11 c 3
7 10 c 3
8 9 c 3
9 7 c 3
10 5 c 3
11 4 c 3
I want to select top n row for each group by column y where n is provided in column z.
So the output should be like :
output:
x y z
1 3 a 2
2 2 a 2
3 8 b 1
4 11 c 3
5 10 c 3
6 9 c 3
A solution with base R:
# df is split according to y, then we keep only the top "z" value (after ordering x)
# and rbind everything back together:
do.call(rbind,
lapply(split(df, df$y),
function(df1) df1[order(df1$x, decreasing=TRUE), ][1:unique(df1$z), ]))
# x y z
#a.1 3 a 2
#a.2 2 a 2
#b 8 b 1
#c.6 11 c 3
#c.7 10 c 3
#c.8 9 c 3
EDIT:
A much more direct way (still in base R) provided in comment by #mt1022:
df[ave(1:nrow(df), df$y, FUN = seq_along) <= df$z, ]
# x y z
#1 3 a 2
#2 2 a 2
#4 8 b 1
#6 11 c 3
#7 10 c 3
#8 9 c 3
One approach with data.table:
library(data.table)
setDT(df)
df[,.(inc=seq_len(.N)<=z,x,z),by=.(y)][inc==T ,-2]
# y x z
#1: a 3 2
#2: a 2 2
#3: b 8 1
#4: c 11 3
#5: c 10 3
#6: c 9 3
A solution with dplyr that uses do:
df %>%
group_by(y) %>%
do(head(.,as.numeric(unique(.$z))))
I'm posting the solution I was looking for using dplyr. It is based on #HNSKD:
library(dplyr)
x<-c(3,2,1,8,7,11,10,9,7,5,4)
y<-c("a","a","a", "b","b","c","c","c","c","c","c")
z<-c(2,2,2,1,1,3,3,3,3,3,3)
df<-data.frame(x,y,z)
df %>% group_by(y) %>% slice(1:2)
Which returns the first two elements for each y:
# A tibble: 6 x 3
# Groups: y [3]
x y z
<dbl> <fct> <dbl>
1 3 a 2
2 2 a 2
3 8 b 1
4 7 b 1
5 11 c 3
6 10 c 3

Create a new variable which count length of duplicate in R

I have a data frame,I want to create a variable z,count duplicate of "y variable", if y have 1,1 set z = 2,2, if y have 3,3,3, set z = 3,3,3.
x = c("a","b","c","d","e","a","b","c","d","e","a","b","c")
y = c(1,1,2,2,2,3,3,4,4,4,5,5,5)
data <- data.frame(x,y)
data
x y z
1 a 1 2
2 b 1 2
3 c 2 3
4 d 2 3
5 e 2 3
6 a 3 2
7 b 3 2
8 c 4 3
9 d 4 3
10 e 4 3
11 a 5 3
12 b 5 3
13 c 5 3
Thanks for your help.
You can try the rle:
data$z <- with(data, unlist(mapply(rep, rle(y)$lengths, rle(y)$lengths)))
data
x y z
1 a 1 2
2 b 1 2
3 c 2 3
4 d 2 3
5 e 2 3
6 a 3 2
7 b 3 2
8 c 4 3
9 d 4 3
10 e 4 3
11 a 5 3
12 b 5 3
13 c 5 3
If your your variable y is sorted as an increasing sequence as you say, then the following solution will work:
# calculate counts of each level
counts <- table(data$y)
# fill in z
data$z <- counts[match(data$y, names(counts))]
Note, however, that this method will fail if y is not ordered and, since you want to restart the count when a different level occurs. For these purposes, #psidom's solution is more robust to mis-ordered data as rle will reset the count.
This method calculates the total occurrences of a level and then feeds these total counts to the proper location using match.
Here is a quick method using dplyr, and its rather intuitive syntax:
library(dplyr)
left_join(data, data %>%
group_by(y) %>%
summarize(z = n()),
by = "y")
x y z
1 a 1 2
2 b 1 2
3 c 2 3
4 d 2 3
5 e 2 3
6 a 3 2
7 b 3 2
8 c 4 3
9 d 4 3
10 e 4 3
11 a 5 3
12 b 5 3
13 c 5 3
We can do this easily with data.table
library(data.table)
setDT(data)[, z := .N , rleid(y)]
data
# x y z
# 1: a 1 2
# 2: b 1 2
# 3: c 2 3
# 4: d 2 3
# 5: e 2 3
# 6: a 3 2
# 7: b 3 2
# 8: c 4 3
# 9: d 4 3
#10: e 4 3
#11: a 5 3
#12: b 5 3
#13: c 5 3
Or using rle from base R without any loops
inverse.rle(within.list(rle(data$y), values <- lengths))
#[1] 2 2 3 3 3 2 2 3 3 3 3 3 3
Or another base R method with ave
with(data, ave(y, cumsum(c(TRUE, y[-1]!= y[-length(y)])), FUN=length))
#[1] 2 2 3 3 3 2 2 3 3 3 3 3 3

how to mutate a column with ID in group

how to mutate a column with ID in group
data.frame like:
a b c
1 a 1 1
2 a 1 2
3 a 2 3
4 b 1 4
5 b 2 5
6 b 3 6
group by a, flag start with 1, if b equals pre b,then flag=1 else flag+=1
a b c flag
1 a 1 1 1 <- group a start with 1
2 a 1 2 1 <-- in group a, 1(in row 2)=1(in row 1)
3 a 2 3 2 <- in group a, 2(in row 3)!=1(in row 2)
4 b 1 4 1 <- group b start with 1
5 b 2 5 2 <- in group b, 2(in row 5)!=1(in row 4)
6 b 3 6 3 <- in group b, 3(in row 6)!=2(in row 5)
i now using this:
for(i in 2:nrow(x)){
x[i, 'flag'] = ifelse(x[i, 'a']!=x[i-1,'a'], 1, ifelse(x[i, 'b']==x[i-1, 'b'], x[i-1, 'flag'], x[i-1,'flag']+1))
}
but it is inefficiency in large dataset
#
UPDATE
dense_rank in dplyr give me the answer
> x %>% group_by(a) %>% mutate(dense_rank(b))
Source: local data frame [10 x 4]
Groups: a
a b c dense_rank(b)
1 a x 1 1
2 a x 2 1
3 a y 3 2
4 b x 4 1
5 b y 5 2
6 b z 6 3
7 c x 7 1
8 c y 8 2
9 c z 9 3
10 c z 10 3
thanks.
I am not entirely sure what you are trying to do. But it seems to me that you are trying to assign index numbers to values in b for each group (a or b).
#I modified your example here.
a <- rep(c("a","b"), each =3)
b <- c(4,4,5,11,12,13)
c <- 1:6
foo <- data.frame(a,b,c, stringsAsFactors = F)
a b c
1 a 4 1
2 a 4 2
3 a 5 3
4 b 11 4
5 b 12 5
6 b 13 6
#Since you referred to dplyr, I will use it.
cats <- list()
for(i in unique(foo$a)){
ana <- foo %>%
filter(a == i) %>%
arrange(b) %>%
mutate(indexInb = as.integer(as.factor(b)))
cats[[i]] <- ana
}
bob <- rbindlist(cats)
a b c indexInb
1: a 4 1 1
2: a 4 2 1
3: a 5 3 2
4: b 11 4 1
5: b 12 5 2
6: b 13 6 3
Hers's a quick vectorized way to solve this without using any for loops
Base R solution using ave and transform
transform(x, flag = ave(b, a, FUN = function(x) cumsum(c(1, diff(x)))))
# a b c flag
# 1 a 1 1 1
# 2 a 1 2 1
# 3 a 2 3 2
# 4 b 1 4 1
# 5 b 2 5 2
# 6 b 3 6 3
Or a data.table solution (more efficient)
library(data.table)
setDT(x)[, flag := cumsum(c(1, diff(b))), by = a]
x
# a b c flag
# 1: a 1 1 1
# 2: a 1 2 1
# 3: a 2 3 2
# 4: b 1 4 1
# 5: b 2 5 2
# 6: b 3 6 3
Or a dplyr solution (because you tagged it)
library(dplyr)
x %>%
group_by(a) %>%
mutate(flag = cumsum(c(1, diff(b))))
# Source: local data frame [6 x 4]
# Groups: a
#
# a b c flag
# 1 a 1 1 1
# 2 a 1 2 1
# 3 a 2 3 2
# 4 b 1 4 1
# 5 b 2 5 2
# 6 b 3 6 3

Resources