R merging with a preference - r

Suppose you have a matrix that consists of two columns of only 1's and 2's.
A B
1 2
2 2
1 1
2 1
2 1
2 2
2 1
How would you merge these two columns into one so that 2 always overwrites 1?
Desired Output:
C
2
2
1
2
2
2
2

Assuming that the data is stored in a dataframe named df, you can use
df$C <- pmax(df$A, df$B)
to create a new column C with the desired result.
In the case of a matrix m you can use
m <- cbind(m, pmax(m[,1], m[,2]))
colnames(m) <- LETTERS[1:ncol(m)]
#> m
# A B C
#[1,] 1 2 2
#[2,] 2 2 2
#[3,] 1 1 1
#[4,] 2 1 2
#[5,] 2 1 2
#[6,] 2 2 2
#[7,] 2 1 2
#> class(m)
#[1] "matrix"

Without ifelse:
df$C <- apply(df[,c("A","B")],1,max)
With ifelse:
df$C2 <- with(df, ifelse(A==1&B==1,1,2))
Result
> df
A B C1 C2
1 1 2 2 2
2 2 2 2 2
3 1 1 1 1
4 2 1 2 2
5 2 1 2 2
6 2 2 2 2
7 2 1 2 2

Related

expand_grid with identical vectors

Problem:
Is there a simple way to get all combinations of two (or more) identical vectors. But only show unique combinations.
Reproducible example:
library(tidyr)
x = 1:3
expand_grid(a = x,
b = x,
c = x)
# A tibble: 27 x 3
a b c
<int> <int> <int>
1 1 1 1
2 1 1 2
3 1 1 3
4 1 2 1
5 1 2 2
6 1 2 3
7 1 3 1
8 1 3 2
9 1 3 3
10 2 1 1
# ... with 17 more rows
But, if row 1 2 1 exists, then I do not want to see 1 1 2 or 2 1 1. I.e. show only unique combinations of the three vectors (any order).
library(gtools)
x = 1:3
df <- as.data.frame(combinations(n=3,r=3,v=x,repeats.allowed=T))
df
output
V1 V2 V3
1 1 1 1
2 1 1 2
3 1 1 3
4 1 2 2
5 1 2 3
6 1 3 3
7 2 2 2
8 2 2 3
9 2 3 3
10 3 3 3
You can just sort rowwise and remove duplicates. Continuing from your expand_grid(), then
df <- tidyr::expand_grid(a = x,
b = x,
c = x)
data.frame(unique(t(apply(df, 1, sort))))
X1 X2 X3
1 1 1 1
2 1 1 2
3 1 1 3
4 1 2 2
5 1 2 3
6 1 3 3
7 2 2 2
8 2 2 3
9 2 3 3
10 3 3 3
Using comboGeneral from the RcppAlgos package, it's implemented in C++ and pretty fast.
x <- 1:3
RcppAlgos::comboGeneral(x, repetition=TRUE)
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 1 1 2
# [3,] 1 1 3
# [4,] 1 2 2
# [5,] 1 2 3
# [6,] 1 3 3
# [7,] 2 2 2
# [8,] 2 2 3
# [9,] 2 3 3
# [10,] 3 3 3
Note: If you're running Linux, you will need gmp installed, e.g. for Ubuntu do:
sudo apt install libgmp3-dev
base
x <- 1:3
df <- expand.grid(a = x,
b = x,
c = x)
df[!duplicated(apply(df, 1, function(x) paste(sort(x), collapse = ""))), ]
#> a b c
#> 1 1 1 1
#> 2 2 1 1
#> 3 3 1 1
#> 5 2 2 1
#> 6 3 2 1
#> 9 3 3 1
#> 14 2 2 2
#> 15 3 2 2
#> 18 3 3 2
#> 27 3 3 3
Created on 2021-09-09 by the reprex package (v2.0.1)

Producing all combinations of two column values in R

I have a data.frame with two columns
> data.frame(a=c(5,4,3), b =c(1,2,4))
a b
1 5 1
2 4 2
3 3 4
I want to produce a list of data.frames with different combinations of those column values; there should be a total of six possible scenarios for the above example (correct me if I am wrong):
a b
1 5 1
2 4 2
3 3 4
a b
1 5 1
2 4 4
3 3 2
a b
1 5 2
2 4 1
3 3 4
a b
1 5 2
2 4 4
3 3 1
a b
1 5 4
2 4 2
3 3 1
a b
1 5 4
2 4 1
3 3 2
Is there a simple function to do it? I don't think expand.grid worked out for me.
Actually expand.grid can work here, but it is not recommended since it's rather inefficient when you have many rows in df (you need to subset n! out of n**n if you have n rows).
Below is an example using expand.grid
u <- do.call(expand.grid, rep(list(seq(nrow(df))), nrow(df)))
lapply(
asplit(
subset(
u,
apply(u, 1, FUN = function(x) length(unique(x))) == nrow(df)
), 1
), function(v) within(df, b <- b[v])
)
One more efficient option is to use perms from package pracma
library(pracma)
> lapply(asplit(perms(df$b),1),function(v) within(df,b<-v))
[[1]]
a b
1 5 4
2 4 2
3 3 1
[[2]]
a b
1 5 4
2 4 1
3 3 2
[[3]]
a b
1 5 2
2 4 4
3 3 1
[[4]]
a b
1 5 2
2 4 1
3 3 4
[[5]]
a b
1 5 1
2 4 2
3 3 4
[[6]]
a b
1 5 1
2 4 4
3 3 2
Using combinat::permn create all possible permutations of b value and for each bind it with a column.
df <- data.frame(a= c(5,4,3), b = c(1,2,4))
result <- lapply(combinat::permn(df$b), function(x) data.frame(a = df$a, b = x))
result
#[[1]]
# a b
#1 5 1
#2 4 2
#3 3 4
#[[2]]
# a b
#1 5 1
#2 4 4
#3 3 2
#[[3]]
# a b
#1 5 4
#2 4 1
#3 3 2
#[[4]]
# a b
#1 5 4
#2 4 2
#3 3 1
#[[5]]
# a b
#1 5 2
#2 4 4
#3 3 1
#[[6]]
# a b
#1 5 2
#2 4 1
#3 3 4

In a large dataset is there any fast way to identify the repeated data records and also bucket the records with similar repeat pattern in R?

Assume I have a data frame as this:
my_df<- data.frame(mat1=c(1,2,2,2,1,2,2),
mat2=c(5,4,3,1,5,4,4),
mat3=c(4,1,6,9,4,1,1),
mat4=c(1,2,6,9,1,2,2))
I actually know how to identify the repeats, which gives me the follwoing:
mat1 mat2 mat3 mat4 Repeat
1 1 5 4 1 TRUE
2 2 4 1 2 TRUE
3 2 3 6 6 FALSE
4 2 1 9 9 FALSE
5 1 5 4 1 TRUE
6 2 4 1 2 TRUE
7 2 4 1 2 TRUE
I want to bucket the similar pattern, to generate the classes as follows:
mat1 mat2 mat3 mat4 Repeat repeat_class
1 1 5 4 1 TRUE 1
2 2 4 1 2 TRUE 2
3 2 3 6 6 FALSE 0
4 2 1 9 9 FALSE 0
5 1 5 4 1 TRUE 1
6 2 4 1 2 TRUE 2
7 2 4 1 2 TRUE 2
where, repeat_class=0 shows non-repeated data records,repeat_class=1,2,etc identifies the similar paterns found in the data records.
I can do it in for loops, but for a large dataset it is just taking too long. I'm wondering if there is any faster way to do that in R?
It looks like you want a column with a unique key for each repeat class in the data frame.
In dplyr, we can use the function group_indices:
library(dplyr)
my_df$repeat_class <- my_df%>%
group_indices(mat1, mat2, mat3, mat4)
mat1 mat2 mat3 mat4 repeat_class
1 1 5 4 1 1
2 2 4 1 2 4
3 2 3 6 6 3
4 2 1 9 9 2
5 1 5 4 1 1
6 2 4 1 2 4
7 2 4 1 2 4
To match your output, if we want non-duplicated keys to all match, we can set them to be 0:
my_df$repeat_class[!(duplicated(my_df$repeat_class) | duplicated(my_df$repeat_class, fromLast = T))] <- 0
mat1 mat2 mat3 mat4 id repeat_class
1 1 5 4 1 1 1
2 2 4 1 2 4 4
3 2 3 6 6 3 0
4 2 1 9 9 2 0
5 1 5 4 1 1 1
6 2 4 1 2 4 4
7 2 4 1 2 4 4
Here is one option with .GRP from data.table. We group by the names of 'my_df', and assign (:=) .GRP values with number of rows greater than 1 to 'repeat_class'
library(data.table)
setDT(my_df)[, repeat_class := .GRP * (.N > 1), by = names(my_df)]
my_df
# mat1 mat2 mat3 mat4 repeat_class
#1: 1 5 4 1 1
#2: 2 4 1 2 2
#3: 2 3 6 6 0
#4: 2 1 9 9 0
#5: 1 5 4 1 1
#6: 2 4 1 2 2
#7: 2 4 1 2 2
Here's my guess (which mirrors my comment):
#Make a character vector that reflects the pattern:
my_df$pat <- apply(my_df,1,paste, collapse="_")
#Then use ave to measure length of each pattern and subtract 1 from the tally:
(my_df$repeat_class <- ave( seq(nrow(my_df)), my_df$pat, FUN=length ) - 1 )
#[1] 1 2 0 0 1 2 2

In R: How to create a vector of lagged differences but keep the original value for negative differences without using loops

I have a vector in R of the form:
> a <- c(1,3,5,7,9,11,1,3,5,7,9,11,1,3,5,7,9,11)
> a
[1] 1 3 5 7 9 11 1 3 5 7 9 11 1 3 5 7 9 11
I can take the lagged differences like this:
b <- diff(a)
> b
[1] 2 2 2 2 2 -10 2 2 2 2 2 -10 2 2 2 2 2
But I would like the negative differences to be replaced by the original values in the vector a. Or, in this case the -10's to be replaced by the 1's.
Is there a way to do this without looping though the vectors?
Thanks
One possible way:
indices<-which(b<0)
b[indices]<-a[indices+1]
One approach using replacement:
d <- diff(a)
d_neg <- d < 0
d[d_neg] <- a[-1][d_neg]
# [1] 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2
One approach using ifelse:
d <- diff(a)
ifelse(d < 0, a[-1], d)
# [1] 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2
One approach using mathematics and pmax:
d <- diff(a)
(d < 0) * a[-1] + pmax(d, 0)
# [1] 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2

Relevel a factor

There is given a unordered factor ID, a reference vector for the rank of each level and a label for each level. Now I want to order the ID's by given rank and after that I want to overrider the labels in the factor.
Could you give a advise if there is a better way to do so:
ID<-factor(c(1,2,2,3,1,3,3,2,1,1)+10)
Rank<-c("11"=3,"12"=1,"13"=2)
Label<-c("11"="B","12"="A","13"="C")
ID.Rank<-factor(ID, levels=names(Rank),labels=Rank)
ID.Rank<-factor(ID.Rank, levels=sort(Rank),order=T)
ID.Label<-factor(ID, levels=names(Label),labels=Label)
data.frame(ID,ID.Rank,ID.Label)
### here is importent that ID.Rank has a certain order.
factor(ID.Rank, labels=Label[match(levels(ID.Rank), Rank)])
If I understood your question correctly, here is how you can solve the problem.
set.seed(2)
ID<-as.numeric(ID)
df1<-as.data.frame(ID)
> df1
ID
1 1
2 1
3 3
4 2
5 3
6 2
7 3
8 3
9 2
10 3
df2<-as.data.frame(Rank)
df2$ID<-rownames(df2)
> df2
Rank ID
1 3 1
2 1 2
3 2 3
df3<-merge(df1,df2,by="ID")
ID Rank
1 1 3
2 1 3
3 2 1
4 2 1
5 2 1
6 3 2
7 3 2
8 3 2
9 3 2
10 3 2
df3$Rank is what you are looking as the final result. You can convert that to factor.
Updated as per comments: If you want the original order of ID:
df1$IDo<-rownames(df1)
df3
ID IDo Rank
1 1 1 3
2 1 7 3
3 1 4 3
4 2 3 1
5 2 9 1
6 2 10 1
7 3 2 2
8 3 5 2
9 3 6 2
10 3 8 2
myFac <- factor(ID, levels=Rank, labels=names(Rank) )
myFac
[1] 3 3 2 2 3 1 1 2 2 3
Levels: 1 < 2 < 3
match(levels(myFac), names(Label) )
[1] 1 2 3
Label[match(levels(myFac), names(Label) )]
1 2 3
"B" "A" "C"
levels(myFac) <- Label[match(levels(myFac), names(Label) )]
myFac
#-----
[1] C C A A C B B A A C
Levels: B < A < C
Assuming Rank and Label are always in the same order, you just need to order the labels appropriately and then use them to create the ordered factor.
ID <- factor(c(1,2,2,3,1,3,3,2,1,1)+10)
Rank <- c("11"=3,"12"=1,"13"=2)
Label <- c("11"="B","12"="A","13"="C")
Label <- Label[order(Rank)]
factor(ID, levels=names(Label), labels=Label, order=TRUE)
## [1] B A A C B C C A B B
## Levels: A < C < B

Resources