R: compare two groups of vectors - r

I have made two recommendation systems and would like to compere the products they recommend and to see how many products are mutual. I joined the two results into data frame - one recommendation system columns starts with "z", other one with "b".
Example data:
df <- data.frame(z1 = c("a", "s", "d"), z2 = c("z", "x", "c"), z3 = c("q", "w", "e"),
b1 = c("w", "a", "e"), b2 = c("a", "i", "r"), b3 = c("z", "w", "y"))
ID z1 z2 z3 b1 b2 b3
1 a z q q a z
2 s x w a i r
3 d c e r e y
Desired results:
ID z1 z2 z3 b1 b2 b3 mutual_recommendation
1 a z q q a z 3
2 s x w a i r 0
3 d c e e r y 1
The problem is that the order might not be the same and compering all the combinations is by Case or ifelse would be a lot of combination, specially when number of Top-N recommendation will change to 10.

We can use an apply to loop over the rows of the subset of dataset (removed the 'ID' column), get the length of intersect of the first 3 and next 3 elements
df$mutual_recommendation <- apply(df[-1], 1, FUN = function(x)
length(intersect(x[1:3], x[4:6])))
df$mutual_recommendation
#[1] 3 0 1

Here is another solution (note: I changed the data.frame code to produce the data frame that is actually shown under it in the question - they do not match):
> library(dplyr)
> df %>% mutate(mutual_recommendation=apply(df,1,function(x) sum(x[1:3] %in% x[4:6]) ))
z1 z2 z3 b1 b2 b3 mutual_recommendation
1 a z q q a z 3
2 s x w a i r 0
3 d c e r e y 1

Related

Add new column describing unique values per ID

The data I have contain four fields: ID, x1 (numeric), x2 (numeric), and x3 (factor). Some IDs have multiple records, and also some values of x3 are missing (NA). Here is a sample
ID <- c(1,1,1,1,2,2,3,3,3,3,4,4,4,5,6,6)
x1 <- rnorm(16,0,1)
x2 <- rnorm(16,2,2)
x3 <- c("a", "a", "a", NA, "b", "b", "c", "c", "a", "c", "w", "w", "w", "y", NA, NA)
df <- data.frame(ID, x1, x2, x3)
I want to to create a new field (let's call it unqind) to check whether each ID has unique values of x3.
For example, ID=1 has four observations of x3 ("a", "a", "a", NA) ... three "a"'s and one NA. Therefore unqind=0.
ID=2 has two observations of x3 (2 "b"s)... therefore, unqind=1.
In case all values of x3 are NAs per ID, then unqind=1.
After creating unqind, df looks like:
ID x1 x2 x3 unqind
1 0.9087691 4.4353865 a 0
1 0.3686852 2.5851186 a 0
1 -1.335171 1.18109 a 0
1 -0.1596629 0.593775 NA 0
2 0.4841148 0.1684549 b 1
2 0.1256352 4.2785666 b 1
3 -0.954508 3.1284599 c 0
3 0.3502183 2.4766285 c 0
3 -1.2365438 1.041901 a 0
3 0.9786498 -0.6517521 c 0
4 1.3426399 1.5733424 w 1
4 -0.3117586 -0.4648479 w 1
4 0.136769 -2.6124866 w 1
5 -1.3295984 6.2783164 y 1
6 -1.1989125 -1.7025381 NA 1
6 -0.8936165 2.3131387 NA 1
You could do this quite easily with the data.table package. uniqueN() is equivalent to length(unique(x)) but much faster. Group by ID and compare the result to 1.
library(data.table)
setDT(df)[, unqind := as.integer(uniqueN(x3) == 1L), by = ID]
Another option, using base R, could be with ave().
df$unqind <- with(df, {
as.integer(ave(as.character(x3), ID, FUN=function(x) length(unique(x))) == 1L)
})

Summing columns of characters in R data frame to create a new column

I currently have a data frame in R, where each entry is a character. However, each character also corresponds to a point value, where: B = 10, S = 1, C = 1, X = 0.
For example, consider the following data frame
> df = data.frame(p1 = c("B", "B", "C", "C", "S", "S", "X"), p2 = c("X", "B", "B", "S", "C", "S", "X"), p3 = c("C", "B", "B", "X", "C", "S", "X"))
> df
p1 p2 p3
1 B X C
2 B B B
3 C B B
4 C S X
5 S C C
6 S S S
7 X X X
I want to create three new columns in R: c1, c2, c3 where these are essentially the "lagged" sum (using the numeric values of each characters) of the p1, p2, and p3 values.
p1 p2 p3 c1 c2 c3
1 B X C 0 10 10
2 B B B 0 10 20
3 C B B 0 1 11
4 C S X 0 1 2
5 S C C 0 1 2
6 S S S 0 1 2
7 X X X 0 0 0
For example, c1 is always initialized to 0. c2 will be the point value of p1, and c3 will be the sum of c2 and the point value of p1.
In general c_i = c_{i-1} + p_{i-1}.
Is there an easy way to do this in R? Thank you in advance, as I am a relatively novice R user.
Something like this would work. matchFun is a function that does the matching.
matchFun <- function(x) c(10, 1, 1, 0)[x]
within(df, {
c3 <- rowSums(sapply(list(p1, p2), matchFun))
c2 <- matchFun(p1)
c1 <- 0L
})
# p1 p2 p3 c1 c2 c3
# 1 B X C 0 10 10
# 2 B B B 0 10 20
# 3 C B B 0 1 11
# 4 C S X 0 1 2
# 5 S C C 0 1 2
# 6 S S S 0 1 2
# 7 X X X 0 0 0

How to optimize my R code to eliminate NEIGHBORING duplicates row-wise in a dataframe using vectorization instead of looping

Edited question:
My dataframe looks like this.
x1 <- c("a", "c", "f", "j")
x2 <- c("b", "c", "g", "k")
x3 <- c("b", "d", "h", NA)
x4 <- c("a", "e", "i", NA)
df <- data.frame(x1, x2, x3, x4, stringsAsFactors=F)
df
x1 x2 x3 x4
1 a b b a
2 c c d e
3 f g h i
4 j k <NA> <NA>
I wrote a loop to eliminate NEIGHBORING duplicate values in each row.
for ( i in 1:4 ) {
for ( j in 1:3 ) {
if ( df[i, 4-j+1] == df[i, 4-j] & is.na(df[i, 4-j+1]) == F ) {
df[i, 4-j+1] <- NA
} else {
df[i, 4-j+1] <- df[i, 4-j+1]
}
}
}
The result looks like this.
x1 x2 x3 x4
1 a b <NA> a
2 c <NA> d e
3 f g h i
4 j k <NA> <NA>
However, the original dataframe is quite big so the loop doesn't seem to be an appropriate approach.
Could you please show me how to optimize?
Thank you very much for your help and sorry for not asking more precisely.
Rami
To remove duplicates wherever there are on the row
df[t(apply(df,1,duplicated))]<-NA
To remove only neighbouring duplicates, this should work :
df[]<-t(apply(df,1,function(rg){
if(any(duplicated(rg))) {
inddupl<-c(F,rg[2:length(rg)]==rg[1:(length(rg)-1)])
rg[inddupl]<-NA
}
return(rg)
}))

Extract data from column aggregate function in R

I have a large database from which I have extracted a data value (x) using the aggregate function:
library(plotrix)
aggregate(mydataNC[,c(52)],by=list(patientNC, siteNC, supNC),max)
OUTPUT:
Each (x) value has a corresponding distance value in located in a column titled (dist) in this database.
What is the easiest way to extract the value dist and added to the table?
I'd probably start with merge() first. Here's a small reproducible example you can use to see what's going on and modify it to use your data:
# generate bogus data and view it
x1 <- rep(c("A", "B", "C"), each = 4)
x2 <- rep(c("E", "E", "F", "F"), times = 3)
y1 <- rnorm(12)
y2 <- rnorm(12)
md <- data.frame(x1, x2, y1, y2)
> head(md)
x1 x2 y1 y2
1 A E -1.4603164 -0.9662473
2 A E -0.5247227 1.7970341
3 A F 0.8990502 1.7596285
4 A F -0.6791145 2.2900357
5 B E 1.2894863 0.1152571
6 B E -0.1981511 0.6388998
# aggregate by taking maximum of each unique (x1, x2) combination
md.agg <- with(md, aggregate(y1, by = list(x1, x2), FUN = max))
names(md.agg) <- c("x1", "x2", "y1")
> md.agg
x1 x2 y1
1 A E -0.5247227
2 B E 1.2894863
3 C E 0.9982510
4 A F 0.8990502
5 B F 2.5125956
6 C F -0.5916491
# merge y2 into the aggregated data
md.final <- merge(md, md.agg)
> md.final
x1 x2 y1 y2
1 A E -0.5247227 1.7970341
2 A F 0.8990502 1.7596285
3 B E 1.2894863 0.1152571
4 B F 2.5125956 -0.2217510
5 C E 0.9982510 0.6813261
6 C F -0.5916491 1.0348518

order while splitting (eg. TA should be split to two column "A" in first "T" second) in r

I have following issue, I could solve:
set.seed (1234)
mydf <- data.frame (var1a = sample (c("TA", "AA", "TT"), 5, replace = TRUE),
varb2 = sample (c("GA", "AA", "GG"), 5, replace = TRUE),
varAB = sample (c("AC", "AA", "CC"), 5, replace = TRUE)
)
mydf
var1a varb2 varAB
1 TA AA CC
2 AA GA AA
3 AA GA AC
4 AA AA CC
5 TT AA AC
I want to split two letter into different column, and then order alphabetically.
Edit: Ordering can be done before split, for example var1a value "TA" var1a should be "AT" or after split so that var1aa should be "A", and var1ab be "T" (instead of "T", "A").
so sorting is within each cell.
split_col <- function(.col, data){
.x <- colsplit( data[[.col]], names = paste0(.col, letters[1:2]))
}
split each column and combine
require(reshape)
splitdf <- do.call(cbind, lapply(names(mydf), split_col, data = mydf))
var1aa var1ab varb2a varb2b varABa varABb
1 T A A A C C
2 A A G A A A
3 A A G A A C
4 A A A A C C
5 T T A A A C
But the unsolved part is I want to order the pair of columns such that columnname"a" and columname"b" are ordered, alphabetically. Thus expected output:
var1aa var1ab varb2a varb2b varABa varABb
1 A T A A C C
2 A A A G A A
3 A A A G A C
4 A A A A C C
5 T T A A A C
Can how can order (short with each pair of variable) ?
mylist <-as.list(mydf)
splits <- lapply(mylist, reshape::colsplit, names=c("a", "b"))
rowsort <- lapply(splits, function(x) t(apply(x, 1, sort)))
comb <- do.call(data.frame, rowsort)
comb
var1a.1 var1a.2 varb2.1 varb2.2 varAB.a varAB.b
1 A T A A C C
2 A A A G A A
3 A A A G A C
4 A A A A C C
5 T T A A A C
EDIT:
If names are important, you can replace them:
replaceNums <- function(x){
.which <- regmatches(x, regexpr("[[:alnum:]]*(?=.)", x, perl=TRUE))
stopifnot(length(x) %% 2 == 0) #checkstep
paste0(.which, c("a", "b"))
}
names(comb) <- replaceNums(names(comb))
comb
var1aa var1ab varb2a varb2b varABa varABb
1 A T A A C C
2 A A A G A A
3 A A A G A C
4 A A A A C C
5 T T A A A C

Resources