In R, compare one column value to all other columns - r

I'm very new to R and I have a question which might be very simple for experts here.
Let's say i have a table "sales", which includes 4 customer IDs (123-126) and 4 products (A,B,C,D).
ID A B C D
123 0 1 1 0
124 1 1 0 0
125 1 1 0 1
126 0 0 0 1
I want to calculate the overlaps between products. So for A, the number of IDs that have both A and B will be 2. Similarly, the overlap between A and C will be 0 and that between A and D will be 1. Here is my code for A and B overlap:
overlap <- sales [which(sales [,"A"] == 1 & sales [,"B"] == 1 ),]
countAB <- count(overlap,"ID")
I want to repeat this calculation for all 4 products,so A overlaps with B,C,D and B overlaps with A,C,D, etc...How can i change the code to accomplish this?
I want the final output to be the number of IDs for each two-product combination. It's product affinity exercise and i want to find out for one product, which product sold the most with it. For example, for A, the most sold products with it will be B, followed by D, then C. Some sorting needs to be added to the code to get to this i think.
Thanks for your help!

#x1 is your dataframe
x1<-structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L,
1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID",
"A", "B", "C", "D"), class = "data.frame", row.names = c(NA,
-4L))
#get the combination of all colnames but the first ("ID")
k1<-combn(colnames(x1[,-1]),2)
#create two lists a1 and a2 so that we can iterate over each element
a1<-as.list(k1[seq(1,length(k1),2)])
a2<-as.list(k1[seq(2,length(k1),2)])
# your own functions with varying i and j
mapply(function(i,j) length(x1[which(x1[,i] == 1 & x1 [,j] == 1 ),1]),a1,a2)
[1] 2 0 1 1 1 0

Here's a possible solution :
sales <-
read.csv(text=
"ID,A,B,C,D
123,0,1,1,0
124,1,1,0,0
125,1,1,0,1
126,0,0,0,1")
# get product names
prods <- colnames(sales)[-1]
# generate all products pairs (and transpose the matrix for convenience)
combs <- t(combn(prods,2))
# turn the combs into a data.frame with column P1,P2
res <- as.data.frame(combs)
colnames(res) <- c('P1','P2')
# for each combination row :
# - subset sales selecting only the products in the row
# - count the number of rows summing to 2 (if sum=2 the 2 products have been sold together)
# N.B.: length(which(logical_condition)) can be implemented with sum(logical_condition)
# since TRUE and FALSE are automatically coerced to 1 and 0
# finally add the resulting vector to the newly created data.frame
res$count <- apply(combs,1,function(comb){sum(rowSums(sales[,comb])==2)})
> res
P1 P2 count
1 A B 2
2 A C 0
3 A D 1
4 B C 1
5 B D 1
6 C D 0

You can use matrix multiplication:
m <- as.matrix(d[-1])
z <- melt(crossprod(m,m))
z[as.integer(z$X1) < as.integer(z$X2),]
# X1 X2 value
# 5 A B 2
# 9 A C 0
# 10 B C 1
# 13 A D 1
# 14 B D 1
# 15 C D 0
where d is your data frame:
d <- structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L, 1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID", "A", "B", "C", "D"), class = "data.frame", row.names = c(NA, -4L))
[Update]
To calculate the product affinity, you can do:
z2 <- subset(z,X1!=X2)
do.call(rbind,lapply(split(z2,z2$X1),function(d) d[which.max(d$value),]))
# X1 X2 value
# A A B 2
# B B A 2
# C C B 1
# D D A 1

You might want to take a look at the arules package. It does exactly what you are looking for.
Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). Also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat by C. Borgelt.

Related

Summarize Gaps in Binary Data using R

I am playing around with binary data.
I have data in columns in the following manner:
A B C D E F G H I J K L M N
-----------------------------------------------------
1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 1 0 1 1 0 0 1 0
0 0 0 0 0 0 0 1 1 1 1 1 0 0
1 - Indicating that the system was on and 0 indicating that the system was off
I am trying to figure out ways to figure out a way to summarize the gaps between the on/off transition of these systems.
For example,
for the first row, it stops working after 'I'
for the second row, it works from 'E' to 'G' and then works again in 'I' and 'M' but is off during other.
Is there a way to summarize this?
I wish to see my result in the following form
row-number Number of 1's Range
------------ ------------------ ------
1 9 A-I
2 3 E-G
2 2 I-J
2 1 M
3 5 H-L
Here's a tidyverse solution:
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
group_by(rowid) %>%
# This counts the number of times a new streak starts
mutate(grp_num = cumsum(val != lag(val, default = -99))) %>%
filter(val == 1) %>%
group_by(rowid, grp_num) %>%
summarise(num_1s = n(),
range = paste0(first(col), "-", last(col)))
## A tibble: 5 x 4
## Groups: rowid [3]
# rowid grp_num num_1s range
# <int> <int> <int> <chr>
#1 1 1 9 A-I
#2 2 2 3 E-G
#3 2 4 2 I-J
#4 2 6 1 M-M
#5 3 2 5 H-L
An option with data.table. Convert the 'data.frame' to 'data.table' while creating a row number column (setDT), melt from 'wide' to 'long' format specifying the id.var as row number column 'rn', create a run-lenght-id (rleid) column on the 'value' column grouped by 'rn', subset the rows where 'value' is 1, summarise with number of rows (.N), and pasted range of 'variable' values, grouped by 'grp' and 'rn', assign the columns not needed to NULL and order by 'rn' if necessary.
library(data.table)
melt(setDT(df1, keep.rownames = TRUE), id.var = 'rn')[,
grp := rleid(value), rn][value == 1, .(NumberOfOnes = .N,
Range = paste(range(as.character(variable)), collapse="-")),
.(grp, rn)][, grp := NULL][order(rn)]
# rn NumberOfOnes Range
#1: 1 9 A-I
#2: 2 3 E-G
#3: 2 2 I-J
#4: 2 1 M-M
#5: 3 5 H-L
Or using base R with rle
do.call(rbind, apply(df1, 1, function(x) {
rl <- rle(x)
i1 <- rl$values == 1
l1 <- rl$lengths[i1]
nm1 <- tapply(names(x), rep(seq_along(rl$values), rl$lengths),
FUN = function(y) paste(range(y), collapse="-"))[i1]
data.frame(NumberOfOnes = l1, Range = nm1)}))
data
df1 <- structure(list(A = c(1L, 0L, 0L), B = c(1L, 0L, 0L), C = c(1L,
0L, 0L), D = c(1L, 0L, 0L), E = c(1L, 1L, 0L), F = c(1L, 1L,
0L), G = c(1L, 1L, 0L), H = c(1L, 0L, 1L), I = c(1L, 1L, 1L),
J = c(0L, 1L, 1L), K = c(0L, 0L, 1L), L = c(0L, 0L, 1L),
M = c(0L, 1L, 0L), N = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))

Counting the elements in rows and map to column in r

I would like summarize my data by counting the entities and create counting_column for each entity.
let say:
df:
id class
1 A
1 B
1 A
1 A
1 B
1 c
2 A
2 B
2 B
2 D
I want to create a table like
id A B C D
1 3 2 1 0
2 1 2 0 1
How can I do this in R using apply function?
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
class = structure(c(1L, 2L, 1L, 1L, 2L, 3L, 1L, 2L, 2L, 4L
), .Label = c("A", "B", "C", "D"), class = "factor")), .Names = c("id",
"class"), class = "data.frame", row.names = c(NA, -10L))
with(df, table(id, class))
# class
#id A B C D
# 1 3 2 1 0
# 2 1 2 0 1
xtabs(~ id + class, df)
# class
#id A B C D
# 1 3 2 1 0
# 2 1 2 0 1
tapply(rep(1, nrow(df)), df, length, default = 0)
# class
#id A B C D
# 1 3 2 1 0
# 2 1 2 0 1
This seems like a very strange requirement but if you insist on using apply then the function count counts the number of rows for which id equals x and class equals y. It is applied to every combination of id and class to get a using nested apply calls. Finally we add the row and column names.
uid <- unique(DF$id)
uclass <- unique(DF$class)
count <- function(x, y, DF) sum(x == DF$id & y == DF$class)
a <- apply(matrix(uclass), 1, function(u) apply(matrix(uid), 1, count, u, DF))
dimnames(a) <- list(uid, uclass)
giving:
> a
A B c D
1 3 2 1 0
2 1 2 0 1
Note
We used this for DF
Lines <- "id class
1 A
1 B
1 A
1 A
1 B
1 c
2 A
2 B
2 B
2 D"
DF <- read.table(text = Lines, header = TRUE)

Counting occurrence of a variable without taking account duplicates

I have a big data frame, called data with 1 004 490 obs, and I want to analyse the success of a treatment.
ID POSITIONS TREATMENT
1 0 A
1 1 A
1 2 B
2 0 C
2 1 D
3 0 B
3 1 B
3 2 C
3 3 A
3 4 A
3 5 B
So firstly, I want to count the number of time that one treatment is applicated to a patient (ID), but one treatment can be given several times to an iD. So, do I need to first delete all the duplicates and after count or there is a function that don't take into account all the duplicates.
What I want to have :
A : 2
B : 2
C : 2
D : 1
Then, I want to know how many time the treatment was given at the last position, but the last position is always different according to the ID.
What I want to have :
A : 0
B : 2 (for ID = 1 and 3)
C : 0
D : 1 (for ID = 1)
Thanks for your help, I am a new user of R !
Using base R, we can do,
merge(aggregate(ID ~ TREATMENT, df, FUN = function(i) length(unique(i))),
aggregate(ID ~ TREATMENT, df[!duplicated(df$ID, fromLast = TRUE),], toString),
by = 'TREATMENT', all = TRUE)
Which gives,
TREATMENT ID.x ID.y
1 A 2 <NA>
2 B 2 1, 3
3 C 2 <NA>
4 D 1 2
Here is a tidyverse approach, where we get the distinct rows based on 'ID', 'TREATMENT' and get the count of 'TREATMENT'
library(tidyverse)
df1 %>%
distinct(ID, TREATMENT) %>%
count(TREATMENT)
# A tibble: 4 x 2
# TREATMENT n
# <chr> <int>
#1 A 2
#2 B 2
#3 C 2
#4 D 1
and for second output, after grouping by 'ID', slice the last row (n()), create a column 'ind' and fill that with 0 for all missing combinations of 'TREATMENT' with complete, then get the sum of 'ind' after grouping by 'TREATMENT'
df1 %>%
group_by(ID) %>%
slice(n()) %>%
mutate(ind = 1) %>%
complete(TREATMENT = unique(df1$TREATMENT), fill = list(ind=0)) %>%
group_by(TREATMENT) %>%
summarise(n = sum(ind))
# A tibble: 4 x 2
# TREATMENT n
# <chr> <dbl>
#1 A 0
#2 B 2
#3 C 0
#4 D 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L), POSITIONS = c(0L, 1L, 2L, 0L, 1L, 0L, 1L, 2L, 3L, 4L, 5L
), TREATMENT = c("A", "A", "B", "C", "D", "B", "B", "C", "A",
"A", "B")), .Names = c("ID", "POSITIONS", "TREATMENT"),
class = "data.frame", row.names = c(NA, -11L))

calculation of allele frequency using wig file in R

I have a matrix(similar to a wig file) like this:
Position reference A C G T N sum(total read counts)
68773265 A 1 0 0 0 0 1
68773266 C 0 1 0 1 0 2
68773267 C 0 1 1 2 0 4
To achieve variant(non-reference) allele ratio,
I want to create this: (sum-reference sequence's count)/sum * 100 per position
Position reference frequency(%) sum(total read counts)
68773265 A 0 1
68773266 C 50 2
68773267 C 75 4
Please give me some advice on this problem. Thanks in advance!!
Using the subset of column names "nm1", match the "reference" column with the "nm1" to get the column index, cbind with 1:nrow(df1) for creating row/column index. Get the rowSums of "nm1" columns ("Sum1"), use this to create "frequencyPercent" based on the formula in the post.
nm1 <- c('A', 'C', 'G', 'T') # this could include `N` also
indx <- cbind(1:nrow(df1), match(df1$reference, nm1))
Sum1 <- rowSums(df1[nm1])
data.frame(df1[1:2], frequencyPercent=100*(Sum1-df1[nm1][indx])/Sum1,
SumTotalCounts=df1[,ncol(df1)])
Or use transform on the original dataset
transform(df1, frequencyPercent=100*(Sum1-df1[nm1][indx])/Sum1,
check.names=FALSE)[c(1:2,8:9)]
# Position reference sum(total read counts) frequencyPercent
#1 68773265 A 1 0
#2 68773266 C 2 50
#3 68773267 C 4 75
data
df1 <- structure(list(Position = 68773265:68773267, reference = c("A",
"C", "C"), A = c(1L, 0L, 0L), C = c(0L, 1L, 1L), G = c(0L, 0L,
1L), T = 0:2, N = c(0L, 0L, 0L), `sum(total read counts)` = c(1L,
2L, 4L)), .Names = c("Position", "reference", "A", "C", "G",
"T", "N", "sum(total read counts)"), class = "data.frame",
row.names = c(NA, -3L))

Select rows with certain value in any of the columns

I got a dataframe where there is gene expression data
I'm trying to extract all rows where ANY of the columns has a value (data is already in log2 values) >= 2 but can't seem to get there. My data is:
A B C D
Gene1 1 2 3 1
Gene2 2 1 1 4
Gene3 1 1 0 1
Gene4 1 2 0 1
I would only like to retain gene1, gene2 and gene4 without stating all columns (as this is just a toy example).
You could use rowSums on a logical matrix derived from df >=2 and double negate (!) to get the index of rows to subset.
df[!!rowSums(df >=2),]
# A B C D
#Gene1 1 2 3 1
#Gene2 2 1 1 4
#Gene4 1 2 0 1
Or using the reverse condition df <2 to get the logical matrix, userowSums, then check whether this is less than ncol(df)
df[rowSums(df <2) < ncol(df),]
# A B C D
#Gene1 1 2 3 1
#Gene2 2 1 1 4
#Gene4 1 2 0 1
Or
df[apply(t(df>=2),2, any), ]
data
df <- structure(list(A = c(1L, 2L, 1L, 1L), B = c(2L, 1L, 1L, 2L),
C = c(3L, 1L, 0L, 0L), D = c(1L, 4L, 1L, 1L)), .Names = c("A",
"B", "C", "D"), class = "data.frame", row.names = c("Gene1",
"Gene2", "Gene3", "Gene4"))

Resources