Creating a matrix of conditional percentages - r

To begin, I have been able to put together a nested for loop to create the object I am after, and it works OK for a small toy data set, but the data I will be working with in general will be larger and I am trying to determine if a package exists in R with a built in function to accomplish this task.
The final object is a data frame or matrix that shows the conditional percentage in a column given the reference row. Here is the code for a toy data and my nested for loop that generates the final output object.
mylist <- list(
ID001=c("apple","orange","grape"),
ID002=c("banana","grape"),
ID003=c("apple","pineapple"),
ID004=c("orange","apple"),
ID005=c("orange","grape", "apple"))
dat <- reshape2:::melt(mylist)
names(dat) <- c("fruit","id")
dat <- dat[,c(2,1)]
theFruit <- unique(dat$fruit)
n=length(theFruit)
final.df <- data.frame(matrix(nrow=n,ncol=n, dimnames=list(theFruit,theFruit)))
for(i in theFruit){
for(j in theFruit){
tempid1 <- dat[dat$fruit==i,]$id
tempid2 <- dat[dat$fruit==j,]$id
final.df[i,j] <- round(length(which(tempid1%in%tempid2))/length(tempid1),2)
}
}
final.df
apple orange grape banana pineapple
apple 1.00 0.75 0.50 0.00 0.25
orange 1.00 1.00 0.67 0.00 0.00
grape 0.67 0.67 1.00 0.33 0.00
banana 0.00 0.00 1.00 1.00 0.00
pineapple 1.00 0.00 0.00 0.00 1.00
Reading the output we see that, given a person ate an apple (apple row), 75% also ate an orange (orange column). Similarly, given a person ate an orange (orange row) 100% also ate an apple (apple column). This is not intended to be symmetric with intersections of the two fruits eaten, it is column conditioned on row.
This seems to be akin to a market basket analysis application and I have been working with the arules package the past few days to get at this. In the vernacular of the arules package, I would say the name of the percentages populating the data frame are support values but I have not been able to generate a matrix or data frame of all of the support percentages from arules.
The data I will be working with will have a couple million IDs but only about 150 "products" so the output matrix would only be about 150x150. I can use arules to identify the compelling pairwise relationships but there is interest in seeing ALL of the conditionals.
Does anyone know if arules or another package can accomplish this?

You are looking for the confidence values (Wikipedia). You get a similar output to yours like this with arules:
library(arules)
library(reshape2)
trans <- as(mylist, "transactions")
rules <- apriori(trans, parameter = list(supp = 0, conf = 0, minlen=2, maxlen=2))
df <- inspect(rules)[, c("lhs", "rhs", "confidence")]
dcast(df, lhs~rhs, value.var="confidence", fill=1)
# lhs {apple} {banana} {grape} {orange} {pineapple}
# 1 {apple} 1.0000000 0.0000000 0.5000000 0.7500000 0.25
# 2 {banana} 0.0000000 1.0000000 1.0000000 0.0000000 0.00
# 3 {grape} 0.6666667 0.3333333 1.0000000 0.6666667 0.00
# 4 {orange} 1.0000000 0.0000000 0.6666667 1.0000000 0.00
# 5 {pineapple} 1.0000000 0.0000000 0.0000000 0.0000000 1.00
Of course you can make the first column to row names and convert the data frame to a matrix later on. I leave it up to you.

Related

Apply two different formulas on four data frame columns

I want to apply two different formulas on four columns of my dataframe df. I have done this manually, but since my original data frame has several columns, I want to be able to use loops or case when to do this faster.
Here's how sample dataframe df looks like:
A B C D
20 100 4 1200
40 150 6 2300
34 200 3 1230
32 225 9 1100
12 220 10 1000
Formula 1:
(x-max(x))/(max(x)-min(x))
Formula 2:
(min(x)-x)/(max(x)-min(x))
I'd like to apply formula 1 on columns B and D and formula 2 on columns A and C.
After applying the formula, I want to store the values in a different dataframe but with the same column names.
Here's what I did:
formula_1 <-function(x) {
(((x - min(x)))/(max(x) - min(x)))
}
formula_2 <-function(x){(min(x)-x)/(max(x)-min(x))
}
Create an empty dataframe BI_score
BI_score$B <- formula_1(df$B)
BI_score$D <- formula_1 (df$D)
BI_score$A <- formula_2 (df$A)
BI_score$C <- formula_2 (df$C)
EDIT
As there are some NAs and Inf values and if we want to exclude them from calculation, we can handle it by updating the function as below and then apply the function to column as shown previously.
formula_1 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (((temp - min(temp)))/(max(temp) - min(temp))))
}
formula_2 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (min(temp)-temp)/(max(temp)-min(temp)))
}
The most straight forward approach would be to use lapply to apply the function separately on selected columns.
BI_score <- df
fm1_cols <- c("B", "D")
fm2_cols <- c("A", "C")
BI_score[fm1_cols] <- lapply(df[fm1_cols], formula_1)
BI_score[fm2_cols] <- lapply(df[fm2_cols], formula_2)
BI_score
# A B C D
#1 -0.29 0.00 -0.14 0.154
#2 -1.00 0.40 -0.43 1.000
#3 -0.79 0.80 0.00 0.177
#4 -0.71 1.00 -0.86 0.077
#5 0.00 0.96 -1.00 0.000
As mentioned by #Sotos, if you want to apply the function on alternate columns you could do
BI_score[c(TRUE, FALSE)] <- lapply(df[c(TRUE, FALSE)], formula_1)
BI_score[c(FALSE, TRUE)] <- lapply(df[c(FALSE, TRUE)], formula_2)
Just for fun, approach using dplyr
library(dplyr)
bind_cols(df %>% select(fm1_cols) %>% mutate_all(formula_1),
df %>% select(fm2_cols) %>% mutate_all(formula_2))
If your goal is to apply the two functions on alternating columns, then you can do it via logical indexing
cbind.data.frame(sapply(df[c(TRUE, FALSE)], formula_2),
sapply(df[c(FALSE, TRUE)], formula_1))
# A C B D
#1 -0.2857143 -0.1428571 0.00 0.15384615
#2 -1.0000000 -0.4285714 0.40 1.00000000
#3 -0.7857143 0.0000000 0.80 0.17692308
#4 -0.7142857 -0.8571429 1.00 0.07692308
#5 0.0000000 -1.0000000 0.96 0.00000000
We can use mutate_at from dplyr
library(dplyr)
df1 %>%
mutate_at(vars(B, D), formula_1) %>%
mutate_at(vars(A, C), formula_2)

How to build a Markov's chain transition probability matrix

I am learning R on my own and I am having some troubles trying to build a transition probability matrix in Rstudio using the markovchain package. First I tried to calculate the transition probabilities of a DNA sequence.
ATTCAACACATCCAGCCACATGCTCCGAGAGGAGGCAGAGGGCCCCCGGAATGATGCTTACCGAGATTCTTGTTTTTATCCTCGTGGTTGTTTAAAAACGAGTTGAAACTGACGGCATGTCGGACTATAAGCTACTTACTCACCATAGACGTGACCATAGGCCCTAAAACGTTACCGAGATATTCACTTCTAATAACAGTTGTCGGCAGAGCCAAAAGGCCGGGTGATAATACTTTAAAAAGGGAGTTGATTGTTGTATCTAATCCTAGAATGTCAAGAGCGACCATAACAAGATAATTCGGCAGAGCCAGAAAGCGTTCAAGGACTAGAACCATACCGAGACGCAAACGTTCAGGTCGAACTCTAATACCGATTAGT
But how can the transition probability matrix be calculated in a sequence like this, I was thinking of using R indexes but I don't really know how to calculate those transition probabilities.
Is there a way of doing this in R?
I am guessing that the output of those probabilities in a matrix should be something like this:
A T C G
A 0.60 0.10 0.10 0.20
T 0.10 0.50 0.30 0.10
C 0.05 0.20 0.70 0.05
G 0.40 0.05 0.05 0.50
You can use the markovchain package for help with this. First, your data
seq <- "ATTCAACACATCCAGCCACATGCTCCGAGAGGAGGCAGAGGGCCCCCGGAATGATGCTTACCGAGATTCTTGTTTTTATCCTCGTGGTTGTTTAAAAACGAGTTGAAACTGACGGCATGTCGGACTATAAGCTACTTACTCACCATAGACGTGACCATAGGCCCTAAAACGTTACCGAGATATTCACTTCTAATAACAGTTGTCGGCAGAGCCAAAAGGCCGGGTGATAATACTTTAAAAAGGGAGTTGATTGTTGTATCTAATCCTAGAATGTCAAGAGCGACCATAACAAGATAATTCGGCAGAGCCAGAAAGCGTTCAAGGACTAGAACCATACCGAGACGCAAACGTTCAGGTCGAACTCTAATACCGATTAGT"
Then use the package
library(markovchain)
base_sequence <- strsplit(seq, "")[[1]]
mcX <- markovchainFit(base_sequence)$estimate
mcX
# A C G T
# A 0.3000000 0.2250000 0.2583333 0.2166667
# C 0.2857143 0.2619048 0.2380952 0.2142857
# G 0.3764706 0.1882353 0.2117647 0.2235294
# T 0.3068182 0.2159091 0.1818182 0.2954545
Create DNA
DNA <- "ATTCAACACATCCAGCCACATGCTCCGAGAGGAGGCAGAGGGCCCCCGGAATGATGCTTACCGAGATTCTTGTTTTTATCCTCGTGGTTGTTTAAAAACGAGTTGAAACTGACGGCATGTCGGACTATAAGCTACTTACTCACCATAGACGTGACCATAGGCCCTAAAACGTTACCGAGATATTCACTTCTAATAACAGTTGTCGGCAGAGCCAAAAGGCCGGGTGATAATACTTTAAAAAGGGAGTTGATTGTTGTATCTAATCCTAGAATGTCAAGAGCGACCATAACAAGATAATTCGGCAGAGCCAGAAAGCGTTCAAGGACTAGAACCATACCGAGACGCAAACGTTCAGGTCGAACTCTAATACCGATTAGT"
Split it character by character
DNA_list <- unlist(strsplit(DNA, split = ""))
Retrieve unique elements
DNA_unique <- unique(DNA_list)
Create an empty matrix
matrix <- matrix(0, ncol = length(DNA_unique), nrow=length(DNA_unique))
Fill it: to elt i and element i + 1 and add one in the corresponding cell of the matrix.
for (i in 1:(length(DNA_list) - 1)){
index_of_i <- DNA_unique == DNA_list[i]
index_of_i_plus_1 <- DNA_unique == DNA_list[i + 1]
matrix[index_of_i, index_of_i_plus_1] = matrix[index_of_i, index_of_i_plus_1] + 1
}
Normalize it
matrix <- matrix / rowSums(matrix)
> matrix
[,1] [,2] [,3] [,4]
[1,] 0.3000000 0.2166667 0.2250000 0.2583333
[2,] 0.3068182 0.2954545 0.2159091 0.1818182
[3,] 0.2857143 0.2142857 0.2619048 0.2380952
[4,] 0.3764706 0.2235294 0.1882353 0.2117647
NB: There might be a way to perform it in a faster way if you have really large DNA to compute. But here it seeems to be fast enough.

Association rules between many continuous variables

I have a large dataset and I'm trying to mining association rules between the variables.
My problem is that I have 160 variables among which I have to look for the association rules and also I have more than 1800 item-sets.
Furthermore my variables are continuous variables. To mining association rules, I usually used the apriori algorithm, but as is well known, this algorithm requires the use of categorical variables.
Does anyone have any suggestions on what kind of algorithm I can use in this case?
A restricted example of my dataset is the following:
ID_Order Model ordered quantity
A.1 typeX 20
A.1 typeZ 10
A.1 typeY 5
B.2 typeX 16
B.2 typeW 12
C.3 typeZ 1
D.4 typeX 8
D.4 typeG 4
...
My goal would be mining association rules and correlation between different products, maybe with a neural network algorithm in R Does anyone have any suggestions on how to solve this problem?
Thanks in advance
You can create transactions from your dataset like this:
library(dplyr)
This function is used to get the transactions per ID_Order
concat <- function(x) {
return(list(as.character(x)))
}
Group df by ID_Order and concatenate. pull() returns the concatenated Models in a list.
a_list <- df %>%
group_by(ID_Order) %>%
summarise(concat = concat(Model)) %>%
pull(concat)
Set names to ID_Order:
names(a_list) <- unique(df$ID_Order)
Then you can use the package arules:
Get object of transactions class:
transactions <- as(a_list, "transactions")
Extract rules. You can set minimum support and minimum confidence in supp and conf resp.
rules <- apriori(transactions,
parameter = list(supp = 0.1, conf = 0.5, target = "rules"))
To inspect the rules use:
inspect(rules)
And this is what you get:
lhs rhs support confidence lift count
[1] {} => {typeZ} 0.50 0.50 1.0000000 2
[2] {} => {typeX} 0.75 0.75 1.0000000 3
[3] {typeW} => {typeX} 0.25 1.00 1.3333333 1
[4] {typeG} => {typeX} 0.25 1.00 1.3333333 1
[5] {typeY} => {typeZ} 0.25 1.00 2.0000000 1
[6] {typeZ} => {typeY} 0.25 0.50 2.0000000 1
[7] {typeY} => {typeX} 0.25 1.00 1.3333333 1
[8] {typeZ} => {typeX} 0.25 0.50 0.6666667 1
[9] {typeY,typeZ} => {typeX} 0.25 1.00 1.3333333 1
[10] {typeX,typeY} => {typeZ} 0.25 1.00 2.0000000 1
[11] {typeX,typeZ} => {typeY} 0.25 1.00 4.0000000 1
From the example section of ? transactions:
## example 4: creating transactions from a data.frame with
## transaction IDs and items (by converting it into a list of transactions first)
a_df3 <- data.frame(
TID = c(1,1,2,2,2,3),
item=c("a","b","a","b","c","b")
)
a_df3
trans4 <- as(split(a_df3[,"item"], a_df3[,"TID"]), "transactions")
trans4
inspect(trans4)

Random sample in R when data is in long format

I need to randomly sample a dataset which is arranged in long format. In my dataset, each subject has 4 observations, so if I randomly sample a row I am randomly losing one or more observation per subject.
This is a simulated data for illustration purposes, my data is much bigger.
sub sex group dv1 dv2
P1 m A 0.66 0.94
P1 m B 0.98 0.26
P1 m C 0.02 0.03
P1 m D 0.60 0.30
P2 m A 0.92 0.99
P2 m B 0.82 0.09
P2 m C 0.44 0.67
P2 m D 0.53 0.80
P3 f A 0.29 0.22
P3 f B 0.46 0.20
P3 f C 0.37 0.77
P3 f D 0.76 0.54
P4 m A 0.28 0.99
P4 m B 0.16 0.57
P4 m C 0.46 0.75
P4 m D 0.28 0.21
In this example, I need to randomly select 2 males. For example, I tried using dplyr packaged (see below), but if I give a sample of 2, it just gives me 2 rows for sex="m" and 2 for sex="f". In total, 4 randomly chosen rows. What I need it to do is to give me 8 rows where 4 come from one male and 4 from another. Changing grouping parameter to sub doesn't work, as it barks that there are only 2 levels in the group (actually, it would work in this toy example as there are 4 levels for each sub, but note that I am choosing like 50 samples from a bigger dataset). Also, it would just give me 2 random rows for each sub, which is not what I need.
library(dplyr)
subset <- data %>%
group_by(sex) %>%
sample_n(2)
Please do not suggest to reshape the date to wide format and sample it there, as I know that I can do that. I am sure there must be a way to sample in long format.
I would sample from the patient names and then filter by those sampled names:
Look at all males
male_subset <- data %>% filter(sex == "m")
Look for unique male ID
male_IDs <- unique(male_subset$sub)
Sample from the unique IDs
sampled_IDs <- sample(male_IDs, 2)
Now you subset your data based on these sampled IDs:
data %>% filter(sub %in% sampled_IDs)
This should return all four rows for each of the 2 sampled individuals.
I'm not sure if I've quite understood what you want. Would this do it?
data %>% filter(sex == 'm') %>% filter(sub %in% sample(paste0('P',1:4), 2))
You'd have to change what's in the paste0 function for your real data, of course.
In base R,
set.seed(1)
subset<- sample(data[data$sex == "m",]$sub,2)
data_subset<-data[data$sub %in% subset,]
nrow(data_subset)
# [1] 8
Works, but not flashy.

Aggregating columns

I have a data frame of n columns and r rows. I want to determine which column is correlated most with column 1, and then aggregate these two columns. The aggregated column will be considered the new column 1. Then, I remove the column that is correlated most from the set. Thus, the size of the date is decreased by one column. I then repeat the process, until the data frame result has has n columns, with the second column being the aggregation of two columns, the third column being the aggregation of three columns, etc. I am therefore wondering if there is an efficient or quicker way to get to the result I'm going for. I've tried various things, but without success so far. Any suggestions?
n <- 5
r <- 6
> df
X1 X2 X3 X4 X5
1 0.32 0.88 0.12 0.91 0.18
2 0.52 0.61 0.44 0.19 0.65
3 0.84 0.71 0.50 0.67 0.36
4 0.12 0.30 0.72 0.40 0.05
5 0.40 0.62 0.48 0.39 0.95
6 0.55 0.28 0.33 0.81 0.60
This is what result should look like:
> result
X1 X2 X3 X4 X5
1 0.32 0.50 1.38 2.29 2.41
2 0.52 1.17 1.78 1.97 2.41
3 0.84 1.20 1.91 2.58 3.08
4 0.12 0.17 0.47 0.87 1.59
5 0.40 1.35 1.97 2.36 2.84
6 0.55 1.15 1.43 2.24 2.57
I think most of the slowness and eventual crash comes from memory overheads during the loop and not from the correlations (though that could be improved too as #coffeeinjunky says). This is most likely as a result of the way data.frames are modified in R. Consider switching to data.tables and take advantage of their "assignment by reference" paradigm. For example, below is your code translated into data.table syntax. You can time the two loops, compare perfomance and comment the results. cheers.
n <- 5L
r <- 6L
result <- setDT(data.frame(matrix(NA,nrow=r,ncol=n)))
temp <- copy(df) # Create a temporary data frame in which I calculate the correlations
set(result, j=1L, value=temp[[1]]) # The first column is the same
for (icol in as.integer(2:n)) {
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1]) # Determine which are correlated most
set(x=result, i=NULL, j=as.integer(icol), value=(temp[[1]] + temp[[mch]]))# Aggregate and place result in results datatable
set(x=temp, i=NULL, j=1L, value=result[[icol]])# Set result as new 1st column
set(x=temp, i=NULL, j=as.integer(mch), value=NULL) # Remove column
}
Try
for (i in 2:n) {
maxcor <- names(which.max(sapply(temp[,-1, drop=F], function(x) cor(temp[, 1], x) )))
result[,i] <- temp[,1] + temp[,maxcor]
temp[,1] <- result[,i] # Set result as new 1st column
temp[,maxcor] <- NULL # Remove column
}
The error was caused because in the last iteration, subsetting temp yields a single vector, and standard R behavior is to reduce the class from dataframe to vector in such cases, which causes sapply to pass on only the first element, etc.
One more comment: currently, you are using the most positive correlation, not the strongest correlation, which may also be negative. Make sure this is what you want.
To adress your question in the comment: Note that your old code could be improved by avoiding repeat computation. For instance,
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1])
contains the command cor(temp) twice. This means each and every correlation is computed twice. Replacing it with
cortemp <- cor(temp)
mch <- match(c(max(cortemp[-1,1])),cortemp[,1])
should cut the computational burden of the initial code line in half.

Resources