Compare each element of a variable within each group - r

Consider the data frame in R:
set.seed(36)
y <- runif(10,0,200)
group <- sample(rep(1:2, each=5))
d <- data.frame(y, group)
I want to compare all y against all y within each group. The following codes do this correctly:
d_split <- split(d, d$group)
a <- with(d_split[[1]],outer(y, y, "<="))
b <- with(d_split[[2]],outer(y, y, "<="))
But while I am doing this inside a function, and the number of group varies (group will be an argument of that function), then I cannot proceed in this manner. How can I elegantly write the last three line codes to compare all y against all y within each group?

To perform the same operation for multiple groups we can use lapply and perform the outer operation for every group.
lapply(split(d, d$group), function(x) outer(x[["y"]], x[["y"]], "<="))
#$`1`
# [,1] [,2] [,3] [,4] [,5]
#[1,] TRUE TRUE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE FALSE
#[3,] TRUE TRUE TRUE FALSE TRUE
#[4,] TRUE TRUE TRUE TRUE TRUE
#[5,] TRUE TRUE FALSE FALSE TRUE
#$`2`
# [,1] [,2] [,3] [,4] [,5]
#[1,] TRUE TRUE FALSE TRUE FALSE
#[2,] FALSE TRUE FALSE TRUE FALSE
#[3,] TRUE TRUE TRUE TRUE TRUE
#[4,] FALSE FALSE FALSE TRUE FALSE
#[5,] TRUE TRUE FALSE TRUE TRUE

Here is an option without splitting
library(data.table)
setDT(d)[, as.data.table(outer(y, y, "<=")), group]
# group V1 V2 V3 V4 V5
#1: 1 TRUE TRUE FALSE FALSE FALSE
#2: 1 FALSE TRUE FALSE FALSE FALSE
#3: 1 TRUE TRUE TRUE FALSE TRUE
#4: 1 TRUE TRUE TRUE TRUE TRUE
#5: 1 TRUE TRUE FALSE FALSE TRUE
#6: 2 TRUE TRUE FALSE TRUE FALSE
#7: 2 FALSE TRUE FALSE TRUE FALSE
#8: 2 TRUE TRUE TRUE TRUE TRUE
#9: 2 FALSE FALSE FALSE TRUE FALSE
#10: 2 TRUE TRUE FALSE TRUE TRUE
Or in a 'long' format with CJ
setDT(d)[, CJ(y, y), group][, V1 <= V2, group]

Related

Is it possible to keep memory while using apply()?

I need to run the function lapply on a activation_status list t times so that the t iteration of the function remembers the results from the t-1 iteration.
The list is basically a bidimensional array representing a single item i status over multiple t periods and looks like this:
n_items <<- 100
n_iterations <<- 10
activation_status <-
lapply(1:n_iterations,
FUN = function(t, bool, i) rep(bool, t),
FALSE, n_items)
Now during each iteration t, I randomly activate (set to TRUE) a number of items within the list but I want all the items already activated at time t-1 to stay active (note that I define activation_status within the update function so that it's accessible in the inner functions).
updateActivation <- function(t) {
activation_status[[t]] <- as.logical(rbinom(n_items, 1, prob = .5))
activation_status[[t]][activation_status[[t-1]] == TRUE] <- TRUE
}
But then
lapply(1:n_iterations, updateActivation)
throws as error:
Error in activation_status[[t - 1]] : attempt to select less than one element in get1index
I know I could use a loop, but I wonder if it is:
Possible to do something like this with the apply function?
Do it faster?
Not sure if I fully understood the question but seems like you are looking for a recursion.
In that case Reduce() can be used instead of lapply():
activation_status <- rep(FALSE, 10)
n_iterations <- 5
Reduce(function(y, x) as.logical(rbinom(length(y), 1, prob=0.1)) | y,
x=1:n_iterations, init=activation_status, accumulate=TRUE
)
[[1]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[[3]]
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
[[4]]
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
[[5]]
[1] TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
[[6]]
[1] TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
We could probably do this without using any apply command.
#Set seed for reproduciblity
set.seed(123)
#Create initialization demo data
activation_status <- rep(FALSE, 10)
#Number of values to select
n_iterations <- 5
#Sequence from 1:n_iterations
seq_n_iterations <- seq_len(n_iterations)
#Create matrix to hold output
output <- replicate(n_iterations, activation_status)
#Select n_iterations random values from 1:length(activation_status)
#You can change this if you want to use some specific distrubution
points <- sample(length(activation_status), n_iterations)
#Create column indices
cols <- rep(seq_n_iterations, seq_n_iterations)
#Create row indices
rows <- points[ave(inds, inds, FUN = seq_along)]
#Change those values to TRUE
output[cbind(rows, cols)] <- TRUE
output
# [,1] [,2] [,3] [,4] [,5]
# [1,] FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE TRUE TRUE
# [3,] TRUE TRUE TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE
# [6,] FALSE FALSE FALSE FALSE TRUE
# [7,] FALSE FALSE FALSE FALSE FALSE
# [8,] FALSE FALSE FALSE TRUE TRUE
# [9,] FALSE FALSE FALSE FALSE FALSE
#[10,] FALSE TRUE TRUE TRUE TRUE
If you want them as lists :
asplit(output, 2)
#[[1]]
# [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[[2]]
# [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[[3]]
# [1] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[[4]]
# [1] FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
#[[5]]
# [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE

How many elements of a vector are smaller or equal to each element of this vector?

I am interested in writing a program that gives the number of elements of vector x that are smaller or equal to any given value within vector x.
Let's say
x = [1,3,8,7,6,4,3,10,12]
I want to calculate the number of elements within x which are smaller or equal to 1, to 3, to 8 etc. For example the fifth element of x[5] is 6 and the number of elements smaller or equal to 6 equals to 5. However, I only know how to do an element-wise comparison, e.g x[1]<=x[3]
I suppose that I will be using the for loop and have something like this here:
for (i in length(x)){
if (x[i]<=x[i]){
print(x[i])}
# count number of TRUEs
}
However, this code obviously does not do what I want.
Use outer to make all comparisons at once:
outer(x, x, "<=")
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [2,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [3,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
# [4,] FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
# [5,] FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
# [6,] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
# [7,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
# [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
colSums(outer(x, x, "<="))
#[1] 1 3 7 6 5 4 3 8 9
You can also use the *apply family as follows,
sapply(x, function(i) sum(x <= i))
#[1] 1 3 7 6 5 4 3 8 9
We can use findInterval
findInterval(x, sort(x))
#[1] 1 3 7 6 5 4 3 8 9
Another alternative is to use rank, which ranks the values. Setting the ties.method argument to "max" retrieves the inclusive value ("<=" versus "<").
rank(x, ties.method="max")
[1] 1 3 7 6 5 4 3 8 9

R arulesSequences - which frequent sequences are present in a transaction?

When doing frequent sequence mining, one typically wants to do the following tasks:
1. Find sequential patterns (frequent sequences).
2. Find out which sequential patterns apply to a transaction. I.e.: given a transaction, which of all of the frequent sequences found is present?
I'm having trouble doing the latter.
Using R, I am applying the cspade-algorithm from the arulesSequences package on the following toy dataset:
data <- data.frame(id = 1:10,
transaction = c("A B B A",
"A B C B D C B B B F A",
"A A B",
"B A B A",
"A B B B B",
"A A A B",
"A B B A B B",
"E F F A C B D A B C D E",
"A B B A B",
"A B"))
Then I split the data using the str_split function from package stringr:
data_for_fseq_mining <- str_split(string = data$transaction, pattern = " ")
Use identifiers to uniquely name the list elements in 'data_for_fseq_mining'. This is a prerequisite for using the function 'as.transactions' as shown below.
names(data_for_fseq_mining) <- data$id
In order to convert this kind of data to a dataset of class 'transactions' I use the following function as.transactions from https://github.com/cran/clickstream/blob/master/R/Clickstream.r.
data_for_fseq_mining_trans <- as.transactions(clickstreamList = data_for_fseq_mining)
Now the data is in the proper format, I run the cspade-algorithm with some parameters:
sequences <- cspade(data = data_for_fseq_mining_trans,
parameter = list(support = 0.3, maxsize = 10, maxlen = 10, mingap = 1, maxgap = 10),
control = list(tidList = TRUE, verbose = TRUE))
Summarizing the results (sequence and relative support):
sequences_df <- cbind(sequence = labels(sequences), support = sequences#quality)
sequence support
1 <{A}> 1.0
2 <{B}> 1.0
3 <{A},{B}> 1.0
4 <{B},{B}> 0.7
5 <{A},{B},{B}> 0.6
6 <{B},{B},{B}> 0.4
7 <{A},{B},{B},{B}> 0.4
8 <{B},{B},{B},{B}> 0.3
9 <{A},{B},{B},{B},{B}> 0.3
10 <{A},{A},{B}> 0.5
11 <{B},{A},{B}> 0.4
12 <{A},{B},{A},{B}> 0.3
13 <{A},{A}> 0.8
14 <{B},{A}> 0.6
15 <{A},{B},{A}> 0.6
16 <{B},{B},{A}> 0.5
17 <{A},{B},{B},{A}> 0.4
That's perfectly fine, but now I would like to know, for each transaction, whether each sequence is present or not (TRUE/FALSE). To do this, I tried to use the tidList:
sequences_score <- as.matrix(sequences#tidLists#data)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
[1,] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[2,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[5,] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
[6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[9,] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
From this result, I assume each row corresponds to a transaction and each column to a sequence. But when looking at the 4th column, it says that pattern '
<{B},{B}>' is not present in transactions 2, 4 and 7. Though these transactions clearly all contain this pattern. Are my assumptions about the output wrong?
An alternative approach is to use this piece of code provided by juliesls: R arulesSequences Find which patterns are supported by a sequence
When applying the following lines of code, an error occurs.
ids <- unique(data_for_fseq_mining_trans#itemsetInfo$sequenceID)
sequences_score <- data.frame()
for (seq_id in 1:length(sequences)){
sequences_score[,labels(sequences[seq_id])] <- logical(0)
}
for (id in ids){
transaction_subset <- data_for_fseq_mining_trans[data_for_fseq_mining_trans#itemsetInfo$sequenceID==id]
sequences_score[id, ] <- as.logical(support(x = sequences, transactions =
transaction_subset, type="absolute"))
}
Any clues?
To see whether each sequence is present or not you can indeed use your provided code:
sequences_score <- as.matrix(sequences#tidLists#data)
However, you have to map the resulting matrix to your data using another property of your sequence object as follows:
# Get mapping ids, change to numeric values
mapping_ids <- as.numeric(sequences#tidLists#transactionInfo$sequenceID)
# Then map your matrix sequence_score to correspond to the order of your data
sequences_score <- sequences_score[order(mapping_ids), ]

Create column based on unique values [duplicate]

This question already has answers here:
Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level
(10 answers)
Closed 6 years ago.
I'd like to create columns in a data frame based on the unique values from a single column.
E.g.
Column1
A
B
C
Into
A B C
True False False
False True False
False False True
We can use table
!!table(1:nrow(df1), df1$Column1)
# A B C
# 1 TRUE FALSE FALSE
# 2 FALSE TRUE FALSE
# 3 FALSE FALSE TRUE
Or using mtabulate from qdapTools
library(qdapTools)
mtabulate(df1$Column1)!=0
# A B C
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE TRUE
Or using model.matrix
model.matrix(~Column1-1, df1)!=0
# Column1A Column1B Column1C
#1 TRUE FALSE FALSE
#2 FALSE TRUE FALSE
#3 FALSE FALSE TRUE
You could also use a loop,
sapply(df$Column1, function(i) grepl(i, df$Column1))
# A B C
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE TRUE
You can also use dcast from reshape2 package
library(reshape2)
!is.na(dcast(df, Column1 ~ Column1))[, -1]
# A B C
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE TRUE

Combining a matrix of TRUE/FALSE into one

If I have this matrix (which I named data):
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[2,] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[5,] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[8,] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
And I want to combine the columns into one single column like this: (where one TRUE in the row equals TRUE)
[,1]
[1,] TRUE
[2,] TRUE
[3,] FALSE
[4,] TRUE
[5,] TRUE
[6,] FALSE
[7,] TRUE
[8,] TRUE
[9,] FALSE
I know I could do something like (using the |):
data2[1:9,1]<-data[,1]|data[,2]|data[,3]|data[,4]…
data2 would then contain a single column with the different columns combined. But this is not a good way if I would have lots of columns (for example ncol=100)
I guess there is some simple way of doing it?
Thanks
Here is another answer that takes advantage of how R converts between logicals and numerics:
When going from logical to numeric, FALSE becomes 0 and TRUE becomes 1 so rowSums gives you the number of TRUE per row:
rowSums(data)
# [1] 3 3 0 3 3 0 3 3 0
When going from numeric to logical, 0 becomes FALSE, anything else is TRUE, so you can feed the output of rowSums to as.logical and it will indicate if a row has at least one TRUE:
as.logical(rowSums(data))
# [1] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
I like Tyler's answer though, it might be less efficient (to be proven) but I find it more intuitive.
You could use any with apply as in:
mat <- matrix(sample(c(TRUE, FALSE), 100,TRUE), 10)
apply(mat, 1, any)

Resources