I'm working with a large binary data matrix, 4547 x 5415, for association rule mining. Per usual, each row is a transaction with every column being an item. Whenever I call on the arules package it yields some arcane error message referencing the trio library. Does anyone have experience with this type of error?
i[1:10,1:10]
101402 101403 101404 101405 101406 101411 101412 101413 101414 101415
[1,] 0 0 0 1 0 0 1 0 0 0
[2,] 0 1 0 0 0 0 1 0 0 0
[3,] 0 0 0 0 0 0 1 0 0 0
[4,] 0 0 0 1 0 0 0 0 0 1
[5,] 0 0 0 1 0 0 0 0 0 1
[6,] 0 1 0 0 0 1 0 0 0 0
[7,] 0 0 0 0 0 0 1 0 0 0
[8,] 0 0 1 0 0 0 0 0 0 1
[9,] 0 0 0 0 0 1 0 0 0 0
[10,] 0 0 0 0 1 0 1 0 0 0
rules <- apriori(i, parameter=list(support=0.001, confidence=0.5))
parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target
0.5 0.1 1 none FALSE TRUE 0.001 1 10 rules
ext
FALSE
algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[5415 item(s), 4547 transaction(s)] done [0.47s].
sorting and recoding items ... [4908 item(s)] done [0.18s].
creating transaction tree ... done [0.01s].
**checking subsets of size 1 2Error in apriori(i, parameter = list(support = 0.001, confidence = 0.5)) :
internal error in trio library**
Reproducible example:
y <- matrix(nrow=4547, ncol=5415)
y <- apply(y, c(1,2), function(x) sample(c(0,1),1))
rules <- apriori(y, parameter=list(support=0.001, confidence=0.5))
The problem is that there is a bug in the error handling in the arules package. You run out of memory and when the apriori code tries to create the appropriate error message then it instead creates an invalid call to printf which is handled under Windows by the trio library. So in short you should get an out of memory error.
This problem will be resolved in arules version 1.1-4.
To avoid running out of memory you need to increase support and/or restrict the number of items in the itemsets (maxlen in the list for parameter)
-Michael
Related
I am trying to run the r code from Network-Analysis on Attitudes: A Brief Tutorial.
You can find it here.
First we loaded the cognitive attitudes.
unzip('ANES2012.zip')
ANES2012 <- read.dta('anes_timeseries_2012_Stata12.dta')#loads the data to the object ANES2012
#########################
#Recode variables
#Items regarding Obama
ObamaCog <- data.frame(Mor = as.numeric(ANES2012$ctrait_dpcmoral),#this creates a data frame containing the items tapping beliefs
Led = as.numeric(ANES2012 $ ctrait_dpclead),
Car = as.numeric(ANES2012$ctrait_dpccare),
Kno = as.numeric(ANES2012$ctrait_dpcknow),
Int = as.numeric(ANES2012$ctrait_dpcint),
Hns = as.numeric(ANES2012$ctrait_dpchonst))
ObamaCog[ObamaCog < 3] <- NA#values below 3 represent missing values
I had to change the code a little bit, as the .binarize function didn't work. (I couldn't load a package ("cmprsk") that was necessary.) So I installed library(biclust) and was able to binarize the data:
ObamaCog <- binarize(ObamaCog, threshold = 5)
Then we did the same for the affective attitudes:
ObamaAff <- data.frame(Ang = as.numeric(ANES2012$candaff_angdpc),#this creates a data frame containing the items tapping feelings
Hop = as.numeric(ANES2012$candaff_hpdpc),
Afr = as.numeric(ANES2012$candaff_afrdpc),
Prd = as.numeric(ANES2012$candaff_prddpc))
ObamaAff[ObamaAff < 3] <- NA#values below 3 represent missing values
ObamaAff <- binarize(ObamaAff, 4)#(not) endorsing the feelings is encoded as 1 (0)
And created one Obama-matrix out of it:
Obama <- data.frame(ObamaCog,ObamaAff)
Then we are omitting the NA values:
Obama <- na.omit(Obama)
And I checked:
write.csv(Obama, file = "Obama-Excel1")
There are no more NA values in my matrix.
And I think it fits the required structure: nobs x nvars
Mor Led Car Kno Int Hns Ang Hop Afr Prd
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
...
60 0 0 0 0 0 0 0 0 0 0
61 1 1 1 1 1 1 0 0 0 0
62 0 0 0 0 0 0 0 0 0 0
63 0 0 0 0 0 0 0 0 0 0
65 0 1 1 0 0 1 0 0 0 0
66 1 1 1 1 1 1 0 0 0 0
67 0 0 0 0 0 0 0 0 0 0
until 5914. And if there was an NA-value in the row before, it is now missing. (For example row 64)
If I am then trying to run the IsingFit-function:
ObamaFit <- IsingFit(Obama)
It doesn't work, I am getting the error message:
Error in y %*% rep(1, nc) : non-conformable arguments
I am a beginner in R and I assumed that non-conformable arguments are NA-values, but this doesn't seem to be the case. Can anyone tell me, what the error message means and how I might solve the problem, so I can use the IsingFit-function?
What I need is a function to find words within a certain 'word distance'. The words 'bag' and 'tool' are interesting in a sentence "He had a bag of tools in his car."
With the Quanteda kwic function I can find 'bag' and 'tool' individually, but this often gives me an overload of results. I need e.g. 'bag' and 'tools' within five words from eachother.
You can use the fcm() function to count the co-occurrences within a fixed window, for instance 5 words. This creates a "feature co-occurrence matrix" and can be defined for any size of token span, or for the context of an entire document.
For your example, or at least an example based on my interpretation of your questions, this would look like:
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- c(
d1 = "He had a bag of tools in his car",
d2 = "bag other other other other tools other"
)
fcm(txt, context = "window", window = 5)
## Feature co-occurrence matrix of: 10 by 10 features.
## 10 x 10 sparse Matrix of class "fcm"
## features
## features He had a bag of tools in his car other
## He 0 1 1 1 1 1 0 0 0 0
## had 0 0 1 1 1 1 1 0 0 0
## a 0 0 0 1 1 1 1 1 0 0
## bag 0 0 0 0 1 2 1 1 1 4
## of 0 0 0 0 0 1 1 1 1 0
## tools 0 0 0 0 0 0 1 1 1 5
## in 0 0 0 0 0 0 0 1 1 0
## his 0 0 0 0 0 0 0 0 1 0
## car 0 0 0 0 0 0 0 0 0 0
## other 0 0 0 0 0 0 0 0 0 10
Here, the term bag occurs once within 5 tokens of tool, in the first document. In the second document, they are more than 5 tokens apart, so this is not counted.
I want to find ALL the non-negative integer solutions to the equation i+j+k+l+m=n where n is a non-negative integer. That is, I want to find all possible 5-tuples (i,j,k,l,m) with respect to a certain n, in R.
I wrote a code which is not working. I am suspicious there is something wrong in the looping.
For your convenience, I have taken n=3, so I am basically trying to compute all vectors (i,j,k,l,m) which are 35 in number, and the matrix a(35 by 5) is the matrix that is supposed to display those vectors. The whole thing is in the function "sample(n)", where if I put n=3 i.e. sample(3) when called will give me the matrix a. Please note that a (35 by 5) is defined beforehand with all entries 0.
sample=function(n){
i=0
j=0
k=0
l=0
m=0
for(p in 1:35){
while(i<=3){
while(j<=3){
while(k<=3){
while(l<=3){
m=n-(i+j+k+l)
if(m>-1){
a[p,]=c(i,j,k,l,m)
}
l=l+1}
k=k+1}
j=j+1}
i=i+1}
}
return(a)
}
When I call sample(3), I get my original a i.e. the matrix with all elements 0. What is wrong with this code? Please rectify it.
I don't think a brute-force approach will bring you much joy for this task. Instead you should look for existing functions that can be used and are efficient (i.e. implemented in C/C++).
n <- 3
library(partitions)
blockparts(rep(n, 5), n)
#[1,] 3 2 1 0 2 1 0 1 0 0 2 1 0 1 0 0 1 0 0 0 2 1 0 1 0 0 1 0 0 0 1 0 0 0 0
#[2,] 0 1 2 3 0 1 2 0 1 0 0 1 2 0 1 0 0 1 0 0 0 1 2 0 1 0 0 1 0 0 0 1 0 0 0
#[3,] 0 0 0 0 1 1 1 2 2 3 0 0 0 1 1 2 0 0 1 0 0 0 0 1 1 2 0 0 1 0 0 0 1 0 0
#[4,] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 3 0 0 0 0 0 0 1 1 1 2 0 0 0 1 0
#[5,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3
I believe that your code isn't answering your stated problem (as I understand it), on top of possible errors in your code.
One way to think of the problem is that, given the quadruple (i,j,k,l), the value of m = n - (i + j + k + l), while noting that the quadruple (i,j,k,l) is constrained so that n >= i+j+k+l AND i,j,k,l >= 0. For example, consider the following algorithm:
Let i freely take any value between 0 and n.
Given i, j can take values between 0 and n-i.
Given (i,j), k takes values between 0 and n-i-j.
Given (i,j,k), l takes values between 0 and n-i-j-k.
Given (i,j,k,l), m is defined as m = n - i - j - k -l.
The following code ought to answer your question. Please comment if this is not what you were looking for.
sample.example = function(n){
a=array(0,c(0,5))
for(i in 0:n){
for(j in seq(from=0,to=n-i,by=1)){
for(k in seq(from=0,to=n-i-j,by=1)){
for(l in seq(from=0,to=n-i-j-k,by=1)){
m = n - i -j - k - l
a = rbind(a,c(i,j,k,l,m))
}}}}
return(a)
}
I am new in R, therefore my question might be really simple.
I have a 40 sites with abundances of zooplankton.
My data looks like this (columns are species abundances and rows are sites)
0 0 0 0 0 2 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 85 0
0 0 0 0 0 45 5 57 0
0 0 0 0 0 13 0 3 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 7 0
0 3 0 0 12 8 0 57 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 59 0 0 0
0 0 0 0 4 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 105 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 100 0
0 35 0 55 0 0 0 0 0
1 4 0 0 0 0 0 0 0
0 0 0 0 0 34 21 0 0
0 0 0 0 0 9 17 0 0
0 54 0 0 0 27 5 0 0
0 1 0 0 0 1 0 0 0
0 17 0 0 0 54 3 0 0
What I would like to is take a random sub-sample (e.g. 50 individuals) from each site without replacement several times (bootstrap) in order to calculate diversity indexes to the new standardized abundances afterwards.
Try something like this:
mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]
What the OP is probably looking for here is a way to bootstrap the data for a Hill or Simpson diversity index, which provides some assumptions about the data being sampled:
Each row is a site, each column is a species, and each value is a count.
Individuals are being sampled for the bootstrap, NOT THE COUNTS.
To do this, bootstrapping programs will often model the counts as a string of individuals. For instance, if we had a record like so:
a b c
2 3 4
The record would be modeled as:
aabbbcccc
Then, a sample is usually drawn WITH replacement from the string to create a larger set based on the model set.
Bootstrapping a site: In R, we have a way to do this that is actually quite simple with the 'sample' function. If you select from the column numbers, you can provide probabilities using the count data.
# Test data.
data <- data.frame(a=2, b=3, c=4)
# Sampling from first row of data.
row <- 1
N_samples <- 50
samples <- sample(1:ncol(data), N_samples, rep=TRUE, prob=data[row,])
Converting the sample into the format of the original table: Now we have an array of samples, with each item indicating the column number that the sample belongs to. We can convert back to the original table format in multiple ways, but here is a fairly simple one using a simple counting loop:
# Count the number of each entry and store in a list.
for (i in 1:ncol(data)){
site_sample[[i]] <- sum(samples==i)
}
# Unlist the data to get an array that represents the bootstrap row.
site_sample <- unlist(site_sample)
Just stumbled upon this thread, and the vegan package has a function called 'rrarify' that does precisely what you're looking to do (and in the same ecological context, too)
This should work. It's a little more complicated than it looks at first, since each cell contains counts of a species. The solution uses the apply function to send each row of the data to the user-defined sample_species function. Then we generate n random numbers and order them. If there are 15 of species 1, 20 of species 2, and 20 of species 3, the random numbers generated between 1 and 15 signify species 1, 16 and 35 signify species 2, and 36-55 signify species 3.
## Initially takes in a row of the data and the number of samples to take
sample_species <- function(counts,n) {
num_species <- length(counts)
total_count <- sum(counts)
samples <- sample(1:total_count,n,replace=FALSE)
samples <- samples[order(samples)]
result <- array(0,num_species)
total <- 0
for (i in 1:num_species) {
result[i] <- length(which(samples > total & samples <= total+counts[i]))
total <- total+counts[i]
}
return(result)
}
A <- matrix(sample(0:100,10*40,replace=T), ncol=10) ## mock data
B <- t(apply(A,1,sample_species,50)) ## results
My problem is very simple: I need to create an adjacency list/matrix from a list of edges.
I have an edge list stored in a csv document with column1 = node1 and column2 = node2 and I would like to convert this to a weighted adjacency list or a weighted adjacency matrix.
To be more precise, here's how the data looks like -where the numbers are simply node ids:
node1,node2
551,548
510,512
548,553
505,504
510,512
552,543
512,510
512,510
551,548
548,543
543,547
543,548
548,543
548,542
Any tips on how to achieve the conversion from this to a weighted adjacency list/matrix?
This is how I resolved to do it previously, without success (courtesy of Dai Shizuka):
dat=read.csv(file.choose(),header=TRUE) # choose an edgelist in .csv file format
el=as.matrix(dat) # coerces the data into a two-column matrix format that igraph likes
el[,1]=as.character(el[,1])
el[,2]=as.character(el[,2])
g=graph.edgelist(el,directed=FALSE) # turns the edgelist into a 'graph object'
Thank you!
This response uses base R only. The result is a standard matrix used to represent the adjacency matrix.
el <- cbind(a=1:5, b=5:1) #edgelist (a=origin, b=destination)
mat <- matrix(0, 5, 5)
mat[el] <- 1
mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 0 0 0 1
#[2,] 0 0 0 1 0
#[3,] 0 0 1 0 0
#[4,] 0 1 0 0 0
#[5,] 1 0 0 0 0
Here mat is your adjacency matrix defined from edgelist el, which is a simple cbind of the vectors 1:5 and 5:1.
If your edgelist includes weights, then you need a slightly different solution.
el <- cbind(a=1:5, b=5:1, c=c(3,1,2,1,1)) # edgelist (a=origin, b=destination, c=weight)
mat<-matrix(0, 5, 5)
for(i in 1:NROW(el)) mat[ el[i,1], el[i,2] ] <- el[i,3] # SEE UPDATE
mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 0 0 0 3
#[2,] 0 0 0 1 0
#[3,] 0 0 2 0 0
#[4,] 0 1 0 0 0
#[5,] 1 0 0 0 0
UPDATE
Some time later I realized that the for loop (3rd line) in the previous weighted edgelist example is unnecessary. You can replace it with the following vectorized operation:
mat[el[,1:2]] <- el[,3]
The post on my website you mention in the question (https://sites.google.com/site/daishizuka/toolkits/sna/sna_data) uses the igraph package, so make sure that is loaded.
Moreover, I recently realized that igraph provides a much easier way to create a weighted adjacency matrix from edgelists, using graph.data.frame(). I've updated this on my site, but here is a simple example:
library(igraph)
el=matrix(c('a','b','c','d','a','d','a','b','c','d'),ncol=2,byrow=TRUE) #a sample edgelist
g=graph.data.frame(el)
get.adjacency(g,sparse=FALSE)
That should do it. The sparse=FALSE argument tells it to show the 0s in the adjacency matrix.
If you really don't want to use igraph, I think this is a clunky way to do it:
el=matrix(c('a','b','c','d','a','d','a','b','c','d'),ncol=2,byrow=TRUE) #a sample edgelist
lab=names(table(el)) #extract the existing node IDs
mat=matrix(0,nrow=length(lab),ncol=length(lab),dimnames=list(lab,lab)) #create a matrix of 0s with the node IDs as rows and columns
for (i in 1:nrow(el)) mat[el[i,1],el[i,2]]=mat[el[i,1],el[i,2]]+1 #for each row in the edgelist, find the appropriate cell in the empty matrix and add 1.
Start with your data frame edges and use igraph to obtain adjacency matrix:
head(edges)
node1 node2
1 551 548
2 510 512
3 548 553
4 505 504
5 510 512
6 552 543
library(igraph)
as.matrix(get.adjacency(graph.data.frame(edges)))
551 510 548 505 552 512 543 553 504 547 542
551 0 0 2 0 0 0 0 0 0 0 0
510 0 0 0 0 0 2 0 0 0 0 0
548 0 0 0 0 0 0 2 1 0 0 1
505 0 0 0 0 0 0 0 0 1 0 0
552 0 0 0 0 0 0 1 0 0 0 0
512 0 2 0 0 0 0 0 0 0 0 0
543 0 0 1 0 0 0 0 0 0 1 0
553 0 0 0 0 0 0 0 0 0 0 0
504 0 0 0 0 0 0 0 0 0 0 0
547 0 0 0 0 0 0 0 0 0 0 0
542 0 0 0 0 0 0 0 0 0 0 0
Another possibility with the qdapTools package:
library(qdapTools)
el[rep(seq_len(nrow(el)), el[,'c']), c('a', 'b')] %>%
{split(.[,'b'], .[,'a'])} %>%
mtabulate()
## 1 2 3 4 5
## 1 0 0 0 0 3
## 2 0 0 0 1 0
## 3 0 0 2 0 0
## 4 0 1 0 0 0
## 5 1 0 0 0 0