Association analysis with duplicate transactions using arules package in R - r

I want to create a transaction object in basket format which I can call anytime for my analyses. The data contains comma separated items with 1001 transactions. The first 10 transactions look like this:
hering,corned_b,olives,ham,turkey,bourbon,ice_crea
baguette,soda,hering,cracker,heineken,olives,corned_b
avocado,cracker,artichok,heineken,ham,turkey,sardines
olives,bourbon,coke,turkey,ice_crea,ham,peppers
hering,corned_b,apples,olives,steak,avocado,turkey
sardines,heineken,chicken,coke,ice_crea,peppers,ham
olives,bourbon,coke,turkey,ice_crea,heineken,apples
corned_b,peppers,bourbon,cracker,chicken,ice_crea,baguette
soda,olives,bourbon,cracker,heineken,peppers,baguette
corned_b,peppers,bourbon,cracker,chicken,bordeaux,hering
...
I observed that there are duplicated transactions in the data and removed them but each time I tried to read the transactions, I get:
Error in asMethod(object) :
can not coerce list with transactions with duplicated items
Here is my code:
data <- read.csv("AssociationsItemList.txt",header=F)
data <- data[!duplicated(data),]
pop <- NULL
for(i in 1:length(data)){
pop <- paste(pop, data[i],sep="\n")
}
write(pop, file = "Trans", sep = ",")
transdata <- read.transactions("Trans", format = "basket", sep=",")
I'm sure there's something little yet important I've missed. Kindly offer your assistance.

The problem is not with duplicated transactions (the same row appearing twice)
but duplicated items (the same item appearing twice, in the same transaction --
e.g., "olives" on line 4).
read.transactions has an rm.duplicates argument to remove those duplicates.
read.transactions("Trans", format = "basket", sep=",", rm.duplicates=TRUE)

Vincent Zoonekynd is right, the problem is caused by duplicated items in a transaction. Here I can explain why arules require transactions without duplicated items.
The data of transactions is store internally as a ngCMatrix Object. Relevant source code:
setClass("itemMatrix",
representation(
data = "ngCMatrix",
...
setClass("transactions",
contains = "itemMatrix",
...
ngCMatrix is an sparse matrix defined at Matrix package. It's description from official document:
The nsparseMatrix class is a virtual class of sparse “pattern” matrices, i.e., binary matrices conceptually with TRUE/FALSE entries. Only the positions of the elements that are TRUE are stored
It seems ngCMatirx stored status of an element by an binary indicator. Which means the transactions object in arules can only store exist/not exist for a transaction object and can not record quantity. So...

I just used the 'unique' function to remove duplicates. My data was a little different since I had a dataframe (data was too large for a CSV) and I had 2 columns: product_id and transaction_id. I know it's not your specific question, but I had to do this to create the transaction dataset and apply association rules.
data # > 1 Million Transactions
data <- unique(data[ , 1:2 ] )
trans <- as(split(data[,"product_id"], data[,"trans_id"]),"transactions")
rules <- apriori(trans, parameter = list(supp = 0.001, conf = 0.2))

Related

How to retrieve data using the rentrez package by giving a list of query names instead of a single one?

So I'm trying to use the rentrez package to retrieve DNA sequence data from GenBank, giving as input a list of species.
What I've done is create a vector for the species I want to query, followed by creating a term where I specify the types of sequence data I want to retrieve, then creating a search that retrieves all the occurrences that match my query, and finally I create data where I retrieve the actual sequence data in fasta file.
library(rentrez)
species<-c("Ablennes hians","Centrophryne spinulosa","Doratonotus megalepis","Entomacrodus cadenati","Katsuwonus pelamis","Lutjanus fulgens","Pagellus erythrinus")
for (x in species){
term<-paste(x,"[Organism] AND (((COI[Gene] OR CO1[Gene] OR COXI[Gene] OR COX1[Gene]) AND (500[SLEN]:3000[SLEN])) OR complete genome[All Fields] OR mitochondrial genome[All Fields])",sep='',collapse = NULL)
search<-entrez_search(db="nuccore",term=term,retmax=99999)
data<-entrez_fetch(db="nuccore",id=search$ids,rettype="fasta")
}
Basically what I'm trying to do is concatenate the results of the queries for each species into a single variable. I began using a for cycle but I see it makes no sense in this form because the data of each new species that is being queried is just replacing the previous one in data.
For some elements of species, there will be no data to retrieve and R shows this error:
Error: Vector of IDs to send to NCBI is empty, perhaps entrez_search or entrez_link found no hits?
In the cases where this error is shown and therefore there is no data for that particular species, I wanted the code to just keep going and ignore that.
My output would be a variable data which would include the sequence data retrived, from all the names in species.
library(rentrez)
species<-c("Ablennes hians","Centrophryne spinulosa","Doratonotus megalepis","Entomacrodus cadenati","Katsuwonus pelamis","Lutjanus fulgens","Pagellus erythrinus")
data <- list()
for (x in species){
term<-paste(x,"[Organism] AND (((COI[Gene] OR CO1[Gene] OR COXI[Gene] OR COX1[Gene]) AND (500[SLEN]:3000[SLEN])) OR complete genome[All Fields] OR mitochondrial genome[All Fields])",sep='',collapse = NULL)
search<-entrez_search(db="nuccore",term=term,retmax=99999)
data[x] <- tryCatch({entrez_fetch(db="nuccore",id=search$ids,rettype="fasta")},
error = function(e){NA})
}

Converting Table Data to Transaction Data in r

For a current project I am trying to find a way to convert large amounts of table data (300,000+ obs. of 19 variables) into transaction data for arules. A large number of the variables are formatted logically.
I've tried the following from library(arules): newdata <- read.transactions("olddata.csv", format = "basket", rm.duplicates = FALSE, skip = 1)
However I get the following error:
Error in asMethod(object) :
can not coerce list with transactions with duplicated items
I don't want to remove duplicates as I lose so much of my data because it removes every duplicate logical T/F after the first occurrence.
I figured I could try and accomplish my task using a for loop:
newdata <- ""
for (row in 1:nrow(olddata)) {
if (row !=1) {
newdata <- paste0(newdata, "\n")}
newdata <- paste0(newdata, row,",")
for (col in 2:ncol(olddata)) {
if (col !=2) {
newdata <- paste0(newdata, ",")}
newdata <- paste0(newdata, colnames(olddata),"=", olddata[row,col])}
}
write(newdata,"newdata.csv")`
My goal was to have the value of each variable for each observation look as follows: columnnameA=TRUE, columnnameB=FALSE, etc. This would eliminate "duplicates" for the read.transactions function and retain all of the data.
However my output starts looking like this:
[1] "1,Recipient=Thu Feb 04 21:52:00 UTC 2016,Recipient=TRUE,Recipient=TRUE,Recipient=FALSE,Recipient=TRUE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE\n2,Recipient=Thu Feb 04 21:52:00 UTC 2016,Recipient=TRUE,Recipient=TRUE,Recipient=FALSE,Recipient=TRUE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE\n3
Just a note that Recipient is my first variable name in my olddata object. After it does every observation as Recipient=X it changes to the next variable name and repeats. I end up with a file that has over 5 million observations...oops! This is my first real stab at nested for loops. Not sure if this is the best approach or if there is a better one.
Thanks in advance for any thoughts or insights you might have.

Creating SpatialLinesDataFrame from SpatialLines object and basic df

Using leaflet, I'm trying to plot some lines and set their color based on a 'speed' variable. My data start at an encoded polyline level (i.e. a series of lat/long points, encoded as an alphanumeric string) with a single speed value for each EPL.
I'm able to decode the polylines to get lat/long series of (thanks to Max, here) and I'm able to create segments from those series of points and format them as a SpatialLines object (thanks to Kyle Walker, here).
My problem: I can plot the lines properly using leaflet, but I can't join the SpatialLines object to the base data to create a SpatialLinesDataFrame, and so I can't code the line color based on the speed var. I suspect the issue is that the IDs I'm assigning SL segments aren't matching to those present in the base df.
The objects I've tried to join, with SpatialLinesDataFrame():
"sl_object", a SpatialLines object with ~140 observations, one for each segment; I'm using Kyle's code, linked above, with one key change - instead of creating an arbitrary iterative ID value for each segment, I'm pulling the associated ID from my base data. (Or at least I'm trying to.) So, I've replaced:
id <- paste0("line", as.character(p))
with
lguy <- data.frame(paths[[p]][1])
id <- unique(lguy[,1])
"speed_object", a df with ~140 observations of a single speed var and row.names set to the same id var that I thought I created in the SL object above. (The number of observations will never exceed but may be smaller than the number of segments in the SL object.)
My joining code:
splndf <- SpatialLinesDataFrame(sl = sl_object, data = speed_object)
And the result:
row.names of data and Lines IDs do not match
Thanks, all. I'm posting this in part because I've seen some similar questions - including some referring specifically to changing the ID output of Kyle's great tool - and haven't been able to find a good answer.
EDIT: Including data samples.
From sl_obj, a single segment:
print(sl_obj)
Slot "ID":
[1] "4763655"
[[151]]
An object of class "Lines"
Slot "Lines":
[[1]]
An object of class "Line"
Slot "coords":
lon lat
1955 -74.05228 40.60397
1956 -74.05021 40.60465
1957 -74.04182 40.60737
1958 -74.03997 40.60795
1959 -74.03919 40.60821
And the corresponding record from speed_obj:
row.names speed
... ...
4763657 44.74
4763655 34.8 # this one matches the ID above
4616250 57.79
... ...
To get rid of this error message, either make the row.names of data and Lines IDs match by preparing sl_object and/or speed_object, or, in case you are certain that they should be matched in the order they appear, use
splndf <- SpatialLinesDataFrame(sl = sl_object, data = speed_object, match.ID = FALSE)
This is documented in ?SpatialLinesDataFrame.
All right, I figured it out. The error wasn't liking the fact that my speed_obj wasn't the same length as my sl_obj, as mentioned here. ("data =
object of class data.frame; the number of rows in data should equal the number of Lines elements in sl)
Resolution: used a quick loop to pull out all of the unique lines IDs, then performed a left join against that list of uniques to create an exhaustive speed_obj (with NAs, which seem to be OK).
ids <- data.frame()
for (i in (1:length(sl_obj))) {
id <- data.frame(sl_obj#lines[[i]]#ID)
ids <- rbind(ids, id)
}
colnames(ids)[1] <- "linkId"
speed_full <- join(ids, speed_obj)
speed_full_short <- data.frame(speed_obj[,c(-1)])
row.names(speed_full_short) <- speed_full$linkId
splndf <- SpatialLinesDataFrame(sl_obj, data = speed_full_short, match.ID = T)
Works fine now!
I may have deciphered the issue.
When I am pulling in my spatial lines data and I check the class it reads as
"Spatial Lines Data Frame" even though I know it's a simple linear shapefile, I'm using readOGR to bring the data in and I believe this is where the conversion is occurring. With that in mind the speed assignment is relatively easy.
sl_object$speed <- speed_object[ match( sl_object$ID , row.names( speed_object ) ) , "speed" ]
This should do the trick, as I'm willing to bet your class(sl_object) is "Spatial Lines Data Frame".
EDIT: I had received the same error as OP, driving me to check class()
I am under the impression that the error that was populated for you is because you were trying to coerce a data frame into a data frame and R wasn't a fan of that.

Pairing qualitative user data with text-mining results

I have pairs of customer feedback data in a CSV, denoting whether the customer recommended the service they received (1 or 0), "rec", and an associated comment, "comment". I am trying to compare the customer feedback between those who recommended the service and those who did not.
I have used the tm package to simply read all the lines in a CSV with only comments and do some follow-on text-mining on all the comments, which worked:
>file_loc <- "C:/Users/..(etc)...file.csv"
x <- read.csv(file_loc, header = TRUE)
require(tm)
fdbk <- Corpus(DataframeSource(x))
Now I am trying to compare the comments of those customers who recommend and those who do not by including the "rec" column, but I have not been able to create a corpus from a single column CSV - I tried the following:
>file_loc <- "C:/Users/..(etc)...file.csv"
x <- read.csv(file_loc, header = TRUE)
require(tm)
fdbk <- Corpus(DataframeSource(x$comment))
But I get an error saying
"Error in if (vectorized && (length <= 0))
stop("vectorized sources must have positive length") :
missing value where TRUE/FALSE needed"
I also tried binding the "rec" codes to the comments after creating a topic model, but certain comments end up getting filtered by the "topic" function so the "rec" column is longer than the # of documents in the resulting topic model.
If this something I can do with the tm package simply? I haven't worked with the qdap package at all but is that something that is more appropriate here?
... as ben mentioned:
vec <- as.character(x[,"place of comments"])
Corpus(VectorSource(vec))
perhaps some customer id as meta data would be nice...
hth

How to allocate/append a large column of Date objects to a data-frame

I have a data-frame (3 cols, 12146637 rows) called tr.sql which occupies 184Mb.
(it's backed by SQL, it is the contents of my dataset which I read in via read.csv.sql)
Column 2 is tr.sql$visit_date. SQL does not allow natively representing dates as an R Date object, this is important for how I need to process the data.
Hence I want to copy the contents of tr.sql to a new data-frame tr
(where the visit_date column can be natively represented as Date (chron::Date?). Trust me, this makes exploratory data analysis easier, for now this is how I want to do it - I might use native SQL eventually but please don't quibble that for now.)
Here is my solution (thanks to gsk and everyone) + workaround:
tr <- data.frame(customer_id=integer(N), visit_date=integer(N), visit_spend=numeric(N))
# fix up col2's class to be Date
class(tr[,2]) <- 'Date'
then workaround copying tr.sql -> tr in chunks of (say) N/8 using a for-loop, so that the temporary involved in the str->Date conversion does not out-of-memory, and a garbage-collect after each:
for (i in 0:7) {
from <- floor(i*N/8)
to <- floor((i+1)*N/8) -1
if (i==7)
to <- N
print(c("Copying tr.sql$visit_date",from,to," ..."))
tr$visit_date[from:to] <- as.Date(tr.sql$visit_date[from:to])
gc()
}
rm(tr.sql)
memsize_gc() ... # only 321 Mb in the end! (was ~1Gb during copying)
The problem is allocating then copying the visit_date column.
Here is the dataset and code, I am having multiple separate problems with this, explanation below:
'training.csv' looks like...
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
and code:
# Read in as SQL (for memory-efficiency)...
library(sqldf)
tr.sql <- read.csv.sql('training.csv')
gc()
memory.size()
# Count of how many rows we are about to declare
N <- nrow(tr.sql)
# Declare a new empty data-frame with same columns as the source d.f.
# Attempt to declare N Date objects (fails due to bad qualified name for Date)
# ... does this allocate N objects the same as data.frame(colname = numeric(N)) ?
tr <- data.frame(visit_date = Date(N))
tr <- tr.sql[0,]
# Attempt to assign the column - fails
tr$visit_date <- as.Date(tr.sql$visit_date)
# Attempt to append (fails)
> tr$visit_date <- append(tr$visit_date, as.Date(tr.sql$visit_date))
Error in `$<-.data.frame`(`*tmp*`, "visit_date", value = c("14700", "14705", :
replacement has 12146637 rows, data has 0
The second line that tries to declare data.frame(visit_date = Date(N)) fails, I don't know the correct qualified name with namespace for Date object (tried chron::Date , Dates::Date? don't work)
Both the attempt to assign and append fail. Not even sure whether it is legal, or efficient, to use append on a single large column of a data-frame.
Remember these objects are big, so avoid using temporaries.
Thanks in advance...
Try this ensuring that you are using the most recent version of sqldf (currently version 0.4-1.2).
(If you find you are running out of memory try putting the database on disk by adding the dbname = tempfile() argument to the read.csv.sql call. If even that fails then its so large in relation to available memory that its unlikely you are going to be able to do much analysis with it anyways.)
# create test data file
Lines <-
"customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52"
cat(Lines, file = "trainingtest.csv")
# read it back
library(sqldf)
DF <- read.csv.sql("trainingtest.csv", method = c("integer", "Date2", "numeric"))
It doesn't look to me like you've got a data.frame there (N is a vector of length 1). Should be simple:
tr <- tr.sql
tr$visit_date <- as.Date(tr.sql$visit_date)
Or even more efficient:
tr <- data.frame(colOne = tr.sql[,1], visit_date = as.Date(tr.sql$visit_date), colThree = tr.sql[,3])
As a side note, your title says "append" but I don't think that's the operation you want. You're making the data.frame wider, not appending them on to the end (making it longer). Conceptually, this is a cbind() operation.
Try this:
tr <- data.frame(visit_date= as.Date(tr.sql$visit_date, origin="1970-01-01") )
This will succeed if your format is YYYY-MM-DD or YYYY/MM/DD. If not one of those formats then post more details. It will also succeed if tr.sql$visit_date is a numeric vector equal to the number of days after the origin. E.g:
vdfrm <- data.frame(a = as.Date(c(1470, 1475, 1480), origin="1970-01-01") )
vdfrm
a
1 1974-01-10
2 1974-01-15
3 1974-01-20

Resources