Arules, Support within a range - r

I'm running the Aprori algorithm in R using Arules. I have a massive amount of data to mine and I don't want to use a sample if at all possible. I really only need to see rules associated with items that are not sold very often.
The code i'm using now is:
basket_rules <- apriori(data, parameter = list(sup = 0.7, conf = 0.2, target="rules",list(minlen=4, maxlen=7))
I only want rules with low support but because of the size and nature of my data I cant get it any lower than .7
Is it possible to return a a range of support in order to conserve memory.
for example something like: list(sup <=.05 and >=.0001)
any other ideas for limiting memory usage while running the Aprori is really appreciated.

The nature of support (downward closure) does not allow you to efficiently generate only itemsets/rules with a support in a specific range. You always have to create all frequent itemsets first and then filter in the R implementation in arules. There might be implementations of FP-growth or similar algorithms which are more memory efficient for your problem.
Another way to approach this problem is to look at the data more closely. Maybe you have several items which appear in many transactions. These items might not be not interesting to you and you can remove them before mining rules.

Related

How to include all elements of a vector in an exams2moodle or exams2pdf output?

I am working on a simple code to find the square root of the following elements:
dat <- c(4,9,16,25,36,49,64,81,100,121,144,169,196,225)
num<-sample(dat ,1,replace=F)
Parts of the seed file are configured like this:
examen01<-c("SinRad.Rmd")
semilla<-sample(100:1000, 1)
set.seed(semilla)
exams2moodle(examen01,n=14,svg=TRUE,name="SinRadConRad",
encoding="UTF-8",dir="salida",edir="ejercicios",
mchoice = list(shuffle = TRUE,answernumbering = "ABCD",
solution = FALSE,
eval = list(partial = TRUE,rule = "none")))
I intend that, with n=14, all the responses to the options found in the "dat" vector are included, but I see that there are responses that are repeated.
How to achieve 14 answers for the 14 possibilities, without repeating or missing any?
Thank you very much
The exams2xyz() functions have been written to draw large numbers of random variations from sets of exercises. There is no dedicated functionality that draws a small number of deterministic variations. So in your case I would just draw, say, a hundred variations from the exercise template even if it can only yield 14 distinct versions. Sure, this wastes a bit of memory but not so much that I would worry about this.
Having said that, it is possible to set up a temporary file with a specific version of an exercise by using the expar() function. For example, expar("SinRad.Rmd", num = 4) would yield an exercise where the num parameter has been fixed to 4. Then in the same way you can cycle through the other 13 numbers you want. In the following post we also provide an expargrid() function that does this for all possible combinationso of parameters: Making deterministic versions of a parametrized question
Then you can run exams2moodle() on the resulting 14 deterministic exercise files.

For memory, what should be done when you need to constantly grow a vector to an unknown upper limit?

Suppose that you are dealing with a potentially infinite amount of data. Suppose further that you do not have this data stored in memory, but can generate individual terms at will. Finally, suppose that you want to do some experiment on this data that will involve checking a large but unknown amount of terms in a way that necessitates keeping a great many of them in memory. Toy problems with Recamán's sequence, like "find the minimum number terms needed in that sequence for the first 25 even numbers to have appeared", are what I have in mind as typical examples.
The obvious solution to this sort of problem would be to write some code like:
list<-c(first term)
while([not found enough terms yet])
{
nextTerm<-Whatever
if(this term worked){list<-c(list,nextTerm)}
}
However, building a big vector like this by adding one new term at a time is your memory's worst nightmare. The alternative that I often see suggested is to pre-allocate a big vector in memory by making the first line of your code something like list<-numeric(10^6), but those solutions suppose that we have some rough idea of how many terms we need to check, which isn't always the case. So what can we do when we are dealing with an ever-growing list of unknown required length?
This is very popular subject in R check this answer: https://stackoverflow.com/a/45195098/5442527
Summing up:
Do not use c() to bind as providing value by index [ is much faster. I know that it might seem surprising that you could grow pre-allocated vector. Make an iter variable before while loop and increase the index inside the if statement.
Normally like in Python you do not have to care about it when using append. Even starting with empty list is not an problem as the list (reserved memory) grows expotentialy (x2x2x1.5x1.2...) when you pass some perimeter number of elements. Link Over-allocating

Incorporating Item quantity in the transactions for Apriori algorithm

I was working on a simple recommender system, i started off with apriori algorithm using arules in R. To my surprise i got 0 rules for when support was greater that 0.0001, which is too low a value for support. I figured out that the reason for this could be that the duplicate items in each transaction are being removed. I tried to solve this by setting remove duplicates as false:
df = read.transactions("transactions.csv",sep = ',',rm.duplicates = FALSE)
But that didn't work and i got the following
Warning message:
In asMethod(object) : removing duplicated items in transactions
So is there a way to solve this, or is there a better way to consider the quantity of each item in every transaction in the code? Is there a better option in python or any other language? It would be great if anyone could help me out on this.
The support is based on the number of transactions.
The quantity of an item thus does not matter for the support by definition.
Your problem is probably that you did not preprocess your data well enough. For association rules, it usually seems to be necessary to work with product groups or classes rather than individual product codes. I.e. find rules with "beer" and "milk" rather than "Wilmaukee's worst 12 oz. can 24 pack" and "FUGGIES UnderNites Diapers, Size 4, 56 ct, BIG PACK". Merging such overdifferentiated products dorsimprove the support.

What software package can you suggest for a programmer who rarely works with statistics?

Being a programmer I occasionally find the need to analyze large amounts of data such as performance logs or memory usage data, and I am always frustrated by how much time it takes me to do something that I expect to be easier.
As an example to put the question in context, let me quickly show you an example from a CSV file I received today (heavily filtered for brevity):
date,time,PS Eden Space used,PS Old Gen Used, PS Perm Gen Used
2011-06-28,00:00:03,45004472,184177208,94048296
2011-06-28,00:00:18,45292232,184177208,94048296
I have about 100,000 data points like this with different variables that I want to plot in a scatter plot in order to look for correlations. Usually the data needs to be processed in some way for presentation purposes (such as converting nanoseconds to milliseconds and rounding fractional values), some columns may need to be added or inverted, or combined (like the date/time columns).
The usual recommendation for this kind of work is R and I have recently made a serious effort to use it, but after a few days of work my experience has been that most tasks that I expect to be simple seem to require many steps and have special cases; solutions are often non-generic (for example, adding a data set to an existing plot). It just seems to be one of those languages that people love because of all the powerful libraries that have accumulated over the years rather than the quality and usefulness of the core language.
Don't get me wrong, I understand the value of R to people who are using it, it's just that given how rarely I spend time on this kind of thing I think that I will never become an expert on it, and to a non-expert every single task just becomes too cumbersome.
Microsoft Excel is great in terms of usability but it just isn't powerful enough to handle large data sets. Also, both R and Excel tend to freeze completely (!) with no way out other than waiting or killing the process if you accidentally make the wrong kind of plot over too much data.
So, stack overflow, can you recommend something that is better suited for me? I'd hate to have to give up and develop my own tool, I have enough projects already. I'd love something interactive that could use hardware acceleration for the plot and/or culling to avoid spending too much time on rendering.
#flodin It would have been useful for you to provide an example of the code you use to read in such a file to R. I regularly work with data sets of the size you mention and do not have the problems you mention. One thing that might be biting you if you don't use R often is that if you don't tell R what the column-types R, it has to do some snooping on the file first and that all takes time. Look at argument colClasses in ?read.table.
For your example file, I would do:
dat <- read.csv("foo.csv", colClasses = c(rep("character",2), rep("integer", 3)))
then post process the date and time variables into an R date-time object class such as POSIXct, with something like:
dat <- transform(dat, dateTime = as.POSIXct(paste(date, time)))
As an example, let's read in your example data set, replicate it 50,000 times and write it out, then time different ways of reading it in, with foo containing your data:
> foo <- read.csv("log.csv")
> foo
date time PS.Eden.Space.used PS.Old.Gen.Used
1 2011-06-28 00:00:03 45004472 184177208
2 2011-06-28 00:00:18 45292232 184177208
PS.Perm.Gen.Used
1 94048296
2 94048296
Replicate that, 50000 times:
out <- data.frame(matrix(nrow = nrow(foo) * 50000, ncol = ncol(foo)))
out[, 1] <- rep(foo[,1], times = 50000)
out[, 2] <- rep(foo[,2], times = 50000)
out[, 3] <- rep(foo[,3], times = 50000)
out[, 4] <- rep(foo[,4], times = 50000)
out[, 5] <- rep(foo[,5], times = 50000)
names(out) <- names(foo)
Write it out
write.csv(out, file = "bigLog.csv", row.names = FALSE)
Time loading the naive way and the proper way:
system.time(in1 <- read.csv("bigLog.csv"))
system.time(in2 <- read.csv("bigLog.csv",
colClasses = c(rep("character",2),
rep("integer", 3))))
Which is very quick on my modest laptop:
> system.time(in1 <- read.csv("bigLog.csv"))
user system elapsed
0.355 0.008 0.366
> system.time(in2 <- read.csv("bigLog.csv",
colClasses = c(rep("character",2),
rep("integer", 3))))
user system elapsed
0.282 0.003 0.287
For both ways of reading in.
As for plotting, the graphics can be a bit slow, but depending on your OS this can be sped up a bit by altering the device you plot - on Linux for example, don't use the default X11() device, which uses Cairo, instead try the old X window without anti-aliasing. Also, what are you hoping to see with a data set as large as 100,000 observations on a graphics device with not many pixels? Perhaps try to rethink your strategy for data analysis --- no stats software will be able to save you from doing something ill-advised.
It sounds as if you are developing code/analysis as you go along, on the full data set. It would be far more sensible to just work with a small subset of the data when developing new code or new ways of looking at your data, say with a random sample of 1000 rows, and work with that object instead of the whole data object. That way you guard against accidentally doing something that is slow:
working <- out[sample(nrow(out), 1000), ]
for example. Then use working instead of out. Alternatively, whilst testing and writing a script, set argument nrows to say 1000 in the call to load the data into R (see ?read.csv). That way whilst testing you only read in a subset of the data, but one simple change will allow you to run your script against the full data set.
For data sets of the size you are talking about, I see no problem whatsoever in using R. Your point, about not becoming expert enough to use R, will more than likely apply to other scripting languages that might be suggested, such as python. There is a barrier to entry, but that is to be expected if you want the power of a language such as python or R. If you write scripts that are well commented (instead of just plugging away at the command line), and focus on a few key data import/manipulations, a bit of plotting and some simple analysis, it shouldn't take long to masters that small subset of the language.
R is a great tool, but I never had to resort to use it. Instead I find python to be more than adequate for my needs when I need to pull data out of huge logs. Python really comes with "batteries included" with built-in support for working with csv-files
The simplest example of reading a CSV file:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
To use another separator, e.g. tab and extract n-th column, use
spamReader = csv.reader(open('spam.csv', 'rb'), delimiter='\t')
for row in spamReader:
print row[n]
To operate on columns use the built-in list data-type, it's extremely versatile!
To create beautiful plots I use matplotlib
code
The python tutorial is a great way to get started! If you get stuck, there is always stackoverflow ;-)
There seem to be several questions mixed together:
Can you draw plots quicker and more easily?
Can you do things in R with less learning effort?
Are there other tools which require less learning effort than R?
I'll answer these in turn.
There are three plotting systems in R, namely base, lattice and ggplot2 graphics. Base graphics will render quickest, but making them look pretty can involve pathological coding. ggplot2 is the opposite, and lattice is somewhere in between.
Reading in CSV data, cleaning it and drawing a scatterplot sounds like a pretty straightforward task, and the tools are definitely there in R for solving such problems. Try asking a question here about specific bits of code that feel clunky, and we'll see if we can fix it for you. If your datasets all look similar, then you can probably reuse most of your code over and over. You could also give the ggplot2 web app a try.
The two obvious alternative languages for data processing are MATLAB (and its derivatives: Octave, Scilab, AcslX) and Python. Either of these will be suitable for your needs, and MATLAB in particular has a pretty shallow learning curve. Finally, you could pick a graph-specific tool like gnuplot or Prism.
SAS can handle larger data sets than R or Excel, however many (if not most) people--myself included--find it a lot harder to learn. Depending on exactly what you need to do, it might be worthwhile to load the CSV into an RDBMS and do some of the computations (eg correlations, rounding) there, and then export only what you need to R to generate graphics.
ETA: There's also SPSS, and Revolution; the former might not be able to handle the size of data that you've got, and the latter is, from what I've heard, a distributed version of R (that, unlike R, is not free).

Efficiency of operations on R data structures

I'm wondering if there's any documentation about the efficiency of operations in R, specifically those related to data manipulation.
For example:
I imagine it's efficient to add columns to a data frame, because I'm guessing you're just adding an element to a linked list.
I imagine adding rows is slower because vectors are held in arrays at the C level and you have to allocate a new array of length n+1 and copy all the elements over.
The developers probably don't want to tie themselves to a particular implementation, but it would be nice to have something more solid than guesses to go on.
Also, I know the main R performance hint is to use vectored operations whenever possible as opposed to loops.
what about the various flavors of apply?
are those just hidden loops?
what about matrices vs. data frames?
Data IO was one of the features i looked into before i committed to learning R. For better or worse, here are my observations and solutions/palliatives on these issues:
1. That R doesn't handle big data (>2 GB?) To me this is a misnomer. By default, the common data input functions load your data into RAM. Not to be glib, but to me, this is a feature not a bug--anytime my data will fit in my available RAM, that's where i want it. Likewise, one of SQLite's most popular features is the in-memory option--the user has the easy option of loading the entire dB into RAM. If your data won't fit in memory, then R makes it astonishingly easy to persist it, via connections to the common RDBMS systems (RODBC, RSQLite, RMySQL, etc.), via no-frills options like the filehash package, and via systems that current technology/practices (for instance, i can recommend ff). In other words, the R developers have chosen a sensible (and probably optimal) default, from which it is very easy to opt out.
2. The performance of read.table (read.csv, read.delim, et al.), the most common means for getting data into R, can be improved 5x (and often much more in my experience) just by opting out of a few of read.table's default arguments--the ones having the greatest effect on performance are mentioned in the R's Help (?read.table). Briefly, the R Developers tell us that if you provide values for the parameters 'colClasses', 'nrows', 'sep', and 'comment.char' (in particular, pass in '' if you know your file begins with headers or data on line 1), you'll see a significant performance gain. I've found that to be true.
Here are the snippets i use for those parameters:
To get the number of rows in your data file (supply this snippet as an argument to the parameter, 'nrows', in your call to read.table):
as.numeric((gsub("[^0-9]+", "", system(paste("wc -l ", file_name, sep=""), intern=T))))
To get the classes for each column:
function(fname){sapply(read.table(fname, header=T, nrows=5), class)}
Note: You can't pass this snippet in as an argument, you have to call it first, then pass in the value returned--in other words, call the function, bind the returned value to a variable, and then pass in the variable as the value to to the parameter 'colClasses' in your call to read.table:
3. Using Scan. With only a little more hassle, you can do better than that (optimizing 'read.table') by using 'scan' instead of 'read.table' ('read.table' is actually just a wrapper around 'scan'). Once again, this is very easy to do. I use 'scan' to input each column individually then build my data.frame inside R, i.e., df = data.frame(cbind(col1, col2,....)).
4. Use R's Containers for persistence in place of ordinary file formats (e.g., 'txt', 'csv'). R's native data file '.RData' is a binary format that a little smaller than a compressed ('.gz') txt data file. You create them using save(, ). You load it back into the R namespace with load(). The difference in load times compared with 'read.table' is dramatic. For instance, w/ a 25 MB file (uncompressed size)
system.time(read.table("tdata01.txt.gz", sep=","))
=> user system elapsed
6.173 0.245 **6.450**
system.time(load("tdata01.RData"))
=> user system elapsed
0.912 0.006 **0.912**
5. Paying attention to data types can often give you a performance boost and reduce your memory footprint. This point is probably more useful in getting data out of R. The key point to keep in mind here is that by default, numbers in R expressions are interpreted as double-precision floating point, e.g., > typeof(5) returns "double." Compare the object size of a reasonable-sized array of each and you can see the significance (use object.size()). So coerce to integer when you can.
Finally, the 'apply' family of functions (among others) are not "hidden loops" or loop wrappers. They are loops implemented in C--big difference performance-wise. [edit: AWB has correctly pointed out that while 'sapply', 'tapply', and 'mapply' are implemented in C, 'apply' is simply a wrapper function.
These things do pop up on the lists, in particular on r-devel. One fairly well-established nugget is that e.g. matrix operations tend to be faster than data.frame operations. Then there are add-on packages that do well -- Matt's data.table package is pretty fast, and Jeff has gotten xts indexing to be quick.
But it "all depends" -- so you are usually best adviced to profile on your particular code. R has plenty of profiling support, so you should use it. My Intro to HPC with R tutorials have a number of profiling examples.
I will try to come back and provide more detail. If you have any question about the efficiency of one operation over another, you would do best to profile your own code (as Dirk suggests). The system.time() function is the easiest way to do this although there are many more advanced utilities (e.g. Rprof, as documented here).
A quick response for the second part of your question:
What about the various flavors of apply? Are those just hidden loops?
For the most part yes, the apply functions are just loops and can be slower than for statements. Their chief benefit is clearer code. The main exception that I have found is lapply which can be faster because it is coded in C directly.
And what about matrices vs. data frames?
Matrices are more efficient than data frames because they require less memory for storage. This is because data frames require additional attribute data. From R Introduction:
A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes

Resources