Specify end of record (EOL) delimiter while importing from a text file? - r

I'm trying to import into R a large number of pipe-delimited files that were created in a windows environment, with CR+LF as the end of record (=EOL) delimiter. However, they also have CR's scattered about periodically, which is resulting in frequent inappropriately-split lines. Ideally, want an efficient way to solve this problem from within R - either by finding a way to specify the EOL delimiter when I import, or, if necessary, by reading in the text file and excising the CRs before any parsing of lines is done.
The creators of the files comment on this problem and recommend adding "TERMSTR= CRLF" into your SAS code, and I can find lots of discussions of how to do this in other languages as well. For R, however, all I can find is this discussion, here on stackoverflow:
Possible to change the record delimiter in R?
The sample problem given is a great match for my problem. The solution identified is nice for their specific situation of having a single file like this, but for me would require coding up separate scripts for importing each of the dozens of files, since each have different primary keys that would need to be recognized after the fact to repair the inappropriate import. Alternatively, I could open each file in something like Notebook++ to remove the extra CR's but again, that seems quite inefficient, and then would have to be repeated by hand every time the initial data set was updated by its producers.
Given how frequent a problem this seems to be for people, and the existence of hard-coded solutions in other programming languages, I'm confused as to why there isn't a fix in R and feel like I must be missing something. It seems clear (I think?) that there's no way to do this directly from read.table or even from readLines, but is there a way perhaps to do this using scan, that I'm missing?
Thanks for any thoughts!

Related

R - Read large file with small memory

My data is organize in an csv file with millions of lines and several columns. This file is to large to read into memory all at once.
Fortunately, I only want to compute some statistics on it, like the mean of each column at every 100 rows and such. My solution, based on other posts where was to use read.csv2 with options nrow and skip. This works.
However, I realized that when loading from the end of the file this process is quite slow. As far as I can tell, the reader seems to go trough the file until it passes all the lines that I say to skip and then reads. This, of course, is sub optimal, as it keeps reading over the initial lines every time.
Is there a solution, like python parser, where we can read the file line by line, stop when needed, and then continue? And keeping the nice reading simplicity that comes from read.csv2?

Weka Apriori No Large Itemset and Rules Found

I am trying to do apriori association mining with WEKA (i use 3.7) using given database table
So, i exported two columns (orderLineNumber and productCode) and load it into weka, as far as i go, i haven't got any success attempt, always ended with "No large itemsets and rules found!"
Again, i tried to convert the csv into ARFF file first using ARFF Converter and still get the same message;
I also tried using database loader in WEKA, the data loaded just fine but still give the same result;
The filter i've applied in preprocessing is only numericToNominal filter;
What have i wrongly done here, i suspiciously think it was my ARFF format though, thank you
Update
After further trial, i found out that i exported wrong column and i lack 1 filter process, which is "denormalized", i installed the plugin via packet manager and denormalized my data after converting it to nominal first;
I then compared the results with "Supermarket" sample's result; The only difference are my output came with 'f' instead of 't' (like shown below) and the confidence value seems like always 100%;
First of all, OrderLine is the wrong column.
Obviously, the position on the printed bill is not very important.
Secondly, the file format is not appropriate.
You want one line for every order, one column for every possible item in the #data section. To save memory, it may be helpful to use sparse formats (do not forget to set flags appropriately)
Other tools like ELKI can process input formats like this, that may be easier to use (it also was a lot faster than Weka):
apple banana
milk diapers beer
but last I checked, ELKI would "only" find frequent itemsets (the harder part) not compute association rules. I then used a tiny python script to produce actual association rules as desired.

faster than scan() with Rcpp?

Reading ~5x10^6 numeric values into R from a text file is relatively slow on my machine (a few seconds, and I read several such files), even with scan(..., what="numeric", nmax=5000) or similar tricks. Could it be worthwhile to try an Rcpp wrapper for this sort of task (e.g. Armadillo has a few utilities to read text files)?
Or would I likely be wasting my time for little to no gain in performance because of an expected interface overhead? I'm not sure what's currently limiting the speed (intrinsic machine performance, or else?) It's a task that I repeat many times a day, typically, and the file format is always the same, 1000 columns, around 5000 rows.
Here's a sample file to play with, if needed.
nr <- 5000
nc <- 1000
m <- matrix(round(rnorm(nr*nc),3),nr=nr)
cat(m[1, -1], "\n", file = "test.txt") # first line is shorter
write.table(m[-1, ], file = "test.txt", append=TRUE,
row.names = FALSE, col.names = FALSE)
Update: I tried read.csv.sql and also load("test.txt", arma::raw_ascii) using Armadillo and both were slower than the scan solution.
I highly recommend checking out fread in the latest version of data.table. The version on CRAN (1.8.6) doesn't have fread yet (at the time of this post) so you should be able to get it if you install from the latest source at R-forge. See here.
Please bear in mind that I'm not an R-expert but maybe the concept applies here too: usually reading binary stuff is much faster than reading text files. If your source files don't change frequently (e.g. you are running varied versions of your script/program on the same data), try to read them via scan() once and store them in a binary format (the manual has a chapter about exporting binary files).
From there on you can modify your program to read the binary input.
#Rcpp: scan() & friends are likely to call a native implementation (like fscanf()) so writing your own file read functions via Rcpp may not provide a huge performance gain. You can still try it though (and optimize for your particular data).
Salut Baptiste,
Data Input/Output is a huge topic, so big that R comes with its own manual on data input/output.
R's basic functions can be slow because they are so very generic. If you know your format, you can easily write yourself a faster import adapter. If you know your dimensions too, it is even easier as you need only one memory allocation.
Edit: As a first approximation, I would write a C++ ten-liner. Open a file, read a line, break it into tokens, assign to a vector<vector< double > > or something like that. Even if you use push_back() on individual vector elements, you should be competitive with scan(), methinks.
I once had a neat little csv reader class in C++ based on code by Brian Kernighan himself. Fairly generic (for csv files), fairly powerful.
You can then squeeze performance as you see fit.
Further edit: This SO question has a number of pointers for the csv reading case, and references to the Kernighan and Plauger book.
Yes, you almost certainly can create something that goes faster than read.csv/scan. However, for high performance file reading there are some existing tricks that already let you go much faster, so anything you do would be competing against those.
As Mathias alluded to, if your files don't change very often, then you can cache them by calling save, then restore them with load. (Make sure to use ascii = FALSE, since reading the binary files will be quicker.)
Secondly, as Gabor mentioned, you can often get a substantial performance boost by reading your file into a database and then from that database into R.
Thirdly, you can use the HadoopStreaming package to use Hadoop's file reading capabilities.
For more thoughts in these techniques, see Quickly reading very large tables as dataframes in R.

Strategies for repeating large chunk of analysis

I find myself in the position of having completed a large chunk of analysis and now need to repeat the analysis with slightly different input assumptions.
The analysis, in this case, involves cluster analysis, plotting several graphs, and exporting cluster ids and other variables of interest. The key point is that it is an extensive analysis, and needs to be repeated and compared only twice.
I considered:
Creating a function. This isn't ideal, because then I have to modify my code to know whether I am evaluating in the function or parent environments. This additional effort seems excessive, makes it harder to debug and may introduce side-effects.
Wrap it in a for-loop. Again, not ideal, because then I have to create indexing variables, which can also introduce side-effects.
Creating some pre-amble code, wrapping the analysis in a separate file and source it. This works, but seems very ugly and sub-optimal.
The objective of the analysis is to finish with a set of objects (in a list, or in separate output files) that I can analyse further for differences.
What is a good strategy for dealing with this type of problem?
Making code reusable takes some time, effort and holds a few extra challenges like you mention yourself.
The question whether to invest is probably the key issue in informatics (if not in a lot of other fields): do I write a script to rename 50 files in a similar fashion, or do I go ahead and rename them manually.
The answer, I believe, is highly personal and even then, different case by case. If you are easy on the programming, you may sooner decide to go the reuse route, as the effort for you will be relatively low (and even then, programmers typically like to learn new tricks, so that's a hidden, often counterproductive motivation).
That said, in your particular case: I'd go with the sourcing option: since you plan to reuse the code only 2 times more, a greater effort would probably go wasted (you indicate the analysis to be rather extensive). So what if it's not an elegant solution? Nobody is ever going to see you do it, and everybody will be happy with the swift results.
If it turns out in a year or so that the reuse is higher than expected, you can then still invest. And by that time, you will also have (at least) three cases for which you can compare the results from the rewritten and funky reusable version of your code with your current results.
If/when I do know up front that I'm going to reuse code, I try to keep that in mind while developing it. Either way I hardly ever write code that is not in a function (well, barring the two-liners for SO and other out-of-the-box analyses): I find this makes it easier for me to structure my thoughts.
If at all possible, set parameters that differ between sets/runs/experiments in an external parameter file. Then, you can source the code, call a function, even utilize a package, but the operations are determined by a small set of externally defined parameters.
For instance, JSON works very well for this and the RJSONIO and rjson packages allow you to load the file into a list. Suppose you load it into a list called parametersNN.json. An example is as follows:
{
"Version": "20110701a",
"Initialization":
{
"indices": [1,2,3,4,5,6,7,8,9,10],
"step_size": 0.05
},
"Stopping":
{
"tolerance": 0.01,
"iterations": 100
}
}
Save that as "parameters01.json" and load as:
library(RJSONIO)
Params <- fromJSON("parameters.json")
and you're off and running. (NB: I like to use unique version #s within my parameters files, just so that I can identify the set later, if I'm looking at the "parameters" list within R.) Just call your script and point to the parameters file, e.g.:
Rscript --vanilla MyScript.R parameters01.json
then, within the program, identify the parameters file from the commandArgs() function.
Later, you can break out code into functions and packages, but this is probably the easiest way to make a vanilla script generalizeable in the short term, and it's a good practice for the long-term, as code should be separated from the specification of run/dataset/experiment-dependent parameters.
Edit: to be more precise, I would even specify input and output directories or files (or naming patterns/prefixes) in the JSON. This makes it very clear how one set of parameters led to one particular output set. Everything in between is just code that runs with a given parametrization, but the code shouldn't really change much, should it?
Update:
Three months, and many thousands of runs, wiser than my previous answer, I'd say that the external storage of parameters in JSON is useful for 1-1000 different runs. When the parameters or configurations number in the thousands and up, it's better to switch to using a database for configuration management. Each configuration may originate in a JSON (or XML), but being able to grapple with different parameter layouts requires a larger scale solution, for which a database like SQLite (via RSQLite) is a fine solution.
I realize this answer is overkill for the original question - how to repeat work only a couple of times, with a few parameter changes, but when scaling up to hundreds or thousands of parameter changes in ongoing research, more extensive tools are necessary. :)
I like to work with combination of a little shell script, a pdf cropping program and Sweave in those cases. That gives you back nice reports and encourages you to source. Typically I work with several files, almost like creating a package (at least I think it feels like that :) . I have a separate file for the data juggling and separate files for different types of analysis, such as descriptiveStats.R, regressions.R for example.
btw here's my little shell script,
#!/bin/sh
R CMD Sweave docSweave.Rnw
for file in `ls pdfs`;
do pdfcrop pdfs/"$file" pdfs/"$file"
done
pdflatex docSweave.tex
open docSweave.pdf
The Sweave file typically sources the R files mentioned above when needed. I am not sure whether that's what you looking for, but that's my strategy so far. I at least I believe creating transparent, reproducible reports is what helps to follow at least A strategy.
Your third option is not so bad. I do this in many cases. You can build a bit more structure by putting the results of your pre-ample code in environments and attach the one you want to use for further analysis.
An example:
setup1 <- local({
x <- rnorm(50, mean=2.0)
y <- rnorm(50, mean=1.0)
environment()
# ...
})
setup2 <- local({
x <- rnorm(50, mean=1.8)
y <- rnorm(50, mean=1.5)
environment()
# ...
})
attach(setup1) and run/source your analysis code
plot(x, y)
t.test(x, y, paired = T, var.equal = T)
...
When finished, detach(setup1) and attach the second one.
Now, at least you can easily switch between setups. Helped me a few times.
I tend to push such results into a global list.
I use Common Lisp but then R isn't so different.
Too late for you here, but I use Sweave a lot, and most probably I'd have used a Sweave file from the beginning (e.g. if I know that the final product needs to be some kind of report).
For repeating parts of the analysis a second and third time, there are then two options:
if the results are rather "independent" (i.e. should produce 3 reports, comparison means the reports are inspected side by side), and the changed input comes in the form of new data files, that goes into its own directory together with a copy of the Sweave file, and I create separate reports (similar to source, but feels more natural for Sweave than for plain source).
if I rather need to do the exactly same thing once or twice again inside one Sweave file I'd consider reusing code chunks. This is similar to the ugly for-loop.
The reason is that then of course the results are together for the comparison, which would then be the last part of the report.
If it is clear from the beginning that there will be some parameter sets and a comparison, I write the code in a way that as soon as I'm fine with each part of the analysis it is wrapped into a function (i.e. I'm acutally writing the function in the editor window, but evaluate the lines directly in the workspace while writing the function).
Given that you are in the described situation, I agree with Nick - nothing wrong with source and everything else means much more effort now that you have it already as script.
I can't make a comment on Iterator's answer so I have to post it here. I really like his answer so I made a short script for creating the parameters and exporting them to external JSON files. And I hope someone finds this useful: https://github.com/kiribatu/Kiribatu-R-Toolkit/blob/master/docs/parameter_configuration.md

Length of an XML file

I have an XML file of size 31 GB. I need to find the total number of lines in that file. I know the command wc -l will give me the same. However it's taking too long to perform this operation. Is there any faster mechanism to find the number of lines in a large file?
31 gigs is a really big text file. I bet it would compress down to about 1.5 gigs. I would create these files in a compressed format to begin with then you can stream a decompressed version of the file through wc. This will greatly reduce the amount of i/o and memory used to process this file. gzip can read and write compressed streams.
But I would also make the following comments:
Line numbers are not really that informative for XML as whitespace between elements is ignored (except for mixed content). What do you really want to know about the dataset? I bet counting elements would be more useful.
Make sure your xml file is not unnecessarily redunant, for example are you repeating the same namespace declarations all over the document?
Perhaps XML is not the best way to represent this document, if it is try looking into something like Fast Infoset
if all you need is the line count, wc -l will be as fast as anything else.
The problem is the 31GB text file.
If accuracy isn't an issue, find the average line length and divide the file size by that. That way you can get a really fast approximation. (make sure to consider the character encoding used)
This falls beyond the point where the code should be refactored to avoid your problem entirely. One way to do this is to place all of the data in the file into a tuple store database instead. Apache couchDB and Intersystems Cache are two systems that you could use for this, and will be far better optimized for the type of data you're dealing with.
If you're really stuck with the xml file, then another option is to count all the lines ahead of time and cache this value. Each time a line is added or removed from the file, you can add or subtract one from the file. Also, make sure to use a 64 bit integer since there may be more than 2^32 lines.
No, not really. wc is going to be pretty well optimized. 31GB is a lot of data, and reading it in to count lines is going to take a while no matter what program you use.
Also, this question isn't really appropriate for Stack Overflow, as it's not about programming at all.
Isn't counting lines pretty uncertain since in XML newline is basically just a cosmetic thing? It would probably be better to count the number of occurrences of a specific tag.

Resources