I am trying to write code to store a matrix to a variable directly from Matrix Market's website. Below is a sample URL that I'd use:
https://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcsstruc1/bcsstk01.mtx.gz
The example URL will download a bcsstk01.mtx.gz file. I need to extract the bcsstk01.mtx file. Then I need to use MatrixMarket.mmread() so I can save to a variable.
I first tried saving the downloaded file (or URL location) to a variable A = HTTP.get(), but lack of online resources and lack of knowledge led to no results. Then I used HTTP.download() and got the .mtx.gz file, but I can't unzip it. And finally, MatrixMarket.mmread() cannot read .gz files. So I'm stuck with a downloaded file I can't do anything with unless I manually unzip it.
Using the info from link in the comments and some fiddling, I managed to get the following:
using TranscodingStreams, CodecZlib
using Downloads
stream = PipeBuffer()
openstream = TranscodingStream(GzipDecompressor(), stream)
Downloads.download("https://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcsstruc1/bcsstk01.mtx.gz", stream)
for line in eachline(openstream)
println(line)
end
This prints:
%%MatrixMarket matrix coordinate real symmetric
48 48 224
1 1 2.8322685185200e+06
5 1 1.0000000000000e+06
6 1 2.0833333333300e+06
7 1 -3.3333333333300e+03
...
which I suppose is the desired data.
I am experiencing read speeds that I believe are much slower than should be expected when trying to read a fairly large file in R with fread.
The file is ~60m rows x 147 columns, out of which I am only selecting 27 columns, directly in the fread call using select; only 23 of the 27 are found in the actual file. (Probably I inputted some of the strings incorrectly but I guess that matters less.)
data.table::fread("..\\TOI\\TOI_RAW_APextracted.csv",
verbose = TRUE,
select = cols2Select)
The system being used is an Azure VM with a 16-core Intel Xeon and 114 GB of RAM, running Windows 10.
I'm also using R 3.5.2, RStudio 1.2.1335 and data.table 1.12.0
I should also add that the file is a csv file that I have transferred onto the local drive of the VM, so there is no network / ethernet involved. I am not sure how Azure VMs work and what drives they use, but I would assume it's something equivalent to an SSD. Nothing else is running / being processed on the VM at the same time.
Please find below the verbose output of fread:
omp_get_max_threads() = 16 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true Input contains no \n. Taking this to be a filename to open [01] Check arguments Using 16 threads (omp_get_max_threads()=16, nth=16) NAstrings = [<<NA>>] None of the NAstrings look like numbers. show progress = 1 0/1 column will be read as integer [02] Opening the file Opening file ..\TOI\TOI_RAW_APextracted.csv File opened, size = 49.00GB (52608776250 bytes). Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed Positioned on line 1 starting: <<"POLNO","ProdType","ProductCod>> [06] Detect separator, quoting rule, and ncolumns Detecting sep automatically ... sep=',' with 100 lines of 147 fields using quote rule 0 Detected 147 columns on line 1. This line is either column names or first data row. Line starts as: <<"POLNO","ProdType","ProductCod>> Quote rule picked = 0 fill=false and the most number of columns found is 147 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (52608776248 bytes from row 1 to eof) / (2 * 85068 jump0size) == 309216 Type codes (jump 000) : A5AA5555A5AA5AAAA57777777555555552222AAAAAA25755555577555757AA5AA5AAAAA5555AAA2A...2222277555 Quote rule 0 Type codes (jump 001) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777555577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 002) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 003) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 010) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 031) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 098) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 100) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 10045 sample rows ===== Sampled 10045 rows (handled \n inside quoted fields) at 101 jump points Bytes from first data row on line 2 to the end of last row: 52608774311 Line length: mean=956.51 sd=35.58 min=823 max=1063 Estimated number of rows: 52608774311 /
956.51 = 55000757 Initial alloc = 60500832 rows (55000757 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
===== [08] Assign column names [09] Apply user overrides on column types After 0 type and 124 drop user overrides : 05000005A0005AA0A0000770000077000A000A00000000770700000000000000A00A000000000000...0000000000 [10] Allocate memory for the datatable Allocating 23 column slots (147 - 124 dropped) with 60500832 rows [11] Read the data jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| Read 54964696 rows x 23 columns from 49.00GB (52608776250 bytes) file in 30:26.810 wall clock time [12] Finalizing the datatable Type counts:
124 : drop '0'
3 : int32 '5'
7 : float64 '7'
13 : string 'A'
=============================
0.000s ( 0%) Memory map 48.996GB file
0.035s ( 0%) sep=',' ncol=147 and header detection
0.001s ( 0%) Column type detection using 10045 sample rows
6.000s ( 0%) Allocation of 60500832 rows x 147 cols (9.466GB) of which 54964696 ( 91%) rows used
1820.775s (100%) Reading 50176 chunks (0 swept) of 1.000MB (each chunk 1095 rows) using 16 threads + 1653.728s ( 91%) Parse to row-major thread buffers (grown 32 times) + 22.774s ( 1%) Transpose +
144.273s ( 8%) Waiting
24.545s ( 1%) Rereading 1 columns due to out-of-sample type exceptions
1826.810s Total Column 2 ("ProdType") bumped from 'int32' to 'string' due to <<"B810">> on row 14
Basically, I would like to find out if this is just normal or if there is anything I can do to improve these reading speeds. Based on various benchmarks I've seen around and my own experience and intuition with fread using smaller files, I would have expected this to be read in much much quicker.
Also I was wondering if the multi-core capabilities are fully being used, as I have heard that under Windows this might not always be straightforward. My knowledge around this topic is pretty limited unfortunately, but it does appear from the verbose output that fread is detecting 16 cores.
Thoughts:
(1) If you are using Windows, use Microsoft Open R; even more so if the cloud is Azure. Actually, there may be coordination between Open R and Azure client. Because of Intel's MKL and Microsoft built in enhancements, I find Microsoft Open R faster on Windows.
(2) I suspect 'Select' and 'Drop' work after a full file read. Maybe read all the file, subset or filter afterward.
(3) I think a restart is overkill. I run gc thrice every so often like this: 'gc();gc();gc();'' I have heard others say this does nothing. But at least it makes me feel better. Actually, I notice it helps me on Windows.
(4) The latest versions of data.table fread are implementing 'YAML' . This looks promising.
(5) setDTthread(0) uses all the cores. Too much parellitization can work against you. Trying halving your cores.
I know there are other tools around like awstats or splunk, but I wonder whether there is some serious (web)server logfile analysis going on in R. I might not be the first thought to do it in R, but still R has nice visualization capabilities and also nice spatial packages. Do you know of any? Or is there a R package / code that handles the most common log file formats that one could build on? Or is it simply a very bad idea?
In connection with a project to build an analytics toolbox for our Network Ops guys,
i built one of these about two months ago. My employer has no problem if i open source it, so if anyone is interested i can put it up on my github repo. I assume it's most useful to this group if i build an R Package. I won't be able to do that straight away though
because i need to research the docs on package building with non-R code (it might be as simple as tossing the python bytecode files in /exec along with a suitable python runtime, but i have no idea).
I was actually suprised that i needed to undertake a project of this sort. There are at least several excellent open source and free log file parsers/viewers (including the excellent Webalyzer and AWStats) but neither parse server error logs (parsing server access logs is the primary use case for both).
If you are not familiar with error logs or with the difference between them and access
logs, in sum, Apache servers (likewsie, nginx and IIS) record two distinct logs and store them to disk by default next to each other in the same directory. On Mac OS X,
that directory in /var, just below root:
$> pwd
/var/log/apache2
$> ls
access_log error_log
For network diagnostics, error logs are often far more useful than the access logs.
They also happen to be significantly more difficult to process because of the unstructured nature of the data in many of the fields and more significantly, because the data file
you are left with after parsing is an irregular time series--you might have multiple entries keyed to a single timestamp, then the next entry is three seconds later, and so forth.
i wanted an app that i could toss in raw error logs (of any size, but usually several hundred MB at a time) have something useful come out the other end--which in this case, had to be some pre-packaged analytics and also a data cube available inside R for command-line analytics. Given this, i coded the raw-log parser in python, while the processor (e.g., gridding the parser output to create a regular time series) and all analytics and data visualization, i coded in R.
I have been building analytics tools for a long time, but only in the past
four years have i been using R. So my first impression--immediately upon parsing a raw log file and loading the data frame in R is what a pleasure R is to work with and how it is so well suited for tasks of this sort. A few welcome suprises:
Serialization. To persist working data in R is a single command
(save). I knew this, but i didn't know how efficient is this binary
format. Thee actual data: for every 50 MB of raw logfiles parsed, the
.RData representation was about 500 KB--100 : 1 compression. (Note: i
pushed this down further to about 300 : 1 by using the data.table
library and manually setting compression level argument to the save
function);
IO. My Data Warehouse relies heavily on a lightweight datastructure
server that resides entirely in RAM and writes to disk
asynchronously, called redis. The proect itself is only about two
years old, yet there's already a redis client for R in CRAN (by B.W.
Lewis, version 1.6.1 as of this post);
Primary Data Analysis. The purpose of this Project was to build a
Library for our Network Ops guys to use. My goal was a "one command =
one data view" type interface. So for instance, i used the excellent
googleVis Package to create a professional-looking
scrollable/paginated HTML tables with sortable columns, in which i
loaded a data frame of aggregated data (>5,000 lines). Just those few
interactive elments--e.g., sorting a column--delivered useful
descriptive analytics. Another example, i wrote a lot of thin
wrappers over some basic data juggling and table-like functions; each
of these functions i would for instance, bind to a clickable button
on a tabbed web page. Again, this was a pleasure to do in R, in part
becasue quite often the function required no wrapper, the single
command with the arguments supplied was enough to generate a useful
view of the data.
A couple of examples of the last bullet:
# what are the most common issues that cause an error to be logged?
err_order = function(df){
t0 = xtabs(~Issue_Descr, df)
m = cbind( names(t0), t0)
rownames(m) = NULL
colnames(m) = c("Cause", "Count")
x = m[,2]
x = as.numeric(x)
ndx = order(x, decreasing=T)
m = m[ndx,]
m1 = data.frame(Cause=m[,1], Count=as.numeric(m[,2]),
CountAsProp=100*as.numeric(m[,2])/dim(df)[1])
subset(m1, CountAsProp >= 1.)
}
# calling this function, passing in a data frame, returns something like:
Cause Count CountAsProp
1 'connect to unix://var/ failed' 200 40.0
2 'object buffered to temp file' 185 37.0
3 'connection refused' 94 18.8
The Primary Data Cube Displayed for Interactive Analysis Using googleVis:
A contingency table (from an xtab function call) displayed using googleVis)
It is in fact an excellent idea. R also has very good date/time capabilities, can do cluster analysis or use any variety of machine learning alogorithms, has three different regexp engines to parse etc pp.
And it may not be a novel idea. A few years ago I was in brief email contact with someone using R for proactive (rather than reactive) logfile analysis: Read the logs, (in their case) build time-series models, predict hot spots. That is so obviously a good idea. It was one of the Department of Energy labs but I no longer have a URL. Even outside of temporal patterns there is a lot one could do here.
I have used R to load and parse IIS Log files with some success here is my code.
Load IIS Log files
require(data.table)
setwd("Log File Directory")
# get a list of all the log files
log_files <- Sys.glob("*.log")
# This line
# 1) reads each log file
# 2) concatenates them
IIS <- do.call( "rbind", lapply( log_files, read.csv, sep = " ", header = FALSE, comment.char = "#", na.strings = "-" ) )
# Add field names - Copy the "Fields" line from one of the log files :header line
colnames(IIS) <- c("date", "time", "s_ip", "cs_method", "cs_uri_stem", "cs_uri_query", "s_port", "cs_username", "c_ip", "cs_User_Agent", "sc_status", "sc_substatus", "sc_win32_status", "sc_bytes", "cs_bytes", "time-taken")
#Change it to a data.table
IIS <- data.table( IIS )
#Query at will
IIS[, .N, by = list(sc_status,cs_username, cs_uri_stem,sc_win32_status) ]
I did a logfile-analysis recently using R. It was no real komplex thing, mostly descriptive tables. R's build-in functions were sufficient for this job.
The problem was the data storage as my logfiles were about 10 GB. Revolutions R does offer new methods to handle such big data, but I at last decided to use a MySQL-database as a backend (which in fact reduced the size to 2 GB though normalization).
That could also solve your problem in reading logfiles in R.
#!python
import argparse
import csv
import cStringIO as StringIO
class OurDialect:
escapechar = ','
delimiter = ' '
quoting = csv.QUOTE_NONE
parser = argparse.ArgumentParser()
parser.add_argument('-f', '--source', type=str, dest='line', default=[['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"'''], ['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"''']])
arguments = parser.parse_args()
try:
with open(arguments.line, 'wb') as fin:
line = fin.readlines()
except:
pass
finally:
line = arguments.line
header = ['IP', 'Ident', 'User', 'Timestamp', 'Offset', 'HTTP Verb', 'HTTP Endpoint', 'HTTP Version', 'HTTP Return code', 'Size in bytes', 'User-Agent']
lines = [[l[:-1].replace('[', '"').replace(']', '"').replace('"', '') for l in l1] for l1 in line]
out = StringIO.StringIO()
writer = csv.writer(out)
writer.writerow(header)
writer = csv.writer(out,dialect=OurDialect)
writer.writerows([[l1 for l1 in l] for l in lines])
print(out.getvalue())
Demo output:
IP,Ident,User,Timestamp,Offset,HTTP Verb,HTTP Endpoint,HTTP Version,HTTP Return code,Size in bytes,User-Agent
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -
This format can easily be read into R using read.csv. And, it doesn't require any 3rd party libraries.