Loading wikidata dump - wikidata

I'm loading all geographic entries (Q56061) from wikidata json dump.
Whole dump contains about 16M entries according to Wikidata:Statistics page.
Using python3.4 + ijson + libyajl2 it comes to take about 93 hours of CPU (AMD Phenom II X4 945 3GHz) time just to parse the file.
Using online sequential item queries for total of 2.3M entries of interest comes to take about 134 hours.
Is there some more optimal way to perform this task?
(maybe, something like openstreetmap pdf format and osmosis tool)

My loading code and estimations were wrong.
Using ijson.backends.yajl2_cffi gives about 15 hours for full parsing + filtering + storing to database.

Related

Rascal MPL get line count of file without loading its contents

Is there a more efficient way than
int fileSize = size(readFileLines(fileLoc));
to get the total number of lines in a file? I presume this code has to read the entire file first, which could become costly for huge files.
I have looked into IO and Loc whether some of this info might be saved in conjunction with the file.
This is the way, unless you'd like to call wc -l via util::ShellExec 😁
Apart from streaming the file and saving some memory counting lines is always linear in the size of the file so you won't win much time.

Reading MarkLogic logs from Query Console using XQuery

I want to read MarkLogic logs (for eg : ErrorLog.txt) from query console using Xquery. I had the below code but the problem is output is not properly formatted. Result is like below
xquery version "1.0-ml";
for $hid in xdmp:hosts()
let $h := xdmp:host-name($hid)
return
xdmp:filesystem-file("file://" || $h || "/" ||xdmp:data-directory($hid) ||"/Logs/ErrorLog.txt")
Problem is result is coming as per host basis like first all log of one host is coming and then starting with time 00:00:01 of host 2 and then 00:00:01 of host 3 after running the Xquery.
2019-07-02 00:00:35.668 Info: Merging 2 MB from /cams/q06data02/testQA2/Forests/testQA2-2.2/0002b4cd to /cams/q06data02/testQA2/Forests/testQA2-2.2/0002b4ce, timestamp=15620394303480170
2019-07-02 00:00:36.007 Info: Merged 3 MB at 9 MB/sec to /cams/q06data02/testQA2/Forests/test2-2.2/0002b4ce
2019-07-02 00:00:38.161 Info: Deleted 3 MB at 399 MB/sec /cams/q06data02/test2/Forests/test2-2.2/0002b4cd
Is it possible to get the output with hostname included with log entries and also if we can sort the output by timelines something like
host 1 : 2019-07-02 00:00:01 : Info Merging ....
host 2 : 2019-07-02 00:00:02 : Info Deleted 3 MB at 399 MB/sec ...
Log files are text files. You can parse and sort them like any other text file.
Although they can get very large (GB+), so simple methods may not be performant.
Plus you need to be able to parse the text into fields in order to sort by a field.
Since the first 20 bytes of every line is the time stamp, and that timestamp is in ISO format which sorts lexically same as date, you can split the file by lines and sort using basic colation to get by time sorting of multiple files.
In V9 one can use the pair of xdmp:logfile-scan and xdmp:logmessage-parse to efficiently search over log files (remotely as well as local) and then transform the results into text, XML (attribute or element format) or JSON.
One can also use the REST API for the same.
see: https://docs.marklogic.com/REST/GET/manage/v2/logs
Once logfiles (ideally a selected subset of log messages that is small enough to manage) is converted to a structured format (xml , json or text lines) then sorting, searching, enriching etc is easily performed.
For something much better take a look at Ops Director https://docs.marklogic.com/guide/opsdir/intro

Batch loading in DynamoDB

Hi Currently I am loading each row row in Dynamodb which is veryslow.
I have a huge data which i want to load to DynamoDb by JAVA API.
But this takes huge time .For example to load 1 million data it took me 2 days to load to Dynamo.
Is Batch load possible in DynamoDb.I am not finding any information about bulkload or batch load.
Appreciate any help here.
i know it's an old post, but we came up short exploring how to optimize this so embarked on a scientific discovery :)
http://tech.equinox.com/driving-miss-dynamodb/
long and short of it it Hive on EMR is an excellent option (i know, it's "old skool")..
using and tuning these parameters do the trick (see blog for details):
SET dynamodb.throughput.write.percent = x;
SET mapred.reduce.tasks = x;
SET hive.exec.reducers.bytes.per.reducer = x;
SET tez.grouping.split-count = x;

Powershell, R, Import-Csv, select-object, Export-csv

I'm performing several tests using different approaches for cleaning a big csv file and then importing it into R.
This time I'm playing with Powershell in Windows.
While things work well and most accurate than when using cut() with pipe(), the process is horribly slow.
This is my command:
shell(shell = "powershell",
"Import-Csv In.csv |
select-object col1, col2, etc |
Export-csv new.csv")
And these are the system.time() results:
user system elapsed
0.61 0.42 1568.51
I've seen some other posts that use C# via streaming taking couple of dozens of seconds, but I don't know C#.
My question is, how can improve the PowerShell command in order to make it faster?
Thanks,
Diego
There's a fair amout of overhead in reading in the csv, converting the rows to powershell objects, and the converting back to csv. Doing it through the pipeline that way also causes it to do this one record at a time. You should be able to speed that up considerably if you switch to using Get-Content with a -ReadCount parameter, and extracting your data using a regular expression in a -replace operator, e.g.:
shell(shell = "powershell",
"Get-Content In.csv -ReadCount 1000 |
foreach { $_ -replace '^(.+?,.+?),','$1' |
Add-Content new.csv")
This will reduce the number if disk reads, and the -replace will be functioning as an array operator, doing 1000 records at a time.
First and foremost, my first test was wrong in the sense that due some errors I had before, several other sessions of powershell remained open and delayed the whole process.
These are the real numbers:
> system.time(shell(shell = "powershell", psh.comm))
user system elapsed
0.09 0.05 824.53
Now, as I said I couldn't find a good pattern for splitting the columns of my csv file.
I maybe need to add that it is a messy file, with fields containing commas, multiline fields, summary lines, and so on.
I tried other approaches, like one very famous in stack overflow that uses embedded C# code in PowerShell for splitting csv files.
While it works faster than the more common approach I showed previously, results are not accurate for these types of messy files.
> system.time(shell(shell = "powershell", psh.comm))
user system elapsed
0.01 0.00 212.96
Both approaches showed similar RAM consumption (~40Mb), and CPU usage (~50%) most of the time.
So while the former approach took 4 times the amount of the later, the accuracy of the results, the low cost in terms of resources, and the lesser developing time make me consider it the most efficient for big and messy csv files.

Logfile analysis in R?

I know there are other tools around like awstats or splunk, but I wonder whether there is some serious (web)server logfile analysis going on in R. I might not be the first thought to do it in R, but still R has nice visualization capabilities and also nice spatial packages. Do you know of any? Or is there a R package / code that handles the most common log file formats that one could build on? Or is it simply a very bad idea?
In connection with a project to build an analytics toolbox for our Network Ops guys,
i built one of these about two months ago. My employer has no problem if i open source it, so if anyone is interested i can put it up on my github repo. I assume it's most useful to this group if i build an R Package. I won't be able to do that straight away though
because i need to research the docs on package building with non-R code (it might be as simple as tossing the python bytecode files in /exec along with a suitable python runtime, but i have no idea).
I was actually suprised that i needed to undertake a project of this sort. There are at least several excellent open source and free log file parsers/viewers (including the excellent Webalyzer and AWStats) but neither parse server error logs (parsing server access logs is the primary use case for both).
If you are not familiar with error logs or with the difference between them and access
logs, in sum, Apache servers (likewsie, nginx and IIS) record two distinct logs and store them to disk by default next to each other in the same directory. On Mac OS X,
that directory in /var, just below root:
$> pwd
/var/log/apache2
$> ls
access_log error_log
For network diagnostics, error logs are often far more useful than the access logs.
They also happen to be significantly more difficult to process because of the unstructured nature of the data in many of the fields and more significantly, because the data file
you are left with after parsing is an irregular time series--you might have multiple entries keyed to a single timestamp, then the next entry is three seconds later, and so forth.
i wanted an app that i could toss in raw error logs (of any size, but usually several hundred MB at a time) have something useful come out the other end--which in this case, had to be some pre-packaged analytics and also a data cube available inside R for command-line analytics. Given this, i coded the raw-log parser in python, while the processor (e.g., gridding the parser output to create a regular time series) and all analytics and data visualization, i coded in R.
I have been building analytics tools for a long time, but only in the past
four years have i been using R. So my first impression--immediately upon parsing a raw log file and loading the data frame in R is what a pleasure R is to work with and how it is so well suited for tasks of this sort. A few welcome suprises:
Serialization. To persist working data in R is a single command
(save). I knew this, but i didn't know how efficient is this binary
format. Thee actual data: for every 50 MB of raw logfiles parsed, the
.RData representation was about 500 KB--100 : 1 compression. (Note: i
pushed this down further to about 300 : 1 by using the data.table
library and manually setting compression level argument to the save
function);
IO. My Data Warehouse relies heavily on a lightweight datastructure
server that resides entirely in RAM and writes to disk
asynchronously, called redis. The proect itself is only about two
years old, yet there's already a redis client for R in CRAN (by B.W.
Lewis, version 1.6.1 as of this post);
Primary Data Analysis. The purpose of this Project was to build a
Library for our Network Ops guys to use. My goal was a "one command =
one data view" type interface. So for instance, i used the excellent
googleVis Package to create a professional-looking
scrollable/paginated HTML tables with sortable columns, in which i
loaded a data frame of aggregated data (>5,000 lines). Just those few
interactive elments--e.g., sorting a column--delivered useful
descriptive analytics. Another example, i wrote a lot of thin
wrappers over some basic data juggling and table-like functions; each
of these functions i would for instance, bind to a clickable button
on a tabbed web page. Again, this was a pleasure to do in R, in part
becasue quite often the function required no wrapper, the single
command with the arguments supplied was enough to generate a useful
view of the data.
A couple of examples of the last bullet:
# what are the most common issues that cause an error to be logged?
err_order = function(df){
t0 = xtabs(~Issue_Descr, df)
m = cbind( names(t0), t0)
rownames(m) = NULL
colnames(m) = c("Cause", "Count")
x = m[,2]
x = as.numeric(x)
ndx = order(x, decreasing=T)
m = m[ndx,]
m1 = data.frame(Cause=m[,1], Count=as.numeric(m[,2]),
CountAsProp=100*as.numeric(m[,2])/dim(df)[1])
subset(m1, CountAsProp >= 1.)
}
# calling this function, passing in a data frame, returns something like:
Cause Count CountAsProp
1 'connect to unix://var/ failed' 200 40.0
2 'object buffered to temp file' 185 37.0
3 'connection refused' 94 18.8
The Primary Data Cube Displayed for Interactive Analysis Using googleVis:
A contingency table (from an xtab function call) displayed using googleVis)
It is in fact an excellent idea. R also has very good date/time capabilities, can do cluster analysis or use any variety of machine learning alogorithms, has three different regexp engines to parse etc pp.
And it may not be a novel idea. A few years ago I was in brief email contact with someone using R for proactive (rather than reactive) logfile analysis: Read the logs, (in their case) build time-series models, predict hot spots. That is so obviously a good idea. It was one of the Department of Energy labs but I no longer have a URL. Even outside of temporal patterns there is a lot one could do here.
I have used R to load and parse IIS Log files with some success here is my code.
Load IIS Log files
require(data.table)
setwd("Log File Directory")
# get a list of all the log files
log_files <- Sys.glob("*.log")
# This line
# 1) reads each log file
# 2) concatenates them
IIS <- do.call( "rbind", lapply( log_files, read.csv, sep = " ", header = FALSE, comment.char = "#", na.strings = "-" ) )
# Add field names - Copy the "Fields" line from one of the log files :header line
colnames(IIS) <- c("date", "time", "s_ip", "cs_method", "cs_uri_stem", "cs_uri_query", "s_port", "cs_username", "c_ip", "cs_User_Agent", "sc_status", "sc_substatus", "sc_win32_status", "sc_bytes", "cs_bytes", "time-taken")
#Change it to a data.table
IIS <- data.table( IIS )
#Query at will
IIS[, .N, by = list(sc_status,cs_username, cs_uri_stem,sc_win32_status) ]
I did a logfile-analysis recently using R. It was no real komplex thing, mostly descriptive tables. R's build-in functions were sufficient for this job.
The problem was the data storage as my logfiles were about 10 GB. Revolutions R does offer new methods to handle such big data, but I at last decided to use a MySQL-database as a backend (which in fact reduced the size to 2 GB though normalization).
That could also solve your problem in reading logfiles in R.
#!python
import argparse
import csv
import cStringIO as StringIO
class OurDialect:
escapechar = ','
delimiter = ' '
quoting = csv.QUOTE_NONE
parser = argparse.ArgumentParser()
parser.add_argument('-f', '--source', type=str, dest='line', default=[['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"'''], ['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"''']])
arguments = parser.parse_args()
try:
with open(arguments.line, 'wb') as fin:
line = fin.readlines()
except:
pass
finally:
line = arguments.line
header = ['IP', 'Ident', 'User', 'Timestamp', 'Offset', 'HTTP Verb', 'HTTP Endpoint', 'HTTP Version', 'HTTP Return code', 'Size in bytes', 'User-Agent']
lines = [[l[:-1].replace('[', '"').replace(']', '"').replace('"', '') for l in l1] for l1 in line]
out = StringIO.StringIO()
writer = csv.writer(out)
writer.writerow(header)
writer = csv.writer(out,dialect=OurDialect)
writer.writerows([[l1 for l1 in l] for l in lines])
print(out.getvalue())
Demo output:
IP,Ident,User,Timestamp,Offset,HTTP Verb,HTTP Endpoint,HTTP Version,HTTP Return code,Size in bytes,User-Agent
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -
This format can easily be read into R using read.csv. And, it doesn't require any 3rd party libraries.

Resources