Load Freebase full dump file to Virtuoso - freebase

I have downloaded the full RDF Freebase dump file 'freebase-rdf-2012-12-09-00-00.gz'(7.5GB) from this link http://download.freebaseapps.com/
This data dump uses the the Turtle RDF syntax as defined here http://wiki.freebase.com/wiki/Data_dumps
How can I load this file into Virtuoso (06.04.3132) ?
I tried to use this command
SQL> DB.DBA.TTLP_MT (file_to_string_output ('freebase-rdf-2012-12-09-00-00.gz'), '', 'http://freebase.com');
but it finished in short time. The following request returned only 2 rows(triples) from the source file and no exceptions in the log.
SELECT ?a ?b ?c from <http://freebase.com> where {?a ?b ?c}
http://rdf.freebase.com/ns/american_football.football_historical_roster_position.number
http://rdf.freebase.com/ns/type.object.name Number
http://rdf.freebase.com/ns/american_football.football_historical_roster_position.number
http://rdf.freebase.com/ns/type.object.type http://rdf.freebase.com/ns/type.property.
2 Rows. -- 78 msec.
By the way, how long may it take to load such a big file (8 GB RAM or 24 GB RAM)?
May this dump file be loaded in TDB (via tdbloader), Sesame OpenRDF(via load) or OWLIM SE repository without modification?
And will I get a response from my SELECT SPARQL queries(not very complex) after loading in reasonable time after all?
Thank you!

I've got the reply from [freebase-discuss] mailing list:
This Freebase dump should be unpacked, splitted and run thru fix scripts. More details here
http://people.apache.org/~andy/Freebase20121223

Related

Kusto ingestion error "BadRequest_EmptyArchive: Empty zip archive"

I have a bunch of .csv files in Azure blob storage, and an ingestion rule to pull them into Kusto (Azure Data Explorer). This used to work, but I've been getting a lot of ingestion failures lately. ".show ingestion failures" gives me:
Details FailureKind OperationKind ErrorCode ShouldRetry IngestionProperties IngestionSourcePath
BadRequest_EmptyArchive: Empty zip archive Permanent DataIngestPull BadRequest_EmptyArchive 0 "[Format=Csv/mandatory, IngestionMapping=[{""column"":""CrashSource"",""datatype"":""string"",""Ordinal"":""0""},{""column"":""CrashType"",""datatype"":""string"",""Ordinal"":""1""},{""column"":""ReportId"",""datatype"":""string"",""Ordinal"":""2""},{""column"":""DeviceId"",""datatype"":""string"",""Ordinal"":""3""},{""column"":""DeviceSerialNumber"",""datatype"":""string"",""Ordinal"":""4""},{""column"":""DumpFilePath"",""datatype"":""string"",""Ordinal"":""5""},{""column"":""FailureXmlPath"",""datatype"":""string"",""Ordinal"":""6""},{""column"":""PROCESS_NAME"",""datatype"":""string"",""Ordinal"":""7""},{""column"":""BUILD_VERSION_STRING"",""datatype"":""string"",""Ordinal"":""8""},{""column"":""DUMP_TYPE"",""datatype"":""string"",""Ordinal"":""9""},{""column"":""PRIMARY_PROBLEM_CLASS"",""datatype"":""string"",""Ordinal"":""10""},{""column"":""IMAGE_NAME"",""datatype"":""string"",""Ordinal"":""11""},{""column"":""FAILURE_BUCKET_ID"",""datatype"":""string"",""Ordinal"":""12""},{""column"":""OS_VERSION"",""datatype"":""string"",""Ordinal"":""13""},{""column"":""TARGET_TIME"",""datatype"":""string"",""Ordinal"":""14""},{""column"":""FAILURE_ID_HASH_STRING"",""datatype"":""string"",""Ordinal"":""15""},{""column"":""FAILURE_ID_HASH"",""datatype"":""string"",""Ordinal"":""16""},{""column"":""FAILURE_ID_REPORT_LINK"",""datatype"":""string"",""Ordinal"":""17""}], ValidationPolicy=[Options=ValidateCsvInputConstantColumns, Implications=BestEffort], Tags=[ToStringEmpty], IngestIfNotExists=[ToStringEmpty], ZipPattern=[null]]" https://crashanalysisresults.blob.core.usgovcloudapi.net/datacontainer/Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.zip.crashanalysis.csv
My CSV files are not zipped in blob storage. Do I need to do something with ZipPattern to say so?
Here's what this CSV contains (many strings simplified):
CrashSource,CrashType,ReportId,DeviceId,DeviceSerialNumber,DumpFilePath,FailureXmlPath,PROCESS_NAME,BUILD_VERSION_STRING,DUMP_TYPE,PRIMARY_PROBLEM_CLASS,IMAGE_NAME,FAILURE_BUCKET_ID,OS_VERSION,TARGET_TIME,FAILURE_ID_HASH_STRING,FAILURE_ID_HASH,FAILURE_ID_REPORT_LINK
"source","type","reportid","deviceid","","dumpfilepath","failurexmlpath","process","version","1","problem class","image","bucket","version","2020-07-27T22:36:44.000Z","hash string","{b27ad816-2cb5-c004-d164-516c7a32dcef}","link"
As often happens. I seem to have found my own answer just by asking. While reviewing my question, I realized the string ".zip" is in the middle of my CSV file name. (Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.zip.crashanalysis.csv) That made me wonder if Kusto is treating it differently based on that. I tested by taking the exact same file, renaming it "Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.crashanalysis.csv", and uploading it. This was successfully ingested.
<forehead smack> I guess I will rename my files to get past this, but this sounds like a Kusto ingestion bug to me.
Thanks for reporting this, Sue.
Yes, Kusto automatically attempts to check files with ".zip" in them. We'll check why this is "tripping" if the string is in the middle of the file name instead of just its end.

tFlowToIterate with BigDataBatch job

I have currently a simple standard job on Talend doing this :
It simply reads a file of several lines (tHDFSInput), and for each line of this file (tFlowToIterate), I create a INSERT query "INSERT ... SELECT ... FROM" based on what I read in my file (tHiveRow). And it works well, it's just a bit slow.
I now need to modify my "Standard" job to make a "Big Data Batch" job in order to make it faster, and also because we asked me to only make Big Data Batch from now on.
The thing is that there is no tFlowToIterate and no tHiveRow component with Big Data Batch...
How can I do this ?
Thank's a lot.
Though I haven't tried this solution, I think this can help you.
Create hive table upfront.
Place tHDFSConfiguration component in the job and provide cluster details.
Use tFileInputDelimited component. On providing storage configuration as tHDFSConfiguration(defined in step1), this will read from HDFS.
Use tHiveOutput component. Connect tFileInputDelimited to tHiveOutput. In tHiveOutput, you can provide the table, format and save mode.
In order to load HDFS into Hive without modification on the data maybe you can use only one component : tHiveLoad
Insert your HDFS path inside the component.
tHiveLoad documentation : https://help.talend.com/reader/hCrOzogIwKfuR3mPf~LydA/ILvaWaTQF60ovIN6jpZpzg

Oracle Golden Gate Replicat

can anyone help me? I want to run the replicat on my golden gate for windows. This is my replicat parameter
-- Replicat group --
replicat rep1
-- source and target definitions
ASSUMETARGETDEFS
--target database login --
userid ggtarget, password oracle
--file for discarded transaction --
discardfile C:\app\<name>\product\12.1.2ggtarget\oggcore_1\dirdat\rep1_discard.txt, append megabytes 10
--ddl support
DDL
--Specify table mapping --
map EAM.*, target EAM.*;
when i start the replicat, it say that the replicat is starting. but when i type info all, the replicat is stopped and the status said that is not currently running. how can i make it run?
Check the ggserr.log file for error messages.
The line with MAP command is incomplete. There should be something after the dot,
like:
map EAM.*, target EAM.*;
or:
map EAM.table, target EAM.table;
or:
map table, target table;
Check the report file for error
eg: view report rep1.rpt
It should contain the details of how the replicat parameter file is being parsed when trying to start the replicat.
However if you dont see any error code in report file or ggserr.log check your environment variables and parameter files location
Try row with discard file.
Example
DISCARDFILE .dirrpt/discard.txt, APPEND, MEGABYTES 20
In You congif
discardfile C:\app<name>\product\12.1.2ggtarget\oggcore_1\dirdat\rep1_discard.txt, append megabytes 10
I think you lost comma after append

Write directly to file from BaseX GUI

I wrote an XQuery expression that has a large result of about 50MB and takes a couple of hours to compute. I execute it in the BaseX GUI, but this is a little inconvenient: it crops the result to a result window, which I then have to save. At this time, BaseX becomes unresponsive and may crash.
Is there a way to directly write the result to a file?
Have a look at BaseX' file module, which provides broad functionality to read and write from files and traverse the file system.
For you, file:write($path as xs:string, $items as item()*) as empty-sequence() will be of special interest, which allows to write an element sequence to a file. For example:
file:write(
'/tmp/output.xml',
<root>{
for $i in 1 to 1000000
return <some-large-amount-of-data />
}</root>
)
If your output isn't well-formed XML, consider the file:write-binary, file:write-text and file:write-text-lines functions.
Yet another alternative might be writing to documents in the database instead of files. db:add and db:create from the database module can be used to add the computed results to the current or a new database.

avoiding XDMP-EXPNTREECACHEFULL and loading document

I am using marklogic 4 and I have some 15000 documents (each of around 10 KB). I want to load the entire content as a document ( and convert the total documents to a single csv file and output to HTTP output stream for downloading). While I load the documents this way:
let $uri := cts:uri-match('products/documents/*.xml')
let $doc := fn:doc ($uri)
The xpath has some 15000 xmls. So fn:doc throws an error XDMP-EXPNTREECACHEFULL.
Is there any workaround for this? I cannot increase tree cache size in admin console because the number of xml files in products/documents/*.xml may increase.
Thanks.
When you want to export large quantities of XML from MarkLogic, the best technique is to write the query so that results can stream, avoiding the expanded tree cache entirely. It is a very different style of coding, though: you'll have to avoid strong typing of any kind, and refactor your code to remove FLWOR expressions. You won't be able to test any of the code in cq or qconsole, either.
Take a look at http://blakeley.com/blogofile/2012/03/19/let-free-style-and-streaming/ for some tips on how to get there. At a minimum the code sample you posted would have to become:
doc(cts:uri-match('products/documents/*.xml'))
In passing I would try to rework that to avoid the *.xml part, because it will be slower than needed. Maybe something like this?
cts:search(
collection(),
cts:directory-query('products/documents/', 'infinity'))
If you need to test for something more than the directory, you could add a cts:and-query with some cts:element-query test.
For general information about this error, see the MarkLogic knowledge base article on XDMP-EXPNTREECACHEFULL

Resources