AWS Glue Custom Classifiers Json Path - jsonpath

I have a set of Json data files that look like this
[
{"client":"toys",
"filename":"toy1.csv",
"file_row_number":1,
"secondary_db_index":"4050",
"processed_timestamp":1535004075,
"processed_datetime":"2018-08-23T06:01:15+0000",
"entity_id":"4050",
"entity_name":"4050",
"is_emailable":false,
"is_txtable":false,
"is_loadable":false}
]
I have created a Glue Crawler with the following custom classifier Json Path
$[*]
Glue returns the correct schema with the columns correctly identified.
However, when I query the data on Athena... all the data is landing in the first column and the rest of the columns are empty.
How can I get the data to spread according to their columns?
image of Athena query
Thank you!

It is a issue connected to Hive. I suggest two approaches. Firstly, you can create new table in Athena with struct data type like this:
CREATE EXTERNAL TABLE `example`(
`row` struct<client:string,filename:string,file_row_number:int,secondary_db_index:string,processed_timestamp:int,processed_datetime:string,entity_id:string,entity_name:string,is_emailable:boolean,is_txtable:boolean,is_loadable:boolean> COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://example'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='example',
'averageRecordSize'='271',
'classification'='json',
'compressionType'='none',
'jsonPath'='$[*]',
'objectCount'='1',
'recordCount'='1',
'sizeKey'='271',
'transient_lastDdlTime'='1535533583',
'typeOfData'='file')
And then you can run the query as follows:
SELECT row.client, row.filename, row.file_row_number FROM "example"
Secondly, you can re-design your json file as below and then run the Crawler again. In this example I used Single-JSON-Record-Per-Line format.
{"client":"toys","filename":"toy1.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false},
{"client":"toys2","filename":"toy2.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false}

Related

Is there a way to parse out information from a xcom_pull in Airflow?

So what I'm working with is I have a DAG that has specific information that is being passed through tasks, everything is working as it should. The file needs to be stored into a reports/ folder for the following tasks to work correctly. I'm calling the actual name of the report through a xcom_pull but I also want to parse out information from this xcom_pull in order to capture the unique filename itself to use later on in other tasks. I have a task later on that inserts this filename into the csv file, but I need it to match the filename itself so its a 1:1 match.
I want to parse out information of a xcom_pull option and I'm having issues doing so. The example I have is below:
report_filename = "reports/{}_{}".format('report_example', str(uuid.uuid1()))
get_report = GoogleCampaignManagerDownloadReportOperator(
task_id="get_report",
profile_id=1234,
api_version=1234,
bucket_name=test_bucket,
report_name=report_filename,
report_id=report_id,
file_id=file_id,
)
report_filename_test = xcom_pull(get_report, 'report_name')
sanitize_report = SanitizeReportOperator(
task_id='sanitize_report',
dest_bucket=test_bucket,
dest_object=report_filename_test,
shared_object=str(report_filename_test).replace('reports/', ''),
append_timestamp=True,
append_filename=True
)
As of right now the xcom_pull pulls down the following:
reports/report_example_b3413b62-cc8a-11ec-bded-52e9ae62e477.csv.gz
However, I want to have another xcom_pull that will only pull the following:
report_example_b3413b62-cc8a-11ec-bded-52e9ae62e477.csv.gz
I have tried converting report_filename_test to a string and using the replace function, so for example:
new_test = str(report_filename_test).replace('reports/', '')
But when attempting this, it makes the new_test converting into a NULL format or ignores it completely and saves the file later on into a reports/ folder.
I have also tried passing the report_filename into a list and grabbing the first iteration and grabbing the first iteration, but with how Airflow works from task to task, it creates a new filename with a different uuid each time, which is not what I'm aiming to have done. I have also tried doing a PythonOperator option to create a function specifically to name the file and be called later on throughout the DAG but have not had any luck with this either.
Is there a way to do this where you can parse out the information from a xcom_pull or another way to make this work? The end goal is to essentially have a file name with a specific uuid that I can pass through into the csv file and rename the file to the same specific uuid that is being built without the folder name in front.
I'm just looking to have a unique filename be passed through multiple tasks that is the exact same each time with a uuid format. I'm running out of ideas of how to make this work and have been stuck on this for almost two weeks now.
Any help with this would be greatly appreciated!

Neo4J Self Lookup

I am struggling to get a simple Neo4J file to map to itself.
I have two CSV files
File A
ID,Name
0,abc
1,def
2,ghi
3,JJK
And File B
ID,Primary_ID,Secondary_ID
0,2,3
What I am wanting is to import File A into the Bloom and then link to the other elements by looking up File B if there is a relationship.
A Neo4J expert could probably tell me what I'm doing wrong.
This is my neo4j command:
neo4j#neo4j> LOAD CSV WITH HEADERS FROM 'file:///FileB.csv' AS row
WITH toInteger(row["ID"]) as ID, row["Primary_ID"] as Primary, row["Secondary_ID"] as Secondary
MATCH (c:item {itemId: Secondary})
MATCH (p:item {itemId: Primary})
MERGE (o)-->(p)
RETURN count(o);
Looks like there’s a typo in your query. You used o instead of c in the MERGE clause.
LOAD CSV WITH HEADERS FROM 'file:///FileB.csv' AS row
WITH toInteger(row["ID"]) as ID, row["Primary_ID"] as Primary, row["Secondary_ID"] as Secondary
MATCH (c:item {itemId: Secondary})
MATCH (p:item {itemId: Primary})
MERGE (c)-->(p)
RETURN count(c);

SQLite - Display in the terminal content of a list of tables time it

I achieved to display a list of table names of interest I have in my database with the following function:
SELECT name
FROM sqlite_master
WHERE type = 'table'
AND name LIKE '%#_1' ESCAPE '#';
(It is not the subject but it return me a list of table names finishing by "_1")
Now what I would like to do is to display the content of all these tables in one command (just like if I was using cat *) and I would like to time this command.
So what should be the command ?
Thank you for your help.
This is not possible with a single SQL command.
You have to generate a series of SELECT statements, one for each table, and execute all of them.

How to get relation from csv in Neo4J/Cypher using CSV LOAD

I use Neo4J Community Edition version 3.2.1.
Consider this CSV-file with edges:
node1,relation,node2,type
1,RELATED_TO,2,Married
2,RELATED_TO,1,Married
1,RELATED_TO,3,Child
2,RELATED_TO,3,Child
3,RELATED_TO,4,Sibling
3,RELATED_TO,5,Sibling
4,RELATED_TO,5,Sibling
I have allready created the nodes for this. I then run the following csv load command:
load csv with headers from
"file:///test_dataset/edges.csv" as line
match (person1:Person {pid:line.node1}),
(person2:Person {pid:line.node2})
create (person1)-[:line.relation {type:line.type}]->(person2)
But this returns the following error:
Invalid input '.': expected an identifier character, whitespace, '|', a length specification, a property map or ']' (line 5, column 24 (offset: 167))
"create (person1)-[:line.relation {type:line.type}]->(person2)"
It seems that I cannot use "line.relation" like this. How can I use the relation from the csv-file (second column) using csv load?
I have seen this answer, but I would like to do this using native query language.
To verify that the rest of the query is correct I have managed to create the edges correctly by hardcoding the relation like this:
load csv with headers from
"file:///test_dataset/edges.csv" as line
match (person1:Person {pid:line.node1}),
(person2:Person {pid:line.node2})
create (person1)-[:RELATED_TO {type:line.type}]->(person2)
Natively it's not possible to create a node with a dynamic label and a relationship with a dynamic type.
That's why there is a procedure for that.
If you want to do it natively and you know all the distinct value of your relation column, you can create many cypher script like that (one per value):
LOAD CSV WITH HEADERS FROM "file:///test_dataset/edges.csv" AS line
WITH line WHERE line.relation ='RELATED_TO'
MATCH (person1:Person {pid:line.node1})
MATCH (person2:Person {pid:line.node2})
CREATE (person1)-[:RELATED_TO {type:line.type}]->(person2)

Using Marklogic Xquery data population

I have the data as below manner.
<Status>Active Leave Terminated</Status>
<date>05/06/2014 09/10/2014 01/10/2015</date>
I want to get the data as in the below manner.
<status>Active</Status>
<date>05/06/2014</date>
<status>Leave</Status>
<date>09/10/2014</date>
<status>Terminated</Status>
<date>01/10/2015</date>
please help me on the query, to retrieve the data as specified above.
Well, you have a string and want to split it at the whitestapces. That's what tokenize() is for and \s is a whitespace. To get the corresponding date you can get the current position in the for loop using at. Together it looks something like this (note that I assume that the input data is the current context item):
let $dates := tokenize(date, "\s+")
for $status at $pos in tokenize(Status, "\s+")
return (
<status>{$status}</status>,
<date>{$dates[$pos]}</date>
)
You did not indicate whether your data is on the file system or already loaded into MarkLogic. It's also not clear if this is something you need to do once on a small set of data or on an on-going basis with a lot of data.
If it's on the file system, you can transform it as it is being loaded. For instance, MarkLogic Content Pump can apply a transformation during load.
If you have already loaded the content and you want to transform it in place, you can use Corb2.
If you have a small amount of data, then you can just loop across it using Query Console.
Regardless of how you apply the transformation code, dirkk's answer shows how you need to change it. If you are updating content already in your database, you'll xdmp:node-delete() the original Status and date elements and xdmp:node-insert-child() the new ones.

Resources