Is there a way to parse out information from a xcom_pull in Airflow? - airflow

So what I'm working with is I have a DAG that has specific information that is being passed through tasks, everything is working as it should. The file needs to be stored into a reports/ folder for the following tasks to work correctly. I'm calling the actual name of the report through a xcom_pull but I also want to parse out information from this xcom_pull in order to capture the unique filename itself to use later on in other tasks. I have a task later on that inserts this filename into the csv file, but I need it to match the filename itself so its a 1:1 match.
I want to parse out information of a xcom_pull option and I'm having issues doing so. The example I have is below:
report_filename = "reports/{}_{}".format('report_example', str(uuid.uuid1()))
get_report = GoogleCampaignManagerDownloadReportOperator(
task_id="get_report",
profile_id=1234,
api_version=1234,
bucket_name=test_bucket,
report_name=report_filename,
report_id=report_id,
file_id=file_id,
)
report_filename_test = xcom_pull(get_report, 'report_name')
sanitize_report = SanitizeReportOperator(
task_id='sanitize_report',
dest_bucket=test_bucket,
dest_object=report_filename_test,
shared_object=str(report_filename_test).replace('reports/', ''),
append_timestamp=True,
append_filename=True
)
As of right now the xcom_pull pulls down the following:
reports/report_example_b3413b62-cc8a-11ec-bded-52e9ae62e477.csv.gz
However, I want to have another xcom_pull that will only pull the following:
report_example_b3413b62-cc8a-11ec-bded-52e9ae62e477.csv.gz
I have tried converting report_filename_test to a string and using the replace function, so for example:
new_test = str(report_filename_test).replace('reports/', '')
But when attempting this, it makes the new_test converting into a NULL format or ignores it completely and saves the file later on into a reports/ folder.
I have also tried passing the report_filename into a list and grabbing the first iteration and grabbing the first iteration, but with how Airflow works from task to task, it creates a new filename with a different uuid each time, which is not what I'm aiming to have done. I have also tried doing a PythonOperator option to create a function specifically to name the file and be called later on throughout the DAG but have not had any luck with this either.
Is there a way to do this where you can parse out the information from a xcom_pull or another way to make this work? The end goal is to essentially have a file name with a specific uuid that I can pass through into the csv file and rename the file to the same specific uuid that is being built without the folder name in front.
I'm just looking to have a unique filename be passed through multiple tasks that is the exact same each time with a uuid format. I'm running out of ideas of how to make this work and have been stuck on this for almost two weeks now.
Any help with this would be greatly appreciated!

Related

Read embedded data that starts with numbers?

I have embedded data that I have imported into Qualtrics use a web service block. The data comes from a .json file and reads something like 0.male, 1.male, 2.male, etc.
I have been trying to read this into my survey using the Qualtrics.SurveyEngine.getEmbeddedData method but without luck.
I'm trying to do something that takes the form.
let n = 2
Qualtrics.SurveyEngine.getEmbeddedData(n + ".male")
but this has been returning a NULL result. Is it possible to read embedded data that starts with a number?
Also see:
https://community.qualtrics.com/XMcommunity/discussion/15991/read-in-embedded-variables-using-a-loop#latest
The issue isn't the number, it is the dot. getEmbeddedData() doesn't work when the name contains a dot. See https://stackoverflow.com/a/51802695/4434072 for possible alternatives.

Fluent-bit, How can I use strftime in path

my log file name contains the current date, like my_log_210616.log
and I need to tail the file in fluent-bit. I tried with,
[INPUT]
Name tail
Path /var/log/my-service/my_log_%y%m%d.log
[OUTPUT]
Name stdout
Match *
but it doesn't watch the file. I replaced my_log_%y%m%d.log with my_log_210616.log, then it works.
How can I use strftime in the path?
One solution is to use a path that matches any date. Since fluent-bit will read the log files from their tail you won’t get data from the older files.
You could also add ’Ignore_Older 24h’ to the input config. This will ignore files with modified times older than 24 hours. Using ’Ignore_Older’ with a parser that extracts the event time works even better.
You could also do more elaborate filtering by file name in a lua filter.

U-SQL get filename of input and use for output

I have a filename of test.csv and I want the output to be test.txt.
I can extract the filename of the input but don't know how to use it for the output?
OUTPUT #result TO "/output/{filename}.txt"
USING Outputters.Text(outputHeader:false, quoting:false);
The filename is in the #result.
This feature isn't supported as of yet.
Does anyone have a work around?
U-SQL How can I get the current filename being processed to add to my extract output?
Ideally I would like dd-mm-yy-test.text?
How do I append the day month and year?
I am using USQL for this.
Thanks
Let me address both issues you're laying out in this question:
To use the same output name as the input, there would have to be a way to access rowset values into u-sql variables which I'm pretty sure cannot be done, taking into account that the language is built around the necessity to process many files at once.
To append a date into the output you would only need to declare the current datetime at some point and then use it to write the output file name like this:
DECLARE #now DateTime = DateTime.Now;
OUTPUT #output TO "/tests/output/" + #now.ToString("dd-MM-yyyy") + "-output.csv" USING Outputters.Csv();

Using Marklogic Xquery data population

I have the data as below manner.
<Status>Active Leave Terminated</Status>
<date>05/06/2014 09/10/2014 01/10/2015</date>
I want to get the data as in the below manner.
<status>Active</Status>
<date>05/06/2014</date>
<status>Leave</Status>
<date>09/10/2014</date>
<status>Terminated</Status>
<date>01/10/2015</date>
please help me on the query, to retrieve the data as specified above.
Well, you have a string and want to split it at the whitestapces. That's what tokenize() is for and \s is a whitespace. To get the corresponding date you can get the current position in the for loop using at. Together it looks something like this (note that I assume that the input data is the current context item):
let $dates := tokenize(date, "\s+")
for $status at $pos in tokenize(Status, "\s+")
return (
<status>{$status}</status>,
<date>{$dates[$pos]}</date>
)
You did not indicate whether your data is on the file system or already loaded into MarkLogic. It's also not clear if this is something you need to do once on a small set of data or on an on-going basis with a lot of data.
If it's on the file system, you can transform it as it is being loaded. For instance, MarkLogic Content Pump can apply a transformation during load.
If you have already loaded the content and you want to transform it in place, you can use Corb2.
If you have a small amount of data, then you can just loop across it using Query Console.
Regardless of how you apply the transformation code, dirkk's answer shows how you need to change it. If you are updating content already in your database, you'll xdmp:node-delete() the original Status and date elements and xdmp:node-insert-child() the new ones.

Reading a file into R with partly unknown filename

Is there a way to read a file into R where I do not know the complete file name. Something like.
read.csv("abc_*")
In this case I do not know the complete file name after abc_
If you have exactly one file matching your criteria, you can do it like this:
read.csv(dir(pattern='^abc_')[1])
If there is more than one file, this approach would just use the first hit. In a more elaborated version you could loop over all matches and append them to one dataframe or something like that.
Note that the pattern uses regular expressions and thus is a bit different from what you did expect (and what I wrongly assumed at my first shot to answer the question). Details can be found using ?regex
If you have a directory you want to submit, you have do modify the dir command accordingly:
read.csv(dir('path/to/your/file', full.names=T, pattern="^abc"))
The submitted path in your case may be c:\\users\\user\\desktop, and then the pattern as above. full.names=T forces dir() to output a whole path and not only the file name. Try running dir(...) without the read.csv to understand what is happening there.
If you want to give your path as a complete string, it again gets a bit more complicated:
filepath <- 'path/to/your/file/abc_'
read.csv(dir(dirname(filepath), full.names=T, pattern=paste("^", basename(filepath), sep='')))
That process will fail if your filename contains any regular expression keywords. You would have to substitute then with their corresponding escape sequences upfront. But that again is another topic.

Resources