DataFactory copies files multiple times when using wildcard - wildcard

Hi all complete ADF newbie here - I have a strange issue with DataFactory and surprisingly cant see that anyone else has experienced this same issue.
To summarize:
I have setup a basic copy activity from blob to an Azure SQL database with no transformation steps
I have setup a trigger based on wildcard name. I.e. any files loaded to blob that start with IDT* will be copied to the database
I have loaded a few files to a specific location in Azure Blob
The trigger is activated
As soon as it looks like it all works, a quick assessment of the record count shows that the same files have been imported X number of times
I have analysed what is happening, basically when I load my files to blob, they don't technically arrive at the exact same time. So when file 1 hits the blob, the wildcard search is triggered and it finds 1 file. Then when the 2nd file hits the blob some milliseconds later, the wildcard search is triggered again and this time it processes 2 files (first and second).
The the problem keeps compounding based on the number of files loaded.
I have tried multiple things to get this fixed to no avail, because fundamentally it is behaving "correctly".
I have tried:
Deleting the file once it has processed but again due to the millisecond issue the file is technically still there and can still be processed
I have added a loop to process 1 file at a time then deleting the file before the next is loaded based on file name in the blob but hasn't worked (and cant explain why)
I have limited ADF to only 1 concurrent connection, this reduces the number of times it has duplicated but unfortunately still duplicates it
Tried putting a wait timer at the start of the copy activity, but this causes a resource locking issue. I get an error saying that multiple waits are causing the process to fail
Tried a combination of 1,2 and 3 and i end up with an entirely different issue in that it is trying to find file X, but now no longer exists because it was deleted as part of step 2 above
I am really struggling with something that seems extremely basic. So i am sure it is me overlooking something extremely fundamental as noone else seems to have this issue with ADF.

Related

Is there a limit on number of paths in firebase multi-path updates?

How many different paths are allowed inside a multi-path update. (maximum)
What is the ideal number of different paths that can be used for simultaneous updation without causing any issues/warnings.
Basically to summarize it all .... how many locations can be simultaneously written before firebase can no longer handle it.
I am looking to run a script which resets various paths.. The number of locations can be a huge number... to optimize this operation, i was thinking of using the multi location update for handling this.
If you're running a script which performs a huge number of queries, Multi-path updates are exactly what you need. Don't forget that multi-path updates are atomic operations (all or nothing) which means that if 1 of the operations doesn't succeed, all the other will be cancelled.
Now when it comes to number of updates, there is no limit. You can add as many paths as you want.
One last warning: Make sure all of the paths are corect and the value you're updating is the one you really want to update. Many developers (beginners and experts) sometimes make mistakes when specifying the paths and often end up deleting the whole database or a good part of it ends up with data that belongs to another node.

xProc - Pausing a pipeline and continue it when certain event occurs

I'm fairly new to xProc and xPath, but I've been asked to solve the following problem:
Step 2 receives data via the secondary port from step 1. Step 2 contains a p:for-each, which saves a document into a folder for each element that passes the for-each.
(Part A)
These documents (let's say I receive 6 documents from for-each) lay in the same directory and get filtered by p:directory-list and are eventually stored in one single document, containing the whole path of every document the for-each created. (Part B)
So far, so good.
The problem is that Part A seems to be too slow. Part B already tries to read the data Step A
stores while the directory is still empty. Meaning, I'm having a performance / synchronization problem.
And now comes the question:
Is it possible to let the pipeline wait and to let it continue as soon as a certain event occurs?
That's what I'm imagining:
Step B waits as long as necessary until the directory, which Step A stores the data in, is no longer empty. I read something about
dbxml:breakpoint, but unfortunately I couldn't find more information than the name and
a short description of what it seems to do:
Set a breakpoint, optionally based upon a condition, that will cause pipeline operation to pause at the breakpoint, possibly requiring user intervention to continue and/or issuing a message.
It would be awesome if you know more about it and could give an example of how it's used. It would also help if you know a workaround or another way to solve this problem.
UPDATE:
After searching google for half an eternity, I found SMIL which's timesheets seem to do the trick. Has anyone experience with throwing XML / xProc and SMIL together?
Back towards the end of 2009 I proposed the concept of 'Orchestrating XProc with SMIL' http://broadcast.oreilly.com/2009/09/xproc-and-smil-orchestrating-p.html in a blog post on the O'Reilly Network.
However, I'm not sure that this (XProc + Time) is the solution to your problem. It's not entirely clear, to me, from you description what's happening. Are you implying that you're trying to write something to disk and then read it in a subsequent step? You need to keep stuff in the pipeline in order to ensure you can connect outputs to subsequent inputs.

Riak: are my 2is broken?

we're having some weird things happening with a cleanup cronjob and riak:
the objects we store (postboxes) have a 2i for modification date (which is a unix timestamp).
there's a cronjob running freqently deleting all postboxes that have not been modified within 180 days. however we've found evidence that postboxes that some (very little) postboxes that were modified in the last three days were deleted by this cronjob.
After reviewing and debugging several times over every line of code, I am confident, that this is not a problem of the cronjob.
I also traced back all delete calls to that bucket - and no one else is deleting objects there.
Of course I also checked with Riak to read the postboxes with r=ALL: they're definitely gone. (and they are stored with w=QUORUM)
I also checked the logs: updating the post boxes did succeed (there were no errors reported back from the write operations)
This leaves me with two possible causes for this:
riak loses data (which I am not willing to believe that easily)
the secondary indexes are corrupt and queries to them return wrong keys
So my questions are:
Can 2is actually break?
Is it possible to verify that?
Am I missing something completely different?
Cheers,
Matthias
Secondary index queries in Riak are coverage queries, which means that they will only use one of the stored replicas, and not perform a quorum read.
As you are writing with w=QUORUM, it is possible that one (or more) of the replicas may not get updated if you have n_val set to 3 or higher while the operation still is deemed successful. If this is the one selected for the coverage query, you could end up deleting based on the old value. In order to avoid this, you will need to perform updates with w=ALL.

Efficency for reading file names of a directory ASP.NET

How efficient is reading the names of files in a directory in ASP.NET?
Background: I want to update pictures on a webserver automatically and deploy them in advance. E.g. until the 1. April I want to pick 'image1.png'. After the 1. April 'image2.png'. To achieve this I have to map every image name to a date which indicates if this image has to be picked or not.
In order to avoid mapping between file name and date in a seperate file or database the idea is to put a date in the file name. Iterating the directory and parsing the dates make me find my file.
E.g.:
image_2013-01-01.png
image_2013-04-31.png
The second one will be picked from May to eternity if no image with a later date will be dropped.
So I wonder how this solution impacts the speed of a website assuming <20 files.
If you are using something like Directory.GetFiles, that is one call to the OS.
This will access the disk to get the listing.
For less that 20 files this will be very quick. However since this data is unlikely to change very often, consider caching the name of your image.
You could store it in the application context to share it among all users of your site.

One to one correspending to files - in unix - log files

I am writing a Log Unifier program. That is, I have a system that produces logs:
my.log, my.log.1, my.log.2, my.log.3...
I want on each iteration to store the number of lines I've read from a certain file, so that on the next iteration - I can continue reading on from that place.
The problem is that when the files are full, they roll:
The last log is deleted
...
my.log.2 becomes my.log.3
my.log.1 becomes my.log.2
my.log becomes my.log.1
and a new my.log is created
I can ofcourse keep track of them, using inodes - which are almost a one-to-one correspondence to files.
I say "almost", because I fear of the following scenario:
Between two of my iterations - some files are deleted (let's say the logging is very fast), and are then new files are created and some have inodes of files just deleted. The problem is now - that I will mistake these files as old files - and start reading from line 500 (for example) instead of 0.
So I am hoping to find a way to solve this- here are a few directions I thought about - that may help you help me:
Either another 1-to-1 correspondence other than inodes.
An ability to mark a file. I thought about using chmod +x to mark the file as an
existing file, and for new files that don't have these permissions - I will know they are new - but if somebody were to change the permissions manually, that would confuse my program. So if you have any other way to mark.
I thought about creating soft links to a file that are deleted when the file is deleted. That would allow me to know which files got deleted.
Any way to get the "creation date"
Any idea that comes to mind - maybe using timestamps, atime, ctime, mtime in some clever way - all will be good, as long as they will allow me to know which files are new, or any idea creating a one-to-one correspondence to files.
Thank you
I can think of a few alternatives:
Use POSIX extended attributes to store metadata about each log file that your program can use for its operation.
It should be a safe assumption that the contents of old log files are not modified after being archived, i.e. after my.log becomes my.log.1. You could generate a hash for each file (e.g. SHA-256) to uniquely identify it.
All decent log formats embed a timestamp in each entry. You could use the timestamp of the first entry - or even the whole entry itself - in the file for identification purposes. Log files are usually rolled on a periodic basis, which would ensure a different starting timestamp for each file.

Resources