Transforming the Default URI when using MLCP - uri

I have a delimited file as input source to ingest data in marklogic using conten-pump through unix.There is no such column in the file that is unique throught to serve as the URI. Problem with this is that since duplicates(URI) is not possible, those records are skipped/overwritten for that particular URI.
The syntaxes available are:
-delimited_uri_id *my_column_name*
output_uri_prefix *my_prefix_string*
output_uri_suffix *my_suffix_string*
output_uri_replace pattern,'string'
The command for mlcp is:
bin/mlcp.sh import -host localhost -port 8042 -username name -password password-input_file_path hdfs://path/to/file -delimiter '|' -delimited_uri_id column_name-input_file_type delimited_text -mode distributed
The problem that lies here is that if I modify the above command and include:
-output_uri_prefix $(date +%s%N)
It takes the time(in nanoseconds) of execution of this command and prefixes for all URI.But that doesnt solve my problem since this value remains repeated. Same would happen for other options available too .What could be done to have all records ingested by the construction of unique URI for all records in some manner?

One way or another it is up to you to provide unique ids. For a delimited file the easiest answer might be to add a new column and populate it with a unique id, generated however you like.
Or you could use http://marklogic.github.io/recordloader/ DelimitedDataLoader with the special option ID_NAME=#AUTO. But keep in mind that ID_NAME=#AUTO will single-thread ingestion.

Related

Best practice to parse multiple config files

What would be the best practice - if there is any - to parse multiple config files?
I want to parse the mysql server configuration and also write the configuration again.
The configuration allows to issue multiple lines like:
!includedir /etc/mysql.d/
So the interesting thing is, that some configuration may be located in the main file but other may be located in a sub file.
I think pyparsing only works on ONE single file or one content string.
So I probably first need to read all files and maybe restructures the contents like adding headers for the different files...
====main file====
[mysql]
....
!includedir /etc/mysql.d/
====/etc/mysql.d/my.cnf====
[client]
.....
I would only have one pyparsing call.
Then I could parse everything into one big data object, group the file sections and have the file names as keys. This way I could also write the data back to the disk...
The other possibility would be to parse the main file and programmatically parse all other files that were found in the main file.
Thus I would have several pyparsing calls.
What do you think?
In your pyparsing code, attach a parse action to the expression that matches the include statements, have it parse the contents of the referenced files or directory of files, then merge those results into the current parse output. The parse action would make the successive calls to parseString, your code would only make a single call.
See this new example added to the pyparsing examples directory: https://github.com/pyparsing/pyparsing/blob/master/examples/include_preprocessor.py

Running Go from the command line nested JSON

I can think of workarounds on how to get this working however I'm interested in finding out if there's a solution to this specific problem.
I've got a go program which requires a json string arguement:
go run main.go "{ \"field\" : \"value\" }"
No problems so far. However, am I able to run from the command line if one of the json values is another json string?
go run main.go "{ \"json-string\" : \"{\"nestedfield\" : \"nestedvalue\"}\" }"
It would seem that adding escape characters incorrectly matches up the opening and closing quotes. Am I minuderstanding how this is done or is it (and this is the side I'm coming down on) simply not possible?
To reiterate, this is a question that has piqued my curiosity - I'm aware of alternative approaches - I'm hoping for input related to this specific problem.
Why don't you just put your json config to the file and provide config file name to your application using flag package
Based on the feedback from wiredeye I went down the argument route instead. I've modified the program to run on:
go run main.go field:value field2:value json-string:"{\"nestedfield\":nestedvalue}"
I can then iterate over the os.Args and get the nested json within my program. I'm not using flags directly as I don't know the amount of inputs into the program which would have required me to use duplicate flags (not supported) or parse the flag to a collection (doesn't seem to be supported).
Thanks wiredeye

Specify which resource file a variable is coming from

Robot Framework allows you to import multiple resource files containing keywords with the same names, and to call them using their full name to differentiate between them. For example, if you have Resource1.robot that has a keyword called "Test Keyword" that does some action, and Resource2.robot that also has a keyword called "Test Keyword" that does a different action, when you import both resources into a test suite, your test cases can access those keywords with the syntax Resource1.Test Keyword or Resource2.Test Keyword depending on the functionality that you want.
Is there a way to do that with variables? I have two resource files - patient_records_resource.robot and patient_search_resource.robot. patient_records_resource defines a variable ${LAST NAME EDIT} | name=lname, and patient_search_resource defines a variable with the same name ${LAST NAME EDIT} | id=last-name. I'm running into the problem where a test case imports both of those files, and needs to access both of those edit boxes at different points, and consistently picks the wrong one. I have tried things like patient_search_resource.LAST NAME EDIT with no success, but that's approximately what I'm looking for.
I know I could just rename one of them, but I'd like to use that as a last resort solution. Everyone on my team makes sure to create unique variable names within a single resource file, but coming up with unique variable names across the whole test suite to avoid these collisions would add some overhead that we don't want.
There is no way to do that with variables.
All variables from resource files have the same priority. If multiple variables have the same name then only the one that was imported first is taken into use. [source]
Your only options are:
Splitting the suite with both imports into two sub-suites, ensuring that each sub-suite only imports one of the resources.
Your last resort solution: modifying your variable names to all be unique.
I agree that the latter option adds a bothersome amount of overhead, but, until RF changes the way it handles variables, it's probably your best option. Personally, I prefix all variables with a sequence of letters unique to the resource file (e.g. a resource named "Member_Central_Logging_Functions" might have all variables prefixed with MCLF).

How to read invariant csv files using c#

I am working on Windows Application development using c#. I want to read a csv file from a directory and imported into sql server database table. I am successfully read and import the csv file data into database table if the file content is uniform. But I am unable to insert the file data with invariant form ex.Actually my csv file delimiter is tab('\t') and after getting individual fields I have a field that contains data like dcc
Name
----
xxx
xxx yyy
xx yy zz
and i rerieved data like xxx,yyy and xx,yy,zz so the insertion becomes problem.
How could i insert the data uniformly into a database table.
It's pretty easy.
Just read file line-by-line. Example on MSDN here:
How to: Read Text from a File
For each line use String.Split Method with your tab as delimiter. Method documentation and sample are here:
String.Split Method (Char[], StringSplitOptions)
Then working insert your data.
If a CSV (or TSV) value contains a delimiter inside of it, then it should be surrounded by quotes. See the spec for more details: https://www.rfc-editor.org/rfc/rfc4180#page-3
So your input file is incorrectly formatted. If you can convince the input provider to fix this issue, that will be the best way to fix the problem. If not, other solutions may include:
visually inspecting and editing the file to fix errors, or
writing your parser program to have enough knowledge of your data expectations that it can correctly "guess" where the real delimiters are.
If I'm understanding you correctly, the problem is that your code is splitting on spaces instead of on tabs. Given you have read in the lines from the file, all you need to do is:
string[] fileLines;//from the file
foreach(string line in fileLines)
{
string[] lineParts=line.Split(new char[]{'\t'});
}
and then do whatever you want with each lineParts. The \t is the tab character.
If you're also asking about writing the lines to a database file...you can just read in tab-delimited files with the Import Export Wizard (assuming you're using Sql Server Mgmt Studio, but I'm sure there are comparable ways to import using other db management software).

File uploading: what should be the name of the file to save to?

I am going to add file upload control to my ASP.NET 2.0 web page so that users can upload files. Files will be stored in the server in the folder with the name as of the user. I want to know what is the best option to name the files when saving to server. Needs to consider security, performance, flexibility to handle files etc.
Options I am considering now :
Upload with the same name as of the input file name
Add User Id+Random Number +File name as of the input file name
Create random numbers +Current Time in seconds and save files with that number. Will have one table to map this number with users upload
Anything else? What is the best way?
NEVER EVER use user input for filenames. Don't use the username. User the user id instead (I assume your users have an unique id).
NEVER use the original filename. Use your solution number 3, plus the user id instead of the username.
For your information, PHP had a vulnerability a few years ago: one could forge a HTTP POST request with a file upload, and with a file name like "../../anything.php", and the php _FILES array, supposed to contain sanitized values, didn't detect these kind of file names, so one could write files anywhere in the filesystem.
I'd use a combination of
User ID
A random generated string (e.g. a GUID)
Example PDF file name: 23212-dd503cf8-a548-4584-a0a3-39dc8be618df.pdf
This way, the user can upload as many files as he/she wants, without file name conflict, and you are also able to point out which files belong to which users, just by looking at the file names.
I don't see the need to include any other information in the file name, since upload time/date and such can be retrieved from the file's attributes.
Also, you should store the files in a safe location, which external users, such as visitors of your website, cannot access. Instead, you deliver the file to them through a proxy web page (you read the file from the safe location, and pass the data on to the user). For this solution, a database is needed to keep track of files, their location, etc.
This also makes you able to control which users have access to which files through your code.
Update: Here's a description of how the solution with the proxy web page could be implemented.
Create a Web Form with the name GetFile.aspx
GetFile.aspx takes one query parameter named fileid, which is used to identify the file to get. E.g.: http://www.mypage.com/GetFile.aspx?fileid=100
Use the fileid parameter to lookup the file location in the database, so that it can be read and sent to the user. In the Web Form you use Request.QueryString("fileid") to get the file ID and use it in a query that will look something like this (SQL): SELECT FileLocation FROM UserFiles WHERE FileID = 100
Read the file using a System.IO.FileStream and output its contents through Response.Write. Remember to set the appropriate content type using Response.ContentType first, so that the client browser handles the requested file correctly (see this post on asp.forums.net and the MDSN article which is also referred to in the post, which both discuss a method of determining the appropriate content type automatically).
If you choose this approach, it's easy to implement your own simple security or custom actions later on, such as making sure a user is logged into your web site before you send the file, or that users can only access files they uploaded themselves, or logging which users download which files, etc. The possibilities are endless ;-)
Take a look at the System.IO.Path class as it has lots of useful functions you can utilise, such as:
Check which characters are invalid in a file name:
System.IO.Path.GetInvalidPathChars();
Get a random file name:
System.IO.Path.GetRandomFileName();
Get a unique, randome filename in the temporary directory
System.IO.Path.GetTempFileName();
I would go with option #3. A table mapping these files with users will provide other uses down the road, it always does. If you use the mapping, the only advantage of appending the user name or id to the file is if you are trying to debug a problem.
I'd probably use a GUID instead of a random number but either would work. The important things in my opinion are
No username as part of the filename as any part of the stored file
Never use the original file name as any part of the stored file
Use a random number or GUID to ensure no duplicate file
Adding an user id to the file will help with manual debugging issues
There is more to this than meets the eye...which I am thinking that you already knew!
What sort of files are you talking about? If they are anything even remotely big or in such quantity that the group of files could be big I would immediately suggest that you add some flexibility to your approach.
create a table that stores the root paths to various file stores (this could be drives, unc paths, what ever your environment supports). It will initially have one entry in it which will be your first storage location. An nice attribute to maintain with this data is how much room can be stored here.
maintain a table of file related data (id {guid}, create date, foreign key to path data, file size)
write the file to a root that still has room on it (query all file sizes stored in a root location and compare to that roots capacity)
write the file using a GUID for the name (obfuscates the file on the file system)..can be written without the file extension if security requires it (sensitive files)
write the file according to its create date starting from the root/year{number}/month{number}/day{number}/file.extension
With a system of this nature in place - even though you won't/don't need it up front - you can now more easily relocate the files. You can better manage the files. You can better manage collections of files. Etc. I have used this system before and found it to be quite flexible. Dealing with files that are stored to a file system but managed from a database can get a bit out of control once the file store becomes so large and things need to get moved around a bit. Also, at least in the case of windows...storing zillions of files in one directory is usually not a good idea (the reason for breaking things up by their create date).
This complexity is only really needed when you have high volumes and large foot prints.

Resources