File splitting to different files - unix

I have a file which has data in the below format
Col1
1,a,b,c
1,e,f,g,h,j
2,r,t,y,u,i.o
2,q,s,d,f
3,q,a,s,l
4,r,y,u,p,o
4,o,l,j,f,c,g,b,c
4,d,f,q,
.
.
.
97,w,e,r
3,f,g
100,q,a,x,c
Now I want to split this file to 100 different files so that each file has data based on first column . Example - First file should have only data which has value 1 in first column and second file should have data which starts with 2 in second column and so on till 100 files .
Please tell me the approaches in Informatica, Unix or teradata

Kindly use Transaction control transformation to get multiple files generated with respect to column value.
Take a variable port
V_curr - Col1
V_prev - Col1
V_flag = IIF(V_curr = V_prev,0,1)
Now import transaction control transformation and pass the pipeline.
In properties, Transaction Control Condition type,
IIF(V_flag =0, TC_COMMIT_BEFORE, TC_CONTINUE_TRANSACTION)
Once you execute the workflow, multiple files will be generated with respect to Col1.
For ref - https://kb.informatica.com/h2l/HowTo%20Library/1/0114-GeneratingMultipleTargetsFromOneTargetDefinition.pdf
Also check - https://etlinfromatica.wordpress.com/
Thank you

seems simple enough... use filename port on the flat file target and have an expression transformation with port filename_out dynamically created as a derivative of the first column value e.g. "FileOut" || Port1 || ".dat"
Then connect output port of filename_out to input port of filename on the target

Related

Error in loading a data to neo4j in case statement in cypher query

try to upload CSV data with a case statement in the query, but the following error appears:
cypher:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///test.csv' as line
MATCH(a:test_t{tid:line.pid})
CASE
WHEN line.key !='NA' THEN
WITH split(line.key,",") as name
UNWIND name as x
MERGE(k:test_key{key_term:toLower(x)})
MERGE(a)-[:contains]->(k)
END
Error
Neo.ClientError.Statement.SyntaxError: Invalid input 'S': expected 'l/L' (line 5, column 3 (offset: 137))
"CASE"
Can anyone help me?
The CASE clause does not support embedding other Cypher clauses (but it can invoke functions). In fact, a CASE clause is not actually needed for your use case.
This query should work (the :auto at the beginning is needed in neo4j 4.0+):
:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///test.csv' as line FIELDTERMINATOR ';'
WITH line
WHERE line.key <> 'NA'
MATCH (a:test_t {tid: line.pid})
UNWIND split(line.key, ',') as x
MERGE (k:test_key {key_term: TOLOWER(x)})
MERGE (a)-[:contains]->(k)
This query filters out all unwanted lines as soon as they are obtained from the file. Reducing the number of rows of data being worked on as early as possible is good practice.
Also, you have a second issue. Your data file cannot use the comma as both the (default) field terminator AND as the delimiter between your x values.
To resolve this ambiguity, the above query chose to use the FIELDTERMINATOR ';' option to specify that the ";" character will be used as the field terminator. A sample data file would look like this:
pid;key
123;NA
234;Foo,Bar
345;Bar,Baz
456;NA
567;Baz
You are using the CASE incorrectly. You cannot have update clauses inside of a CASE statement. Instead you can use a WHERE clause to filter the rows of the file. For instance, adding WHERE line.key != 'NA' while processing the file befor you move onto the update will work. Something like this should fit the bill.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///test.csv' as line
MATCH (a:test_t {tid: line.pid})
WITH line
WHERE line.key <> 'NA'
WITH split(line.key, ",") as name
UNWIND name as x
MERGE (k:test_key {key_term: toLower(x)})
MERGE (a)-[:contains]->(k)
It looks like,from your logic you could even move the test up above the MATCH. So this might be better (fewer unecessary matches).
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///test.csv' as line
WITH line
WHERE line.key <> 'NA'
MATCH (a:test_t {tid: line.pid})
WITH split(line.key, ",") as name
UNWIND name as x
MERGE (k:test_key {key_term: toLower(x)})
MERGE (a)-[:contains]->(k)

R - extract data which changes position from file to file (txt)

I have a folder with tons of txt files from where I have to extract especific data. The problem is that the format of the file has changed once and the position of the data I need to extract has also changed. So I need to deal with files in different format.
To try to make it more clear, in column 4 I have the name of the variable and in 5 I have the value, but sometimes this is in a different row. Is there a way to find the name of the variable (in which row) and then extract its value?
Thanks in advance
EDITING
In some files I will have the data like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Current--------28
But in some point in life, there was a change in the software to add another variable and the new file iis like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
So I need to deal with these 2 types of data, extracting the same variables which are in different rows.
If these files can't be read with read.table use readLines and then find those lines that start with the keyword you need.
For example:
Sample file 1 (with the dashes included and extra line breaks):
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
Sample file2 (with a comma as separator):
Column 1,Column 2.
Device ID,A.
Current,555
Voltage, 500.
Error,5.
For both cases do:
text = readLines(con = file("your filename here"))
curr = text[grepl("^Current", text, ignore.case = T)]
Which returns:
for file 1:
[1] "Current--------28"
for file 2:
[1] "Current,555"
Then use gsub to remove anything that is not a number.

How to map output of Execute R Script to second port in ML Studio using maml.mapOutputPort

Execute R Script box in ML Studio has two output ports. How does one send data to the second output port?
Currently ML Studio does not support the mapping of output data to a second port. This can be verified by executing the following code:
# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("dataset1", 1);
maml.mapOutputPort("dataset1", 2);
which returns
---------- Start of error message from R ----------
Error: At this time, there is only 1 output dataset port that can be mapped. Please provide 1 as the portNumber
As a workaround you can output your data.frame with an additional column labeling which output set is intended. Then Clean Missing Data with action Remove entire row can be used to truncate the data to the correct output and Select Columns in Dataset and be used to remove the extraneous column.

looping through paired end fastq reads

How can you loop through a paired-end fastq file? For single end reads you can do the following
library(ShortRead)
strm <- FastqStreamer("./my.fastq.gz")
repeat {
fq <- yield(strm)
if (length(fq) == 0)
break
#do things
writeFasta(fq, 'output.fq', mode="a")
}
However, if I edit one paired-end file, I somehow need to keep track of the second file so that the two files continue to correspond well with each other
Paired-end fastq files are typically ordered,
So you could keep track of the lines that are removed, and remove them from the paired file. But this isn't a great method, and if your data is line-wrapped you will be in pain.
A better way would be to use the header information.
The headers for the paired reads in the two files are identical, except for the field that specifies whether the read is reverse or forward (1 or 2)...
first read from file 1:
#M02621:7:000000000-ARATH:1:1101:15643:1043 1:N:0:12
first read from file 2
#M02621:7:000000000-ARATH:1:1101:15643:1043 2:N:0:12
The numbers 1101:15643:1043 refers to the tile and x, y coordinates on that tile, respectively.
These numbers uniquely identify each read pair, for the given run.
Using this information, you can removed reads from the second file if they are not in the first file.
Alternatively, if you are doing quality trimming... Trimmomatic can perform quality/length filtering on paired-end data, and it's fast...

Sequence number inside a txt file in UNIX

I want to generate a unique sequence number for each row in the file in unix. I can not make identity column in database as it has some other sources which also inserts data in it. I tried using NR number in awk but since i have filters in my script it may skip rows in the file so i may not get sequential numbers.
my requirements are - This sequence number needs to be persistent since everday i would receive this file and should start from where i left of. also the number needs to be preceded by "EMP_" for each line in the file.
Please suggest.
Thanks in advance.
To obtain unique id in UNIX you may use file to store and read the value. however this method is so tedious and require mechanism on file IO locking. the easiest way is to use date time to obtain unique id example :
#!/bin/sh
uniqueVal = `date '+%Y%m%d%H%M%S'`

Resources