We need to split a data received from TCP , so we use a tcp delimiter #
But if already # included in received data. Data split at # , and so the proper splitting of data is not occur.
what is the way to split a data and any need to change a delimiter ?
socket->write(QByteArray(1,TASK_DELIMITER));
We collect a data in
QList commandList = data.split(TASK_DELIMITER);
Related
I have a file which has data in the below format
Col1
1,a,b,c
1,e,f,g,h,j
2,r,t,y,u,i.o
2,q,s,d,f
3,q,a,s,l
4,r,y,u,p,o
4,o,l,j,f,c,g,b,c
4,d,f,q,
.
.
.
97,w,e,r
3,f,g
100,q,a,x,c
Now I want to split this file to 100 different files so that each file has data based on first column . Example - First file should have only data which has value 1 in first column and second file should have data which starts with 2 in second column and so on till 100 files .
Please tell me the approaches in Informatica, Unix or teradata
Kindly use Transaction control transformation to get multiple files generated with respect to column value.
Take a variable port
V_curr - Col1
V_prev - Col1
V_flag = IIF(V_curr = V_prev,0,1)
Now import transaction control transformation and pass the pipeline.
In properties, Transaction Control Condition type,
IIF(V_flag =0, TC_COMMIT_BEFORE, TC_CONTINUE_TRANSACTION)
Once you execute the workflow, multiple files will be generated with respect to Col1.
For ref - https://kb.informatica.com/h2l/HowTo%20Library/1/0114-GeneratingMultipleTargetsFromOneTargetDefinition.pdf
Also check - https://etlinfromatica.wordpress.com/
Thank you
seems simple enough... use filename port on the flat file target and have an expression transformation with port filename_out dynamically created as a derivative of the first column value e.g. "FileOut" || Port1 || ".dat"
Then connect output port of filename_out to input port of filename on the target
Execute R Script box in ML Studio has two output ports. How does one send data to the second output port?
Currently ML Studio does not support the mapping of output data to a second port. This can be verified by executing the following code:
# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("dataset1", 1);
maml.mapOutputPort("dataset1", 2);
which returns
---------- Start of error message from R ----------
Error: At this time, there is only 1 output dataset port that can be mapped. Please provide 1 as the portNumber
As a workaround you can output your data.frame with an additional column labeling which output set is intended. Then Clean Missing Data with action Remove entire row can be used to truncate the data to the correct output and Select Columns in Dataset and be used to remove the extraneous column.
How can you loop through a paired-end fastq file? For single end reads you can do the following
library(ShortRead)
strm <- FastqStreamer("./my.fastq.gz")
repeat {
fq <- yield(strm)
if (length(fq) == 0)
break
#do things
writeFasta(fq, 'output.fq', mode="a")
}
However, if I edit one paired-end file, I somehow need to keep track of the second file so that the two files continue to correspond well with each other
Paired-end fastq files are typically ordered,
So you could keep track of the lines that are removed, and remove them from the paired file. But this isn't a great method, and if your data is line-wrapped you will be in pain.
A better way would be to use the header information.
The headers for the paired reads in the two files are identical, except for the field that specifies whether the read is reverse or forward (1 or 2)...
first read from file 1:
#M02621:7:000000000-ARATH:1:1101:15643:1043 1:N:0:12
first read from file 2
#M02621:7:000000000-ARATH:1:1101:15643:1043 2:N:0:12
The numbers 1101:15643:1043 refers to the tile and x, y coordinates on that tile, respectively.
These numbers uniquely identify each read pair, for the given run.
Using this information, you can removed reads from the second file if they are not in the first file.
Alternatively, if you are doing quality trimming... Trimmomatic can perform quality/length filtering on paired-end data, and it's fast...
I want to generate a unique sequence number for each row in the file in unix. I can not make identity column in database as it has some other sources which also inserts data in it. I tried using NR number in awk but since i have filters in my script it may skip rows in the file so i may not get sequential numbers.
my requirements are - This sequence number needs to be persistent since everday i would receive this file and should start from where i left of. also the number needs to be preceded by "EMP_" for each line in the file.
Please suggest.
Thanks in advance.
To obtain unique id in UNIX you may use file to store and read the value. however this method is so tedious and require mechanism on file IO locking. the easiest way is to use date time to obtain unique id example :
#!/bin/sh
uniqueVal = `date '+%Y%m%d%H%M%S'`
I need to be able to delimit a stream of binary data. I was thinking of using something like the ASCII EOT (End of Transmission) character to do this.
However I'm a bit concerned -- how can I know for sure that the particular binary sequence used for this (0b00000100) won't appear in my own binary sequences, thus giving a false positive on delimitation?
In other words, how is binary delimiting best handled?
EDIT: ...Without using a length header. Sorry guys, should have mentioned this before.
You've got five options:
Use a delimiter character that is unlikely to occur. This runs the risk of you guessing incorrectly. I don't recommend this approach.
Use a delimiter character and an escape sequence to include the delimiter. You may need to double the escape character, depending upon what makes for easier parsing. (Think of the C \0 to include an ASCII NUL in some content.)
Use a delimiter phrase that you can determine does not occur. (Think of the mime message boundaries.)
Prepend a length field of some sort, so you know to read the following N bytes as data. This has the downside of requiring you to know this length before writing the data, which is sometimes difficult or impossible.
Use something far more complicated, like ASN.1, to completely describe all your content for you. (I don't know if I'd actually recommend this unless you can make good use of it -- ASN.1 is awkward to use in the best of circumstances, but it does allow completely unambiguous binary data interpretation.)
Usually, you wrap your binary data in a well known format, for example with a fixed header that describes the subsequent data. If you are trying to find delimeters in an unknown stream of data, usually you need an escape sequence. For example, something like HDLC, where 0x7E is the frame delimeter. Data must be encoded such that if there is 0x7E inside the data, it is replaced with 0x7D followed by an XOR of the original data. 0x7D in the data stream is similarly escaped.
If the binary records can really contain any data, try adding a length before the data instead of a marker after the data. This is sometimes called a prefix length because the length comes before the data.
Otherwise, you'd have to escape the delimiter in the byte stream (and escape the escape sequence).
You can prepend the size of the binary data before it. If you are dealing with streamed data and don't know its size beforehand, you can divide it into chunks and have each chunk begin with size field.
If you set a maximum size for a chunk, you will end up with all but the last chunk the same length which will simplify random access should you require it.
As a space-efficient and fixed-overhead alternative to prepending your data with size fields and escaping the delimiter character, the escapeless encoding can be used to trim off that delimiter character, probably together with other characters that should have special meaning, from your data.
#sarnold's answer is excellent, and here I want to share some code to illustrate it.
First here is a wrong way to do it: using a \n delimiter. Don't do it! the binary data could contain \n, and it would be mixed up with the delimiters:
import os, random
with open('test', 'wb') as f:
for i in range(100): # create 100 binary sequences of random
length = random.randint(2, 100) # length (between 2 and 100)
f.write(os.urandom(length) + b'\n') # separated with the character b"\n"
with open('test', 'rb') as f:
for i, l in enumerate(f):
print(i, l) # oops we get 123 sequences! wrong!
...
121 b"L\xb1\xa6\xf3\x05b\xc9\x1f\x17\x94'\n"
122 b'\xa4\xf6\x9f\xa5\xbc\x91\xbf\x15\xdc}\xca\x90\x8a\xb3\x8c\xe2\x07\x96<\xeft\n'
Now the right way to do it (option #4 in sarnold's answer):
import os, random
with open('test', 'wb') as f:
for i in range(100):
length = random.randint(2, 100)
f.write(length.to_bytes(2, byteorder='little')) # prepend the data with the length of the next data chunk, packed in 2 bytes
f.write(os.urandom(length))
with open('test', 'rb') as f:
i = 0
while True:
l = f.read(2) # read the length of the next chunk
if l == b'': # end of file
break
length = int.from_bytes(l, byteorder='little')
s = f.read(length)
print(i, s)
i += 1
...
98 b"\xfa6\x15CU\x99\xc4\x9f\xbe\x9b\xe6\x1e\x13\x88X\x9a\xb2\xe8\xb7(K'\xf9+X\xc4"
99 b'\xaf\xb4\x98\xe2*HInHp\xd3OxUv\xf7\xa7\x93Qf^\xe1C\x94J)'