How can you loop through a paired-end fastq file? For single end reads you can do the following
library(ShortRead)
strm <- FastqStreamer("./my.fastq.gz")
repeat {
fq <- yield(strm)
if (length(fq) == 0)
break
#do things
writeFasta(fq, 'output.fq', mode="a")
}
However, if I edit one paired-end file, I somehow need to keep track of the second file so that the two files continue to correspond well with each other
Paired-end fastq files are typically ordered,
So you could keep track of the lines that are removed, and remove them from the paired file. But this isn't a great method, and if your data is line-wrapped you will be in pain.
A better way would be to use the header information.
The headers for the paired reads in the two files are identical, except for the field that specifies whether the read is reverse or forward (1 or 2)...
first read from file 1:
#M02621:7:000000000-ARATH:1:1101:15643:1043 1:N:0:12
first read from file 2
#M02621:7:000000000-ARATH:1:1101:15643:1043 2:N:0:12
The numbers 1101:15643:1043 refers to the tile and x, y coordinates on that tile, respectively.
These numbers uniquely identify each read pair, for the given run.
Using this information, you can removed reads from the second file if they are not in the first file.
Alternatively, if you are doing quality trimming... Trimmomatic can perform quality/length filtering on paired-end data, and it's fast...
Related
I made a CSV and I used it for a data driven test in Katalon.
x,y
house,way
1,2
Run the test and the test runs three times, however, I stored two valid data inputs (house, way; 1,2)?
I don't know why this happened.
Its your CVS file, its got extra line or CR at the end, use notepad to delete all empty space and line after the last character of 2nd line. Also remember to delete and reload the file in Data Driven
I have a folder with tons of txt files from where I have to extract especific data. The problem is that the format of the file has changed once and the position of the data I need to extract has also changed. So I need to deal with files in different format.
To try to make it more clear, in column 4 I have the name of the variable and in 5 I have the value, but sometimes this is in a different row. Is there a way to find the name of the variable (in which row) and then extract its value?
Thanks in advance
EDITING
In some files I will have the data like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Current--------28
But in some point in life, there was a change in the software to add another variable and the new file iis like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
So I need to deal with these 2 types of data, extracting the same variables which are in different rows.
If these files can't be read with read.table use readLines and then find those lines that start with the keyword you need.
For example:
Sample file 1 (with the dashes included and extra line breaks):
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
Sample file2 (with a comma as separator):
Column 1,Column 2.
Device ID,A.
Current,555
Voltage, 500.
Error,5.
For both cases do:
text = readLines(con = file("your filename here"))
curr = text[grepl("^Current", text, ignore.case = T)]
Which returns:
for file 1:
[1] "Current--------28"
for file 2:
[1] "Current,555"
Then use gsub to remove anything that is not a number.
I have two files containing biological DNA sequence data. Each of these files are the output of a python script which assigns each DNA sequence to a sample ID based on a DNA barcode at the beginning of the sequence. The output of one of these .txt files looks like this:
>S066_1 IGJRWKL02G0QZG orig_bc=ACACGTGTCGC new_bc=ACACGTGTCGC bc_diffs=0
TTAAGTTCAGCGGGTATCCCTACCTGATCCGAGGTCAACCGTGAGAAGTTGAGGTTATGGCAAGCATCCATAAGAACCCTATAGCGAGAATAATTACTACGCTTAGAGCCAGATGGCACCGCCACTGATTTTAGGGGCCGCTGAATAGCGAGCTCCAAGACCCCTTGCGGGATTGGTCAAAATAGACGCTCGAACAGGCATGCCCCTCGGAATACCAAGGGGCGCAATGTGCGTCCAAAGATTCGATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCAGCGTTCTTCATCGATGACGAGTCTAG
>S045_2 IGJRWKL02H5XHD orig_bc=ATCTGACGTCA new_bc=ATCTGACGTCA bc_diffs=0
CTAAGTTCAGCGGGTAGTCTTGTCTGATATCAGGTCCAATTGAGATACCACCGACAATCATTCGATCATCAACGATACAGAATTTCCCAAATAAATCTCTCTACGCAACTAAATGCAGCGTCTCCGTACATCGCGAAATACCCTACTAAACAACGATCCACAGCTCAAACCGACAACCTCCAGTACACCTCAAGGCACACAGGGGATAGG
The first line is the sequence ID, and the second line in the DNA sequence. S_066 in the first part of the ID indicates that the sequence is from sample 066, and the _1 indicates that its the first sequence in the file (not the first sequence from S_066 per se). Because of the nuances of the DNA sequencing technology being used, I need to generate two files like this from the raw sequencing files, and the result is an output where I have two of these files, which I then use cat to merge together. So far so good.
The next downstream step in my workflow does not allow identical sample names. Right now it gets half way through, errors, and closes because it encounters some identical sequence IDs. So, it must be that the 400th sequence in both files belongs to the same sample, or something, generating identical sample IDs (i.e. both files might have S066_400).
What I would like to do is use some code to insert a number (1000,, 4971, whatever) immediately after the _ on every other line in the second file, starting with the first line. This way the IDs would no longer be confounded and I could proceed. So, it would cover S066_2 to S066_24971 or S066_49712. Part of the trouble is that the ID may be variable in length such that it could begin as S066_ or as 49BBT1_.
Try:
awk '/^\>/ {$1=$1 "_13"} {print $0}' filename > tmp.tmp
mv tmp.tmp filename
Setting:
I have (simple) .csv and .dat files created from laboratory devices and other programs storing information on measurements or calculations. I have found this for other languages but nor for R
Problem:
Using R, I am trying to extract values to quickly display results w/o opening the created files. Hereby I have two typical settings:
a) I need to read a priori unknown values after known key words
b) I need to read lines after known key words or lines
I can't make functions such as scan() and grep() work.
c) Finally I would like to loop over dozens of files in a folder and give me a summary (to make the picture complete: I will manage this part)
I woul appreciate any form of help.
ok, it works for the key value (although perhaps not very nice)
variable<-scan("file.csv", what=character(),sep="")
returns a charactor vector of everything
variable[grep("keyword", ks)+2] # + 2 as the actual value is stored two places ahead
returns characters of seaked values.
as.numeric(lapply(variable, gsub, patt=",", replace="."))
for completion: data had to be altered to number and "," and "." problem needed to be solved.
in a line:
data=as.numeric(lapply(ks[grep("Ks_Boden", ks)+2], gsub, patt=",", replace="."))
Perseverence is not to bad of an asset ;-)
The rest isn't finished, yet, I will post once finished.
I have one loop
for (chr in paste('chr',c(seq(1,22),'X','Y'),sep='')){
.
.
write.table(exp,file="list.txt",col.names =FALSE,row.names=TRUE,sep="\t")
}
right now the loop will give me only one txt with the data from the last loop (chromosome Y)
What i want is one txt file with the data from all the chromosomes/ from all the loops. Not 24 different txt (one per chromosome)
Thank you
Best regards
Anna
What you are not yet recognizing is that 'append' is set to FALSE by default and that you need to explicitly change it to TRUE if you want to record each 'exp'-object (which by the way is an unfortunate name for a data object because it's shared by a common math function). If you set append=TRUE and if the "exp"-object is a full record of one chromosome at each time the write.csv is called (hopefully with an 'chr'-identifier column so you can keep track of where the data came from), then you should succeed.