Perform sequence of edits on a large text file

Perform sequence of edits on a large text file - r

I am hoping to perform a series of edits to a large text file composed almost entirely of single letters, seperated by spaces. The file is about 300 rows by about 400,000 columns, and about 250 MB.
My goal is to tranform this table using a series of steps, for eventual processing with another language (R, probably). I don't have much experience working with big data files, but PERL has been suggested to me as the best way to go about this. Please let me know if there is a better way :).
So, I am hoping to write a PERL script that does the following:
Open file, edit or write to a new file the following:
remove columns 2-6
merge/concatenate pairs of columns, starting with column 2 (so, merge column 2-3,4-5, etc)
replace each character pair according to sequential conditional algorithm running accross each row:
[example PSEUDOCODE: if character 1 of cell = character 2 of cell=a, cell=1
else if character 1 of cell = character 2 of cell=b, cell=2
etc.] such that except for the first column, the table is a numerical matrix
remove every nth column, or keep every nth column and remove all others
I am just starting to learn PERL, so I was wondering if these operations were possible in PERL, whether PERL would be the best way to do them, and if there were any suggestions for syntax on these operations in the context of reading/writing to a file.

I'll start:
use strict;
use warnings;
my #transformed;
while (<>) {
chomp;
my #cols = split(/\s/); # split on whitespace
splice(#cols, 1,6); # remove columns
push #transformed, $cols[0];
for (my $i = 1; $i < #cols; $i += 2) {
push #transformed, "$cols[$i]$cols[$i+1]";
}
# other transforms as required
print join(' ', #transformed), "\n";
}
That should get you on your way.

You need to post some sample input and expected output or we're just guessing what you want but maybe this will be a start:
awk '{
printf "%s ", $1
for (i=7;i<=NF;i+=2) {
printf "%s%s ", $i, $(i+1)
}
print ""
}' file

Related

How to efficiently clean linebreak in entire dataset R data.table

Line break is always the first target to prevent new line splitted in new row when perform "import text file" in Excel.Or export to other application with csv file importing.
(The solution might be able to apply in clean another special mark in dataset.
)
Goal clean all line break into space of entire dataset
dt[,lapply(.SD,gsub("\\n","",.SD))]
Problems
R freezed after applying the script with +50 cols & +3 million rows
What's wrong with the lapply approach above?And what is the preferred approach to clean certain things on entire table ?

chinsoon12 is basically it -- use set for low-overhead by-reference column overwrite; just add fixed=TRUE to make the regex faster too:
for (jj in seq_len(ncol(dt))) set(dt, , jj, gsub('\n', '', dt[[jj]], fixed = TRUE))
BTW, \\n is different from \n. \n is the literal newline character, \\n is the string "\n", i.e., a backslash followed by n. You can see the difference thus:
cat('hey\nyou')
# hey
# you
cat('hey\\nyou')
# hey\nyou

Split file without separating rows beginning with like values in Unix

I have a sorted .csv file that is something like this:
AABB1122,ABC,BLAH,4
AABB1122,ACD,WHATEVER,1
AABB1122,AGT,CAT,4
CCDD4444,AYT,DOG,4
CCDD4444,ACG,MUMMY,8
CCEE4444,AOP,RUN,5
DDFF9900,TUI,SAT,33
DDFF9900,WWW,INDOOR,5
I want to split the file into smaller files of roughly two lines each, but I do not want rows with like values in the first column separated.
Here, I would end up with three files:
x00000:
AABB1122,ABC,BLAH,4
AABB1122,ACD,WHATEVER,1
AABB1122,AGT,CAT,4
x00001:
CCDD4444,AYT,DOG,4
CCDD4444,ACG,MUMMY,8
x00002:
CCEE4444,AOP,RUN,5
DDFF9900,TUI,SAT,33
DDFF9900,WWW,INDOOR,5
My actual data is about 7 gigs in size and contains over 100 million lines. I want to split it into files of about 100K lines each or ~6MB. I am fine with using either file size or line numbers for splitting.
I know that I can use "sort" to split, such as:
split -a 5 -d -1 2
Here, that would give me four files, and like values in the first column would be split over files in most cases.
I think I probably need awk, but, even after reading through the manual, I am not sure how to proceed.
Help is appreciated! Thanks!

An awk script:
BEGIN { FS = "," }
!name { name = sprintf("%06d-%s.txt", NR, $1) }
count >= 2 && prev != $1 {
close(name)
name = sprintf("%06d-%s.txt", NR, $1)
count = 0
}
{
print >name
prev = $1
++count
}
Running this on the given data will create three files:
$ awk -f script.awk file.csv
$ cat 000001-AABB1122.txt
AABB1122,ABC,BLAH,4
AABB1122,ACD,WHATEVER,1
AABB1122,AGT,CAT,4
$ cat 000004-CCDD4444.txt
CCDD4444,AYT,DOG,4
CCDD4444,ACG,MUMMY,8
$ cat 000006-CCEE4444.txt
CCEE4444,AOP,RUN,5
DDFF9900,TUI,SAT,33
DDFF9900,WWW,INDOOR,5
I have arbitrarily chosen to use the line number from the original file from where the first line was taken, along with the first field's data on that line as the filename.
The script counts the number of lines printed to the current output file, and if that number is greater than or equal to 2, and if the first field's value is different from the previous line's first field, the current output file is closed, a new output name is constructed, and the count is reset.
The last block simply prints to the current filename, remembers the first field in the prev variable, and increments the count.
The BEGIN block initializes the field delimiter (before the first line is read) and the !name block sets the initial output file name (when reading the very first line).
To get exactly the filenames that you have in the question, use
name = sprintf("x%05d", ++n)
to set the output filename in both places where this is done.

With csplit if available
With the given data
csplit -s infile %^A% /^C/ %^C% /^D/ /^Z/ {*}

Creating a new regex based on the returned results and rules of a previous regex | Indexing a regex and seeing how the regex has matched a substring

I am particularly looking at R, Perl, and shell. But any other programming language would be fine too.
QUESTION
Is there a way to visually or programmatically inspect and index a matched string based on the regex? This is intended for referencing back to the first regex and its results inside of a second regex, so as to be able to modify a part of the matched string and write new rules for that particular part.
https://regex101.com does visualize how a certain string matches the regular expression. But it is far from perfect and is not efficient for my huge dataset.
PROBLEM
I have around 12000 matched strings (DNA sequences) for my first regex, and I want to process these strings and based on some strict rules find some other strings in a second file that go well together with those 12000 matches based on those strict rules.
SIMPLIFIED EXAMPLE
This is my first regex (a simplified, shorter version of my original regex) that runs through my first text file.
[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)
Let's suppose that it finds the following three sub-strings in my large text file:
1. AAACCCGTGTAATAACAGACGTACTGTGTA
2. TTTTTTTGCGACCGAGAAACGGTTCTGTGTA
3. TAACAAGGACCCTGTGTA
Now I have a second file which includes a very large string. From this second file, I am only interested in extracting those sub-strings that match a new (second) regex which itself is dependent on my first regex in few sections. Therefore, this second regex has to take into account the substrings matched in the first file and look at how they have matched to the first regex!
Allow me, for the sake of simplicity, index my first regex for better illustration in this way:
first.regex.p1 = [ACGT]{1,12000}
first.regex.p2 = (AAC)
first.regex.p3 = [AG]{2,5}
first.regex.p4 = [ACGT]{2,5}
first.regex.p5 = (CTGTGTA)
Now my second (new) regex which will search the second text file and will be dependent on the results of the first regex (and how the substrings returned from the first file have matched the first regex) will be defined in the following way:
second.regex = (CTAAA)[AC]{5,100}(TTTGGG){**rule1**} (CTT)[AG]{10,5000}{**rule2**}
In here rule1 and rule2 are dependent on the matches coming from the first regex on the first file. Hence;
rule1 = look at the matched strings from file1 and complement the pattern of first.regex.p3 that is found in the matched substring from file1 (the complement should of course have the same length)
rule2 = look at the matched strings from file1 and complement the pattern of first.regex.p4 that is found in the matched substring from file1 (the complement should of course have the same length)
You can see that second regex has sections that belong to itself (i.e. they are independent of any other file/regex), but it also has sections that are dependent on the results of the first file and the rules of the first regex and how each sub-string in the first file has matched that first regex!
Now again for the sake of simplicity, I use the third matched substring from file1 (because it is shorter than the other two) to show you how a possible match from the second file looks like and how it satisfies the second regex:
This is what we had from our first regex run through the first file:
3. TAACAAGGACCCTGTGTA
So in this match, we see that:
T has matched first.regex.p1
AAC has matched first.regex.p2
AAGGA has matched first.regex.p3
CC first.regex.p4
CTGTGTA has matched first.regex.p5
Now in our second regex for the second file we see that when looking for a substring that matches the second regex, we are dependent on the results coming from the first file (which match the first regex). Particularly we need to look at the matched substrings and complement the parts that matched first.regex.p3 and first.regex.p4 (rule1 and rule2 from second.regex).
complement means:
A will be substituted by T
T -> A
G -> C
C -> G
So if you have TAAA, the complement will be ATTT.
Therefore, going back to this example:
TAACAAGGACCCTGTGTA
We need to complement the following to satisfy the requirements of the second regex:
AAGGA has matched first.regex.p3
CC first.regex.p4
And complements are:
TTCCT (based on rule1)
GG (based on rule2)
So an example of a substring that matches second.regex is this:
CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG
This is only one example! But in my case I have 12000 matched substrings!! I cannot figure out how to even approach this problem. I have tried writing pure regex but I have completely failed to implement anything that properly follows this logic.. Perhaps I shouldn't be even using regex?
Is it possible to do this entirely with regex? Or should I look at another approach? Is it possible to index a regex and in the second regex reference back to the first regex and force the regex to consider the matched substrings as returned by first regex?

This can be done programmatically in Perl, or any other language.
Since you need input from two different files, you cannot do this in pure regex, as regex cannot read files. You cannot even do it in one pattern, as no regex engine remembers what you matched before on a different input string. It has to be done in the program surrounding your matches, which should very well be regex, as that's what regex is meant for.
You can build the second pattern up step by step. I've implemented a more advanced version in Perl that can easily be adapted to suit other pattern combinations as well, without changing the actual code that does the work.
Instead of file 1, I will use the DATA section. It holds all three example input strings. Instead of file 2, I use your example output for the third input string.
The main idea behind this is to split up both patterns into sub-patterns. For the first one, we can simply use an array of patterns. For the second one, we create anonymous functions that we will call with the match results from the first pattern to construct the second complete pattern. Most of them just return a fixed string, but two actually take a value from the arguments to build the complements.
use strict;
use warnings;
sub complement {
my $string = shift;
$string =~ tr/ATGC/TACG/; # this is a transliteration, faster than s///
return $string;
}
# first regex, split into sub-patterns
my #first = (
qr([ACGT]{1,12000}),
qr(AAC),
qr([AG]{2,5}),
qr([ACGT]{2,5}),
qr(CTGTGTA),
);
# second regex, split into sub-patterns as callbacks
my #second = (
sub { return qr(CTAAA) },
sub { return qr([AC]{5,100}) },
sub { return qr(TTTGGG) },
sub {
my (#matches) = #_;
# complement the pattern of first.regex.p3
return complement( $matches[3] );
},
sub { return qr(CTT) },
sub { return qr([AG]{10,5000}) },
sub {
my (#matches) = #_;
# complement the pattern of first.regex.p4
return complement( $matches[4] );
},
);
my $file2 = "CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG";
while ( my $file1 = <DATA> ) {
# this pattern will match the full thing in $1, and each sub-section in $2, $3, ...
# #matches will contain (full, $2, $3, $4, $5, $6)
my #matches = ( $file1 =~ m/(($first[0])($first[1])($first[2])($first[3])($first[4]))/g );
# iterate the list of anonymous functions and call each of them,
# passing in the match results of the first match
my $pattern2 = join q{}, map { '(' . $_->(#matches) . ')' } #second;
my #matches2 = ( $file2 =~ m/($pattern2)/ );
}
__DATA__
AAACCCGTGTAATAACAGACGTACTGTGTA
TTTTTTTGCGACCGAGAAACGGTTCTGTGTA
TAACAAGGACCCTGTGTA
These are the generated second patterns for your three input substrings.
((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(TCT)((?^:CTT))((?^:[AG]{10,5000}))(GCAT)
((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(CC)((?^:CTT))((?^:[AG]{10,5000}))(AA)
((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(TTCCT)((?^:CTT))((?^:[AG]{10,5000}))(GG)
If you're not familiar with this, it's what happens if you print a pattern that was constructed with the quoted regex operator qr//.
The pattern matches your example output for the third case. The resulting #matches2 looks like this when dumped out using Data::Printer.
[
[0] "CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG",
[1] "CTAAA",
[2] "ACACC",
[3] "TTTGGG",
[4] "TTCCT",
[5] "CTT",
[6] "AAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAG",
[7] "GG"
]
I cannot say anything about speed of this implementation, but I believe it will be reasonable fast.
If you wanted to find other combinations of patterns, all you had to do was replace the sub { ... } entries in those two arrays. If there is a different number than five of them for the first match, you'd also construct that pattern programmatically. I've not done that above to keep things simpler. Here's what it would look like.
my #matches = ( $file1 =~ join q{}, map { "($_)" } #first);
If you want to learn more about this kind of strategy, I suggest you read Mark Jason Dominus' excellent Higher Order Perl, which is available for free as a PDF here.

Using stringr in R
Extract matches to regex_1: "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)"
reg_1_matches = stringr::str_extract_all(sequences, "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)")
reg_1_matches = unlist(reg_1_matches)
lets assume the matches were:
reg_1_matches = c("TTTTTTTGCGACCGAGAAACGGTTCTGTGTA", "TAACAAGGACCCTGTGTA")
Use stringr::str_match with capturing groups (...)
df_ps = stringr::str_match(reg_1_matches, "[ACGT]{1,12000}AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA")
p3 = df_ps[,2]
p4 = df_ps[,3]
Complement
rule_1 = chartr(old= "ACGT", "TGCA", p3)
rule_2 = chartr(old= "ACGT", "TGCA", p4)
Construct regex_2
paste("(CTAAA)[AC]{5,100}(TTTGGG)", rule_1, "(CTT)[AG]{10,5000}", rule_2, sep="")
all in one go:
reg_1_matches = stringr::str_extract_all(sequences, "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)")
df_ps = stringr::str_match(reg_1_matches, "[ACGT]{1,12000}AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA")
p3 = df_ps[,2]
p4 = df_ps[,3]
rule_1 = chartr(old= "ACGT", "TGCA", p3)
rule_2 = chartr(old= "ACGT", "TGCA", p4)
paste("(CTAAA)[AC]{5,100}(TTTGGG)", rule_1, "(CTT)[AG]{10,5000}", rule_2, sep="")

This question really brings to mind the old saying about regular expressions, though in this case the languages you're matching against are regular, so RE is a good fit for this.
Unfortunately, my Perl is somewhat lacking, but fundamentally this sounds like a Regex problem rather than an R or Perl problem, so I'll do my best to answer it on that basis.
Perl's regex engine supports capture groups. The substrings matching bracketed subexpressions in your regex can be made available after matching:
use feature qw(say);
$foo = 'foo';
'aaa' =~ /(a)(a+)/;
say($1); # => 'a'
say($2); # => 'aa'
say("Matched!") if 'aaaa' =~ /${2}/;
What I'd suggest doing is bracketing your regex up properly, picking apart the capture groups after matching, and then sticking them together into a new regex, say...
use feature qw(say);
'ACGTAACAGAGATCTGTGTA' =~ /([ACGT]{1,12000})(AAC)([AG]{2,5})([ACGT]{2,5})(CTGTGTA)/ ; # Note that I've added a lot of (s and )s here so that the results get sorted into nice groups
say($1); # => 'ACGT'
say($2); # => 'AAC'
say($3); # => 'AGAG'
say($4); # => 'AT'
say($5); # => 'CTGTGTA'
$complemented_3 = complement($3); # You can probably implement these yourself...
$complemented_4 = complement($4);
$new_regex = /${complemented_3}[ACGT]+${complemented_4}/;
If the sections have actual meaning, then I'd also advise looking up named capture groups, and giving the results decent names rather than $1, $2, $3....

awk solution.
The requirements are not that complicated: a simple script can do the trick. There's just one complication: every regex that is a result from your first match has to be matched against all lines of the second file. Here's where we use xargs to solve that.
Now, whatever language you pick, it looks like the number of matches being made is going to be extensive, so some remarks about the regexes need to be made first.
The regex for the first file is going to be slow, because in
[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)
the number of possibilities for the first part [AGCT]{1,12000} are huge. Actuallly it only says pick any element from A, C, G, T and to that between 1 and 12000 times. Then match the rest. Couldn't we do a
AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA$
instead? The speed gain is considerable.
A similar remark can be made to the regex for the second file. If you replace
(CTAAA)[AC]{5,100}(TTTGGG){**rule1**}(CTT)[AG]{10,5000}{**rule2**}
with
(CTAAA)[AC]{5,100}(TTTGGG){**rule1**}(CTT)[AG]*{**rule2**}$
you will experience some improvement.
Because I started this answer with the low complication-factor of the requirements, let's see some code:
$ cat tst.awk
match($0, /AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA$/, a) {
r = sprintf("(CTAAA)[AC]{5,100}(TTTGGG)(%s)(CTT)[AG]*(%s)$",
translate(a[1]),
translate(a[2]));
print r
}
function translate(word) {
cmd = "echo '" word "' | tr 'ACGT' 'TGCA'";
res = ((cmd | getline line) > 0 ? line : "");
close(cmd);
return res
}
What this will do is produce the regex for your second file. (I've added extra grouping for demo purposes). Now, let's take a look at the second script:
$ cat tst2.awk
match($0, regex, a){ printf("Found matches %s and %s\n", a[3], a[5]) }
What this will do is get a regex and matches it with every line read from the second input file. We need to provide this script with a value for regex, like this:
$ awk -f tst.awk input1.txt | xargs -I {} -n 1 awk -v regex={} -f tst2.awk input2.txt
The -v option of awk let's us define a regex, which is fed into this call by the first script.
$ cat input1.txt
TAACAAGGACCCTGTGTA
$ cat input2.txt
CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG
and the result is:
$ awk -f tst.awk input1.txt | xargs -I {} -n 1 awk -v regex={} -f tst2.awk input2.txt
Found matches TTCCT and GG
In conclusion: should you use regexes to solve your problem? Yes, but you need to be not too ambitious to match the whole string in one time. Quantifiers like {1,12000} are going to slow you down, whatever language you pick.

R: Create dataframe from paste0 content

I am manually creating a "free text" table using cat and paste0 like so:
tab < - cat(paste0("Stage","\t","Number,"\n",
"A","\t",nrow(df[df$stage == "A",]),"\n",
"B","\t",nrow(df[df$stage == "B",]),"\n"
))
i.e.
Stage Number
A 54
B 85
where I want to be able to create a publication ready table (i.e. looks good and probably generated by r markdown.
The xtable() function can do this, but only accepts a dataframe. So my question is how to I get some free text, delimited by column using "\t" and by rows "\n" into a dataframe?
I have tried:
data.frame(do.call(rbind,strsplit(as.character(tab),'\t')))
But get "dataframe with zero columns and zero rows". I think this has to do with the fact I am not declaring "\" to be a new line.
By the way, if this way seems long-winded and there is an easier way, I am happy to take suggestions.

shell - Insert a character at different indexes in a string

It will eventually be part of a larger script so it needs to be shell scripted. A simple task in other languages, but I'm having trouble accomplishing it in shell. Basically I have a string and I want to insert a "." at all possible indices within the string. The output can be on newlined or separated by spaces. Can anyone help?
Example:
input: "abcd"
output: ".abcd
a.bcd
ab.cd
abc.d
abcd."
OR
output: ".abcd a.bcd ab.cd abc.d abcd."

A simple for loop would do:
input=abcd
for ((i=0; i<${#input}+1; i++))
do
echo ${input::$i}.${input:$i}
done
This just slices up the string at each index and inserts a .. You can change the echo to something else like appending to an array if you want to store them instead ouf output them, of course.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Perform sequence of edits on a large text file - r

You need to post some sample input and expected output or we're just guessing what you want but maybe this will be a start: awk '{ printf "%s ", $1 for (i=7;i<=NF;i+=2) { printf "%s%s ", $i, $(i+1) } print "" }' file

Related

How to efficiently clean linebreak in entire dataset R data.table

Split file without separating rows beginning with like values in Unix

Creating a new regex based on the returned results and rules of a previous regex | Indexing a regex and seeing how the regex has matched a substring

R: Create dataframe from paste0 content

shell - Insert a character at different indexes in a string

Categories

Resources