I have to process many ('00s) two-column delimited files that are numerically sorted by their first column (a long int that can range from 857 to 293823421 for example).
The processing is simple enough: iterate through a loop to left-join the files using one of them as 'anchor' (the 'left' file in the join), using join's -e and -o options to fill in the NULLs.
Question: is there any way join (from Core Utils 8.13) can process these joins as-is, or must I add a sort -k1,1 step to ensure lexical order prior to each join ?
Everything I've read searching this tells me I have to, but I wanted to make sure I wasn't missing some clever trick to avoid the extra sorting. Thank you.
Indeed, join does not support numeric comparisons. However, from your description, it sounds like you can convert your first field into an already-string-sorted form by zero-padding it, and then convert it back by de-zero-padding it. For example, here is a function that performs a join -e NULL on two files that match your description (as I understand it):
function join_by_numeric_first_field () {
local file1="$1"
local file2="$2"
join -e NULL <(awk '{printf("%020d\t%s\n", $1, $2)}' "$file1") \
<(awk '{printf("%020d\t%s\n", $1, $2)}' "$file2") \
| awk '{printf("%d\t%s\n", $1, $2)}'
}
(The awk '{printf("%020d\t%s\n", $1, $2)}' reads each line of a two-column input and re-prints out the two columns, separated by a tab, but treating the first column as a decimal integer and zero-padding it out to twenty characters. The awk '{printf("%d\t%s\n", $1, $2)}' does the same thing, except that it doesn't zero-pad the decimal integer, so it has the effect of stripping out any zero-padding that was there.)
Whether this is a better approach than sort-ing will depend on the size of your files, and on how flexible you need to be in supporting files that don't quite match your description. This approach scales linearly with the file-size, but is significantly more complicated, and is also a bit more fragile, in that the awk commands expect a pretty specific input-format. The sort approach is much simpler, but will not perform as well for large files.
Related
We have a requirement where we have a csv file with custom delimiter '||' (double-pipes) . We have 40 columns in the file and the file size is approximately between 400 to 500 MB.
We need to sort the file based on 2 columns, first on column 4 and then by column 17.
We found this command using which we can sort for one column, but not able to find a command which can sort based on both columns.
Since we use a delimiter with 2 characters, we are using awk command for sorting.
Command:
awk -F \|\| '{print $4}' abc.csv | sort > output.csv
Please advise.
If your inputs are not too fancy (no newlines in the middle of a record, for instance), the sort utility can almost do what you want, but it supports only one-character field separators. So || would not work. But wait, if you do not have other | characters in your files, we could just consider | as the field separator and account for the extra empty fields:
sort -t'|' -k7 -k33 foo.csv
We sort by fields 7 (instead of 4) and then 33 (instead of 17) because of these extra empty fields. The formula that gives the new field number is simply 2*N-1 where N is the original field number.
If you do have | characters inside your fields a simple solution is to substitute them all by one unused character, sort, and restore the original ||. Example with tabs:
sed 's/||/\t/g' foo.csv | sort -t$'\t' -k4 -k17 | sed 's/\t/||/g'
If tab is also used in your fields chose any unused character instead. Form feed (\f) or the field separator (ASCII code 28, that is, replace the 3 \t with \x1c) are good candidates.
Using PROCINFO in gnu-awk you can use this solution to sort on multi-character delimiter:
awk -F '\\|\\|' '{a[$2,$17] = $0} END {
PROCINFO["sorted_in"]="#ind_str_asc"; for (i in a) print a[i]}' file.csv
You could try following awk code. Written as per your shown attempts only. Set OFS as |(this is putting | as output field separator in case you want it ,comma etc then change OFS value accordingly in program) and print 17th field also as per your requirement in awk program. In sort use 1st and 2nd fields to sort it(because now 4th and 17th fields have become 1st and 2nd fields respectively for sort).
awk -F'\\|\\|' -v OFS='\\|' '{print $4,$17}' abc.csv | sort -t'|' -k1.1 -k2.1 > output.csv
The sort command works on physical lines, which may or may not be acceptable. CSV files can contain quoted fields which contain newlines, which will throw off sort (and most other Unix line-oriented utilities; it's hard to write a correct Awk script for this scenario, too).
If you need to be able to manipulate arbitrary CSV files, probably look to a dedicated utility, or use a scripting language with proper CSV support. For example, assume you have a file like this:
Title,Number,Arbitrary text
"He said, ""Hello""",2,"There can be
newlines and
stuff"
No problem,1,Simple undramatic single-line CSV
In case it's not obvious, CSV is fundamentally just a text file, with some restrictions on how it can be formatted. To be valid CSV, every record should be comma-separated; any literal commas or newlines in the data needs to be quoted, and any literal quotes need to be doubled. There are many variations; different tools accept slightly different dialects. One common variation is TSV which uses tabs instead of commas as delimiters.
Here is a simple Python script which sorts the above file on the second field.
import csv
import sys
with open("test.csv", "r") as csvfile:
csvdata = csv.reader(csvfile)
lines = [line for line in csvdata]
titles = lines.pop(0) # comment out if you don't have a header
writer = csv.writer(sys.stdout)
writer.writerow(titles) # comment out if you don't have a header
writer.writerows(sorted(lines, key=lambda x: x[1]))
Using sys.stdout for output is slightly unconventional; obviously, adapt to suit your needs. The Python csv library documentation is obviously not designed primarily to be friendly for beginners, but it should not be impossible to figure out, and it's not hard to find examples of working code.
In Python, sorted() returns a copy of a list in sorted order. There is also sort() which sorts a list in-place. Both functions accept an optional keyword parameter to specify a custom sort order. To sort on the 4th and 17th fields, use
sorted(lines, key=lambda x: (x[3], x[16]))
(Python's indexing is zero-based, so [3] is the fourth element.)
To use | as a delimiter, specify delimiter='|' in the csv.reader() and csv.writer() calls. Unfortunately, Python doesn't easily let you use a multi-character delimiter, so you might have to preprocess the data to switch to a single-character delimiter which does not occur in the data, or properly quote the fields which contain the character you selected as your delimiter.
I have a huge table I want to extract information from. Firstly, I want to extract a certain line based on a pattern -> I've done that successfully with grep. However this line has loads of columns and I'm interested only in a couple of them that have a certain pattern in them (partial match - beginning of the string). Is it possible to extract only the columns and the number of the column (the nth column) for some partial matches? Hope I was clear enough.
Languages: Preferably in bash but I can also work in R, alternatively I'm open to suggestions if you think another language can be more helpful.
Thanks!
Awk is perfect for stuff like this. To help you write a script I think we need more details. But I'm guessing you'll want to use the print feature of awk. To print out the nth column of a file "your_file" do:
awk '{print $n}' your_file
In solving your problem you may also want to loop over all N columns which you can do via:
for i in {1..N} ;
do
awk -v col=${i} '{print $col}' your_file ;
done
I want to mask only the 2nd column of the data.
Input:
First_name,second_name,phone_number
ram,prakash,96174535
hari,pallavi,98888234
anurag,aakash,82783784
Output Expected:
First_name,second_name,phone_number
ram,*******,96174535
hari,*******,98888234
anurag,******,82783784
The sed program will do this just fine:
sed '2,$s/,[^,]*,/,*****,/'
The 2,$ only operates on lines 2 through to the end of the file (to leave the header line alone) and the substitute command s/,[^,]*,/,*****,/ will replace anything between the first and second comma with the mask *****.
Note that I've specifically used a fixed number of asterisks in the replacement string. Whether you're hiding passwords or anonymising data (as seems to be the case here), you don't want to leak any information, including the size of the names being replaced.
If you really want to use the same number of characters as in the original data, and you also want to cater for the possibility of replacing multiple fields, you can use something like:
awk -F, 'BEGIN{OFS=FS}NR==1{print;next}{gsub(/./,"*",$2);gsub(/./,"*",$4);print}'
This will also leave the first line untouched but will anonymise columns two and four (albeit with the information leakage previously mentioned):
echo 'First_name,second_name,phone_number,other
ram,prakash,96174535,abc
hari,pallavi,98888234,def
anurag,aakash,82783784,g
bob,santamaria,124,xyzzy' | awk -F, 'BEGIN{OFS=FS}NR==1{print;next}{gsub(/./,"*",$2);gsub(/./,"*",$4);print}'
First_name,second_name,phone_number,other
ram,*******,96174535,***
hari,*******,98888234,***
anurag,******,82783784,*
bob,**********,124,*****
Doing multiple columns with full anonymising would entail using $2="*****" rather than the gsub (for both columns of course).
Another in awk. Using gsub to replace every char in $2 with an *:
$ awk 'BEGIN{FS=OFS=","}NR>1{gsub(/./,"*",$2)}1' file
First_name,second_name,phone_number
ram,*******,96174535
hari,*******,98888234
anurag,******,82783784
try following too once and let me know if this helps you.
awk -F"," 'NR>1{$2="*******"} 1' OFS=, Input_file
grep -F -f file1 file2
file1 is 90 Mb (2.5 million lines, one word per line)
file2 is 45 Gb
That command doesn't actually produce anything whatsoever, no matter how long I leave it running. Clearly, this is beyond grep's scope.
It seems grep can't handle that many queries from the -f option. However, the following command does produce the desired result:
head file1 > file3
grep -F -f file3 file2
I have doubts about whether sed or awk would be appropriate alternatives either, given the file sizes.
I am at a loss for alternatives... please help. Is it worth it to learn some sql commands? Is it easy? Can anyone point me in the right direction?
Try using LC_ALL=C . It turns the searching pattern from UTF-8 to ASCII which speeds up by 140 time the original speed. I have a 26G file which would take me around 12 hours to do down to a couple of minutes.
Source: Grepping a huge file (80GB) any way to speed it up?
So what I do is:
LC_ALL=C fgrep "pattern" <input >output
I don't think there is an easy solution.
Imagine you write your own program which does what you want and you will end up with a nested loop, where the outer loop iterates over the lines in file2 and the inner loop iterates over file1 (or vice versa). The number of iterations grows with size(file1) * size(file2). This will be a very large number when both files are large. Making one file smaller using head apparently resolves this issue, at the cost of not giving the correct result anymore.
A possible way out is indexing (or sorting) one of the files. If you iterate over file2 and for each word you can determine whether or not it is in the pattern file without having to fully traverse the pattern file, then you are much better off. This assumes that you do a word-by-word comparison. If the pattern file contains not only full words, but also substrings, then this will not work, because for a given word in file2 you wouldn't know what to look for in file1.
Learning SQL is certainly a good idea, because learning something is always good. It will hovever, not solve your problem, because SQL will suffer from the same quadratic effect described above. It may simplify indexing, should indexing be applicable to your problem.
Your best bet is probably taking a step back and rethinking your problem.
You can try ack. They are saying that it is faster than grep.
You can try parallel :
parallel --progress -a file1 'grep -F {} file2'
Parallel has got many other useful switches to make computations faster.
Grep can't handle that many queries, and at that volume, it won't be helped by fixing the grep -f bug that makes it so unbearably slow.
Are both file1 and file2 composed of one word per line? That means you're looking for exact matches, which we can do really quickly with awk:
awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2
NR (number of records, the line number) is only equal to the FNR (file-specific number of records) for the first file, where we populate the hash and then move onto the next line. The second clause checks the other file(s) for whether the line matches one saved in our hash and then prints the matching lines.
Otherwise, you'll need to iterate:
awk 'NR == FNR { query[$0]=1; next }
{ for (q in query) if (index($0, q)) { print; next } }' file1 file2
Instead of merely checking the hash, we have to loop through each query and see if it matches the current line ($0). This is much slower, but unfortunately necessary (though we're at least matching plain strings without using regexes, so it could be slower). The loop stops when we have a match.
If you actually wanted to evaluate the lines of the query file as regular expressions, you could use $0 ~ q instead of the faster index($0, q). Note that this uses POSIX extended regular expressions, roughly the same as grep -E or egrep but without bounded quantifiers ({1,7}) or the GNU extensions for word boundaries (\b) and shorthand character classes (\s,\w, etc).
These should work as long as the hash doesn't exceed what awk can store. This might be as low as 2.1B entries (a guess based on the highest 32-bit signed int) or as high as your free memory.
I have two files ...
file1:
002009092312291100098420090922111
010555101070002956200453T+00001190.81+00001295.920010.87P
010555101070002956200449J+00003128.85+00003693.90+00003128
010555101070002956200176H+00000281.14+00000300.32+00000281
file2:
002009092410521000098420090709111
010560458520002547500432M+00001822.88+00001592.96+00001822
010560458520002547500432D+00000106.68+00000114.77+00000106
In both files in every record starting with 01, the string from 3rd char to 25th char, i.e up to alphabet is the key.
Based on this key, I have to compare two files, and if there is any record matching in file 2, then I have to replace that record in file1, or else append it if it won't match.
Well, this is a fairly unspecific (and basic) programming question. We'll be better able to help us if you explain exactly what you did and where you got stuck.
Also, it looks a bit like homework, and people are wary of giving too much help on homework problems, as it might look like cheating.
To get you started:
I'd recommend Perl to solve this, but awk or another scripting language will also do. I'd recommend against sh/bash, as they are weak on text manipulation; also combining grep et al will become rather cumbersome.
First write a Perl program that filters records starting with 01. Then extract the key and put it into a hash (a Perl structure). Then output a new, combined file as required.
Using awk get the fields from 3-25 but doing something like
awk -F "" '/^01/{print $1}' file_name | cut -c 3-25 and match the first two fields with 01 from both files and get all the lines in two different buffers and compare both the buffers using for line in in a shell script.
Whenever the line in second buffer matches the first one grep the line in second buffer in first file and replace the line in first file with the line in second. I think you need to work a bit around the logic.