Search for lines matching pattern in specific column in UNIX - unix

I am using HDFS to get data that meets a pattern in a specific column and want it to output the entire line. (Expecting ~2 million of 7 million lines to output)
Here is my exact situation:
I would like the entire line in a file where the data in the 4th column starts with a "5"
For example my data set:
HK|20151010|65|5005
KR|20151009|38|5092
MD|20150925|98|1943
BG|20150826|82|4892
HK|20151017|14|5002
I want the command to yield the following results:
HK|20151010|65|5005
KR|20151009|38|5092
HK|20151017|14|5002
Thank you so much! (Note: I cannot search the entire line because there are matches in other columns where the column data will begin with a 5)

how about:
awk -F'|' '$4~/^5/' file
if the 4th column is always the last col, this line should work too:
grep '|5[^|]*$' file

grep can be used to do this with some [^x]+x magic. Here is the regular expression both in basic and extended forms:
grep '^\([^|]\+|\)\{3\}5'
egrep '^([^|]+\|){3}5'

Related

Can sort command be used to sort file based on multiple columns in a csv file

We have a requirement where we have a csv file with custom delimiter '||' (double-pipes) . We have 40 columns in the file and the file size is approximately between 400 to 500 MB.
We need to sort the file based on 2 columns, first on column 4 and then by column 17.
We found this command using which we can sort for one column, but not able to find a command which can sort based on both columns.
Since we use a delimiter with 2 characters, we are using awk command for sorting.
Command:
awk -F \|\| '{print $4}' abc.csv | sort > output.csv
Please advise.
If your inputs are not too fancy (no newlines in the middle of a record, for instance), the sort utility can almost do what you want, but it supports only one-character field separators. So || would not work. But wait, if you do not have other | characters in your files, we could just consider | as the field separator and account for the extra empty fields:
sort -t'|' -k7 -k33 foo.csv
We sort by fields 7 (instead of 4) and then 33 (instead of 17) because of these extra empty fields. The formula that gives the new field number is simply 2*N-1 where N is the original field number.
If you do have | characters inside your fields a simple solution is to substitute them all by one unused character, sort, and restore the original ||. Example with tabs:
sed 's/||/\t/g' foo.csv | sort -t$'\t' -k4 -k17 | sed 's/\t/||/g'
If tab is also used in your fields chose any unused character instead. Form feed (\f) or the field separator (ASCII code 28, that is, replace the 3 \t with \x1c) are good candidates.
Using PROCINFO in gnu-awk you can use this solution to sort on multi-character delimiter:
awk -F '\\|\\|' '{a[$2,$17] = $0} END {
PROCINFO["sorted_in"]="#ind_str_asc"; for (i in a) print a[i]}' file.csv
You could try following awk code. Written as per your shown attempts only. Set OFS as |(this is putting | as output field separator in case you want it ,comma etc then change OFS value accordingly in program) and print 17th field also as per your requirement in awk program. In sort use 1st and 2nd fields to sort it(because now 4th and 17th fields have become 1st and 2nd fields respectively for sort).
awk -F'\\|\\|' -v OFS='\\|' '{print $4,$17}' abc.csv | sort -t'|' -k1.1 -k2.1 > output.csv
The sort command works on physical lines, which may or may not be acceptable. CSV files can contain quoted fields which contain newlines, which will throw off sort (and most other Unix line-oriented utilities; it's hard to write a correct Awk script for this scenario, too).
If you need to be able to manipulate arbitrary CSV files, probably look to a dedicated utility, or use a scripting language with proper CSV support. For example, assume you have a file like this:
Title,Number,Arbitrary text
"He said, ""Hello""",2,"There can be
newlines and
stuff"
No problem,1,Simple undramatic single-line CSV
In case it's not obvious, CSV is fundamentally just a text file, with some restrictions on how it can be formatted. To be valid CSV, every record should be comma-separated; any literal commas or newlines in the data needs to be quoted, and any literal quotes need to be doubled. There are many variations; different tools accept slightly different dialects. One common variation is TSV which uses tabs instead of commas as delimiters.
Here is a simple Python script which sorts the above file on the second field.
import csv
import sys
with open("test.csv", "r") as csvfile:
csvdata = csv.reader(csvfile)
lines = [line for line in csvdata]
titles = lines.pop(0) # comment out if you don't have a header
writer = csv.writer(sys.stdout)
writer.writerow(titles) # comment out if you don't have a header
writer.writerows(sorted(lines, key=lambda x: x[1]))
Using sys.stdout for output is slightly unconventional; obviously, adapt to suit your needs. The Python csv library documentation is obviously not designed primarily to be friendly for beginners, but it should not be impossible to figure out, and it's not hard to find examples of working code.
In Python, sorted() returns a copy of a list in sorted order. There is also sort() which sorts a list in-place. Both functions accept an optional keyword parameter to specify a custom sort order. To sort on the 4th and 17th fields, use
sorted(lines, key=lambda x: (x[3], x[16]))
(Python's indexing is zero-based, so [3] is the fourth element.)
To use | as a delimiter, specify delimiter='|' in the csv.reader() and csv.writer() calls. Unfortunately, Python doesn't easily let you use a multi-character delimiter, so you might have to preprocess the data to switch to a single-character delimiter which does not occur in the data, or properly quote the fields which contain the character you selected as your delimiter.

Bash/R searching columns in a huge table

I have a huge table I want to extract information from. Firstly, I want to extract a certain line based on a pattern -> I've done that successfully with grep. However this line has loads of columns and I'm interested only in a couple of them that have a certain pattern in them (partial match - beginning of the string). Is it possible to extract only the columns and the number of the column (the nth column) for some partial matches? Hope I was clear enough.
Languages: Preferably in bash but I can also work in R, alternatively I'm open to suggestions if you think another language can be more helpful.
Thanks!
Awk is perfect for stuff like this. To help you write a script I think we need more details. But I'm guessing you'll want to use the print feature of awk. To print out the nth column of a file "your_file" do:
awk '{print $n}' your_file
In solving your problem you may also want to loop over all N columns which you can do via:
for i in {1..N} ;
do
awk -v col=${i} '{print $col}' your_file ;
done

Unix redact data

I want to mask only the 2nd column of the data.
Input:
First_name,second_name,phone_number
ram,prakash,96174535
hari,pallavi,98888234
anurag,aakash,82783784
Output Expected:
First_name,second_name,phone_number
ram,*******,96174535
hari,*******,98888234
anurag,******,82783784
The sed program will do this just fine:
sed '2,$s/,[^,]*,/,*****,/'
The 2,$ only operates on lines 2 through to the end of the file (to leave the header line alone) and the substitute command s/,[^,]*,/,*****,/ will replace anything between the first and second comma with the mask *****.
Note that I've specifically used a fixed number of asterisks in the replacement string. Whether you're hiding passwords or anonymising data (as seems to be the case here), you don't want to leak any information, including the size of the names being replaced.
If you really want to use the same number of characters as in the original data, and you also want to cater for the possibility of replacing multiple fields, you can use something like:
awk -F, 'BEGIN{OFS=FS}NR==1{print;next}{gsub(/./,"*",$2);gsub(/./,"*",$4);print}'
This will also leave the first line untouched but will anonymise columns two and four (albeit with the information leakage previously mentioned):
echo 'First_name,second_name,phone_number,other
ram,prakash,96174535,abc
hari,pallavi,98888234,def
anurag,aakash,82783784,g
bob,santamaria,124,xyzzy' | awk -F, 'BEGIN{OFS=FS}NR==1{print;next}{gsub(/./,"*",$2);gsub(/./,"*",$4);print}'
First_name,second_name,phone_number,other
ram,*******,96174535,***
hari,*******,98888234,***
anurag,******,82783784,*
bob,**********,124,*****
Doing multiple columns with full anonymising would entail using $2="*****" rather than the gsub (for both columns of course).
Another in awk. Using gsub to replace every char in $2 with an *:
$ awk 'BEGIN{FS=OFS=","}NR>1{gsub(/./,"*",$2)}1' file
First_name,second_name,phone_number
ram,*******,96174535
hari,*******,98888234
anurag,******,82783784
try following too once and let me know if this helps you.
awk -F"," 'NR>1{$2="*******"} 1' OFS=, Input_file

awk find patterns (stored in a file) in a file

I am learning awk and I am having a hard time trying to do this :
I have a file, let's name it pattern_file.txt, which contains multiple patterns, one per line. For example, it looks like this :
pattern_file.txt
PATTERN1
PATTERN2
PATTERN3
PATTERN4
I have a second file, containing some text. Let's name it text_file.txt. It looks like this:
text_file.txt
xxxxxxxxPATTERN1xxxxxxx
yyyyyPATTERN2yyyy
zzzzzzzzzPATTERN3zzzzzz
What I am trying to do is : If one of the patterns in the pattern_file.txt is present in the current line read in text_file.txt, print the line.
I know how to print a line with awk, what gives me a hard time is to use the pattern stored in the pattern_file.txt and verify if one of them is present.
In awk using index
awk 'NR==FNR{a[$1];next}{for(i in a)if(index($0,i)) print}' pattern text
xxxxxxxxPATTERN1xxxxxxx
yyyyyPATTERN2yyyy
zzzzzzzzzPATTERN3zzzzzz
Store "patterns" to a hash and for each record use index to try to find the "patterns" from the record.
A variation of this helpful James Brown's answer using match() which also does regex match (and) returns the starting index of the matching string,
awk 'FNR==NR{a[$0]; next}{for (i in a) if (match($0,i)) print}' pattern_file.txt text_file.txt
which returns me the lines needed.
On printing the return values from the match() function
awk 'FNR==NR{a[$0]; next}{for (i in a) if (match($0,i)) print RSTART}' pattern_file.txt text_file.txt
gives an output as
9 # Meaning 'PATTERN1' match started at index 9 in 'xxxxxxxxPATTERN1xxxxxxx'
6
10

Search for multiple patterns in a file not necessarily on the same line

I need a unix command that would search multiple patterns (basically an AND), however those patterns need not be on the same line (otherwise I could use grep AND command). For e.g. suppose I have a file like following:
This is first line.
This is second line.
This is last line.
If I search for words 'first' and 'last', above file should be included in the result.
Try this question, seems to be the same as yours with plenty of solutions: How to find patterns across multiple lines using grep?
I think instead of AND you actually mean OR:
grep 'first\|last' file.txt
Results:
This is first line.
This is last line.
If you have a large number of patterns, add them to a file; for example if patterns.txt contains:
first
last
Run:
grep -f patterns.txt file.txt

Resources