Remove duplicates from a large file - unix

I have a ~20GB csv file.
Sample file:
1,a#a.com,M
2,b#b.com,M
1,c#c.com,F
3,d#d.com,F
The primary key in this file is the first column.
I need to write two file, uniq.csv and duplicates.csv
uniq.csv should contain all non-duplicate records and duplicates.csv will contain all duplicate records with current timesstamp.
uniq.csv
1,a#a.com,M
2,b#b.com,M
3,d#d.com,F
duplicates.csv
2012-06-29 01:53:31 PM, 1,c#c.com,F
I am using Unix Sort so that I can take advantage of its External R-Way merge sorting algorithm
To identify uniq records
tail -n+2 data.txt | sort -t, -k1 -un > uniq.csv
To identify duplicate records
awk 'x[$1]++' FS="," data.txt | awk '{print d,$1}' "d=$(date +'%F %r')," > duplicates.csv
I was wondering if there is anyway to find both duplicates and uniq with a single scan of this large file?

Your awk script is nearly there. To find the unique lines, you merely need to use the in operator to test whether the entry is in the associate array or not. This allows you to collect the data in one pass through the data file and to avoid having to call sort.
tail -n +2 data.txt | \
awk '
BEGIN { OFS=FS="," }
{
if (!($1 in x)) {
print $0 > "/dev/fd/3"
}
x[$1]++
}
END {
for (t in x) {
print d, t, x[t]
}
}' d="$(date +'%F %r')" 3> uniq.csv > duplicates.csv

I got this question in an interview, a couple jobs ago.
One answer is to use uniq with the "-c" (count) option. An entry with a count of "1' is unique, and otherwise not unique.
sort foo | uniq -c | awk '{ if ($1 == 1) { write-to-unique } else {write-to-duplicate }'
If you want to write a special-purpose program and/or avoid the delay caused by sorting, I would use Python.
Read the input file, hashing each entry and ++ an integer value for each unique key that you encounter. Remember that hash values can collide even when the two items are not equal, so keep each key individually along with its count.
At EOF on the input, traverse the hash structure and spit each entry into one of two files.
You seem not to need sorted output, only categorized output, so the hashing should be faster. Constructing a hash is O(1), while sorting is O(I forget; is unix sort Nlog(N)?)

Here is a code on perl which will do processing in one scan
#!/usr/bin/perl
open(FI,"sort -t, -k1 < file.txt |");
open(FD,">duplicates.txt");
open(FU,">uniques.txt");
my #prev;
while(<FI>)
{
my (#cur) = split(',');
if($prev[0] && $prev[0]==$cur[0])
{
print FD localtime()." $_";
}
else
{
print FU $_;
}
#prev=#cur;
}

Related

How can one filter a JSON object to select only specific key/values using jq?

I'm trying to validate all versions in a versions.json file, and get as the output a json with only the invalid versions.
Here's a sample file:
{
"slamx": "16.4.0 ",
"sdbe": null,
"mimir": null,
"thoth": null,
"quasar": null,
"connectors": {
"s3": "16.0.17",
"azure": "6.0.17",
"url": "8.0.2",
"mongo": "7.0.15"
}
}
I can use the following jq script line to do what I want:
delpaths([paths(type == "string" and contains(" ") or type == "object" | not)])
| delpaths([paths(type == "object" and (to_entries | length == 0))])
And use it on a shell like this:
BAD_VERSIONS=$(jq 'delpaths([paths(type == "string" and contains(" ") or type == "object" | not)]) | delpaths([paths(type == "object" and (to_entries | length == 0))])' versions.json)
if [[ $BAD_VERSIONS != "{}" ]]; then
echo >&2 $'Bad versions detected in versions.json:\n'"$BAD_VERSIONS"
exit 1
fi
and get this as the output:
Bad versions detected in versions.json:
{
"slamx": "16.4.0 "
}
However, that's a very convoluted way of doing the filtering. Instead of just walking the paths tree and just saying "keep this, keep that", I need to create a list of things I do not want and remove them, twice.
Given all the path-handling builtins and recursive processing, I can't help but feel that there has to be a better way of doing this, something akin to select, but working recursively across the object, but the best I could do was this:
. as $input |
[path(recurse(.[]?)|select(strings|contains("16")))] as $paths |
reduce $paths[] as $x ({}; . | setpath($x; ($input | getpath($x))))
I don't like that for two reasons. First, I'm creating a new object instead of "editing" the old one. Second and foremost, it's full of variables, which points to a severe flow inversion issue, and adds to the complexity.
Any ideas?
Thanks to #jhnc's comment, I found a solution. The trick was using streams, which makes nesting irrelevant -- I can apply filters based solely on the value, and the objects will be recomposed given the key paths.
The first thing I tried did not work, however. This:
jq -c 'tostream|select(.[-1] | type=="string" and contains(" "))' versions.json
returns [["slamx"],"16.4.0 "], which is what I'm searching for. However, I could not fold it back into an object. For that to happen, the stream has to have the "close object" markers -- arrays with just one element, corresponding to the last key of the object being closed. So I changed it to this:
jq -c 'tostream|select((.[-1] | type=="string" and contains(" ")) or length==1)' versions.json
Breaking it down, .[-1] selects the last element of the array, which will be the value. Next, type=="string" and contains(" ") will select all values which are strings and contain spaces. The last part of the select, length==1, keeps all the "end" markers. Interestingly, it works even if the end marker does not correspond to the last key, so this might be brittle.
With that done, I can de-stream it:
jq -c 'fromstream(tostream|select((.[-1] | type=="string" and contains(" ")) or length==1))' versions.json
The jq expression is as follow:
fromstream(
tostream |
select(
(
.[-1] |
type=="string" and contains(" ")
) or
length==1
)
)
For objects, the test to_entries|length == 0 can be abbreviated to length==0.
If I understand the goal correctly, you could just use .., perhaps along the following lines:
..
| objects
| with_entries(
select(( .value|type == "string" and contains(" ")) or (.value|type == "object" and length==0)) )
| select(length>0)
paths
If you want the paths, then consider:
([], paths) as $p
| getpath($p)
| objects
| with_entries(
select(( .value|type == "string" and contains(" ")) or (.value|type == "object" and length==0)) )
| select(length>0) as $x
| {} | setpath($p; $x)
With your input modified so that s3 has a trailing blank, the above produces:
{"slamx":"16.4.0 "}
{"connectors":{"s3":"16.0.17 "}}

Match function in unix to find if string ends with particular input value?

I need a regular expression that I can use in match() function to see if value given at command line argument exists at the end of a given string.
I am using awk then trying use match function to get above result:
while read line; do
cat $line | awk -v value="$2.$" '{ if ( match("$1,value) != 0 ) print match("arvind","ind.$") " " "arvind" }'
done < xreffilelist.txt

How do I fetch this substring using awk?

I have a string let's say
k=CHECK_${SOMETHING}_CUSTOM_executable.acs
Now I want to fetch only CUSTOM_executable from the above string. This is what I have tried so far in Unix
echo $k|awk -F '_' '{print $2}'
Can you explain how can i do this
Try this :
$ echo "$k"
CHECK_111_CUSTOM_executable.acs
code:
echo "$k" | awk 'BEGIN{FS=OFS="_"}{sub(/.acs/, "");print $3, $4}'
Assume the variable ${SOMETHING} has the value SOMETHING just for simplicity.
The following assignment, therefore,
k=CHECK_${SOMETHING}_CUSTOM_executable.acs
sets the value of k to CHECK_SOMETHING_CUSTOM_executable.acs.
When split into fields on _ by awk -F '_' (note the single quotes aren't necessary here).
You get the following fields:
$ echo "$k" | awk -F _ '{for (i=0; i<=NF; i++) {print i"="$i}}'
0=CHECK_SOMETHING_CUSTOM_executable.acs
1=CHECK
2=SOMETHING
3=CUSTOM
4=executable.acs
So to get the output you want simply use
echo "$k" | awk -F _ -v OFS=_ '{print $3,$4}'
Suppose if SOMETHING variable is having 111_222_333 (or) 111_222_333_444,
Use this:
$ k=CHECK_${SOMETHING}_CUSTOM_executable.acs
$ echo $k | awk 'BEGIN{FS=OFS="_"}{ print $(NF-1),$NF }'
(Or)
echo $k | awk -F_ '{ print $(NF-1), $NF }' OFS=_
Explanation :
NF - The number of fields in the current input record.
Try this simple awk:
awk -F[._] '{print $3"_"$4}' <<<"$k"
CUSTOM_executable
The -F[._] defines both dot and underline as field separator. Then awk prints the filed number 3 and 4 from $k as input.
If the k contains k='CHECK_${111_111}_CUSTOM_executable.acs', then use filed with numbers $4 and $5:
awk -F[._] '{print $4"_"$5}' <<<"$k"
CHECK_${111_111}_CUSTOM_executable.acs
| $1| |$2 | |$3| | $4 | | $5 | |$6|
You do not need to use awk, it can be done in bash easily. I assume that $SOMETHING does not contains _ characters (also CUSTOM and executable part is just some text, they also not contains _). Then:
k=CHECK_${SOMETHING}_CUSTOM_executable.acs
l=${k#*_}; l=${l#*_}; l=${l%.*};
This cuts anything from the beginning to the 2nd _ char, and chomps off anything after the last . char. Result is put into the l env.var.
If $SOMETHING may contain _ then a little bit work has to be done (I assume the CUSTOM and executable part does not contain _):
k=CHECK_${SOMETHING}_CUSTOM_executable.acs
l=${k%_*}; l=${l%_*}; l=${k#${l}_*}; l=${l%.*};
This chomps off everything after the last but one _ character, the cuts the result off from the original string. The last statement chomps the extension off. The result is in l env.var.
Or it can be done using regex:
[[ $k =~ ([^_]+_[^_]+)\.[^.]+$ ]] && l=${BASH_REMATCH[1]}
This matches any string containing two words separated by _ and finished with .<extension>. The extension part is chomped off and result is in l env.var.
I hope this helps!

AWK : Comparing difference of files with different delimiter

1.txt
1|2|3
4|5|6
7|3|6
2.txt (double pipe)
1||2||3
4||5||6
expected
7|3|6
I want to compare 1.txt and 2.txt and print the difference . Note that the numbers of columns can vary each time
awk -F"|" 'NR==FNR{a[$0]++;next} !(a[$0])' 2.txt 1.txt
How can I modify the code to include delimiters in each files.
The code below works for first field alone but I am not sure how it separated the fields by double pipe
awk -F"|" 'NR==FNR{a[$1]++;next} !(a[$1])' 2.txt 1.txt
One simple workaround would be to squeeze the double delimiters in the second file before feeding to awk:
awk -F"|" 'NR==FNR{a[$0]++;next} !(a[$0])' <(tr -s '|' < 2.txt) 1.txt
For your sample input, it'd produce:
7|3|6
EDIT: You assert that
awk -F"|" 'NR==FNR{a[$1]++;next} !(a[$1])' 2.txt 1.txt
works. It doesn't do what you expect. It compares only the first field and not the entire line.
You can use this awk,
awk -F"|" 'NR==FNR{gsub(/\|\|/,"|",$0);a[$0]++;next} !(a[$0])' 2.txt 1.txt
I typically use bash features to accomplish this:
diff 1.txt <(sed 's/||/|/g' < 2.txt)
You can use regexp as a delimiter in gawk, like so if you don't mind if your output is unsorted (as arrays in awk), you can do it with a single command:
gawk 'BEGIN {FS="\\|\\|*"} {gsub(FS,"|") ; a[$0]++} END {for (k in a) {if ( a[k] > 0 ) { print k } } }'
BEGIN {FS="\\|\\|*"} ==> The field separator is one or more |
{gsub(FS,"|") ; a[$0]++} ==> On every line normalize the number of separator |s to one and store the line in an array, or if it's already in the array, increment the value related to it
END {for (k in a) {if ( a[k] > 0 ) { print k } } } finally print every array element where it found more than once.

How to Use Chaining Translation in Unix

I want translate the word "abcd" into upper case "ABCD" using tr command then translate the "ABCD" to digit e.g 1234.
I want to chain two translations together (lowercase to upper case, then upper case to 1234) using pipes and also pipe the final output into more.
I'm not able to chain the second part.
echo "abcd" | tr '[:lower:]' '[:upper:]' > file1
Here I'm not sure how to add the second translation in the same command.
You can't do it in a single tr command; you can do it in a single pipeline:
echo "abcd" | tr '[:lower:]' '[:upper:]' | tr 'ABCD' '1234'
Note that your [:lower:] and [:upper:] notation will translate more than abcd to ABCD. If you want to extend the mapping of digits so A-I map to 1-9, that's doable; what maps to 0?
If you want to do it in a single command, then you could write:
echo "abcdABCD" | tr 'abcdABCD' '12341234'
Or, abbreviated slightly:
$ echo 'abecedenarian-DIABOLICALISM' | tr 'a-dA-D' '1-41-4'
12e3e4en1ri1n-4I12OLI31LISM
$

Resources