Deleting duplicate lines based on a criteria on a file in unix

Deleting duplicate lines based on a criteria on a file in unix - unix

I have a data file extracted from a table. Now I have to retain lines with latest timestamp for an id if there are multiple entries for the id if exist.
Need to delete other old entry lines. The file need to be cleaned up.
Id | timestamp | status
3 | 17-Feb-20 12:30:00:00 PM | E
1 | 16-feb-20 09:30:00:00 Am | L
3 | 17-Feb-20 15:30:00:00 PM | N
2 | 17-Feb-20 10:12:00:00 Am | L
My need is I need to retain id 1,2 as there are only one entry. But id 3 has two rows but should retain the one with timestamp 15:30 PM since that's the latest.
I got reference to use 'sed' cmd to delete particular line number or particular line. But the thing is I will be reading line by line.
Say on looping first line wud be id 3, I wud grep on the same file for matching id if exist. So for id 3 I got naother match, then will compare the timestamp. Since the current line I'm on is older timestamp need to delete tat line and retain latest one.
Is it possible to do the operation on the same file which I will be reading through or should obviously go with another file ??

Related

Unix command for below file

I have a CSV file like below
05032020
Col1|col2|col3|col4|col5
Infosys
Tcs
Wipro
Accenture
Deloitte
I want record count by skipping date and Header columns
O/p: Record count 5 with including line numbers
cat FF_Json_to_CSV_MAY03.txt
05032020
requestId|accountBranch|accountNumber|guaranteeGuarantor|accountPriority|accountRelationType|accountType|updatedDate|updatedBy
0000000001|5BW|52206|GG1|02|999|CHECKING|20200503|BTCHLCE
0000000001|55F|80992|GG2|02|1999|IRA|20200503|0QLC
0000000001|55F|24977|CG|01|3999|CERTIFICAT|20200503|SRIKANTH
0000000002|5HJ|03349|PG|01|777|SAVINGS|20200503|BTCHLCE
0000000002|5M8|999158|GG3|01|900|CORPORATE|20200503|BTCHLCE
0000000002|5LL|49345|PG|01|999|CORPORATE|20200503|BTCHLCE
0000000002|5HY|15786|PG|01|999|CORPORATE|20200503|BTCHLCE
0000000003|55F|34956|CG|01|999|CORPORATE|20200503|SRIKANTH
0000000003|5BY|14399|GG10|03|10|MONEY MARK|20200503|BTCHLCE
0000000003|5PE|32100|PG|04|999|JOINT|20200503|BTCHLCE
0000000003|5LB|07888|GG25|02|999|BROKERAGE|20200503|BTCHLCE
0000000004|55F|36334|CG|02|999|JOINT|20200503|BTCHLCE
0000000005|55F|06739|GG9|02|999|SAVINGS|20200503|BTCHLCE
0000000005|5CP|39676|PG|01|999|SAVINGS|20200503|BTCHLCE
0000000006|55V|62452|CG|01|10|CORPORATE|20200503|SRIKANTH
0000000007|55V|H9889|CG|01|999|SAVINGS|20200503|BTCHLCE
0000000007|5L2|03595|PG|02|999|CORPORATE|20200503|BTCHLCE
0000000007|55V|C1909|GG8|01|10|JOINT|20200503|BTCHLCE
I need line numbers from 00000000001

There are two ways to solve your issue:
Count only the records you want to count
Count all records and remove the ones you don't want to count
From your example, it's not possible to know how to do it, but let me give you some ideas:
Imagine that your file starts with 3 header lines, then you can do something like:
wc -l inputfile | awk '{print $1-3}'
Imagine that the lines you want to count all start with a number and a dot, then you can do something like:
grep "[0-9]*\." inputfile | wc -l

Merging(Joining) 2 huge flat files in Solaris, using an index column(first field)

I have 2 huge flat files in Unix(Solaris), each say about 500-600 GB. And i need join and merge the 2 files into a single flat file using the first column which would be a key index column. How could i do it in an optimized way?
Basically it should be an inner join between the 2 flat files. Reason try to use flat files is, we have a 2 huge table that have been split into 2 separate tables, and that is extracted into 2 flat files, and i am trying to join it at Unix level instead of at database level.
I did use the below commands :
sort -n file1 > file_temp1;
sort -n file2 > file_temp2;
join -j 1 -t';' file_temp1 file_temp2 > Final
It works fine with sort as the 1st field is the index column. However when the join happens, i get hardly 2% of the data in the Final file. So just was trying to understand what mistake i am doing in the join command? Both the files contain about .2 million records and all of the records are matching between the 2 files. I want to have a performance check if the join made at unix would be better than that performed at the database level. Sorry for incomplete question! The first field is a numeric index field. do we have something like a"-n" switch to indicate the join that the first field is a numeric index?

You should not sort -n, since join has no corresponding -n flag. Just keep all the leading/trailing whitespace as it is:
#!/bin/sh
sort -t';' -k 1 file1 > file1.srt
sort -t';' -k 1 file2 > file2.srt
join -t';' -1 1 -2 1 file1.srt file2.srt > both
#cat both

Sorting a single column of a file without disturbing the others columns

I have a file trial.txt in which I want to sort its fourth column only without changing the order of the other colums .
My file contents is like this.
A user 9 12 ab
B user 2 9 cd
C user 5 13 da
I want my output to look like this
A user 9 13 ab
B user 2 12 cd
C user 5 9 da
I had tried this.
sort -k 4 trial.txt
but it is not giving the output as expected.

use:
sort -k4 -n -s trial.txt
Apart from the k option, it is advised you use -n for comparing numbers, the -s option suppresses the last resort comparison. check the manual for more info.
Also, your required output is showing descending order, in case use -r option to reverse the sorting pattern.

I am sorry for giving a wrong interpretation of the question. So here is the final answer. It's simple:
First use awk to get the third column of the file trial.txt and save it in a temporary file.
Sort this temporary file.
Read every line from the trial.txt and temporary file simultaneously. To do this read https://unix.stackexchange.com/questions/82541/loop-through-the-lines-of-two-files-in-parallel. Then from each input of the line from trial.txt replace its 4th column using awk with the line input from the temporary file. To do this take help from How to replace the nth column/field in a comma-separated string using sed/awk?. Store this in a new file and delete temp after this. Done!

finding least date and unique content in UNIX

I am getting data from the server in a file (in format1) everyday ,however i am getting the data for last one week.
I have to archive the data for 1.5 months exactly,because this data is being picked to make some graphical representation.
I have tried to merge the the files of 2 days and sort them uniquely (code1) ,however it didn't work because everyday name of raw file is changing.However Time-stamp is unique in this file,but I am not sure how to sort the unique data on base of a specific column also,is there any way to delete the data older than 1.5 months.
For Deletion ,The logic i thought is deleting by fetching today's date - least date of that file but again unable to fetch least date.
Format1
r01/WAS2/oss_change0_5.log:2016-03-21T11:13:36.354+0000 | (307,868,305) | OSS_CHANGE |
com.nokia.oss.configurator.rac.provisioningservices.util.Log.logAuditSuccessWithResources | RACPRS RNC 6.0 or
newer_Direct_Activation: LOCKING SUCCEEDED audit[ | Source='Server' | User identity='vpaineni' | Operation
identifier='CMNetworkMOWriterLocking' | Success code='T' | Cause code='N/A' | Identifier='SUCCESS' | Target element='PLMN-
PLMN/RNC-199/WBTS-720' | Client address='10.7.80.21' | Source session identifier='' | Target session identifier='' |
Category code='' | Effect code='' | Network Transaction identifier='' | Source user identity='' | Target user identity='' |
Timestamp='1458558816354']
Code1
cat file1 file2 |sort -u > file3
Data on Day2 ,the input file name Differ
r01/WAS2/oss_change0_11.log:2016-03-21T11:13:36.354+0000 | (307,868,305) | OSS_CHANGE |
com.nokia.oss.configurator.rac.provisioningservices.util.Log.logAuditSuccessWithResources | RACPRS RNC 6.0 or
newer_Direct_Activation: LOCKING SUCCEEDED audit[ | Source='Server' | User identity='vpaineni' | Operation
identifier='CMNetworkMOWriterLocking' | Success code='T' | Cause code='N/A' | Identifier='SUCCESS' | Target element='PLMN-
PLMN/RNC-199/WBTS-720' | Client address='10.7.80.21' | Source session identifier='' | Target session identifier='' |
Category code='' | Effect code='' | Network Transaction identifier='' | Source user identity='' | Target user identity='' |
Timestamp='1458558816354']

I have written almost similar kind of code a week back.
Awk is a good Tool ,if you want to do any operation column wise.
Also , Sort Unique will not work as file name is changing
Both unique rows and least date can be find using awk.
1 To Get Unique file content
cat file1 file2 |awk -F "\|" '!repeat[$21]++' > file3;
Here -F specifies your field separator
Repeat is taking 21st field that is time stamp
and will only print 1st occurrence of that time ,rest ignored
So,finally unique content of file1 and file2 will be available in file3
2 To Get least Date and find difference between 2 dates
Least_Date=`awk -F: '{print substr($2,1,10)}' RMCR10.log|sort|head -1`;
Today_Date=`date +%F` ;
Diff=`echo "( \`date -d $Today_Date +%s\` - \`date -d $Start_Date +%s\`) / (24*3600)" | bc -l`;
Diff1=${Diff/.*};
if [ "$Diff1" -ge "90" ]
then
Here we have used {:} as field separator, and finally substring to get exact date field then sorting and finding least
value.
Subtracting today's Date by using Binary calculator and then removing decimals.
Hope it helps .....

Selecting a range of records from a file in Unix

I have 4,930,728 records on a file text file in unix. This file is used to ingest images to Oracle web center content using batchloader. <<EOD>> indicate end of record as per below sample.
I have two questions
After processing 4,300,846 of 4,930,728 record(s), the batchloader fails for whatever resoan. Now I want to create a new file with records from 4,300,846 to 4,930,728. How do I do achieve that?
I want to split this text file containing 4930728 records into multiple files each contaiting range of (1,000,000) records e.g. file 1 contains records from 0 to 10,000,000. The second file contains records from 1,000,001 to 20,000,000 and so on. How do I achieve this?
filename: load_images.txt
Action = insert
DirectReleaseNewCheckinDoc=1
dUser=Biometric
dDocTitle=333_33336145454_RT.wsq
dDocType=Document
dDocAuthor=Biometric
dSecurityGroup=Biometric
dDocAccount=Biometric
xCUSTOMER_MSISDN=33333
xPIN_REF=64343439
doFileCopy=1
fParentGUID=2CBC11DF728D39AEF91734C58AE5E4A5
fApplication=framework
primaryFile=647229_234343145454_RT.wsq
primaryFile:path=/ecmmigration_new/3339_2347333145454_RT.wsq
xComments=Biometric Migration from table OWCWEWW_MIG_3007
<<EOD>>

Answer #1:
head -n 4930728 myfile.txt | tail -n $(echo "4930728 - 4300846" | bc)
Answer #2 - to split the files in 1000 0000 lines:
split -l 10000000 myfile.txt ### It will create file like xaa,xab and so on

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Deleting duplicate lines based on a criteria on a file in unix - unix

Related

Unix command for below file

Merging(Joining) 2 huge flat files in Solaris, using an index column(first field)

Sorting a single column of a file without disturbing the others columns

finding least date and unique content in UNIX

Selecting a range of records from a file in Unix

Categories

Resources