Remove duplicate line based on column for fixed length file - unix

I have a fixed length file having primary key from 1-8. there is no delimiter. I want to eliminate duplicates by eliminating second occurrence. Look for solutions in unix.
File looks like this:
A00991CCAGXCD K 9999PHLX CANADIAN DOLLAR F
G0084W10%AEURN 4612EURONAV NV ANTWERPEN F
D1819089%ADB 6021DEUTSCHE BANK AG F
G0084W10GAADNT 6799ADIENT PLC F
D1F19089NADB 6021DEUTSCHE BANK AG F
Output extected is:
A00991CCAGXCD K 9999PHLX CANADIAN DOLLAR F
G0084W10%AEURN 4612EURONAV NV ANTWERPEN F
D1819089%ADB 6021DEUTSCHE BANK AG F
D1F19089NADB 6021DEUTSCHE BANK AG F

Short awk solution:
awk '!a[substr($1,1,8)]++' file
The output:
A00991CCAGXCD K 9999PHLX CANADIAN DOLLAR F
G0084W10%AEURN 4612EURONAV NV ANTWERPEN F
D1819089%ADB 6021DEUTSCHE BANK AG F
D1F19089NADB 6021DEUTSCHE BANK AG F
substr($1,1,8) - returns 8-character-long substring from the 1st field $1 (starting from the 1st char)
!a[...]++ - considering only the first occurence of unique array index

Related

Transposition Cipher: How to solve

I have a ciphertext as follows, of which I do not know the keylength:
wlna evesy ehudre thnma upbum w onaw-dino olsile tf hndcseoorl foouA. bnsst uho,et r,vweeirh teorf efer tsw lae tsutas sfeccsan,ul eytd hduu sbhe edtmel faut s,b bo nte oefrroeth ad ofhlea fl, in nthdan olwe hacpe lenbe euce rdo edt acsuhutobslinre ut,h tae a sv tmsoeedswitiny cl ardesro ndipipn ino,ng trat reeac edimanthf o chme ay einrh iwhccodha ur stoorn uftentuauac aqnctina dse oy.realgea Lrsea ms nos fl kiceofdan wi tndieer eroscvto edsindre oun a usht-out e,bcoo n wesin o retou befwh,nd mahic vehy alax ep teindre hepe nsecho oftul sebox kybhi eswav chhenbe eeal aref dyrd rereHo.to r ow uaudhyrencli erngie ba hdconee edenvym r fogaeth terdne to h wospt hrheecore ed rveeseshi menss hhigtreeav edimanevo fr m erarytysee e wrot itn to frof hesulmt ohi d,wol cht aud sy e vrn apli. ltaead Hehdev ei blntycanee d irre bwdono ty wonrpesne s,owhf o ad omhare rmy bkall asml aefethe ndtert ohsun uu llaly ogare Osne.e tn he,owhlwat i stms obar pothebl he atteni slglEt nanhisminb, eg le
How would I be able to decipher this transposed ciphertext in a non-manual way like on https://tholman.com/other/transposition/ ?
I believe that the punctuation and spaces matter as well in this ciphertext.

R extract specific word after keyword

How do I extract a specific word after keyword in R.
I have the following input text which contains details about policy. I need to extract specific words value like FirstName , SurName , FatherName and dob.
input.txt
In Case of unit linked plan, Investment risk in Investment Portfolio is borne by the policyholder.
ly
c I ROPOSAL FORM z
Insurance
Proposal Form Number: 342525 PF 42242
Advisor Coe aranch Code 2
Ff roanumber =F SSOS™S™~™S~S rancid ate = |
IBR. Code S535353424
re GFN ——
INSTRUCTION FOR FILLING THES APPLICATION FORM ; 1. Compiets the proocsal form in CAPITAL LETTERS using = Black Ball Point P]n. 2. Sless= mark your selection by marking “X" insides the
Boe. 3. Slnsse bases 2 Blank soece after eect word, letter or initial 4. Slssse write "MA" for questions whic are not apolicatie. 5.00 NOT USE the Sor") to identify your initial or seperate the sddressiiine.
6. Sulmissson of age proof ie mandatory along wall Ge propel fonm.
IMPORTANT INSTRUCTIONS WITH REGARD TO DISCLOSURE OF INFORMATION: Inturance it a contract of UTMOST GOOD FAITH and itis required by disclose all material and nelevant
fach: complebehy, DO) NOT suppress any fac: in response by the questions in the priposal form. FAILURE TO PROVIDE COMPLETE AND ACCURATE INFORMATION OR
MISREPRESENTATION OF THE FACTS COULD DECLARE THES POLICY CONTRACT NULL AND VOID AFTER PAYMENT OF SURRENDER VALUE, IF ANY, SUBJECT TO SECTION 45 OF
INSURANCE ACT, 1998 As AMENDED FROM TIME TO TIME,
Section I - Details of the Life to be Assured
1. Tite E-] Mr. LJ Mrs. LJ Miss [J Or. LJ Others (Specify)
2. FirstName PETER PAUL
3. Surname T
44. Father's Name
46, Mother's Name ERIKA RESWE D
5. Date of Birth 13/02/1990 6, Gender E] Male ] Female
7. Age Proof L] School Certificate [] Driving License [] Passport {Birth Certificate E"] PAN Card
3, Marital Status D) Single EF] Married 0 Widower) 0 Civorcee
9, Spouse Name ERISEWQ FR
10. Maiden Name
iL. Nationality -] Resident Indian National [J Non Resident Indian (MRI) L] Others (Specify)
12, Education J Postgraduate / Doctorate Ee) Graduate [] 12thstd. Pass [J 10thstd. Pass [J Below 10th std.
OO Dliterate / Uneducated CJ Others (Specify)
13. Address For No 7¥%a vaigai street Flower
Communication Nagar selaiyur
Landmark
City Salem
Pin Code BO00 73: State TAMIL NADU
Address proof [] Passport ([] Driving License [] Voter ID [] Bank Statement [] Utility Bill G4 Others (Specify) Aadhaar Card
14, Permanent No 7¥a vaigai street Flower
Address :
Nagar selaiyur
Landmark
City Salem
Pin Code 5353535 state (TAMIL NADU
Address proof CJ] Passport [9 DrivingLicense [J Voter ID [ Bank Statement [ Utility Bill B] Others (Specify) Aadhaar Card
15. Contact Details Mobile 424242424 Phone (Home)
Office / Business
E-mail fdgrgtr13#yahoo.com
Preferred mode: ((] Letter EF) E-Mail
Preferred Language for Letter {other than English): [] Hindi [] Kannada [-] Tamil J Telugu C] Malayalam C) Gujarati
Bengali GOriya =D] Marathi
16. Occupation CL] Salaried-Govt /PSU ( Salaried-other [9 Self Employed Professional [J Aagriculturist {Farmer [Part Time Business
LJ Retired ] Landlord J Student (current Std) -] Others (Specify) Salaried - MNC
17. Full Name of the Capio software
Employers Businnes/
School/College
18, Designation & Exact nature of Work / Business Manager
19. AnnualIncomein 1,200,000.00 20. Annual Income of Husband / Father = 1,500,000.00
Figures (%) (for female and minor lives)
21. Exact nature of work / business of Husband / Father for female and minor lives Government Employee
Page 10fé
The below code works for me but the problem is if line order changes everything get changed. Is there a way to extract keyword value irrespective of line order. ?
Current Code
path <- getwd()
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
fName <- sub('.*FirstName', '', my_txt[7])
SName <- sub('.*Surname', '', my_txt[8])
FatherNm <- sub(".*Father's Name", '', my_txt[9])
dob <- sub("6, Gender.*", '',sub(".*Date of Birth", '', my_txt[11]))
You can combine the text together as one string and extract the values based on pattern in the data. This approach will work irrespective of the line number in the data provided the pattern in the data is always valid for all the files.
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
#Collapse data in one string
text <- paste0(my_txt, collapse = '\n')
#Extract text after FirstName till '\n'
fName <- sub('.*FirstName (.*?)\n.*', '\\1', text)
fName
#[1] "John Woo"
#Extract text after Surname till '\n'
SName <- sub('.*Surname (.*?)\n.*', '\\1', text)
SName
#[1] "T"
#Extract text after Father's Name till '\n'
FatherNm <- sub(".*Father's Name (.*?)\n.*", '\\1', text)
FatherNm
#[1] "Bill Woo"
#Extract numbers which come after Date of Birth.
dob <- sub(".*Date of Birth (\\d+/\\d+/\\d+).*", '\\1', text)
dob
#[1] "13/07/1970"

Extract date from a text document in R

I am again here with an interesting problem.
I have a document like shown below:
"""UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n""""
From the above document, I need to extract date highlighted in bold and Italics.
I tried with strpdate function but did not get the desired results.
Any help will be greatly appreciated.
Thanks in advance.
Assuming you only want to capture a single date, you may use sub here:
text <- "UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n"
date <- sub("^.*\\b(\\d{2}-[A-Z]+-\\d{4})\\b.*", "\\1", text)
date
[1] "29-MAY-2019"
If you had the need to match multiple such dates in your text, then you may use regmatches along with regexec:
text <- "Hello World 29-MAY-2019 Goodbye World 01-JAN-2018"
regmatches(text,regexec("\\b(\\d{2}-[A-Z]+-\\d{4})\\b", text))[[1]]
[1] "29-MAY-2019" "29-MAY-2019"

Syntax error when using count in loop

I am trying to run a loop where I count the total in each file under the variable _merge, and then count certain outcomes of _merge, such as _merge=1 and so on. I then want to calculate percentages by dividing each instance of _merge by the total under _merge.
Below is my code:
/*define local list*/
local ward_names B C D E FN FS GS HE
/*loop for each dbase*/
foreach file of local ward_names {
use "../../../cleaning/sra/output/`file'_ward_CTS_Merged.dta", clear
count if _merge
local ward_count=r(N)
count if _merge==1
local count_master=r(N)
count if _merge==2
local count_using=r(N)
count if _merge==3
local count_match=r(N)
clear
set obs 1
g ward_count='ward_count'
g count_master=`count_master'
g count_using=`count_using'
g count_match=`count_match'
g ward= "`file'"
save "../temp/`file'_collapsed_diagnostics.dta", replace
clear
The code was running fine until I tried to add the total count for each ward file:
g ward_count='ward_count'
'ward_count' invalid name
Is this a syntax error or something more severe?
You need to use ` instead of ' when you refer to a local macro:
generate ward_count = `ward_count'
EDIT:
As per #NickCox's recommendation you can improve your code by using the tabulate command with its matcell() option to get the counts all at once:
tabulate _merge, matcell(A)
_merge | Freq. Percent Cum.
------------------------+-----------------------------------
master only (1) | 1 16.67 16.67
matched (3) | 5 83.33 100.00
------------------------+-----------------------------------
Total | 6 100.00
matrix list A
A[2,1]
c1
r1 1
r2 5
So you could then do the following:
generate count_master = A[1,1]
generate count_match = A[2,1]

Drop or remove column using awk

I wanted to drop first 3 column;
This is my data;
DETAIL 02032017
Name Gender State School Class
A M Melaka SS D
B M Johor BB E
C F Pahang AA F
EOF 3
I want my data like this:
DETAIL 02032017
School Class
SS D
BB E
AA F
EOF 3
This is my current command that I get mycommandoutput:
awk -v date="$(date +"%d%m%Y")" -F\| 'NR==1 {h=$0; next}
{file="TEST_"$1"_"$2"_"date".csv";
print (a[file]++?"": "DETAIL"date"" ORS h ORS) $0 > file} END{for(file in a) print "EOF " a[file] > file}' testing.csv
Can anyone help me?
Thank you :)
I want to remove first three column
If you just want to remove the first three columns, you can just set them to empty strings, leaving alone those that don't have three columns, something like:
awk 'NF>=3 {$1=""; $2=""; $3=""; print; next}{print}'
That has the potentially annoying habit of still having the field separators between those empty fields but, since modifying columns will reformat the line anyway, I assume that's okay:
DETAIL 02032017
School Class
SS D
BB E
AA F
EOF 3
If awk is the only tool being used to process them, the spacing won't matter. If you do want to preserve formatting (meaning that the columns are at very specific locations on the line), you can just get a substring of the entire line:
awk '{if (NF>=3) {$0 = substr($0,25)}; print}'
Since that doesn't modify individual fields, it won't trigger a recalculation of the line that would change its format:
DETAIL 02032017
School Class
SS D
BB E
AA F
EOF 3

Resources