Fuzzy match of strings using python - python-3.4

I have a record set as below.
"product_id"|"prod_descr"|"status"|"last_upd_time"
"102317"|"TELMINORM CH 40/12.5MG TAB 10'S"|"A"|"2016-08-31 15:02:06.609879"
"99996"|"BECOSTAR TAB 15'S"|"A"|"2016-09-05 18:20:25"
"99997"|"SUPRADYN TABLET15S"|"A"|"2016-09-06 09:05:24"
"120138"|"LASILACTONE 50MG TABLET 10'S"|"A"|"2016-09-07 12:01:05"
"101921"|"TELMA 20MG TABLET 15S"|"A"|"2016-08-31 15:02:06.609879"
"1220"|"ACNESTAR SOAP 75GM"|"A"|"2016-08-31 15:02:06.609879"
"120147"|"AMANTREL CAPSULES 15S"|"A"|"2016-09-09 09:54:35"
"113446"|"VOLIX 0 3MG TABLET 15S"|"A"|"2016-08-31 15:02:06.609879"
"121294"|"maxifer xt syrup "|"A"|"2016-09-29 15:32:40"
"120151"|"PIRITON CS SYRUP 100ML"|"A"|"2016-09-09 14:30:46"
"103481"|"TERBICIP SPRAY 30ML"|"A"|"2016-08-31 15:02:06.609879"
"96175"|"SORBITRATE 5MG TABLET 50S"|"A"|"2016-08-31 15:02:06.609879"
The set is as huge as a million records. I want to take each record (second field), say on row 2 "TELMINORM CH 40/12.5MG TAB 10'S" and make a fuzzy comparison with the rest of the records and find if there exists a similar record set.
An example would be
TELMINORM CH 40/12.5MG TAB 10'S is same as TELMINORM CH 40/12.5MG CAP 10'S. Tablet/Capsule is what is meant by TAB/CAP. In this case its a duplicate record.
So to eliminate this I used distance module and then if the difference of the string is less than 5, I am writing to a file in below format.
TELMINORM CH 40/12.5MG TAB 10'S - TELMINORM CH 80/12.5MG TAB 10'S, TELMINORM CH 40/12.5MG TAB 10'S, TELMINORM CH 40/12.5MG CAP 10'S
The logic i used is doing the trick but slow. It processes 150 records in 1 hour
which is very slow process.

I have used something like this
from fuzzywuzzy import fuzz
rank = fuzz.ratio("str_1", "str_2")
Then I check if the rank > 80 and proceed. This method seems to be faster than the distance module.

Related

Best way to import words from a text file into a data frame in R

I have a text file filled with words separated by spaces as seen below:
ACNES ACOCK ACOLD ACORN ACRED ACRES ACRID ACTED ACTIN ACTON ACTOR ACUTE ACYLS ADAGE ADAPT ADAWS ADAYS ADDAX ADDED ADDER ADDIO ADDLE ADEEM ADEPT ADHAN ADIEU ADIOS ADITS ADMAN ADMEN ADMIN ADMIT ADMIX ADOBE ADOBO ADOPT ADORE ADORN ADOWN ADOZE ADRAD ADRED ADSUM ADUKI ADULT ADUNC ADUST ADVEW ADYTA ADZED ADZES AECIA AEDES AEGIS AEONS AERIE AEROS AESIR AFALD AFARA AFARS AFEAR AFFIX AFIRE AFLAJ AFOOT AFORE AFOUL AFRIT AFROS AFTER AGAIN AGAMA AGAMI AGAPE AGARS AGAST AGATE AGAVE AGAZE AGENE AGENT AGERS AGGER AGGIE AGGRI AGGRO AGGRY AGHAS AGILA AGILE AGING AGIOS AGISM AGIST AGITA AGLEE AGLET AGLEY AGLOO AGLOW AGLUS AGMAS AGOGE AGONE AGONS AGONY AGOOD AGORA AGREE AGRIA AGRIN
What's the best way to import all these words into a 1 column data frame?

(gnu) diff - display corresponding line numbers

I'm trying to build a diff viewer plugin for my text editor (Kakoune). I'm comparing two files and marking any lines that are different across two panes.
However the two views don't scroll simultaneously. My idea is to get a list of line numbers that correspond to each other, so I can place the cursor correctly when switching panes, or scroll the secondary window when the primary one scrolls, etc.
So - Is there a way to get a list of corresponding numbers from commandline diff?
I'm hoping for something along the following example: Given file A and B, the output should tell me which line numbers (of the ones that didn't change) would correspond.
File A File B Output
1: hello 1: hello 1:1
2: world 2: another 2:3
3: this 3: world 3:4
4: is 4: this 4:5
5: a 5: is
6: test 6: eof
The goal is that when I scroll to line 4 in file A, I'll know to scroll file B such that it's line 5 is rendered at the same position.
Doesn't have to be Gnu diff, but should only use tools that are available on most/all linux boxes.
I can get some of the way using GNU diff, but then need a python script to post-process it to turn a group-based output into line-based.
#!/usr/bin/env python
import sys
import subprocess
file1, file2 = sys.argv[1:]
cmd = ["diff",
"--changed-group-format=",
"--unchanged-group-format=%df %dl %dF %dL,",
file1,
file2]
p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
output = p.communicate()[0]
for item in output.split(",")[:-1]:
start1, end1, start2, end2 = [int(s) for s in item.split()]
n = end1 - start1 + 1
assert(n == end2 - start2 + 1) # unchanged group, should be same length in each file
for i in range(n):
print("{}:{}".format(start1 + i, start2 + i))
gives:
$ ./compare.py fileA fileB
1:1
2:3
3:4
4:5

Data Scraping with list in excel

I have a list in Excel. One code in Column A and another in Column B.
There is a website in which I need to input both the details in two different boxes and it takes to another page.
That page contains certain details which I need to scrape in Excel.
Any help in this?
Ok. Give this a shot:
import pandas as pd
import requests
df = pd.read_excel('C:/test/data.xlsx')
url = 'http://rla.dgft.gov.in:8100/dgft/IecPrint'
results = pd.DataFrame()
for row in df.itertuples():
payload = {
'iec': '%010d' %row[1],
'name':row[2]}
response = requests.post(url, params=payload)
print ('IEC: %010d\tName: %s' %(row[1],row[2]))
try:
dfs = pd.read_html(response.text)
except:
print ('The name Given By you does not match with the data OR you have entered less than three letters')
temp_df = pd.DataFrame([['%010d' %row[1],row[2], 'ERROR']],
columns = ['IEC','Party Name and Address','ERROR'])
results = results.append(temp_df, sort=False).reset_index(drop=True)
continue
generalData = dfs[0]
generalData = generalData.iloc[:,[0,-1]].set_index(generalData.columns[0]).T.reset_index(drop=True)
directorData = dfs[1]
directorData = directorData.iloc[:,[-1]].T.reset_index(drop=True)
directorData.columns = [ 'director_%02d' %(each+1) for each in directorData.columns ]
try:
branchData = dfs[2]
branchData = branchData.iloc[:,[-1]].T.reset_index(drop=True)
branchData.columns = [ 'branch_%02d' %(each+1) for each in branchData.columns ]
except:
branchData = pd.DataFrame()
print ('No Branch Data.')
temp_df = pd.concat([generalData, directorData, branchData], axis=1)
results = results.append(temp_df, sort=False).reset_index(drop=True)
results.to_excel('path.new_file.xlsx', index=False)
Output:
print (results.to_string())
IEC IEC Allotment Date File Number File Date Party Name and Address Phone No e_mail Exporter Type IEC Status Date of Establishment BIN (PAN+Extension) PAN ISSUE DATE PAN ISSUED BY Nature Of Concern Banker Detail director_01 director_02 director_03 branch_01 branch_02 branch_03 branch_04 branch_05 branch_06 branch_07 branch_08 branch_09
0 0305008111 03.05.2005 04/04/131/51473/AM20/ 20.08.2019 NISSAN MOTOR INDIA PVT. LTD. PLOT-1A,SIPCOT IN... 918939917907 shailesh.kumar#rnaipl.com 5 Merchant/Manufacturer Valid IEC 2005-02-07 AACCN0695D FT001 NaN NaN 3 Private Limited STANDARD CHARTERED BANK A/C Type:1 CA A/C No :... HARDEEP SINGH BRAR GURMEL SINGH BRAR HOUSE NO ... JEROME YVES MARIE SAIGOT THIERRY SAIGOT A9/2, ... KOJI KAWAKITA KIHACHI KAWAKITA 3-21-3, NAGATAK... Branch Code:165TH FLOOR ORCHID BUSINESS PARK,S... Branch Code:14NRPDC , WAREHOUSE NO.B -2A,PATAU... Branch Code:12EQUINOX BUSINESS PARK TOWER 3 4T... Branch Code:8GRAND PALLADIUM,5TH FLR.,B WING,,... Branch Code:6TVS LOGISTICS SERVICES LTD.SING,C... Branch Code:2PLOT 1A SIPCOT INDUL PARK,ORAGADA... Branch Code:5BLDG.NO.3 PART,124A,VALLAM A,SRIP... Branch Code:15SURVEY NO. 678 679 680 681 682 6... Branch Code:10INDOSPACE SKCL INDL.PARK,BULD.NO...

R data.table fread fails on special character

I can only give you picture of data I'm working with or the character that creates my problems in .csv file. I don't know how to get that character.
This pillar character is stopping fread working. Is there away to escape it? readr read_csv works through them with no problem. I have tried to drop, make it character column, use comment.char = "", but nothing seems to work.
Here what I'm hoping to get out (what I get out with read_csv)
# A tibble: 5 x 4
X1 trade date trade_condition
<dbl> <dbl> <date> <chr>
1 2902 28.3 2019-01-14 -12------P----
2 2903 28.0 2019-01-14 P
3 2904 28.0 2019-01-14 P
4 2905 28.0 2019-01-14 P
5 2906 28.1 2019-01-14 P
I'm using data.table_1.12.0
Here is Verbose = T
omp_get_max_threads() = 8
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 8 threads (omp_get_max_threads()=8, nth=8)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file C:/Users/Markku/Desktop/KONECRANES_2019.01.14/trades.csv
File opened, size = 592KB (606768 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<,trade,date,trade_condition,sy>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 9 fields using quote rule 0
Detected 9 columns on line 1. This line is either column names or first data row. Line starts as: <<,trade,date,trade_condition,sy>>
Quote rule picked = 0
fill=false and the most number of columns found is 9
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 10 because (606767 bytes from row 1 to eof) / (2 * 27623 jump0size) == 10
Type codes (jump 000) : 57AAAA5AA Quote rule 0
A line with too-few fields (4/9) was found on line 4 of sample jump 7. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-few fields (4/9) was found on line 13 of sample jump 9. Most likely this jump landed awkwardly so type bumps here will be skipped.
Type codes (jump 010) : 57AAAA5AA Quote rule 0
'header' determined to be true due to column 2 containing a string on row 1 and a lower type (float64) in the rest of the 858 sample rows
=====
Sampled 858 rows (handled \n inside quoted fields) at 11 jump points
Bytes from first data row on line 2 to the end of last row: 606683
Line length: mean=213.01 sd=86.78 min=59 max=372
Estimated number of rows: 606683 / 213.01 = 2849
Initial alloc = 5698 rows (2849 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 57AAAA5AA
[10] Allocate memory for the datatable
Allocating 9 column slots (9 - 0 dropped) with 5698 rows
[11] Read the data
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==1
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==2
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==3
jumps=[0..1), chunk_size=606683, total_size=606683
Read 2903 rows x 9 columns from 592KB (606768 bytes) file in 00:00.014 wall clock time
[12] Finalizing the datatable
Type counts:
2 : int32 '5'
1 : float64 '7'
6 : string 'A'
=============================
0.003s ( 21%) Memory map 0.001GB file
0.007s ( 50%) sep=',' ncol=9 and header detection
0.000s ( 0%) Column type detection using 858 sample rows
0.000s ( 0%) Allocation of 5698 rows x 9 cols (0.000GB) of which 2903 ( 51%) rows used
0.004s ( 29%) Reading 1 chunks (0 swept) of 0.579MB (each chunk 2903 rows) using 1 threads
+ 0.000s ( 0%) Parse to row-major thread buffers (grown 0 times)
+ 0.002s ( 14%) Transpose
+ 0.002s ( 14%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.014s Total
Warning message:
In fread(trades_file, verbose = T) :
Stopped early on line 2905. Expected 9 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<2903,28.04,2019-01-14,"P>>

After using SELECT DISTINCT with inner join, still getting duplicate record

I am using inner join in three tables. I want to get only unique record, not duplicates. So I used SELECT DISTINCT with inner join. But still getting duplicates record.
My code
SELECT DISTINCT Submission.MID AS Expr16,
RevAssaignments.Rev1Name AS Expr18,
RevAssaignments.Rev2Name AS Expr19,
RevAssaignments.Rev3Name AS Expr20,
RevAssaignments.Rev1Status AS Expr21,
RevAssaignments.Rev2Status AS Expr22,
RevAssaignments.Rev3Status AS Expr23,
Submission.Title AS Expr2,
Submission.SID AS Expr1,
Files.PaperKey AS Expr7,
Submission.CoAuth AS Expr3,
Submission.Email AS Expr4,
Submission.CopyRightDate AS Expr5,
Submission.Status AS Expr6,
Files.*
FROM RevAssaignments
INNER JOIN Submission ON RevAssaignments.SID = Submission.SID
INNER JOIN Files ON Submission.SID = Files.SID
WHERE (Submission.Status = 'ACCEPTED ')
ORDER BY Expr16
Output
I-2012-10-355 Modified Sierpinski Carpet Fractal Antenna for Wireless Applications 354 2701318277.pdf Kuldip Pahwa
I-2012-10-355 Modified Sierpinski Carpet Fractal Antenna for Wireless Applications 354 1488315706.pdf Kuldip Pahwa
I-2012-10-355 Modified Sierpinski Carpet Fractal Antenna for Wireless Applications 354 3539969905.pdf Kuldip Pahwa
I-2012-12-379 Modified Dither Optical Phase Locked Loop for Inter-satellite Communications 378 1978719613.pdf A.BANERJEE
I-2012-12-379 Modified Dither Optical Phase Locked Loop for Inter-satellite Communications 378 1063820967.pdf A.BANERJEE
I-2012-12-379 Modified Dither Optical Phase Locked Loop for Inter-satellite Communications 378 9443420594.pdf A.BANERJEE
I-2012-12-385 A Sampling Oscilloscope Based System with Active RF/IF Load-pull for Multi-Tone Non-linear Device Characterization 384 1383013331.pdf Dr. Muhammad Akmal Chaudhary
I-2013-4-435 DESIGN OF MICROSTRIP YAGI UDA ANTENNA WITH THREE PARASITIC ELEMENTS AT 2.5 GHz 434 2012614214.pdf satyandra singh lodhi
I-2013-4-435 DESIGN OF MICROSTRIP YAGI UDA ANTENNA WITH THREE PARASITIC ELEMENTS AT 2.5 GHz 434 1349118729.pdf satyandra singh lodhi
Desired Output
I-2012-10-355 Modified Sierpinski Carpet Fractal Antenna for Wireless Applications 354 3539969905.pdf Kuldip Pahwa
I-2012-12-379 Modified Dither Optical Phase Locked Loop for Inter-satellite Communications 378 9443420594.pdf A.BANERJEE
I-2012-12-385 A Sampling Oscilloscope Based System with Active RF/IF Load-pull for Multi-Tone Non-linear Device Characterization 384 1383013331.pdf Dr. Muhammad Akmal Chaudhary
I-2013-4-435 DESIGN OF MICROSTRIP YAGI UDA ANTENNA WITH THREE PARASITIC ELEMENTS AT 2.5 GHz 434 1349118729.pdf satyandra singh lodhi
as you may noticed, fileNames are different (2701318277.pdf, 1488315706.pdf, 9443420594.pdf), distinct command unified all records that share same values for all fields)
That is because you have multiple results for your join in your Files table. You could reach your desired output if you perform the join only on 1 row in Files table.
SELECT DISTINCT Submission.MID AS Expr16
, RevAssaignments.Rev1Name AS Expr18
, RevAssaignments.Rev2Name AS Expr19
, RevAssaignments.Rev3Name AS Expr20
, RevAssaignments.Rev1Status AS Expr21
, RevAssaignments.Rev2Status AS Expr22
, RevAssaignments.Rev3Status AS Expr23
, Submission.Title AS Expr2
, Submission.SID AS Expr1
, Files.PaperKey AS Expr7
, Submission.CoAuth AS Expr3
, Submission.Email AS Expr4
, Submission.CopyRightDate AS Expr5
, Submission.Status AS Expr6
, Files.*
FROM RevAssaignments
INNER JOIN Submission ON RevAssaignments.SID = Submission.SID
INNER JOIN Files ON Files.SID = ( SELECT Files.FID FROM Files WHERE Files.SID = Submission SID (ORDER BY if needed) LIMIT 1)
WHERE (Submission.Status = 'ACCEPTED ') ORDER BY Expr16

Resources