find duplicate records by searching specific column values in a file - unix

I am new for unix can you please help me to find duplicate record
duplicate based on Name,EmpId and designation
Input File:
"Name" , "Address", ËmpId"," designation", "office location"
"NameValue","AddressValue",ËmpIdValue","designationValue","office locationValue"
"NameValue1","AddressValue1",ËmpIdValue1","designationValue1","office locationValue1"
"NameValue","AddressValue1",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressValue2",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressVal4ue",ËmpIdValue1","designationValue","office locationValue"
Output file:
"NameValue","AddressValue",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressValue1",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressValue2",ËmpIdValue","designationValue","office locationValue"

Probably python script will be the best for this:
import fileinput
dict = {}
for line in fileinput.input():
tokens = line.split(",")
key = tokens[0] + "###" + tokens[2] + "###" + tokens[3]
if key in dict:
# print the previous duplicate, if it wasn't printed yet
if len(dict[key]):
print dict[key],
dict[key] = ""
print line,
else:
dict[key] = line
For production use you probably may want to use more sophisticated algorithm to make keys unique, but the general idea is the same.

Related

Judge whitespace or number using pyparsing

I am working on parsing structured text files by pyparsing and I have a problem judging whitespace or numerical number. My file looks like this:
RECORD 0001
TITLE (Main Reference Title)
AUTHOR (M.Brown)
Some files have more than one author then
RECORD 0002
TITLE (Main Reference Title 1)
AUTHOR 1(S.Red)
2(B.White)
I would like to parse files and convert them into dictionary format.
{"RECORD": "001",
"TITLE": "Main Reference Title 1",
"AUTHOR": {"1": "M.Brown"}
}
{"RECORD": "002",
"TITLE": "Main Reference Title 2",
"AUTHOR": {"1": "S.Red", "2": "B.White"}
}
I tried to parse the AUTHOR field by pyparsing (tried both 2.4.7 and 3.0.0b3). Following is the simplified version of my code.
from pyparsing import *
flag = White(" ",exact=1).set_parse_action(replace_with("1")) | Word(nums,exact=1)
flaged_field = Group(flag + restOfLine)
next_line = White(" ",exact=8).suppress() + flaged_field
authors_columns = Keyword("AUTHOR").suppress() +\
White(" ",exact=2).suppress() +\.
flaged_field +\ # parse first row
ZeroOrMore(next_line) # parse next row
authors = authors_columns.search_string(f)
, where 'f' contains all lines read from the file. With this code, I only could parse the author's names with numbering flags.
[]
[[['1', '(S.Red)'],['2','(B.White)']]]
However, if I only parse with whitespace
flag = White(" ",exact=1).set_parse_action(replace_with("1"))
it worked correctly for the files without numbering flags.
['1', '(M.Brown)']
[]
The number (or whitespace) in [9:10] has a meaning in my format and want to judge if it is a whitespace or a numerical number (limited up to 9). I also replaced "|" to "^", and replaced the order, and tried
flag = Word(nums+" ")
, too, but neither of the cases works for me. Why judge White(" ") or Word(nums) doesn't work with my code? Could someone help me or give me an idea to solve this?
This was solved by adding leave_whitespace().
flag = (White(" ",exact=1).set_parse_action(replace_with("0")) | Word(nums,exact=1)).leave_whitespace()

Matching dict key to text file and returning Test Pass/fail

I'm a novice at Python, and am currently working on a small test case assignment where I am to find and match the dictionary keys to a small text file, and see if the keys are present in the text file.
As follows, the dictionary goes:
dict = {"description, translation": "test_translation(serial,",
"unit": "test_unit(",}
The text in text file, henceforth called "requirement.txt" as follows:
The description shall display the translation of XXX.
The unit shall be hidden.
The value is read from the file "version.txt".
To the key, I am to find and match if they are present or absent - a match should return a "test pass", no match would return a skip.
Keys from dictionary are to be sorted to a list, then iterated and matched to text. (Values from dictionary are to be sorted to a seperate list and iterated over a seperate file, to which I shall not delve into it here.)
This is the code that I currently have (and stuck):
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
line_strings = line.split(' ')
for word in list:
if word in line:
print("Test Pass")
print(word)
break
else:
print("Test Fail")
print(line + "\n")
Result currently obtained:
Test Fail
Test Pass
display
The description shall display the translation of XXX.
Test Fail
Test Fail
Test Fail
Test Pass
unit
The unit shall be hidden.
Test Fail
Test Fail
Test Fail
Test Fail
The value is read from the file "version.txt".
Using the current code which I have, (and I am stuck), running the code returned multiple times of "Test pass" and "Test fail", suggesting that the keys are iterated multiple times over each line and the results returned for each multiple iteration.
I am stuck at two fronts:
After seperating the key into a list, how to order them in the sequence of "description, translation", "unit)?
How to modify the code so as to ensure that result is returned once as "Test pass" or "test fail"
Results should ideally return in the following format:
Ideal outcome:
('Text:', "The description shall display the translation of XXX.
('Key:', 'description, translation')
Test Pass
('Text:', 'The unit shall be hidden.')
('Key:', 'unit')
Test Pass
('Text:', 'The value is read from the file "version.txt".')
('Key:', (none))
Test Fail
For your kind enlightenment please, thank you!
Try with this:
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
# Create an empty list which will contain all the word that match
words_found = []
for word in list:
# if the word match then add it to the list words_found
if word in line:
words_found.append(word)
print("(\'Text:\',\"{}\"")' ".format(line))
print("(\'Keys:\',\"{}\"")' ".format(words_found))
# if the list of words found it's not empty then the test passed
if(words_found):
print("Test Passed")
else:
print("Test Failed")
the idea is to create a list of the words founds and then print them all
I'm using the format Operation and you can find a guide on how to use it here. And the line if(words_found): check if the list is empty.
Additional Notes
In this case, you won't need it but if you wanted to solve only the second point you can use the for else statement as explained in the docs
4.4 break and continue Statements, and else Clauses on Loops
Loop statements may have an else clause; it is executed when the loop terminates through exhaustion of the list (with for) or when the condition becomes false (with while), but not when the loop is terminated by a break statement.
Reducing by one tab the indentation the else of your if statement it became the else of the for statement so it will be executed only if the for never had a break the problem is solved.
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
line_strings = line.split(' ')
for word in list:
if word in line:
print(word)
print("Test Pass")
break
else:
print("Test Fail")
print(line + "\n")
Edit
To split the key into description and translation we just have to split the two word at the comma with the builtin function split
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
# Create an empty list which will contain all the word that match
words_found = []
for word in list:
description, translation = word.split(",")
# if the word match then add it to the list words_found
if description in line:
words_found.append(description)
print("(\'Text:\',\"{}\"")' ".format(line))
print("(\'Keys:\',\"{}\"")' ".format(words_found))
# if the list of words found it's not empty then the test passed
if(words_found):
print("Test Passed")
else:
print("Test Failed")

Does U-SQL support extracting files based on date of creation in ADLS

We know U-SQL supports directory and filename pattern matching while extracting the files. What I wanted to know does it support pattern matching based on date of creation of the file in ADLS (without implementing custom extractors).
Say a folder contains files created across months (filenames don't have date as part of the filename), is there a way to pull only files of a particular month.
The U-SQL EXTRACT operator is not aware of any metadata (such as create date) about a file - only the filename.
You could probably build a solution using the .NET SDK. For something rather simple you could use PowerShell to create a file which will contain all the files that meet your date time criteria. Then consume the content as desired.
# Log in to your Azure account
Login-AzureRmAccount
# Modify variables as required
$DataLakeStoreAccount = "<yourDataLakeStoreAccountNameHere>";
$DataLakeAnalyticsAccount = <yourDataLakeAnalyticsAccountNameHere>";
$DataLakeStorePath = "/Samples/Data/AmbulanceData/"; #modify as desired
$outputFile = "Samples/Outputs/ReferenceGuide/filteredFiles.csv"; #modify as desired
$filterDate = "2016-11-22";
$jobName = "GetFiles";
# Query directory and build main body of script. Note, there is a csv filter.
[string]$body =
"#initial =
SELECT * FROM
(VALUES
" +
(Get-AzureRmDataLakeStoreChildItem -Account $DataLakeStoreAccount -Path $DataLakeStorePath |
Where {$_.Name -like "*.csv" -and $_.Type -eq "FILE"} | foreach {
"(""" + $DataLakeStorePath + $_.Name + """, (DateTime)FILE.CREATED(""" + $DataLakeStorePath + $_.Name + """)), `r`n" });
# formattig, add column names
$body =
$body.Substring(0,$body.Length-4) + "
) AS T(fileName, createDate);";
# U-SQL query and OUTPUT statement
[string]$output =
"
// filter results based on desired time frame
#filtered =
SELECT fileName
FROM #initial
WHERE createDate.ToString(""yyyy-MM-dd"") == ""$filterDate"";
OUTPUT #filtered
TO ""$outputFile""
USING Outputters.Csv();";
# bring it all together
$script = $body + $output;
#Execute job
$jobInfo = Submit-AzureRmDataLakeAnalyticsJob -Account $DataLakeAnalyticsAccount -Name $jobName -Script $script -DegreeOfParallelism 1
#check job progress
Get-AzureRmDataLakeAnalyticsJob -Account $DataLakeAnalyticsAccount -JobId $jobInfo.JobId -ErrorAction SilentlyContinue;
Write-Host "You now have a list of desired files to check # " $outputFile
Currently there is no way to access or use file meta data properties. Please add your vote and use case to the following feedback item: https://feedback.azure.com/forums/327234-data-lake/suggestions/10948392-support-functionality-to-handle-file-properties-fr
it's been a while since this question was asked, and I'm not sure if this is what you were looking for originally, but now you can use the FILE.MODIFIED U-SQL function:
DECLARE #watermark string = "2018-08-16T18:12:03";
SET ##FeaturePreviews="InputFileGrouping:on";
DECLARE #file_set_path string = "adl://adls.azuredatalakestore.net/stage/InputSample.tsv";
#input =
EXTRACT [columnA] int?,
[columnB] string
FROM #file_set_path
USING Extractors.Tsv(skipFirstNRows : 1, silent : true);
#result =
SELECT *, FILE.MODIFIED(#file_set_path) AS FileModifiedDate
FROM #input
WHERE FILE.MODIFIED(#file_set_path) > DateTime.ParseExact(#watermark, "yyyy-MM-ddTHH:mm:ss", NULL);
OUTPUT #result TO "adl://ADLS.azuredatalakestore.net/stage/OutputSample.tsv" USING Outputters.Tsv(outputHeader:true);
The U-SQL built-in function is documented here:
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/file-modified-u-sql

Is there anything like MessageBox/SQL binds for normal strings?

Context
In PeopleCode, the following declaration with MessageBox is valid:
MessageBox(0, "", 0, 0, "Something = %1, Something else = %2.", &something, &somethingElse);
This allows me to use bind variables for the MessageBox. The same is also true with SQL:
SQLExec("SELECT * FROM %Table(:1) WHERE VALUE = :2", Record.Y_SOMETHING, &value);
Question
Is there a way to do that with normal strings? I've never liked having to "pipe" strings together like this &string = Something = " | &something | ", Something else = " | &somethingElse | ".".
Is there a way to use this format for regular strings? I've looked through various of Oracle's PeopleBooks, but I haven't found anything.
Maybe this is what you are looking for:
Local number &message_set, &message_num;
Local string &default_msg_txt = "%1 %2 %3";
Local string &l_result= MsgGetText(&message_set, &message_num, &default_msg_txt, "hallo", "Frank", "!");
result:
"hallo Frank !"
You can use MsgGetText function to determine a message by message catalogue. In case the message is not found, the default text is used.

How can I model a scalable set of definition/term pairs?

Right now my flashcard game is using a prepvocab() method where I
define the terms and translations for a week's worth of terms as a dictionary
add a description of that week's terms
lump them into a list of dictionaries, where a user selects their "weeks" to study
Every time I add a new week's worth of terms and translations, I'm stuck adding another element to the list of available dictionaries. I can definitely see this as not being a Good Thing.
class Vocab(object):
def __init__(self):
vocab = {}
self.new_vocab = vocab
self.prepvocab()
def prepvocab(self):
week01 = {"term":"translation"} #and many more...
week01d = "Simple Latvian words"
week02 = {"term":"translation"}
week02d = "Simple Latvian colors"
week03 = {"I need to add this":"to self.selvocab below"}
week03d = "Body parts"
self.selvocab = [week01, week02] #, week03, weekn]
self.descs = [week01d, week02d] #, week03, weekn]
Vocab.selvocab(self)
def selvocab(self):
"""I like this because as long as I maintain self.selvocab,
the for loop cycles through the options just fine"""
for x in range(self.selvocab):
YN = input("Would you like to add week " \
+ repr(x + 1) + " vocab? (y or n) \n" \
"Description: " + self.descs[x] + " ").lower()
if YN in "yes":
self.new_vocab.update(self.selvocab[x])
self.makevocab()
I can definitely see that this is going to be a pain with 20+ yes no questions. I'm reading up on curses at the moment, and was thinking of printing all the descriptions at once, and letting the user pick all that they'd like to study for the round.
How do I keep this part of my code better maintained? Anybody got a radical overhaul that isn't so....procedural?
You should store your term:translation pairs and descriptions in a text file in some manner. Your program should then parse the text file and discover all available lessons. This will allow you to extend the set of lessons available without having to edit any code.
As for your selection of lessons, write a print_lesson_choices function that displays the available lessons and descriptions to the user, and then ask for their input in selecting them. Instead of asking a question of them for every lesson, why not make your prompt something like:
self.selected_weeks = []
def selvocab(self):
self.print_lesson_choices()
selection = input("Select a lesson number or leave blank if done selecting: ")
if selection == "": #Done selecting
self.makevocab()
elif selection in self.available_lessons:
if selection not in self.selected_weeks:
self.selected_weeks.append(selection)
print "Added lesson %s"%selection
self.selvocab() #Display the list of options so the user can select again
else:
print "Bad selection, try again."
self.selvocab()
Pickling objects into a database means it'll take some effort to create an interface to modify the weekly lessons from the front end, but is well worth the time.

Resources