Judge whitespace or number using pyparsing - pyparsing

I am working on parsing structured text files by pyparsing and I have a problem judging whitespace or numerical number. My file looks like this:
RECORD 0001
TITLE (Main Reference Title)
AUTHOR (M.Brown)
Some files have more than one author then
RECORD 0002
TITLE (Main Reference Title 1)
AUTHOR 1(S.Red)
2(B.White)
I would like to parse files and convert them into dictionary format.
{"RECORD": "001",
"TITLE": "Main Reference Title 1",
"AUTHOR": {"1": "M.Brown"}
}
{"RECORD": "002",
"TITLE": "Main Reference Title 2",
"AUTHOR": {"1": "S.Red", "2": "B.White"}
}
I tried to parse the AUTHOR field by pyparsing (tried both 2.4.7 and 3.0.0b3). Following is the simplified version of my code.
from pyparsing import *
flag = White(" ",exact=1).set_parse_action(replace_with("1")) | Word(nums,exact=1)
flaged_field = Group(flag + restOfLine)
next_line = White(" ",exact=8).suppress() + flaged_field
authors_columns = Keyword("AUTHOR").suppress() +\
White(" ",exact=2).suppress() +\.
flaged_field +\ # parse first row
ZeroOrMore(next_line) # parse next row
authors = authors_columns.search_string(f)
, where 'f' contains all lines read from the file. With this code, I only could parse the author's names with numbering flags.
[]
[[['1', '(S.Red)'],['2','(B.White)']]]
However, if I only parse with whitespace
flag = White(" ",exact=1).set_parse_action(replace_with("1"))
it worked correctly for the files without numbering flags.
['1', '(M.Brown)']
[]
The number (or whitespace) in [9:10] has a meaning in my format and want to judge if it is a whitespace or a numerical number (limited up to 9). I also replaced "|" to "^", and replaced the order, and tried
flag = Word(nums+" ")
, too, but neither of the cases works for me. Why judge White(" ") or Word(nums) doesn't work with my code? Could someone help me or give me an idea to solve this?

This was solved by adding leave_whitespace().
flag = (White(" ",exact=1).set_parse_action(replace_with("0")) | Word(nums,exact=1)).leave_whitespace()

Related

Why are some strings in quotes but others aren't when creating a .YAML file from R?

I'm trying to create the following .YAML file:
summary:
title: "Table tabs"
link: ~
blocks: []
nested: nav-pills
nested_names: yes
(note there are no quotes around the tilde, square brackets or yes).
I write the code to create it in R:
tabs <- list(
summary =
list(
title = "Table tabs",
link = "~",
blocks = "[]",
nested = "nav-pills",
nested_names = "yes"
)
)
write(yaml::as.yaml(tabs), file = "myfile.yaml"
But when I write it out to .YAML, it looks like this:
summary:
title: Table tabs
link: '~'
blocks: '[]'
nested: nav-pills
nested_names: 'yes'
i.e. There are quotations around the tilde, square brackets and yes.
Why does this happen, and what can I do to prevent it?
The information is already provided in stackoverflow:
I try to point you through the given answers:
More general considerations using quotes in yaml are discussed sufficiently in the question "YAML: Do I need quotes for strings in YAML?"
Here the difference of ' and "in yaml is discussed:
"What is the difference between a single quote and double quote in Yaml header for r Markdown?"
Specifically the tilde sign is discussed here:
"What is the purpose of tilde character ~ in YAML?"
To summarise,
The tilde is one of the ways the null value can be written. Most
parsers also accept an empty value for null, and of course null, Null
and NULL
Based on the answer from TarJae, the solution is as follows:
tabs <- list(
summary =
list(
title = "Table tabs",
link = NULL,
blocks = list(),
nested = "nav-pills",
nested_names = TRUE
)
)

Why are keywords not parsed first and omitted from free text matching?

I thought I understood pyparsing's logic, but cannot figure out why the bottom example is failing.
I'm trying to parse open text comments where a product or set of products can be mentioned either in the beginning or the end of the comment. Product names can also be omitted from the comment.
The output should be a list of the mentioned products and the description regarding them.
Below are some test cases. The parse is identifying everything as 'description' instead of first picking up the products (isn't that what the negative is supposed to do?)
What's wrong in my understanding?
import pyparsing as pp
products_list = ['aaa', 'bbb', 'ccc']
products = pp.OneOrMore(' '.join(products_list))
word = ~products + pp.Word(pp.alphas)
description = pp.OneOrMore(word)
comment_expr = (pp.Optional(products("product1")) + description("description") + pp.Optional(products("product2")))
matches = comment_expr.scanString("""\
aaa is a good product
I prefer aaa
No comment
aaa bbb are both good products""")
for match in matches:
print match
The expected results would be:
product1: aaa, description: is a good product
product2: aaa, description: I prefer
description: No comment
product1: [aaa, bbb] description: are both good products
Pyparsing's shortcut equivalence between strings and Literals is intended to be a convenience, but sometimes it results in unexpected and unwanted circumstances. In these lines:
products_list = ['aaa', 'bbb', 'ccc']
products = pp.OneOrMore(' '.join(products_list))
.
I'm pretty sure you wanted product to match on any product. But instead, OneOrMore gets passed this as its argument:
' '.join(products_list)
This is purely a string expression, resulting in the string "aaa bbb ccc". Passing this to OneOrMore, you are saying that products is one or more instances of the string "aaa bbb ccc".
To get the lookahead, you need to change products to:
products = pp.oneOf(products_list)
or even better:
products = pp.MatchFirst(pp.Keyword(p) for p in products_list)
Then your negative lookahead will work better.

Matching dict key to text file and returning Test Pass/fail

I'm a novice at Python, and am currently working on a small test case assignment where I am to find and match the dictionary keys to a small text file, and see if the keys are present in the text file.
As follows, the dictionary goes:
dict = {"description, translation": "test_translation(serial,",
"unit": "test_unit(",}
The text in text file, henceforth called "requirement.txt" as follows:
The description shall display the translation of XXX.
The unit shall be hidden.
The value is read from the file "version.txt".
To the key, I am to find and match if they are present or absent - a match should return a "test pass", no match would return a skip.
Keys from dictionary are to be sorted to a list, then iterated and matched to text. (Values from dictionary are to be sorted to a seperate list and iterated over a seperate file, to which I shall not delve into it here.)
This is the code that I currently have (and stuck):
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
line_strings = line.split(' ')
for word in list:
if word in line:
print("Test Pass")
print(word)
break
else:
print("Test Fail")
print(line + "\n")
Result currently obtained:
Test Fail
Test Pass
display
The description shall display the translation of XXX.
Test Fail
Test Fail
Test Fail
Test Pass
unit
The unit shall be hidden.
Test Fail
Test Fail
Test Fail
Test Fail
The value is read from the file "version.txt".
Using the current code which I have, (and I am stuck), running the code returned multiple times of "Test pass" and "Test fail", suggesting that the keys are iterated multiple times over each line and the results returned for each multiple iteration.
I am stuck at two fronts:
After seperating the key into a list, how to order them in the sequence of "description, translation", "unit)?
How to modify the code so as to ensure that result is returned once as "Test pass" or "test fail"
Results should ideally return in the following format:
Ideal outcome:
('Text:', "The description shall display the translation of XXX.
('Key:', 'description, translation')
Test Pass
('Text:', 'The unit shall be hidden.')
('Key:', 'unit')
Test Pass
('Text:', 'The value is read from the file "version.txt".')
('Key:', (none))
Test Fail
For your kind enlightenment please, thank you!
Try with this:
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
# Create an empty list which will contain all the word that match
words_found = []
for word in list:
# if the word match then add it to the list words_found
if word in line:
words_found.append(word)
print("(\'Text:\',\"{}\"")' ".format(line))
print("(\'Keys:\',\"{}\"")' ".format(words_found))
# if the list of words found it's not empty then the test passed
if(words_found):
print("Test Passed")
else:
print("Test Failed")
the idea is to create a list of the words founds and then print them all
I'm using the format Operation and you can find a guide on how to use it here. And the line if(words_found): check if the list is empty.
Additional Notes
In this case, you won't need it but if you wanted to solve only the second point you can use the for else statement as explained in the docs
4.4 break and continue Statements, and else Clauses on Loops
Loop statements may have an else clause; it is executed when the loop terminates through exhaustion of the list (with for) or when the condition becomes false (with while), but not when the loop is terminated by a break statement.
Reducing by one tab the indentation the else of your if statement it became the else of the for statement so it will be executed only if the for never had a break the problem is solved.
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
line_strings = line.split(' ')
for word in list:
if word in line:
print(word)
print("Test Pass")
break
else:
print("Test Fail")
print(line + "\n")
Edit
To split the key into description and translation we just have to split the two word at the comma with the builtin function split
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
# Create an empty list which will contain all the word that match
words_found = []
for word in list:
description, translation = word.split(",")
# if the word match then add it to the list words_found
if description in line:
words_found.append(description)
print("(\'Text:\',\"{}\"")' ".format(line))
print("(\'Keys:\',\"{}\"")' ".format(words_found))
# if the list of words found it's not empty then the test passed
if(words_found):
print("Test Passed")
else:
print("Test Failed")

find duplicate records by searching specific column values in a file

I am new for unix can you please help me to find duplicate record
duplicate based on Name,EmpId and designation
Input File:
"Name" , "Address", ËmpId"," designation", "office location"
"NameValue","AddressValue",ËmpIdValue","designationValue","office locationValue"
"NameValue1","AddressValue1",ËmpIdValue1","designationValue1","office locationValue1"
"NameValue","AddressValue1",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressValue2",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressVal4ue",ËmpIdValue1","designationValue","office locationValue"
Output file:
"NameValue","AddressValue",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressValue1",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressValue2",ËmpIdValue","designationValue","office locationValue"
Probably python script will be the best for this:
import fileinput
dict = {}
for line in fileinput.input():
tokens = line.split(",")
key = tokens[0] + "###" + tokens[2] + "###" + tokens[3]
if key in dict:
# print the previous duplicate, if it wasn't printed yet
if len(dict[key]):
print dict[key],
dict[key] = ""
print line,
else:
dict[key] = line
For production use you probably may want to use more sophisticated algorithm to make keys unique, but the general idea is the same.

How can I model a scalable set of definition/term pairs?

Right now my flashcard game is using a prepvocab() method where I
define the terms and translations for a week's worth of terms as a dictionary
add a description of that week's terms
lump them into a list of dictionaries, where a user selects their "weeks" to study
Every time I add a new week's worth of terms and translations, I'm stuck adding another element to the list of available dictionaries. I can definitely see this as not being a Good Thing.
class Vocab(object):
def __init__(self):
vocab = {}
self.new_vocab = vocab
self.prepvocab()
def prepvocab(self):
week01 = {"term":"translation"} #and many more...
week01d = "Simple Latvian words"
week02 = {"term":"translation"}
week02d = "Simple Latvian colors"
week03 = {"I need to add this":"to self.selvocab below"}
week03d = "Body parts"
self.selvocab = [week01, week02] #, week03, weekn]
self.descs = [week01d, week02d] #, week03, weekn]
Vocab.selvocab(self)
def selvocab(self):
"""I like this because as long as I maintain self.selvocab,
the for loop cycles through the options just fine"""
for x in range(self.selvocab):
YN = input("Would you like to add week " \
+ repr(x + 1) + " vocab? (y or n) \n" \
"Description: " + self.descs[x] + " ").lower()
if YN in "yes":
self.new_vocab.update(self.selvocab[x])
self.makevocab()
I can definitely see that this is going to be a pain with 20+ yes no questions. I'm reading up on curses at the moment, and was thinking of printing all the descriptions at once, and letting the user pick all that they'd like to study for the round.
How do I keep this part of my code better maintained? Anybody got a radical overhaul that isn't so....procedural?
You should store your term:translation pairs and descriptions in a text file in some manner. Your program should then parse the text file and discover all available lessons. This will allow you to extend the set of lessons available without having to edit any code.
As for your selection of lessons, write a print_lesson_choices function that displays the available lessons and descriptions to the user, and then ask for their input in selecting them. Instead of asking a question of them for every lesson, why not make your prompt something like:
self.selected_weeks = []
def selvocab(self):
self.print_lesson_choices()
selection = input("Select a lesson number or leave blank if done selecting: ")
if selection == "": #Done selecting
self.makevocab()
elif selection in self.available_lessons:
if selection not in self.selected_weeks:
self.selected_weeks.append(selection)
print "Added lesson %s"%selection
self.selvocab() #Display the list of options so the user can select again
else:
print "Bad selection, try again."
self.selvocab()
Pickling objects into a database means it'll take some effort to create an interface to modify the weekly lessons from the front end, but is well worth the time.

Resources