Why am I getting KeyError in Python3 in the following piece of code? - python-3.6

d = { ' dog ' : ' has a tail and goes woof! ' ,' cat ' : ' says meow ' ,' mouse':' chased by cats ' }
word=input('Enter a word: ')
print('The definition is:', d.get(word))
Traceback (most recent call last):
File "<pyshell#429>", line 1, in <module>
print('The definition is: ', d[word])
KeyError: 'dog'
I entered dog as my key value expecting it to print out: ' has a tail and goes woof! ' but instead i got a KeyError. How do I resolve this?

There are trailing and leading white spaces in the string, e.g. ' dog ', You will never find the key in that way. This code should work fine:
d = { 'dog' : 'has a tail and goes woof!' ,'cat': 'says meow', ' mouse':'chased by cats' }
word=input('Enter a word: ')
print('The definition is:', d.get(word.strip()))
I've added word.strip() at the end to ignore trailing and leading white spaces in the input

Related

Implement tr and sed functions in awk

I need to process a text file - a big CSV - to correct format in it. This CSV has a field which contains XML data, formatted to be human readable: break up into multiple lines and indentation with spaces. I need to have every record in one line, so I am using awk to join lines, and after that I am using sed, to get rid of extra spaces between XML tags, and after that tr to eliminate unwanted "\r" characters.
(the first record is always 8 numbers and the fiels separator is the pipe character: "|"
The awk scrips is (join4.awk)
BEGIN {
# initialise "line" variable. Maybe unnecessary
line=""
}
{
# check if this line is a beginning of a new record
if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]|" ) {
# if it is a new record, then print stuff already collected
# then update line variable with $0
print line
line = $0
} else {
# if it is not, then just attach $0 to the line
line = line $0
}
}
END {
# print out the last record kept in line variable
if (line) print line
}
and the commandline is
cat inputdata.csv | awk -f join4.awk | tr -d "\r" | sed 's/> *</></g' > corrected_data.csv
My question is if there is an efficient way to implement tr and sed functionality inside the awk script? - this is not Linux, so I gave no gawk, just simple old awk and nawk.
thanks,
--Trifo
tr -d "\r"
Is just gsub(/\r/, "").
sed 's/> *</></g'
That's just gsub(/> *</, "><")
mawk NF=NF RS='\r?\n' FS='> *<' OFS='><'
Thank you all folks!
You gave me the inspiration to get to a solution. It is like this:
BEGIN {
# initialize "line" variable. Maybe unnecessary.
line=""
}
{
# if the line begins with 8 numbers and a pipe char (the format of the first record)...
if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|" ) {
# ... then the previous record is ready. We can post process it, the print out
# workarounds for the missing gsub function
# removing extra spaces between xml tags
# removing extra \r characters the same way
while ( line ~ "\r") { sub( /\r/,"",line) }
# "<text text> <tag tag>" should look like "<text text><tag tag>"
while ( line ~ "> *<") { sub( /> *</,"><",line) }
# then print the record and update line var with the beginning of the new record
print line
line = $0
} else {
# just keep extending the record with the actual line
line = line $0
}
}
END {
# print the last record kept in line var
if (line) {
while ( line ~ "\r") { sub( /\r/,"",line) }
while ( line ~ "> *<") { sub( /> *</,"><",line) }
print line
}
}
And yes, it is efficient: the embedded version runs abou 33% faster.
And yes, it would be nicer to create a function for the postprocessing of the records in "line" variable. Now I have to write the same code twice to process the last recond in the END section. But it works, it creates the same output as the chained commands and it is way faster.
So, thanks for the inspiration again!
--Trifo

Adding previous lines to current line unless pattern is found in unix shell script

I am facing an issue while adding previous lines to current line for a pattern. I have a 43 MB file in unix. The snippet is shown below:
AAA7034 new value and a old value
A
78698 new line and old value
BCA0987 old value and new value
new value
What I want is :
AAA7034 new value and a old value A 78698 new line and old value
BCA0987 old value and new value new value
Means I have add all the the lines till next pattern is found ( first pattern is : AAA and next pattern is : BCA )
because of high size of files..not sure if awk/sed shall work. Any bash script is appreciated.
You can combine all patterns and perform a regex match. Try something like this (it is just a scratch, you should trim the output if you need):
#!/bin/bash
patterns="^(AAA|BCS|BABA|BCA)"
file="$1"
while IFS= read -r line; do
if [[ "$line" =~ $patterns ]] ; then
echo # prints new line
fi
echo -n $line " " # prints the line itself and a space as a separator
done < "$file"
You can redirect the output to a file, of course.
It's not really clear precisely what you want. You've stated that you want to match the patterns 'AAA' and 'BCA', and later expanded that to "patter shall be like: AAA, BCS, BABA, BCA". I don't know if that means that you only want to match 'AAA', 'BCA', 'AAA, 'BCS, 'BABA', and 'BCA, or if you want to match 3 or 4 characters strings containing only 'A', B', 'C', and 'S', but it sounds like you are just looking for:
awk '/[A-Z]{3,4}/{printf "\n"} { printf "%s ", $0} END {printf "\n"}' input-file
Change the pattern as needed when your requirements are made more precise.
Based on the comment, it is trivial to convert any awk program to perl. Here is (basically) the output of a2p on the above awk script, with changes to reflect the stated pattern:
#!/usr/bin/env perl
while (<>) {
chomp;
if (/AAA|BCA|BCS|BABA/) {
printf "\n";
}
printf '%s ', $_;
}
printf "\n";
You can simplify that a bit:
perl -pe 'chomp; printf "\n" if /AAA|BCA|BCS|BABA/; printf "%s ", $_' input-file; echo

Use sed to replace all occurrences of strings which start with 'xy' and of length 5 or more

I am running AIX 6.1
I have a file which contains strings/words starting with some specific characters, say 'xy' or 'Xy' or 'Xy' or 'XY' (case insensitive) and I need to mask the entire word/string with asterisks '*' if the word is greater than say 5 characters.
e.g. I need a sed command which when run against a file containing the below line...
This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings
should give below as the output
This is a test line xy12 which I need to replace specific strings
I tried the below commands (did not yet come to the stage where I restrict to word lengths) but it does not work and displays the full line without any substitutions.
I tried using \< and > as well as \b for word identification.
sed 's/\<xy\(.*\)\>/******/g' result2.csv
sed 's/\bxy\(.*\)\b******/g' result2.csv
You can try with awk:
echo 'This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings' | awk 'BEGIN{RS=ORS=" "} !(/^[xX][yY]/ && length($0)>=5)'
The awk record separator is set to a space in order to be able to get the length of each word.
This works with GNU awk in --posix and --traditional modes.
With sed for the mental exercice
sed -E '
s/(^|[[:blank:]])([xyXY])([xyXY].{2}[^[:space:]]*)([^[:space:]])/\1#\3#/g
:A
s/(#[^#[:blank:]]*)[^#[:blank:]](#[#]*)/\1#\2/g
tA
s/#/*/g'
This need to not have # in the text.
A simple POSIX awk version :
awk '{for(i=1;i<=NF;++i) if ($i ~ /^[xX][yY]/ && length($i)>=5) gsub(/./,"*",$i)}1'
This, however, does not keep the spacing intact (multiple spaces are converted to a single one), the following does:
awk 'BEGIN{RS=ORS=" "}(/^[xX][yY]/ && length($i)>=5){gsub(/./,"*")}1'
You may use awk:
s='This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings xy123 xy1234 xy12345 xy123456 xy1234567'
echo "$s" | awk 'BEGIN {
ORS=RS=" "
}
{
for(i=1;i<=NF;i++) {
if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/)
gsub(/./,"*", $i);
print $i;
}
}'
A one liner:
awk 'BEGIN {ORS=RS=" "} { for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } }'
# => This is a test line ******* xy12 ***** ******* which I need to replace specific strings ***** ****** ******* ******** *********
See the online demo.
Details
BEGIN {ORS=RS=" "} - start of the awk: set the output record separator equal to the space record separator
{ for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^xy[a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } } - iterate over each field (with for(i=1;i<=NF;i++)) and if the current field ($i) length is equal or more than 5 (length($i) >= 5) and it matches a Xy and (&&) 1 or more alphanumeric chars pattern ($i~/^[Xx][Yy][a-zA-Z0-9]+$/), then replace each char with * (with gsub(/./,"*", $i)) and then print the current field value.
This might work for you (GNU sed):
sed -r ':a;/\bxy\S{5,}\b/I!b;s//\n&\n/;h;s/[^\n]/*/g;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/;ta' file
If the current line does not contain a string which begins with xy case insensitive and 5 or more following characters, then there is no work to be done.
Otherwise:
Surround the string by newlines
Copy the pattern space (PS) to the hold space (HS)
Replace all characters other than newlines with *'s
Append the PS to the HS
Replace the PS with the HS
Swap the strings between the newlines retaining the remainder of the first line
Repeat

undesired multiple matches creating links with regex matches in asp.net

I am creating a link out of certain characters that match my regex, and all works as expected as long as a specific match does not appear more than once in the input string.
I want to find any instance of "RTR-" or "RO-" followed by 2 to 4 numbers, and convert that to a link. ex. "This is RTR-1234" becomes
"This is <a href='http://server/browse/RTR-1234'>RTR-1234</a>"
I pass my string to:
Function linkifyText(ByVal txt As String) As String
Dim regx As New Regex("\b(RTR-|RO-)\d{2,4}\b", RegexOptions.IgnoreCase)
Dim mactches As MatchCollection = regx.Matches(txt)
For Each match As Match In mactches
txt = txt.Replace(match.Value, "<a href='http://server/browse/" & match.Value & "'>" & match.Value & "</a>")
Next
Return txt
End Function
This seems to work fine, even when there are multiple differing matches. For exmaple "This is RTR-1234, and this is RTR-4321" becomes
This is <a href='http://server/browse/RTR-1234'>RTR-1234</a> and this is <a href='http://server/browse/RTR-4321'>RTR-4321</a>
I run into problems however, when the same match occurs more than once in the input string. For example, "This is RTR-1234 again this is RTR-1234" becomes
This is <a href='http://server/browse/<a href='http://server/browse/RTR-1234'>RTR-1234</a>'><a href='http://server/browse/RTR-1234'>RTR-1234</a></a> again this is <a href='http://server/browse/<a href='http://server/browse/RTR-1234'>RTR-1234</a>'><a href='http://server/browse/RTR-1234'>RTR-1234</a></a>
Description
Why not just do this in one search and replace operation?
\b((?:R(?:TR|O))-[0-9]{2,4})\b
Replace With: <a href='http://server/browse/$1'>$1</a>
This regular expression will do the following:
find any instance of RTR- or RO- followed by 2 to 4 numbers
Example
Live Demo
https://regex101.com/r/qD4kC4/1
Sample text
I want to find any instance of "RTR-" or "RO-" followed by 2 to 4 numbers, and convert that to a link.
ex. "This is RTR-1234"
For example, "This is RTR-1234 again this is RTR-1234" becomes
After Replacement
I want to find any instance of "RTR-" or "RO-" followed by 2 to 4 numbers, and convert that to a link.
ex. "This is <a href='http://server/browse/RTR-1234'>RTR-1234</a>"
For example, "This is <a href='http://server/browse/RTR-1234'>RTR-1234</a> again this is <a href='http://server/browse/RTR-1234'>RTR-1234</a>" becomes
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
R 'R'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
TR 'TR'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
O 'O'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
- '-'
----------------------------------------------------------------------
[0-9]{2,4} any character of: '0' to '9' (between 2
and 4 times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------

Lexical error at line 0, column 0

In the grammar below, I am trying configure any line that starts with ' as a single line comment and anything betweeen /' Multiline Comment '/. The single line comment works ok. But for some reason as soon as I press / or ' or ';' or < or '>' I get the error below. I don't have above characters configured. Shouldn't they be considered default and skip parsing ?
Error
Lexical error at line 0, column 0. Encountered: "\"" (34), after : ""
Lexical error at line 0, column 0. Encountered: ">" (62), after : ""
Lexical error at line 0, column 0. Encountered: "\n" (10), after : "-"
I have only included part of the code below for conciseness. For full Lexer definition please visit the link
TOKEN :
{
< WHITESPACE:
" "
| "\t"
| "\n"
| "\r"
| "\f">
}
/* COMMENTS */
MORE :
{
<"/'"> { input_stream.backup(1); } : IN_MULTI_LINE_COMMENT
}
<IN_MULTI_LINE_COMMENT>
TOKEN :
{
<MULTI_LINE_COMMENT: "'/" > : DEFAULT
}
<IN_MULTI_LINE_COMMENT>
MORE :
{
< ~[] >
}
TOKEN :
{
<SINGLE_LINE_COMMENT: "'" (~["\n", "\r"])* ("\n" | "\r" | "\r\n")?>
}
I can't reproduce every aspect of your problem. You say there is an error "as soon as" you enter certain characters. Here is what I get.
/ There is no error unless the next character is not a '. If the next character is not ', there is an error.
' I see no error. This is correctly treated as the start of comment
; There is always an error. No token can start with ;.
< There only an error if the next characters are not - or <-.
> There always is an error. No token can start with >
I'm not exactly sure why you would expect these not to be errors, since your lexer has no rules to cover these cases. Generally when there is no rule to match a prefix of the input and the input is not exhausted, there will be a TokenMgrError thrown.
If you want to eliminate all these TokenMgrErrors, make a catch-all rule (as explained in the FAQ):
TOKEN: { <UNEXPECTED_CHARACTER: ~[] > }
Make sure this is the very last rule in the .jj file. This rule says that, when no other rule applies, the next character is treated as an UNEXPECTED_CHARACTER token. Of course this just boots the problem up to the parsing level. If you really want the tokenizer to skip all characters that don't belong, just use the following rule as the very last rule:
SKIP : { < ~[] > }
For most languages, that would be an odd thing to do, which is why it is not the default.

Resources