scanString end location: why it is end_index+1? - pyparsing

python/pyparsing
When I use scanString method, it is giving the start and end location of the matched token, in the text.
e.g.
line = "cat bat"
pat = Word(alphas)
for i in pat.scanString(line):
print i
I get the following:
((['cat'], {}), 0, 3)
((['bat'], {}), 4, 7)
But cat end location should be "2" right? Why it is reporting the next location as the end location?

This is consistent with Python's [begin:end] slicing conventions, where the "end" is the index of the next character. By putting the end as the next location, it is very straightforward to extract the matching substring using the returned values:
for t,start,end in pat.scanString(line):
print line[start:end]
You can see how this is used if you look in the pyparsing source code for the implementation of transformString.

Related

Received MethodError when cleaning String

I have data in a .txt file that looks like this:
04:31 Yuri Kane feat Jeza – Love Comes (Original Mix) [PREMIER]
25:31 Heatbeat & Quilla – Secret (Original Mix) [ARMADA CAPTIVATING]
All of them have this pattern:
00:00 artist - title [studio]
I want to remove the time stamp and the studio, so the output looks like this:
1. Yuri Kane feat Jeza – Love Comes (Original Mix)
Here is what I tried:
function remove_time_from(str::String)
return last(split(str,"0 "))
end
function remove_url(str::String)
return first(rsplit(str,"["))
end
function main()
tracks = String[]
local number = 0
for line in eachline("track-list.txt")
number += 1
removed_time = remove_time_from(line)
cleaned = remove_url(removed_time)
push!(tracks,"$number.$cleaned")
end
open("track-list-cleaned.txt", "w") do io
for line in tracks
write(io, "$line\n")
end
end
end
main()
but it returns:
MethodError: no method matching remove_url(::SubString{String})
When you use the function remove_time_from() it uses first() which returns a SubString{String}:
track = "04:31 Yuri Kane feat Jeza – Love Comes (Original Mix) [PREMIER]"
println(typeof(remove_time_from(track))) # Output: SubString{String}
You have 2 ways to fix it:
Have both remove_time_from() and remove_url() convert the SubString to String before returning it. This way, no matter which function you use first, you'll get a String:
return convert(String,last(split(str,"0 ")))
Use AbstractString instead of String as the function parameter, because SubString is a subtype of AbstractString:
println(SubString <: AbstractString) # Output: true
This way, no matter which function you use first, it would accept a String (the variable type of line) or SubString (the type you end up with after using one of the functions).
Suggestions:
Using split(str,"0 ") won't remove the time stamp:
last(split("04:31 Yuri Kane feat Jeza – Love Comes (Original Mix) [PREMIER]", "0 "))
Output: 04:31 Yuri Kane feat Jeza – Love Comes (Original Mix) [PREMIER]
What you need is chop() and you can specify how many characters to ignore from the head, so in this case 5 (includes the leading whitespace).
chop(str, head = 5)
You don't need to read in the lines, clean it, and then store it in a Vector to write later. You can clean it (do it in one line), and write it out to the file:
open("track-list-cleaned.txt", "w") do io
for line in eachline("track-list.txt")
number += 1
cleaned = (remove_url(remove_time_from(line)))
write(io, "$number.$cleaned\n")
end
end
Use enumerate() to number the lines as you're reading them in:
for (number,line) in enumerate(eachline("track-list.txt"))
Code:
# Using the assignment form because each function has only one line.
remove_time_from(str::AbstractString) = chop(str, head = 5)
remove_url(str::AbstractString) = first(rsplit(str," https"))
function main()
open("track-list-cleaned.txt", "w") do io
for (number,line) in enumerate(eachline("track-list.txt"))
cleaned = strip(remove_url(remove_time_from(line)))
write(io,"$number.$cleaned\n")
end
end
end
main()

Matching dict key to text file and returning Test Pass/fail

I'm a novice at Python, and am currently working on a small test case assignment where I am to find and match the dictionary keys to a small text file, and see if the keys are present in the text file.
As follows, the dictionary goes:
dict = {"description, translation": "test_translation(serial,",
"unit": "test_unit(",}
The text in text file, henceforth called "requirement.txt" as follows:
The description shall display the translation of XXX.
The unit shall be hidden.
The value is read from the file "version.txt".
To the key, I am to find and match if they are present or absent - a match should return a "test pass", no match would return a skip.
Keys from dictionary are to be sorted to a list, then iterated and matched to text. (Values from dictionary are to be sorted to a seperate list and iterated over a seperate file, to which I shall not delve into it here.)
This is the code that I currently have (and stuck):
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
line_strings = line.split(' ')
for word in list:
if word in line:
print("Test Pass")
print(word)
break
else:
print("Test Fail")
print(line + "\n")
Result currently obtained:
Test Fail
Test Pass
display
The description shall display the translation of XXX.
Test Fail
Test Fail
Test Fail
Test Pass
unit
The unit shall be hidden.
Test Fail
Test Fail
Test Fail
Test Fail
The value is read from the file "version.txt".
Using the current code which I have, (and I am stuck), running the code returned multiple times of "Test pass" and "Test fail", suggesting that the keys are iterated multiple times over each line and the results returned for each multiple iteration.
I am stuck at two fronts:
After seperating the key into a list, how to order them in the sequence of "description, translation", "unit)?
How to modify the code so as to ensure that result is returned once as "Test pass" or "test fail"
Results should ideally return in the following format:
Ideal outcome:
('Text:', "The description shall display the translation of XXX.
('Key:', 'description, translation')
Test Pass
('Text:', 'The unit shall be hidden.')
('Key:', 'unit')
Test Pass
('Text:', 'The value is read from the file "version.txt".')
('Key:', (none))
Test Fail
For your kind enlightenment please, thank you!
Try with this:
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
# Create an empty list which will contain all the word that match
words_found = []
for word in list:
# if the word match then add it to the list words_found
if word in line:
words_found.append(word)
print("(\'Text:\',\"{}\"")' ".format(line))
print("(\'Keys:\',\"{}\"")' ".format(words_found))
# if the list of words found it's not empty then the test passed
if(words_found):
print("Test Passed")
else:
print("Test Failed")
the idea is to create a list of the words founds and then print them all
I'm using the format Operation and you can find a guide on how to use it here. And the line if(words_found): check if the list is empty.
Additional Notes
In this case, you won't need it but if you wanted to solve only the second point you can use the for else statement as explained in the docs
4.4 break and continue Statements, and else Clauses on Loops
Loop statements may have an else clause; it is executed when the loop terminates through exhaustion of the list (with for) or when the condition becomes false (with while), but not when the loop is terminated by a break statement.
Reducing by one tab the indentation the else of your if statement it became the else of the for statement so it will be executed only if the for never had a break the problem is solved.
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
line_strings = line.split(' ')
for word in list:
if word in line:
print(word)
print("Test Pass")
break
else:
print("Test Fail")
print(line + "\n")
Edit
To split the key into description and translation we just have to split the two word at the comma with the builtin function split
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
# Create an empty list which will contain all the word that match
words_found = []
for word in list:
description, translation = word.split(",")
# if the word match then add it to the list words_found
if description in line:
words_found.append(description)
print("(\'Text:\',\"{}\"")' ".format(line))
print("(\'Keys:\',\"{}\"")' ".format(words_found))
# if the list of words found it's not empty then the test passed
if(words_found):
print("Test Passed")
else:
print("Test Failed")

How to Store All Text in Between Two Index Positions of Same String in VBScript?

So I am going off memory here because I cannot see the code I am trying to figure this out for at the moment, but I am working with some old VB Script code where there is a data connection that is set like this:
set objCommand = Server.CreateObject("ADODB.command")
and I have a field from the database that is being stored in a variable like this:
Items = RsData(“Item”).
This specific field in the database is a long string of
text:
(i.e. “This is part of a string of text…Header One: Here is text after header one… Header Two: Here is more text after header two”).
There are certain parts of the text that I wish to store as a variable that are between two index positions in the long string of text within that field. They are separated by headers that are stored in the text field above like this: “Header One:” and “Header Two:”, and I want to capture all text that occurs in between those two headers of text and store them into their own variable (i.e. “Here is text after header one…”).
How do I achieve this? I have tried to use the InStr method to set the index but from how I understand how this works it will only count the beginning of where a specific part of the string occurs. Am I wrong in my thinking of this? Since that is the case, I am also having trouble getting the Mid function to work. Can some one please show me an example of how this is supposed to work? Remember, I am only going off of memory so please forgive me that I am unable to provide better code examples now. I hope my question makes sense!
I am hopeful that someone can help me with an answer tonight so I can try this out tomorrow when I am near the code again! Thank you for your efforts and any help offered!
You can extract all the substrings starting with the text Header and ending just before either the next Header or end-of-string. I have used regular expression to implement that and it is working for me. Have a look at the code below. If I get a simpler(non-regex solution), I will update the answer.
Code:
strTest = "Header One: Some random text Header Two: Some more text Header One: Some random textwerwerwefvxcf234234 Header Three: Some more t2345fsdfext Header Four: Some randsdfsdf3w42343om text Header Five: Some more text 123213"
set objReg = new Regexp
objReg.Global = true
objReg.IgnoreCase = false
objReg.pattern = "Header[^:]+:([\s\S]*?)(?=Header|$)" '<---Regex Pattern. Explained later.
set objMatches = objReg.Execute(strTest)
Dim arrHeaderValues() '<-----This array contains all the required values
i=-1
for each objMatch in objMatches
i = i+1
Redim Preserve arrHeaderValues(i)
arrHeaderValues(i) = objMatch.subMatches.item(0) '<---item(0) indicates the 1st group of each match
next
'Displaying the array values
for i=0 to ubound(arrHeaderValues)
msgbox arrHeaderValues(i)
next
set objReg = Nothing
Regex Explanation:
Header - matches Header literally
[^:]+: - matches 1+ occurrences of any character that is not a :. This is then followed by matching a :. So far, keeping the above 2 points in mind, we have matched strings like Header One:, Header Two:, Header blabla123: etc. Now, whatever comes after this match is relevant to us. So we will capture that inside a Group as shown in the next breakup.
([\s\S]*?)(?=Header|$) - matches and captures everything(including newlines) until either the next Header or the end-of-the-string(represented by $)
([\s\S]*?) - matches 0+ occurrences of any character and capture the whole match in Group 1
(?=Header|$) - match and capture the above thing until another instance of the string Header or end of the string
Click for Regex Demo
Alternative Solution(non-regex):
strTest = "Header One: Some random text Header Two: Some more text Header One: Some random textwerwerwefvxcf234234 Header Three: Some more t2345fsdfext Header Four: Some randsdfsdf3w42343om text Header Five: Some more text 123213"
arrTemp = split(strTest,"Header") 'Split using the text Header
j=-1
Dim arrHeaderValues()
for i=0 to ubound(arrTemp)
strTemp = arrTemp(i)
intTemp = instr(1,strTemp,":") 'Find the position of : in each array value
if(intTemp>0) then
j = j+1
Redim preserve arrHeaderValues(j)
arrHeaderValues(j) = mid(strTemp,intTemp+1) 'Store the desired value in array
end if
next
'Displaying the array values
for i=0 to ubound(arrHeaderValues)
msgbox arrHeaderValues(i)
next
If you don't want to store the values in an array, you can use Execute statement to create variables with different names during run-time and store the values in them. See this and this for reference.

How do I delete characters in a string up to a certain point in classic asp?

I have a string that at any point may or may not contain one or more / characters. I'd like to be able to create a new string based on this string. The new string would include every character after the very last / in the original string.
Sounds like you're wanting the file name from a URL. In any case, it's the same function. The key is using the InStrRev function to find the first / char, but starting from the right. Here's the function:
Function GetFilename(URL)
Dim I
I = InStrRev(URL, "/")
If I > 0 Then
GetFilename = Mid(URL, I + 1)
Else
GetFilename = URL
End If
End Function
Split it up into parts and get the last part:
a = split("my/string/thing", "/")
wscript.echo a(ubound(a))
note: Not safe when the string is empty.

xQuery substring problem

I now have a full path for a file as a string like:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml"
However, now I need to take out only the folder path, so it will be the above string without the last back slash content like:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/"
But it seems that the substring() function in xQuery only has substring(string,start,len) or substring(string,start), I am trying to figure out a way to specify the last occurence of the backslash, but no luck.
Could experts help? Thanks!
Try out the tokenize() function (for splitting a string into its component parts) and then re-assembling it, using everything but the last part.
let $full-path := "/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
$segments := tokenize($full-path,"/")[position() ne last()]
return
concat(string-join($segments,'/'),'/')
For more details on these functions, check out their reference pages:
fn:tokenize()
fn:string-join()
fn:replace can do the job with a regular expression:
replace("/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
"[^/]+$",
"")
This can be done even with a single XPath 2.0 (subset of XQuery) expression:
substring($fullPath,
1,
string-length($fullPath) - string-length(tokenize($fullPath, '/')[last()])
)
where $fullPath should be substituted with the actual string, such as:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml"
The following code tokenizes, removes the last token, replaces it with an empty string, and joins back.
string-join(
(
tokenize(
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
"/"
)[position() ne last()],
""
),
"/"
)
It seems to return the desired result on try.zorba-xquery.com. Does this help?

Resources