pyparsing - parse simple line - pyparsing

I'm scratching my head on how to completely parse this line,
I'm having trouble with the '( 4801)' part, every other elements are being grabbed OK.
# MAIN_PROG ( 4801) Generated at 2010-01-25 06:55:00
This is what I have so far
from pyparsing import nums, Word, Optional, Suppress, OneOrMore, Group, Combine, ParseException
unparsed_log_data = "# MAIN_PROG ( 4801) Generated at 2010-01-25 06:55:00.007 Type: Periodic"
binary_name = "# MAIN_PROG"
pid = Literal("(" + nums + ")")
report_id = Combine(Suppress(binary_name) + pid)
year = Word(nums, max=4)
month = Word(nums, max=2)
day = Word(nums, max=2)
yearly_day = Combine(year + "-" + month + "-" + day)
clock24h = Combine(Word(nums, max=2) + ":" + Word(nums, max=2) + ":" + Word(nums, max=2) + Suppress("."))
timestamp = Combine(yearly_day + White(' ') + clock24h).setResultsName("timestamp")
time_bnf = report_id + Suppress("Generated at") + timestamp
time_bnf.searchString(unparsed_log_data)
EDIT:
Paul, if you have the patience,
how would I filter
unparsed_log_data =
"""
# MAIN_PROG ( 4801) Generated at 2010-01-25 06:55:00
bla bla bla
multi line garbage
bla bla
Efficiency | 38 38 100 | 3497061 3497081 99 |
more garbage
"""
time_bnf = report_id + Suppress("Generated at") + timestamp
partial_report_ignore = Suppress(SkipTo("Efficiency"))
efficiency_bnf = Suppress("|") + integer.setResultsName("Efficiency") + Suppress(integer) + integer.setResultsName("EfficiencyPercent")
Both
efficiency_bnf.searchString(unparsed_log_data) and
report_and_effic.searchString(unparsed_log_data)
return data as expected,
but if I try
report_and_effic = report_bnf + partial_report_ignore + efficiency_bnf
report_and_effic.searchString(unparsed_log_data)
returns ([], {})
EDIT2:
one should read in the code,
partial_report_ignore = Suppress(SkipTo("Efficiency", include=True))

pid = Literal("(" + nums + ")")
should be
pid = "(" + Word(nums) + ")"
Pyparsing allows you to add strings to expression objects using '+', like:
expr + "some string"
Which gets interpreted as:
expr + Literal("some string")
You wrote Literal("(" + nums + ")"). nums is the string "0123456789", to be used as part of creating Word's, like Word(nums). So what you were trying to match was not "left-paren followed by a word composed of nums followed by right-paren", you were trying to match the literal string "(0123456789)".

Related

How to fix python returning multiple lines in a .csv document instead of one?

I am trying to scrape data form a public forum for a school project, but every-time I run the code, the resulting .csv file shows multiple rows for the text variable instead of just one.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
my_url = 'https://www.emimino.cz/diskuse/1ivf-repromeda-56566/'
uClient = uReq(my_url)
page_soup = soup(uClient.read(),"html.parser")
uClient.close()
containers = page_soup.findAll("div",{"class":"discussion_post"})
out_filename = "Repromeda.csv"
headers = "text,user_name,date \n"
f = open(out_filename, "w")
f.write(headers)
for container in containers:
text1 = container.div.p
text = text1.text
user_container = container.findAll("span",{"class":"user_category"})
user_id = user_container[0].text
date_container = container.findAll("span",{"class":"date"})
date = date_container[1].text
print("text: " + text + "\n" )
print("user_id: " + user_id + "\n")
print("date: " + date + "\n")
# writes the dataset to file
f.write(text.replace(",", "|") + ", " + user_id + ", " + date + "\n")
f.close()
Ideally I am trying to create a row for each data entry (ie. text, user_id, date in one row), but instead I get multiple rows for one text entry and only one row for user_id and date entry.
this is the actual output
this is the expected output
Just replace the new line with blank string.
for container in containers:
text1 = container.div.p
text = text1.text.replace('\n', ' ')

openxlsx add multiple lines on header/footer

I am trying to build the headers of an excel xlsx workbook with some format.
Some strings are too large and need wrapped it.
But when I do it with strwrap adding \n character, setHeaderFooter tell me that I have more than three parameters
for example:
require (openxlsx)
require (tidyverse)
wb0 <- createWorkbook()
addWorksheet(wb0, "Sheet 1")
label <- "&B EJECUTIVOS DE CUENTA: &B"
value <- "FANARA ARIEL GUSTAVO:476, BARETTO ANA SILVINA:34, NO APLICA:32, SANTOS MAXIMILIANO ARIEL:21, AVILA ROBERTO:19, REGGI PABLO:9, Otros:51"
setHeaderFooter (
wb0, sheet = 1
, header = c (
# "left side / izquierda"
strwrap (
x = paste0 (
label, " "
, substr (value, start = 1, stop = 360)
)
, width = 45
, prefix = "\n", initial = ""
)
, "center header / centro"
, "rigth side / derecha"
)
, footer = c (
"&[Date]"
, NA
, "&[File]"
)
)
strwrap is creating a vector of length 4, so the total number of elements in your header is 6, rather than the 3 that setHeaderFooter requires.
strwrap(x = paste0 (label, " ", substr (value, start = 1, stop = 360)),
width = 45, prefix = "\n", initial = "")
[1] "&B EJECUTIVOS DE CUENTA: &B FANARA ARIEL" "\nGUSTAVO:476, BARETTO ANA SILVINA:34, NO" "\nAPLICA:32, SANTOS MAXIMILIANO ARIEL:21,"
[4] "\nAVILA ROBERTO:19, REGGI PABLO:9, Otros:51"
So let's collapse this into a single string vector by wrapping it in paste:
paste(strwrap(x = paste0 (label, " ", substr (value, start = 1, stop = 360)),
width = 45, prefix = "\n", initial = ""),
collapse="")
[1] "&B EJECUTIVOS DE CUENTA: &B FANARA ARIEL\nGUSTAVO:476, BARETTO ANA SILVINA:34, NO\nAPLICA:32, SANTOS MAXIMILIANO ARIEL:21,\nAVILA ROBERTO:19, REGGI PABLO:9, Otros:51"
Now you'll get a header with the 3 elements that the function expects. Here's what the header looks like in the Excel file:

What is the New line in reports for dBase III?

In the generated reports I cannot go to a new line. I can add only 4 fields side by side but I want to add them in a new line.
If you are just printing to your printer (LPT1) as a device, after entering the code to switch devices from the screen to the printer just reference what line number you want to print on. Here's some code from an old program I used to print the page header, and subsequent headers as needed.
Early in your code:
SET CONSOLE OFF && so your output doesn't echo to the screen while printing.
SET PRINTER ON
SET PRINTER TO LPT1
Then call the Prt_Header() function to print the first page header. You must keep up with the line numbers as you print your detail records, and when you get to the bottom of the page use the EJECT command to kick out that page and send another call to Prt_Header().
****************************
STATIC FUNCTION Prt_Header()
****************************
nPage += 1
# 1, 4 SAY DATE()
# 1, 55 SAY "MyCompany INTERNATIONAL, INC."
# 1,121 SAY "Page " + STR( nPage, 4, 0)
# 2, 51 SAY "MY Report Name"
# 3, 4 SAY "Pay Group: " + cPayGroup
# 3, 58 SAY "For Period: " + cPeriodMon + "/" + cPeriodYr
# 4, 4 SAY cLines
# 5, 4 SAY "EXECUTIVE " + "(" + cParTitle + "): " + cName
# 5, 70 SAY "Member #:" + cDist
# 5,100 SAY "Sponsored: " + STR( nNoSponsored, 5, 0 )
# 6, 21 SAY cAddress
# 6,100 SAY "Qualified: " + STR( nQualified, 5, 0 )
if .not. empty( cAddress2 )
# 7, 21 SAY cAddress2
nLine_no := 8
else
nLine_no := 7
endif
# nLine_no, 21 SAY TRIM(cCity) + ", "+ cState + " " + cZip + " " + =
cFullName
nLine_no += 2
# nLine_no, 4 SAY "LN LEVEL I. D. NAME"
# nLine_no, 70 SAY "SALES BONUS PCT"
# nLine_no, 93 SAY "PHONE LAST ORDER STATUS"
# nLine_no + 1, 4 SAY cLines
nLine_no += 2
nItem := 0
RETURN NIL
* EOP: Prt_Header()
But, if you're using a report generator this is not what you're looking for.

Recover email address from special application of MD5 hash function

First, we segment the email address into 2-character strings.
Then, for every segment s, we compute the following hash J:
md5(md5(s) + s + md5(s)) [where + is the string concatenation operator].
Finally, we concatenate all hash strings J to form the long hash below.
For example: for an input of helloworld#company.com, we would compute:
md5(md5('he') + 'he' + md5('he')) +
md5(md5('ll') + 'll' + md5('ll')) +
md5(md5('ow') + 'ow' + md5('ow')) +
...
Long Hash:
f894e71e1551d1833a977df952d0cc9de44a1f9669fbf97d51309a2c6574d5eaa746cdeb9ee1a5df
c771d280d33e5672bf024973657c99bf80cb242d493d5bacc771b3b0b422d5c13595cf3e73cfb1df
91caedee7a6c5f3ce2c283564a39c52d3306d60cbc0e3e33d7ed01e780acb1ccd9174cfea4704eb2
33b0f06e52f6d5aba5a5a89e6122dd55f8efcf024961c1003d116007775d60a0d5781d2e35d747b5
dece2e0e3d79d272e40c8c66555f5525
How can I recover the email address from the hash? As I understand it, a "Hash" is a One Way Function. I can only compare it to another hash to see if they match or generate a Hash of the original text.
While it may be true in general that it is impractical to extract the original message from a hash, this clearly looks like an exercise with conditions carefully crafted to make it possible to break the "encryption".
Consider that the email address is broken up into two-character segments. If you limit yourself to just lowercase letters (26 letters + 2 symbols, # and ., there are only 28 * 28 = 784 possible two-letter combinations. Even if the emails have lowercase and uppercase letters and numbers, there are only 64 * 64 = 4096 combinations -- well within computational limits.
The thing to do is to pre-compute a rainbow table, or table of all possible hash values in your search space. You could do this with a matrix:
+----------------------------------+----------------------------------+----------------------------------------+-----------------------------+
| a | b | c | ... |
+----------------------------------+----------------------------------+----------------------------------------+-----------------------------+
a| md5(md5('aa') + 'aa' + m5('aa')) | md5(md5('ba') + 'ba' + m5('ba')) | md5(md5('ca') + 'ca' + m5('ca')) | ... |
+----------------------------------+----------------------------------+----------------------------------------+-----------------------------+
b| md5(md5('ab') + 'ab' + m5('ab')) | md5(md5('bb') + 'bb' + m5('bb')) | md5(md5('cb') + 'cb' + m5('cb')) | ... |
+----------------------------------+----------------------------------+----------------------------------------+-----------------------------+
c| md5(md5('ac') + 'ac' + m5('ac')) | md5(md5('bc') + 'bc' + m5('bc')) | md5(md5('cc') + 'cc' + m5('cc')) | ... |
+----------------------------------+----------------------------------+----------------------------------------+-----------------------------+
| ... | ... | ... | ... |
+----------------------------------+----------------------------------+----------------------------------------+-----------------------------+
but then you would have to traverse the matrix each time to find a match -- slow!
An alternative is to use a dictionary with the key being the hash, and the value being the 'decoded' letters:
{
md5(md5('aa') + 'aa' + md5('aa')): 'aa',
md5(md5('ab') + 'ab' + md5('ab')): 'ab',
md5(md5('ac') + 'ac' + md5('ac')): 'ac',
...
}
Either way, you will now have the hashes for all possible two-letter combinations. Now you process the input string. Since MD5 produces 32-character long hashes, break the input up into 32-character strings, and perform lookups against your table:
'f894e71e1551d1833a977df952d0cc9d' => 'he'
'e44a1f9669fbf97d51309a2c6574d5ea' => 'll'
...
Here is implementation of your question in python.
My Code:
import hashlib, string
# lambda function for MD5
md5hashFunction = lambda data: hashlib.md5(data.encode()).hexdigest()
# lambda function for md5(md5(data) + data + md5)
finalHash = lambda data: md5hashFunction(
md5hashFunction(data) + data + md5hashFunction(data)
)
# All MD5 hashes are 32 char length size therefore we need dive 32 fixed parts
hashes = [
"f894e71e1551d1833a977df952d0cc9d",
"e44a1f9669fbf97d51309a2c6574d5ea",
"a746cdeb9ee1a5dfc771d280d33e5672",
"bf024973657c99bf80cb242d493d5bac",
"c771b3b0b422d5c13595cf3e73cfb1df",
"91caedee7a6c5f3ce2c283564a39c52d",
"3306d60cbc0e3e33d7ed01e780acb1cc",
"d9174cfea4704eb233b0f06e52f6d5ab",
"a5a5a89e6122dd55f8efcf024961c100",
"3d116007775d60a0d5781d2e35d747b5",
"dece2e0e3d79d272e40c8c66555f5525",
]
# Enumurate all alphabet and extra characters for decryption => "_+.#"
alphabet = list(
string.ascii_lowercase + string.ascii_uppercase + string.digits + "_+.#"
)
# Create python dictionary for map hashes to string
rainbowTable = {finalHash(x + y): x + y for x in alphabet for y in alphabet}
"""
rainbowTable
'31453dd786a8c6f6c7c8860d5fcea4be': 'aa',
'857dce5bcf6b6b32bec281207b2dba80': 'ab',
'e90d94b4b65ac19188fdae82acf7fbbc': 'ac',
'67299b8cedc5eafea7dda1daf9356b54': 'ad',
'40fca4e80bfc6e1faa2c4e2b7e0929f0': 'ae',
'de48fc1bd98f5508c513f9947a514ce8': 'af',
'4852089b1b43b45204907df0066c0edf': 'ag',
'e1b82a5fe4fdcf73d034a0d5063ffe3f': 'ah',
...... Continues....
"""
# Search for matched hash and join to single string
print("".join([rainbowTable[hash] for hash in hashes]))
"""
f894e71e1551d1833a977df952d0cc9de44a1f9669fbf97d51309a2c6574d5eaa746cdeb9ee1a5df
c771d280d33e5672bf024973657c99bf80cb242d493d5bacc771b3b0b422d5c13595cf3e73cfb1df
91caedee7a6c5f3ce2c283564a39c52d3306d60cbc0e3e33d7ed01e780acb1ccd9174cfea4704eb2
33b0f06e52f6d5aba5a5a89e6122dd55f8efcf024961c1003d116007775d60a0d5781d2e35d747b5
dece2e0e3d79d272e40c8c66555f5525
"""
"""
Output ==> secret_jobs#anvato.com
"""
Here is what you can do:
Step 1: Divide the hash string in 32 bit blocks
Step 2: find all possible combinations of 2 character strings from the list of strings which can be combination of alphabets, numbers and any special characters.
Step 3: generate MD5 hash code for that segment, concatenate it with plain text segment and same hash code and generate MD5 hash code again
Step 4: Compare the generates hash code with the existing hash code. If it matched save it in string buffer. Iterate this process till all the blocks are decoded. You will have your answer.

Python 2.7.3 Math Flaw( 40 million is less then six hundred thousand )

SOLVED
In my program, it thinks that 40 million is less then 600,000.
Here is the code:
(Stop it after it loops 20 times)
import re
import urllib2
x = 0
d = 1
c = 1
highestmemberid = 1
highestmembercash = 4301848
while (d==1):
x = float(x + 1)
if (x==14 or x==3 or x==11 or x==13 or x==15):
x = x + 1
print x,
url = "http://www.marapets.com/profile.php?id=" + str(x)
home = opener.open(url)
matchpoint = re.compile("<td align='left' valign=middle><B style='color:#(......);'>(.*?)</B></td>")
home = home.read()
home = home
points = re.findall(matchpoint,home)
if ("http://images.marapets.com/stuff/banned.gif" in home or "This user does not exist" in home):
print "banned/dosen't exist"
else:
mp = points[0][1]
mp = mp.replace(" MP","")
mpcheck= mp.replace(",","")
mp = float(mpcheck)
if (mpcheck > highestmembercash):
highestmembercash = mpcheck
highestmemberid = x
print "richer"
else:
print "Not richer!"
print mp
print "The richest player in marapets is player id #: " + str(highestmemberid) + "Who has: " + str(highestmembercash) + " MP."
if(x == 5368561):
print "The richest player in marapets is player id #: " + str(highestmemberid) + "Who has: " + str(highestmembercash) + " MP."
What the program does is grab cash amounts from the page, and then sees if this is the highest amount. It loops about 5 million times.
mpcheck is a string, you want to check that mp > highestmembercash and assign highestmembercash = mp.

Resources