UnicodeEncodeError: "ascii" can't encode character '\xe0' while parsing HTML (Python) - python-3.4

I'm parsing HTML by inheriting HTMLParser, which is a class coming from the library html.parser. I'm making a web scraper. I have set "convert_charrefs" to true. The program downloads a page by doing "downloadPage(url)" and passes it to myParser (I think It will be better for you if I don't paste here all my code). When the parser finds the link I'm interested to (e.g Attività e procedimenti) from a web site, the program get the value of the attribute "href" and tries to download the page linked by href, by doing "downloadPage(href)", passes it to myParser and so on...
The code for downloadPage(href) is the following:
def getCharset(response):
str = response.info()["Content-type"]
if str:
end = re.search("charset=", str).span()[1]
if end:
return str[end:]
else:
return "ascii"
else:
return "ascii"
def downloadPage(url):
response = urllib.request.urlopen(url)
charset = getCharset(response)
return response.read().decode(charset)
Now, the problem is that certain link has some vowel stressed, such as "http://città.it/" (last url is faked). Not all links found in a web page are made of Unicode characters. So the following code sometimes raises UnicodeEncodeError:
urllib.request.urlopen(url)
I specify that I can't know at first glance how each link is composed

I have solved this problem in this way:
def fromIriToUri(iri):
myUri = []
iri = urlsplit(iri)
iri = list(iri)
for i in iri:
try:
i.encode("ascii")
myUri.append(i)
except UnicodeEncodeError:
myUri.append(urllib.parse.quote(i))
uri = urllib.parse.urlunsplit(myUri)
return uri

Related

Why is a Base64 string displayed as empty in Message Box?

I have to encode some HTML source code into base64 format before form submission, and then decode it back to original code in the code behind. Here is the testing code by MsgBox:
MsgBox(HttpContext.Current.Request.Form("encodedSourceCode"))
MsgBox(Convert.ToString(HttpContext.Current.Request.Form("encodedSourceCode").GetType()))
Dim b = Convert.FromBase64String(HttpContext.Current.Request.Form("encodedSourceCode"))
Dim html = System.Text.Encoding.UTF8.GetString(b)
MsgBox(html)
And I have added an alert() for encodedSourceCode in client script.
The results turn out to be:
First MsgBox: Empty
Second MsgBox: "System.String"
Last MsgBox: Original HTML source code
And the JS alert dialog shows the base64 string, which consists of a bunch of digits and alphabets.
In short, everything is fine, except the first MsgBox, which is supposed to be base64 encoded string but turns out to be empty. Why? Is it normal?
Actually it does not matter much because even the final result (after decoding) seems to have no problem, but I'm just curious why the interim result is not shown as what it's supposed to be.
It seems that the string is simply too long without 'wrappable' characters, I suppose. MsgBox cuts out the 'last word' and shows nothing.
This may confirm it:
dim test = HttpContext.Current.Request.Form("encodedSourceCode")
MsgBox(test) ' empty
test = test.Substring(0, 20)
MsgBox(test) ' shows the first 20 characters
Testing in LinqPad, I get the limit around 43.000 characters:
MsgBox("".PadLeft(43000, "a"))
MsgBox("".PadLeft(44000, "a"))
MsgBox("".PadLeft(43000, "a") & " " & "".PadLeft(1000, "a"))
1st: shows text.
2nd: shows empty box, length = 44.000
3rd: shows text, although the total length is 44.001, but wrappable at the space.
It definitely has nothing to do with base64 strings as they are simple strings. Here the proof:
Dim myString = "Hello world, this is just an ɇxâmpŀƏ ʬith some non-ansi characters..."
Dim myEncoding As Encoding = Encoding.UTF8
MsgBox(myString)
Dim myBase64 = Convert.ToBase64String(myEncoding.GetBytes(myString))
MsgBox(myBase64)
Dim myStringAgain = myEncoding.GetString(Convert.FromBase64String(myBase64))
MsgBox(myStringAgain)
MsgBox(If(StringComparer.Ordinal.Equals(myString, myStringAgain), "same", "different"))
The line
MsgBox(Convert.ToString(HttpContext.Current.Request.Form("encodedSourceCode").GetType()))
results in "System.String" because you convert the name of the type to a string (see xxx.GetType()).

python-requests %2F characters

I am building a request containing a list of parameters, this is a list of endpoints that is read from a file. All these containing "/" characters.
First the file is read as:
pointRef = []
with open("myfolder/" + scope, 'r') as f:
for line in f:
pointRef.append(line.strip())
then passing
params = {'endDate': endDate, 'startDate': startDate, 'pointRef': pointRef}
and executing
r = requests.get(url=url_ranged_multiple, headers=headers, params=params)
and this gives error (I tried other requests by hand and they work), but I noticed that the final url request that is composed by "request.get" contains the "%2F" character instead of "/"
I wonder if this is the problem how can I correct it.
Many thanks in advance

How do i decode this string? \xc3\x99\xc3\xa9\xc2\x87-B[x\xc2

This is what I need to decode
\xc3\x99\xc3\x99\xc3\xa9\xc2\x87-B[x\xc2\x99\xc2\xbe\xc3\xa6\x14Ez\xc2\xab
it is generated by String.fromCharCode(arrayPw[i]);
but i don't understand how to decode it :(
Please help
Python:
data = "\xc3\x99\xc3\x99\xc3\xa9\xc2\x87-B[x\xc2\x99\xc2\xbe\xc3\xa6\x14Ez\xc2\xab"
udata = data.decode("utf-8")
asciidata = udata.encode("ascii","ignore")
JavaScript:
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Otherwise do more research about decoding UTF-8.
https://gist.github.com/chrisveness/bcb00eb717e6382c5608
There's also an online UTF-8 decoder/encoder:
https://mothereff.in/utf-8
HINT: ÙÙé-B[x¾æEz«
duplicate of this : https://stackoverflow.com/a/70815136/5902698
You load a dataset and you have some strange characters.
Exemple :
'戴森美å�‘é€\xa0型器完整版套装Dyson Airwrap
HS01(铜金色礼盒版)'
In my case, I know that the strange characters are chineses. So I can figure that the one who send me the data have encode it in utf-8 but should do it in 'ISO-8859-1'.
So first step, I had encoded the string, then I decode with utf-8.
so my lines are :
_encoding = 'ISO-8859-1'
_my_str.encode(_encoding, 'ignore').decode("utf-8", 'ignore')
Then my output is :
"'森Dyson Airwrap HS01礼'"
This works for me, but I guess that I do not really well understood under the hood. So feel free to tell me if you have further information.
Bonus. I'll try to detect when the str is in the first strange format because some of my entries are in chinese but others are in english
EDIT : The Bonus is useless. I Just use lamba on ma column to encode and decode without care about format. So I changed the encoding after loading the dataframe
_encoding = 'ISO-8859-1'
_decoding = "utf-8"
df[col] = df[col].apply(lambda x : x.encode(_encoding, 'ignore').decode(_decoding , 'ignore'))

Sign a soap Message through Python

I am struggling to Sign a XML soap message with a private key . I have done it in Java in the past , but having a very hard time in doing it through Python. I have kept a template XML in home directory , with values filled up for "BinarySecurityToken" and "KeyInfo" tags . The values in these are generated through SOAP UI using the same private key (as the URI that points to the Body tag is always same) . After that I am calculating the digest value of the whole Body tag and populating that in "DigestValue" tag in "SignedInfo" . Now I am canonlizing this Signed Info tag and calculating the "SignatureValue" over it . But ultimately , when I am passing this Soap XML to the webservice , I am getting a policy faliure message (because of wrong Signature generaion ) , below is my code :
body = etree.tostring(root.find('.//{http://schemas.xmlsoap.org/soap/envelope/}Body'))
c14n_exc = True
ref_xml = canonicalize(body, c14n_exc)
digest_value = sha1_hash_digest(ref_xml)
#Inserting the digest Value
for soapheader in root.xpath('soapenv:Header/wsse:Security/ds:Signature/ds:SignedInfo/ds:Reference', namespaces=ns):
soaptag = etree.XPathEvaluator(soapheader,namespaces=ns)
soaptag('ds:DigestValue')[0].text = digest_value
signed_info_xml = etree.tostring(root.find('.//{http://www.w3.org/2000/09/xmldsig#}SignedInfo'))
signed_info = canonicalize(signed_info_xml, c14n_exc)
pkey = RSA.load_key("privkeyifind.pem", lambda *args, **kwargs: "nopass")
signature = pkey.sign(hashlib.sha1(signed_info).digest())
signature_value = base64.b64encode(signature)
#Inserting the signature Value
for signedInfo in root.xpath('soapenv:Header/wsse:Security/ds:Signature', namespaces=ns):
signtag = etree.XPathEvaluator(signedInfo,namespaces=ns)
signtag('ds:SignatureValue')[0].text = signature_value
canonReq = canonicalize(etree.tostring(root), c14n_exc)
proc = Popen(["curl", "-k", "-s" ,"--connect-timeout", '3', '--data-binary' , canon2, "https://world-service-dev.intra.aexp.com:4414/worldservice/CLIC/CaseManagementService/V1"], stdout=PIPE, stderr=PIPE)
response, err = proc.communicate()
#######################################################
#Method to generate the digest value of the xml message
#######################################################
def sha1_hash_digest(payload):
"Create a SHA1 hash and return the base64 string"
return base64.b64encode(hashlib.sha1(payload).digest())
#####################################
#Method to canonicalize a request XML
#to remove tabs, line feeds/spaces,
#quoting, attribute ordering and form
#a proper XML
#####################################
def canonicalize(xml, c14n_exc=True):
"Return the canonical (c14n) form of the xml document for hashing"
# UTF8, normalization of line feeds/spaces, quoting, attribute ordering...
output = StringIO()
# use faster libxml2 / lxml canonicalization function if available
et = lxml.etree.parse(StringIO(xml))
et.write_c14n(output, exclusive=c14n_exc)
return output.getvalue()
I can only use standard Python function of 2.6.6 . I can not download message signing lib like signxml etc (due to the restriction on the environment).

Exploding a string ASP

I have the following which is returned from an api call:
<WORST>0</WORST>
<AVERAGE>93</AVERAGE>
<START>1</START>
I need to parse this to just give me the <AVERAGE></AVERAGE> number, 93.
Here's what I'm trying but get error detected:
res = AjaxGet(url)
myArray = split(res,"AVERAGE>")
myArray2 = split(myArray[1],"</AVERAGE>")
response.write myArray2[0]
I'm brand new to ASP, normally code in PHP
VBScript doesn't recognise square brackets [] when accessing Array elements and will produce a Syntax Error in the VBScript Engine.
Try making the following changes to the code snippet to fix this problem;
res = AjaxGet(url)
myArray = split(res,"AVERAGE>")
myArray2 = split(myArray(1),"</AVERAGE>")
response.write myArray2(0)
On a side Note:
Parsing XML data in this way is really inefficient if the AjaxGet() function returns an XML response you could use the XML DOM / XPath to locate the Node and access the value.

Resources