How to specify encoding in Python http POST? - http

When I call imgflip caption api (see below), it doesn't support non-English words. However, it does support non-English words if we use the web api here (https://imgflip.com/memegenerator). I guess it is because string encoding inconsistency between Python and imgflip. How can I specify encoding method in Python http POST? Thanks.
import requests
import json
r = requests.post(
"https://api.imgflip.com/caption_image",
data={'template_id': 405658,
'username': '[Username]',
'password': '[Password]',
'boxes[0][text]': "カジュアルなこんにちは",
'boxes[1][text]': "",
'boxes[2][text]': ""})
print(r.status_code, r.reason)
res = json.loads(r.text)
print(res)
BTW, a password is required to really run the api.

Related

unknown characters "سقوط" are scraped instead of encoding utf-8

I'm trying to scraped a Non-English website (https://arzdigital.com/). Here is my spider code. The problem is although at the beginning I import "urllib.parse" and in the settings.py file I wrote
FEED_EXPORT_ENCODING='utf-8'
the spider doesn't encode properly (the output is like this: "سقوط ۱۰ هزار دلاری بیت کوین در عرض یک ساعت؛ علت چه بود؟"). Even using .encode() function didn't work.
So, here is my spider code:
# -*- coding: utf-8 -*-
import scrapy
import logging
import urllib.parse
parts = urllib.parse.urlsplit(u'http://fa.wikipedia.org/wiki/صفحهٔ_اصلی')
parts = parts._replace(path=urllib.parse.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C'
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
start_urls=[f'https://arzdigital.com/latest-posts/page/{i}/'.format(i) for i in enter code hererange(1,353)]
def parse(self, response):
posts=response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title=post.xpath(".//#title").get()
yield{
'post_title':post_title
}
except AttributeError:
logging.error("The element didn't exist")
Can anybody tell me where the problem is? Thank you so much!
In the response headers there is a charset, otherwise it defaults to Windows-1252.
If you find a charset ISO-8859-1 substitute it with Windows-1252.
Now you have the right encoding to read it.
Best store all in full Unicode, UTF-8, so every script is possible.
It may be you are looking at the output with a console (on Windows most likely not UTF-8), and then you will see multi-byte sequences as two weird chars. Store it in a file, and edit it with Notepad++ or the like, where you
can see the encoding and change it. Nowadays even Windows Notepad sometimes recognizes UTF-8.

How to update payload info for python scraping

I have a python scraper that works for this site:
https://dhhr.wv.gov/COVID-19/Pages/default.aspx
It will scrape the tooltips from one of the graphs that is navigated to by clicking the "Positive Case Trends" link in the above URL.
here is my code:
import re
import requests
import json
from datetime import date
url4 = 'https://wabi-us-gov-virginia-api.analysis.usgovcloudapi.net/public/reports/querydata?synchronous=true'
# payload:
x=r'{"version":"1.0.0","queries":[{"Query":{"Commands":[{"SemanticQueryDataShapeCommand":{"Query":{"Version":2,"From":[{"Name":"c","Entity":"Case Data"}],"Select":[{"Column":{"Expression":{"SourceRef":{"Source":"c"}},"Property":"Lab Report Date"},"Name":"Case Data.Lab Add Date"},{"Aggregation":{"Expression":{"Column":{"Expression":{"SourceRef":{"Source":"c"}},"Property":"Daily Confirmed Cases"}},"Function":0},"Name":"Sum(Case Data.Daily Confirmed Cases)"},{"Aggregation":{"Expression":{"Column":{"Expression":{"SourceRef":{"Source":"c"}},"Property":"Daily Probable Cases"}},"Function":0},"Name":"Sum(Case Data.Daily Probable Cases)"}]},"Binding":{"Primary":{"Groupings":[{"Projections":[0,1,2]}]},"DataReduction":{"DataVolume":4,"Primary":{"BinnedLineSample":{}}},"Version":1}}}]},"CacheKey":"{\"Commands\":[{\"SemanticQueryDataShapeCommand\":{\"Query\":{\"Version\":2,\"From\":[{\"Name\":\"c\",\"Entity\":\"Case Data\"}],\"Select\":[{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"c\"}},\"Property\":\"Lab Report Date\"},\"Name\":\"Case Data.Lab Add Date\"},{\"Aggregation\":{\"Expression\":{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"c\"}},\"Property\":\"Daily Confirmed Cases\"}},\"Function\":0},\"Name\":\"Sum(Case Data.Daily Confirmed Cases)\"},{\"Aggregation\":{\"Expression\":{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"c\"}},\"Property\":\"Daily Probable Cases\"}},\"Function\":0},\"Name\":\"Sum(Case Data.Daily Probable Cases)\"}]},\"Binding\":{\"Primary\":{\"Groupings\":[{\"Projections\":[0,1,2]}]},\"DataReduction\":{\"DataVolume\":4,\"Primary\":{\"BinnedLineSample\":{}}},\"Version\":1}}}]}","QueryId":"","ApplicationContext":{"DatasetId":"fb9b182d-de95-4d65-9aba-3e505de8eb75","Sources":[{"ReportId":"dbabbc9f-cc0d-4dd0-827f-5d25eeca98f6"}]}}],"cancelQueries":[],"modelId":339580}'
x=x.replace("\\\'","'")
json_data = json.loads(x)
final_data2 = requests.post(url4, json=json_data, headers={'X-PowerBI-ResourceKey': 'ab4e5874-7bbf-44c9-9443-0701abdee612'}).json()
print(json.dumps(final_data2))
The issue is that some days it stops working because the payload and X-PowerBI-ResourceKey header parameter values change and i have to find and manually copy and paste the new values from browser inspection network section into my source. Is there a way to programatically obtain these from the webpage and construct them in my code?
I'm pretty sure the resource key is part of the iframe url encoded as base64.
from base64 import b64decode
from bs4 import BeautifulSoup
import json
import requests
resp = requests.get('https://dhhr.wv.gov/COVID-19/Pages/default.aspx')
soup = BeautifulSoup(resp.text)
data = soup.find_all('iframe')[0]['src'].split('=').pop()
decoded = json.loads(b64decode(data).decode())

Using HTTR POST to Create a Azure Devops Work Item

Im attempting to setup some R code to create a new work item task in Azure Devops. Im okay with a mostly empty work item to start with if thats okay to do (my example code is only trying to create a work item with a title).
I receive a 203 response but the work item doesn't appear in Devops.
Ive been following this documentation from Microsoft, I suspect that I might be formatting the body incorrectly.
https://learn.microsoft.com/en-us/rest/api/azure/devops/wit/work%20items/create?view=azure-devops-rest-5.1
Ive tried updating different fields and formatting the body differently with no success. I have attempted to create either a bug or feature work item but both return the same 203 response.
To validate that my token is working I can GET work item data by ID but the POST continues to return a 203.
require(httr)
require(jsonlite)
url <- 'https://dev.azure.com/{organization}/{project}/_apis/wit/workitems/$bug?api-version=5.1'
headers = c(
'Authorization' = sprintf('basic %s',token),
'Content-Type' = 'application/json-patch+json',
'Host' = 'dev.azure.com'
)
data <- toJSON(list('body'= list("op"= "add",
"path"= "/fields/System.AreaPath",
"value"= "Sample task")), auto_unbox = TRUE, pretty = TRUE)
res <- httr::POST(url,
httr::add_headers(.headers=headers),
httr::verbose(),
body = data)
Im expecting a 200 response (similar to the example in the link above) and a work item task in Azure DevOps Services when I navigate to the website.
Im not the best with R so please be detailed. Thank you in advanced!
The POST continues to return a 203.
The HTTP response code 203 means Non-Authoritative Information, it should caused by your token format is converted incorrectly.
If you wish to provide the personal access token through an HTTP
header, you must first convert it to a Base64 string.
Refer to this doc described, if you want to use VSTS rest api, you must convert your token to a Base64 string. But in your script, you did not have this script to achieve this convert.
So, please try with the follow script to convert the token to make the key conformant with the requirements(load the base64enc package first):
require(base64enc)
key <- token
keys <- charToRaw(paste0(key,":token"))
auth <- paste0("Basic ",base64encode(keys))
Hope this help you get 200 response code
I know this question is fairly old, but I cannot seem to find a good solution posted yet. So, I will add my solution in case others find themselves in this situation. Note, this did take some reading through other SO posts and trial-and-error.
Mengdi is correct that you do need to convert your token to a Base64 string.
Additionally, Daniel from this SO question pointed out that:
In my experience with doing this via other similar mechanisms, you have to include a leading colon on the PAT, before base64 encoding.
Mengdi came up big in another SO solution
Please try with adding [{ }] outside your request body.
From there, I just made slight modifications to your headers and data objects. Removed 'body' from your json, and made use of paste to add square brackets as well. I found that the Rcurl package made base64 encoding a breeze. Then I was able to successfully create a blank ticket (just titled) using the API! Hope this helps someone!
library(httr)
library(jsonlite)
library(RCurl)
#user and PAT for api
userid <- ''
token= 'whateveryourtokenis'
url <- 'https://dev.azure.com/{organization}/{project}/_apis/wit/workitems/$bug?api-version=5.1'
#create a combined user/pat
#user id can actually be a blank string
#RCurl's base64 seemed to work well
c_id <- RCurl::base64(txt = paste0(userid,
":",
devops_pat
),
mode = "character"
)
#headers for api call
headers <- c(
"Authorization" = paste("Basic",
c_id,
sep = " "
),
'Content-Type' = 'application/json-patch+json',
'Host' = 'dev.azure.com'
)
#body
data <- paste0("[",
toJSON(list( "op"= "add",
"path"= "/fields/System.Title",
"value"= "API test - please ignore"),
auto_unbox = TRUE,
pretty = TRUE
),
"]"
)
#make the call
res <- httr::POST(url,
httr::add_headers(.headers=headers),
httr::verbose(),
body = data
)
#check status
status <- res$status_code
#check content of response
check <- content(res)

Python Requests taking a long time

Basically I am working on a python project where I download and index files from the sec edgar database. The problem however, is that when using the requests module, it take a very long time to save the text in a variable (between ~130 and 170 seconds for one file).
The file roughly has around 16 million characters, and I wanted to see if there was any way to easily lower the time it takes to retrieve the text. -- Example:
import requests
url ="https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.text)
Thanks!
What I found is in the code for r.text, specifically when no encoding was given ( r.encoding == 'None' ). The time spend detecting the encoding was 20 seconds, I was able to skip it by defining the encoding.
...
r.encoding = 'utf-8'
...
Additional details
In my case, my request was not returning an encoding type. The response was 256k in size, the r.apparent_encoding was taking 20 seconds.
Looking into the text property function. It tests to see if there is an encoding. If there is None, it will call the apperent_encoding function which will scan the text to autodetect the encoding scheme.
On a long string this will take time. By defining the encoding of the response ( as described above), you will skip the detection.
Validate that this is your issue
in your above example :
from datetime import datetime
import requests
url = "https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.encoding)
print(datetime.now())
enc = r.apparent_encoding
print(enc)
print(datetime.now())
print(r.text)
print(datetime.now())
r.encoding = enc
print(r.text)
print(datetime.now())
of course the output may get lost in the printing, so I recommend you run the above in an interactive shell, it may become more aparent where you are losing the time even without printing datetime.now()
From #martijn-pieters
Decoding and printing 15MB of data to your console is often slower than loading data from a network connection. Don't print all that data. Just write it straight to a file.

python3 imaplib search function encoding

Can someone point me out how to properly search using imaplib in python. The email server is Microsoft Exchange - seems to have problems but I would want a solution from the python/imaplib side.
https://github.com/barbushin/php-imap/issues/128
I so far use:
import imaplib
M = imaplib.IMAP4_SSL(host_name, port_name)
M.login(u, p)
M.select()
s_str = 'hello'
M.search(s_str)
And I get the following error:
>>> M.search(s_str)
('NO', [b'[BADCHARSET (US-ASCII)] The specified charset is not supported.'])
search takes two or more parameters, an encoding, and the search specifications. You can pass None as the encoding, to not specify one. hello is not a valid charset.
You also need to specify what you are searching: IMAP has a complex search language detailed in RFC3501§6.4.4; and imaplib does not provide a high level interface for it.
So, with both of those in mind, you need to do something like:
search(None, 'BODY', '"HELLO"')
or
search(None, 'FROM', '"HELLO"')

Resources