Goutte / Symfony DOM Crawler download file from form

Goutte / Symfony DOM Crawler download file from form - web-scraping

There is a form in the remote page, which, after submitted, automatically download specific file to your computer. How could I grab that file and store it on server using Goutte or native Symfony DOM Crawler?
Currently I have this code:
$client = new Client();
$client->setHeader('user-agent', "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36");
$crawler = $client->request('GET', 'ADDRESS');
$form = $crawler->selectButton('Get Results')->form();
$crawler = $client->submit($form);
If Goutte does not allow to do this, which technology would?

Related

I am loging into the following website via splash lua script but can't logg in

I want to log in into the https://login.starcitygames.com/ website by using splash integration with lua script. i first check in locallhost for testing them.
when detecting all the form css tags and entering log in credentials i failed to logged in.
The code are here:
function main(splash)
splash:set_custom_headers({
["user-agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36",
})
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(10))
splash:set_viewport_full()
local search_input = splash:select('input[type=text]')
search_input:send_text("(censored)#gmail.com")
local search_input = splash:select('input[name=password]')
search_input:send_text("(censored)")
assert(splash:wait(5))
local submit_button = splash:select('button[type=submit]')
submit_button:click()
assert(splash:wait(15))
return {
html = splash:html(),
png = splash.png(),
}
end
After when i run it on localhost 'http://0.0.0.0:8050/' then the following results is come and can't logged in .
May be the css tags i use is incorrect or anything .
I am new to splash lua so don't understand it.
the output is:

Try to replace local search_input = splash:select('input[type=text]') with local search_input = splash:select('input[name=username]').

How to get a response cookie with flurl

How to get a response cookie with flurl? I've searched for some references and studied on flurl.dev but still confused how to apply them. sorry I am not a programmer, I still have a lot to learn.
Simple code that i use :
var strUrl = await url
.WithHeaders(new
{
user_agent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
content_type = "application/json",
cookie = "cookie"
})
.PostJsonAsync(new
{
user = "user",
password = "password"
})
.ReceiveString();
Result : 200 OK

The problem here is that when you call ReceiveString(), you're getting the response body and effectively discarding the other aspects of the response message returned by PostJsonAsync. You can get the response, cookies, and body in separate steps like this:
var resp = await url
.WithHeaders(...)
.PostJsonAsync(...);
var cookies = resp.Cookies; // list of FlurlCookie objects
var body = await resp.GetStringAsync();

extract email from craiglists post

Is there a way to find the email from listing on craigslist without the use of selenium
import requests,re
from bs4 import BeautifulSoup as bs
url='https://newyork.craigslist.org/wch/prk/d/hawthorne-10x15-drive-up-storage-unit/7122801839.html' #example url
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
res=requests.get(url,headers=headers)
the email changes with each request made (I assume), I tried x=re.findall('(\w{32})',res.text) but it doesn't work

Craigslist fetches the email address by sending a POST request to this special URL:
https://newyork.craigslist.org/contactinfo/nyc/prk/U_ID
The value of this U_ID is 7122801839 in this case (from the URL you provided).
You can replicate this request like this:
from bs4 import BeautifulSoup
import requests
import json
U_ID = "7122801839"
URL = f"https://newyork.craigslist.org/contactinfo/nyc/prk/{U_ID}"
COOKIE_VALUE = "cookie" # Replace this with a valid cookie
HEADERS = {
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Accept':'*/*',
'Accept-Language':'en-us',
'Accept-Encoding':'gzip, deflate, br',
'Host':'newyork.craigslist.org',
'Origin':'https',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Safari/605.1.15',
'Connection':'keep-alive',
'Referer':'https',
'Content-Length':'44816',
'Cookie':COOKIE_VALUE,
'X-Requested-With':'XMLHttpRequest',
}
PAYLOAD = {
'MIME Type':'application/x-www-form-urlencoded; charset=UTF-8',
}
response = requests.request(
method='POST',
url=URL,
headers=HEADERS,
data=PAYLOAD
)
html = json.loads(response.text)['replyContent']
soup = BeautifulSoup(html,'html.parser')
email = soup.find(class_='mailapp').get('href')
email = email.split('?subject')[0].replace('mailto:','')
print(email)
Please note that this code won't work without a cookie, so you will need to copy the cookie from your browser.

Losing information when using BeautifulSoup

I am following the guide of 'Automate the Boring Stuff with Python'
practicing a project called 'Project: “I’m Feeling Lucky” Google Search'
but the CSS selector returns nothing
import requests,sys,webbrowser,bs4,pyperclip
if len(sys.argv) > 1:
address = ' '.join(sys.argv[1:])
else:
address = pyperclip.paste()
res = requests.get('http://google.com/search?q=' + str(address))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for i in range (5):
webbrowser.open('http://google.com' + linkElems[i].get('href'))**
I already tested the same code in the IDLE shell
It seems that
linkElems = soup.select('.r')
returns nothing
and after I checked the value returned by beautiful soup
soup = bs4.BeautifulSoup(res.text,"html.parser")
I found all class='r' and class='rc' is gone for no reason.
But they were there in the raw HTML file.
Please tell me why and how to avoid such problems

To get version of HTML where it's defined class r, it's necessary to set User-Agent in headers:
import requests
from bs4 import BeautifulSoup
address = 'linux'
headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
res = requests.get('http://google.com/search?q=' + str(address), headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for a in linkElems:
if a.text.strip() == '':
continue
print(a.text)
Prints:
Linux.orghttps://www.linux.org/
Puhverdatud
Tõlgi see leht
Linux – Vikipeediahttps://et.wikipedia.org/wiki/Linux
Puhverdatud
Sarnased
Linux - Wikipediahttps://en.wikipedia.org/wiki/Linux
...and so on.

The reason why Google blocks your request is because default requests user-agent is python-requests. Check what's your user-agent thus blocking your request and resulting in completely different HTML with different elements and selectors. But sometimes you can receive a different HTML, with different selectors when using user-agent.
Learn more about user-agent and HTTP request headers.
Pass user-agent into request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
Try to use lxml parser instead, it's faster.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
-----
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''
Disclaimer, I work for SerpApi.

POST data into ASPX with requests. How to fix "remote mashine error" message

I'm trying to scrap data from aspx page using request with POST data.
On parsed html I'm getting an error "An application error occurred on the server. The current custom error settings for this application prevent the details of the application error from being viewed remotely (for security reasons). It could, however, be viewed by browsers running on the local server machine."
I was searching for solutions a while but frankly I'm new in Python and can't really figure out what's wrong.
The ASPX has javaonclick function which opens a new window with data in html.
The code I've created is below.
Any help or suggestions would be greatly welcomed. Thank you!
import requests
from bs4 import BeautifulSoup
session = requests.Session()
url = 'http://ws1.osfi-bsif.gc.ca/WebApps/FINDAT/Insurance.aspx?T=0&LANG=E'
r=session.get(url)
soup = BeautifulSoup(r.content,'lxml')
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
payload = {
r'__EVENTTARGET': r'',
r'__EVENTARGUMENT': r'',
r'__LASTFOCUS': r'',
r'__VIEWSTATE': viewstate,
r'__VIEWSTATEGENERATOR': r'B2E4460D',
r'__EVENTVALIDATION': eventvalidation,
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$institutionType': r'radioButton1',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$institutionDropDownList': r'F018',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$reportTemplateDropDownList': r'C_LIFE-1',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$reportDateDropDownList': r'3+-+2015',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$submitButton': r'Submit'
}
HEADER = {
"Content-Type":"application/x-www-form-urlencoded",
"Content-Length":"11759",
"Host":"ws1.osfi-bsif.gc.ca",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language":"en-US,en;q=0.5",
"Cache-Control": "max-age=0",
"Accept-Encoding":"gzip, deflate",
"Connection":"keep-alive",
}
df = session.post(url, data=payload, headers=HEADER)
print df.text

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Goutte / Symfony DOM Crawler download file from form - web-scraping

Related

I am loging into the following website via splash lua script but can't logg in

How to get a response cookie with flurl

extract email from craiglists post

Losing information when using BeautifulSoup

POST data into ASPX with requests. How to fix "remote mashine error" message

Categories

Resources