Trying to log into this website https://lse.co.uk but have been unsuccessful. I have looked on StackOverflow and read multiple questions & answers but they are all different. Either that or I missed one that matches this case.
This is what I have.
import requests
login_url = "https://www.lse.co.uk/login.html"
s = requests.session()
payload = {
"txtEmail": "some#email.co.uk",
"txtPassword": "somepassword"
}
r = s.post(login_url, data=payload)
Also tried the above but encoding the credentials with Base64.
Inspecting the html code from Chrome I can see a Base64 string. Should I capture this and encode both username and password with this string? The Base64 string is not visible in the output of r.content, so not sure how to do this either.
looking at the form, it's likely you do not submit all the inputs from the form. Simply sending the two form inputs you need isn't enough.
It's likely the code reading the form is expecting more from your code, first there are two hidden inputs that gives some context:
<input type="hidden" name="txtFormType" value="LOGIN">
<input type="hidden" name="txtLoginSource" value="MAIN">
so you should add them to your scraping code:
>>> payload = {
"txtEmail": "some#email.co.uk",
"txtPassword": "somepassword",
"txtFormType": "LOGIN",
"txtLoginSource": "MAIN"
}
if you're lucky, that's all it's looking for, and the form will work.
If you're not that means you need to provide the recaptcha hidden element, which is there to prevent users from scripting access to the login page (mostly to avoid brute force by bots, with the side effect to be a brain fsck to people willing to do legit scripts).
So let's check that:
>>> result = requests.get(login_url)
then you need to use an html parser, like lxml:
>>> from lxml import etree
and you got to parse the html:
>>> page = etree.fromstring(r.text, etree.HTMLParser())
and there you try to fetch it:
>>> tree.xpath("//form[#class='login__form']/input[name='g-recaptcha-response-v3']")
[]
heck, it's not there! 😑
That's because it's likely to be handled by a script adding that hidden input using javascript when the page is loaded. So there you're doomed, there's no easy solution.
One of the solutions is to pull the big guns, using a real browser to open the page, have the google javascript running, doing a few things to make sure you're not being detected as a bot (like resizing the window when loading the page), and fetch that hidden input's value.
Hopefully, you can use selenium to do that, cf that answer. I won't get into how you install selenium, but your code might be like:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_options=options,
executable_path=r'/path/to/chromedriver')
driver.get(login_url)
# here get the g-recaptcha-response-v3 element to fetch its value, so you can add it to the payload
I'm sorry I'm not going deep into that solution, but you should have enough to get started and explore it.
I'm not that good with python I'm Also trying to learn requests too.
I can try to help you with looking at the response, You can try to
print(r.text)
You will see the website response.
This is not a fix but more like a way to see if something went wrong
Related
I'm trying to get data from this page
https://bscscan.com/tokenholdings?a=0xFAe2dac0686f0e543704345aEBBe0AEcab4EDA3d
But the Website owner doesn't provide endpoints APIs for this purpose. So I tried to achieve it in different ways:
-USING DRYSCRAPE but the library seems to be abandoned;
-USING REQUESTS but the data are provided dinamically by javascript;
-USING REQUESTS HTML but even in this case the data doesn't seems to be loaded.
I would like to ignore selenium cause it's slow but I don't know how to solve this issue. Anyone has a solution that could work? The data I need is the table containing the tokens of the wallet. Thank U in advice and hv a nice day.
You can do it with requests-html, for example let's grab the symbol of the first row:
from requests_html import HTMLSession
session = HTMLSession()
url='https://bscscan.com/tokenholdings'
token={'a': '0xFAe2dac0686f0e543704345aEBBe0AEcab4EDA3d'}
r = session.get(url, params=token)
r.html.render(sleep=2)
binance_row = r.html.find('tbody tr', first=True)
symbol = binance_row.find('td')[2].text
print(symbol)
Output:
BNB
https://www.nyse.com/quote/XNYS:A
After I access the above URL, I open Developer Tools in Firefox. Then change the date in HISTORIC PRICES, then click 'GO'. The table is updated. But I don't see relevant HTTP requests sent in devtools.
So this means that the data has already been downloaded in the first request. But I can not figure out how to extract the raw data of the table. Could anybody take a look at how to extract the raw data from the table? (Note that I don't want to use methods like selenium, I want to stay with raw HTTP requests to get the raw data.)
EDIT: websocket is mentioned in the comment. But I can't see it in Developer Tools. I add websocket tag anyway in case somebody knows more about websocket can chime in.
I am afraid you cannot extract javascript rendered content without selenium. You can always make use of a headless browser(you don't see any instance on your screen, the only pitfall is that you have to wait until the page fully loads) and it won't bother you anymore.
In other words, all the other scraping libs are based on urls and forms. Scrapy can post forms but not run javascripts.
Selenium will save the day, all you lose is a couple of seconds for each attempt(will be milliseconds if it is run in frontend). You can share page source with driver.page_source and it can be directly used for parsing(as a html text) with BeautifulSoup or whatever.
You can do it with requests-html, for example let's grab the first row of the table:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.nyse.com/quote/XNYS:A'
r = session.get(url)
r.html.render(sleep=7)
first_row = r.html.find('.flex_tr', first=True)
print(first_row.text)
Output:
06/18/2021
146.31
146.83
144.94
145.01
3,220,680
As #Nikita said you will have to wait the page loading (here 7sec but maybe less), but if you want to do multiple requests you can do it asynchronously !
I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.
This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic
I am pretty new to web-scraping and recently I am trying to automatically scrap phone number for pages like this. I am not supposed to use Selenium/headless url browser libraries and I am trying to find the a way to actually request the phone number using let say a web service or any other possible solution that could give me the phone number hopefully directly without having to go through the actual button press by selenium.
I totally understand that it may not even be possible to automatically reveal the phone number in one shut as it is meant not be accessible by nosy newbie web-scraper like me; but I still like to raise the question for my information to get detailed answer from an expert point of view.
If I search the "Reveal" button DOM element, it shows some tags which I have never seen before. I have two main questions which I believe could be helpful for newbies like me.
1) Given a set of unknown tags/attribues (ie. data-q and data-reveal in the blow button), how is one able to find out which scripts in the page are actually using them?
2) I googled the button element's tag like: data-q and data-reveal the only relevant I could find was this which for some reason I don't have access two even-if I use proxy.
Any clue particularly on the first question is much appreciate it.
Regards,
Below is the href-button code
Reveal
Ok, according to your demand there are several steps before you finally get a solution.
1st step : open your own browser and enter your target page(https://www.gumtree.com/p/vans/2015-ford-transit-custom-2.2tdci-290-l1-h1/1190345514)
2nd step : (Assume you are using Chrome as your favorite browser) Press Ctrl+Shift+I to open the console, and then select 'Network' tag in the console.
3rd step : Press the 'Reveal' button on that page, watch the console carefully, catch the http request which is sent immediately when you press the 'Reveal' button. You can see the request contains a long string of number in Query String Parameters, actually it is a timestamp.
4th step : Also you can see there is a part named 'Request Headers' in that http request, and you should copy the values of referer , user-agent , x-gumtree-token.
5th step : Try to construct your request (I am a fan of Python, So I am going to show you my example code in Python)
import time
import requests
import json
headers = {
'referer': 'please enter the value you just copied from that specific request',
'user-agent': 'please enter the value you just copied from that specific request',
'x-gumtree-token': 'please enter the value you just copied from that specific request'
}
url = 'https://www.gumtree.com/ajax/account/seller/reveal/number/1190345514?_='
current_time = time.time()
current_time = str(current_time)
current_time = current_time.split('.')[0] + current_time.split('.')[1] + '0'
url += current_time
response = requests.get(url=url,headers=headers)
response_result = json.loads(response.content)
phone_number = response_result['data']
Does Google have an API with a function which will verify if a specific phrase can be found at a given url?
Say I have a webpage url: www.mysite/2011/01/check-if-phrase-exists
I want to know if the phrase foobar exists somewhere on that document (it can be anywhere on the html document - not just "readable text").
The function/api would return True or False.
Question Update The "method" should avoid me from having to retrieve the entire page to my server and search myself. It is the fetching of the webpage to my server that I am trying to avoid (to cut down on bandwidth).
I don't think they do, but you could do this yourself without much code (this is adapted from the App Engine docs):
import urllib2
url = "http://www.google.com/"
try:
result = urllib2.urlopen(url)
my_search_function(result)
# or perhaps my_search_function(result.content)
except urllib2.URLError, e:
handleError(e)
Then you can just define my_search_function(text) to do what you need