Cycling through IP addresses in Asynchronous Webscraping - web-scraping

I am using a relatively cookie cutter code to asynchronously request the HTMLs from a few hundred urls that I have scraped with another piece of code. The code works perfectly.
Unfortunately, this is causing my IP to be blocked due to the high number of requests.
My thought is to write some code to grab some proxy IP addresses, place them in a list, and cycle through them randomly as the requests are sent. Assuming I have no problems in creating this list, I am having trouble conceptualising how to splice the random rotation of these proxy IPs into my asychronous request code. This is my code so far.
async def download_file(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
content = await resp.read()
return content
async def write_file(n, content):
filename = f'sync_{n}.html'
with open(filename, 'wb') as f:
f.write(content)
async def scrape_task(n, url):
content = await download_file(url)
await write_file(n, content)
async def main():
tasks = []
for n, url in enumerate(open('links.txt').readlines()):
tasks.append(scrape_task(n,url))
await asyncio.wait(tasks)
if __name__ == '__main__':
asyncio.run(main())
I am thinking that I need to put:
conn = aiohttp.TCPConnector(local_addr=(x, 0), loop=loop)
async with aiohttp.ClientSession(connector=conn) as session:
...
as the second and third lines of my code, where x is going to be one of the random IP addresses from a list earlier defined. How would I go about doing this? I am unsure if placing the whole code in a simple synchoronous loop will defeat the purpose of using the asynchronous requests.
If there is a simpler solution to the problem of being blocked from a website for rapid fire requests, that would be very helpful too. Please note I am very new to coding.

Related

corrupted file when sending it via POST with reqwest

I'm trying to send a POST request to my server including a file. I can do that with curl following this example with no problems https://gokapi.readthedocs.io/en/latest/advanced.html#interacting-with-the-api, but I can't with Rust.
When I try to implement the request with Rust I have issues, namely the file is corrupted as if it was sent in the wrong way. I tried to get it working with this code,
fn upload(file: &String) -> Result<(), Box<dyn std::error::Error>> {
let client = reqwest::blocking::Client::new();
let mut form = multipart::Form::new()
.file("file", file)
.unwrap()
.text("allowedDownloads", "0")
.text("expiryDays", "2")
.text("password", "");
let res = client
.post("http://myserver.com/api/files/add")
.header(ACCEPT, "application/json")
.header("apikey", "secret")
.header("Accept-Encoding", "gzip, deflate, br")
.multipart(form)
.send();
let response_json = json::parse(&res.unwrap().text().unwrap()).unwrap();
let id = &response_json["FileInfo"]["Id"];
print!("http://myserver.com/downloadFile?id={}", id);
Ok(())
}
but the server receives a bad file, 7zip gives me this error.
tried doing the same script in python, and I got it working in 3 lines.
import requests
files = {'file': ("1398608 Fractal Dreamers - Gardens Under a Spring Sky.osz", open("1398608 Fractal Dreamers - Gardens Under a Spring Sky.osz", "rb"), "application/octet-stream")}
request = requests.post("http://myserver/api/files/add", files=files, headers={'apikey': 'api'})
the file uploaded from the python script works flawlessly, while the rust doesn't.
Any help is appreciated as I'm still a beginner with Rust.
I also did try Sending attachment with reqwest but I get
{"Result":"error","ErrorMessage":"multipart: NextPart: bufio: buffer full"}
EDIT: the issue looks like it's related to file (being the filename) including some strange characters? Test subject file was "1398608 Fractal Dreamers - Gardens Under a Spring Sky.osz", but changing it to "a.osz" made the issue disappear. I have no clue why and how is that
content of the zip is:
"Fractal Dreamers - Gardens Under a Spring Sky ([Crz]xz1z1z) [Vernal].osu"
"audio.mp3"
I get the error with the full name, but "1398608 Fractal Dreamers - Gardens Under a Spring Sky.zip" works as well. What's the issue with .osz?

Lua - Download file asynchronously via HTTP

I just finish reading copas core code. And I want to write code to download file from website asynchronously, but copas seems to only support socket IO.
Since Lua does not provide async syntax, and other packages will surely have their own event loop that, I think, can not run along side copas' loop.
So to async download file via http, do I have to find a package that suppprt async http and async file IO at the same time? Or any other ideas?
After reading bunches of code, I can finally answer my own question.
As I mention in my comment to the question, one can make use of the step function exported by async IO library, and merge multiple stepping into a bigger loop.
In the case of luv, it uses external thread pool in C to manage file IO, and use a single-threaded loop to call pending callbacks and manage IO polling (polling is not needed in my use case).
One can simply call file operation function provided by luv to make async file IO. But still need to step luv's loop to call callbacks bind to IO operations.
The integerated main loop looks goes like this:
local function main_loop()
copas.running = true
while not copas.finished() or uv.loop_alive() do
if not copas.finished() then
copas.step()
end
if uv.loop_alive() then
uv.run("nowait")
end
end
end
copas.step() is the stepping function of copas. And uv.run("nowait") make luv run just one pass of event loop and don't block if there is no ready IO when polling.
A working solution looks like this:
local copas = require "copas"
local http = require "copas.http"
local uv = require "luv"
local urls = {
"http://example.com",
"http://example.com"
}
local function main_loop()
copas.running = true
while not copas.finished() or uv.loop_alive() do
if not copas.finished() then
copas.step()
end
if uv.loop_alive() then
uv.run("nowait")
end
end
end
local function write_file(file_path, data)
-- ** call to luv async file IO **
uv.fs_open(file_path, "w+", 438, function(err, fd)
assert(not err, err)
uv.fs_write(fd, data, nil, function(err_o, _)
assert(not err_o, err_o)
uv.fs_close(fd, function(err_c)
assert(not err_c, err_c)
print("finished:", file_path)
end)
end)
end)
end
local function dl_url(url)
local content, _, _, _ = http.request(url)
write_file("foo.txt", content)
end
-- adding task to copas' loop
for _, url in ipairs(urls) do
copas.addthread(dl_url, url)
end
main_loop()

How to get discord bot to handle separate processes/ link to another bot

I am trying to create something of an application bot. I need the bot to be triggered in a generic channel and then continue the application process in a private DM channel with the applicant.
My issue is this : The bot can have only one on_message function defined. I find it extremely complicated (and inefficient) to check everytime if the on_message was triggered by a message from a DM channel vs the generic channel. Also, makes it difficult to keep track of an applicants answers. I want to check if the following is possible : Have the bot respond to messages from the generic channel as usual. If it receives an application prompt, start a new subprocess (or bot?) that handles the DMs with the applicant separately.
Is the above possible? if not, is there an alternative to handling this in a better way ?
#client.event
async def on_message(message):
if message.author == client.user:
return
if message.channel.type==discord.ChannelType.private:
await dm_channel.send("Whats your age?") ## Question 2
elif message.channel.type == discord.ChannelType.text:
if message.content.startswith('$h'):
member = message.author
if "apply" in message.content:
await startApply(member)
else:
await message.channel.send('Hello!')
# await message.reply('Hello!', mention_author=True)
async def startApply(member):
dm_channel = await member.create_dm()
await dm_channel.send("Whats your name?") ## Question 1
I have the above code as of now. I want the startApply function to trigger a new bot/subprocess to handle the DMs with an applicant.
Option 1
Comparatively speaking, a single if check like that is not too much overhead, but there are a few different solutions. First, you could try your hand at slash commands. This is library built as an extension for the discord.py library for slash commands. You could make one that only works in DM's, and then have it run from there with continuous slash commands.
Option 2
Use a webhook to start up a new bot. This is most likely more complicated, as youll have to get a domain or find some sort of free service to catch webhooks. You could use a webhook like this though to 'wake up' a bot and have it chat with the user in dm's.
Option 3 (Recommended)
Create functions that handle the text depending on the channel, and keep that if - elif in there. As i said, one if isn't that bad. If you had functions that are called in your code that handled everything, it actually should be fairly easy to deal with:
#client.event
async def on_message(message):
if message.author == client.user:
return
if message.channel.type==discord.ChannelType.private:
respondToPrivate(message)
elif message.channel.type == discord.ChannelType.text:
repondToText(message)
In terms of keeping track of the data, if this is a smaller personal project, MySQL is great and easy to learn. You can have each function store whatever data needed to the database so that you can have it stored to be looked at / safe in case of bot crash & then it will also be out of memory.

Why is Rust's std::thread::sleep allowing my HTTP response to return the correct body?

I am working on the beginning of the final chapter of The Rust Programming Language, which is teaching how to write an HTTP response with Rust.
For some reason, the HTML file being sent does not display in the browser unless I have Rust wait before calling TcpResponse::flush().
Here is the code:
use std::io::prelude::*;
use std::net::TcpListener;
use std::net::TcpStream;
use std::fs;
use std::thread::sleep;
use std::time::Duration;
fn main() {
let listener = TcpListener::bind("127.0.0.1:7878").unwrap();
for stream in listener.incoming() {
let stream = stream.unwrap();
handle_connection(stream);
}
}
fn handle_connection(mut stream: TcpStream) {
let mut buffer = [0; 1024];
stream.read(&mut buffer).unwrap();
let contents = fs::read_to_string("hello.html").unwrap();
let response = format!(
"HTTP/1.1 200 OK\r\nContent-Length: {}\r\n{}",
contents.len(),
contents
);
stream.write(response.as_bytes()).unwrap();
// let i = stream.write(response.as_bytes()).unwrap();
// println!("{} bytes written to the stream", i);
// ^^ using this code instead will sometimes make it display properly
sleep(Duration::from_secs(1));
// ^^ uncommenting this will cause a blank page to load.
stream.flush().unwrap();
}
I observe the same behavior in multiple browsers.
According to the Rust book, calling TcpListener::flush should ensure that the bytes finish writing to the stream. So why would I be unable to view the HTML file in the browser unless I sleep the thread before flushing?
I have done hard reloading and restarted the server with cargo run multiple times and the behavior is the same. I have also printed out the file contents to the terminal, and the contents are being read fine under either condition (of course they are).
I wonder if this is a problem with my operating system. I'm on Windows 10.
It isn't really holding the project up as I can continue learning (and I'm not planning on putting an actual web project into production right now), but I would appreciate any insight anyone has on this issue. There must be something about Rust's handling of the stream or the environment that I am not understanding.
Thanks for your time!

urequests micropython problem (multiple POST requests to google forms)

I'm trying to send data to Google Forms directly (without and external service like IFTTT) using an esp8266 with micropython. I've already used IFTTT but at this point is not useful for me, i need a sampling rate of more or equal to 100 Hz and as you know this exceeds the IFTTT's usage limit. I've tried making a RAM buffer, but i got a error saying that the buffer exceded the RAM size (4 MB) so that's why im trying to do directly.
After trying some time i got it partially. I say "partially" because i have to do a random get-request after the post-request; i don't know why it works, but it works (in this way i can send data to Google Forms every second approximately, or maybe less). I guess the problem is that the esp8266 can't close the connection with Google Forms and it gets stuck when it tries to do a new post-request, if this were the problem, i don't know how to fix it in another way, any suggestions? The complete code is here:
ssid = 'my_network'
password = 'my_password'
import urequests
def do_connect():
import network
sta_if = network.WLAN(network.STA_IF)
if not sta_if.isconnected():
print('connecting to network...')
sta_if.active(True)
sta_if.connect(ssid, password)
while not sta_if.isconnected():
pass
print('network config:', sta_if.ifconfig())
def main():
do_connect()
print ("CONNECTED")
url = 'url_of_my_google_form'
form_data = 'entry.61639300=example' #have to change the entry
user_agent = {'Content-Type': 'application/x-www-form-urlencoded'}
while True:
response = urequests.post(url, data=form_data, headers=user_agent)
print ("DATA HAVE BEEN SENT")
response.close
print("TRYING TO SEND ANOTHER ONE...")
response = urequests.get("http://micropython.org/ks/test.html") #<------ RANDOM URL, I DON'T KNOW WHY THIS CODE WORKS CORRECTLY IN THIS WAY
print("RANDOM GET:")
print(response.text)
response.close
if __name__ == '__main__':
main()
Thank you for your time guys. Also i've tried with this code before but it DOESN'T WORK. Without the random get-request, it gets stuck after one or two times of posting:
while True:
response = urequests.post(url, data=form_data, headers=user_agent)
print ("DATA HAVE BEEN SENT")
response.close
print("TRYING TO SEND ANOTHER ONE...")
Shouldn't it be response.close() (with brackets)?.. 🤔,
Without brackets you access a (non existing) property close of the object response instead of calling the method close(), and do not really close the connection. This could lead to memory overflow.

Resources