Yoututbe scraping by colab - web-scraping

I need to scrape car type video from YouTube by some tags like this list in Google Colab :
Abarth
AC
Acura
Adam
Adler
AEC
Aero
Aixam
Albion
SO i have tried these two code to find the video tag ( for example tag='Peugeot') in google colab:
!pip install youtube-search-python
from youtubesearchpython import SearchVideos
search = SearchVideos("NoCopyrightSounds", offset = 1, mode = "json", max_results = 20)
print(search.result())
and
!pip install youtube-dl
!echo '' > ford_video_list.txt
!chmod 755 ford_video_list.txt
!youtube-dl --match-title 'ford' --add-metadata --write-thumbnail --list-thumbnails --mark-watched --write-info-json 'ford_video_description_json.txt' --write-description 'ford_video_description.txt' --cookies='Search-youtube-url-file.txt' --ignore-errors --skip-download --get-url -f bestvideo+bestaudio/best --default-search "ytsearch2000:" "Ford Festiva" >> ford_video_list.txt
!echo '*****End of test 1 ******'
But by trying this code it don't showing any result:
import urllib.request
from bs4 import BeautifulSoup
textToSearch = 'python tutorials'
query = urllib.parse.quote(textToSearch)
url = "https://www.youtube.com/results?search_query=" + query
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
for vid in soup.findAll(attrs={'class':'yt-uix-tile-link'}):
if not vid['href'].startswith("https://googleads.g.doubleclick.net/"):
print('https://www.youtube.com' + vid['href'])
So, i guess the class name is not correct!, and i asked here for debugging it.
Update:
I have made one google colab page (shown below) to test those codes ( also the code of youtube-dl showing this error:
https://colab.research.google.com/drive/1bZQ68gLggTQHCG_5fQQJJTICHA4K3HJ3?usp=sharing
ERROR: Unable to download webpage: HTTP Error 429: Too Many Requests
(caused by <HTTPError 429: 'Too Many Requests'>); please report this
issue on https://yt-dl.org/bug . Make sure you are using the latest
version; see https://yt-dl.org/update on how to update. Be sure to
call youtube-dl with the --verbose flag and include its complete
output.
I understand that error made because:
The google don't like too many request form one IP Address.
So tried to add these tags(--rm-cache-dir --force-ipv4 --verbose) to youtube-dl command as you can see below ( based of these reffrences 1 2 3):
ERROR: Unable to download webpage: HTTP Error 429: Too Many Requests (caused by <HTTPError 429: 'Too Many Requests'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
File "/usr/local/lib/python3.6/dist-packages/youtube_dl/extractor/common.py", line 632, in _request_webpage
return self._downloader.urlopen(url_or_request)
File "/usr/local/lib/python3.6/dist-packages/youtube_dl/YoutubeDL.py", line 2238, in urlopen
return self._opener.open(req, timeout=self._socket_timeout)
File "/usr/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.6/urllib/request.py", line 564, in error
result = self._call_chain(*args)
File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib/python3.6/urllib/request.py", line 756, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
Thanks.

It has been working by changing the '"ytsearch2000:" "Ford Festiva"` :
to ' "ytsearch50":"Ford Festiva" as you can see below:
!pip install youtube-dl
# !youtube-dl --default-search gvsearch5:how to develop for android --no-playlist --write-info-json --write-annotation --write-thumbnail --write-sub --skip-download
!youtube-dl --match-title 'ford' "ytsearch50":"Ford Festiva"+"peugeot 405" --write-info-json --write-annotation --write-thumbnail --write-sub -f 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/mp4'
and the problem was because of : and " location mistaking!
the entire code for scraping the video of some car type video from google could be seen here:
https://colab.research.google.com/github/CAR-Driving/yoloOnGoogleColab/blob/master/database_creating/Yoututbe_scraping_by_colab.ipynb

Related

problem in Change Directory in colab for google drive

I want download video from YouTube with youtube_dl with Colab and save it in google drive. I make a directory with the name of video title and save video in that folder. Then I use this code:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
URL = "https://www.youtube.com/watch?v=QTPP-iaF7BY&t=1955s"
!pip install youtube_dl
import youtube_dl
with youtube_dl.YoutubeDL({"ignoreerrors": True, "quiet": True}) as ydl:
playlist_dict = ydl.extract_info(URL, download=False)
print('\n', playlist_dict['title'], '\n')
import os
new_folder = playlist_dict['title']
path = f"//content//drive//MyDrive//{new_folder}//".replace("'"," ").replace(".","-").replace(":","-")
os.makedirs(path, exist_ok=True)
print('\n', path, '\n')
%cd {path}
But for the URL that I specified in the above code it get this error:
shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected
shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected
The folder you are executing pip from can no longer be found.
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
Pillai "Hoeffding's Inequality"
//content//drive//MyDrive//Pillai "Hoeffding's Inequality"//
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-13-dd9eae6c92da>", line 20, in <module>
get_ipython().magic('cd {path}')
File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2160, in magic
return self.run_line_magic(magic_name, magic_arg_s)
File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2081, in run_line_magic
result = fn(*args,**kwargs)
File "<decorator-gen-84>", line 2, in cd
File "/usr/local/lib/python3.7/dist-packages/IPython/core/magic.py", line 188, in <lambda>
call = lambda f, *a, **k: f(*a, **k)
File "/usr/local/lib/python3.7/dist-packages/IPython/core/magics/osm.py", line 288, in cd
oldcwd = py3compat.getcwd()
OSError: [Errno 107] Transport endpoint is not connected
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
stb = value._render_traceback_()
AttributeError: 'OSError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
File "/usr/local/lib/python3.7/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
File "/usr/lib/python3.7/inspect.py", line 1502, in getinnerframes
frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
File "/usr/lib/python3.7/inspect.py", line 1460, in getframeinfo
filename = getsourcefile(frame) or getfile(frame)
File "/usr/lib/python3.7/inspect.py", line 696, in getsourcefile
if getattr(getmodule(object, filename), '__loader__', None) is not None:
File "/usr/lib/python3.7/inspect.py", line 725, in getmodule
file = getabsfile(object, _filename)
File "/usr/lib/python3.7/inspect.py", line 709, in getabsfile
return os.path.normcase(os.path.abspath(_filename))
File "/usr/lib/python3.7/posixpath.py", line 383, in abspath
cwd = os.getcwd()
OSError: [Errno 107] Transport endpoint is not connected
with other URL in YouTube I haven't this problem and it downloads and saves correctly in Google Drive.
EDIT
With changing %cd {path} to os.chdir(path) the problem solved. But
I don't understand why %cd {path} work for some and don't work
for others.

How can I send a file using to a HTTP server and read it?

So, I created the following HTTP server tunneled via ngrok, and I am trying to send a file to the server, to then read it and display it on the web page of the server.
Here's the code for the server:
import os
from http.server import HTTPServer, BaseHTTPRequestHandler
from pyngrok import ngrok
import time
port = os.environ.get("PORT", 80)
server_address = ("127.0.0.1", port)
class MyServer(BaseHTTPRequestHandler):
def _set_headers(self):
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.end_headers()
def do_GET(self):
self._set_headers()
self.wfile.write(bytes("<html><head><title>https://pythonbasics.org</title></head>", "utf-8"))
self.wfile.write(bytes("<p>Request: %s</p>" % self.path, "utf-8"))
self.wfile.write(bytes("<body>", "utf-8"))
self.wfile.write(bytes("<p>This is an example web server.</p>", "utf-8"))
self.wfile.write(bytes("</body></html>", "utf-8"))
def do_POST(self):
'''Reads post request body'''
self._set_headers()
content_len = int(self.headers.getheader('content-length', 0))
post_body = self.rfile.read(content_len)
self.wfile.write("received post request:<br>{}".format(post_body))
def do_PUT(self):
self.do_POST()
httpd = HTTPServer(server_address, MyServer)
public_url = ngrok.connect(port).public_url
print("ngrok tunnel \"{}\" -> \"http://127.0.0.1:{}\"".format(public_url, port))
try:
# Block until CTRL-C or some other terminating event
httpd.serve_forever()
except KeyboardInterrupt:
print(" Shutting down server.")
httpd.socket.close()
And I have been trying to send a file using POST as follow
>>> url = 'https://httpbin.org/post'
>>> files = {'file': open('report.xls', 'rb')}
>>> r = requests.post(url, files=files)
>>> r.text
I imported requests of course, and here's what I get
Exception occurred during processing of request from ('127.0.0.1', 60603)
Traceback (most recent call last):
File "C:\Program Files\Python39\lib\socketserver.py", line 316, in _handle_request_noblock
self.process_request(request, client_address)
File "C:\Program Files\Python39\lib\socketserver.py", line 347, in process_request
self.finish_request(request, client_address)
File "C:\Program Files\Python39\lib\socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "C:\Program Files\Python39\lib\socketserver.py", line 720, in __init__
self.handle()
File "C:\Program Files\Python39\lib\http\server.py", line 427, in handle
self.handle_one_request()
File "C:\Program Files\Python39\lib\http\server.py", line 415, in handle_one_request
method()
File "C:\Users\pierr\OneDrive\Desktop\SpyWare-20210104T124335Z-001\SpyWare\Ngrok_Test.py", line 28, in do_POST
content_len = int(self.headers.getheader('content-length', 0))
AttributeError: 'HTTPMessage' object has no attribute 'getheader'
Could someone please help me fix this error ? I don't get where it comes from.
Syntax has changed. You need to use
content_len = int(self.headers.get('Content-Length'))
Instead of
content_len = int(self.headers.getheader('content-length', 0))
The rest should be the same

Python 3.6.5 urllib.error.HTTPError: HTTP Error 403: Forbidden

I am trying to download a csv file from the internet. Here is my code using urllib. But I get HTTP Error 403.
Program-1:
from urllib import request
nse_stocks = 'https://www.nseindia.com/products/content/sec_bhavdata_full.csv'
def download_file(url):
connection = request.urlopen(url)
file_read = connection.read()
file_str = str(file_read)
lines_file_str = file_str.split('\\n')
file = open(r'downloaded_file.csv', 'w')
for line in lines_file_str:
file.write(line + '\n')
file.close()
download_file(nse_stocks)
Response:
Traceback (most recent call last):
File "C:/Users/sg0205481/Documents/Krishna/eBooks/Python/TheNewBoston/Python/downloadWebFile2.py", line 17, in <module>
download_file(nse_stocks)
File "C:/Users/sg0205481/Documents/Krishna/eBooks/Python/TheNewBoston/Python/downloadWebFile2.py", line 7, in download_file
connection = request.urlopen(url)
File "C:\Users\sg0205481\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\sg0205481\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Users\sg0205481\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\sg0205481\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Users\sg0205481\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Users\sg0205481\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
But I don't get the error with Program-2 with requests module. File gets downloaded successfully.
Program-2:
import requests
def download_file(url):
file_data = requests.get(url)
filename = 'downloaded_file.csv'
with open(filename, 'wb') as file:
file.write(file_data.content)
download_file('https://www.nseindia.com/products/content/sec_bhavdata_full.csv')
What is the problem with Program-1? What makes Program-2 pass successfully?
You can download a file (mime type) with get method from web server, not post method. If web server has handler for this extension and accept post method, than you may try!!!

telegram bot keeps replying rather than executing the following code (webhook is used)

I am trying to host my telegram bot for a multiplayer game on GAE and web hook is used here. This is how my databasing part is like:
class Game(ndb.Model):
chat_id = ndb.IntegerProperty(required = True)
mission_num = ndb.IntegerProperty(default =1)
round_num = ndb.IntegerProperty(default =1)
class Player(ndb.Model):
user_id = ndb.IntegerProperty(required=True)
player_role = ndb.StringProperty (
choices = ['spy','resistance'])
The part of code under web hook handler:
if text.startswith('/'):
if text == '/start':
reply('Bot enabled')
setEnabled(chat_id, True)
elif text == '/stop':
reply('Bot disabled')
setEnabled(chat_id, False)
elif text == '/newgame':
if chat_type == 'group':
existing_game = Game.query (Game.chat_id == chat_id).get()
if existing_game:
reply ("game is alr intitiated liao")
else:
##create a new game here
#still stuck here
##========================##
#reply("keep replying this line")
##========================##
new_game = Game (
chat_id = chat_id,
id = chat_id
)
curr_game_key = new_game.put()
new_player = Player (
parent = curr_game_key,
user_id = fr_user_id,
id = fr_user_id)
new_player.put()
reply("waiting for more friends to join")
else:
reply('game must be conducted within a group chat! jio more friends!')
else:
reply('What command?')
else:
if getEnabled(chat_id):
reply('I got your message! (but I do not know how to answer)')
else:
logging.info('not enabled for chat_id {}'.format(chat_id))
The problem is that when I send '/newgame' in a group chat, nothing is sent back to me. If I uncomment the following line my bot keeps sending me "keep replying this line" like crazy.:
#reply("keep replying this line")
The reply function:
def reply(msg=None, img=None):
if msg:
resp = urllib2.urlopen(BASE_URL + 'sendMessage', urllib.urlencode({
'chat_id': str(chat_id),
'text': msg.encode('utf-8'),
'disable_web_page_preview': 'true',
'reply_to_message_id': str(message_id),
})).read()
elif img:
resp = multipart.post_multipart(BASE_URL + 'sendPhoto', [
('chat_id', str(chat_id)),
('reply_to_message_id', str(message_id)),
], [
('photo', 'image.jpg', img),
])
else:
logging.error('no msg or img specified')
resp = None
logging.info('send response:')
logging.info(resp)
The error:
Internal Server Error
The server has either erred or is incapable of performing the requested operation.
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
rv = self.handle_exception(request, response, e)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
return self.handle_exception(e, self.app.debug)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~orbitaltest2/1.393734187774164753/main.py", line 66, in get
self.response.write(json.dumps(json.load(urllib2.urlopen(BASE_URL + 'getUpdates'))))
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 409: Conflict
code for the handler that line 66 belongs to:
class GetUpdatesHandler(webapp2.RequestHandler):
def get(self):
urlfetch.set_default_fetch_deadline(60)
self.response.write(json.dumps(json.load(urllib2.urlopen(BASE_URL + 'getUpdates'))))
Totally newbie, any suggestion is appreciated!
You should check what status code & what content is returned by your webhook.
You have 2 options how to reply to use:
Call Telegram API
Return JSON as a response to webhook call
As you have not provided it source code of reply() it's hard to tell what exactly is wrong.
Anyway your webhook should return HTTP status code 200. If it does not Telegram treat it as an internal error and is trying to resent message to you. This is why you are getting repeated calls and it's "replying like crazy".
Most probably the call reply("keep replying this line") is successful but then something wrong and Telegram gets wrong reply.
Add try/except blocks and log exceptions.
Check your logs & put additional logging if needed. For example I'm logging HTTP response content from my webhook. It helps.

Determine difference in http request between Python2 and Python3

I am attempting to use Python3 to send metrics to Hosted Graphite. The examples given on the site are Python2, and I have successfully ported the TCP and UDP examples to Python3 (despite my inexperience, and have submitted the examples so the docs may be updated), however I have been unable to get the HTTP method to work.
The Python2 example looks like this:
import urllib2, base64
url = "https://hostedgraphite.com/api/v1/sink"
api_key = "YOUR-API-KEY"
request = urllib2.Request(url, "foo 1.2")
request.add_header("Authorization", "Basic %s" % base64.encodestring(api_key).strip())
result = urllib2.urlopen(request)
This works successfully, returning a HTTP 200.
So far I have ported this much to Python3, and while I was (finally) able to get it to make a valid HTTP request (i.e. no syntax errors), the request fails, returning HTTP 400
import urllib.request, base64
url = "https://hostedgraphite.com/api/v1/sink"
api_key = b'YOUR-API-KEY'
metric = "testing.python3.http 1".encode('utf-8')
request = urllib.request.Request(url, metric)
request.add_header("Authorization", "Basic %s" % base64.encodestring(api_key).strip())
result = urllib.request.urlopen(request)
The full result is:
>>> result = urllib.request.urlopen(request)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python3/3.3.1/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 160, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/Cellar/python3/3.3.1/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 479, in open
response = meth(req, response)
File "/usr/local/Cellar/python3/3.3.1/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 591, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/local/Cellar/python3/3.3.1/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 517, in error
return self._call_chain(*args)
File "/usr/local/Cellar/python3/3.3.1/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 451, in _call_chain
result = func(*args)
File "/usr/local/Cellar/python3/3.3.1/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 599, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
Is it obvious what I am doing wrong? Are there any suggestions on how I might capture and compare what the successful (python2) and failing (python3) requests are actually sending?
Don't mix Unicode strings and bytes:
>>> "abc %s" % b"def"
"abc b'def'"
You could construct the header as follows:
from base64 import b64encode
headers = {'Authorization': b'Basic ' + b64encode(api_key)}
A quick way to see the request is to change the host in the url to localhost:8888 and run before making the request:
$ nc -l 8888
You could also use wireshark to see the requests.

Resources