user id not found even when user exists - web-scraping

I have been trying to retrieve the data from instagram using instagramy but i get on getting the errors like profile doesnt exist but actually the profile username is copied from instagram itself.
from instagramy.plugins.analysis import analyze_users_popularity
import pandas as pd
session_id = "58094758320%3AcPGOQQFP3YK2sq%3A0%3AAYeC7jQGmpaEYf2il0Evg60SXDeJarTsjUjc5TG7RQ"
# Instagram user_id of ipl teams
teams = ["chennaiipl", "mumbaiindians",
"royalchallengersbangalore", "kkriders",
"delhicapitals", "sunrisershyd",
"kxipofficial"]
data = analyze_users_popularity(teams ,session_id)
df= pd.DataFrame(data)
df
and i keep on getting this error -------
HTTPError: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\siddh\OneDrive\Desktop\instagram scrapper.py", line 17, in <module>
data = analyze_users_popularity(teams ,session_id)
File "C:\Users\siddh\anaconda3\lib\site-packages\instagramy\plugins\analysis.py", line 16, in analyze_users_popularity
user = InstagramUser(username, sessionid)
File "C:\Users\siddh\anaconda3\lib\site-packages\instagramy\InstagramUser.py", line 61, in __init__
data = self.get_json()
File "C:\Users\siddh\anaconda3\lib\site-packages\instagramy\InstagramUser.py", line 83, in get_json
raise UsernameNotFound(self.url.split("/")[-2])
UsernameNotFound: InstagramUser('chennaiipl') not Found

Related

Slow data reading from Google BigTable

Airflow 1.10.14 and composer 1.15.2, google Bigtable in GCP
I'm getting this issue
{taskinstance.py:1152} ERROR - <_MultiThreadedRendezvous of RPC that terminated with
status = StatusCode.ABORTE
details = "Error while reading table 'projects/pyproject/instances/bigtable-02/tables/mytable' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)
debug_error_string = "{"created":"#1649401290.125577144","description":"Error received from peer ipv4:142.251.6.95:443","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Error while reading table 'projects/pyproject/instances/bigtable-02/tables/mytable' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)","grpc_status":10}
>
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 980, in _run_raw_tas
result = task_copy.execute(context=context
File "/home/airflow/gcs/plugins/minutebardata.py", line 55, in execut
tk.get_data(dt=self.dt
File "/home/airflow/gcs/plugins/scraper/ticker.py", line 46, in get_dat
df0 = self.dbhook._load_all_minubebar_tickers(ticker
File "/home/airflow/gcs/plugins/scraper/db_wrapper.py", line 583, in _load_all_minubebar_ticker
dfs = [self.process(row, ticker) for row in rows if row is not None
File "/home/airflow/gcs/plugins/scraper/db_wrapper.py", line 583, in <listcomp
dfs = [self.process(row, ticker) for row in rows if row is not None
File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 485, in __iter_
response = self._read_next_response(
File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 474, in _read_next_respons
return self.retry(self._read_next, on_error=self._on_error)(
File "/opt/python3.6/lib/python3.6/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_fun
on_error=on_error
File "/opt/python3.6/lib/python3.6/site-packages/google/api_core/retry.py", line 184, in retry_targe
return target(
File "/opt/python3.6/lib/python3.6/site-packages/google/cloud/bigtable/row_data.py", line 470, in _read_nex
return six.next(self.response_iterator
File "/opt/python3.6/lib/python3.6/site-packages/grpc/_channel.py", line 416, in __next_
return self._next(
File "/opt/python3.6/lib/python3.6/site-packages/grpc/_channel.py", line 803, in _nex
raise sel
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with
status = StatusCode.ABORTE
details = "Error while reading table 'projects/pyproject/instances/bigtable-02/tables/mytable' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)
debug_error_string = "{"created":"#1649401290.125577144","description":"Error received from peer ipv4:142.251.6.95:443","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Error while reading table 'projects/pyproject/instances/bigtable-02/tables/mytable' : Response was not consumed in time; terminating connection.(Possible causes: slow client data read or network problems)","grpc_status":10}
I use the following approach
row_set = RowSet()
row_set.add_row_range_from_keys(
start_key=ticker + b"#" + startdt,
end_key=ticker + b"#" + enddt)
rows = table.read_rows(
row_set=row_set)
This code works properly when I run this locally. However, when I try to run this in GCP I get this issue.
Could you give me any hints to find the solution?

geograpy3: sqlite3.OperationalError: no such table

I would like to use the geograpy3 package for a city & country mapping of string values, related to locations (like 'Roma, Italy' or just 'Timișoara'). It runs on my venv under OpenSuse 15.3.
Unfortunately, I can't get along with the SQLite DB. My test files always end with errors like:
sqlite3.OperationalError: no such table ...
In detail:
import geograpy
url='https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay'
places = geograpy.get_geoPlace_context(url = url)
print(places)
ends with:
Traceback (most recent call last):
File "/path/to/geograpy3/examples/example1.py", line 3, in <module>
places = geograpy.get_geoPlace_context(url = url)
File "/home/axel/.local/share/virtualenvs/myProject-1oLtBMPc/lib/python3.9/site-packages/geograpy/__init__.py", line 24, in get_geoPlace_context
places=get_place_context(url, text, labels=Labels.geo, debug=debug)
File "/home/axel/.local/share/virtualenvs/myProject-1oLtBMPc/lib/python3.9/site-packages/geograpy/__init__.py", line 46, in get_place_context
pc = PlaceContext(places)
File "/home/axel/.local/share/virtualenvs/myProject-1oLtBMPc/lib/python3.9/site-packages/geograpy/places.py", line 32, in __init__
self.setAll()
File "/home/axel/.local/share/virtualenvs/myProject-1oLtBMPc/lib/python3.9/site-packages/geograpy/places.py", line 87, in setAll
self.set_countries()
File "/home/axel/.local/share/virtualenvs/myProject-1oLtBMPc/lib/python3.9/site-packages/geograpy/places.py", line 98, in set_countries
country=self.getCountry(place)
File "/home/axel/.local/share/virtualenvs/myProject-1oLtBMPc/lib/python3.9/site-packages/geograpy/locator.py", line 1162, in getCountry
countryRecords=self.sqlDB.query(query,params)
File "/home/axel/.local/share/virtualenvs/myProject-1oLtBMPc/lib/python3.9/site-packages/lodstorage/sql.py", line 183, in query
query = cur.execute(sqlQuery,params)
sqlite3.OperationalError: no such table: countries
What have I missed?
Look under your $HOME/.geograpy3/locations.db, if the file does not exist or is empty download it from here.
For more information look at this issue: https://github.com/somnathrakshit/geograpy3/issues/59

BS4: AttributeError: 'NoneType' object stops the parser from working

I'm currently working on a parser to make a small preview of a page from a URL given by the user in PHP.
I'd like to retrieve only the title of the page and a little chunk of information (a bit of text)
The project: for a list of meta-data of popular wordpress-plugins and gathering the first 50 URLs - that are 50 plugins which are of interest! The challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...
https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor
url = "https://wordpress.org/plugins/browse/popular/{}"
def main(url, num):
with requests.Session() as req:
print(f"Collecting Page# {num}")
r = req.get(url.format(num))
soup = BeautifulSoup(r.content, 'html.parser')
link = [item.get("href")
for item in soup.findAll("a", rel="bookmark")]
return set(link)
with ThreadPoolExecutor(max_workers=20) as executor:
futures = [executor.submit(main, url, num)
for num in [""]+[f"page/{x}/" for x in range(2, 50)]]
allin = []
for future in futures:
allin.extend(future.result())
def parser(url):
with requests.Session() as req:
print(f"Extracting {url}")
r = req.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = [item.get_text(strip=True, separator=" ") for item in soup.find(
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
head = [soup.find("h1", class_="plugin-title").text]
new = [x for x in target if x.startswith(
("V", "Las", "Ac", "W", "T", "P"))]
return head + new
with ThreadPoolExecutor(max_workers=50) as executor1:
futures1 = [executor1.submit(parser, url) for url in allin]
for future in futures1:
print(future.result())
see the results:
Extracting https://wordpress.org/plugins/tuxedo-big-file-uploads/Extracting https://wordpress.org/plugins/cherry-sidebars/
Extracting https://wordpress.org/plugins/meks-smart-author-widget/
Extracting https://wordpress.org/plugins/wp-limit-login-attempts/
Extracting https://wordpress.org/plugins/automatic-translator-addon-for-loco-translate/
Extracting https://wordpress.org/plugins/event-organiser/
Traceback (most recent call last):
File "/home/martin/unbenannt0.py", line 45, in <module>
print(future.result())
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/martin/unbenannt0.py", line 34, in parser
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
AttributeError: 'NoneType' object has no attribute 'find_next'
well i have a severe error - the
AttributeError: 'NoneType' object has no attribute 'find_next'
It looks like soup.find("h3", class_="screen-reader-text") has not found anything.
Well we could either break this line up and only call find_next if there was a result or use a try/except that captures the AttributeError.
at the moment i do not know how to fix this whole thing - only that we can surround the offending code with:
try:
code that causes error
except AttributeError:
print(f"Attribution error on {some data here}, {whatever else would be of value}, {...}")
... whatever action is thinkable to take here.
btw.- besides this error i want to add a option that gives the results back: see complete and unaltered error traceback. It contains valuable process call stack information.
Extracting https://wordpress.org/plugins/automatic-translator-addon-for-loco-translate/
Extracting https://wordpress.org/plugins/wpforo/Extracting https://wordpress.org/plugins/accesspress-social-share/
Extracting https://wordpress.org/plugins/mailoptin/
Extracting https://wordpress.org/plugins/tuxedo-big-file-uploads/
Extracting https://wordpress.org/plugins/post-snippets/
Extracting https://wordpress.org/plugins/woocommerce-payfast-gateway/Extracting https://wordpress.org/plugins/woocommerce-grid-list-toggle/
Extracting https://wordpress.org/plugins/goodbye-captcha/
Extracting https://wordpress.org/plugins/gravity-forms-google-analytics-event-tracking/
Traceback (most recent call last):
File "/home/martin/dev/wordpress_plugin.py", line 44, in <module>
print(future.result())
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/martin/dev/wordpress_plugin.py", line 33, in parser
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
AttributeError: 'NoneType' object has no attribute 'find_next'
hope that this was not too long and complex - thank you for the help!

Openstack Not able to connect to using rest api

I have a local install on openstack on my virtual box. I am trying to use the lib cloud api to connnect and get a list of images,flavours etc.
Below is the code that I am trying to execute
from libcloud.compute.types import Provider
from libcloud.compute.providers import get_driver
# Authentication information so you can authenticate to DreamCompute
# copy the details from the OpenStack RC file
# https://dashboard.dreamcompute.com/project/access_and_security/api_access/openrc/
auth_username = 'admin'
auth_password = 'f882e2f4eaad434c'
TENANT_NAME = 'admin'
project_name = 'admin'
auth_url = 'http://192.168.56.101:5000/v3/tokens'
region_name = 'RegionOne'
provider = get_driver(Provider.OPENSTACK)
conn = provider(auth_username,
auth_password,
ex_force_auth_url=auth_url,
ex_force_auth_version='2.0_password',
ex_tenant_name=project_name,
ex_force_service_type='compute',
ex_force_service_name='compute',
ex_force_base_url='http://192.168.56.101:8774/v2.1/29a8949bc3a04bfead0654be8e552017')
# Get the image that we want to use by its id
# NOTE: the image_id may change. See the documentation to find
# all the images available to your user
image_id = '4525d442-e9f4-4d19-889f-49ab03be93df'
image = conn.get_image(image_id)
# Get the flavor that we want to use by its id
flavor_id = '100'
flavor = conn.ex_get_size(flavor_id)
# Create the node with the name “PracticeInstance”
# and the size and image we chose above
instance = conn.create_node(name='PracticeInstance', image=image, size=flavor)
When I run the above code I am getting below error:
C:\Python dev\website\music\openstack>python openstack.py
Traceback (most recent call last):
File "openstack.py", line 30, in <module>
image = conn.get_image(image_id)
File "C:\Users\C5265680\AppData\Local\Programs\Python\Python36\lib\site-packages\libcloud\compute\drivers\openstack.py", line 2028, in get_image
'/images/%s' % (image_id,)).object['image'])
File "C:\Users\C5265680\AppData\Local\Programs\Python\Python36\lib\site-packages\libcloud\common\openstack.py", line 223, in request
raw=raw)
File "C:\Users\C5265680\AppData\Local\Programs\Python\Python36\lib\site-packages\libcloud\common\base.py", line 536, in request
action = self.morph_action_hook(action)
File "C:\Users\C5265680\AppData\Local\Programs\Python\Python36\lib\site-packages\libcloud\common\openstack.py", line 290, in morph_action_hook
self._populate_hosts_and_request_paths()
File "C:\Users\C5265680\AppData\Local\Programs\Python\Python36\lib\site-packages\libcloud\common\openstack.py", line 324, in _populate_hosts_and_request_paths
osa = osa.authenticate(**kwargs) # may throw InvalidCreds
File "C:\Users\C5265680\AppData\Local\Programs\Python\Python36\lib\site-packages\libcloud\common\openstack_identity.py", line 855, in authenticate
return self._authenticate_2_0_with_password()
File "C:\Users\C5265680\AppData\Local\Programs\Python\Python36\lib\site-packages\libcloud\common\openstack_identity.py", line 880, in _authenticate_2_0_with_password
return self._authenticate_2_0_with_body(reqbody)
File "C:\Users\C5265680\AppData\Local\Programs\Python\Python36\lib\site-packages\libcloud\common\openstack_identity.py", line 885, in _authenticate_2_0_with_body
method='POST')
File "C:\Users\C5265680\AppData\Local\Programs\Python\Python36\lib\site-packages\libcloud\common\base.py", line 637, in request
response = responseCls(**kwargs)
File "C:\Users\C5265680\AppData\Local\Programs\Python\Python36\lib\site-packages\libcloud\common\base.py", line 157, in __init__
message=self.parse_error())
libcloud.common.exceptions.BaseHTTPError: {"error": {"message": "The resource could not be found.", "code": 404, "title": "Not Found"}}
I have checked the logs in my server at /var/log/keystone and it does not give any error so I am guessing that I am able to login.
Also there are many examples which show that after the above steps I should be able to get list of images/flavors/servers.
Not sure why i am not able to connect. Can someone please help me with this.
I think you should modify the auth_url and provider arguments.
As following argument set is worked at my environment.
auth_username = 'admin'
auth_password = 'f882e2f4eaad434c'
TENANT_NAME = 'admin'
project_name = 'admin'
auth_url = 'http://192.168.56.101:5000'
region_name = 'RegionOne'
provider = get_driver(Provider.OPENSTACK)
conn = provider(auth_username,
auth_password,
ex_force_auth_url=auth_url,
ex_force_auth_version='2.0_password',
ex_tenant_name=project_name)
Updated 2017.11.16
#user8040338 Your error message is the first clue,
The resource could not be found. and status code 404, the most cause of its message and status code is wrong rest api url format.
Firstly, you need to check keystone v2.0 rest api format.
At the same time, you check again Libcloud reference.
For instance, the argument ex_force_auth_version had been specified the api version 2.0_password (v2), but auth_url variable formed the url including the version resource /v3, it was wrong version and usage of API with the libcloud API argument you specified. auth_url should be a base URL from API usage.
The similar processes about each argument of API should be conducted repeatedly until solving issues.

why I got the errors PartitionOwnedError and ConsumerStoppedException when starting a few consumers

I use pykafka to fetch message from kafka topic, and then do some process and update to mongodb. As the pymongodb can update only one item every time, so I start 100 processes. But when starting, some processes occoured errors "PartitionOwnedError and ConsumerStoppedException". I don't know why.
Thank you.
kafka_cfg = conf['kafka']
kafka_client = KafkaClient(kafka_cfg['broker_list'])
topic = kafka_client.topics[topic_name]
balanced_consumer = topic.get_balanced_consumer(
consumer_group=group,
auto_commit_enable=kafka_cfg['auto_commit_enable'],
zookeeper_connect=kafka_cfg['zookeeper_list'],
zookeeper_connection_timeout_ms = kafka_cfg['zookeeper_conn_timeout_ms'],
consumer_timeout_ms = kafka_cfg['consumer_timeout_ms'],
)
while(1):
for msg in balanced_consumer:
if msg is not None:
try:
value = eval(msg.value)
id = long(value.pop("id"))
value["when_update"] = datetime.datetime.now()
query = {"_id": id}}
result = collection.update_one(query, {"$set": value}, True)
except Exception, e:
log.error("Fail to update: %s, msg: %s", e, msg.value)
>
Traceback (most recent call last):
File "dump_daily_summary.py", line 182, in <module>
dump_daily_summary.run()
File "dump_daily_summary.py", line 133, in run
for msg in self.balanced_consumer:
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 745, in __iter__
message = self.consume(block=True)
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 734, in consume
raise ConsumerStoppedException
pykafka.exceptions.ConsumerStoppedException
>
Traceback (most recent call last):
File "dump_daily_summary.py", line 182, in <module>
dump_daily_summary.run()
File "dump_daily_summary.py", line 133, in run
for msg in self.balanced_consumer:
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 745, in __iter__
message = self.consume(block=True)
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 726, in consume
self._raise_worker_exceptions()
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 271, in _raise_worker_exceptions
raise ex
pykafka.exceptions.PartitionOwnedError
PartitionOwnedError: check if there are some background process consuming in the same consumer_group, maybe there are not enough available partitions for starting another consumer.
ConsumerStoppedException: you can try upgrading your pykafka version (https://github.com/Parsely/pykafka/issues/574)
I met the same problem like you. But, I confused about others' solutions like adding enough partitions for consumers or updating the version of pykafka.
In fact, mine satisfied those conditions above.
Here is the version of tools:
python 2.7.10
kafka 2.11-0.10.0.0
zookeeper 3.4.8
pykafka 2.5.0
Here is my code:
class KafkaService(object):
def __init__(self, topic):
self.client_hosts = get_conf("kafka_conf", "client_host", "string")
self.topic = topic
self.con_group = topic
self.zk_connect = get_conf("kafka_conf", "zk_connect", "string")
def kafka_consumer(self):
"""kafka-consumer client, using pykafka
:return: {"id": 1, "url": "www.baidu.com", "sitename": "baidu"}
"""
from pykafka import KafkaClient
consumer = ""
try:
kafka = KafkaClient(hosts=str(self.client_hosts))
topic = kafka.topics[self.topic]
consumer = topic.get_balanced_consumer(
consumer_group=self.con_group,
auto_commit_enable=True,
zookeeper_connect=self.zk_connect,
)
except Exception as e:
logger.error(str(e))
while True:
message = consumer.consume(block=False)
if message:
print "message:", message.value
yield message.value
The two exceptions(ConsumerStoppedException and PartitionOwnedError), are raised by the function consum(block=True) of pykafka.balancedconsumer.
Of course, I recommend you to read the source code of that function.
There is a argument block=True, after altering it to False, the programme can not fall into the exceptions.
Then the kafka consumers work fine.
This behavior is affected by a longstanding bug that was recently discovered and is currently being fixed. The workaround we've used in production at Parse.ly is to run our consumers in an environment that handles automatically restarting them when they crash with these errors until all partitions are owned.

Resources