How to scrape twitter tweets of a specific language in Python? - web-scraping

I want to scrape tweets of only Urdu language for my project using python. I started researching how to scrape Twitter tweets. Three prominent ways I found so far.
Tweepy Using Twitter API
Twint Using Twitter API
Selenium
However, I still can't figure out how to specially target Urdu language tweets. I will be very highly grateful if anyone can provide any help, guidance, or lead in this regard. Thanks

After researching more on the topic:
Two ways:
One can use define the tweets language using Twint.Lang('tweet_language_code').
import twint
c = twint.Config()
c.Username = "elonmusk"
c.Limit = 100
c.Store_csv = True
c.Output = "none3.csv"
c.Lang = "en" # en code for english
twint.run.Search(c)
(Note: The above method didn`t worked for me. Thereby, I strived for the other methods)
Second, Using snscraper module. set the language in the query. (Working nicely)
import snscrape.modules.twitter as sntwitter
query = 'lang:ur' #ur is code for urdu
#limit = 10
urduTweets = sntwitter.TwitterSearchScraper(query).get_items()

for tweet in tweepy.Cursor(api.search_tweets, q=keyword, lang='en', count=450, since_id='2021-01-01').items(50000):
The above snippet will give you 50K tweets in English.
*Note: To access tweets older than 1 week, you need Twitter API's Academic Access, general API will only fetch you the past 1 week of data.

Related

How can I get tweet having only tweet ID without using twitter API?

I have a large number of Tweet IDs that have been collected by other people (https://github.com/echen102/us-pres-elections-2020), and I now want to get these tweets from those IDs. What should I do without the Twitter API?
Do you want the url ? It is : https://twitter.com/user/status/<tweet_id>
If you want the text of the tweet withou using the api , you have to render the page, and then scrape it.
You can do it with one module, requests-html:
from requests_html import HTMLSession
session = HTMLSession()
url = "https://twitter.com/user/status/1414963866304458758"
r = session.get(url)
r.html.render(sleep=2)
tweet_text = r.html.find('.css-1dbjc4n.r-1s2bzr4', first=True)
print(tweet_text.text)
Output:
Here’s a serious national security question: Why does the Biden administration want to protect COMMUNISM from blame for the Cuban Uprising? They attribute it to vaccines. Even if the Big Guy can’t comprehend it, Hunter could draw a picture for him.

Scrape BSCScan Token Holdings Page

I'm trying to get data from this page
https://bscscan.com/tokenholdings?a=0xFAe2dac0686f0e543704345aEBBe0AEcab4EDA3d
But the Website owner doesn't provide endpoints APIs for this purpose. So I tried to achieve it in different ways:
-USING DRYSCRAPE but the library seems to be abandoned;
-USING REQUESTS but the data are provided dinamically by javascript;
-USING REQUESTS HTML but even in this case the data doesn't seems to be loaded.
I would like to ignore selenium cause it's slow but I don't know how to solve this issue. Anyone has a solution that could work? The data I need is the table containing the tokens of the wallet. Thank U in advice and hv a nice day.
You can do it with requests-html, for example let's grab the symbol of the first row:
from requests_html import HTMLSession
session = HTMLSession()
url='https://bscscan.com/tokenholdings'
token={'a': '0xFAe2dac0686f0e543704345aEBBe0AEcab4EDA3d'}
r = session.get(url, params=token)
r.html.render(sleep=2)
binance_row = r.html.find('tbody tr', first=True)
symbol = binance_row.find('td')[2].text
print(symbol)
Output:
BNB

How to get latest news posted on twitter by a website

I am using R and I need to retrieve the few most recent posts from a Twitter user (#ExpressNewsPK) using twitteR api. I have created an account and have an access token, etc. I have used the following command to extract the tweets:
setup_twitter_oauth(consumerkey,consumersecret,accesstoken,accesssecret)
express_news_tweets <- searchTwitter("#ExpressNewsPK", n = 10, lang = "en" )
However, the posts that are returned aren't the most recent ones from this user. Where have I made a mistake?
I think searchTwitter would search with the search string provided (here #ExpressNewsPK). So instead of giving tweets by #ExpressNewsPK it would give tweets which are directed to #ExpressNewsPK.
To get tweets from #ExpressNewsPK, you have a function named userTimeline which would give tweets from a particular user.
So after you are done with setup_twitter_oauth, you can try
userTimeline("ExpressNewsPK")
read more about it at ?userTimeline
When you use searchTwitter(), you call the Twitter Search API. Search API only returns a sample history of the tweets.
What you really need to do is to call Twitter Streaming API. Using it you'll be able to download tweets in near real time. You can read more about the Streaming API here: https://dev.twitter.com/streaming/overview

Getting the error "ApiKey invalid" for hotel live prices

I'm trying to get the a list of current hotel prices but I can't get my API Key to work. I've had it for a couple days so I know it isn't too new. I even tried the example in the docs (after fixing the dates):
http://partners.api.skyscanner.net/apiservices/hotels/liveprices/v2/UK/EUR/en-GB/27539733/2016-12-04/2016-12-10/2/1?apiKey=myKey
While it worked for the demo key it wouldn't work for mine. I also tried it on the ec2 micro I'm using for testing with Python and get a response with u'{"errors":["ApiKey invalid"]}':
SKY_SCAN_URL = "http://partners.api.skyscanner.net/apiservices/hotels/liveprices/v2/"
sky_key = get_sky_scan_key()
def get_hotels(request):
entityid = request.GET['entityid']
checkindate = date_formatter(request.GET['start'])
checkoutdate = date_formatter(request.GET['end'])
rooms = request.GET['rooms']
guests = request.GET['guests']
FINAL_SKY_URL = "%s/%s/%s/%s/%s/%s/%s/%s/%s/?apiKey=%s" % (
SKY_SCAN_URL, 'US', 'USD', 'en-US', entityid, checkindate, checkoutdate, guests, rooms, sky_key)
sky_response = requests.get(FINAL_SKY_URL)
This function outputs a get request with a URL like this:
http://partners.api.skyscanner.net/apiservices/hotels/liveprices/v2//US/USD/en-US/20.7983626,-156.3319253-latlong/2016-09-07/2016-09-14/1/1/?apiKey=myKey
Any advice on what the possible issue could be would be awesome, thanks!
Edit:
To be more specific I'm looking for reasons why my API Key is invalid. I'm not familiar with skyscan and while I've added an app from the skyscanner dashboard by clicking the travel api and copied the key into my project and directly into a valid url my key is showing as bad. Are there any additional steps or things that I need to take into account?
I don't know about how you're creating the URL but it seems like it shouldn't be built that way. (most likely due to their misleading documentation)
This:
http://partners.api.skyscanner.net/apiservices/hotels/liveprices/v3/?apiKey=myKey&checkoutdate=2016-09-14&checkindate=2016-09-07&currency=USD&rooms=1&entityid=20.7983626%2C-156.3319253-latlong&local=en-US&market=US&guests=1
Should be:
http://partners.api.skyscanner.net/apiservices/hotels/liveprices/v3/US/USD/en-US/20.7983626,-156.3319253-latlong/2016-09-07/2016-09-14/1/1/?apiKey=myKey
Your code should be something like:
SKY_SCAN_URL = "http://partners.api.skyscanner.net/apiservices/hotels/liveprices/v3/"
FINAL_URL = "%s/%s/%s/%s/%s/%s/%s/%s/%s/?apiKey=%s" % (SKY_SCAN_URL, market, currency, locale, entityid, checkindate, checkoutdate, guests, rooms, apiKey)
sky_response = requests.get(FINAL_URL)
I also suggest you do some tests here.
From their help site as of 17 days ago -
https://support.business.skyscanner.net/hc/en-us/articles/209452689-Why-is-my-API-key-returning-no-results-for-hotels-
"Our Hotels API is currently being reworked, and access is not available at present. Apologies for any inconvenience, when the new API is ready for use we will update the Skyscanner for Business site, so please check back there for updates."
Unclear when this changes.
Since April 2017, skyScanner started re-working their Hotels API, thus stopping all ongoing API calls to LIVE Pricing APIs:
https://support.business.skyscanner.net/hc/en-us/articles/209452689-Why-is-my-API-key-returning-no-results-for-hotels-
Hotels and Flights Cached Pricing and Browse services still working, though I am not sure if it is enough for your business case.
It seems that Skyscanner has updated their Hotels API recently and the documentation can be found here: https://skyscanner.github.io/slate/#hotels-live-prices

Twitter Retweet count of a Tweet

I want to get count of all Retweets for a specific Tweet from Twitter. I used Twitterizer
There's a RetweetCount property in Twitterizer. http://www.twitterizer.net/documentation/html/P_Twitterizer_TwitterStatus_RetweetCount.htm
in C# it would be like:
TwitterStatus Tweet;
Tweet.Id = '2134213'; //set the twitter ID
MessageBox.Show(Tweet.RetweetCount);
Ruel is on the right track, but a little off.
The code (in the latest version: 2.3) would look like this:
TwitterResponse<TwitterStatus> statusResponse = TwitterStatus.Show(123456);
int RetweetsOfThatStatus = statusResponse.ResponseObject.RetweetCount;
Feel free to ask any additional questions on the forums. I usually answer questions there throughout the day.
Ricky
Twitterizer's Architect/Developer
#hotcode: I don't think that any twitter client api is providing that feature for you. but you can try to search for:
RT #user: text
on twitter and then parse the results if it is really a retweet of your intented tweet. If that doesn't work on twitter you can use this here

Resources