Need to analyze tweets on a specific topic, my application (for using Twitter API) was not approved. Instead, I tried to do it manually using Twitter Advanced Search. However it's more burdensome than using easy-to-use Twitter API, I think advanced search doesn't retrieve all the relevant tweets containing a particular keyword, as I tested it for several case.
So, there are two questions. Firstly, am I right about incompleteness of advanced search in providing results? Secondly, is there another way (or turn around) to use API without need to approval?
Specifically, is there any limit for returning results by advanced search or it provides all the possible results, just like API?
Related
I am setting up a database of certain events that have occurred in the past, and need to search the internet for a number of terms to retrieve as many pages as possible that contain terms related to the happenings i want to document.
First I looked into achieving this using Googles "Custom Search API", after reading this question:
Need to access Google Custom search api through R
I did manage to get a JSON of search results through the browser, but not through R, so I moved on.
When I saw that the Custom Search API was using OpenSearch, and found the rOpenSearch package for R, I wanted to try going down this path:
http://terradue.github.io/rOpenSearch/
After reading through the documentation, there was only provided examples of searching sites that provide opensearch descriptions. As I need to search as many websites as possible, it seems like I would need an opensearch description for a search engine like Google. But I can't seem to find that anywhere.
Is there any way to search the internet via. R using OpenSearch, and collecting the results in a data table?
If you know of a better solution to my problem, I'd appreciate if you could point me in another direction.
If I read well, you are looking for something called Web Scraping via R.
<See me!>
I am curious whether the following automation would be feasible:
search google for a UCP/EAN code number (e.g. 8710103703631)
scrape and parse data (depending on what is available) from the first ranked page concerning Product:
Name
Brand
Model
Picture
Description
Just trying to understand how complicated this might be.
Thank you!
Lookup EAN/UPC codes via API
There are some free web-APIs which (reverse-)lookup barcodes (EAN/UPC) or provide additional information.
For example ean-search.org is an REST API that is queried by the EAN and delivers XML (e.g. provides a link to Amazon for your sample "Philips Sonicare").
Benefit using an API: ready to use data, no scraping needed.
Web-scraping for search-results
For sure you can use search-engines (like google, duckduckgo, etc.) and search for the barcode using the favorite web-scraping library in your desired programming-language:
JSoup (in Java): see this question
Scrapy or BeautifulSoup (in Phyton): see this question
I've started using Bing Custom Search API and many of the top search results I get are... surprisingly old and irrelevant.
The Custom Search interface allows you to rank slices of websites higher than others and to boost some results, but it remains URL-based and doesn't go into weighting of actual page contents or metadata such as date, keywords, author and so on.
Will "classic" SEO tips such as using one h1, optimizing page title/description/keywords, etc. help improve result relevance?
I guess it boils down to asking "does Bing Custom Search API use the regular Bing search engine behind the scenes?", but if it is more complex than that, any answer to my main problem will do.
Bing Custom Search is basically indexing and ranking mechanism similar to the Bing search engine. Only difference in Bing Custom Search is restricting results to certain sites and/or controlling ranking of results. So, anything that helps in improving page quality (and hence ranking in Web Search) will also help in improving Bing Custom Search results.
This actually becomes more important as the candidate pool to select from is very small in Custom Search (or any such API) compared to the full-fledged Web Search API, which has billions of pages to select from.
Only caveat is it takes time to improve page quality and hence ranking, so until then you may have to pin/block/boost results.
I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.
Is there a way to search twitter users which have a certain keyword in their 'description' field? Right now my best thought is to write a loop to which will sequentially run through every user id, search the 'description' field and then only save the users which have that keyword.
Looping through every Twitter ID out there seems excessive! Is there a better way or method?
Sub-question are their packages beyond twitteR and streamR for Twitter analysis in R?
P.S. as this is an entirely conceptual question, it was judged that no reproducible code was necessary... some can be provided if the question is unclear.
Thanks!
As you mention this is an entirely conceptual question:
twitter API offers search by users' profile description keywords by using the 'q' parameter: https://dev.twitter.com/rest/reference/get/users/search
You can even OAuth in the link above if you have the credentials and 'curl' test your query. If you simply don't want to build the query, just for the sake of checking feasibility I found this site where you can search by keywords in users' profile: https://moz.com/followerwonk/bio/ (I'm guessing they use Twitter's official API).
As for the R subquestion, I'm afraid I only know the ones you mentioned :-S