Facebook search scrape - web-scraping

Facebook search scrape - web-scraping

I want help regarding how i can save Facebook data from search results.
I have 1000 query URLs like:
https://www.facebook.com/search/people/?q=name
https://www.facebook.com/search/people/?q=mobile
How can I quickly scrape data from the resulting web pages?
I have tried to scrape with some scraper programs but could not get them to work. Does anyone have a faster way?

Use python requests library. It is pure and fast library. Scraping speed is not only dependent on your code, it also depends on the web site you are scraping.

Related

What's the xpath code for scraping this website?

I've been using the Scraper extension to scrape a website called Flippa and find websites for sale. For example, I'll go to this page with several websites on it and find all the domains for sale:
https://flippa.com/search?sort_alias=most_recent&filter%5Bproperty_type%5D=website,established_website,starter_site&filter%5Bsitetype%5D=content,blog,directory,review,forum-community
I've been using the following XPath code to gather the domains (e.g. blasterpiece.com), but it no longer works:
//div[1]/div[2]/div[1]/a[2]/text()
Any idea what I need to tweak? I'm new to scraping, so I'm pretty stuck.
Thanks!

This XPath should work: //a[contains(#class,"GTM-search-result-card ng-binding")]/text()

How to do web scraping using R

I’m a beginner in web scraping and trying to learn how to implement an automated process to collect data from the web submitting search terms.
The specific problem I’m working on is as follows:
Given the stackoverflow webpage https://stackoverflow.com/ I submit a search for the term “web scraping” and want to collect in a list all question links and the content for each question.
Is it possible to scrape these results?
My plan is to create a list of terms:
term <- c(“web scraping”, “crawler”, “web spider”)
submit a research for each term and collect both question title and content of the question.
Of course the process should be repeated for each pages of results.
Unfortunately, being relatively new to web scraping, I'm not sure what to do.
I’ve already downloaded some packages to scrape the web (rvest, RCurl, XML, RCrawler).
Thanks for your help

how to write crawler to crawl data from instagram ?

is there any ways crawling / monitoring instagram data for research purpose ?
I tried the official API but it only worked in sandbox which is impossible for crawling the real info like followers. I need to monitor certain accounts and also extend the range by the followers / followings and track their behaviors (e.g. the progress of number of likes)
So anybody can have some suggestions ? or could give me some references about related crawling task ?

Maybe you can try using beautiful soup library and read this book by Ryan Mitchell: Web Scraping with Python. Basically you should understand the DOM, REGEXs and how to algorithmically jump page to page so that you can get started.
Also do check a site's ToS before you start and know that they may have some guidelines/rules against scraping as generally all sites have robot.txt files these days that specify how/what you're allowed to scrape

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?

Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.

Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.

I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.

Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

Crawling wikipedia

I'm going through crawling wikipedia using website downloader for windows, i was looking through the whole options in this tool to find an option to download wikipedia pages for specific period, for example from 2005 untill now.
Does anyone get any idea about crawling the website in specific period of time ?

Why not download the SQL database containing all of Wikipedia?
You can then query it using SQL.

Give a try to the Wikipedia API and your programming skills.

There should be no need to do web scraping; use the MediaWiki API to directly request the information you want. I'm not sure what you mean by "wikipedia pages for a specific period" - do you mean last edited at a certain time? If so, while skimming, I noticed an API call that lets you get a look at the last n revisions; just ask for the last revision and see what its date is.

It depends if the website in question offers the archive and mostly don't so its not possible in a straightforward way to crawl a sample started from specific date. But you can implement some intelligence in your crawler to read the page created date or something like that.
But you can also look at Wikipedia API at http://en.wikipedia.org/w/api.php

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex