How to artificially create ConnectionRefusedError? - web-scraping

I want to debug ConnectionRefusedError handling in Scrapy. I can't do the debugging without the ability to simulate the error. How can I simulate ConnectionRefusedError?

It turns out we can simulate it by trying to scrape a site that is blocked in our country. For example, in Indonesia vimeo.com is blocked, so I scrape it.
It will be easier for China since they block a lot of sites.

Related

Tracking a Search that leads to a sale in GA

This seems really basic but i am struggling with it
We have a client who runs a travel website.
They have a few different search bars eg Flights, Hotels, Carhire.
I am trying to track the performance of each... "What % of people completed a sale that ran a Flight search." Same for Hotel, and for Car hire
Any ideas for the best way to get this info in GA?
Many thanks
There are a few ways to get this information, each with their pros and cons. The options that I see immediately available are segments and goals.
Segments are great because they are retrospective and generally more flexible, with the ability to be changed if you find your criteria isn't quite right. You create here, and specify sessions that go through search results pages etc:
Then you can create another segment for booking confirmation page, and any other intermediary steps that you'd like to report on. The main con of segments is that you can only pull in 4 at a time, but if you have more you can pull them 4 at a time and copy+paste the data into an excel sheet or google sheet. Segments can also be pulled via the Core Reporting Api and DataStudio which makes them great for automating into dashboards.
Goals are cool because they pull into the default reports, and basically track sessions through a particular page, event or sequence. The main con I see and the reason is that I don't use them is that they only start tracking fro mthe time you create them , and if you change the configuration it does not impact historical data, so your data can get messed up quickly if you don't have sandbox GA views or sandbox goals for your testing before putting it into a dedicated goal slot. You can also only have 10 or 20 goals depending on your plan, so once data is tracked against that goal you can't remove or clear it.

Is the Sabre Advanced Calendar REST test api limited in its results?

The sample POST form on the Sabre Dev Studio page has a query setup for a round trip flight from DFW to LAX. You can submit the form and get results. However, if I change the airport from DFW to LAS then I don't get any results. I'm wondering if it's due to the test api having a limited amount of flight data loaded.
I was getting results for flights from LAS/OAK and BOS/LAS just a few days ago from my application I'm writing but now it's a 404 every time so I figured I'd try the sample page again.
Update: It looks like the 404's happen with one-way flights. My tests with round trips seem to be fine.
Looking at the Sabre FAQ and this other SO answer from Bruno, I think this is the expected behavior on the TEST platform.
test platform is limited to certain city pairs (Sabre FAQ)
one way fares aren't loaded into the test platform (Bruno)
my previously working requests were round trip and changed to one-way

how to spoof location so google autocomplete API will provide local results, ideally with R

google has an API for downloading search suggestions:
https://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/xml_reference/query_suggestion.html
unfortunately, as far as i can tell, these results are specific to your location. for an analysis, i would like to be able to define the city/location that google thinks it is making the suggestion to. here's what happens when i scrape from dar es salaam, tanzania:
http://suggestqueries.google.com/complete/search?client=firefox&q=insurance
["insurance",["insurance","insurance companies in tanzania","insurance group of tanzania","insurance principles","insurance act","insurance policy","insurance act tanzania","insurance act 2009","insurance definition","insurance industry in tanzania"]]
i understand that a vpn would partially solve this issue, but only by giving me a different location and not lots of locations. is there a reasonable way to replicate this sort of thing quickly and easily from, say, the 100 largest cities in the united states?
confirmation that results differ within the usa-
thanks!
Google will use your IP and your location history (if turned on) to determine your location.
To be able to go around it, you can spoof your IP while logged off your google account (but I don't know if google will consider it a trial of hacking no matter what your intentions are).
Another way is to use Tor browser (even though it is not it's original purpose). You can configure tor to exit from a certain country using the Exitnode parameter in the torrc config file
As found in the docs:
ExitNodes node,node,…
A list of identity fingerprints, country codes, and address patterns of nodes to use as exit node
But if you want a fast way to do it, I don't think that's possible since google wants to know the real location of the users and have put a lot of effort into making such tricks fail.
The hl param for interface language changes the search results, but I can't tell if it's actually changing the location. For example:
http://suggestqueries.google.com/complete/search?client=chrome&q=why&hl=FR
Here's an example with 5 different values of hl:
http://jsbin.com/tusacufaza/edit?js,output

Scrape all google search result for a specific name

I think the question has been answered here before,but i could not find the desired topic.I am a newbie in web scraping.I have to develop a script that will take all the google search result for a specific name.Then it will grab the related data against that name and if there is found more than one,the data will be grouped according to their names.
All I know is that,google has some kind of restriction on scraping.They provide a custom search api.I still did not use that api,but hoping to get all the resulted links corresponding to a query from that api. But, could not understand what will be the ideal process to do the scraping of the information from that links.Any tutorial link or suggestion is very much appreciated.
You should have provided a bit more what you have been doing, it does not sound like you even tried to solve it yourself.
Anyway, if you are still on it:
You can scrape Google through two ways, one is allowed one is not allowed.
a) Use their API, you can get around 2k results a day.
You can up it to around 3k a day for 2000 USD/year. You can up it more by getting in contact with them directly.
You will not be able to get accurate ranking positions from this method, if you only need a lower number of requests and are mainly interested in getting some websites according to a keyword that's the choice.
Starting point would be here: https://code.google.com/apis/console/
b) You can scrape the real search results
That's the only way to get the true ranking positions, for SEO purposes or to track website positions. Also it allows to get a large amount of results, if done right.
You can Google for code, the most advanced free (PHP) code I know is at http://scraping.compunect.com
However, there are other projects and code snippets.
You can start off at 300-500 requests per day and this can be multiplied by multiple IPs. Look at the linked article if you want to go that route, it explains it in more details and is quite accurate.
That said, if you choose route b) you break Googles terms, so either do not accept them or make sure you are not detected. If Google detects you, your script will be banned by IP/captcha. Not getting detected should be a priority.

Need help in Web scraping webpages and its links by automatic funciton in R

I am interested to extract the data of paranormal activity reported in news, so that i can analyze the
data of space and time of appearance for any correlations. This project is just for fun, to learn and use web scraping, text extraction and spatial and time correlation analysis. So please forgive me for deciding on this topic, I wanted to do something interesting and challenging work.
First I found this website has some collection of the reported paranormal incidences, they have collection for 2009,2010,2011 and 2012.
The structure of the website goes like this in every year they have 1..10 pages...and links goes like this
for year2009
link http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
In each page they have collected the stories under the heading like this
Internal structure
Paranormal Activity, Posted 03-14-09
each of these head lines has two pages inside it..goes like this
link http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
On each of these pages they have actual reported stories collected on various headlines..and the actual websites link for those stories. I am interested in collecting those reported text and extract information regarding the kind of paranormal activity like ghost, demon or UFOs and the time, date and place of incidents. I wish to analyze this data for any spatial and time correlations. If UFO or Ghosts are real they must have some behavior and correlations in space or time in their movements. This is long shot of the story...
I need help in web scraping the text form the above said pages. Here i have wrote down the code to follow one page and its link down to last final text i want. Can anyone let me know is there any better and efficient way to get the clean text from the final page. Also automation of the collecting text by following all 10 pages for whole 2009.
library(XML)
#source of paranormal news from about.com
#first page to start
#2009 - http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
pn.url<-"http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm"
pn.html<-htmlTreeParse(pn.url,useInternalNodes=T)
pn.h3=xpathSApply(pn.html,"//h3",xmlValue)
#extracting the links of the headlines to follow to the story
pn.h3.links=xpathSApply(pn.html,"//h3/a/#href")
#Extracted the links of the Internal structure to follow ...
#Paranormal Activity, Posted 01-03-09 (following this head line)
#http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
pn.l1.url<-pn.h3.links[1]
pn.l1.html<-htmlTreeParse(pn.l1.url,useInternalNodes=T)
pn.l1.links=xpathSApply(pn.l1.html,"//p/a/#href")
#Extracted the links of the Internal structure to follow ...
#British couple has 'black-and-white-twins' twice (following this head line)
#http://www.msnbc.msn.com/id/28471626/
pn.l1.f1.url=pn.l1.links[7]
pn.l1.f1.html=htmlTreeParse(pn.l1.f1.url,useInternalNodes=T)
pn.l1.f1.text=xpathSApply(pn.l1.f1.html,"//text()[not(ancestor::script)][not(ancestor::style)]",xmlValue)
I sincerely thanks in advance for reading my post and your time for helping me.
I will be great full for any expert who would like to mentor me in this whole project.
Regards
Sathish
Try to use Scrapy and BeautifulSoup libraries. Despite their being Python based, they are considered the best in scrapping domain. You can use command line interface to connect both, for more details about connecting R and Python have a look here.

Resources