Advanced web scraping - web-scraping

I want to scrape a part of a website, lets say: www.mywebsite.com/x1/x2
The website needs login information.
You need to open x1 first then you click on a button and x2 will be opened as a popup window. If you close x1 you lose access to x2.
I used Internet Download Manager, and I think I correctly put login information but this failed because you need x1 to be opened when you access x2.
The website support JavaScript.

IDM (after I've looked at) is for downloading, not for regular web scraping. Why not to use a special web scraping software? Most of them support logging in and complicated html and js-driven pages scrape. Seems to me your case is not a pure html page but a js-stuffed web page.

Related

How to track a PDF view (not click) on my website using Google Tag Manager

How can I track that someone visited the following URL of my website http://www.website.com/mypdf.pdf.
I tried using a Page View trigger on a Page View tag. I'm completely new at Google Analytics so not sure how to proceed. Most people are going to be going to that pdf directly via URL, as there is no link to it on my website, but I really want to be able to track how many people view it.
Thanks in advance!
You cannot track PDF views with the help of GTM. GTM for web is a javascript injector, and one cannot inject Javascript into a PDF document from the browser.
One way to circumvent this is to have a gateway page, i.e. have the click go to a HTML page that counts the view before redirecting to the document in question (naturally you could use GTM in that page). Since people go directly to the PDF URL this would require a bit of scripting - you would have to redirect all PDF links to your gateway page via a server directive, count the view and then have the page load the respective document.
Another even more roundabout way would be to parse your server log files and send PDF requests to GA via the measurement protocol (actually many servers allow to have log writes redirected to another script, so you could do this in realtime). I would not really recommend that approach - it's technologically interesting, but probably more effort than it is worth.
The short version is, if you are not comfortable fiddling a little with your server setup you will probably not be able to track pdf views. GTM does not work on PDF files.
Facing same issue…
My solution was to use url shortener (like bitly.com) which includes opening statistics.
Not the perfect solution but it works for direct pdf access from external source (outside your site).

Manage ads inside a Single Page Application

I m developing a Single Page Application (SPA). So, I use to refresh the page's HTML's content dynamically using Ajax requests.
I'd like to register to the DoubleClick for Publishers program, but I m wondering if my SPA is able to integrate advertising due to its dynamic content loaded without refreshing the page.
I saw this link: https://support.google.com/dfp_sb/answer/3058726
So I assume it's ok. But I'd like to be certain before starting using DFP. Could someone confirm please?
Then, sometimes I m using external html pages that I still load using Ajax. Should I consider writing the advertising banners JavaScript inside these external views, or directly inside the master page of my app?
Last question: How can I manage users having an adblocker software installed? Am I allowed to detect the presence of an adblocker software using JavaScript and then execute some specific code for this kind of users?
I'm working in a SPA and working with DFP successfully. Here is my feedback to your questions:
So I assume it's ok. But I'd like to be certain before starting using
DFP. Could someone confirm please?
Yes, you can refresh the banners using the method you are refering in the link you shared
Then, sometimes I m using external html pages that I still load using
Ajax. Should I consider writing the advertising banners JavaScript
inside these external views, or directly inside the master page of my
app?
To load them externally will bring you to lower performance results. You can control everything from the main page and you will have better results.
Last question: How can I manage users having an adblocker software
installed? Am I allowed to detect the presence of an adblocker
software using JavaScript and then execute some specific code for this
kind of users?
This is something I have not started to work on it but you can detect (like forbes.com is doing on it website) and there are also projects on dealing with this.

how to simulate a cross origin site on local machine?

I am trying to simulate a cross origin site. Meaning I shouldn't be able to make ajax request from site A to site B since the browser will not naturally allow me to do so because of their cross-origin policy.
What are the tools I can use in this regard? Or are there any hacks?
What I've tried so far: I've opened a visual studio solution. It has two asp.net web form projects. One web project (say A) simply hosts a form with a file input control and a submit button. The other project B has a simple aspx page, which contains an iframe which loads site A inside of it.
I ran project B, and, in the browser console window, I did something like this:
var ifr = document.getElementById('myiframe');
console.log(ifr.contentWindow.document.body.innerHTML);
The console window displays the markup of site A's page which is loaded in the client's iframe.
Clearly I've failed. But is there I way I can do it on one machine.
Well, a bit of digging shows that you can achieve this feat is by modifying your hosts file (C:\Windows\System32\drivers\etc\hosts) as mentioned in the post below:
How do i map http://localhost:8080 to http://mysites in iis7?

Streaming Infopath form data from web server - strange behaviour in Fiddler trace

I have a simple aspx page that streams Infopath form data to the client. That then opens in the Infopath template on the client machine. Works fine.
However when I look at the Fiddler trace I see 7 calls to my aspx page - the original one from the browser, then several times more from the Infopath process. My form data ends up being downloaded 4 times - once by the browser and 3 times by Infopath!
Here's a link to the Fiddler trace file
- if anyone can explain what is going on here I would be grateful.
How does Infopath even know about the aspx file & why does it need to call it at all, leave alone calling it several times?
Infopath (and actually most MS Office applications) has code which attempts to determine where a given form came from and then use the content from that server rather than a local copy. This is typically important to offer the ability to post updates back to the original source (e.g. a SharePoint site). Generally you shouldn't see multiple full downloads, although you certainly may see several HEAD, OPTIONS or other WebDav verbs.
(The Office team is working to reduce redundant downloads but this isn't trivial for a bunch of reasons).

need to copy files on client system, is thr any possible way?

I m developin an Online Examination System in C#.net and want to copy files on client machine as soon as exam starts, so that even if internet gets disconnected examinee can continue with test
You may wish to consider a client server solution, such as WPF or winforms as this is more suited to this type of development. You can use one click deployment to have this still launched from the web and updated on every run.
If you do decied to use asp.net this will result in a very javascript heavy site with a very slow load in the first page.
To do this you would load all your test qustions into a javascript datastructure on the first page, when every the user when to the next page you would need to, using javascript, collect all the answers and store in javascript. then rereender the entire page using your definitions of the test in javascript with no trip back to the server. then once the test was complete you would need to send your results back to the server, the internet must be active once you've compleated the test.
You'll have to create a download package and provide a link for the user to click to request the files. You can't force a download.
If your exam in all in one web page, you don't need to do anything. Once the page appears in the users browser, it has already been "copied locally".

Resources