Just trying to understand why they didn't use a REST API.
In REST, clients initiate requests to servers for resources; servers process those requests and return appropriate responses.
The utm.gif is not involved in server-to-client data transfer, but instead it's involved in moving data in the other direction.
Of course REST has HTTP methods for the client to communicate with servers (GET and POST) and indeed, Google Analytics directs the client's browser to send all analytics data to the GA servers via a GET Request. More precisely, a GET Request is comprised of a Request URL and Request Headers (e.g., Referer and User-Agent Headers).
All GA data--every single item--is assembled and packed into the Request URL's query string (everything after the '?'). But in order for that data to go from the client (where it is created) to the GA server (where it is logged and aggregated) there must be an HTTP Request, so the ga.js (google analytics script that's downloaded, unless it's cached, by the client, as a result of a function called when the page loads) directs the client to assemble all of the analytics data--e.g., cookies, location bar, request headers, etc.--concatenate it into a single string and append it as a query string to a URL (http://www.google-analytics.com/__utm.gif?) and that becomes the Request URL.
Of course there can't be an HTTP Request without a resource; so resource is the client requesting from the server? It doesn't need anything from the server, instead it wants to send information to the server. So the actual server resource requested by the client is purely pretextual--the resource isn't even needed by the client, it's solely requested to comply with the transmission protocol operator. Therefore, it makes sense to make that resource as small and as unobtrusive as possible, which is why it's a 1 x 1 transparent pixel in gif format. It is the smallest possible size and the least dense image format (bytes/pixel); I think it's a little over 30 bytes. A 1 x 1 image in the other common formats (e.g., jpeg, png, tiff) are larger.
This general scheme for transferring data between a client and a server has been around forever; there could very well be a better way of doing this, but it's the only way I know of (that satisfies the constraints imposed by a hosted analytics service).
(Google Analytics does indeed have two APIs--"Data Export" and "Management"--which are both RESTful Web Services.)
You can use __utm.gif in browsers that don't support javascript using the <noscript> tag (with some work on the server), as well as in email messages (with some work before sending the email).
How are you gonna make a REST request in an email message?
Because it's an image you can stick it anywhere you can use and image tag even if you can't execute JS. Many years back this Google pushed this for tracking of email campaigns. You could stick this formatted string in an html email message and then any client that displays the message will send that request to the GA servers and you will get at a minimum IP info (which get's you geo location also) depending on client you may also get OS, language and all the other browser settings. You don't get all the fancy analytics you get from the modern JS tracking scripts but if still has it's uses.
Here is a site that will help you format the request string and also has some more details.
Google pixel generator
Related
In case of HTTP requests like HEAD / GET / POST etc, which information of client is received by the server?
I know some of the info includes client IP, which can be used to block a user in case of, lets say, too many requests.
Another information of use would be user-agent, which is different for browsers, scripts, curl, postman etc. (Of course client can change default by setting request headers, but thats alright)
I want to know which other parameters can be used to identify a client (or define some properties)? Does the server get the mac address somehow?
So, is there a possibility that just by the request, it is identifiable that this request is being done by a "bot" (python or java code, eg.) vs a genuine user?
Assume there is no token or any such secret shared between client-server so there is no session...each subsequent request is independent.
The technique you are describing is generally called fingerprinting - the article covers properties and techniques. Depending on the use there are many criticisms of it, as it bypasses a users intention of being anonymous. In all cases it is a statistical technique - like most analytics.
Putting your domain behind a service like cloudflare might help prevent some of those bots from hitting your server. Other than a service like that, setting up a reCAPTCHA would block bots from accessing any pages behind it.
It would be hard to detect bots using solely HTTP because they can send you whatever headers they want. These services use other techniques to try and detect and filter out the bots, while allowing real users to access the site.
I don't think you can rely on any HTTP request header, because a client might not send it to the server, and/or there might be proxies between the client and the server that strip or alter the request headers.
If you just want to associate a unique ID to an HTTP request, you could generate an ID on your backend. For example, the JavaScript framework Hapi.js computes a request ID using this code:
new Date() + '-' + process.pid + '-' + Math.floor(Math.random() * 0x10000)
You might not even need to generate an ID manually. For example, if your app is on AWS and there is an Application Load Balancer in front of your backend, the incoming request will have the custom header X-Amzn-Trace-Id.
As for distinguishing between requests made by human clients and bots, I think you could adopt a "time trap" approach like the one described in this answer about honeypots for spambots.
HTTP request headers are not a good way to track users that use your site. This is because users can edit these headers and the server has no way to verify their authenticy. Also, in the case of the IP Address, it can change during a session if, for example, a user is on a mobile network.
My suggestion is using a cookie with a unique, random id, given to the user the first time they land on a page of your site. Keep in mind that the user can still edit/remove this cookie, so it isn't a perfect method. If you can force the user to login, then you could track the user with their session token.
So I am working on a full stack website using Flask, Vuejs and SQLite3. I am using axios to call the backend from the front end. I noticed though that if I 'GET' request information from axios(Vuejs) to the backend on a route, that all that information can be seen in plain text (JSON) on that route. I tried requiring a 'secret' from the header which does work, but the header information can be viewed in plain text as well, and it shows the 'secret'. I tried socket.io as well but those socket.io requests can be viewed in plain text as well.
Is there a way to encrypt or hide 'GET' requests but still allowing that information to get to the frontend from the backend on any IP that calls my site?
So for example say I have a database with a column and value of: Header1: 500. On the frontend I want to show that 500 in the HTML. So using axios I use a .get(path) and call the data. In Flask I have a route with method='GET' and return the database in JSON format. On the frontend I would save the axios response in a variable and then show the variable in the HTML. And that's great and all. But the information in the 'GET' request shows 'Header1: 500' in plain text. Now obviously that's not a big deal with that not very important value. But say I am calling an entire database to the frontend to display parts of it or use the values; those values are now all viewable on the Flask route to anyone.
Every network request can be read in plaintext, by default. There are various methods of preventing the data to be read in such way;
login wall - assumes that the user is authorised to read the data that is sent back to him
limiting the response - most sane way just limits the data the user is sent. Usually carries some benefit of performance gain as the responses are smaller
encryption - you could encrypt data on backend and decrypt it in the browser, but end user would still be able to accesa front-end decryption algorithm and reverse the process
The general rule in frontend is that everything sent from the server, can be read by the user.
I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)
I need to send different resources (specially images) for same urls depending on a complex logic based on different factors (cookie, IP, time, random). I want to take advantage of CDNs (cache, availability, proximity). So, I want this CDN to make a call to my server in order to decide which resource serve to any request. It is very important to not use redirects, so the user will never see a 30X status code.
For clarification:
User makes a request to http://resources.mydomain.com/img/a.jpg, which domain is under CDN
CDN makes a call to my server, sending url requested, cookies and user IP
My server returns the name of the real resource to serve (http://hidden.mydomain.com/img/a-version3.jpg)
CDN requests that image if not in cache
CDN responds to user request sending a-version3.jpg data, but without any redirect
Is it possible using any current commercial solution?
Yes, I think it is already supported by CDNetworks long time ago.
It is called "Origin Logic Control" now. You can check the description from http://www.cdnetworks.com/wp-content/uploads/2013/08/CDNetworks-ContentAccel-DS-EN2.pdf:
Allows a customer’s domain to require checking with the origin on every request.
You can return a special HTTP header (or special HTTP body, I am not sure now) to tell CDNetworks to return resources directly (and using cached version if available), not 30x status code.
You can enable Redirect Chasing to get what you are looking for. Alternatively, look at the Akamai blog post on Edge Redirect for a faster option.
Is there any sort of data dump or data set with information from Web Server logs?
The information that I am mainly looking for are:
a) what type of request is it (POST or GET or HTTP or something else)
b) What type of data is being transferred (image, audio, video or text)
c) what is the size of the data that is being transferred
Information such as IP address, URL can be anonymous.
Are you using Firefox? If so, you can use the included Web Console tool to view all the HTTP request body being sent from your browser to the server and the response bodies, along with things like the method (GET, POST, etc.). This would be the same thing that a web server would be logging (except the IP address of the client is always you, obviously). You should be able to copy all the data and paste it to a file if you want a data dump.
To use the web console, click the orange Firefox button and then Web Developer > Web Console. Or if you're using an older version or have the Firefox button disabled, it's under the tools menu.
Edit: To get the most out of it, you'll want to right click on the console and select Log Request and Response Bodies. This will get you more information than just the headers.