I was going to post this in the Workbox Github repo, but I doubt it's a bug and more likely my own misunderstanding. I have found some slightly similar questions, but none of the answers seem to clearly explain how I can resolve my issue.
In my sw.js file I am precaching the Home URL and the Start URL. The Start URL is the exact same as the Home URL, except it appends ?utm_source=pwa to the URL. This is a technique I've read that others do to track PWA usage in Google Analytics and I like the idea.
However, now when a new user arrives at the website, they load the initial page and then Workbox fetches the Home URL and then fetches the Start URL. This means that if the user arrives at the homepage of the website they will have loaded that page 3 times. I'd like to figure out how to get Workbox to realize that the Home URL and Start URL are essentially the same and to not need that third fetch request.
I understand that ignoreUrlParametersMatching defaults to use [/^utm_/] which I would expect it to do as I described above, but perhaps I'm understanding it incorrectly and it does not apply to prefetched URLs...? Does it automatically apply if I don't explicitly call it from precacheAndRoute()?
To clarify my expectation of ignoreUrlParametersMatching would be that it precaches the Home URL and then when it attempts to cache the Start URL it ignores (removes) the UTM parameter, sees that it already has that URL cached and does not fetch. Then, when the Start URL is requested from cache, it again would ignore the UTM parameter and respond with the URL it has in cache. Is this far off from reality? If so, how should I do this to achieve both my tracking and reduce the "duplicate" fetch?
Here are some excerpts of my sw.js file:
const HOME_URL = 'https://gearside.com/nebula/';
const START_URL = 'https://gearside.com/nebula/?utm_source=pwa';
workbox.precaching.precacheAndRoute([
//...other precached files
{url: HOME_URL, revision: revisionNumber},
{url: START_URL, revision: revisionNumber},
]);
Both URLs are precached:
Shows both fetch requests:
Note: I've noticed this problem with or without revision numbers.
TL;DR
Do not include https://gearside.com/nebula/?utm_source=pwa in the precache manifest.
Use the workbox-google-analytics module:
import * as googleAnalytics from 'workbox-google-analytics';
googleAnalytics.initialize();
Long version
You should precache based on unique resources. Every entry defined in the precache manifest will be downloaded and cached.
If https://gearside.com/nebula/ and https://gearside.com/nebula/?utm_source=pwa serve the exact same content, only precache one of them (preferably the one without the query string).
The option ignoreURLParametersMatching serves to specify an array of regexes that will be tested against the query parameters, and if any of them matches, then the route match ignores such query parameter.
To exemplify,
precacheAndRoute([
{url: '/styles/main.css', revision: '777'},
], {
ignoreURLParametersMatching: [/.*/]
});
Will match any of these requests:
/styles/main.css
/styles/main.css?minified=0
/styles/main.css?minified=0&renew=1
and serve /styles/main.css, because the regex .* matches any query string.
The default value of ignoreURLParametersMatching is [/^utm_/]. If in the example above we skip ignoreURLParametersMatching, any of the following requests would be matched (and resolved with the precached /styles/main.css):
/styles/main.css
/styles/main.css?utm_hello=yes
/styles/main.css?utm_yes_what=dunno&utm_really=yeah
But the following requests will not go through the precache:
/styles/main.css?remodelate=expensive&utm_pwa=no
/styles/main.css?utm_spa=neither&trees=awesome
because none of them have exclusively only query parameters starting with utm_.
More info about the workbox-google-analytics module can be found here: Workbox Google Analytics
Related
My Dynamic link giving the error as Invalid Dynamic Link - Blocked
We could not match param 'https://www.toppscholars.com?meetingId=546546&pwd=98456' with whitelisted URL patterns in this Google project.
I tried to create whitelist which goes to playstore:
^https://play.google.com/.*id=com.appname$
Unable to use for below URL, need redirection on below url to read the values.
https://www.toppscholars.com?meetingId=546546&pwd=9845
I'm expecting the dynamic link to open my app and read the parameters.
If app not present it will go to app store / playstore.
Expecting the link to work across all the devices and platforms without error.
Issue resolved by below example.
Link: https://www.example.com/post?postId=hy48ndmFLMdxydT7mGPq
Allowlist URL Pattern: ^https{0,1}:\/\/www\.example\.com\/post([\/#\?].*){0,1}$
Similarly for
Link: https://www.toppscholars.com?meetingId=546546&pwd=9845
Allowlist URL Pattern: ^https{0,1}:\/\/www\.toppscholars\.com([\/#\?].*){0,1}$
I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)
When I send a one-off document to RightSignature via their API, I'm specifying a callback location in the XML document as specified in RightSignature's schema definition. I then get a signer-link value back from their API for the document. I display the HTML response from the signer-link URL in an iFrame on our website. When our user signs the document in this iFrame, which is rendering the responses from their website, I want their website to post to our callback location.
Can I do this with the RightSignature API and does it make sense?
So far, I'm only getting content in the iFrame that indicates that the signing was successful. The callback location does not seem to be getting called.
I got it solved just now. Basically, i was doing two things wrong first you have to go in RightSignature Account and set it there the CallBack url
Account > Settings > Advanced Settings
But the thing which RS is unable to mention to us that this url can not be of localhost, but it should be of https i mean like Live URL of your site like
https://stagingmysite.azurewebsites.net/User/CallBackFunction
And then in your CallBack just write these two lines and you will receive complete XML which would have the GUID and document status as well.
byte[] data = Request.BinaryRead(Request.TotalBytes);
string callBackXML = System.Text.Encoding.UTF8.GetString(data);
I found the answer with some help from the API team at RightSignature. I was using callback_location but what I really wanted is redirect_location. Their online documentation was difficult to follow and did not clearly point out the difference.
I got this working after a lot of trial and error.
I am developing an application that will serve multiple customer-organizations, each of them should be given access based on a fixed url. Example: domain/myapp/CustomerOrg1
Previously I always registered a new WAComponent-subclass for each of these entry-points. That does work but there has to be a better solution, I would like a single component-class to find out which URL the request uses (to then respond with the customer-org's homepage)
I tried:
registering a WARequestHandler-subclass; and it allows me to find out the full path (incl. /CustomerOrg1) but I am outside of any session and don't know how to get into one.
registering a WAComponent-subclass as /myapp, and it works in that it also handles /myapp/CustomerOrg1 automatically, however when I try to find out the URL used (by self session url inspect) it claims to be only the base-url (/myapp).
Try
self requestContext request uri
and if you are not in a component but any object you can do
WACurrentRequestContext value request uri
Please be aware that the uri you get in the answer by Norbert is in a production environment a value that has already been processed, and possibly modified, by your (Apache/nginx/etc) webserver responsible for static content and load balancing.
I want to retrieve content of sample.html inside catalog folder in alfresco using restful.
From alfresco document i got the following rest url to retrieve content of a document. But i dont know exactly what is property, stor_type, store_id,id and attach.
GET /alfresco/service/api/node/content{property}/{store_type}/{store_id}/{id}?a={attach?}
It would be grateful if someone explains me the above rest url properties and provide me a example.
The CMIS Web Scripts Reference and the Repository RESTful API Reference give a little more information (but no examples).
property is the property of the node to follow in order to obtain the content - this will default to cm:content so can generally be omitted
store_type will normally be "workspace" for live application data - see this forum discussion on store types etc
store_id will be "SpacesStore" for normal files - see this forum discussion on other stores
id is the unique identifier for the node (within a given store), e.g. 986b162e-0867-4a7b-9f4f-0e3837cdc97b
attach - if true, force download of content as attachment (defaults to false) - I think this is to trigger "Save as..." in a browser rather than directly streaming the content?
Example GET URL (untested - and of course you'd need to use a valid host, port and id)
http://my.example.com:8080/alfresco/service/api/node/content/workspace/SpacesStore/986b162e-0867-4a7b-9f4f-0e3837cdc97b
Together, the store_type, store_id and id form a NodeRef which uniquely identifies a node, e.g.
workspace://SpacesStore/f1a5e908-80cb-4c6e-b919-cc80fe53b835
There are a couple of examples (though not of this exact API call) on Jeff Potts' tutorial on Curl and web scripts.
If you want to download a file by name and path (without already knowing the node ID) then you will need to use another API, as the one you are using requires you to know the node ID.
This page mentions a direct download URL that accepts a path and filename, e.g.
/alfresco/download/direct?path=/Company%20Home/My%20Home%20Space/myimage.jpg
Depending on access controls, you may need to add the login ticket parameter to this URL, e.g. &alf_ticket=1234567890, where 1234567890 is the security ticket provided by the login URL.
Note: although I refer to the CMIS Web Scripts Reference above, see also this posting and Jira ticket that state that CMIS web script URLs are deprecated, i.e. ( /alfresco/service/cmis and /alfresco/cmis)