Search a url for unique phrase using Google API - http

Does Google have an API with a function which will verify if a specific phrase can be found at a given url?
Say I have a webpage url: www.mysite/2011/01/check-if-phrase-exists
I want to know if the phrase foobar exists somewhere on that document (it can be anywhere on the html document - not just "readable text").
The function/api would return True or False.
Question Update The "method" should avoid me from having to retrieve the entire page to my server and search myself. It is the fetching of the webpage to my server that I am trying to avoid (to cut down on bandwidth).

I don't think they do, but you could do this yourself without much code (this is adapted from the App Engine docs):
import urllib2
url = "http://www.google.com/"
try:
result = urllib2.urlopen(url)
my_search_function(result)
# or perhaps my_search_function(result.content)
except urllib2.URLError, e:
handleError(e)
Then you can just define my_search_function(text) to do what you need

Related

Understand Dynamic Links Firebase

I would like to understand better Firebase Dynamic Links because i am very new to this subject.
What i would like to know :
FirebaseDynamicLinks.instance.getInitialLink() is supposed to return "only" the last dynamic link created with the "initial" url (before it was shorten) ?
Or why FirebaseDynamicLinks.instance.getInitialLink() doesn't take a String url as a parameter ?
FirebaseDynamicLinks.instance.getDynamicLink(String url) doesn't read custom parameters if the url was shorten, so how can we retrieve custom parameters from a shorten link ?
My use case is quite simple, i am trying to share an object through messages in my application, so i want to save the dynamic link in my database and be able to read it to run a query according to specific parameters.
FirebaseDynamicLinks.instance.getInitialLink() returns the link that opened the app and if the app was not opened by a dynamic link, then it will return null.
Future<PendingDynamicLinkData?> getInitialLink()
Attempts to retrieve the dynamic link which launched the app.
This method always returns a Future. That Future completes to null if
there is no pending dynamic link or any call to this method after the
the first attempt.
https://pub.dev/documentation/firebase_dynamic_links/latest/firebase_dynamic_links/FirebaseDynamicLinks/getInitialLink.html
FirebaseDynamicLinks.instance.getInitialLink() does not accept a string url as parameter because it is just meant to return the link that opened the app.
Looks like there's no straightforward answer to getting the query parameters back from a shortened link. Take a look at this discussion to see if any of the workarounds fit your use case.

How to figure out where is the raw data in a table?

https://www.nyse.com/quote/XNYS:A
After I access the above URL, I open Developer Tools in Firefox. Then change the date in HISTORIC PRICES, then click 'GO'. The table is updated. But I don't see relevant HTTP requests sent in devtools.
So this means that the data has already been downloaded in the first request. But I can not figure out how to extract the raw data of the table. Could anybody take a look at how to extract the raw data from the table? (Note that I don't want to use methods like selenium, I want to stay with raw HTTP requests to get the raw data.)
EDIT: websocket is mentioned in the comment. But I can't see it in Developer Tools. I add websocket tag anyway in case somebody knows more about websocket can chime in.
I am afraid you cannot extract javascript rendered content without selenium. You can always make use of a headless browser(you don't see any instance on your screen, the only pitfall is that you have to wait until the page fully loads) and it won't bother you anymore.
In other words, all the other scraping libs are based on urls and forms. Scrapy can post forms but not run javascripts.
Selenium will save the day, all you lose is a couple of seconds for each attempt(will be milliseconds if it is run in frontend). You can share page source with driver.page_source and it can be directly used for parsing(as a html text) with BeautifulSoup or whatever.
You can do it with requests-html, for example let's grab the first row of the table:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.nyse.com/quote/XNYS:A'
r = session.get(url)
r.html.render(sleep=7)
first_row = r.html.find('.flex_tr', first=True)
print(first_row.text)
Output:
06/18/2021
146.31
146.83
144.94
145.01
3,220,680
As #Nikita said you will have to wait the page loading (here 7sec but maybe less), but if you want to do multiple requests you can do it asynchronously !

Firebase Storage : Get the token of the URL

I currently have an application that works with Firebase.
I repeatedly load profile pictures. However the link is quite long, it consumes a certain amount of data. To reduce this load, I would like to put the link in raw and only load the token that is added to the link.
To explain, a link looks like this: “https://firebasestorage.googleapis.com/v0/b/fir-development.appspot.com/o/9pGveKDGphYVNTzRE5U3KTpSdpl2?alt=media&token=f408c3be-07d2-4ec2-bad7-acafedf59708”
So I would like to put in gross: https://firebasestorage.googleapis.com/v0/b/fir-developpement.appspot.com/o/
In continuation: “9pGveKDGphYVNTzRE5U3KTpSdpl2” which is the UID of the user that I recover already and the or my problem this poses: “alt = media & token = f408c3be-07d2-4ec2-bad7-acafedf59708” which adds randomly for each photo .
I would like to get back only this last random piece …
Is it possible ?
Thank you
UP : 01/11 Still no solution
It's not supported to break apart and reassemble download URLs. You should be treating these strings as if their implementation details might change without warning.

How to remove _ga query string from URL

I have a multidomain website for which there is GA tracking. Recently we moved to Universal Analytics and noticed that whenever the domain is changed (from US to Korean/Japanese), a _ga=[random number] is appended to the URL
i.e. from
abc.com
when i click on the japanese site, the URL becomes
japanese.abc.com/?_ga=1.3892897.20937502.9237834
Why does this happen?
How can I remove the _ga part of the URL?
Appreciate your help.
This is needed for cross-domain-tracking (i.e. track people who cross domain boundaries as one visitor and not as one visitor per domain). If you want cross domain tracking you cannot remove this. The _ga - part is the client id which identifies a session and since it cannot be shared via cookies (which are domain specific) it has to be passed via the url when the domain changes.
Since somebody set your site up for cross domain tracking I guess you actually want this (it does not happen by default). The parameter is a necessary side effect of cross domain tracking with Universal Analytics. If you do want this look in the tracking code for any of the linker functions mentioned in the documentation and remove them.
Updated to answer the questions from the comment.
Is there no way to remove the _ga string and still have the cross
domain facility?
No, currently not. Browser vendors work on better ways of cross
domain communication so there might be something in the future, but
at the moment the parameter is the best way.
Also, what if some user randomly changes the _ga value and presses
enter? How will GA record that?
If the user happens to create a client id that has been used before
(highly unlikely) his visit would be attributed to another user.
Realistically Google Analytics will just record him as a new user.
Updated
For those who like to play I did a proof of concept for cross domain tracking without the _ga parameter. Something along those lines could be developed further, as-is it is not suitable for production use.
Update: David Vallejo has a Javascript solution where the _ga parameter is removed via the history API (so while it is still added it is for all intents and purposes invisible to the end user). This is a more elaborate version of Michael Hampton's answer below.
I'm using HTML5 history.replaceState() to hide the GA query string in the browser's address bar.
This requires me to construct a new URL having the _ga= value removed (you can do this in your favorite language) and then simply calling it.
This only alters the URL in the address bar (and in the browser's history). Google Analytics still gets the information passed in via the query string, so your tracking still works.
I do this in a Go html/template:
{{if .URL.RawQuery}}
<script>
window.history.replaceState({}, document.title, '{{.ReplacedURL}}');
</script>
{{end}}
I was asked to remove this tag after it started showing up when we split our website between two domain names. With Apache Rewrite Rules:
RewriteCond %{QUERY_STRING} _ga
RewriteRule ^(.*)$ $1? [R=301,NC,L]
This will remove the tag, but will not be able to pass the _ga params to Google Analytics.
If the user doesn't mind a short refresh, then adding this code to every page
<?php
list($url, $qs) = preg_split('/\?/',$_SERVER['REQUEST_URI']);
if (preg_match('/_ga=/', $qs) ) header( "refresh:1;url=${url}" );
?>
will refresh after a second, removing the query string, but allowing the Google Analytics action to take place. This means that by the time your user has bookmarked or copied your URL, the pesky _ga stuff has long gone.
The above code will throw away ANY query string. This version will just strip out the '_ga' argument.
$urlA = parse_url($_SERVER['REQUEST_URI']);
$qs = $urlA['query'];
if (preg_match('/_ga=/',$qs)) {
$url = $urlA['path'];
$newargs = array();
$QSA = preg_split('/\&/',$qs);
foreach ($QSA as $e) {
list($arg,$val) = preg_split('/\=/',$e);
if ($arg == '_ga') continue; # get rid of this one
$newargs[$arg] = $val;
}
$nqs = http_build_query($newargs);
header( "refresh:1;url=${url}?${nqs}" );
}
You can't stop Google from adding the tag, but you can tell Analytics to ignore it in your reports. Thanks to Russ Henneberry for this: http://blog.crazyegg.com/2013/03/29/remove-url-parameters-from-google-analytics-reports/
It was written before Universal was released, so the language is outdated - now you create a new "view" (rather than "profile"). Creating a new view ensures that you still have the raw data in your default view (just in case you ever need it), so it's really the best solution (keeping in mind that you can't ever apply new settings retroactively in G Ax). Good luck!
You can't remove the _ga parameter from the URL on the website...BUT you can use an Advanced filter in Google Analytics to remove the query parameter from the reports!
Like this:
1) Field A: Request URI
Pattern: ^(.+)\?_ga
2) Field B: not needed
3) Output To -> Constructor
Field: Request URI
Pattern: $A1
This filter that will strip off all query parameters when _ga is the first parameter shown. You can get a lot fancier with the regex, but this approach should work for most websites.
See this page: https://support.google.com/tagmanager/answer/6107124?hl=en
& search for "use hash as delimiter"
Setting this value to true allows you to pass the value through a hash tag instead of through a query parameter
Should fix it
One way to handle this is to use the history.replaceState Javascript function to remove the query string from the URL after the page is finished loading and Google Analytics has done its thing. However, if you remove it too soon, it'll affect GA functionality (one visitor will show as multiple visitors). I've found that the following Javascript (with a 3-second delay)
<script defer src="data:text/javascript,async function main() {await new Promise(r => setTimeout(r, 3000));window.history.replaceState({}, document.title, window.location.pathname);}main();"></script>
I used "window.location.pathname" for convenience so that you can use the same script on many pages. However, you can also do like this (for the top page of the site):
<script defer src="data:text/javascript,async function main() {await new Promise(r => setTimeout(r, 3000));window.history.replaceState({}, document.title, '/');}main();"></script>
Or for a sub-page:
<script defer src="data:text/javascript,async function main() {await new Promise(r => setTimeout(r, 3000));window.history.replaceState({}, document.title, '/something/something.html');}main();"></script>
I did the "data:text/javascript" thing instead of a true in-line script so I could apply "defer" to it, although this probably isn't necessary if you're using a sufficiently long delay value.
You can filter out all (or only include) "?_ga=" parameters in Google Analytics for reporting purposes. I would also highly recommend adding a canonical to the base URL -- or adding the parameters to Google Webmaster Tools -- to avoid duplicate content.

How to modify page URL in Google Analytics

How can you modify the URL for the current page that gets passed to Google Analytics?
(I need to strip the extensions from certain pages because for different cases a page can be requested with or without it and GA sees this as two different pages.)
For example, if the page URL is http://mysite/cake/ilikecake.html, how can I pass to google analytics http://mysite/cake/ilikecake instead?
I can strip the extension fine, I just can't figure out how to pass the URL I want to Google Analytics. I've tried this, but the stats in the Google Analytics console don't show any page views:
pageTracker._trackPageview('cake/ilikecake');
Thanks,
Mike
You could edit the GA profile and add custom filters ...
Create a 'Search and Replace' custom filter, setting the filter field to 'Request URI' and using something like:
Search String: (.*ilikecake\.)html$
Replace String: $1
(was \1)
Two possibilities come to mind:
it can take a while, up to about 24 hours, for visits to be reflected in the Analytics statistics. How long ago did you make your change?
try beginning the pathname with a "/", so
pageTracker._trackPageview('/cake/ilikecake');
and then wait a bit, as per the first item.
Usually you have the ga script code at the end of your file, while special _trackPageviews() calls are often used somewhere else.
Have you made sure you have your call to pageTracker._trackPageview() after you have defined the pagetracker?
Like this:
var pageTracker = _gat._getTracker("UA-XXXXXXX-X");
pageTracker._trackPageview();
otherwise you just get a JavaScript error I suppose.

Resources