I've spent last few days trying to find a solution to solve problem below.
I have set of URLs for which I would like to request data - mainly pageviews and visits by months in specific time interval. These URL specify one web section and we would like to get statistics for this section. I'm using PHP GAPI.
I am able to construct correct filter for the URL set:
ga:pagePath==[url1]||ga:pagePath==[url2]||ga:pagePath==[url3]...
But this works for a fews URLs because request is sent via GET and there is request length limitation for GET.
At first I tried to make severeal requests for a few URLs from the whole set and after all requests (when I had data for all pages) I made sum of pageviews and visits. Than I realized that this could work for pageviews but not for visits (one particular visit could be counted in more than one response and thanks to sum it was counted muliple times).
And than i have these limitations:
I can't use regular expresion to shorten the filter. URLs of pages are badly designed (not thanks to us :) ) and the pages in a web section therefore don't have nice URL prefix like /my-section/*
I need historical data (2 years back), so it won't help to start tracking some custom variable or event for pages in particular web section from now.
So I tried to make POST request to API. I was able to get auth token, but POSTing request to get statistic data returns:
403 Forbidden
Target feed is read-only
I tried to find if there is actualy the possibility to use POST method, but had no luck finding exact info (some clues suggest that it is not possible).
Another idea could be redesigning URL to have some nice prefix to filter by regexp and somehow changing the stored URLs in GA, but I have a feeling that it's not possible either.
Does anyone have an idea how to solve this?
Thanks for any suggests :)
Related
Say I have two pages on a site called “Page 1” and “Page 10”. I'd like to be able to see the paths visitors take to get from “Page 1” to “Page 10” with full URLs intact. Many of the URLs (including those for “Page 1” and “Page 10”) will include query strings that are important.
Is this possible? If so, how?
Try using behavior flow reports. The report basically shows you how visitors click through your website. There are a lot of ways to customize the report, with which you will need to play around to really answer your question. By default, the behavior flow focuses on entry and exit points of visitors, regardless how many times they hit the different subpages in between. However, I'm sure you can set appropriate filters and settings to answer your question.
I use two methods for tracking where people have been on my website:
Track and store the information in my own SQL database. (details below)
Lead Forensics (paid subscription, but you can do a trial).
For tracking and storing my own data, I record unique visitors based upon the IP Address they're connecting from and then have a separate table that records all page views that links back to the unique visitor table.
Lead Forensics data simply allows me to link up those unique visitors with actual companies that have viewed my website.
Doing it yourself means you don't have to rely on Google working for your records to work, and in my experience Google Analytics tends to round numbers so you don't get a true indication of numbers, and also you can remove bots and website trawlers from your data by tracking the user agent string.
As a somewhat ugly hack you could use transaction tracking. If you use the same transaction id multiple times subsequent products will be added to the existing data. So assign an ID at the start of the visits and on each page record a transaction with the current page url as product name (and the ID as transaction id). This will give you the complete path per user (I am frankly not to sure how this is useful - at some point you probably want aggregated data. Plus each transaction and product counts towards your quota for interaction counts, so on a large site you might run over the 10mio hits limit).
you can do it programatically
have a MAP in the backend which stores the userId (assuming u would have given a unique ID at the time of login to each user) with a list of Strings(each string being URL visited by that user)
whenever the user hits another URL from Page 1(and only from page1, check it using JS), send a POST request to backend with the new URL in its data section.
In the backend, check if the URL is of Page 10 and if not, add this URL as a string into the MAP for that corresponding user
Finally, when the user clicks on the Page 10 URL, you know the URLs in the way from Page 1 to Page 10 and so use them.
Though if I consider JS and I have not misunderstood your question, we can get the previous URL from request header information using document.referrer.
Are you trying to do it from 'Google Tag Manager'? I am not sure whether you are trying to trace the URLS in clientside or server side?
this is my first time posting here. I do not have much experience (less than a week) with html parsing/web scraping and have difficulties parsing this webpage:
https://www.jobsbank.gov.sg/
What I wan to do is to parse the content of all available job listing in the web.
my approach:
click search on an empty search bar which will return me all records listed. The resulting web page is: https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do
provide the search result web address to R and identify all the job listing links
supply the job listing links to R and ask R to go to each listing and extract the content.
look for next page and repeat step 2 and 3.
However, the problem is that the resulting webpage I got from step 1 does not direct me to the search result page. Instead, it will direct me back to the home page.
Is there anyway to overcome this problem?
Suppose I managed to get the web address for the search result, I intent to use the following code:
base_url <- "https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do"
base_html <- getURLContent(base_url,cainfo="cacert.pem")[[1]]
links <- strsplit(base_html,"a href=")[[1]]
Learn to use the web developer tools in your web browser (hint: Use Chrome or Firefox).
Learn about HTTP GET and HTTP POST requests.
Notice the search box sends a POST request.
See what the Form Data parameters are (they seem to be {actionForm.checkValidRequest}:YES
{actionForm.keyWord}:my search string )
Construct a POST request using one of the R http packages with that form data in.
Hope the server doesn't care about the cookies, if it does, get the cookies and feed it cookies.
Hence you end up using postForm from RCurl package:
p = postForm(url, .params=list(checkValidRequest="YES", keyword="finance")
And then just extract the table from p. Getting the next page involves constructing another form request with a bunch of different form parameters.
Basically, a web request is more than just a URL, there's all this other conversation going on between the browser and the server involving form parameters, cookies, sometimes there's AJAX requests going on internally to the web page updating parts.
There's a lot of "I can't scrape this site" questions on SO, and although we could spoonfeed you the precise answer to this exact problem, I do feel the world would be better served if we just told you to go learn about the HTTP protocol, and Forms, and Cookies, and then you'll understand how to use the tools better.
Note I've never seen a job site or a financial site that doesn't like you scraping its content - although I can't see a warning about it on this site, that doesn't mean it's not there and I would be careful about breaking the Terms and Conditions of Use. Otherwise you might find all your requests failing.
I think the question has been answered here before,but i could not find the desired topic.I am a newbie in web scraping.I have to develop a script that will take all the google search result for a specific name.Then it will grab the related data against that name and if there is found more than one,the data will be grouped according to their names.
All I know is that,google has some kind of restriction on scraping.They provide a custom search api.I still did not use that api,but hoping to get all the resulted links corresponding to a query from that api. But, could not understand what will be the ideal process to do the scraping of the information from that links.Any tutorial link or suggestion is very much appreciated.
You should have provided a bit more what you have been doing, it does not sound like you even tried to solve it yourself.
Anyway, if you are still on it:
You can scrape Google through two ways, one is allowed one is not allowed.
a) Use their API, you can get around 2k results a day.
You can up it to around 3k a day for 2000 USD/year. You can up it more by getting in contact with them directly.
You will not be able to get accurate ranking positions from this method, if you only need a lower number of requests and are mainly interested in getting some websites according to a keyword that's the choice.
Starting point would be here: https://code.google.com/apis/console/
b) You can scrape the real search results
That's the only way to get the true ranking positions, for SEO purposes or to track website positions. Also it allows to get a large amount of results, if done right.
You can Google for code, the most advanced free (PHP) code I know is at http://scraping.compunect.com
However, there are other projects and code snippets.
You can start off at 300-500 requests per day and this can be multiplied by multiple IPs. Look at the linked article if you want to go that route, it explains it in more details and is quite accurate.
That said, if you choose route b) you break Googles terms, so either do not accept them or make sure you are not detected. If Google detects you, your script will be banned by IP/captcha. Not getting detected should be a priority.
I am trying to get all the posts for a page by using
https://graph.facebook.com/PAGE_ID/feed
And it works like a charm. I can get all the info for each post except the like count.
The feed does return "likes" for each post, but it shows the like info for the first 25 likes. I cannot know the like count of a post.
The closest solution I found on the net is to set "summary=1" when requesting info of a post, e.g.
https://graph.facebook.com/POST_ID/likes?summary=1
This will return a summary field that shows the like count of this post, which is exactly what I need.
However, if this is the only way to solve the problem, I have to make additional network request for each post just for getting the like count. I could originally finish the job with only ONE network request, but now I have make 1+N times (number of posts in the page feed) of network requests.
I think I must be missing something. FB must have some way to get the like count embedded in the feed info. Just like the FB app or website, all posts show their like counts immediately, there is no way to make additional N times of network requests in order to get the like count for each post.
Hope someone can help. Thanks a lot in advance.
Finally, I found there is a way to get the like/comment counts for each post while pulling the feed without making further network requests:
/url/feed?fields=likes.summary(1).limit(0)
Isn't it great?
I am attempting to create goal funnels in GA for dynamic asp.net based pages. The funnel currently looks as follows:
/
/market_home.aspx
/Category.aspx
/product.aspx
/Cart.aspx
/Checkout.aspx
/OrderReview.aspx
/Confirmation.aspx
The market_home, Category and product pages are dynamic and will contain various parameters ie:
/market_home.aspx?id=1
/Category.aspx?id=1
/product.aspx?id=1
I am using regular expression as my match setting (have tried head match as well). I still get two of my market home pages not being captured. It is only 2 out of 18.
I can't seem to figure out why it catches some, but not all of the traffic.
I also am not capturing incoming/outgoing traffic that is not at the start of the funnel. In other words, those visitors being captured in the funnel appear to complete the entire thing from start to finish. There are no visitors dropping out in the middle anywhere, which I can't believe.
The beginning of the URL will not change.
Any ideas what could be wrong?
I've got the same problem, i even asked about it couple days ago: Using regexp in Google Analytics Goal Funnel steps
I beleive the thing is that RegExp don't work properly in funnel steps. My solution for this is generating the same virtual pageview in every dynamically generated page and use it in the funnel. Goog practice is to create a separate profile for it and filter out those virtuals in the main to avoid data distortion.