Recursive, Non-Dynamic (Refreshable) Web API via Power BI - recursion

I am trying to write a recursive web API call in PBI to collect all 27,515 records, the oDATA feed has a limit of 1,000 rows. I need this data to be refreshable in the PBI service, therefore these 28 requests via M code cannot be formulated in a dynamic way. PBI only allows for static or non-dynamic sources for refresh within the service. Below, I will share two pieces of M code, 1. one that is considered to be a dynamic data source (not what I need, but pulls all 27,515 records correctly) and 2. one that is a static data source (which is giving an incorrect number of 19,000 records, but is the type of data source that I need for this refreshing problem).
Noteworthy: Upon initial API call I receive a table named table "d" (in the photo below) with two rows one row it titled "results" which contains all of the data (1,000 rows) I need per request, the second row is titled "__next" which has the next API URL with an embedded skiptoken from the current calls worth of data. This skiptoken tells the API which rows to skip so that the next request doesn't deliver the data we have already collected.
Table d, Initial Table
M Code for Dynamic Data Source: This dynamic data source is pulling the correct number of records in 28 requests (up to 1,000 records per request) totaling 27,515 rows.
= List.Generate( ()=> Json.Document(Web.Contents("https://my_instance/odata/v2/Table?$format=JSON&$paging=snapshot"))[d],
each Record.HasFields(_, "results")= true,
each try Json.Document(Web.Contents(_[__next]))[d] otherwise [df=[__next="dummy_variable"]])
M Code for Static Data Source: This static data source is the type that I need for refreshing in PBI service (I confirmed it does refresh in the service), but is returning an incorrect number of rows, 19,000 versus 27,515. This code is calling 19 requests versus the needed 28 requests. I believe the error lies in the Query portion where I am attempting to call the next API URL with the skiptoken from the previous request.
= List.Generate( ()=> Json.Document(Web.Contents("https://my_instance/odata/v2/Table?$format=JSON&$paging=snapshot"))[d],
each Record.HasFields(_, "results")= true,
each try Json.Document(Web.Contents("https://my_instance/odata/v2/Table?$format=JSON&$paging=snapshot", [Query=[q=_[__next]]]))[d] otherwise [df=[__next="dummy_variable"]])
Does anyone see an error in the static code for iteratively calling each new request in the table [d] which has rows labeled [results] (all the data) and another row labeled [__next] which has the next URL with the skiptoken from the previous API call.

To be clear, in Web.Contents the url must be static, but you can freely use dynamic components in the RelativePath optional option argument (as in this simple example function) which is how you can generate dynamic web API queries that work in the service without the error you are seeing w.r.t. dynamic queries:
(current_page as text) =>
let
data = Web.Contents(
"https://my_instance/api/v2/endpoint", // static!
[
RelativePath = "?page="&current_page // dynamic!
]
)
in
data
So if you can split out the relative path of your _next parameter and feed it into such a function it will be OK for automatic refreshes in the Power BI service.

Related

Google Analytics: segment discrepancy between API and web reporting

I've had an analytics reporting API running for a while now and unfiltered view results from the API match the web reporting. The issue I'm seeing is when adding a segment to the API report request. The web reporting is frequently returning different values than the API for a handful of the segment/view_id combinations. I'm looking for a recommended settings to review here to understand what is causing the discrepancy, as I'm not sure if this is an program code/API issue, web reporting issue or a configuration for segment/view_id issue.
Notes:
When incorrect, it appears that the web reporting numbers for sessions is averaging 10% higher than what the API returns
A single segment is applied to many view_ids we manage and a high percentage (~80%) are showing the discrepancy, the remainder match.
the modified and created dates for this segment are 5 months old per the web interface, meaning there is not a configuration change within the segment causing the discrepancy
we've compared 2018 YTD to eliminate a time lag data update as an issue.
segments appear to be link to our master account level and applied to the accounts we manage.
currently using v4 of the analytics API for .Net (C#)
Current Questions:
Could this be a setting in how a particular segment was created?
Why would some segment/view_ids match and others not?
Is there a account, property or view_id permission/configuration setting to review as it relates to applying segments?
Any help or insights on what to review here would be helpful.
Forgot the code snippet:
var segmentDimension = new Dimension { Name = "ga:segment" };
var DefaultReportRequest = new ReportRequest
{
DateRanges = new List<DateRange> { dateRange },
Dimensions = new List<Dimension> { date, SourceMedium, Campaign, AdContent, Keyword },
Metrics = new List<Metric> { sessions, Users, NewUsers, Bounces, pageViews, SessionDuration, Goal01Completion, Goal02Completion, Goal03Completion, Goal04Completion },
ViewId = v_id,
PageSize = 10000
};
if (!(segmentId == ""))
{
DefaultReportRequest.Dimensions.Add(segmentDimension);
Google.Apis.AnalyticsReporting.v4.Data.Segment segment = new Google.Apis.AnalyticsReporting.v4.Data.Segment() { SegmentId = segmentId };
DefaultReportRequest.Segments = new List<Google.Apis.AnalyticsReporting.v4.Data.Segment> { segment };
};
var getReportsRequest5 = new GetReportsRequest
{
ReportRequests = new List<ReportRequest> { DefaultReportRequest }
};
var batchRequest5 = reportingService.Reports.BatchGet(getReportsRequest5);
var response5 = batchRequest5.Execute();
Thanks in advance for your help,
Mike
Update 2:
After reviewing this further the API call is always pulling a single day of data "Yesterday". The web reporting when pulling that single specific day of data matches. If the web reporting pulls a time range of data around those specific dates (ex: +/- 3 days) the numbers no longer match. It seems like sampling could be in play here, but the web reports we are running indicate 100% of sessions in both pulls. I think the question is how to determine which is more accurate a single day or a time range of data. Has anyone investigate this, I've reproduced it on several of our view_ids.
Thanks,
Mike
Update 3 (rseolution):
Turns out the issue was with how the segment was created and being applied to web reporting. The segment was focused at the User level, meaning aggregated values would change based on the time frame selected. The desired state was having the filters apply to a single day, making session focus a better then user as it contained the segment to the session.
Thanks all,
Mike
Without knowing too much about the details of the segments and views, the first thing I'd like to confirm with you is that you're aware of sampling in GA.
Unless they're all 360 accounts, you'll be subjected to sampling depending on the sessions you're returning for 2018 YTD. Note, sampling is based on sessions on the property level, not view level.
Another thing you can do in your code is to check if the sampling of the % of data matches with the web version VIA the response from the API. On the web version, the sampling info is here:https://i.stack.imgur.com/hcPGD.png

Java API to query CommonCrawl to populate Digital Object Identifier (DOI) Database

I am attempting to create a database of Digital Object Identifier (DOI) found on the internet.
By manually searching the CommonCrawl Index Server manually I have obtained some promising results.
However I wish to develop a programmatic solution.
This may result in my process only requiring to read the index files and not the underlying WARC data files.
The manual steps I wish to automate are these:-
1). for each CommonCrawl Currently available index collection(s):
2). I search ... "Search a url in this collection: (Wildcards -- Prefix: http://example.com/* Domain: *.example.com) " e.g. link.springer.com/*
3). this returns almost 6MB of json data that contains approx 22K unique DOIs.
How can I browse all available CommonCrawl indexes instead of searching for specific URLs?
From reading the API documentation for CommonCrawl I cannot see how I can browse all the indexes to extract all DOIs for all domains.
UPDATE
I found this example java code https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java
that shows how to access a common crawl dataset.
However when I run it I receive this exception
"main" org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 404, ResponseStatus: Not Found, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>common-crawl/crawl-data/CC-MAIN-2016-26/segments/1466783399106.96/warc/CC-MAIN-20160624154959-00160-ip-10-164-35-72.ec2.internal.warc.gz</Key><RequestId>1FEFC14E80D871DE</RequestId><HostId>yfmhUAwkdNeGpYPWZHakSyb5rdtrlSMjuT5tVW/Pfu440jvufLuuTBPC25vIPDr4Cd5x4ruSCHQ=</HostId></Error>
In fact every file I try to read results in the same error. Why is that?
what is the correct common crawl uri's for their datasets?
The data set location has changed since more than one year, see announcement. However, many examples and libraries still contain the old pointers. You can access the index files for all crawls back to 2013 on s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/cdx-00xxx.gz - replace YYYY-WW with year and week of the crawle and expand xxx to 000-299 to get all 300 index parts. New crawl data is announced on the Common Crawl group, or read more about how to access the data.
To get the example code to work replace lines 24 and 25 with:
String fn = "crawl-data/CC-MAIN-2013-48/segments/1386163035819/warc/CC-MAIN-20131204131715-00000-ip-10-33-133-15.ec2.internal.warc.gz";
S3Object f = s3s.getObject("commoncrawl", fn, null, null, null, null, null, null);
Also note that the commoncrawl group have an updated example.

Dynamics AX 2009 AIF Tables

Background
I have an issue where roughly once a month the AIFQueueManager table is populated with ~150 records which relate to messages which had been sent to AX (where they "successfully failed"; i.e. errorred due to violation of business rules, but returned an exception as expected) over 6 months ago.
Question
What tables are involved in the AIF inbound message process / what order to events occur in? e.g. XML file is picked up and recorded in the AifDocumentLog, data's extracted and added to the AifQueueManager and AifGatewayQueue tables, records from here are then inserted in the AifMessageLog, etc.
Thanks in advance.
There are 4 main AIF classes, I will be talking about the inbound only, and focusing on the included file system adapter and flat XML files. I hope this makes things a little less hazy.
AIFGatewayReceiveService - Uses adapters/channels to read messages in from different sources, and dumps them in the AifGatewayQueue table
AIFInboundProcessingService - This processes the AifGatewayQueue table data and sends to the Ax[Document] classes
AIFOutboundProcessingService - This is the inverse of #2. It creates XMLs with relevent metadata
AIFGatewaySendService - This is the inverse of #1, where it uses adapters/channels to send messages out to different locations from the AifGatewayQueue
For #1
So #1 basically fills the AifGatewayQueue, which is just a queue of work. It loops through all of your channels and then finds the relevant adapter by ClassId. The adapters are classes that implement AifIntegrationAdapter and AifReceiveAdapter if you wanted to make your own custom one. When it loops over the different channels, it then loops over each "message" and tries to receive it into the queue.
If it can't process the file for some reason, it catches exceptions and throws them in the SysExceptionTable [Basic>Periodic>Application Integration Framework>Exceptions]. These messages are scraped from the infolog, and the messages are generated mostly from the receive adaptor, which would be AifFileSystemReceiveAdapter for my example.
For #2
So #2 is processing the inbound messages sitting in the queue (ready/inprocess). The AifRequestProcessor\processServiceRequest does the work.
From this method, it will call:
Various calls to Classes\AifMessageManager, which puts records in the AifMessageLog and the AifDocumentLog.
This key line: responseMessage = AifRequestProcessor::executeServiceOperation(message, endpointActionPolicy); which actually does the operation against the Ax[Document] classes by eventually getting to AifDispatcher::callServiceMethod(...)
It gets the return XML and packages that into an AifMessage called responseMessage and returns that where it may be logged. It also takes that return value, and if there is a response channel, it submits that back into the AifGatewayQueue
AifQueueManager is actually cleared and populated on the fly by calling AifQueueManager::createQueueManagerData();.

StructureGroup Details using the Content Delivery/Broker API

I am trying to get all the structure groups published in a given publication using the PublicationID. I am expecting to get the structure groups with StructureGroupCriteria by passing the Root Structure Group TCM ID but getting page ids (I am expecting SGs).
Now I am trying to loop through the list and get details of each structuregroup. I did not find any API (.net) to get these details and also the API is returning only Pages.
What I have done and working so far using StructureGroupCriteria, returns list of Page IDs instead of SG IDs
PublicationCriteria pubCriteria = new PublicationCriteria(pubID);
// Root StructureGroup TCM ID -- tcm:45-3-4
StructureGroupCriteria sgCriteria = new StructureGroupCriteria("tcm:45-3-4", true);
Criteria allSGsInPub = CriteriaFactory.And(pubCriteria, sgCriteria);
Query allSGs = new Query(allSGsInPub);
string[] sgInfo = allSGs.ExecuteQuery();
Response.Write("Total : " + sgInfo.Length);
foreach (string sgid in sgInfo ) {
// HOW DO I get the Structure Group Details here
//TCMURI sgURI = new TCMURI(sgid);
}
Q # 1 : How to get the all the structuregroups and individual structure group details? (May be something simple, I am not able to find right API).
Q # 2 : How can I get all the structuregroups using ItemTypeCriteria sgCriteria = new ItemTypeCriteria(4); // 4 is SG Item Type .
When I tried this option, the query worked successfully but no results returned. Is this the expected behavior and should we always use StructureGroupCriteria instead of ItemTypeCriteria?
The reason for this approach, I want to avoid using the Root StructureGroup ID which is required with the above code. But at the moment, none of the approaches returning StructureGroup information and I always get Page Information.
Tridion Version: 2011 SP1, .net API.
Note: When I publish I am checking the publish SG info checkbox and published successfully. On Broker DB side, I can see the information on the taxnonomy table as well.
I was playing with Odata service and accidentally I found that I can get all my structure group information from Odata web service.
/cd_webservice/odata.svc/StructureGroups?$filter=PublicationId%20eq%2045
Also, the results are returning child structure groups with a depth parameter.
Just to clarify , using Broker API it is not feasible to get the structure groups (my original question). However, the workaround solution is to use OData Service to get the Structure Groups.
I don't think you will get Structure Groups returned by the Query object.
According to the documentation, when you publish Structure Group information the Structure Group hierarchy is published to the Content Delivery side where it is stored as a taxonomy.
Have you tried using the Taxonomy APIs to get the information you need?

Google Analytics Realtime Sandbox Environment

I am looking for a way to setup a google analytics sandbox environment that will allow me
to test out my custom js code near real time.
My app will be using custom variables for advanced segmentation, and I would like to test out multiple scenarios quickly, as opposed to setting up a dummy GA account and wait for a whole day to confirm the test.
Thanks
Great question.
For GA, server updates occur every four hours, and after every sixth such update, the entire set is recalculated, which means a 24-hour lag from code change to reliable feedback. This delay also applies to most customizations to the GA Browser (e.g., "custom filters").
So if you are going to use GA as your web metrics system, and you expect to actually rely on those data then a test rig is essential.
For me, it's useful to group test systems for client-side analytics using two rubrics: (i) complete, self-contained (closed-loop) systems; or (ii) simpler automated data pulls from the production system (by "production system" here i mean GA's system, not the Site whose pages the GA code is tracking).
For the latter, just add this line to each page of your Site that contains the GA tracking code, just below '__trackPageview()':
pageTracker._setLocalRemoteServerMode();
That line will cause a copy of each transaction line to be logged to your server's activity log--so in essence, you get the data captured by GA in real-time That's all you need to do to capture the data; to parse it, you can use, for instance, any of the excellent open source web log analyzers like AWStats, or roll your own.
This is simple and reliable--but all it can do is tell you (in real-time) "does the analytics code i just implemented on pages served by my production server actually work?"
Usually, that's not good enough--you would rather know if your code will work before it's on your production server. To do that, you need to simulate the production environment and find a way to access in real-time the data GA collects.
This kind of test rig is a little more involved, but still not difficult.
In sum, it requires these steps:
host/serve the ga.js and the
tracking pixel locally;
log the __utm.gif requests (in the
GA data flow, each request
corresponds to one logged
transaction); and
parse the headers into some
convenient human-readable form.
If you want more detail than that (ie, a step-by-step implementation), here it is:
I. Hosting/Serving the GA Script (& automating updates
To do that, you can create a small shell script like this one to wget the latest ga.js version into your local directory (replacing the extant version it finds there).
#!/bin/sh
rm /My_Sites/sitename.com/analytics/ga.js
cd /My_Sites/sitename.com/analytics/
wget http://www.google-analytics.com/ga.js
chmod 644 /My_Sites/sitename.com/analytics/ga.js
cd ${OLDPWD}
exit 0;
(Thanks to AskApache.com, which provided the original motivation and config details to do this in a production context.)
II. Create __utm.gif file
This is just a transparent 1x1 pixel gif image, which you will place in Site directory (doesn't matter where, it just needs to match the location recited in your pages)
III. Log the __utm.gif Requests
For a testing protocol in which you are the source of the client-side activity (e.g., you want to verify the cross-browser fidelity of some event-tracking code you've added to a page on your Site, so you automate 5000 clicks on the button you just wired up,serving the page from your dev server set up for this purpose) it's probably simplest to just log the Request Headers, because it's in those headers that the GA script directs the client to gather various data from the DOM, from the location bar (url), and from prior http headers, and append them to a request for a resource on the GA server (__utm.gif, which is just a 1x1 transparent pixel).
For this type of protocol, i use the Firefox addon, LiveHTTPHeaders. You install it like any other Firefox addon, a few mouse clicks is all. Next, open it, and click the "Generator" tab. From this window, you can see the actual requests in real time. At the bottom of the window is a 'save' button to store the log. I find it easier to configure LiveHTTPHeaders to log only the __utm.gif requests; to do that, just click the 'Edit' tab and create a siimple filter to exclude everything except these particular gif images (using the check boxes on the right, and the large text box to the right).
Other kinds of test protocols require you to work from your Server Activity Logs; in that case just add this line to each page of your Site, just below __trackPageview():
pageTracker._setLocalRemoteServerMode();
IV. Parse those logged requests so you can actually read them
So now your log will contain individual transction lines, each one of which is a string appended to an HTTP Request for the GA tracking pixel. This string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urchin tracker"). Each of these parameters corresponds to a variable that you see in the GA Dashboard (here's a complete list and description of them). This is all you need to know to build a parser. In more detail:
First, here's a sanitized __utm.gif request (the entries in your LiveHTTPHeaders log):
http://www.google-analytics.com/__utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=__utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
This is my parser (in Python):
# regular expression module imported
import re
pattern = r'\&{1,2}'
pat_obj = re.compile(pattern)
# splitting the gif request on the '&' character
# (which GA originally used to concatenate each piece to build the request)
# (here, i've bound the __utm.gif to the variable by 'gfx')
gfx1 = pat_obj.split(gfx)
# create a look-up table to map a descriptive name to each gif request parameter
# (note, this isn't the entire list, which i've linked to above)
keys = "utmje utmsc utmsr utmac utmcc utmcn utmcr utmcs utmdt utme utmfl utmhn utmn utmp utmr utmul utmwv"
values = "java_enabled screen_color_depth screen_resolution account_string cookies campaign_session_new repeat_campaign_visit language_encoding page_title event_tracking_data flash_version host_name GIF_req_unique_id page_request referral_url browser_language gatc_version"
keys = keys.strip().split()
#create the look-up table
GIF_REQUEST_PARAMS = dict(zip(keys, values))
# parse each request parameter and map the parameter name to a descriptive name:
pattern = r'(utm\w{1,2})=(.*?)$'
pat_obj = re.compile(pattern)
for itm in gfx1 :
m = pat_obj.search(itm)
if m :
fmt = '{0:25} {1:10}'
print( fmt.format( GIF_REQUEST_PARAMS[m.group(1)], m.group(2) ) )
The result looks like this:
gatc_version              1         
GIF_req_unique_id         1669045322
language_encoding         UTF-8     
screen_resolution         1280x800  
screen_color_depth        24-bit    
browser_language          en-us     
java_enabled              1         
flash_version             10.0%20r45
campaign_session_new      1         
page_title                Position%20Listings%20%7C%20Linden%20Lab
host_name                 lindenlab.hrmdirect.com
referral_url              http://lindenlab.com/employment
page_request              /employment/openings.php?sort=da
account_string            UA-XXXXXX-X
cookies
To avoid making this longer still, i left out the cookies' value. They obviously require a separate parsing step, though it's virtually identical to the step i just showed. Again, each request represents a single transaction, so you can store them as you need to.

Resources