I found this app in Splunkbase called 3D Graph Network Topology Visualization (GitHub link) and I've added it to my Splunk environment.
I'm performing a simple search for a large amount of data and using this visualization tool. However, the visualization tool sometimes takes large amounts of time to load the final results. So I did some discovery.
Here is my SPL
index=test source=mysource
| dedup myaddress myneighboraddress
| stats values(vendor), count by myaddress, myneighboraddress
| table myaddress myneighboraddress count
Edit I've taken #warren suggestions (below) and updated my search. Although this does improve my search time, I don't believe the core issue was with my search.
index=test source=mysource
| fields - _raw
| fields myaddress myneighboraddress vendor
| stats values(vendor) as vendor count by myaddress, myneighboraddress
After checking the Job Inspector, I've noticed my search only takes from 1 second up to 5 seconds. However, depending on the total events I return (using the splunk head command), I noticed lengthy load times for the visualization to actually display anything. After my search is complete, and while the visualization is trying to process the graph, my browser will also bring the popup saying "Page Unresponsive: you can wait for it to becomes responsive or exit the page. Do you want to Wait?" because of how long the process is taking.
I've read the README and confirmed my chrome://gpu has "Hardware accelerated" turned on entry WebGL. Additionally, I confirmed my chrome://settings has the option "Use hardware acceleration when available" enabled. I have noticed when my search is complete, and the visualizer is trying to finish processing, my GPU will go to near 0% until everything is officially loaded in (but still choppy with dropped frame rate). I know there is a ticket for this, but it was last updated in 2020.
So my questions are:
Are there certain bench marks for hardware?
How dependent on a good GPU is the visualization to build out the graph?
Are there benchmarks on total events/nodes for the visualization results posted somewhere? I've seen the example images provided on the ReadMe that show approx 2k and 6k total events. Is anything above 10k events/nodes out of the question?
Are there any ways to speed up the initial loading of the visual results after a search has been performed?
I've performed my own benchmark tests, and would like to know if this is normal or if I'm doing something incorrect. My test is based entirely when my search is completed and right as the visualizer starts to process the graph. So I will use | head X at the end of my search (shown above) where X is my number below column 1:
head of X (total nodes)
Total seconds to load (after search completes)
20,000
42
15,000
21
10,000
12
8,000
10
6,000
7
5,000
4.7
4,000
2.5
See if this simplified version helps you at all (dedup is rarely the proper tool to use, and is almost always pointless to run before stats in the manner you're doing it):
index=test source=mysource vendor=* myaddress=* myneighboraddress=*
| fields - _raw
| fields myaddress myneighboraddress vendor
| stats count by myaddress myneighboraddress
Depending on the size of events, dropping _raw can be a huge performance improver. Likewise to only keeping the fields you care about
I removed the values(vendor) since your | table was immediately removing it
If you want it left in, do this stats line instead:
| stats values(vendor) as vendors count by myaddress myneighboraddress
Currently, it seems the simple answers (at the time of this post) are the following:
There are no benchmarks (that can be found)
Even though the GPU acceleration was enabled in my browser, its unclear benchmarks for specific GPU hardware
There are no benchmarks posted anywhere to assist it total nodes the app can display (in a reasonable time)
Unknown on how to boost loading time of the visual
Related
My webpage is scoring 90+ on desktop version but yet it's test result on Field Data show "does not pass". While the same page on Mobile with 70+ speed is marked as "Passed"
What's the criteria over here and what else is needed to pass test on desktop version. Here is the page on which I'm performing test: Blog Page
Note: This page speed is on 90+ from about 2 months. Moreover if anyone can guide about improving page speed on Mobile in WordPress using DIVI builder, that would be helpful.
Although 6 items show in "Field Data" only three of them actually count towards your Core Web Vitals assessment.
First Input Delay (FID)
Largest Contentful Paint (LCP)
Cumulative Layout Shift (CLS)
You will notice that they are denoted with a blue marker.
On mobile all 3 of them pass, despite a lower overall performance score.
However on Desktop your LCP occurs at 3.6 seconds average, which is not a pass (it needs to be within 2.5 seconds).
That is why you do not pass on Desktop but do on mobile.
This appears to be something with your font at a glance (sorry not at PC to test properly), causing a late switch out. I could be wrong, as I said, I haven't had chance to test so you need to investigate using Dev Tools etc.
Bear in mind that the score you see (95+ on Desktop, 75+ on mobile) is part of a synthetic test performed each time you run Page Speed Insights and has no bearing on your Field Data or Origin Summary.
The data in the "Field Data" (and Origin Summary) is real world data, gathered from browsers, so they can be far apart if you have a problem at a particular screen size (for example) etc. that is not picked up in a synthetic test.
Field Data pass or fails a website based on historical data.
Field Data Over the previous 28-day collection period, field data shows
that this page does not pass the Core Web Vitals assessment.
So if you have made recent changes to your website to improve your site score you need to wait atleast a month so that Field Data shows result based on newer data.
https://developers.google.com/speed/docs/insights/v5/about#distribution
I'm logging some custom metrics in Application insights using the TelemetryClient.TrackMetric method in .NET, and I've noticed that occasionally some of the events are duplicated when I view them in the Azure portal.
I've drilled into the data, and the duplicate events have the same itemId and timestamp, but if I show the ingestion time by adding | extend ingestionTime = ingestion_time() to the query then I can see that the ingestion times are different.
This GitHub issue indicates that this behavior is expected, as AI uses at-least-once delivery.
I plot these metrics in charts in the Azure portal using a sum aggregation, however these duplicates are creating trust issues with the charts as the duplicates are simply treated as two separate events.
Is there a way to de-dupe the events based on itemId before plotting the data in the Azure portal?
Update
A more specific example:
I'm running an algorithm, triggered by an event, which results in a reward. The algorithm may be triggered several dozen times a day, and the reward is a positive or negative floating point value. It logs the reward each time to Application Insights as a custom metric (called say custom-reward), along with some additional properties for data splitting.
In the Azure portal I'm creating a simple chart by going to Application Insights -> Metrics and customising the chart. I select my custom-reward metric in the Metric dropdown, and select Sum as the aggregation. I may or may not apply splitting. I save the chart to my dashboard.
This simple chart gives me a nice way of monitoring the system to make sure nothing unexpected is happening, and the Sum value in the bottom left of the chart allows me to quickly see whether the sum of the rewards is positive or negative over the chart's range, and by how much.
However, on occasion I've been surprised by the result (say over the last 12 hours the sum of the rewards was surprisingly negative), and on closer inspection I discovered that a few large negative results have been duplicated. Further investigation shows this has been happening with other events, but with smaller results I tend not to notice.
I'm not that familiar with the advanced querying bit of Application Insights, I actually just used it for the first time today to dig into the events. But it does sound like there might be something I can do there to create a query that I can then plot, with the results deduped?
Update 2
I've managed to make progress with this thanks to the tips by #JohnGardner, so I'll mark that as the answer. I've deduped and plotted the results by adding the following line to the query:
| summarize timestamp=any(timestamp), value=any(value), name=any(name), customDimensions=any(customDimensions) by itemId
Update 3
Adding the following line to the query allowed me to split on custom data (in this case splitting by algorithm ID):
| extend algorithmId = tostring(customDimensions.["algorithm-id"])
With that line added, when you select "Chart" in the query results, algorithmId now shows up as an option in the split dropdown. After that you can click "Pin to dashboard". You lose the handy "sum over the time period" indicator in the bottom left of the chart which you get via the simple "Metrics" chart, however I'm sure I'll be able to recreate that in other ways.
if you are doing your own queries, you would generally be using something like summarize or makeseries to do this deduping for a chart. you wouldn't generally plot individual items unless you are looking at a very small time range?
so instead of something like
summarize count() ...
you could do
summarize dcount(itemId) ...
or you might add a "fake" summarize to a query that didn't need it before with by itemId to coalesce multiple rows into just one, using any(x) to grab any individual row's value for each column for each itemId.
but it really depends on what you are doing in your specific query. if you were using something like sum(itemCount) to also deal with sampling, you have other odd cases now, where the at-least-once delivery might have duplicated sampled items? (updating your question to add a specific query and hypothetical result would possibly lead to a more specific answer).
The goal flow report on my google analytics account shows some strange sampling behavior. While I can usually select up to a month of data before sampling starts it seems to be different for the goal flow report.
As soon as I select more than one day of data the used data set is getting smaller very fast. At three days the report ist based on only 50% of the sessions, which, according to analytics, comes to only 35 sessions.
Has anyone experienced a similar behavior of sampling although only very small data-sets are used?
Sampling is induced when your request is calculation-intensive; there's no 'garunteed point at which it trips.
Goal flow complexity will increase exponentially as you add goals, so even a low number of goals might make this report demand a lot of processing.
Meanwhile you'll find that moast of the standard reports can cover large periods of time without sampling; they are preaggreated, so it's very cheap to load them.
If you want to know more about sampling, see here:
https://stackoverflow.com/a/37386181/5815149
I try to pull out the (unique) visitor count for a certain directory using three different methods:
* with a profile
* using an dynamic advanced segment
* using custom report filter
On a smaller site the three methods give the same result. But on the large site (> 5M visits/month) I get a big discrepancy between the profile on one hand and the advanced segment and filter on the other. This might be because of sampling - but the difference is smaller when it comes to pageviews. Is the estimation of visitors worse and the discrepancy bigger when using sampled data? Also when extracting data from the API (using filters or profiles) I still get DIFFERENT data even if GA doesn't indicate that the data is sampled - ie I'm looking at unsampled data.
Another strange thing is that the pageviews are higher in the profile than the filter, while the visitor count is higher for the filter vs the profile. I also applied a filter at the profile to force it to use sample data - and I again get quite similar results to the filter and segment-data.
profile filter segment filter#profile
unique 25550 37778 36433 37971
pageviews 202761 184130 n/a 202761
What I am trying to achieve is to find a way to get somewhat accurat data on unique visitors when I've run out of profiles to use.
More data with discrepancies can be found in this google docs: https://docs.google.com/spreadsheet/ccc?key=0Aqzq0UJQNY0XdG1DRFpaeWJveWhhdXZRemRlZ3pFb0E
Google Analytics (free version) tracks only 10 mio page interactions [0] (pageviews and events, any tracker method that start with "track" is an interaction) per month [1], so presumably the data for your larger site is already heavily sampled (I guess each of you 5 Million visitors has more than two interactions) [2]. Ad hoc reports use only 1 mio datapoints at max, so you have a sample of a sample. Naturally aggregated values suffer more from smaller sample sizes.
And I'm pretty sure the data limits apply to api access too (Google says that there is "no assurance that the excess hits will be processed"), so for the large site the api returns sampled (or incomplete) data, too - so you cannot really be looking at unsampled data.
As for the differences, I'd say that different ad hoc report use different samples so you end up with different results. With GA you shouldn't rely too much an absolute numbers anyway and look more for general trends.
[1] Analytics Premium tracks 50 mio interactions per month (and has support from Google) but comes at 150 000 USD per year
[2] Google suggests to use "_setSampleRate()" on large sites to make sure you have actually sampled data for each day of the month instead of random hit or miss after you exceed the data limits.
Data limits:
http://support.google.com/analytics/bin/answer.py?hl=en&answer=1070983).
setSampleRate:
https://developers.google.com/analytics/devguides/collection/gajs/methods/gaJSApiBasicConfiguration#_gat.GA_Tracker_._setSampleRate
Yes, the sampled data is less accurate, especially with visitor counts.
I've also seen them miss 500k pageviews over two days, only to see them appear in their reporting a few days later. It also doesn't surprise me to see different results from different interfaces. The quality of Google Analytics has diminished, even as they have tried to become more real-time. It appears that their codebase is inconsistent across API's, and their algorithms are all over the map.
I usually stick with the same metrics and reporting methods, so that my results remain comparable to one another. I also run GA in tandem with Gaug.es, as a validation and sanity check. With that extra data, I choose the reporting method in GA that I am most confident with and I rely on that exclusively.
Due to faulty hardware, statistics generated over a 2 week period were significantly higher than normal (10000 times higher than normal).
After moving the application to a new server, the problem rectified itself. The issue I have is that there are 2 weeks of stats that are clearly wrong.
I have checked the raw impressions table for the affected fortnight and it seems to be correct (ie. stats per banner per day match the average for the previous month). Looking at the intermediate & summary impressions tables, the values are inflated.
I understand from the openx forum (link text) it's possible to regenerate stats from the raw data but it will only regenerate stats per hour, meaning regenerating stats for 2 weeks would be very time consuming.
Is there another, more efficient way to regenerate the stats from the raw data for the affected fortnight?
Have a look at this link as it appears to have a solution you may find helpful. The solution is similar to the one you posted in your question, but it appears that this one has been modified to make it easier to use. Other than using regenerateAdServerStatistics.php, I do not know of another option for regenerating the statistics you need.
I understand from the openx forum (link text) it's possible to regenerate stats from the raw data but it will only regenerate stats per hour, meaning regenerating stats for 2 weeks would be very time consuming
We have solved this problem on our installation by creating a wrapper shell script for reganerateAdServerStatistics.php with dateStart & dateEnd arguments for situations like the one you mention. It's used to:
regenerate statistics for a specific day (all hours, takes ~2h)
run normal maintenance to keep today's stats updated
goto step (1) as long as day processed < dateEnd
To be honest the script is somewhat more complex, as we also need to import raw data from our data warehouse for each day to be processed, because the "live" data are kept in a in-memory database, but that's kinda out of this post's context.