Why Gnocchi apply 'server_group' to resource slowly?

Why Gnocchi apply 'server_group' to resource slowly? - openstack

I add the metadata "metering.server_group":"corey-group" to an instance while creating, and check it by using nova show, it is applied, then I check the Gnocchi resource using gnocchi resource show --type instance ${instance-id}, the attribute server_group is None in the begining, but after a while, it will be applied (always on the hour, ex: 07:00, 08:00...), I have no idea what happens, I think this issue will cause Gnocchi gets incorrect datasets while doing aggregation, so I spent some times to troubleshoot it.
First of all, the attributes of Gnocchi resource stored in database:
MariaDB [(none)]> use gnocchi
MariaDB [gnocchi]> select * from resource_type where name='instance';
# check its tablename, ex: rt_xxxxxx
MariaDB [gnocchi]> select * from rt_xxxxxx where display_name='corey-vm';
+----------------+---------------------+-----------+--------------------------------------+-------------------------+------------------+---+
| display_name | host | image_ref | flavor_id | server_group | id | flavor_name |
+----------------+---------------------+-----------+--------------------------------------+-------------------------+------------------+---+
| corey-vm | corey-test-com-001 | NULL | 26e46b4c-23bd-4224-a609-29bd3094a18e | NULL | xxxxxx | corey-flavor |
+----------------+---------------------+-----------+--------------------------------------+-------------------------+------------------+---+
As you can see, the column server_group should be corey-group, but it is always NULL when the instance is just created, and seems like ceilometer updates the resource per hour on the hour.
I added some log in the file ceilometer/publisher/gnocchi.py, and found that it updates resource every minutes, but the variable resource_extra gets server_group only on the hour, that's why it is None is the begining.
Here are some parts of the logs
2020-11-09 11:59:15 DEBUG ceilometer.publisher.gnocchi Resource {'host': u'test-com-002', 'display_name': u'vm-001', 'flavor_id': u'xxx', 'flavor_name': u'xxx'} publish_samples /usr/lib/python2.7/site-packages/ceilometer/publisher/gnocchi.py:345
2020-11-09 12:00:15 DEBUG ceilometer.publisher.gnocchi Resource {'host': u'test-com-002', 'display_name': u'vm-001', 'flavor_name': u'xxx', 'server_group': 'corey-group'} publish_samples /usr/lib/python2.7/site-packages/ceilometer/publisher/gnocchi.py:345
2020-11-09 12:01:15 DEBUG ceilometer.publisher.gnocchi Resource {'host': u'test-com-002', 'display_name': u'vm-001', 'flavor_id': u'xxx', 'flavor_name': u'xxx'} publish_samples /usr/lib/python2.7/site-packages/ceilometer/publisher/gnocchi.py:345
But I stuck at this point, I can't understand why the variable resource_extra can't gets server_group each time. What causes this happpening exactly? (Running on Queens)
I would appreciate any ideas.
Update 09/11/2020
After some days of troubleshooting, I still can't find the root cause.
But I found a command line to apply the 'server_group' manually, that can help me to avoid Gnocchi gets incorrect datasets to aggregate.
Here it is:
gnocchi resource update --type instance -a server_group:corey-group ${resource_id}
Update 11/11/2020
I tried to grep the integer 3600 and modify them to 300, but nothing changed, below are what I've tried.
/etc/ceilometer/ceilometer.conf
[compute]
resource_cache_expiry = 300
ceilometer/compute/discovery.py
cfg.IntOpt('resource_cache_expiry',
default=300,
ceilometer/publisher/zaqar.py
DEFAULT_TTL = 300
Update 12/11/2020
I can't reproduce this issue on Pike.

Maybe you can refer to the following discussions:
Heat autoscaling with gnocchi based aodh alarms requires use of naive instance_discovery_method setting with ceilometer compute agents?
According to the reference, try to change the default instance_discovery_method from "libvirt_metadata" to "naive" in ceilometer config file, like this:
[compute]
instance_discovery_method = naive
Switching to "naive" resolves this issue, however it obviously generates load on the Nova API for metadata retrieval.

Related

Raising Alerts from Application Insights log based on percentage?

I can write a query in application insights that gives me a percentage as a scalar. I want to create alert if that percentage is > X . How can this be done using log based alerts?
Basically, I have a lot of machines that send telemetry to application insights. Sometimes they log some exceptions. I send MachineName in customDimensions for all the logs. So I can get the names of all the machines that sent logs in last 24 hours. The exceptions are also sent with MachineName in customDimensions. When a particular error is raised by more than X% machines in last 24 hours, I want to raise an alert.
The way to write alert logic is using 'Number of Results' which cannot be used for this since it automatically adds '|count' to the query. The other way is using 'Metric Measurement', which I am guessing should help me raise an alert like this but I'm unable to figure out how.
I can get the total machine count by this query:
let num_machines = traces
| summarize by tostring(customDimensions["MachineName"])
| count;
I can get the number of machines that reported an exception like this:
let num_error_machines = exceptions
| where customDimensions["Message"] contains "ExceptionXRaised"
| summarize by tostring(customDimensions["MachineName"])
| count;
finally, i can get the percentage of machines that raised the issue like this:
print toscalar(num_error_machines)*100/toscalar(num_machines)
I am not sure how to use this result to raise an alert using MetricMeasurement. This needs to be modified somehow to get AggregatedValue and use bin, I am not sure if that is possible / how that query will be.

Sorry for the late reply. I've tested in my side and met many problems indeed.
I found that alert rule doesn't support to monitor the percentage number of the result, it only supports the numbers of query result and Metric measurement. So I think you may give up the percentage and use the num_err_machine like the screenshot below
Pls note, you can't append " ; " at the end of the query or it will give an error like The request had some invalid properties

Get all the action log of an instance and the flavor attached to it

I was wondering if anyone can tell me how to get the action log of the instance using openstacksdk, novaclient. And while getting the action log, I also want to get the flavor attached to it. See the attached picture please.
I actually got the action log using this novaclient module:
novaclient.v2.instance_action.InstanceAction
but it shows me very little details and without the flavor id that I needed. The following fields it shows me are the following:
action, instance_uuid, message, project_id, request_id, start_time and user_id
I hope anyone can tell me how to get it.

I don't think it is possible to get the flavor id from the action list / server event list.
Openstack does not keep a database record of what each request did, or a historic record of the instance states. So you would need to resort to trawling the logs for the request-id ... which is OK for forensics, but does not scale. (And I don't know if the flavor is in the log messages.)
Of course, you could use the APIs (novaclient, openstacksdk) to get the current flavor for the instance, given its instance id. But that isn't exactly what you want.
It is possible record historical information using Gnochi + Ceilometer or similar, but you would need to have set this up already.

How do I use dynamic keyword in Kusto

I'm using Azure Resource Graph which is using Kusto language for query Azure resources and confused how I can create my own objects via dynamic keyword from existing ones. Example is below which is showing that I'm trying to just assign the same value to disk to dynamic object osDisk but it fails with InvalidQuery. What am I doing wrong?
where type =~ 'Microsoft.Compute/virtualmachines'
| extend disk = properties.storageProfile.osDisk
| extend osDisk = dynamic({"osdisk" : properties.storageProfile.osDisk})
|project disk, osDisk
Error
Please provide below info when asking for support: timestamp = 2019-07-20T01:55:46.6283092Z, correlationId = 297ad2ed-81f2-49b3-86b2-5f38e2394923. (Code: BadRequest) Query is invalid. Please refer to the documentation for the Azure Resource Graph service and fix the error before retrying. (Code: InvalidQuery)
Removing dynamic line option returns results properly

try using pack(): https://learn.microsoft.com/en-us/azure/kusto/query/packfunction
print disk = "disk_value", properties = dynamic({"storageProfile":{"osDisk":"osDisk_value"}})
| project disk, osDisk = pack("osDisk", properties.storageProfile.osDisk)

Joining PageViews and Request in Application Insights Log Analytics

I want to join pageViews that are coming from the AppInsights browser SDK, to the request on the backend. I don't see a foreign key that makes sense, is there one OOTB? or do I need to code something to join them together?
To add context, I am interested in pageView duration by cloudRoleInstance (server), but cloudRoleInstance is only available on requests.
I tried the following, and did not work, I supose the operation IDs are not the same.
pageViews
| join (requests) on operation_Id

You can join by Operation ID (operation_Id).
Here is the query which returns all documents for a particular operation_Id:
union *
| where timestamp > ago(1d)
| where operation_Id == "<operation_id>"

I was interested in exactly the same thing and this is how I ended up solving it:
Set a "cloud_RoleInstance" cookie for each response from the server so that the client javascript would know which role instance sent the last response.
Add a TelemetryInitializer to the client-side Application Insights instance which pulls the RoleInstance cookie and adds it as data to the telemetry collected client-side.
*The reason I did it this way instead of joining on operationId as the other answer says is because operationId seemed to span many requests on the server, sometimes over the course of a half an hour. Maybe that has is because of the way our Single Page Application is set up, but operationId just wasn't working for me.
Code
BaseController.cs::BeginExecute (We have our own BaseController which all other controllers inherit from)
var roleInstanceCookie = requestContext.HttpContext.Response.Cookies.Get("cloud_RoleInstance");
roleInstanceCookie.Value = Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.CurrentRoleInstance.Id;
requestContext.HttpContext.Response.Cookies.Set(roleInstanceCookie);
ApplicationInsights.js (This contains our AI snippet that loads AI, currently using version 2.3.1 of the JS SDK)
// ... initialization snippet ...
appInsights.addTelemetryInitializer((envelope) => {
envelope.data.cloud_RoleInstance = getCookie("cloud_RoleInstance");
});
The cloud_RoleInstance will then end up in the customDimensions column of your PageViews in Application Insights

EXCEPTION_ACCESS_VIOLATION on a static declaration

I am new with the usage of for ***'Address use ***. And I was wondering what are the limitation of this usage. So I created the following procedure:
procedure letshack (A : System.Address) is
My_String : String(1..100000);
for My_String'Address use A;
begin
Put(My_String);
end;
And this raise a EXCEPTION_ACCESS_VIOLATION while the same code with a String that is 100 length don't raise it. More over if i don't use the integer address, this code works properly.
So what are the limitation of for ***'Address use *** usage.
Ps : I am using Ada 95 but any information is welcome.
Edit:
I understand a part of the behavior. And this is what I suppose.
When you start your program a certain stack is allocated and you can write and read in it. Indeed I Wrote the 5th byte with an integer address
Real Addresses |----------------------------| Virtual Addresses
0x48000|Stack Origine |0x00
| |
| |
| |
| |
|End of Stack |
0x48000+range|----------------------------|0x00+range
And you get EXCEPTION_ACCESS_VIOLATION if you are out of stack. It seems strange for a "strong" language if it is right. Because it means you can rewrite your own stack and make bad behave.

Finnaly found the behavior.
When you start your program the addresses you use are virtual ones in a page.
And the part of the system that handle virtual adress make it for a certain size of memory that is allocated to your process which is constant depending on your system as show the following schema:
Real Addresses |----------------------------| Virtual Addresses
0x48000|Begin of the virtual address|0x00
|range |
| |
| |
|End of the virtual address |
|range |
0x48000+range|----------------------------|0x00+range
You can do anything without allocating variable in it. For example on my windows this size is 4096 bytes according to the variable si.dwPageSize in <windows.h>.
And I tested my String can be 4096 bytes long but not 4097 bytes. I must now test it on my embedded system but seems close to the truth.

If you have ensured that you have allocated 100_000 consecutive characters in a readable part of memory starting at A, then it should work.
If A is the address of another Ada entity, it is not supposed to work.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why Gnocchi apply 'server_group' to resource slowly? - openstack

Related

Raising Alerts from Application Insights log based on percentage?

Get all the action log of an instance and the flavor attached to it

How do I use dynamic keyword in Kusto

Joining PageViews and Request in Application Insights Log Analytics

EXCEPTION_ACCESS_VIOLATION on a static declaration

Categories

Resources