Host name resolution failed for 'Microsoft.Network' - vnet

Not allowed to do any operation on Vnet, like Vnet Integration etc , faling on North Europe also.
Failed to save subnet 'apps'. Error: 'Host name resolution failed for 'Microsoft.Network'; cannot fulfill the request.'

I see the same issue.
It might be related to Azure current failure with ARM
Starting at 08:45 UTC on 04 Jun 2020, a subset of customers may
experience issues with resource creation for services that depend on
the Azure Resource Manager (ARM) platform.
Engineers have identified a fix and have started to deploy to affected
regions. Customers may continue to experience latency or failures
while creating resources for some services. The next update will be
provided in 60 minutes, or as events warrant.
This message was last updated at 14:33 UTC on 04 June 2020
You can check the updated Azure services status here - https://status.azure.com/en-us/status
EDIT: Azure team announced that the issue is resolved now

Related

How to troubleshoot unresponsive ASP .NET Web API on IIS?

I am having an issue with Web API hosted on IIS. Here are the details
Environment: ASP.Net MVC, IIS, SQL server
Web API hosted on separate server, Load balanced with 2 servers, Big
IP
The API works fine in PROD but in lower environments it becomes unresponsive or very slow. Sometimes it takes 5 minutes to process a request. Otherwise it takes forever. The issues occurs only once or twice a month. App Pool recycle or IIS restart didn’t seem to help. But after rebooting server it works fine and processes the same request in 10 – 20 seconds.
Authentication is set for every request (i.e. 401 followed by 200). When issue occurs the IIS log has only 401 entry. Issue seems to happen on only 1 server as it starts to work fine after restarting only that server.
During the issue when another API request comes, it gets to the affected server. What might be the reason that the request doesn’t go to another server which is free?
The system logs were fine. The CPU utilization was 3%, Memory usage etc. looks good on the server at the time of issue. IIS configuration settings are same as PROD and look good.
What tools can be used to monitor IIS Apps, Server? Are there any free tools? Any help to troubleshoot this issue will be great.
Some errors on server:
Activation of app Microsoft.Windows.Cortana_cw5n1h2txyewy!CortanaUI failed with error: This app can't be activated by the Built-in Administrator. See the Microsoft-Windows-TWinUI/Operational log for additional information.
The Open Procedure for service "BITS" in DLL "C:\Windows\System32\bitsperf.dll" failed. Performance data for this service will not be available. The first four bytes (DWORD) of the Data section contains the error code.
Windows cannot load the extensible counter DLL ASP.NET_2.0.50727. The first four bytes (DWORD) of the Data section contains the Windows error code.
Cryptographic Services failed while processing the OnIdentity() call in the System Writer Object.
UPDATE:
DebugDiag2 Analysis - CrashHangAnalysis report.
Is this causing deadlock?
Thread ID Total CPU Time Entry Point for Thread
2 00:00:00.031 ntdll!RtlReleaseSRWLockExclusive+2200
0 00:00:00.030 w3wp+2e50
1 00:00:00.000 nativerd!DllGetClassObject+24680
3 00:00:00.000 ntdll!RtlReleaseSRWLockExclusive+2200
4 00:00:00.000 w3tp!THREAD_POOL::CreateThreadPool+350
4 Threads (40% of all threads) have this same call stack.
Note: Grouping of identical threads can be disabled in the 'Preferences' tab of the Analysis Options
Thread 2 - System ID 4704
Entry point ntdll!RtlReleaseSRWLockExclusive+2200
Create time 9/20/2021 1:00:34 PM
Time spent in user mode 0 Days 00:00:00.000
Time spent in kernel mode 0 Days 00:00:00.031
This thread is not fully resolved and may or may not be a problem. Further analysis of these threads may be required.
Thread 3 - System ID 2576
Entry point ntdll!RtlReleaseSRWLockExclusive+2200
Create time 9/20/2021 1:00:34 PM
Time spent in user mode 0 Days 00:00:00.000
Time spent in kernel mode 0 Days 00:00:00.000
This thread is not fully resolved and may or may not be a problem. Further analysis of these threads may be required.
Thread 8 - System ID 2112
Entry point ntdll!RtlReleaseSRWLockExclusive+2200
Create time 9/20/2021 1:00:34 PM
Time spent in user mode 0 Days 00:00:00.000
Time spent in kernel mode 0 Days 00:00:00.000
This thread is not fully resolved and may or may not be a problem. Further analysis of these threads may be required.
Thread 9 - System ID 5832
Entry point ntdll!RtlReleaseSRWLockExclusive+2200
Create time 9/20/2021 1:01:04 PM
Time spent in user mode 0 Days 00:00:00.000
Time spent in kernel mode 0 Days 00:00:00.000
This thread is not fully resolved and may or may not be a problem. Further analysis of these threads may be required.
Instruction Address Source
[0x7ffadb029444] ntdll!NtWaitForWorkViaWorkerFactory+14
[0x7ffadaf9eb4e] ntdll!RtlReleaseSRWLockExclusive+296e
[0x7ffada7184d4] kernel32!BaseThreadInitThunk+14
[0x7ffadafd1781] ntdll!RtlUserThreadStart+21

Azure VM update management Non-Compliant VMs

What is the reason for an azure vm to become non-compliant after update deployment?
Is it because there are any critical or security updates missing?
Does having Other updates missing cause a vm to become non-compliant
There were no documentation to clarify this, only with the current statistics, I came into this conclusion. If there are any documentation to prove that, appreciate your help.
yes, there may be a chance for missing some critical or security updates.Before you deploy software updates to your machines, review the update compliance assessment results for enabled machines. For each software update, its compliance state is recorded and then after the evaluation is complete, it is collected and forwarded in bulk to Azure Monitor logs.
On a Windows machine, the compliance scan is run every 12 hours by default, and is initiated within 15 minutes of the Log Analytics agent for Windows is restarted. The assessment data is then forwarded to the workspace and refreshes the Updates table. Before and after update installation, an update compliance scan is performed to identify missing updates, but the results are not used to update the assessment data in the table.it is important to review recommendations on how to configure the windows Update client
After reviewing the compliance results, the software update deployment phase is the process of deploying software updates. To install updates, schedule a deployment that aligns with your release schedule and service window.
After the deployment is complete, review the process to determine the success of the update deployment by machine or target group.Check deployment status

MariaDB has stopped responding - [ERROR] mysqld got signal 6

MariaDB service was stopped responding all of a sudden. It was running for more than 5 months continuously without any issues. When we check the MariaDB service status at the time of the incident, it showed as active (running) ( service mariadb status ). But we could not log into the MariaDB server, each logging attempt was just hanged without any response. All our web applications were also failed to communicate with the MariaDB service. Also, we checked the max_used_connections, and it was below the maximum value.
When we going through the logs, we saw the below error (this had been triggered at the time of the incident).
210623 2:00:19 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.2.34-MariaDB-log
key_buffer_size=67108864
read_buffer_size=1048576
max_used_connections=139
max_threads=752
thread_count=72
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1621655 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f4c008501e8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f4c458a7d30 thread_stack 0x49000
2021-06-23 2:04:20 139966788486912 [Warning] InnoDB: A long semaphore wait:
--Thread 139966780094208 has waited at btr0sea.cc line 1145 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x55e1838d5ab0 created in file btr0sea.cc line 191
a writer (thread id 139966610978560) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0sea.cc line 1145
Last time write locked in file btr0sea.cc line 1218
We could not even stop the MariaDB service using general stopping commands ( service MariaDB stop). But we were able to forcefully kill the MariaDB process and then we could get the MariaDB service back online.
What could be the reason for this failure. If you have already faced similar issues please share your experience, what actions you got to prevent such failures (in the future). Your feedback is much much appreciated.
Our Environment Details are as follows
Operating system: Red Hat Enterprise Linux 7
Mariadb version: 10.2.34-MariaDB-log MariaDB Server
I also face this issue on an aws instance (c5a.4xlarge) hosting my database.
Server version: 10.5.11-MariaDB-1:10.5.11+maria~focal
It happened already 3 times occasionnaly. Like you, no possibility to stop the service but reboot the machine to get it working again.
Logs at restart suggest some tables crashed and should be repaired.

Azure cosmos changefeed Processor options

Changefeed Processor options are well described here -
I have few questions on that -
leaseRenewInterval: Suppose an instance could not renew its lease within 17s (default lease renew interval), will the lease be removed from that instance? Or feed will wait till leaseExpirationInterval to remove the lease from it and give it a chance to reacquire lease within 60s?
Will leaseRenew by default happens after checkpoint, or both are independent? i.e. leaseRenew can happen on separate thread after leaserenewinterval, while other thread is still working on a batch?
We have seen the error: failed to checkpoint for owner 'null' with continuation token. How this can happen? Why owner can become null?
We have also seen the exception LeaseLostException. Can this happen even if the pod/instance is not down? We are not expecting any load balance as only 1 physical partition is there, but want our system to be fault tolerant, so we do have multiple instances running where all other except 1, will always wait for lease to acquire.
There are few instances where we can see, at the same time, 3 pods/instance having lease of same physical partition, or we can say, they acquired same lease. (We can have at max 1 Physical Partition, (TTL for document is 3 days and storage is less, so we are not expecting more than 1 physical partition)). How this can happen?
EDITS:
Current Settings:
leaseRenewInterval : 17s
leaseAcquireInterval: 13s
leaseExpirationInterval: 60s
feedPollDelay: 2s [only this is not the default]
ChangeFeed Processor version:
We are using below in our maven
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-cosmos</artifactId>
<version>4.8.0</version>
</dependency>
So, I can assume the CFP version is 4.8.0
Leases when not renewed are not removed by the current instance. Other instances can "think" that the lease was not renewed because the current owner crashed, so they will "steal" them. Normally happens when the lease is not accessed/updated before the expiration time.
Independent. There could be no checkpoints (no new changes) and lease still would get renewed.
That sounds like the lease was released and then attempted to checkpoint. Not sure which CFP version you are using or which is your interval configurations.
Are you customizing any of the intervals? If so, that could lead to a lease being lost (detected as expired by other instance).
Same question as before, this could happen either during load balancing or because leases are being detected expired.
Please share which CFP version you are using and what are the options. Normally, unless you are very certain what you are doing, I don't recommend changing any of the intervals.
EDIT: Based on the new information. I am not familiar with the Java CFP, but when the number of instances is higher than leases, load balancing a lease across other instances while not ideal, shouldn't be a problem, because the lease will still be owned and processed by 1 machine. The only recommendation I'd try is to use the latest maven package version. There are fixes on CFP on newer version (https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-java-v4#4140-2021-04-06), so try 4.15.0.

Statsd dying silently

Statsd being started by Chef is dying. I believe I have isolated the problem away from Chef as the INIT script Chef is calling is doing what it is suppose to. I have turned debug on for statsd and in the log the following is the last messages before dying:
15 Oct 11:17:39 - reading config file: /etc/statsd/config.js
15 Oct 11:17:39 - server is up
15 Oct 11:17:39 - DEBUG: Loading backend: ./backends/graphite
I am absolutely stumped; nothing in /var/log/messages, nothing in the error log. Any idea if statsd requires certain services up and running?
StatsD doesn't require any specific services to run. You can also set dumpMessages to true in the config to have it log all messages that are incoming to get a picture of what's happening. Does it send any data to Graphite during the time it's up? If you continue having problems, there is also #statsd on Freenode IRC where a lot of people are idling who know a thing or two about StatsD.
I had a very similar issue. I found this page very informative. I took only the part to package my statsd with its backends (changed package files a bit) and installed the created package to make it a daemon.
Hope this helps.

Resources