Intel VT-x: How NMI is delivered to guest OS - intel

I am running a guest OS on KVM with Intel VT-x. I am trying to understand how NMI is handled when it is running in non-root mode. Does Intel VT-x automatically cause an NMI at the guest or the Linux kernel's kvm subsystem sends the NMI to VCPU? Although I have registered an NMI handler in guest OS, I only get a host NMI trigger during non-root execution.

This is a partial answer to the question. I can describe what the processor does when an NMI occurs, but I don't know what KVM does.
If the NMI Exiting control is 0 and an NMI arrives while in VMX-non-root mode, the NMI is delivered to the guest via the guest's IDT.
If the NMI Exiting control is 1, an NMI causes a VM exit. [Intel SDM, volume 3, section 24.6.1, table 24-5]
Probably KVM sets this control to 1. In this case, the processor does not automatically process the NMI. It is up to KVM how to handle it when the VM exit occurs. It may deliver the NMI to the host through the host IDT or it may inject it into the guest.

Related

Wake-up over LAN

By sending so-called the Magic Packet, I'm trying to wake-up a PC over LAN - no luck.
Target PC BIOS settings: Power Management > Wake on LAN: LAN only.
Target PC NIC settings: Power Management > Allow this device to wake
the computer CHECKED. Target PC NIC settings: Power Management > Only
allow a magic packet to wake the computer CHECKED. Target PC NIC
settings: Advanced > System Idle Power Saver - DISABLED. Target PC
NIC settings: Advanced > Wake on Magic Packed - ENABLED. Target PC
NIC settings: Advanced > Wake on Pattern Match - ENABLED.
The packet does arrive at destination; tried ports 0,1,7,9.
What else should I be looking at?
The problem turned out to be windows updater - it was constantly running some kinda super setup to unsuccessfully install the 20H2 update and that was an infinite loop of try-and-fail. Killing that setup in task manager and disabling automatic updates solved the problem. Further investigation proved the faulty setup did not allow PC to go into S0 power mode, but its NIC (and BIOS) has firmware to perform WoL from S0 only. Note: both, motherboard and BIOS are proprietary so I can't add anything else about the system but my frustration with DELL.

how CPU distributes data from network?

I'm learning network communications and already familiar with TCP/IP networking layers (physical, data link ... and application layers) and how data moves through this nodes. But I have some questions about what happens inside a machine, when data is received by a Network Interface Card(NIC).
Questions:
How CPU knows that data from other machine is arrived?
How CPU informs OS that data from other machine is arrived?
How OS knows which application the data is for?
Please, give me some deep explanation for this topic, or advice some useful materials to make it clear.
To give you a general view from Linux point(should be similar for other OS):
The packets arrive in NIC. These packets are copied into circular queues in RAM via DMA. The arrival of packets will generate an interrupt to let the system know that their are packets in RAM. Corresponding to the interrupt there will be an interrupt handler routine registered with the Operating System via the network driver. (To keep things simple didn't talk about softirq's). Each CPU has a poll function whose job is to harvest packets from these queue's and pass them onto upper n/w layers. So answering your queries:
How CPU knows that data from other machine is arrived?
When interrupt occurs and poll loop is not running on the CPU, the OS(via network driver)
will ask the CPU to start the poll loop for harvesting the packets.
How CPU informs OS that data from other machine is arrived?
CPU doesn't need to inform OS. The OS will know when the interrupt occurs as the interrupt handler is a part of the network driver which is part of OS. Infact in a way OS will tell the CPU to start harvesting packets.
How OS knows which application the data is for?
The communication is done via sockets which will have a port number. The packets arrived will have a port number which will guide the OS to take the packet to the required application.

NIC behaviour in bridged adaptor mode

I was always in impression that, NIC card has unique MAC address and if incoming packet matches with that MAC, it lifts the packet and sends to kernel.
Recently when I have installed VMbox(host - Ubuntu, guest OS - Ubuntu) and configured network option with "bridge adapter" mode(MAC is randomly chosen), Vmbox is acting like a independent machine. I mean VM box OS has it's own MAC address and public IP.
I have observed that packets send on wire from VM box have virtual MAC & same for incoming packets.
1) Does NICs allow to send network packet with MAC id different from physical MAC id? and same for incoming packets, is it ok to lift packets where MAC is not matching with physical MAC id?( As I understand this is only possible in promiscuous mode)
2) Is n't it security violation? how about flooding internet by allocating more MACs by creating multiple instances of VM on many machines?
3) If MAC id is chosen randomly, there can be possibility that MAC id will be matching with with some other network device, how is this addressed?
Thank You,
Gopinath.
1) Regular NICs operate on Layer-1, it is responsibility of OS (and respective kernelspace or userspace drivers) to provide valid Ethernet frame, using (if needed) vendor-related MAC address stored (for reference) in network card's memory. Whether the frame comes from host OS or from guest OS (through virtual switch in hypervisor) is irrelevant. The situation becomes slightly different in case of NFV and smart NICs, but not much.
The whole point of virtualization is that you shouldn't tell the difference from running your OS on virtual server or on standalone machine standing next to your host (looking either from the inside of your system or from the outside).
2) No, security don't get worse through that. As mentioned in previous point, the situation would be similar if you put physical host next to another. And from the security point of view, it's easier to flood local network with packets with forged source MAC than to instantiate the same number of VMs.
3) Collisions affect local network the same way as with regular host. The possibility is always there, but the probability is extremely low, with 6*8-1 bits to chose random address from. I've seen only one such collision, and only because MACs were set manually, not picked randomly.

Get SR-IOV Virtual Function counters

Suppose I've got SR-IOV passthrough enabled on the host with 2 Virtual Functions, I'm running two QEMU/KVM VM's with libvirt, each connected to a VF respectively, is there any way to see the VF counters on the host (such as rx/tx pkts)?
I've tried to use ethtool -S to see stats but I can only see the global counters of the physical function.
I found an SR-IOV counters plugin for OpenStack Ceilometer but it's a Mellanox plugin and uses a proprietary drivers on the Guest VM's.
Any help would be appreciated.
When you enable VFs on a host, the VFs are initially bound to a host kernel network driver module, so will appear as ethNN letting you query stats. When you then attach the VF to a guest using PCI device assignment the VF is unbound from the host kernel driver, so the ethNNN device goes away in the host. It is thus impossible to query network stats for that VF in the host.
The only way to achieve that is to not use PCI device assignment, and instead associate the VF with the guest using MACVTAP in direct mode. This is not quite as high performance as using PCI assignment, but is still pretty decent due to virtio-net design and lets you see the NIC in the host to monitor traffic.

ixgbe: setting the number of RX/TX queues

I want to set the number of RX/TX queues used by an Intel 10G NIC. Let me explain why:
I am using an Intel 10G NIC of type X520, on a Dell R720 system. I am using ixgbe version 3.6.7-k. The kernel in Ubuntu 3.2.0-59.
I am running my network application on 4 out of the 24 cores on the machine. Currently the NIC is using flow-director so I've got 24 TX and RX queues, while most of the IRQs finally run on the 4 cores running the application.
However, I see that some IRQs are running on the other 20 queues (this is probably happening as flow-director samples about 20% of the traffic so some traffic goes through regular RSS). Now I don't want any IRQ to be run on the other 20 cores as they are doing a different task which is damaged by the IRQs running.
I tried setting the affinity of the interrupts only to the 4 cores I use, but this does not work well with flow-director. I guess a better approach will be using only 4 RX/TX queues and assigning them to the dedicated cores. But I couldn't find a way to set the number of RX/TX queue in the ixgbe driver (though this is quite simple with other 10G drivers I am familiar with, such as Broadcom's bnx2x).
Any idea?
This is not possible with the version of ixgbe (currently 3.19.1-k) in the latest Linux kernel source (as of 3.18.0-rc1).
You need to grab the latest ixgbe driver (currently 3.22.3) from e1000.sf.net, which supports the RSS parameter. From modinfo ixgbe:
parm: RSS:Number of Receive-Side Scaling Descriptor Queues, default 0=number of cpus (array of int)
So if you have one ixgbe NIC and want 4 queues, you'll need to add a line like this to modprobe.conf (or equivalent in your distro):
options ixgbe RSS=4
Then you'll want to set /proc/irq/*/smp_affinity cpu mask for whatever the irqs are in /proc/interrupts that match your NIC.
Some versions of ixgbe driver included into linux kernel (since 2013, 3.9 kernel, "3.11.33-k" version of ixgbe) can change RSS (queue) count in runtime even without RSS module option. There is ethtool tool to change parameters of network cards, and there are options to change channels:
ethtool -l|--show-channels devname
ethtool -L|--set-channels devname [rx N] [tx N] [other N]
[combined N]
-l --show-channels
Queries the specified network device for the numbers of
channels it has. A channel is an IRQ and the set of queues
that can trigger that IRQ.
-L --set-channels
Changes the numbers of channels of the specified network
device.
rx N Changes the number of channels with only receive queues.
tx N Changes the number of channels with only transmit queues.
other N
Changes the number of channels used only for other
purposes e.g. link interrupts or SR-IOV co-ordination.
combined N
Changes the number of multi-purpose channels.
Test current channel (RSS, queue) count of ixgbe eth1 with ethtool -l eth1 and change with ethtool -L eth1 combined 4 or ethtool -L eth1 rx 2 tx 2.
Implemented in net/ethernet/intel/ixgbe/ixgbe_ethtool.c: http://elixir.free-electrons.com/linux/v4.12/source/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c#L3442
static const struct ethtool_ops ixgbe_ethtool_ops = { ...
.get_channels = ixgbe_get_channels,
.set_channels = ixgbe_set_channels, ... }
ixgbe_get_channels: http://elixir.free-electrons.com/linux/v4.12/source/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c#L3127
ixgbe_set_channels to change adapter->ring_feature[RING_F_RSS].limit: http://elixir.free-electrons.com/linux/v4.12/source/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c#L3164
Implemented since 3.9 version of Linux kernel (around 2013):
* http://elixir.free-electrons.com/linux/v3.9/ident/ixgbe_get_channels
* https://patchwork.ozlabs.org/patch/211119/ "[RFC,v2,09/10] ixgbe: Add support for displaying the number of Tx/Rx channels"
* https://patchwork.ozlabs.org/patch/211120/ "[RFC,v2,10/10] ixgbe: Add support for set_channels ethtool operation diffmbox"

Resources