通过前面的文章,我们对kubevirt有了一些简单的了解,本文我们来看看kubevirt虚拟机的网络实现原理。

pod网络

kubevirt是k8s的一个CRD实现,每个kubevirt虚拟机对应一个vmi对象和一个pod对象,而k8s本身对pod网络有了一些规范(CNI),所以在了解kubevirt虚拟机网络前,有必要先对k8s的pod网络有个了解。

pod与container

k8s pod是一组容器(container)的逻辑集合,一个pod可以包含多个业务容器和一个系统内置的sandbox容器:

kubelet创建pod

kubelet在创建pod下的容器时,会先创建sandbox容器,再创建其它业务容器:

hostnetwork与CNI

pod网络就是在创建sandbox容器这一步完成的。kubelet调CRI接口创建sandbox容器,CRI收到请求后判断pod是否是hostNetwork,如果不是则会先调CNI插件初始化网络(包含创建网络设备和申请ip):

flannel host-gateway

我们假设CNI用的是flannel host-gateway模式,则pod网络有如下示意:

kubevirt网络

kubevirt网络相关组件

通过前面的文章,我们知道用户在kubevirt平台创建虚拟机其实只需要创建一个vmi(Virtual Machine Instance)对象,之后virt-controller会根据vmi对象中的信息创建一个pod,本文我们把这个pod叫作vmi pod。vmi pod中有kubevirt组件virt-launcher,以及虚拟化相关组件libvirtd和qemu。kubevirt虚拟机网络主要与virt-launcher以及daemonset部署的virt-handler有关:

以下内容基于kubevirt@0.49.0

源码分析

假设当前有如下vmi yaml示例:

apiVersion: kubevirt.io/v1alpha3

kind: VirtualMachineInstance

metadata:

annotations:

name: test

spec:

domain:

devices:

interfaces:

- masquerade: {}

name: default

# ...

networks:

- name: default

pod: {}

# ...

vmi对象中与网络关系密切的参数主要有两个:

spec.domain.devices.interfaces:定义连接guest接口的方法,支持bridge、slirp、masquerade、sriov和macvtap(五选一),本文仅对bridge和masquerade两种类型做阐述。 spec.networks:定义连接vm虚拟机网络的源,支持pod和multus两种类型(二选一),本文仅对pod类型做阐述。

上述两个参数的name字段需要匹配上。

基于上述yaml,我们再分别从virt-handler和virt-launcher两个组件源码层面,看看kubevirt网络的实现。

virt-handler

当创建一个vmi对象后,virt-handler初始化网络的入口函数在vmUpdateHelperDefault:

// pkg/virt-handler/vm.go

func (d *VirtualMachineController) vmUpdateHelperDefault(origVMI *v1.VirtualMachineInstance, domainExists bool) error {

/*...*/

if !vmi.IsRunning() && !vmi.IsFinal() {

/*...*/

if err := d.setupNetwork(vmi); err != nil {

return fmt.Errorf("failed to configure vmi network: %w", err)

}

/*...*/

}

/*...*/

}

// pkg/virt-handler/vm.go

func (d *VirtualMachineController) setupNetwork(vmi *v1.VirtualMachineInstance) error {

/*...*/

return d.netConf.Setup(vmi, isolationRes.Pid(), func() error {

if requiresDeviceClaim {

if err := d.claimDeviceOwnership(rootMount, "vhost-net"); err != nil {

return neterrors.CreateCriticalNetworkError(fmt.Errorf("failed to set up vhost-net device, %s", err))

}

}

return nil

})

}

// pkg/network/setup/netconf.go

func (c *NetConf) Setup(vmi *v1.VirtualMachineInstance, launcherPid int, preSetup func() error) error {

/*...*/

err := ns.Do(func() error {

// 执行初始化网络的第一阶段逻辑

return netConfigurator.SetupPodNetworkPhase1(launcherPid)

})

if err != nil {

return fmt.Errorf("setup failed, err: %w", err)

}

/*...*/

}

// pkg/network/setup/network.go

func (n *VMNetworkConfigurator) SetupPodNetworkPhase1(pid int) error {

launcherPID := &pid

nics, err := n.getPhase1NICs(launcherPID)

if err != nil {

return err

}

for _, nic := range nics {

if err := nic.PlugPhase1(); err != nil {

return fmt.Errorf("failed plugging phase1 at nic '%s': %w", nic.podInterfaceName, err)

}

}

return nil

}

在SetupPodNetworkPhase1函数中主要分两步,第一步通过getPhase1NICs收集NIC信息,第二步遍历这些NIC,执行PlugPhase1。

getPhase1NICs

先看看getPhase1NICs:

// pkg/network/setup/network.go

func (v VMNetworkConfigurator) getPhase1NICs(launcherPID *int) ([]podNIC, error) {

/*...*/

for i, _ := range v.vmi.Spec.Networks {

nic, err := newPhase1PodNIC(v.vmi, &v.vmi.Spec.Networks[i], v.handler, v.cacheFactory, launcherPID)

if err != nil {

return nil, err

}

nics = append(nics, *nic)

}

return nics, nil

}

func newPhase1PodNIC(vmi *v1.VirtualMachineInstance, network *v1.Network, handler netdriver.NetworkHandler, cacheFactory cache.InterfaceCacheFactory, launcherPID *int) (*podNIC, error) {

// 根据spec.domain.devices.interfaces和spec.networks找到pod中NIC信息

podnic, err := newPodNIC(vmi, network, handler, cacheFactory, launcherPID)

if err != nil {

return nil, err

}

if launcherPID == nil {

return nil, fmt.Errorf("missing launcher PID to construct infra configurators")

}

// 这里返回的infraConfigurator针对bridge和masquerade做了特别初始化处理

if podnic.vmiSpecIface.Bridge != nil {

podnic.infraConfigurator = infraconfigurators.NewBridgePodNetworkConfigurator(

podnic.vmi,

podnic.vmiSpecIface,

generateInPodBridgeInterfaceName(podnic.podInterfaceName),

*podnic.launcherPID,

podnic.handler)

} else if podnic.vmiSpecIface.Masquerade != nil {

podnic.infraConfigurator = infraconfigurators.NewMasqueradePodNetworkConfigurator(

podnic.vmi,

podnic.vmiSpecIface,

generateInPodBridgeInterfaceName(podnic.podInterfaceName),

podnic.vmiSpecNetwork,

*podnic.launcherPID,

podnic.handler)

}

return podnic, nil

}

newPodNIC方法是根据spec.domain.devices.interfaces和spec.networks找到pod中NIC信息,需要注意的是,如果spec.networks是pod类型,返回的pod NIC名称默认是eth0;如果是multus且multus.default=false,则返回的pod NIC名称是net%d,%d表示mutlus在sepc.networks中的序号。

PlugPhase1

再看看PlugPhase1:

// pkg/network/setup/podnic.go

func (l *podNIC) PlugPhase1() error {

// 如果NIC是SRIOV,则不作任何处理

if l.vmiSpecIface.SRIOV != nil {

return nil

}

/*...*/

// 前面看到只有bridge和masquerade才会初始化该字段

// 所以非bridge和masquerade类型的在这里就直接返回了

if l.infraConfigurator == nil {

return nil

}

if err := l.infraConfigurator.DiscoverPodNetworkInterface(l.podInterfaceName); err != nil {

return err

}

dhcpConfig := l.infraConfigurator.GenerateNonRecoverableDHCPConfig()

if dhcpConfig != nil {

log.Log.V(4).Infof("The generated dhcpConfig: %s", dhcpConfig.String())

if err := l.cacheFactory.CacheDHCPConfigForPid(getPIDString(l.launcherPID)).Write(l.podInterfaceName, dhcpConfig); err != nil {

return fmt.Errorf("failed to save DHCP configuration: %w", err)

}

}

domainIface := l.infraConfigurator.GenerateNonRecoverableDomainIfaceSpec()

if domainIface != nil {

log.Log.V(4).Infof("The generated libvirt domain interface: %+v", *domainIface)

if err := l.storeCachedDomainIface(*domainIface); err != nil {

return fmt.Errorf("failed to save libvirt domain interface: %w", err)

}

}

/*...*/

// preparePodNetworkInterface must be called *after* the Generate

// methods since it mutates the pod interface from which those

// generator methods get their info from.

if err := l.infraConfigurator.PreparePodNetworkInterface(); err != nil {

log.Log.Reason(err).Error("failed to prepare pod networking")

return errors.CreateCriticalNetworkError(err)

}

/*...*/

}

PlugPhase1函数中,如果发现vmi的spec.domain.devices.interfaces是sriov、slirp和macvtap类型都不会做过多的处理。而对于bridge和masquerade两种类型,都是根据infraConfigurator接口做相关处理,infraConfigurator接口的定义如下:

// pkg/network/infraconfigurators/common.go

type PodNetworkInfraConfigurator interface {

DiscoverPodNetworkInterface(podIfaceName string) error

PreparePodNetworkInterface() error

GenerateNonRecoverableDomainIfaceSpec() *api.Interface

GenerateNonRecoverableDHCPConfig() *cache.DHCPConfig

}

bridge和masquerade都实现了PodNetworkInfraConfigurator,具体实现如下。

bridge

DiscoverPodNetworkInterface

// pkg/network/infraconfigurators/bridge.go

func (b *BridgePodNetworkConfigurator) DiscoverPodNetworkInterface(podIfaceName string) error {

// 先根据pod接口名在vmi pod找到pod的网卡link设备

link, err := b.handler.LinkByName(podIfaceName)

if err != nil {

log.Log.Reason(err).Errorf("failed to get a link for interface: %s", podIfaceName)

return err

}

b.podNicLink = link

// 从link设备中拿到ip地址信息

addrList, err := b.handler.AddrList(b.podNicLink, netlink.FAMILY_V4)

if err != nil {

log.Log.Reason(err).Errorf("failed to get an ip address for %s", podIfaceName)

return err

}

if len(addrList) == 0 {

// 如果没有设置ip,则把ipam使能设置为关闭

b.ipamEnabled = false

} else {

// 如果有找到ip,则把ipam使用设置为打开

b.podIfaceIP = addrList[0]

b.ipamEnabled = true

// 记录pod网卡的路由信息

if err := b.learnInterfaceRoutes(); err != nil {

return err

}

}

// 根据vmi pod网卡名称构建出tap设备名称(如果网卡是eth0,则tap设备名为tap0)

b.tapDeviceName = virtnetlink.GenerateTapDeviceName(podIfaceName)

// 尝试从vmi.spec.domain.devices.interfaces中拿到用户指定的mac地址

// 如果没配置(如前文给的vmi yaml就没配置),则会随机生成一个mac地址

b.vmMac, err = virtnetlink.RetrieveMacAddressFromVMISpecIface(b.vmiSpecIface)

if err != nil {

return err

}

if b.vmMac == nil {

b.vmMac = &b.podNicLink.Attrs().HardwareAddr

}

return nil

}

GenerateNonRecoverableDHCPConfig

// pkg/network/infraconfigurators/bridge.go

func (b *BridgePodNetworkConfigurator) GenerateNonRecoverableDHCPConfig() *cache.DHCPConfig {

// 如果pod网卡没有ip,直接返回

if !b.ipamEnabled {

return &cache.DHCPConfig{IPAMDisabled: true}

}

dhcpConfig := &cache.DHCPConfig{

MAC: *b.vmMac,

IPAMDisabled: !b.ipamEnabled,

IP: b.podIfaceIP,

}

// 如果pod网卡有ip,且配置了路由,则

if b.ipamEnabled && len(b.podIfaceRoutes) > 0 {

log.Log.V(4).Infof("got to add %d routes to the DhcpConfig", len(b.podIfaceRoutes))

b.decorateDhcpConfigRoutes(dhcpConfig)

}

return dhcpConfig

}

// 把符合条件的pod网卡路由和网关信息作为dhcp配置

func (b *BridgePodNetworkConfigurator) decorateDhcpConfigRoutes(dhcpConfig *cache.DHCPConfig) {

log.Log.V(4).Infof("the default route is: %s", b.podIfaceRoutes[0].String())

dhcpConfig.Gateway = b.podIfaceRoutes[0].Gw

if len(b.podIfaceRoutes) > 1 {

dhcpRoutes := virtnetlink.FilterPodNetworkRoutes(b.podIfaceRoutes, dhcpConfig)

dhcpConfig.Routes = &dhcpRoutes

}

}

GenerateNonRecoverableDomainIfaceSpec

// pkg/network/infraconfigurators/bridge.go

// 根据mac地址构造一个interface对象

func (b *BridgePodNetworkConfigurator) GenerateNonRecoverableDomainIfaceSpec() *api.Interface {

return &api.Interface{

MAC: &api.MAC{MAC: b.vmMac.String()},

}

}

PreparePodNetworkInterface

前面三个函数都可以看作准备数据,PreparePodNetworkInterface才是整个逻辑核心。

// pkg/network/infraconfigurators/bridge.go

func (b *BridgePodNetworkConfigurator) PreparePodNetworkInterface() error {

// 先把pod中的网卡down掉

if err := b.handler.LinkSetDown(b.podNicLink); err != nil {

log.Log.Reason(err).Errorf("failed to bring link down for interface: %s", b.podNicLink.Attrs().Name)

return err

}

// 如果ipam使能(即pod网卡有ip)

if b.ipamEnabled {

// 删掉pod网卡的ip

err := b.handler.AddrDel(b.podNicLink, &b.podIfaceIP)

if err != nil {

log.Log.Reason(err).Errorf("failed to delete address for interface: %s", b.podNicLink.Attrs().Name)

return err

}

// 把pod网卡重命名,并且创建一张和原pod网卡同名的dummy网卡

// 并把原先网卡ip给dummy网卡

if err := b.switchPodInterfaceWithDummy(); err != nil {

log.Log.Reason(err).Error("failed to switch pod interface with a dummy")

return err

}

// Set arp_ignore=1 to avoid

// the dummy interface being seen by Duplicate Address Detection (DAD).

// Without this, some VMs will lose their ip address after a few

// minutes.

if err := b.handler.ConfigureIpv4ArpIgnore(); err != nil {

log.Log.Reason(err).Errorf("failed to set arp_ignore=1")

return err

}

}

// 给pod网卡设置随机mac地址

if _, err := b.handler.SetRandomMac(b.podNicLink.Attrs().Name); err != nil {

return err

}

// 创建一个网桥设备

if err := b.createBridge(); err != nil {

return err

}

tapOwner := netdriver.LibvirtUserAndGroupId

if util.IsNonRootVMI(b.vmi) {

tapOwner = strconv.Itoa(util.NonRootUID)

}

// 用virt-chroot命令创建一个tap设备,并挂到网桥上

err := createAndBindTapToBridge(b.handler, b.tapDeviceName, b.bridgeInterfaceName, b.launcherPID, b.podNicLink.Attrs().MTU, tapOwner, b.vmi)

if err != nil {

log.Log.Reason(err).Errorf("failed to create tap device named %s", b.tapDeviceName)

return err

}

// 重新up pod网卡设备

if err := b.handler.LinkSetUp(b.podNicLink); err != nil {

log.Log.Reason(err).Errorf("failed to bring link up for interface: %s", b.podNicLink.Attrs().Name)

return err

}

// 关闭pod网卡learning

if err := b.handler.LinkSetLearningOff(b.podNicLink); err != nil {

log.Log.Reason(err).Errorf("failed to disable mac learning for interface: %s", b.podNicLink.Attrs().Name)

return err

}

return nil

}

// pkg/network/infraconfigurators/bridge.go

func (b *BridgePodNetworkConfigurator) switchPodInterfaceWithDummy() error {

originalPodInterfaceName := b.podNicLink.Attrs().Name

newPodInterfaceName := virtnetlink.GenerateNewBridgedVmiInterfaceName(originalPodInterfaceName)

dummy := &netlink.Dummy{LinkAttrs: netlink.LinkAttrs{Name: originalPodInterfaceName}}

// 先把pod网卡重命名(如把eth0命名为eth0-nic)

err := b.handler.LinkSetName(b.podNicLink, newPodInterfaceName)

if err != nil {

log.Log.Reason(err).Errorf("failed to rename interface : %s", b.podNicLink.Attrs().Name)

return err

}

// 更新内存对象中的podNicLink信息

b.podNicLink, err = b.handler.LinkByName(newPodInterfaceName)

if err != nil {

log.Log.Reason(err).Errorf("failed to get a link for interface: %s", newPodInterfaceName)

return err

}

// 创建一个dummy网卡(名称为原网卡名,如eth0)

err = b.handler.LinkAdd(dummy)

if err != nil {

log.Log.Reason(err).Errorf("failed to create dummy interface : %s", originalPodInterfaceName)

return err

}

// 把原先pod网卡ip给dummy网卡

err = b.handler.AddrReplace(dummy, &b.podIfaceIP)

if err != nil {

log.Log.Reason(err).Errorf("failed to replace original IP address to dummy interface: %s", originalPodInterfaceName)

return err

}

return nil

}

// pkg/network/infraconfigurators/bridge.go

func (b *BridgePodNetworkConfigurator) createBridge() error {

// 创建一个网桥设备

bridge := &netlink.Bridge{

LinkAttrs: netlink.LinkAttrs{

Name: b.bridgeInterfaceName,

},

}

err := b.handler.LinkAdd(bridge)

if err != nil {

log.Log.Reason(err).Errorf("failed to create a bridge")

return err

}

// 把pod网卡接到网桥上

err = b.handler.LinkSetMaster(b.podNicLink, bridge)

if err != nil {

log.Log.Reason(err).Errorf("failed to connect interface %s to bridge %s", b.podNicLink.Attrs().Name, bridge.Name)

return err

}

// up网桥设备

err = b.handler.LinkSetUp(bridge)

if err != nil {

log.Log.Reason(err).Errorf("failed to bring link up for interface: %s", b.bridgeInterfaceName)

return err

}

// 构建一个fake ip:169.254.75.1%d/32

// %d表示spec.domain.devices.interfaces序号

// 一张网卡也就是169.254.75.10/32

addr := virtnetlink.GetFakeBridgeIP(b.vmi.Spec.Domain.Devices.Interfaces, b.vmiSpecIface)

fakeaddr, _ := b.handler.ParseAddr(addr)

// 给网桥添加fake ip

if err := b.handler.AddrAdd(bridge, fakeaddr); err != nil {

log.Log.Reason(err).Errorf("failed to set bridge IP")

return err

}

// disabel网桥的tx checksum offload

if err = b.handler.DisableTXOffloadChecksum(b.bridgeInterfaceName); err != nil {

log.Log.Reason(err).Error("failed to disable TX offload checksum on bridge interface")

return err

}

return nil

}

masquerade

DiscoverPodNetworkInterface

// pkg/network/infraconfigurators/masquerade.go

func (b *MasqueradePodNetworkConfigurator) DiscoverPodNetworkInterface(podIfaceName string) error {

// 获取pod网卡设备信息

link, err := b.handler.LinkByName(podIfaceName)

if err != nil {

log.Log.Reason(err).Errorf("failed to get a link for interface: %s", podIfaceName)

return err

}

b.podNicLink = link

// 计算虚拟机ipv4地址以及网关地址

// ipv4默认网段10.0.2.0/24

// 如果在vmi.spec.networks.pod.vmNetworkCIDR,则以该字段为准

if err := b.computeIPv4GatewayAndVmIp(); err != nil {

return err

}

// 判断pod网卡是否开启ipv6

ipv6Enabled, err := b.handler.IsIpv6Enabled(podIfaceName)

if err != nil {

log.Log.Reason(err).Errorf(ipVerifyFailFmt, podIfaceName)

return err

}

if ipv6Enabled {

// 计算虚拟机ipv6地址以及网关地址

// ipv6默认网段fd10:0:2::/120

// 如果在vmi.spec.networks.pod.vmIPv6NetworkCIDR,则以该字段为准

if err := b.discoverIPv6GatewayAndVmIp(); err != nil {

return err

}

}

return nil

}

GenerateNonRecoverableDHCPConfig

// pkg/network/infraconfigurators/masquerade.go

// masquerade不需要dhcp

func (b *MasqueradePodNetworkConfigurator) GenerateNonRecoverableDHCPConfig() *cache.DHCPConfig {

return nil

}

GenerateNonRecoverableDomainIfaceSpec

// pkg/network/infraconfigurators/masquerade.go

// masquerade无需处理

func (b *MasqueradePodNetworkConfigurator) GenerateNonRecoverableDomainIfaceSpec() *api.Interface {

return nil

}

PreparePodNetworkInterface

// pkg/network/infraconfigurators/masquerade.go

func (b *MasqueradePodNetworkConfigurator) PreparePodNetworkInterface() error {

// 创建一个网桥设备

if err := b.createBridge(); err != nil {

return err

}

tapOwner := netdriver.LibvirtUserAndGroupId

if util.IsNonRootVMI(b.vmi) {

tapOwner = strconv.Itoa(util.NonRootUID)

}

// 用virt-chroot命令创建一个tap设备,并挂到网桥上

tapDeviceName := virtnetlink.GenerateTapDeviceName(b.podNicLink.Attrs().Name)

err := createAndBindTapToBridge(b.handler, tapDeviceName, b.bridgeInterfaceName, b.launcherPID, b.podNicLink.Attrs().MTU, tapOwner, b.vmi)

if err != nil {

log.Log.Reason(err).Errorf("failed to create tap device named %s", tapDeviceName)

return err

}

// 基于nft/iptables创建ipv4 nat规则

err = b.createNatRules(iptables.ProtocolIPv4)

if err != nil {

log.Log.Reason(err).Errorf("failed to create ipv4 nat rules for vm error: %v", err)

return err

}

ipv6Enabled, err := b.handler.IsIpv6Enabled(b.podNicLink.Attrs().Name)

if err != nil {

log.Log.Reason(err).Errorf(ipVerifyFailFmt, b.podNicLink.Attrs().Name)

return err

}

if ipv6Enabled {

// 基于nft/iptables创建ipv6 nat规则

err = b.createNatRules(iptables.ProtocolIPv6)

if err != nil {

log.Log.Reason(err).Errorf("failed to create ipv6 nat rules for vm error: %v", err)

return err

}

}

return nil

}

// pkg/network/infraconfigurators/masquerade.go

func (b *MasqueradePodNetworkConfigurator) createBridge() error {

// 网桥配置固定的mac地址:02:00:00:00:00:00

mac, err := net.ParseMAC(link.StaticMasqueradeBridgeMAC)

if err != nil {

return err

}

// 创建一个网桥

bridge := &netlink.Bridge{

LinkAttrs: netlink.LinkAttrs{

Name: b.bridgeInterfaceName,

MTU: b.podNicLink.Attrs().MTU,

HardwareAddr: mac,

},

}

err = b.handler.LinkAdd(bridge)

if err != nil {

log.Log.Reason(err).Errorf("failed to create a bridge")

return err

}

// up网桥设备

if err := b.handler.LinkSetUp(bridge); err != nil {

log.Log.Reason(err).Errorf("failed to bring link up for interface: %s", b.bridgeInterfaceName)

return err

}

// 把之前计算出来的虚拟机网关地址给到网桥

if err := b.handler.AddrAdd(bridge, b.vmGatewayAddr); err != nil {

log.Log.Reason(err).Errorf("failed to set bridge IP")

return err

}

ipv6Enabled, err := b.handler.IsIpv6Enabled(b.podNicLink.Attrs().Name)

if err != nil {

log.Log.Reason(err).Errorf(ipVerifyFailFmt, b.podNicLink.Attrs().Name)

return err

}

if ipv6Enabled {

// 如果开启ipv6,把ipv6的地址也配置到网桥设备

if err := b.handler.AddrAdd(bridge, b.vmGatewayIpv6Addr); err != nil {

log.Log.Reason(err).Errorf("failed to set bridge IPv6")

return err

}

}

// disabel网桥的tx checksum offload

if err = b.handler.DisableTXOffloadChecksum(b.bridgeInterfaceName); err != nil {

log.Log.Reason(err).Error("failed to disable TX offload checksum on bridge interface")

return err

}

return nil

}

// pkg/network/infraconfigurators/masquerade.go

func (b *MasqueradePodNetworkConfigurator) createNatRules(protocol iptables.Protocol) error {

// 开启pod内ipv4/ipv6 forward配置

err := b.handler.ConfigureIpForwarding(protocol)

if err != nil {

log.Log.Reason(err).Errorf("failed to configure ip forwarding")

return err

}

// 用nft或者iptables设置nat规则

if b.handler.NftablesLoad(protocol) == nil {

return b.createNatRulesUsingNftables(protocol)

} else if b.handler.HasNatIptables(protocol) {

return b.createNatRulesUsingIptables(protocol)

}

return fmt.Errorf("Couldn't configure ip nat rules")

}

virt-launcher

virt-launcher初始化网络的入口函数在SyncVirtualMachine(virt-launcher提供grpc接口,实际调该接口的还是virt-handler进程):

// pkg/virt-launcher/virtwrap/cmd-server/server.go

func (l *Launcher) SyncVirtualMachine(_ context.Context, request *cmdv1.VMIRequest) (*cmdv1.Response, error) {

/*...*/

if _, err := l.domainManager.SyncVMI(vmi, l.allowEmulation, request.Options); err != nil {

log.Log.Object(vmi).Reason(err).Errorf("Failed to sync vmi")

response.Success = false

response.Message = getErrorMessage(err)

return response, nil

}

/*...*/

}

// pkg/virt-launcher/virtwrap/manager.go

func (l *LibvirtDomainManager) SyncVMI(vmi *v1.VirtualMachineInstance, allowEmulation bool, options *cmdv1.VirtualMachineOptions) (*api.DomainSpec, error) {

/*...*/

dom, err := l.virConn.LookupDomainByName(domain.Spec.Name)

if err != nil {

// We need the domain but it does not exist, so create it

if domainerrors.IsNotFound(err) {

domain, err = l.preStartHook(vmi, domain, false)

/*...*/

}

/*...*/

}

/*...*/

}

// pkg/virt-launcher/virtwrap/manager.go

func (l *LibvirtDomainManager) preStartHook(vmi *v1.VirtualMachineInstance, domain *api.Domain, generateEmptyIsos bool) (*api.Domain, error) {

/*...*/

err = netsetup.NewVMNetworkConfigurator(vmi, l.networkCacheStoreFactory).SetupPodNetworkPhase2(domain)

if err != nil {

return domain, fmt.Errorf("preparing the pod network failed: %v", err)

}

/*...*/

}

// pkg/network/setup/network.go

func (n *VMNetworkConfigurator) SetupPodNetworkPhase2(domain *api.Domain) error {

nics, err := n.getPhase2NICs(domain)

if err != nil {

return err

}

for _, nic := range nics {

if err := nic.PlugPhase2(domain); err != nil {

return fmt.Errorf("failed plugging phase2 at nic '%s': %w", nic.podInterfaceName, err)

}

}

return nil

}

virt-launcher处理网络的第二阶段也分为2步,第一步通过getPhase2NICs收集pod NIC信息,第二步遍历NIC,执行nic.PlugPhase2。

getPhase2NICs

// pkg/network/setup/network.go

func (v VMNetworkConfigurator) getPhase2NICs(domain *api.Domain) ([]podNIC, error) {

nics := []podNIC{}

if len(v.vmi.Spec.Domain.Devices.Interfaces) == 0 {

return nics, nil

}

for i, _ := range v.vmi.Spec.Networks {

nic, err := newPhase2PodNIC(v.vmi, &v.vmi.Spec.Networks[i], v.handler, v.cacheFactory, domain)

if err != nil {

return nil, err

}

nics = append(nics, *nic)

}

return nics, nil

}

// pkg/network/setup/podnic.go

func newPhase2PodNIC(vmi *v1.VirtualMachineInstance, network *v1.Network, handler netdriver.NetworkHandler, cacheFactory cache.InterfaceCacheFactory, domain *api.Domain) (*podNIC, error) {

podnic, err := newPodNIC(vmi, network, handler, cacheFactory, nil)

if err != nil {

return nil, err

}

podnic.dhcpConfigurator = podnic.newDHCPConfigurator()

podnic.domainGenerator = podnic.newLibvirtSpecGenerator(domain)

return podnic, nil

}

PlugPhase2

// pkg/network/setup/podnic.go

func (l *podNIC) PlugPhase2(domain *api.Domain) error {

precond.MustNotBeNil(domain)

// 如果是sriov,直接返回

if l.vmiSpecIface.SRIOV != nil {

return nil

}

if err := l.domainGenerator.Generate(); err != nil {

log.Log.Reason(err).Critical("failed to create libvirt configuration")

}

// 只有是bridge或者masquerade才会进入逻辑

if l.dhcpConfigurator != nil {

dhcpConfig, err := l.dhcpConfigurator.Generate()

if err != nil {

log.Log.Reason(err).Errorf("failed to get a dhcp configuration for: %s", l.podInterfaceName)

return err

}

log.Log.V(4).Infof("The imported dhcpConfig: %s", dhcpConfig.String())

if err := l.dhcpConfigurator.EnsureDHCPServerStarted(l.podInterfaceName, *dhcpConfig, l.vmiSpecIface.DHCPOptions); err != nil {

log.Log.Reason(err).Criticalf("failed to ensure dhcp service running for: %s", l.podInterfaceName)

panic(err)

}

}

return nil

}

// pkg/network/dhcp/configurator.go

func (d *configurator) EnsureDHCPServerStarted(podInterfaceName string, dhcpConfig cache.DHCPConfig, dhcpOptions *v1.DHCPOptions) error {

if dhcpConfig.IPAMDisabled {

return nil

}

dhcpStartedFile := d.getDHCPStartedFilePath(podInterfaceName)

_, err := os.Stat(dhcpStartedFile)

if os.IsNotExist(err) {

// 启动dhcp服务

if err := d.handler.StartDHCP(&dhcpConfig, d.advertisingIfaceName, dhcpOptions); err != nil {

return fmt.Errorf("failed to start DHCP server for interface %s", podInterfaceName)

}

newFile, err := os.Create(dhcpStartedFile)

if err != nil {

return fmt.Errorf("failed to create dhcp started file %s: %s", dhcpStartedFile, err)

}

newFile.Close()

}

return nil

}

// pkg/network/driver/common.go

func (h *NetworkUtilsHandler) StartDHCP(nic *cache.DHCPConfig, bridgeInterfaceName string, dhcpOptions *v1.DHCPOptions) error {

/*...*/

// 起个协程启动一个dhcp服务(ipv4)

go func() {

if err = DHCPServer(

nic.MAC,

nic.IP.IP,

nic.IP.Mask,

bridgeInterfaceName,

nic.AdvertisingIPAddr,

nic.Gateway,

nameservers,

nic.Routes,

searchDomains,

nic.Mtu,

dhcpOptions,

); err != nil {

log.Log.Errorf("failed to run DHCP: %v", err)

panic(err)

}

}()

if nic.IPv6.IPNet != nil {

// 启动一个ipv6 dhcp服务

go func() {

if err = DHCPv6Server(

nic.IPv6.IP,

bridgeInterfaceName,

); err != nil {

log.Log.Reason(err).Error("failed to run DHCPv6")

panic(err)

}

}()

}

return nil

}

Generate函数对于bridge和masquerade实现不同:

bridge

// pkg/network/dhcp/bridge.go

func (d *BridgeConfigGenerator) Generate() (*cache.DHCPConfig, error) {

dhcpConfig, err := d.cacheFactory.CacheDHCPConfigForPid(d.launcherPID).Read(d.podInterfaceName)

if err != nil {

return nil, err

}

if dhcpConfig.IPAMDisabled {

return dhcpConfig, nil

}

dhcpConfig.Name = d.podInterfaceName

// 前面bridge逻辑提到,会给网桥一个fake ip,这里是获取fake ip

fakeBridgeIP := virtnetlink.GetFakeBridgeIP(d.vmiSpecIfaces, d.vmiSpecIface)

fakeServerAddr, _ := netlink.ParseAddr(fakeBridgeIP)

dhcpConfig.AdvertisingIPAddr = fakeServerAddr.IP

newPodNicName := virtnetlink.GenerateNewBridgedVmiInterfaceName(d.podInterfaceName)

podNicLink, err := d.handler.LinkByName(newPodNicName)

if err != nil {

return nil, err

}

// dhcp的MTU设置和pod网卡一样

dhcpConfig.Mtu = uint16(podNicLink.Attrs().MTU)

dhcpConfig.Subdomain = d.subdomain

return dhcpConfig, nil

}

masquerade

// pkg/network/dhcp/masquerade.go

func (d *MasqueradeConfigGenerator) Generate() (*cache.DHCPConfig, error) {

dhcpConfig := &cache.DHCPConfig{}

// 获取pod网卡信息

podNicLink, err := d.handler.LinkByName(d.podInterfaceName)

if err != nil {

return nil, err

}

dhcpConfig.Name = podNicLink.Attrs().Name

dhcpConfig.Subdomain = d.subdomain

dhcpConfig.Mtu = uint16(podNicLink.Attrs().MTU)

// 获取masquerade的ipv4网关和vm ip

ipv4Gateway, ipv4, err := virtnetlink.GenerateMasqueradeGatewayAndVmIPAddrs(d.vmiSpecNetwork, iptables.ProtocolIPv4)

if err != nil {

return nil, err

}

dhcpConfig.IP = *ipv4

dhcpConfig.AdvertisingIPAddr = ipv4Gateway.IP.To4()

dhcpConfig.Gateway = ipv4Gateway.IP.To4()

ipv6Enabled, err := d.handler.IsIpv6Enabled(d.podInterfaceName)

if err != nil {

log.Log.Reason(err).Errorf("failed to verify whether ipv6 is configured on %s", d.podInterfaceName)

return nil, err

}

if ipv6Enabled {

// 获取masquerade的ipv6网关和vm ip

ipv6Gateway, ipv6, err := virtnetlink.GenerateMasqueradeGatewayAndVmIPAddrs(d.vmiSpecNetwork, iptables.ProtocolIPv6)

if err != nil {

return nil, err

}

dhcpConfig.IPv6 = *ipv6

dhcpConfig.AdvertisingIPv6Addr = ipv6Gateway.IP.To16()

}

return dhcpConfig, nil

}

除了上述内容,还有一些通过libvirtd把网络信息配置到虚拟机的动作未提及,这部分内容待读者自行研读。

libvirt

通过virt-handler和virt-launcher准备好bridge、tap、dhcp server等资源后,kubevirt会把这些数据组装成libvirt xml去调libvirtd接口创建虚拟机,最后形成一个完整的虚拟机系统。

总结

前面我们从源码层面对kubevirt网络做了一些梳理,本章节我们结合前面的源码分析,再用图示的形式梳理一下,便于读者理解。CNI依旧以flannel host-gateway模式为例:

pod+bridge

vmi yaml如下所示:

apiVersion: kubevirt.io/v1alpha3

kind: VirtualMachineInstance

metadata:

annotations:

name: test

spec:

domain:

devices:

interfaces:

- bridge: {} # 注意这里是bridge

name: default

# ...

networks:

- name: default

pod: {}

# ...

virt-handler

1、down掉eth0网卡,删除eth0网卡ip(删除的ip会先保留在内存中):

2、创建一张名为eth0的dummy网卡,把原先eth0网卡改名为eth0-nic,刚才删掉的ip给到dummy网卡,设置arp_ignore,并且把eth0-nic网卡mac改为随机mac:

3、创建一个网桥,把eth0-nic连接到网桥,up网桥并给网桥设置一个fake ip,再disabel网桥的tx checksum offload:

4、创建tap设备并连接到网桥上:

5、up eth0-nic设备,并关闭mac地址学习功能,最终有:

virt-launcher

virt-launcher做的事情相对少些,只启动了一个dhcp server供后续vm获取ip:

libvirt

libvirt负责创建vm并且把tap设备用于vm中:

pod+masquerade

vmi yaml如下所示:

apiVersion: kubevirt.io/v1alpha3

kind: VirtualMachineInstance

metadata:

annotations:

name: test

spec:

domain:

devices:

interfaces:

- masquerade: {} # 注意这里是masquerade

name: default

# ...

networks:

- name: default

pod: {}

# ...

virt-handler

1、创建mac地址为02:00:00:00:00:00的网桥并up,默认根据10.0.24.0/24计算网关地址(即10.0.24.1),并把该地址作为网桥的ip地址,最后disable tx checksum offload:

2、创建一个tap设备并连接到网桥上:

3、开启ip_forward,最终用nftable或者iptable实现nat:

virt-launcher

启动了一个dhcp server供后续vm获取ip:

libvirt

libvirt负责创建vm并且把tap设备用于vm中:

其它

通过以上内容我们对kubevirt虚拟机网络有了一个更清晰的认识,这里再补充一些个人的思考:

为什么创建虚拟机网络要在virt-handler和virt-launcher两个组件中完成?也就是官网提到创建网络的phase1和phase2两个阶段,不能全在virt-launcher中完成吗?

个人见解:这里应该是从网络安全上考虑,官网可以看到这么一段话:

The virt-launcher is an untrusted component of KubeVirt (since it wraps the libvirt process that will run third party workloads). As a result, it must be run with as little privileges as required. As of now, the only capability required by virt-launcher to configure networking is the CAP_NET_ADMIN.

从前面的分析可以看出,一个网络报文从宿主机网卡到虚拟机,需要经过:宿主机物理网卡 -> CNI网桥 -> 宿主机veth pair -> pod网络命名空间veth pair -> virt-handler网桥 -> pod网络命名空间tap -> 虚拟机内的网卡,很明显这个链路太长,对于网络性能要求高的场景可能会具有一定的挑战性。

个人见解:文章开头提到,只有非hostNetwork的pod网络才会有CNI的参与,因此,如果把vmi pod的网络配置为hostNetwork,则没有了CNI网桥、宿主机veth pair、pod网络命名空间veth pair,链路会减少一大截。但是hostNetwork网络的话,首先要考虑网络安全问题,其次virt-handler和virt-launcher代码可能需要改造,最后还有vm ip的分配也得考虑。

本文都是用flannel作为pod网络CNI,如果在CNI处做改造,实现一个如下模型网络,或许是比较理想的(未做任何验证和深入分析,仅供参考):

从vmi的字段定义来看是支持SR-IOV的(没实际验证过),SR-IOV理论上能提升不少虚拟机的网络性能。不过SR-IOV也有一些限制:需要在BIOS中配置并且需要宿主机网卡支持SR-IOV;即使宿主机网卡支持SR-IOV,一般VF个数也比较少,数量在十几到几十左右,而一个k8s节点默认可以起pod的个数是110,如果都是小规格虚拟机,很明显VF数量满足不了。不过如果业务需要的都是大规格虚拟机,例如一台宿主机上只需要虚拟化个位数的虚拟机,SR-IOV应该是个不错的选择。

微信公众号卡巴斯同步发布,欢迎大家关注。

精彩内容

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: