Using ElasticSearch/FluentD/Kibana containers for Windows containers logging

Previous article used EFK (ElasticSearch/FluentD/Kibana) as standalone executables installed on Windows OS which is suboptimal in production environments. Each layer of EFK stack can be presented as container in K8 cluster.

Walkthrough below will guide through creation of solutions consisting of following

  1. ElasticSearch standalone service running on Windows/Linux node (I was unable to make ES run reliably as container with persistent storage)
  2. ElasticSearch service of type ExternalName which allows both FluentD and Kibana to find ES instance via DNS query
  3. Kibana container which connects to ES and exposed via NodePort service
  4. FluentD daemonset

Install ElasticSearch

I used Windows 2019 as host for Elasticsearch and installation is simple and straightforward. Once installed. Create elasticsearch service in K8 of type external name which points to the name of your Windows machine hosting ES installation. I put all logging components into kube-logging namespace. YAML for both is below. In my case my Windows VM is called utility-vm and it’s automatically registered to my internal DNS zone on Azure

apiVersion: v1
kind: Namespace
metadata:
 name: kube-logging
---
apiVersion: v1
kind: Service
metadata:
  name: elasticsearch-service
  namespace: kube-logging
spec:
  type: ExternalName
  externalName: utilityvm.kubernetes.my

Install Kibana container

My cluster consists only of 2 nodes: 1 Linux master and 1 Windows worker. There is no windows image for Kibana so it has to be run on master node. YAML is below taking into account that it has to be run on Linux only and can tolerate master role. YAML also contains service of type NodePort to expose it for consumption on port 30080

apiVersion: v1
kind: Service
metadata:
  name: kibana
  namespace: kube-logging
  labels:
    app: kibana
spec:
  type: NodePort
  ports:
  - port: 5601
    nodePort: 30080
  selector:
    app: kibana
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kibana
  namespace: kube-logging
  labels:
    app: kibana
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  template:
    metadata:
      labels:
        app: kibana
    spec:
      nodeSelector:
        beta.kubernetes.io/os: linux
      tolerations:
         - key: node-role.kubernetes.io/master
           operator: "Equal"
           effect: "NoSchedule"
      containers:
      - name: kibana
        image: docker.elastic.co/kibana/kibana:7.6.2
        env:
          - name: ELASTICSEARCH_HOSTS
            value: http://elasticsearch-service:9200
        ports:
        - containerPort: 5601

At this point you shall be able to connect to your http://master1:30080 node and see Kibana interface

Configure fluentd daemonset

As of time of writing this article fluentd only provided Windows image for SAC channel 1903. My nodes are running Windows 2019 since Kubernetes 1.18+ inexplicably do not support SAC channels as of right now. So I can not use that image and have to just rebuild fluentd image locally. Download fluentd Repo. Navigate to folder v1.10/windows and replace Dockerfile with file below and build it with docker build . -t fluent/fluentd:local

FROM mcr.microsoft.com/windows/servercore:ltsc2019
RUN powershell -Command "Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))"
RUN choco install -y ruby --version 2.6.5.1 --params "'/InstallDir:C:\ruby26'" \
&& choco install -y msys2 --params "'/NoPath /NoUpdate /InstallDir:C:\ruby26\msys64'"
RUN refreshenv \
&& ridk install 2 3 \
RUN echo gem: --no-document >> C:\ProgramData\gemrc  \
&& gem install cool.io -v 1.5.4 --platform ruby \
&& gem install oj -v 3.3.10 \
&& gem install json -v 2.2.0 \
&& gem install fluentd -v 1.10.1 \
&& gem install win32-service -v 1.0.1 \
&& gem install win32-ipc -v 0.7.0 \
&& gem install win32-event -v 0.6.3 \
&& gem install windows-pr -v 1.2.6 
RUN powershell -Command "Remove-Item -Force C:\ruby26\lib\ruby\gems\2.6.0\cache\*.gem; Remove-Item -Recurse -Force 'C:\ProgramData\chocolatey'"
COPY fluent.conf /fluent/conf/fluent.conf
ENV FLUENTD_CONF="fluent.conf"
EXPOSE 24224 5140
ENTRYPOINT ["cmd", "/k", "fluentd", "-c", "C:\\fluent\\conf\\fluent.conf"]

Fluentd relies on configuration file to inform runtime how to format log entries. Thanks to Mike Kock article following is slight adaptation to my scenario. Configuration file below parsing Windows pods logs, appends some Kubernetes specific information and forwards to ES

apiVersion: v1
data:
  fluentd.conf: |
    <match fluent.**>
      @type null
    </match>
    #Target Logs (ex:nginx)
    <source>
      @type tail
      @id in_tail_container_logs
      path /var/log/containers/*.log
      pos_file /var/log/containers/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head false
      format json
      time_format %Y-%m-%dT%H:%M:%S.%N%Z
    </source>
    <filter kubernetes.**>
      @type kubernetes_metadata
      @id filter_kube_metadata
    </filter>
    <filter kubernetes.**>
      @type grep
      <exclude>
        key log
        pattern /Reply/
      </exclude>
    </filter>
    <match kubernetes.**>
        @type elasticsearch
        @id out_es
        @log_level info
        include_tag_key true
        host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
        port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
        scheme "#{ENV['FLUENT_ELASTICSEARCH_SCHEME'] || 'http'}"
        ssl_verify "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERIFY'] || 'false'}"
        #user "#{ENV['FLUENT_ELASTICSEARCH_USER']}"
        #password "#{ENV['FLUENT_ELASTICSEARCH_PASSWORD']}"
        reload_connections "#{ENV['FLUENT_ELASTICSEARCH_RELOAD_CONNECTIONS'] || 'true'}"
        logstash_prefix "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'k8log'}"
        logstash_format true
        type_name fluentd
        request_timeout 20s
        reload_on_failure true
        reconnect_on_error true
        with_transporter_log true
        <buffer>
          flush_thread_count "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_FLUSH_THREAD_COUNT'] || '1'}"
          flush_interval "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_FLUSH_INTERVAL'] || '10s'}"
          chunk_limit_size "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE'] || '2M'}"
          queue_limit_length "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_QUEUE_LIMIT_LENGTH'] || '32'}"
          retry_max_interval "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_RETRY_MAX_INTERVAL'] || '30'}"
          retry_forever true
        </buffer>
    </match>
kind: ConfigMap
metadata:
  name: fluentd-configmap
  namespace: kube-logging

Actual fluentd runtime is deployed as daemonset within kubernetes to pull logs from localhost and send them to ES based on config file above. YAML for deployment is below. Please note that image have to be prebuilt for this to work on node ahead of time. YAML also contains definitions for service account which are used to connect to API server to pull Kubernetes specific information which is inserted into each log entry identifying additional (Kubernetes specific) fields.

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluentd
  namespace: kube-logging
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: fluentd
  namespace: kube-logging
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  verbs:
  - get
  - list
  - watch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: fluentd
roleRef:
  kind: ClusterRole
  name: fluentd
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: fluentd
  namespace: kube-logging
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-logging
  labels:
    app: fluentd
spec:
  selector:
    matchLabels:
     app: fluentd
  template:
    metadata:
      labels:
       app: fluentd
    spec:
      serviceAccount: fluentd
      serviceAccountName: fluentd
      nodeSelector:
        beta.kubernetes.io/os: windows
      containers:
      - name: fluentd
        volumeMounts:
         - name: config-volume
           mountPath: "c:\\fluent\\conf\\K8\\"
         - name: varlog
           mountPath: /var/log
         - name: progdatacontainers
           mountPath: /ProgramData/docker/containers
#FluendD only supply image for 1903 and above, so if running anything below that you have to create Dockerfile yourself and build
        image: fluent/fluentd:local
        command: ["cmd"]
        args: ["/c", "gem install fluent-plugin-elasticsearch fluent-plugin-kubernetes_metadata_filter &", "fluentd", "-c", "C:\\fluent\\conf\\K8\\fluentd.conf"]
        #args: ["/c", "gem install fluent-plugin-elasticsearch &", "fluentd", "-c", "C:\\fluent\\conf\\K8\\fluentd.conf"]
        env:
          - name:  FLUENT_ELASTICSEARCH_HOST
            value: "elasticsearch-service.kube-logging.svc.cluster.local"
          - name:  FLUENT_ELASTICSEARCH_PORT
            value: "9200"
          - name: FLUENT_ELASTICSEARCH_SCHEME
            value: "http"
          - name: FLUENTD_SYSTEMD_CONF
            value: disable
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 200Mi
      volumes:
       - name: config-volume
         configMap:
          name: fluentd-configmap
       - name: varlog
         hostPath:
          path: /var/log
       - name: progdatacontainers
         hostPath:
          path: /ProgramData/docker/containers

If everything is deployed properly you shall see following in your kube-logging namespace

gregory@master1:~$ k get all -n kube-logging
NAME                          READY   STATUS    RESTARTS   AGE
pod/fluentd-zcxj9             1/1     Running   0          31m
pod/kibana-699b99d996-vkd27   1/1     Running   3          44h

NAME                            TYPE           CLUSTER-IP      EXTERNAL-IP               PORT(S)          AGE
service/elasticsearch-service   ExternalName   <none>          utilityvm.kubernetes.my   <none>           44h
service/kibana                  NodePort       10.103.68.166   <none>                    5601:30080/TCP   44h

NAME                     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
daemonset.apps/fluentd   1         1         1       1            1           beta.kubernetes.io/os=windows   25h

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/kibana   1/1     1            1           44h

NAME                                DESIRED   CURRENT   READY   AGE
replicaset.apps/kibana-699b99d996   1         1         1       44h

And following in default namespace

gregory@master1:~$ k get all
NAME                                READY   STATUS    RESTARTS   AGE
pod/win-webserver-8d8dcb548-mrnvp   1/1     Running   2          116m
pod/win-webserver-8d8dcb548-vnxzq   1/1     Running   2          116m

NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   45h

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/win-webserver   2/2     2            2           44h

NAME                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/win-webserver-8d8dcb548   2         2         2       44h

Configure Kibana to live stream logs

Go to kibana WebUI (http://master1:30080 in my case) and click gear icon on bottom and then Index Patterns/Create index pattern. Type k8log-* in Define index pattern

Choose @timestamp as you Time field and choose “Create”

You will see all fields available with logs comming from fluentd and specifically ones kubernetes specific (like kubernetes.container_name, kubernetes.labels.app etc).

To stream logs click on Logs on left side and go to Settings tab. Here you will define what indices you want to appear in streaming log and what fields you want to be shown on screen. Example for mine is below

Save settings and switch to Stream tab which will output live logs along with fields you requested

Setting up logging for windows pods in Kubernetes

There is no currently well supported centralized solution for logging for windows pods in Kubernetes outside of built-in solutions by big cloud providers. Instructions below would allow to setup logging in stand-alone Kubernetes cluster for Windows pods.

Following software will be used

  1. FluentD windows service
  2. ElasticSearch service
  3. Kibana UI front-end to service to display logs

Instructions are built upon Kubernetes deployed locally outlined in earlier post. Cluster config is below

gregory@master1:~$ k get nodes                                 
NAME         STATUS   ROLES    AGE   VERSION
master1      Ready    master   18d   v1.17.4
winworker1   Ready    <none>   18d   v1.17.4

Overall architecture of logging soluion consists of following moving parts:

  1. Docker service would be configured with fluentd logging driver
  2. FluentD service which will parse logs and send it to ElasticSearch
  3. Kibana UI to query logs

Install ElasticSearch

Install ElasticSearch on any Windows nodes to aggregate logs from fluentd service. Accept defaults for installation.

Install FluentD service

Docker daemon (service) running on Windows worker nodes would need to be configured with fluentd logging driver to send data to fluentd service.

To install fluentd download binaries and install it per instructions https://www.fluentd.org/download

Once installed, modify configuration file under C:\opt\td-agent\etc\td-agent\td-agent.conf to contain following entry. Replace host localhost with hostname of host where you installed ElasticSearch in previous step

<source>
  @type forward
</source>
<match *>
  @type elasticsearch
<inject>
   time_key          @timestamp
   time_format       %Y%m%dT%H%M%S%z
</inject>
  host localhost
  port 9200
  index_name fluentd
  logstash_format true
  flush_interval 5s
</match>

Install fluentd as service by starting Td-agent command prompt and executing

> fluentd --reg-winsvc i
> fluentd --reg-winsvc-fluentdopt '-c C:/opt/td-agent/etc/td-agent/td-agent.conf -o C:/opt/td-agent/td-agent.log'

Restart fluentd service restart-service fluentdwinsvc

Configure docker daemon

Configure dockerd to use fluentd logging driver on windows nodes. Edit file C:\programdata\docker\config\daemon.json to have following content. Replace utilityvm with hostname where you installed fluentd in previous step. Restart docker service after change.

{
   "log-driver": "fluentd",
   "log-opts": {
     "fluentd-address": "utilityvm:24224"
   }
 }

Install kibana

Install Kibana on Windows for nice UI for ElasticSearch. Modify file called kibana.config under \config subfolder and add/change following parameters

server.host: 0.0.0.0

elasticsearch.hosts: ["<hostname/port of your elasticsearch host>"]

Once installed launch kubana.bat which will launch service listening on port 5061 by default. So you can access server on port 5061 to get interface in Kibana .

Once in UI add index for logstash as below

Once index is added you can look at logs at Logs tab. And configure to show real time data from logstash* index. Windows containers are configured to output container name and random number every 5s so you will be able to see this information streaming live

Onboarding Windows nodes to Kubernetes cluster

Below are step by step instructions how to onboard Windows nodes to Kubernetes cluster. For cluster master I used Ubuntu 18.04 (Kubernetes control plane is still UNIX only setup and probably will stay forever this way). For Windows worker nodes I used Windows Server 1909 images (but any version of Windows 2019 and up can be used instead. I run my cluster in Azure but did not use Azure CNI so steps can be replicated with on-prem clusters as well.

Install single control-plane cluster

  1. Create Ubuntu VM in Azure and download Kubernetes binaries required for installation of control plane. I will use kubeadm tool both for settings up cluster as well as onboarding Windows nodes to cluster (master1 server).
  2. Install docker on master1 server ()
  3. Flannel POD network plugin will be used for PODs and hence additional parameters should be passed to kubeadm tool (--pod-network-cidr=10.244.0.0/16). Run on master1 sysctl net.bridge.bridge-nf-call-iptables=1
  4. Initialize single control-plane cluster by running kubeadm init --pod-network-cidr=10.244.0.0/16 on master1 node
  5. Copy last line from installation for joining nodes to cluster. In my case it’s (kubeadm join 10.0.0.4:6443 --token k54f1t.5rr385g1upol2njr --discovery-token-ca-cert-hash sha256:a4994328cc8b51386101983a4f860cbd08de95c56e7714b252b6ea7d13cf6d9d)
  6. Execute following to copy config file for kubectl to access your cluster
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
  1. Install flannel POD network plugin (kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml)
  2. Verify that your cluster is healthy by executing (kubectl get nodes). Your master node shall read as Ready
  3. Follow instructions here to configure flannel to allow Windows nodes to join

Add Windows nodes

  1. Nodes need to be able to talk to each other by name so make sure DNS works. If you are in Azure you can setup private DNS zone and associate it with Virtual Network and enabled Auto-Registration.
PS C:\Users\cloudadmin> resolve-dnsname master1.kubernetes.my

Name                                           Type   TTL   Section    IPAddress
----                                           ----   ---   -------    ---------
master1.kubernetes.my                          A      10    Answer     10.0.0.4
Azure private DNS registration
  1. Install Windows version 2019+. I use image of Windows Server 1909 with containers from Azure marketplace. It shall automatically register its name with private zone.
    Set default DNS suffix to be your private zone name (kubernetes.my for me)
    Set-DnsClientGlobalSetting -SuffixSearchList "kubernetes.my"
  2. Download Windows kubernetes tools and expand to local folder.
Invoke-WebRequest https://github.com/kubernetes-sigs/sig-windows-tools/archive/master.zip -OutFile master.zip
Expand-Archive .\master.zip -DestinationPath .
  1. Modify file called Kubeclustervxlan.json under (sig-windows-tools-master\kubeadm\v1.15.0) . Values for object called ControlPlane shall be modified to point to your master1 server and use token which was copied earlier. Change username to username you use on master1 node as well. Also make sure your default Ethernet adapter is in fact called Ethernet (Get-NetAdapter). If it’s not then modify line in file "InterfaceName":"Ethernet" to whatever name adapter is. Modify Source object to point to the same version of kubernetes as the master1 node is running. Modify CRI item in configuration file to change Pause image to multi-arch image as below since default pause image does not support 1909 base OS. My complete file is below, modify with your relevant entries
 {
    "Cri" : {
        "Name" : "dockerd",
        "Images" : {
            "Pause" : "mcr.microsoft.com/oss/kubernetes/pause:1.3.0",
            "Nanoserver" : "mcr.microsoft.com/windows/nanoserver:1809",
            "ServerCore" : "mcr.microsoft.com/windows/servercore:ltsc2019"
        }
    },
    "Cni" : {
        "Name" : "flannel",
        "Source" : [{ 
            "Name" : "flanneld",
            "Url" : "https://github.com/coreos/flannel/releases/download/v0.11.0/flanneld.exe"
            }
        ],
        "Plugin" : {
            "Name": "vxlan"
        },
        "InterfaceName" : "Ethernet 2"
    },
    "Kubernetes" : {
        "Source" : {
            "Release" : "1.17.4",
            "Url" : "https://dl.k8s.io/v1.17.4/kubernetes-node-windows-amd64.tar.gz"
        },
        "ControlPlane" : {
            "IpAddress" : "master1",
            "Username" : "gregory",
            "KubeadmToken" : "c5pi79.39te6ro1fnufx5jt",
            "KubeadmCAHash" : "sha256:a4994328cc8b51386101983a4f860cbd08de95c56e7714b252b6ea7d13cf6d9d"
        },
        "KubeProxy" : {
            "Gates" : "WinOverlay=true"
        },
        "Network" : {
            "ServiceCidr" : "10.96.0.0/12",
            "ClusterCidr" : "10.244.0.0/16"
        }
    },
    "Install" : {
        "Destination" : "C:\\ProgramData\\Kubernetes"
    }
}
  1. Execute powershell script under kubeadm folder and pass location of modified configuration file .\KubeCluster.ps1 -ConfigFile .\v1.15.0\Kubeclustervxlan.json -install
  2. Open generated public key of SSH cert (called id_rsa.pub under .ssh folder) and copy it contents. Add this contents to file called .ssh/authorized_keys on master1 node.
  3. Reboot computer after successful install
  4. Once computer comes back execute script again now with -join parameter to join node to a cluster .\KubeCluster.ps1 -ConfigFile .\v1.15.0\Kubeclustervxlan.json -join
  5. If everything went with no errors you shall see node joined to K8 cluster and be in Ready state
root@master1:~# k get nodes -o wide
NAME         STATUS   ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                    KERNEL-VERSION     CONTAINER-RUNTIME
master1      Ready    master   142m    v1.17.4   10.0.0.4      <none>        Ubuntu 18.04.4 LTS          5.0.0-1032-azure   docker://19.3.6
winworker1   Ready    <none>   2m20s   v1.17.4   10.0.0.5      <none>        Windows Server Datacenter   10.0.18363.720     docker://19.3.5


10. You can schedule windows containers now and verify they work. Example below creates deployment with 2 pods which outputs random numbers to STDOUT

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
spec:
  replicas: 2
  selector:
    matchLabels:
      app: win-webserver
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      containers:
      - command:
        - powershell.exe
        - -command
        - while ($true) { "[{0}] [{2}] {1}" -f (Get-Date),(Get-Random),$env:COMPUTERNAME;
          Start-Sleep 5}
        image: mcr.microsoft.com/windows/servercore:1909
        imagePullPolicy: IfNotPresent
        name: windowswebserver
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      nodeSelector:
        beta.kubernetes.io/os: windows
      restartPolicy: Always
status: {}
PS C:\Users\cloudadmin> kubectl get pods
NAME                            READY   STATUS    RESTARTS   AGE
win-webserver-fffd4486f-cmdgx   1/1     Running   0          34m
win-webserver-fffd4486f-rp96t   1/1     Running   0          34m
PS C:\Users\cloudadmin> kubectl logs win-webserver-fffd4486f-cmdgx
[3/25/2020 12:48:07 AM] [WIN-WEBSERVER-F] 1105704259
[3/25/2020 12:48:12 AM] [WIN-WEBSERVER-F] 356015894
[3/25/2020 12:48:17 AM] [WIN-WEBSERVER-F] 1136900039
[3/25/2020 12:48:22 AM] [WIN-WEBSERVER-F] 111352898
[3/25/2020 12:48:27 AM] [WIN-WEBSERVER-F] 593146587
[3/25/2020 12:48:32 AM] [WIN-WEBSERVER-F] 1438304716
[3/25/2020 12:48:37 AM] [WIN-WEBSERVER-F] 1357778278

Azure DevOps as workflow automation for service management

Azure DevOps makes a good use case for situations where you need workflow management service for common tasks required by service management process. Example below showcases process of setting up workflow for Rename VM hypothetical task requested by service management tool.

Scenario which is being automated is request to rename VM in Azure which is currently unsupported by native control pane and require set of manual/semi-automated execution by personnel.

Entire process is documented in detailed below. Basic steps are

  • Run powershell to export current VM to a file
  • Delete original VM
  • Verify validity of generated template
  • Deploy template

Powershell

Traditionally rename VM tasks are accomplished by removing original VM while preserving original disks and NIC and then recreating new VM as close as possible to original one. This approach is suboptimal since a lot of original metadata about original VM is lost (for example host caching for disks, tags, extensions etc). Approach being taken below instead relies on pulling current resource schema for VM (ARM template) and redeploy it with new name. Highlighted lines below are required to account for situations when VM was created from market place image. Output of powershell will be template file with sanitized inputs to be recreated with custom name

[CmdletBinding()]
param (
      [Parameter(Mandatory = $true)] [string] $vmName,
      [Parameter(Mandatory = $true)] [string] $resourceGroupName,
      [Parameter(Mandatory = $true)] [string] $newVMName
)
$ErrorActionPreference = "Stop"
$resource = Get-AzVM -ResourceGroupName $resourceGroupName -VMName $vmName 
Export-AzResourceGroup -ResourceGroupName $resource.ResourceGroupName -Resource $resource.Id -IncludeParameterDefaultValue -IncludeComments -Path .\template.json -Force
$resource | Stop-AzVM -Force
$resource | Remove-AzVM -Force
$templateTextFile = [System.IO.File]::ReadAllText(".\template.json")
$TemplateObject = ConvertFrom-Json $templateTextFile -AsHashtable
$TemplateObject.resources.properties.storageProfile.osDisk.createOption = "Attach"
$TemplateObject.resources.properties.storageProfile.Remove("imageReference")
$TemplateObject.resources.properties.storageProfile.osDisk.Remove("name")
$TemplateObject.resources.properties.Remove("osProfile")
$TemplateObject | ConvertTo-Json -Depth 50 | Out-File (".\template.json")

Azure DevOps

Create classic build pipeline (until Yaml build pipeline allow UI editing I would personally stay away from them).

  • Add following variables (vmName, newVMName, resourceGroupName) to build pipeline which will identify VM name, new VMName, resource group name for VM being worked on. Allow setting of those variable at queue time.
  • Add Azure powershell task to execute powershell file script mentioned above and pass parameters set above to it and make sure it’s set as Powershell core

Add Azure Resource Group Deployment task to verify validity of generated template. Please note highlighted parameters below.

  • Add another Azure Resource Group Deployment task to perform actual rename. Settings are the same as previous step, just deployment mode shall be set to Incremental

This shall complete Build pipeline. You can test it manually by providing values for 3 parameters directly from Azure DevOps UI.

Integration with service management

Azure DevOps provides REST API to perform actions against service. Documentation available here.

To call API you need to generate PAT token first for your or service account by going to Azure DevOps and choosing PAT. The only permission need is Build - Read & Execute

To invoke build via API one have to call URI similar to following (https://dev.azure.com/artisticcheese/blog/_apis/build/builds?api-version=5.1) Below is POST contents of the body of request identifying build by number and parameters which will be passed to build at queue time.

{
"definition":
{
	"id":16
},
"parameters": "{\"vmName\": \"VM1\",	\"newVMName\": \"VM2\",	\"resourceGroupName\": \"temp\"}"
}

Response of build request would contain link to get status of the build as well which front-end service can call to get status of the build

Azure Private Link in action

Azure networking team just introduced preview of Azure Private Link (https://azure.microsoft.com/en-us/blog/announcing-azure-private-link/). It promises to bring functionality previously unavailable for bridging gap in networking between PaaS and VNETs as well as between VNETs in different tenants/subscriptions.

There are 2 distinctive use cases for Private Link:

  1. Private Link for accessing Azure PaaS Services
  2. Private Link to Private Link Service connection for connectivity across tenants and subscriptions and even overlapping IP address across VNETs

Private Link for accessing Azure PaaS Services

Traditionally if you wanted to access PaaS services securely within VNET you’d need enable VNET service endpoint which will in turn enable routing of requests from within your VNET directly to your PaaS service. PaaS will see your requests coming from private IP range of your VNET as opposed public IP address before the enablement. You still go through public IP of PaaS service though as a result, just not route through edge.

Private Link solution creates endpoint with local IP address on your subnet through which you can access your PaaS service. You will in fact see Network Interface resource being created with associated IP address once your enable this resource.

It will be similar to reverse NAT from networking point of view.

Example is below where I created storage account called privatelinkMSDN which does not have integration into VNETs so by default it will deny all connections to blobs externally or internally.

Accessing blob externally will produce HTTP error as expected due to IP filtering on storage account.

Trying to resolve name externally produces external IP address of service

PS C:\Users\174181> resolve-dnsname privatelinkmsdn.privatelink.blob.core.windows.net -Type A                                                                                                                                                                                                                                                                                                                                                                     

Name                           Type   TTL   Section    NameHost                                                                                                                                                                  ----                           ----   ---   -------    --------
                                                                                                                                                              privatelinkmsdn.privatelink.blob.core.windows.net CNAME  53    Answer     blob.bl5prdstr09a.store.core.windows.net
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Name       : blob.bl5prdstr09a.store.core.windows.net                                                                                                                                                                          QueryType  : A                                                                                                                                                                                                                  TTL        : 52                                                                                                                                                                                                                Section    : Answer                                                                                                                                                                                                             IP4Address : 40.71.240.16                                                                                                                                                                                                                                                        

Creating of Private Endpoint is not covered here since it’s well documented at Microsoft. End result is shown below. Following resources are created as result of creation of Private Endpoint:

  1. DNS zone named as privatelink.blob.core.windows.net with record pointing to your Private Endpoint
  2. Private Endpoint itself
  3. Network Interface resource associated with Private Endpoint
  4. Private IP address associated with Network Interface

While externally this URL resolves to external IP address, resolving the same name within VNET delegates resolution to private DNS zone and provides internet IP address of NIC card and hence provides access to image in blob as expected.

PS C:\Users\cloudadmin> resolve-dnsname privatelinkmsdn.privatelink.blob.core.windows.net -Type A                                                                                             
Name                                           Type   TTL   Section    IPAddress
----                                           ----   ---   -------    ---------
privatelinkmsdn.privatelink.blob.core.windows. A      1800  Answer     10.1.0.4

Private Link Service connection

Initial Configuration I’m working with is described below

  1. Azure Tenant 1 (suvalian.com) which is associated with Subscription 2. This will be hypothetical ISV customer which provides services to tenant 2 below (like VDI for example). Subscription 1 contains VNET called MSDN-VNET with 10.1.0.0/16 address space.
  2. Azure tenant 2 (nttdata.com) which is associated with Subscription 2. This is customer who would like to privately connect to your services. Subscription 2 contains VNET called NTT-VNET with 10.1.0.0/16 address space (please note it’s the same address space as VNET in Subscription 1)

There is no trust between 2 tenants (that is there no guest accounts in either directory from other directory), so essentially it’s completely separate Azure Environments.

Traditionally to connect from Azure 2 to Azure 1 you’d have to either:

  1. Expose your services via public IP address with restrictive NSG rules on it (poor security and additional cost due to ingress traffic charges)
  2. Create VNET to VNET connectivity via VPN gateway (costly, can not have overlapped IP address space, cumbersome to setup and administer)
  3. Create VNET peering between VNETS (can not have overlapped IP address space)

Solution consists of parts depicted on image below:

In Subsription 2 you create:

  • Private Link Service (PLS) which will be used as endpoint connection target for your customers
  • Network Interface resource with IP addresses which will be used for NAT (10.1.2.5)
  • Standard Load balancer with load balancing rule
  • Backend pool with IIS (10.1.1.4) which you want to provide access to your customer

In Subscription 1 you create

  • Private Endpoint which will connect to PLS in Subscription 2
  • Network Interface with IP (10.1.0.4) which will be used for connectivity to PLS

Client 1 living in Subscription 1 can connect to IIS resource in Subscription 2 via IP of 10.1.0.4. IIS is configured to respond with information about client connecting to it. Opening web page on 10.1.0.4 serves page from IIS web server identifying that HTTP connection originates from 10.1.2.5

PS C:\Users\cloudadmin> (Invoke-WebRequest http://10.1.0.4/).Content
REMOTE_ADDR 10.1.2.5

Azure lighthouse vs guest tenant management

Traditionally if you have to manage customers environment you had 2 choices:

  1. Ask customer to add your account from your tenant as guest user to their Azure Active Directory and assign specific RBAC roles afterwards on resources
  2. Customer would have to create an account for you in their tenant. You’d have to maintain 2 different username/passwords as a result and perform logon/logoff in management for each tenant

Traditional approach

For demo purposes following are initial input parameters:

  • MSDN subscription called “Customer Subscription” ( 8211cd03-4f97-4ee6-af42-38cad1387992) in “suvalian.com” tenant (c0de79f3-23e2-4f18-989e-d173e1d403d6).
  • I want to manage this subscription from my main tenant nttdata.com with account 174181@nttdata.com
  • Add your account ID into Role in customers subscription
  • Email will be dispatched with invitation and require me to accept via following link
  • Once invitation is accepted I can see new tenant is available for me to switch to in portal
  • Switching to tenant allows me to view managed subscription

Problems with traditional approach:

  1. Requires end user interaction to accept invitation to manage customers environment
  2. Can only invite individual team members and not groups
  3. Partner has to switch between tenants to manage their environment (can not see for example all VMs from all managed tenants) or execute single Azure Automation RunBook across all tenants
  4. Customer have to deal with user lifecycle management, that is remove user or add user anytime something happens on partner side

Lighthouse approach

New way of managing this process is outlined below.

You can onboard customer either through Azure Marketplace or ARM deployment. I will be using ARM deployment below since one have to be Azure MSP partener to publish to marketplace.

JSON files for this post located here.

You need to gather following information before onboarding a customer

  1. Tenant ID of your MSP Azure AD
  2. Principal ID of your MSP Azure AD group
  3. Role Definition ID which is set by Azure and available here

For my specific requirements values are below: role definitinon ID is Contributor which has ID of b24988ac-6180-42a0-ab88-20f7382dd24c, Group ID e361eaed-1a02-4b06-9e12-04417f6e2a46 from tenant 65e4e06f-f263-4c1f-becb-90deb8c2d9ff

{
      "$schema": "https://schema.management.azure.com/schemas/2018-05-01/subscriptionDeploymentParameters.json#",
      "contentVersion": "1.0.0.0",
      "parameters": {
            "mspName": {
                  "value": "NTTData Consulting"
            },
            "mspOfferDescription": {
                  "value": "Managed Services"
            },
            "managedByTenantId": {
                  "value": "65e4e06f-f263-4c1f-becb-90deb8c2d9ff"
            },
            "authorizations": {
                  "value": [
                        {
                              "principalId": "e361eaed-1a02-4b06-9e12-04417f6e2a46",
                              "principalIdDisplayName": "Hyperscale Team",
                              "roleDefinitionId": "b24988ac-6180-42a0-ab88-20f7382dd24c"
                        }
                  ]
            }
      }
}

I deploy from cloudshell since it’s already correctly logs me into correct tenant. Switch to correct subscription before running ARM deployments

PS /home/gregory> Select-AzSubscription -SubscriptionId 8211cd03-4f97-4ee6-af42-38cad1387992

Name                                     Account                                         SubscriptionName                               Environment                                    TenantId
----                                     -------                                         ----------------                               -----------                                    --------
Customer Subscription (8211cd03-4f97-4e… MSI@50342                                       Customer Subscription                          AzureCloud                                     fb172512-c74c-4f0d-bb83-3e70586312d5

PS /home/gregory> New-AzDeployment -Name "MSP" -Location 'Central US' -TemplateFile ./template.json -TemplateParameterFile ./template.parameters.json
DeploymentName          : MSP
Location                : centralus
ProvisioningState       : Succeeded
Timestamp               : 9/3/19 3:24:26 PM
Mode                    : Incremental
TemplateLink            :
Parameters              :
                          Name                   Type                       Value
                          =====================  =========================  ==========
                          mspName                String                     NTTData Consulting
                          mspOfferDescription    String                     Managed Services
                          managedByTenantId      String                     65e4e06f-f263-4c1f-becb-90deb8c2d9ff
                          authorizations         Array                      [
                            {
                              "principalId": "e361eaed-1a02-4b06-9e12-04417f6e2a46",
                              "principalIdDisplayName": "Hyperscale Team",
                              "roleDefinitionId": "b24988ac-6180-42a0-ab88-20f7382dd24c"
                            }
                          ]

Outputs                 :
                          Name              Type                       Value
                          ================  =========================  ==========
                          mspName           String                     Managed by NTTData Consulting
                          authorizations    Array                      [
                            {
                              "principalId": "e361eaed-1a02-4b06-9e12-04417f6e2a46",
                              "principalIdDisplayName": "Hyperscale Team",
                              "roleDefinitionId": "b24988ac-6180-42a0-ab88-20f7382dd24c"
                            }
                          ]

DeploymentDebugLogLevel :

Login to your customer environment and check that you see now “NTTData Consulting” in service providers

Now if you want to add additional access (like accessing second subscription) you can do it right from portal without need for ARM deployment. For example below I’m adding access to specific resource group in separate subscription to be managed by MSP.

In my MSP panel I can now see both access to entire subscription and access to specific resource group in another

You shall be able to see resources in portal just like if your account was part of customers tenant

For example I added tags to existing storage account and it appears as I was guest account in customers AD.

Automation at scale in Azure with Powershell Azure functions

Code for article below is located at https://github.com/artisticcheese/artisticcheesecontainer/tree/master/MetadataFunction

My current task was to execute certain script within big number of VMs (700+) on periodic schedule to pull Metadata information from Azure dataplane ( https://docs.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service ). This data is available ONLY within running VM and there is no way to access it any other way. Specifically data about ScheduledEvents ( https://docs.microsoft.com/en-us/azure/virtual-machines/windows/scheduled-events ) which informs VM if Azure initiated reboot is pending in one way or another (detailed info at https://docs.microsoft.com/en-us/azure/virtual-machines/windows/scheduled-events#query-for-events)

Microsoft provides solution called “Azure Scheduled Events Service” ( https://github.com/microsoft/AzureScheduledEventsService ) which has severe drawbacks. Namely:

  1. You have to download and install service on all machines
  2. It relies on Invoke-RestMethod cmdlet to query metadata services and hence not supported powershell 2.0 and hence by default will not run on Windows 2008
  3. It only runs on Windows obviously so none of UNIX machines will be covered
  4. It logs data into local Application Log which is completely useless since now you have to figure out how to centralize and query this information
  5. There is no centralized alerting on those events as result of point 4 above

My solution which is outlined below is relying on Azure Resources to install/maintain/query/alert on health events without the need for dedicated agents.

Solution consists of following moving parts

  1. Azure Powershell function
  2. Azure Storage Queue
  3. Azure Log Analytics Account
  4. Azure monitor

General flow is below

Azure powershell function executed on timer or via HTTP request which is populates storage queue with all VM names in subscriptions, their resource group and powerstate of Machine

Azure App Service where powershell function is hosted on has a scale out condition to jump to 8 instances upon seeing storage queue being populated which in return provides around 160 concurrently executing workers

Second Azure powershell function is bound to storage queue and spins up upon presence of queue messages. It reads queue message, pulls VM and check it’s operating system version and based on that executes either shell or powershell script to pull metadata service via Invoke-AzVMRunCommand

Upon success or error script write to LogAnalytics workspace data being returned

Azure monitor is setup to act upon Azure Log Analytics query.

Details

Create Function App which will host 2 functions mentioned above. Example is below. Don’t use consumption plan since it does not scale well with powershell and choose at least S2 size since you will be able to use multiprocessor capabilities to scale out locally and in addition to scale app service out based on queue as well.

Go to storage account which was created and create 2 queues to hold messages and message rejects (poison).

Copy storage account connection string from this storage account, this will be required for function setup

Create Log Analytics workspace to hold messages

Record values of WorkspaceID as well as primary key to be used later in function

Update local.settings.json in your Function folder to contain settings you copied earlier. Mine example is below

{
  "IsEncrypted": false,
  "Values": {
    "AzureWebJobsStorage": "DefaultEndpointsProtocol=https;AccountName=mymetadatafuncta57d;AccountKey=9/jxdL3jdsrKED+ddQHByebGkzozxiLHrNeRUrvGWhO8//dzGm9m184n0VymQBTBlkfzIPkbx1+nTSXA/6HlZQ==",
    "FUNCTIONS_WORKER_RUNTIME": "powershell",
    "LogAnalyticsWorkspaceID": "02f2eb14-85d2-4069-9a1a-6b8cd91d783c",
    "LogAnalyticsSharedKey": "D0P2Z9D4U3k8xJFLzBnLg/Ns3oyEsEj4ivVxq5buGQN5BtYND/nleWGfrsc5SD6wajW/SbtqpvvgWCjQCfPdlw==",
    "QueueName": "metadataservicequeue",
    "FUNCTIONS_EXTENSION_VERSION": "~2",
    "WEBSITE_NODE_DEFAULT_VERSION": "10.14.1"
  }
}

Deploy function to Azure from VSCode

Once function is deployed try to execute PopulateQueueWithVmNamesHTTP. You are expected to see failure since Function shall not be having necessary permissions to access Azure resources.

2019-08-20T21:07:27.528 [Information] INFORMATION: getting Queue Account info
2019-08-20T21:07:28.062 [Information] INFORMATION: getting all VM Account info
2019-08-20T21:07:29.804 [Error] ERROR: No account found in the context. Please login using Connect-AzAccount.
Microsoft.Azure.WebJobs.Script.Rpc.RpcException : Result: ERROR: No account found in the context. Please login using Connect-AzAccount.
Exception: No account found in the context. Please login using Connect-AzAccount.

Assign system assigned identity to your Function by going to Identity option in Platform feature

Add Identity to Reader and Virtual Machine Contributor roles in subscription. Reader role is needed to pull list of all VMs in subscription and Contributor role one needs to be able to execute scripts on VMs

You shall see successfull output now with details of what queue messages were created

2019-08-20T21:25:46  Welcome, you are now connected to log-streaming service. The default timeout is 2 hours. Change the timeout with the App Setting SCM_LOGSTREAM_TIMEOUT (in seconds). 
2019-08-20T21:25:49.448 [Information] Executing 'Functions.PopulateQueueWithVMNamesHTTP' (Reason='This function was programmatically called via the host APIs.', Id=3d49429c-63c9-4b8e-998b-d05514863f09)
2019-08-20T21:25:55.744 [Information] INFORMATION: PowerShell HTTP trigger function processed a request.
2019-08-20T21:25:55.761 [Information] INFORMATION: getting Storage Account info
2019-08-20T21:25:57.910 [Information] INFORMATION: getting Queue Account info
2019-08-20T21:25:58.183 [Information] INFORMATION: getting all VM Account info
2019-08-20T21:26:01.662 [Information] INFORMATION: Generating queue messages
2019-08-20T21:26:01.766 [Information] INFORMATION: Loop finished
2019-08-20T21:26:01.770 [Information] INFORMATION: Added 1 count {
"VMName" : "GregDesktop",
"ResourceGroup": "DEVTESTLAB-RG",
"State" : "VM running"
} to queue 1 records process
2019-08-20T21:26:01.920 [Information] Executed 'Functions.PopulateQueueWithVMNamesHTTP' (Succeeded, Id=3d49429c-63c9-4b8e-998b-d05514863f09)

You shall also see this queue message in your storage account

If you monitor logs for MetadataFunction you’ll see it wake up and process messages posted in queue

019-08-20T23:12:07.244 [Information] INFORMATION: Finished executing Invoke-AzureRMCommand with parameters GregDesktop, DEVTESTLAB-RG, VM running, return is {"DocumentIncarnation":0,"Events":[]} )
2019-08-20T23:12:07.255 [Information] INFORMATION: Outputing following to Log Analytics [
    {
        "Return" : "{\"DocumentIncarnation\":0,\"Events\":[]}",
        "VMName" : "GregDesktop",
        "ResourceGroup" : "DEVTESTLAB-RG"

    }
]
2019-08-20T23:12:07.588 [Trace] PROGRESS: Reading response stream... (Number of bytes read: 0)
2019-08-20T23:12:07.589 [Trace] PROGRESS: Reading web response completed. (Number of bytes read: 0)
2019-08-20T23:12:07.596 [Information] OUTPUT: 200
2019-08-20T23:12:07.644 [Information] Executed 'Functions.MetadataFunction' (Succeeded, Id=21111100-7a23-4374-93f1-9dfa5df76011)

You’ll see also output posted to LogAnalytics workspace custom folder called MetaDataLog

You can then setup alerting on scheduled redeploy events via executing Kusto query below and tying Monitor action to it

MetaDataLog_CL
| project VMName_s, TimeGenerated,  ResourceGroup, Return_s
| summarize arg_max(TimeGenerated, *) by VMName_s
| where Return_s contains "Redeploy"
| order by TimeGenerated desc 

Notes:

  1. Consumption plan is impossible to use due to scalability of powershell running on single core instances provided by consumption plan. I was unable to use it in any form or capacity until I switched to App Service plan instead. (https://docs.microsoft.com/en-us/azure/azure-functions/functions-reference-powershell#concurrency)
  2. Increase value for parameter PSWorkerInProcConcurrencyUpperBound to increase concurrency since function is not CPU or IO bound. Mine is set to 20
  3. Go to Application Service plan also configure Scale Out/In rule to scale number of instances based on size of queue. Mine is set to 8. So once application is triggered you’ll get 160 instances of powershell executing in parallel
  4. Project consists of 2 functions to populate queue. One is HTTP triggered and another one executed on timer.