SDefinITive

by jhasensio

Antrea Observability Part 3: Dashboards for metrics and logs

In the previous post we learnt how to install some important tools to provide observability including Grafana, Prometheus, Loki and Fluent-Bit.

Importing precreated Dashboards and Datasources in Grafana

The task of creating powerful data insights from metrics and logs through awesome visual dashboards is by far the most complex and time consuming task in the observability discipline. To avoid the creation of dashboard from scratch each time you deploy your observability stack, there is a method to import precreated dashboards very easily as long as someone already put the effort in creating your desired dashboard for you. A dashboard in Grafana is represented by a JSON object, which stores plenty of metadata including includes properties, size, placement, template variables, panel queries, etc.

If you remember from previous post here, we installed Grafana with datasource and dashboard sidecars enabled. As you can guess these sidecars are just auxiliary containers whose task is to watch for the existence of configmap in current namespace with a particular label. As soon as a new matching configmap appears the sidecars inject dynamically the extracted configuration and create a new datasource or dashboard.

I have created my own set of dashboards and datasources that you can reuse if you wish. To do so just clone the git repository as shown here.

git clone https://github.com/jhasensio/antrea.git
Cloning into 'antrea'...
remote: Enumerating objects: 309, done.
remote: Counting objects: 100% (309/309), done.
remote: Compressing objects: 100% (182/182), done.
remote: Total 309 (delta 135), reused 282 (delta 109), pack-reused 0
Receiving objects: 100% (309/309), 1.05 MiB | 6.67 MiB/s, done.
Resolving deltas: 100% (135/135), done.

First we will create the Datasources, so navigate to antrea/GRAFANA/datasources/ folder and list the content.

ls -alg
total 16
drwxrwxr-x 2 jhasensio 4096 Feb 27 18:48 .
drwxrwxr-x 4 jhasensio 4096 Feb 27 18:37 ..
-rw-rw-r-- 1 jhasensio  571 Feb 27 18:37 datasources.yaml

In this case, the Datasource is defined through a yaml file. The format is fully defined at official documentation here is very easy to identify the different fields as you can see below.

cat datasources.yaml
datasources:
 datasources.yaml:
   apiVersion: 1
   datasources:
   # Loki Datasource using distributed deployment (otherwise use port 3100)
    - name: loki
      type: loki
      uid: loki
      url: http://loki-gateway.loki.svc.cluster.local:80
      access: proxy
      version: 1
      editable: false
   # Prometheus Datasource marked as default
    - name: prometheus
      type: prometheus
      uid: prometheus
      url: http://prometheus-service.monitoring.svc.cluster.local:8080
      access: proxy
      isDefault: true
      version: 1
      editable: false

Now all that we need to do is to generate a configmap or a secret from the yaml file using kubectl create with –from-file keyword. Since the datasources might contain sensitive information is it better to use a secret instead of a configmap to leverage the encoding capabilities of the secret object. The following command will create a generic secret importing the data contained in the datasources.yaml all existing yaml files in current directory and will use the filename stripping .yaml extension as prefix for the configmaps names followed by -datasource-cm suffix.

kubectl create secret generic -n grafana –from-file=datasources.yaml datasources-secret
secret/datasources-secret created

Next step is to label the secret object using the label and key we defined when we deployed Grafana.

kubectl label secret datasources-secret -n grafana grafana_datasource=1
secret/datasources-secret labeled

Now we have to do a similar procedure with the dashboards. From the path where you cloned the git repo before, navigate to antrea/GRAFANA/dashboards/ and list the contents. You should see the json files that define the dashboards we want to import.

ls -alg
total 176
drwxrwxr-x 2 jhasensio  4096 Feb 27 18:37 .
drwxrwxr-x 4 jhasensio  4096 Feb 27 18:37 ..
-rw-rw-r-- 1 jhasensio 25936 Feb 27 18:37 1-agent-process-metrics.json
-rw-rw-r-- 1 jhasensio 28824 Feb 27 18:37 2-agent-ovs-metrics-and-logs.json
-rw-rw-r-- 1 jhasensio 15384 Feb 27 18:37 3-agent-conntrack-and-proxy-metrics.json
-rw-rw-r-- 1 jhasensio 39998 Feb 27 18:37 4-agent-network-policy-metrics-and-logs.json
-rw-rw-r-- 1 jhasensio 17667 Feb 27 18:37 5-antrea-agent-logs.json
-rw-rw-r-- 1 jhasensio 31910 Feb 27 18:37 6-controller-metrics-and-logs.json

Now we will create the configmaps in the grafana namespace and label it. We can use a single-line command to do it recursively leveraging xargs command as shown below.

ls -1 *.json | sed ‘s/\.[^.]*$//’ | xargs -n 1 -I {arg} kubectl create configmap -n grafana –from-file={arg}.json {arg}-dashboard-cm
configmap/1-agent-process-metrics-dashboard-cm created
configmap/2-agent-ovs-metrics-and-logs-dashboard-cm created
configmap/3-agent-conntrack-and-proxy-metrics-dashboard-cm created
configmap/4-agent-network-policy-metrics-and-logs-dashboard-cm created
configmap/5-antrea-agent-logs-dashboard-cm created
configmap/6-controller-metrics-and-logs-dashboard-cm created

And now tag the created configmap objects using the label and key we defined when we deploy grafana using the command below.

kubectl get configmaps -n grafana | grep dashboard-cm | awk ‘{print $1}’ | xargs -n1 -I{arg} kubectl label cm {arg} -n grafana grafana_dashboard=1
configmap/1-agent-process-metrics-dashboard-cm labeled
configmap/2-agent-ovs-metrics-and-logs-dashboard-cm labeled
configmap/3-agent-conntrack-and-proxy-metrics-dashboard-cm labeled
configmap/4-agent-network-policy-metrics-and-logs-dashboard-cm labeled
configmap/5-antrea-agent-logs-dashboard-cm labeled
configmap/6-controller-metrics-and-logs-dashboard-cm labeled

Feel free to explore the created configmaps and secrets to verify if they have proper labels before moving to Grafana UI using following command.

kubectl get cm,secret -n grafana –show-labels | grep -E “NAME|dashboard-cm|datasource”
NAME                                                             DATA   AGE     LABELS
configmap/1-agent-process-metrics-dashboard-cm                   1      5m32s   grafana_dashboard=1
configmap/2-agent-ovs-metrics-and-logs-dashboard-cm              1      5m31s   grafana_dashboard=1
configmap/3-agent-conntrack-and-proxy-metrics-dashboard-cm       1      5m31s   grafana_dashboard=1
configmap/4-agent-network-policy-metrics-and-logs-dashboard-cm   1      5m31s   grafana_dashboard=1
configmap/5-antrea-agent-logs-dashboard-cm                       1      5m31s   grafana_dashboard=1
configmap/6-controller-metrics-and-logs-dashboard-cm             1      5m31s   grafana_dashboard=1
NAME                                   TYPE                 DATA   AGE   LABELS
secret/datasources-secret              Opaque               1      22m   grafana_datasource=1

If the sidecars did their job of watching for secrets with corresponding labels and importing their configuration into Grafana UI, we should now be able to access Grafana UI and see the imported datasources.

Grafana imported Datasources through labeled configmaps

If you navigate to the Dashboards section you should now see the six imported dashboards as well.

Grafana imported Dashboards through labeled configmaps

Now we have imported the datasources and dashboards we can move into Grafana UI to explore the created visualizations, but before moving jumping to UI, lets dive a little bit into log processing to understand how the logs are properly formatted and pushed ultimately to Grafana dashboards.

Parsing logs with FluentBit

As we seen in previous post here, Fluent-bit is a powerful log shipper that can be used to push any log produced by Antrea components. The formatting or parsing is the process to extract useful information within the raw logs to achieve better understanding and filtering capabilites of the overall information. This might be a tough task that require some time and attention and understanding of how a regular expression work. The following sections will show how to obtain the desired log formatting configuration using a methodical approach.

1 Install FluentBit in a Linux box

The first step is to isntall fluent-bit. The quickest way is by using the provided shell script in the official page.

curl https://raw.githubusercontent.com/fluent/fluent-bit/master/install.sh | sh
... skipped
Need to get 29.9 MB of archives.
After this operation, 12.8 MB of additional disk space will be used.
Get:1 https://packages.fluentbit.io/ubuntu/focal focal/main amd64 fluent-bit amd64 2.0.9 [29.9 MB]
Fetched 29.9 MB in 3s (8,626 kB/s)     
(Reading database ... 117559 files and directories currently installed.)
Preparing to unpack .../fluent-bit_2.0.9_amd64.deb ...
Unpacking fluent-bit (2.0.9) over (2.0.6) ...
Setting up fluent-bit (2.0.9) ...

Installation completed. Happy Logging!

Once the script is completed, the binary will be placed into the /opt/fluent-bit/bin path. Launch fluent-bit to check if is working. An output like the one shown below should be seen.

/opt/fluent-bit/bin/fluent-bit
Fluent Bit v2.0.9
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/03/13 19:15:50] [ info] [fluent bit] version=2.0.9, commit=, pid=1943703
...

2 Obtain a sample of the target log

Next step is to get a sample log file of the intended system we want to process logs from. This sampe log will be used as the input for fluent-bit. As an example, let’s use the network policy log generated by Antrea Agent when enableLogging spec is set to true. Luckily, the log format is fully docummented here and we will use that as a base for our parser. We can easily identify some fields within the log line such as timestamp, RuleName, Action, SourceIP, SourcePort and so on.

cat /var/log/antrea/networkpolicy/np.log
2023/02/14 11:23:54.727335 AntreaPolicyIngressRule AntreaClusterNetworkPolicy:acnp-acme-allow FrontEnd_Rule Allow 14000 10.34.6.14 46868 10.34.6.5 8082 TCP 60

3 Create the the fluent-bit config file

Once we have a sample log file with targeted log entries create a fluent-bit.conf file that will be used to process our log file. In the INPUT section we are using the tail input plugin that will read the file specified in the path keyword below. After tail module read the file, the fluent-bit pipeline will send the data to the antreanetworkpolicy custom parser that we will create in next section. Last, the output section will send the result to the standard output (console).

vi fluent-bit.conf
[INPUT]
    name tail
    path /home/jhasensio/np.log
    tag antreanetworkpolicy
    read_from_head true
    parser antreanetworkpolicy
    path_key on

[OUTPUT]
    Name  stdout
    Match *

4 Create the parser

The parser is by far the most tricky part since you need to use regular expression to create matching patterns in order to extract the values for the desired fields according to the position of the values within the log entry. You can use rubular that is a great website to play with regular expressions and also provides an option to create a permalink to share the result of the parsing. I have created a permalink here that is also commented in the parser file below that can be used for this purpose to understand and play with regex using a sample log line as an input. The basic idea is to choose a name for the intended fields and use a regular expression to match with the log position that shows the value for that field. Note we end up with a pretty long regex expression to extract all the fields.

vi parser.conf
[PARSER]
    Name antreanetworkpolicy
    Format regex
    # https://rubular.com/r/gCTJfLIkeioOgO
    Regex ^(?<date>[^ ]+) (?<time>[^ ]+) (?<ovsTableName>[^ ]+) (?<antreaNativePolicyReference>[^ ]+) (?<rulename>[^ ]+) (?<action>[^( ]+) (?<openflowpriority>[^ ]+) (?<sourceip>[^ ]+) (?<sourceport>[^ ]+) (?<destinationip>[^ ]+) (?<destinationport>[^ ]+) (?<protocol>[^ ]+) (?<packetLength>.*)$

4 Test filter and parser

Now that the config file and the parser are prepared, it is time to test the parser to check if is working as expected. Sometime you need to iterate over the parser configuration till you get the desired outcome. Use following command to test fluent-bit.

/opt/fluent-bit/bin/fluent-bit -c fluent-bit.conf –parser parser.conf -o stdout -v
Fluent Bit v2.0.9
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/03/13 19:26:01] [ info] Configuration:
[2023/03/13 19:26:01] [ info]  flush time     | 1.000000 seconds
[2023/03/13 19:26:01] [ info]  grace          | 5 seconds
... skipp
[2023/03/13 19:26:01] [ info] [output:stdout:stdout.0] worker #0 started
[0] antreanetworkpolicy: [1678731961.923762559, {"on"=>"/home/jhasensio/np.log", "date"=>"2023/02/14", "time"=>"11:23:54.727335", "ovsTableName"=>"AntreaPolicyIngressRule", "antreaNativePolicyReference"=>"AntreaClusterNetworkPolicy:acnp-acme-allow", "rulename"=>"FrontEnd_Rule", "action"=>"Allow", "openflowpriority"=>"14000", "sourceip"=>"10.34.6.14", "sourceport"=>"46868", "destinationip"=>"10.34.6.5", "destinationport"=>"8082", "protocol"=>"TCP", "packetLength"=>"60"}]
[2023/03/13 19:26:02] [debug] [task] created task=0x7f6835e4bbc0 id=0 OK
[2023/03/13 19:26:02] [debug] [output:stdout:stdout.1] task_id=0 assigned to thread #0
[2023/03/13 19:26:02] [debug] [out flush] cb_destroy coro_id=0
[2023/03/13 19:26:02] [debug] [task] destroy task=0x7f6835e4bbc0 (task_id=0)

As you can see in the highlighted section the log entry has been sucessfully processed and a JSON document has been generated and sent to the console mapping the defined field names with the corresponding values matched in the log according to regex capturing process.

Repeat this procedure for all the different log formats that you plan to process until you get the desired results. Once done is time to push the configuration into kubernetes and run FluentBit as a daemonSet. The following values.yaml can be used to inject the fluent-bit configuration via Helm to process logs generated by Antrea controller, agents, openvSwitch and NetworkPolicy logs. Note that, apart from the regex, is also important to tag the logs with significant labels in order to get the most of the dashboards and achieve good filtering capabilities at Grafana.

vi values.yaml
# kind -- DaemonSet or Deployment
kind: DaemonSet

env: 
  - name: NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName

config:
  service: |
    [SERVICE]
        Daemon Off
        Flush {{ .Values.flush }}
        Log_Level {{ .Values.logLevel }}
        Parsers_File parsers.conf
        Parsers_File custom_parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port {{ .Values.metricsPort }}
        Health_Check On

  ## https://docs.fluentbit.io/manual/pipeline/inputs
  inputs: |
    [INPUT]
        Name tail
        Path /var/log/containers/antrea-agent*.log
        Tag antreaagent
        parser antrea
        Mem_Buf_Limit 5MB
    
    [INPUT]
        Name tail
        Path /var/log/containers/antrea-controller*.log
        Tag antreacontroller
        parser antrea
        Mem_Buf_Limit 5MB

    [INPUT]
        Name tail
        Path /var/log/antrea/networkpolicy/np*.log
        Tag antreanetworkpolicy
        parser antreanetworkpolicy
        Mem_Buf_Limit 5MB

    [INPUT]
        Name tail
        Path /var/log/antrea/openvswitch/ovs*.log
        Tag ovs
        parser ovs
        Mem_Buf_Limit 5MB
  
  ## https://docs.fluentbit.io/manual/pipeline/filters
  filters: |
    [FILTER]
        Name kubernetes
        Match antrea
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude On

    [FILTER]
        Name record_modifier
        Match *
        Record podname ${HOSTNAME}
        Record nodename ${NODE_NAME}

  ## https://docs.fluentbit.io/manual/pipeline/outputs
  outputs: |
    [OUTPUT]
        Name loki
        Match antreaagent
        Host loki-gateway.loki.svc
        Port 80
        Labels job=fluentbit-antrea, agent_log_category=$category
        Label_keys $log_level, $nodename
    
    [OUTPUT]
        Name loki
        Match antreacontroller
        Host loki-gateway.loki.svc
        Port 80
        Labels job=fluentbit-antrea-controller, controller_log_category=$category
        Label_keys $log_level, $nodename
  
    [OUTPUT]
        Name loki
        Match antreanetworkpolicy
        Host loki-gateway.loki.svc
        Port 80
        Labels job=fluentbit-antrea-netpolicy
        Label_keys $nodename, $action
  
    [OUTPUT]
        Name loki
        Match ovs
        Host loki-gateway.loki.svc
        Port 80
        Labels job=fluentbit-antrea-ovs
        Label_keys $nodename, $ovs_log_level, $ovs_category

    ## https://docs.fluentbit.io/manual/pipeline/parsers
  customParsers: |
    [PARSER]
        Name antrea
        Format regex
        # https://rubular.com/r/04kWAJU1E3e20U
        Regex ^(?<timestamp>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) ((?<log_level>[^ ]?)(?<code>\d\d\d\d)) (?<time>[^ ]+) (.*) ((?<category>.*)?(.go.*\])) (?<message>.*)$
   
    [PARSER]    
        Name antreanetworkpolicy
        Format regex
        # https://rubular.com/r/gCTJfLIkeioOgO
        Regex ^(?<date>[^ ]+) (?<time>[^ ]+) (?<ovsTableName>[^ ]+) (?<antreaNativePolicyReference>[^ ]+) (?<ruleName>[^ ]+) (?<action>[^( ]+) (?<openflowPriority>[^ ]+) (?<sourceIP>[^ ]+) (?<sourcePort>[^ ]+) (?<destinationIP>[^ ]+) (?<destinationPort>[^ ]+) (?<protocol>[^ ]+) (?<packetLength>.*)$

    [PARSER]    
        Name ovs
        Format regex
        Regex ^((?<date>[^ ].*)\|(?<log_code>.*)\|(?<ovs_category>.*)\|(?<ovs_log_level>.*))\|(?<message>\w+\s.*$)

Find a copy of values.yaml in my github here. Once your fluent-bit configuration is done, you can upgrade the fluent-bit release through helm using the following command to apply the new configuration.

helm upgrade –values values.yaml fluent-bit fluent/fluent-bit -n fluent-bit
Release "fluent-bit" has been upgraded. Happy Helming!
NAME: fluent-bit
LAST DEPLOYED: Tue Mar 14 17:27:23 2023
NAMESPACE: fluent-bit
STATUS: deployed
REVISION: 32
NOTES:
Get Fluent Bit build information by running these commands:

export POD_NAME=$(kubectl get pods --namespace fluent-bit -l "app.kubernetes.io/name=fluent-bit,app.kubernetes.io/instance=fluent-bit" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace fluent-bit port-forward $POD_NAME 2020:2020
curl http://127.0.0.1:202

Exploring Dashboards

Now we understand how log shipping process work its time to jump into grafana to navigate through the different precreated dashboards that will represent not only prometheus metrics but also some will be helpful to explore logs and extract some metrics from them.

All the dashboards use variables that allow you to create more interactive and dynamic dashboards. Instead of hard-coding things like agent-name, instance or log-level, you can use variables in their place that are displayed as dropdown lists at the top of the dashboard. These dropdowns make it easy to filter the data being displayed in your dashboard.

Dashboard 1: Agent Process Metrics

The first dashboard provide a general view of observed kubernetes cluster. This dashboard can be used to understand some important metrics across the time such as CPU and memory usage, kubernetes API usage, filedescriptor and so on. I have tried to extract the most significative out of the full set of metrics as documented here but feel free to add extra metrics you are interested in.

Dashboard 1 Agent Process Metrics

Just to see graphs in action reacting to cluster conditions, lets play with CPU usage as an example. An easy way to impact CPU usage is by creating some activity in the cluster to stress CPU. Lets create a new deployment that will be used for this purpose with following command.

kubectl create deployment www –image=httpd –replicas=1
deployment.apps/www created

Now lets try to stress the antrea agent CPU a little bit by calling quite aggresively to the Kubernetes API. To do this we will use a loop to scale the deployment randomly to a number between 1 and 20 replicas and waiting a random number between 1 and 500 milliseconds between API calls.

while true; do kubectl scale deployment www –replicas=$[($RANDOM % 20) + 1]; sleep .$[ ( $RANDOM % 500 ) + 1 ]s ; done
deployment.apps/www scaled
deployment.apps/www scaled
deployment.apps/www scaled
deployment.apps/www scaled
... Stop with CTRL+C

As you can see as soon as we start to inject the CPU usage of the antrea agent running in the nodes is fairly impacted as the agent need to reprogram the ovs and plumb the new replica pods to the network.

Antrea Agent CPU Usage Panel

On the other hand, the KubeAPI process is heavily impacted because is in charge to process all the API calls we are sending.

Kubernetes API CPU Usage Panel

Another interesting panel is below and titld as Logs per Second. This panel uses all the logs received from the nodes via fluent-bit and received by Loki to calculate the logs per second rate. As soon as we scale in/out there is a lot of activity in the agents that is generating a big amount of logs. This can be useful as an indication of current cluster activity.

Antrea Logs per Second panel

Explore other panels and play with variables such as instance to filter out the displayed information.

Dashboard 2: Agent OpenvSwitch (OVS) Metrics and Logs

The second dashboard will help us to understand what is happening behind the scenes in relation to the OpenvSwitch component. OpenvSwitch is an open source OpenFlow capable virtual switch used to create the SDN solution that will provide connectivity between the nodes and pods in this kubernetes scenario.

OpenvSwitch uses a Pipeline Model that is explained in the Antrea IO website here and is represented in the following diagram. (Note: It seems that the diagram is a little bit outdated since the actual Table IDs does not correspond to the IDs represented in official diagram).

Antrea OVS Pipeline (Note:Table IDs are outdated)

The tables are used to store the flow information at different stages of the transit of the packet. As an example, lets take the SpoofGuardTable (correspond to TableID 1 in current implementation). The SpoofGuardTable which map to ID 1 is responsible for preventing IP and ARP spoofing from local Pods.

We can easily explore its content, for example, display the pods running on a given node (k8s-worker01)

kubectl get pod -A -o wide | grep -E “NAME|worker-01”
jhasensio@forty-two:~/ANTREA$ kubectl get pod -A -o wide | grep -E "NAME|worker-01"
NAMESPACE           NAME                                           READY   STATUS    RESTARTS        AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
acme-fitness        payment-5ffb9c8d65-g8qb2                       1/1     Running   0               2d23h   10.34.6.179   k8s-worker-01         <none>           <none>
fluent-bit          fluent-bit-9n6nq                               1/1     Running   2 (4d13h ago)   6d8h    10.34.6.140   k8s-worker-01         <none>           <none>
kube-system         antrea-agent-hbc86                             2/2     Running   0               47h     10.113.2.15   k8s-worker-01         <none>           <none>
kube-system         coredns-6d4b75cb6d-mmscv                       1/1     Running   2 (4d13h ago)   48d     10.34.6.9     k8s-worker-01         <none>           <none>
kube-system         coredns-6d4b75cb6d-sqg4z                       1/1     Running   1 (20d ago)     48d     10.34.6.7     k8s-worker-01         <none>           <none>
kube-system         kube-proxy-fvqbr                               1/1     Running   1 (20d ago)     48d     10.113.2.15   k8s-worker-01         <none>           <none>
load-gen            locust-master-67bdb5dbd4-ngtw7                 1/1     Running   0               2d22h   10.34.6.184   k8s-worker-01         <none>           <none>
load-gen            locust-worker-6c5f87b5c8-pgzng                 1/1     Running   0               2d22h   10.34.6.185   k8s-worker-01         <none>           <none>
logs                logs-pool-0-4                                  1/1     Running   0               6d18h   10.34.6.20    k8s-worker-01         <none>           <none>
loki                loki-canary-7kf7l                              1/1     Running   0               15h     10.34.6.213   k8s-worker-01         <none>           <none>
loki                loki-logs-drhbm                                2/2     Running   0               6d18h   10.34.6.22    k8s-worker-01         <none>           <none>
loki                loki-read-2                                    1/1     Running   0               6d9h    10.34.6.139   k8s-worker-01         <none>           <none>
loki                loki-write-1                                   1/1     Running   0               15h     10.34.6.212   k8s-worker-01         <none>           <none>
minio-operator      minio-operator-868fc4755d-q42nc                1/1     Running   1 (20d ago)     35d     10.34.6.3     k8s-worker-01         <none>           <none>
vmware-system-csi   vsphere-csi-node-42d8j                         3/3     Running   7 (18d ago)     35d     10.113.2.15   k8s-worker-01         <none>           <none>

Take the Antrea Agent pod name and dump the content of the SpoofGuardTable of the OVS container in that particular Node by issuing the following command.

kubectl exec -n kube-system antrea-agent-hbc86 -c antrea-ovs — ovs-ofctl dump-flows br-int table=1 –no-stats
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=2,arp_spa=10.34.6.1,arp_sha=12:2a:4a:31:e2:43 actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=10,arp_spa=10.34.6.9,arp_sha=16:6d:5b:d2:e6:86 actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=183,arp_spa=10.34.6.185,arp_sha=d6:c2:ae:76:9d:25 actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=140,arp_spa=10.34.6.140,arp_sha=4e:b4:15:50:90:3e actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=139,arp_spa=10.34.6.139,arp_sha=c2:07:8f:35:8b:17 actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=4,arp_spa=10.34.6.3,arp_sha=7a:d9:fd:e7:2d:af actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=182,arp_spa=10.34.6.184,arp_sha=7a:5e:26:ef:76:07 actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=21,arp_spa=10.34.6.20,arp_sha=fa:e6:b9:f4:b1:e7 actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=23,arp_spa=10.34.6.22,arp_sha=3a:17:f0:79:49:24 actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=177,arp_spa=10.34.6.179,arp_sha=92:76:e0:a4:f4:57 actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=8,arp_spa=10.34.6.7,arp_sha=8e:cc:4d:06:80:ed actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=210,arp_spa=10.34.6.212,arp_sha=7e:34:08:3d:86:c5 actions=resubmit(,2)
 cookie=0x7010000000000, table=1, priority=200,arp,in_port=211,arp_spa=10.34.6.213,arp_sha=9a:a8:84:d5:74:70 actions=resubmit(,2)
 cookie=0x7000000000000, table=1, priority=0 actions=resubmit(,2)

This SpoofGuard table contains the port of the pod, the IP address and the associated MAC and is automatically populated upon pod creation. The OVS metrics are labeled by table ID so its easy to identify the counters for any given identifier. Using the filter variable at the top of the screen select the ovs_table 1. As you can see from the graph the value for each of the table remains steady which mean there has not been recent pod creation/deletion activity in the cluster in the interval of observation. Also you can see how the value is almost the same in all the worker nodes which means the pods are evenly distributed across the worker. As expected the control plane node has less fewer entries.

Again, as we want to see dashboards in action, we can generate some counter variations easily scaling in/out a deployment. Using the www deployment we created before, scale now to 100 replicas. The pods should be scheduled among the worker nodes and the Table 1 will be populated with corresponding entries. Lets give a try.

kubectl scale deployment www –replicas=100
deployment.apps/www scaled

The counter for this particular table now increases rapidly and the visualization graph shows now updated values.

Agent OVS Flow Count SpoofGuard Table ID1

Similarly, there is another ovs table which is the EndpointDNAT (Table11) that program DNAT rules to reach endpoints behind a clusterIP service. Using the variable filters at the top of the dashboard select ovs_table 1 and 11 in the same graph and select only the instance k8s-worker-01. Note how the EndpointDNAT (11) hasn’t change at all during the scale out process and has remained steady at the value 101 in this particular case.

Agent OVS Flow Count SpoofGuard Table ID1 + Table ID11 at Worker 01

If you expose now the deployment under test, all the pod replicas will be used as endpoints which means the overall endpoints should be incremented by exactly one hundred.

kubectl expose deployment www –port=28888
service/www exposed

Now the www service should have 100 endpoints as showed below and correspondindly, the counter of the Table11 which contains the information to reach the endpoints of the cluster will be also incremented by the same amount.

kubectl get endpoints www
NAME   ENDPOINTS                                                         AGE
www    10.34.1.59:28888,10.34.1.60:28888,10.34.1.61:28888 + 97 more...   47s

You can check in any of the antrea agent container the entries that the creation of the service has created using the following ovs command line at the antrea-ovs container.

kubectl exec -n kube-system antrea-agent-hbc86 -c antrea-ovs — ovs-ofctl dump-flows br-int table=11 –no-stats | grep 28888
 cookie=0x7030000000000, table=11, priority=200,tcp,reg3=0xa22049a,reg4=0x270d8/0x7ffff actions=ct(commit,table=12,zone=65520,nat(dst=10.34.4.154:28888),exec(load:0x1->NXM_NX_CT_MARK[4],move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3]))
 cookie=0x7030000000000, table=11, priority=200,tcp,reg3=0xa2202da,reg4=0x270d8/0x7ffff actions=ct(commit,table=12,zone=65520,nat(dst=10.34.2.218:28888),exec(load:0x1->NXM_NX_CT_MARK[4],move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3]))
 cookie=0x7030000000000, table=11, priority=200,tcp,reg3=0xa22013c,reg4=0x270d8/0x7ffff actions=ct(commit,table=12,zone=65520,nat(dst=10.34.1.60:28888),exec(load:0x1->NXM_NX_CT_MARK[4],move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3]))

... skipped

And consequently the graph shows the increment by 100 endpoints that correspond to the ClusterIP www service exposition backed by the 100-replica deployment we created.

Table1 and Table11 in Worker01

Note also how this particular table 11 in charge of Endpoints is synced accross all the nodes because every pod in the cluster should be able to reach any of the endpoints through the clusterIP service. This can be verified changing to All instances and displaying only the ovs_table11. All the nodes in the cluster shows the same value for this particular table.

These are just two examples of tables that are automatically programmed and might be useful to visualize the entries of the OVS switch. Feel free to explore any other panels playing with filters within this dashboard.

As a bonus, you can see at the bottom of the dashboard there is a particular panel that shows the OVS logs. Having all the logs from all the workers in a single place is a great advantage for troubleshooting.

The logs can be filtered out using the instance, ovs_log_level and ovs_category variables at the top of the dashboard. The filters might be very useful to focus only on the relevant information we want to display. Note that, by default, the log level of the OVS container is set to INFO, however, you can increase the log level for debugging purposes to get higher verbosity. (Note: As with any other logging subsystem, increasing log level to DEBUG can adversely impact performance so be careful). To check the current status of the logs

kubectl exec -n kube-system antrea-agent-hbc86 -c antrea-ovs — ovs-appctl vlog/list
                 console    syslog    file
                 -------    ------    ------
backtrace          OFF        ERR       INFO
bfd                OFF        ERR       INFO
bond               OFF        ERR       INFO
bridge             OFF        ERR       INFO
bundle             OFF        ERR       INFO
bundles            OFF        ERR       INFO
cfm                OFF        ERR       INFO
collectors         OFF        ERR       INFO
command_line       OFF        ERR       INFO
connmgr            OFF        ERR       INFO
conntrack          OFF        ERR        DBG
conntrack_tp       OFF        ERR       INFO
coverage           OFF        ERR       INFO
ct_dpif            OFF        ERR       INFO
daemon             OFF        ERR       INFO
... <skipped>
vlog               OFF        ERR       INFO
vswitchd           OFF        ERR       INFO
xenserver          OFF        ERR       INFO

If you want to increase the error level in any of the categories use following command ovs-appctl command. The syntax of the command specifies the subsystem, the log target (console, file or syslog) and the log level where dbg is the maximum level and info is the default for the file logging. As you can guess dbg must be enabled only during a troubleshooting session and disabled afterwards to avoid performance issues. This command is applied only to an specified agent. To restore the configuration of all the subsystems just use the keyworkd ANY instead of specifying a subsystem.

kubectl exec -n kube-system antrea-agent-hbc86 -c antrea-ovs -- ovs-appctl vlog/set dpif:file:info

If you want to repeat and change in all the agents accross the cluster use the following command to add recursiveness through xargs command.

kubectl get pods -n kube-system | grep antrea-agent | awk '{print $1}' | xargs -n1 -I{arg} kubectl exec -n kube-system {arg} -c antrea-ovs -- ovs-appctl vlog/set dpif:file:dbg

Go back to the grafana dashboard and explore some of the entries. As an example the creation of a new pod produces the following log output. As mentioned earlier in this post, the original message has been parsed by FluentBit to extract the relevant fields and is formatted by Grafana to gain some human readability.

Dashboard 3: Agent Conntrack and Proxy Metrics

The next dashboard is the conntrack and is helpful to identify the current sessions and therefore the actual activity in terms of traffic in the cluster. Conntrack is part of every Linux stack and allows the kernel to track all the network connections (a.k.a flows) in order to identify all the packets that belong to a particular flow and provide a consistent treatment. This is useful for allowing or denying packets as part of a stateful firewall subsystem and also for some network translations mechanisms that require this connection tracking to work properly.

The conntrack table has its a maximum size and is important to monitor that the limits are not surpassed to avoid issues. Lets use the traffic generator we installed in the previous post to see how the metrics reacts to new traffic coming into the system. Access the Dashboard Number 3 in Grafana UI. The Conntrack section shows the number of entries (namely connections or flows) in the conntrack table. The second panel calculates the percentage of flows using the Prometheus antrea_agent_conntrack_max_connection_track metric and is useful to monitor how close we are to the limit of the node in terms of established connections.

Conntrack Connection Count

To play around with this metric lets inject some traffic using the Load Generator tool we installed in previous post here. Open the Load Generator and inject some traffic by clicking on New Test and select 5000 users with a spawn rate of 100 users/second and http://frontend.acme-fitness.svc as target endpoint to sent our traffic to. Next, click Start swarming button.

Locust Load-Generator

After some time we can see how we have reached a fair number of request per second and there are no failures.

Locust Load Generator Statistics

Narrow the dispaly interval at the right top corner to show the last 5 or 15 minutes and, after a while you can see how the traffic injection produces a change in the number of connections.

Conntrack entries by node

The Conntrack percentage also follow the same pattern and show a noticeable increase.

Conntrack entries percentage over Conntrack Max

The Percentage is still far from the limit. Let’s push a little bit more simulating a L4 DDoS attack in next section.

Detecting a L4 DDoS attack

To push the networking conntrack stack to the limit, let’s attempt a L4 DDoS attack using the popular hping3 tool. This tool can be useful to generate a big amount of traffic against a target (namely victim) system. We will run it in kubernetes as a deployment to be able to scale in/out to inject even more traffic to see if we can reach the limits of the system. The following manifest will be used to spin up the ddos-atacker pods. Modify the variables to match with the target you want to try. The below ddos-attacker.yaml file will instruct kubernetes to spin up a single replica deployment that will inject traffic to our acme-fitness frontend application listening at port 80. The hping3 command will flood as much TCP_SYN segments as it can send without waiting for confirmation.

vi ddos-attacker.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: ddos-attacker
  name: ddos-attacker
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ddos-attacker
  template:
    metadata:
      labels:
        app: ddos-attacker
    spec:
      containers:
      - image: sflow/hping3
        name: hping3
        env:
        # Set TARGET_HOST/TARGET_PORT variables to your intended target
        # For Pod use  pod-ip-address.namespace.pod.cluster-domain.local 
        #  Example: 10-34-4-65.acme-fitness.pod.cluster.local
        # For svc use svc_name.namespace.svc.cluster.local
        #  Example : frontend.acme-fitness.svc.cluster.local
        # Flood Mode, use with caution! 
        - name: TARGET_HOST
          value: "frontend.acme-fitness.svc.cluster.local"
          #value: "10-34-1-13.acme-fitness.pod.cluster.local"
        - name: TARGET_PORT
          # if pod use port 3000, if svc, then 80
          value: "80"
        # -S --flood initiates DDoS SYN Flooding attack
        command: ["/usr/sbin/hping3"]
        args: [$(TARGET_HOST), "-p", $(TARGET_PORT), "-S", "--flood"]

Once the manifest is applied the deployment will spin up a single replica. Hovering at the graphic of the top panel you can see the number of connections generated by the hping3 execution. As you can see there are two workers that shows a count at around 70K connections. The hping3 will be able to generate up to 65K connections as per TCP limitations. As you can see below, the panel at the top shows an increase in the connections in two of the workers.

Antrea Conntrack Connection count during single pod ddos attack

As you can guess, these two affected workers must have something to do with the attacker and the victim pods. You can easily verify where are both of them currently scheduled using the command below that in fact, frontend pod (victim) is running on k8s-worker-02 node whereas ddos-attacker pod is running on k8s-worker-06 which makes senses according to the observed behaviour.

kubectl get pods -A -o wide | grep -E “NAME|frontend|ddos”
NAMESPACE              NAME                                           READY   STATUS    RESTARTS       AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
acme-fitness           frontend-6cd56445-bv5vf                        1/1     Running   0              55m     10.34.2.180   k8s-worker-02         <none>           <none>
default                ddos-attacker-6f4688cffd-9rxdf                 1/1     Running   0              2m36s   10.34.1.175   k8s-worker-06         <none>           <none>

Additionaly, the panel below shows the Conntrack connections percentage. Hovering the mouse over the graph you can display a table with actual values for all the measurements represnted in the graphic. The two workers involved in the attack are reaching an usage of 22% of the total available connections.

Antrea Conntrack Connection Percentage during single pod ddos attack

Now scale the ddos deployment to spin up more replicas. Adjust the replica count according to your cluster size. In my case I am using a 6-worker cluster so I will scale out to 5 replicas that will be more than enough.

kubectl scale deployment ddos-attacker –replicas=5
deployment.apps/ddos-attacker scaled

The graph now show how the count usage reach the 100% in k8s-worker-02 that is the node where the victim is running on. The amount of traffic sent to the pod is likely to cause a denial of service since the Antrea Conntrack table will have issues to accomodate new incoming connections.

Antrea Conntrack Connection Percentage reaching 100% after multiple pod ddos attack

The number of connections (panel at the top of the dashboards) shows an overall of +500K connections at node k8s-worker-02 which exceeds the maximum configured conntrack table.

Antrea Conntrack Connection Count after multiple pod ddos attack

As you can imagine, the DDoS attack is also impacting the CPU usage at each of the antrea-agent process. If you go back to dashboard 1 you can check how CPU usage is heavily impacted with values peaking at 150% of CPU usage and how the k8s-worker-02 node is specially struggling during the attack.

CPU Agent Usage during ddos attack

Now stop the attack and wide the display vistualization interval to visualize 6 hours range. If you look retrospectively is easy to identify the ocurrence of the DDoS attacks attemps in the timeline.

DDoS Attack identification

Dashboard 4: Network Policy and Logs

The next dashboard is related with Network Policies and is used to understand what is going on in this topic by means of the related metrics and logs. If there are no policies at all in your cluster you will see a boring dashboard like the one below showing the idle state.

Network policies and logs

As docummented in the Antrea.io website, Antrea supports not only standard K8s NetworkPolicies to secure ingress/egress traffic for Pods but also adds some extra CRDs to provide the administrator with more control over security within the cluster. Using the acme-fitness application we installed in previous post we will play with some policies to see how they are reflected in our grafana dashboard. Lets create firstly an Allow ClusterNetworkPolicy to control traffic going from frontend to catalog microservice that are part of the acme-fitness application under test. See comments in the yaml for further explanation

vi acme-fw-policy_Allow.yaml
apiVersion: crd.antrea.io/v1alpha1
kind: ClusterNetworkPolicy
metadata:
  name: acnp-acme-allow
spec:
  # The policy with the highest precedence (the smallest numeric priority value) is enforced first. 
  priority: 10
 # Application Tier
  tier: application
 # Enforced at pods matching service=catalog at namespace acme-fitness
  appliedTo:
    - podSelector:
        matchLabels:
          service: catalog
      namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: acme-fitness
 # Application Direction. Traffic coming from frontend pod at acme-fitness ns
  ingress:
    - action: Allow
      from:
        - podSelector:
            matchLabels:
              service: frontend
          namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: acme-fitness
 # Application listening at TCP 8082
      ports:
        - protocol: TCP
          port: 8082
      name: Allow-From-FrontEnd
 # Rule hits will be logged in the worker 
      enableLogging: true

Once you apply the above yaml you can verify the new Antrea Cluster Network Policy (in short acnp).

kubectl get acnp
NAME              TIER          PRIORITY   DESIRED NODES   CURRENT NODES   AGE
acnp-acme-allow   application   10         1               1               22s

Return to the cluster and you will see some changes showing up.

Note the policy is calculated and effective only in the worker where the appliedTo matching pods actually exist. In this case the catalog pod lives on the k8s-worker-01 as reflected in the dasboard and verified by the output shown below.

kubectl get pod -n acme-fitness -o wide | grep -E “NAME|catalog”
NAME                              READY   STATUS    RESTARTS      AGE   IP            NODE            NOMINATED NODE   READINESS GATES
catalog-958b9dc7c-brljn           1/1     Running   2 (42m ago)   60m   10.34.6.182   k8s-worker-01   <none>           <none>

If you scale the catalog deployment, then the pods will be scheduled accross more nodes in the cluster and the networkpolicy will be enforced in multiple nodes.

kubectl scale deployment -n acme-fitness catalog –replicas=6
deployment.apps/catalog scaled

Note how, after scaling, the controller recalculates the network policy and it pushed it to 6 of the nodes.

kubectl get acnp
NAME              TIER          PRIORITY   DESIRED NODES   CURRENT NODES   AGE
acnp-acme-allow   application   10         6               6               6m46s

This can be seen in the dashboard as well. Now use the locust tool mentioned earlier to inject some traffic to see network policy hits.

Network Policy Dashboard showing Ingress Network Policies and rule hits

Now change the policy to deny the traffic using Reject action using the following manifest.

vi acme-fw-policy_Reject.yaml
apiVersion: crd.antrea.io/v1alpha1
kind: ClusterNetworkPolicy
metadata:
  name: acnp-acme-reject
spec:
  # The policy with the highest precedence (the smallest numeric priority value) is enforced first. 
  priority: 10
 # Application Tier
  tier: application
 # Enforced at pods matching service=catalog at namespace acme-fitness
  appliedTo:
    - podSelector:
        matchLabels:
          service: catalog
      namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: acme-fitness
 # Application Direction. Traffic coming from frontend pod at acme-fitness ns
  ingress:
    - action: Reject
      from:
        - podSelector:
            matchLabels:
              service: frontend
          namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: acme-fitness
 # Application listening at TCP 8082
      ports:
        - protocol: TCP
          port: 8082
      name: Reject-from-Frontend
 # Rule hits will be logged in the worker 
      enableLogging: true

The Reject action will take precedence over the existing ACNP with Allow action and the hit counter begin to increment progressively.

Feel free to try with egress direction and with drop action to see how the values are populated. A little bit below, you can find a panel with the formatted log entries that are being pushed to Loki by fluent-bit. Expand any entry to see all the fields and values.

Network Policy Log sample

You can use the filters at the top of the dashboard to filter using source or destination IP. You can enter the whole IP or just a some octect since the matching is done using a regex to find the at any part of the IPv4. As an example, the following filter will display logs and logs derived metric where the source IP contains .180 and the destination IP contains .85.

Once the data is entered all the logs are filtered to display only matching entries. This not only affect to the Log Analysis panel but also to the rest of panels derived from the logs and can be useful to drill down and focus only on desired conversations within the cluster.

Log Analysis entries after filtering by SRC/DST

Last but not least, there are two more sections that analyze the received logs and extract some interesting metrics such as Top DestinationIP, Top SourceIPs, Top Converstion, Top Hits by rule and so on. These analytics can be useful to identify most top talkers in the cluster as well as identifying traffic that might be accidentaly denied. Remember this analytics are also affected by the src/dst filters mentioned above.

Allowed Traffic Analytics

Dashboard 5: Antrea Agent Logs

Next dashboard is the Antrea Agent Logs. This dashboard is purely generated from the logs that the Antrea Agent produces at each of the k8s nodes. All the panels can be filtered out using the variables like the log level and the log_category to obtain the desired set of logs and avoid other noisy and less relevant logs.

Antrea Agent Logs Dashboard

As an example lets see the trace left by Antrea when a new node is scheduled in a worker node. First use the log_category filter to select pod_configuration and server subsystems only. Next use following command to create a new 2-replica deployment using apache2 image.

kubectl create deployment www –image=httpd –replicas=2
deployment.apps/www created

Note how the new deployment has produced some logs in the pod_configuration and server log categorie subsystem specifically in k8s-worker-01 and k8s-worker-02

Verify the pods has been actually scheduled in k8s-worker-01 and k8s-worker-02

kubectl get pod -o wide | grep -E “NAME|www”
NAME                       READY   STATUS    RESTARTS     AGE     IP            NODE            NOMINATED NODE   READINESS GATES
www-86d9694b49-m85xl       1/1     Running   0            5m26s   10.34.2.94    k8s-worker-02   <none>           <none>
www-86d9694b49-r6kwq       1/1     Running   0            5m26s   10.34.6.187   k8s-worker-01   <none>           <none>

The logs panel shows the activity created when the kubelet request IP connectivity to the new pod (CmdAdd).

If you need to go deeper in the logs there’s always an option to increase the log verbosity. By default the log is set to the minimum or zero.

kubectl get pods -n kube-system | grep antrea-agent | awk ‘{print $1}’ | xargs -n1 -I{arg} kubectl exec -n kube-system {arg} -c antrea-agent — antctl log-level
0
0
0
0
0
0
0

But, for troubleshooting or research purposes you can change log level to 4 which is the maximum verbosity. You can do it using following single-line command in all the nodes in the cluster.

kubectl get pods -n kube-system | grep antrea-agent | awk '{print $1}' | xargs -n1 -I{arg} kubectl exec -n kube-system {arg} -c antrea-agent -- antctl log-level 4

Note the command is executed silently, so repeat previous command without the log-level keyword to check if the new settings has been applied and you should receive an output of 4 in each of iterations. Now delete the previous www deployment and recreate it again. As you can see in the Antrea Logs panel the same operation now produces much more logs with enriched information.

Increasing the log level can be useful to see conntrack information. If you are interested in flow information, increasing the log verbosity temporarily might be a good option. Lets give a try. Expose the previous www deployment using below command.

kubectl expose deployment www –port=80
service/www exposed

And now create a new curl pod to generate some traffic towards the created service to see if to locate the conntrack flows.

kubectl run mycurlpod –image=curlimages/curl -i –tty — sh
If you don't see a command prompt, try pressing enter.
/ $ / $ curl www.default.svc
<html><body><h1>It works!</h1></body></html>

Using the log_category filter at the top select the conntrack_connections to focus only on flow relevant logs.

For further investigation you can click on the arrow next to the Panel Title and then click on Explore to drill down.

Change the query and add filter to get only logs containing “mycurlpod” and “DestionationPort:80” which corresponds with the flow we are looking for.

And you will get the log entries matching the search filter entered as shown in the following screen.

You can copy the raw log entry for further analysis or forensic / reporting purposes. Its quite easy to identigy Source/Destination IP Addresses, ports and other interesting metadata such as matching ingress and egress security policies affecting this particular traffic.

FlowKey:{SourceAddress:10.34.6.190 DestinationAddress:10.34.2.97 Protocol:6 SourcePort:46624 DestinationPort:80} OriginalPackets:6 OriginalBytes:403 SourcePodNamespace:default SourcePodName:mycurlpod DestinationPodNamespace: DestinationPodName: DestinationServicePortName:default/www: DestinationServiceAddress:10.109.122.40 DestinationServicePort:80 IngressNetworkPolicyName: IngressNetworkPolicyNamespace: IngressNetworkPolicyType:0 IngressNetworkPolicyRuleName: IngressNetworkPolicyRuleAction:0 EgressNetworkPolicyName: EgressNetworkPolicyNamespace: EgressNetworkPolicyType:0 EgressNetworkPolicyRuleName: EgressNetworkPolicyRuleAction:0 PrevPackets:0 PrevBytes:0 ReversePackets:4 ReverseBytes:486 PrevReversePackets:0 PrevReverseBytes:0 TCPState:TIME_WAIT PrevTCPState:}

Once your investigation is done, do not forget to restore the log level to zero to contain the log generation using following filter.

kubectl get pods -n kube-system | grep antrea-agent | awk '{print $1}' | xargs -n1 -I{arg} kubectl exec -n kube-system {arg} -c antrea-agent -- antctl log-level 0

These are just a couple of examples to explore the logs sent by Fluent-Bit shipper in each of the workers. Feel free to explore the dashboards using the embedded filters to see how affect to the displayed information.

Dashboard 6: Antrea Controller Logs

The dashboard 6 displays the activity related to Controller component of Antrea. The controller main function at this moment is to take care of NetworkPolicy implementation. For that reason unless there is networkpolicy changes you won’t see any log or metric expect the Controller Process CPU Seconds which should remain very low (under 2% of usage).

If you want to see some metrics in action you can generate some activity just by pushing some networkpolicies. As an example create a manifest to create cluster network policies and networkpolicies.

vi dummy_policies.yaml
apiVersion: crd.antrea.io/v1alpha1
kind: ClusterNetworkPolicy
metadata:
  name: test-acnp
  namespace: acme-fitness
spec:
  priority: 10
  tier: application
  appliedTo:
    - podSelector:
        matchLabels:
          service: catalog
  ingress:
    - action: Reject
      from:
        - podSelector:
            matchLabels:
              service: frontend
      ports:
        - protocol: TCP
          port: 8082
      name: AllowFromFrontend
      enableLogging: false
---
apiVersion: crd.antrea.io/v1alpha1
kind: NetworkPolicy
metadata:
  name: test-np
  namespace: acme-fitness
spec:
  priority: 3
  tier: application
  appliedTo:
    - podSelector:
        matchLabels:
          service: catalog
  ingress:
    - action: Allow
      from:
        - podSelector:
            matchLabels:
              service: frontend
      ports:
        - protocol: TCP
          port: 8082
      name: AllowFromFrontend
      enableLogging: true

Using a simple loop like the one below, create and delete recursively the same policy waiting a random number of time between iterations

while true; do kubectl apply -f dummy_policies.yaml; sleep .$[ ( $RANDOM % 500 ) + 1 ]s; kubectl delete -f dummy_policies.yaml; done
clusternetworkpolicy.crd.antrea.io/test-acnp created
networkpolicy.crd.antrea.io/test-np created
clusternetworkpolicy.crd.antrea.io "test-acnp" deleted
networkpolicy.crd.antrea.io "test-np" deleted
clusternetworkpolicy.crd.antrea.io/test-acnp created
networkpolicy.crd.antrea.io/test-np created
clusternetworkpolicy.crd.antrea.io "test-acnp" deleted
networkpolicy.crd.antrea.io "test-np" deleted

Keep the loop running for several minutes and return to grafana UI to check how the panels are populated.

As in the other dashboards, at the bottom you can find the related logs that are being generated by the network policy creation / deletion activity pushed to the cluster.

Stop the loop by pressing CTRL C. As in the others systems generating logs, you can increase the log level by using the same method mentioned earlier. You can try to configure the antrea controller to the maximum verbosity which is 4.

kubectl exec -n kube-system antrea-controller-74799d4774-g6cxt -c antrea-controller -- antctl log-level 4

Repeat the same loop and you will find a dramatic increase in number of messages being logged. At the Logs Per Second Panel at the upper right corner you can notice an increasing in the logs per second rate from 6 to a steady rate of +120 logs per second.

When finished do not forget to restore log level to 0

kubectl exec -n kube-system antrea-controller-74799d4774-g6cxt -c antrea-controller -- antctl log-level 0

This concludes this series of Observability on Antrea. I hope it has served to shed light on a crucial component in any Kubernetes cluster such as the CNI and that it allows not only to properly understand the operation from the perspective of networks, but also to enhance the capabilities of analysis and troubleshooting.

Antrea Observability Part 2: Installing Grafana, Prometheus, Loki and Fluent-Bit

Who does not love to watch a nice dashboard full of colors? Observing patterns and real-time metrics in a time series might make us sit in front of a screen as if hypnotized for hours. But apart from the inherent beauty of dashboards, they provide observability, which is a crucial feature for understanding the performance of our applications and also a very good tool for predicting future behavior and fixing existing problems.

There is a big ecosystem out there with plenty of tools to create a logging pipeline that collect, parse, process, enrich, index, analyze and visualize logs. In this post we will focus on a combination that is gaining popularity for log Analysis that is based on FluentBit, Loki and Grafana as shown below. On the other hand we will use Prometheus for metric collection.

Opensource Observability Stack

Let’s build the different blocks starting by the visualization tool.

Installing Grafana as visualization platform

Grafana is a free software based on Apache 2.0 license, which allows us to visualize data collected from various sources such as Prometheus, InfluxDB or Telegraph, tools that collect data from our infrastructure, such as CPU usage, memory, or network traffic of a virtual machine, a Kubernetes cluster, or each of its containers.

The real power of Grafana lies in the flexibility to create as many dashboards as we need with very smart visualization grapsh where we can format this data and represent it as we want. We will use Grafana as main tool for adding observatility capabilities to Antrea which is the purpose of this series of posts.

To carry out the installation of grafana we will rely on the official helm charts. The first step, therefore, would be to add the grafana repository so that helm can access it.

helm repo add grafana https://grafana.github.io/helm-charts

Once the repository has been added we can broswe it. We will use the latest available release of chart to install the version 9.2.4 of Grafana.

helm search repo grafana
NAME                                    CHART VERSION   APP VERSION             DESCRIPTION                                       
grafana/grafana                         6.43.5          9.2.4                   The leading tool for querying and visualizing t...
grafana/grafana-agent-operator          0.2.8           0.28.0                  A Helm chart for Grafana Agent Operator           
grafana/enterprise-logs                 2.4.2           v1.5.2                  Grafana Enterprise Logs                           
grafana/enterprise-logs-simple          1.2.1           v1.4.0                  DEPRECATED Grafana Enterprise Logs (Simple Scal...
grafana/enterprise-metrics              1.9.0           v1.7.0                  DEPRECATED Grafana Enterprise Metrics             
grafana/fluent-bit                      2.3.2           v2.1.0                  Uses fluent-bit Loki go plugin for gathering lo...
grafana/loki                            3.3.2           2.6.1                   Helm chart for Grafana Loki in simple, scalable...
grafana/loki-canary                     0.10.0          2.6.1                   Helm chart for Grafana Loki Canary                
grafana/loki-distributed                0.65.0          2.6.1                   Helm chart for Grafana Loki in microservices mode 
grafana/loki-simple-scalable            1.8.11          2.6.1                   Helm chart for Grafana Loki in simple, scalable...
grafana/loki-stack                      2.8.4           v2.6.1                  Loki: like Prometheus, but for logs.              
grafana/mimir-distributed               3.2.0           2.4.0                   Grafana Mimir                                     
grafana/mimir-openshift-experimental    2.1.0           2.0.0                   Grafana Mimir on OpenShift Experiment             
grafana/oncall                          1.0.11          v1.0.51                 Developer-friendly incident response with brill...
grafana/phlare                          0.1.0           0.1.0                   🔥 horizontally-scalable, highly-available, mul...
grafana/promtail                        6.6.1           2.6.1                   Promtail is an agent which ships the contents o...
grafana/rollout-operator                0.1.2           v0.1.1                  Grafana rollout-operator                          
grafana/synthetic-monitoring-agent      0.1.0           v0.9.3-0-gcd7aadd       Grafana's Synthetic Monitoring application. The...
grafana/tempo                           0.16.3          1.5.0                   Grafana Tempo Single Binary Mode                  
grafana/tempo-distributed               0.27.5          1.5.0                   Grafana Tempo in MicroService mode                
grafana/tempo-vulture                   0.2.1           1.3.0                   Grafana Tempo Vulture - A tool to monitor Tempo...

Any helm chart includes configuration options to customize the setup by passing a configuration file that helm will use when deploying our release. We can research in the documentation to understand what all this possible helm chart values really means and how affect the final setup. Sometimes it is useful to get a file with all the default configuration values and personalize as requiered. To get the default values associated with a helm chart just use the following command.

helm show values grafana/grafana > default_values.yaml

Based on the default_values.yaml we will create a customized and reduced version and we will save in a new values.yaml file with some modified values for our custom configuration. You can find the full values.yaml here. The first section enables data persistence by creating a PVC that will use the vsphere-sc storageClass we created in this previous post to leverage vSphere Container Native Storage capabilities to provision persistent volumes. Adjust the storageClassName as per your setup.

vi values.yaml
# Enable Data Persistence
persistence:
  type: pvc
  enabled: true
  storageClassName: vsphere-sc
  accessModes:
    - ReadWriteOnce
  size: 10Gi

The second section enables the creation of sidecars containers that allow us to import grafana configurations such as datasources or dashboards through configmaps, this will be very useful to deploy Grafana fully configured in an automated way without user intervention through the graphical interface. With this settings applied, any configmap in the grafana namespace labeled with grafana_dashboard=1 will trigger the import of dashboard. Similarly, any configmaps labeled with grafana_datasource=1 will trigger the import of the grafana datasource.

vi values.yaml (Sidecars Section)
# SideCars Section
# Enable Sidecars containers creationfor dashboards and datasource import via configmaps
sidecar:
  dashboards:
    enabled: true
    label: grafana_dashboard
    labelValue: "1"
  datasources:
    enabled: true
    label: grafana_datasource
    labelValue: "1"
    

The last section defines how to expose Grafana graphical interface externally. We will use a kubernetes service type LoadBalancer for this purpose. In my case I will use AVI as the ingress solution for our cluster so the load balancer will be created in the service engine. Feel free to use any other external LoadBalancer solution if you want.

vi values.yaml (service expose section)
# Define how to expose the service
service:
  enabled: true
  type: LoadBalancer
  port: 80
  targetPort: 3000
  portName: service

The following command creates the namespace grafana and installs the grafana/grafana chart named grafana in the grafana namespace taking the values.yaml configuration file as the input. After successful deployment, the installation gives you some hints for accessing the application, e.g. how to get the credentials, which are stored in a secret k8s object. Ignore any warning about PSP you might get.

helm install grafana –create-namespace grafana -n grafana grafana/grafana -f values.yaml
Release "grafana" has been installed. Happy Helming!
NAME: grafana
LAST DEPLOYED: Mon Dec 26 18:42:05 2022
NAMESPACE: grafana
STATUS: deployed
REVISION: 2
NOTES:
1. Get your 'admin' user password by running:

   kubectl get secret --namespace grafana grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:

   grafana.grafana.svc.cluster.local

   Get the Grafana URL to visit by running these commands in the same shell:
     export POD_NAME=$(kubectl get pods --namespace grafana -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}")
     kubectl --namespace grafana port-forward $POD_NAME 3000

3. Login with the password from step 1 and the username: admin

As explained in the notes after helm installation, the first step is to get the plaintext password that will be used to authenticate the default admin username in the Grafana UI.

kubectl get secret –namespace grafana grafana -o jsonpath=”{.data.admin-password}” | base64 –decode ; echo
wFCT81uGC7ij5Sv1rTIuf2CwQa5Y9xkGQSixDKOx

Veryfing Grafana Installation

Before moving to the Grafana UI let’s explore created kubernetes resources and their status.

kubectl get all -n grafana
NAME                           READY   STATUS    RESTARTS   AGE
pod/grafana-7d95c6cf8c-pg5dw   3/3     Running   0          24m

NAME              TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)        AGE
service/grafana   LoadBalancer   10.100.220.164   10.113.3.106   80:32643/TCP   24m

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/grafana   1/1     1            1           24m

NAME                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/grafana-7d95c6cf8c   1         1         1       24m

The chart has created a deployment with 3 grafana replica pods that are in running status. Note how the LoadBalancer service has already allocated the external IP address 10.113.3.106 to provide outside reachability. As mentioned earlier, if you have a LoadBalancer solution such as AVI with his AKO operator deployed in your setup you will see that a new Virtual Service has been created and it’s ready to use as shown below:

Now you can open your browser and type the IP Address. AVI also register in its internal DNS the new LoadBalancer objects that the developer creates in kubernetes. In this specific setup an automatic FQDN is created and grafana should be available from your browser at http://grafana.grafana.avi.sdefinitive.net. As specified in the LoadBalancer section in the values.yaml at deployment type, the grafana GUI will be exposed on port 80. For security purpose is strongly recommended to use a Secure Ingress object instead if you are planning to deploy in production.

Grafana GUI Welcome Page

This way we would have finished the installation of Grafana visualization tool. Let’s move now to install another important piece in observability in charge of retrieving metrics which is Prometheus.

Prometheus for metric collection

Prometheus was created to monitor highly dynamic environments, so over the past years it has become the mainstream monitoring tool of choice in container and microservices world. Modern devops is becoming more and more complex to handle manually and there is a need for automation. Imagine a complex infrastructure with loads of servers distributed over many locations and you have no insight of what is happening in terms of errors, latency, usage and so on. In a modern architecture there are more things than can go wrong when you have tons of dynamic and ephemeral services and applications and any of them can crash and cause failure of other services. This is why is crucial to avoid manual intervention and allow the administrator to quickly identify and fix any potential problem or degradation of the system.

The prometheus architecture is represented in the following picture that has been taken from the official prometheus.io website.

Prometheus architecture and its ecosystem

The heart componentes of the prometheus server are listed below

  • Data Retrieval Worker.- responsible for fetching time series data from a particular data source, such as a web server or a database, and converting it into the Prometheus metric format.
  • Time Series Data Base (TSDB).- used to store, manage, and query time-series data.
  • HTTP Server.- responsible for exposing the Prometheus metrics endpoint, which provides access to the collected metrics data for monitoring and alerting purposes.

The first step would be enabling Prometheus

Installing Prometheus

Prometheus requires access to Kubernetes API resources for service discovery, access to the Antrea Metrics Listener and some configuration to instruct the Data Retrieval Worker to scrape the required metrics in both Agent and Controller components. There are some manifests in the Antrea website ready to use to save some time with all the scraping and job configurations of Prometheus. Lets apply the provided manifest as a first step.

kubectl apply -f https://raw.githubusercontent.com/antrea-io/antrea/main/build/yamls/antrea-prometheus.yml
namespace/monitoring created
serviceaccount/prometheus created
secret/prometheus-service-account-token created
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created
configmap/prometheus-server-conf created
deployment.apps/prometheus-deployment created
service/prometheus-service create

As you can see in the output the manifest include all required kubernetes objects including permissions, configurations and lastly the prometheus server itself. The manifest deploy all the resources in a dedicated monitoring namespace.

kubectl get all -n monitoring
NAME                                         READY   STATUS    RESTARTS   AGE
pod/prometheus-deployment-57d7b4c6bc-jx28z   1/1     Running   0          42s

NAME                         TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/prometheus-service   NodePort   10.101.177.47   <none>        8080:30000/TCP   42s

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/prometheus-deployment   1/1     1            1           42s

NAME                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/prometheus-deployment-57d7b4c6bc   1         1         1       42s

Feel free to explore all created resources. As you can tell there is a new service is running as a NodePort type so you should be able to reach the Prometheus server using any of your workers IP addresses that listen at the static 30000 port. Alternatively you can always use port-forward method to redirect a local port to the service listening at 8080. Open a browser to verify you can access to the HTTP Server component of Prometheus.

Prometheus configuration will retrieve not only Antrea related metrics but also some built in kubernetes metrics. Just for fun type “api” in the search box and you will see dozens of metrics available.

Prometheus Server

Now we are sure Prometheus Server is running and is able to scrape metrics succesfully, lets move into our area of interest that is Antrea CNI.

Enabling Prometheus Metrics in Antrea

The first step is to configure Antrea to generate Prometheus metrics. As we explain in the previous post here we are using Helm to install Antrea so the better way to change the configuration of the Antrea setup is by using the values.yaml file and redeploying the helem chart. As you can see we are enabling also the FlowExporter featuregate. This is a mandatory setting to allow conntrack flows related metrics to get updated. Edit the values.yaml

vi values.yaml
# -- Container image to use for Antrea components.
image:
  tag: "v1.10.0"

enablePrometheusMetrics: true

featureGates:
   FlowExporter: true
   AntreaProxy: true
   TraceFlow: true
   NodePortLocal: true
   Egress: true
   AntreaPolicy: true

Deploy a new release of antrea helm chart taking the values.yaml file as in input. Since the chart is already deployed, now we need to use upgrade keyword instead of install as did the first time.

helm upgrade -f values.yaml antrea antrea/antrea -n kube-system
Release "antrea" has been upgraded. Happy Helming!
NAME: antrea
LAST DEPLOYED: Wed Jan 18 13:53:56 2023
NAMESPACE: kube-system
STATUS: deployed
REVISION: 2
TEST SUITE: None
NOTES:
The Antrea CNI has been successfully installed

You are using version 1.10.0

For the Antrea documentation, please visit https://antrea.io

Now with antctl command verify the feature gates has been enabled as expected

antctl get featuregates
Antrea Agent Feature Gates
FEATUREGATE              STATUS         VERSION   
Traceflow                Enabled        BETA      
AntreaIPAM               Disabled       ALPHA     
Multicast                Disabled       ALPHA     
AntreaProxy              Enabled        BETA      
Egress                   Enabled        BETA      
EndpointSlice            Disabled       ALPHA     
ServiceExternalIP        Disabled       ALPHA     
AntreaPolicy             Enabled        BETA      
Multicluster             Disabled       ALPHA     
FlowExporter             Enabled        ALPHA     
NetworkPolicyStats       Enabled        BETA      
NodePortLocal            Enabled        BETA      

Antrea Controller Feature Gates
FEATUREGATE              STATUS         VERSION   
AntreaPolicy             Enabled        BETA      
NetworkPolicyStats       Enabled        BETA      
NodeIPAM                 Disabled       ALPHA     
Multicluster             Disabled       ALPHA     
Egress                   Enabled        BETA      
Traceflow                Enabled        BETA      
ServiceExternalIP        Disabled       ALPHA     

As seen before Prometheus has some built-in capabilities to browse and visualize metrics however we want this metric to be consumed from the powerful Grafana that we have installed earlier. Lets access to the Grafana console and click on the Gear icon to add the new Prometheus Datasource.

Grafana has been deployed in the same cluster in a different namespace. The prometheus URL is derived from <service>.<namespace>.svc.<port>.domain. In our case the URL is http://prometheus-service.monitoring.svc:8080. If your accessing Prometheus from a different cluster ensure you use the FQDN adding the corresponding domain at the end of the URL (by default cluster.local).

Click on Save & Test blue button at the bottom of the screen and you should see a message indicating the Prometheus server is reachable and working as expected.

Now click on the compass button to verify that Antrea metrics are being populated and are reachable from Grafana for visualization. Select new added Prometheus as Datasource at the top.

Pick up any of the available antrea agent metrics (I am using here antrea_agent_local_pod_count as an example) and you should see the visualization of the gathered metric values in the graph below.

That means Prometheus datasource is working and Antrea is populating metrics successfully. Let’s move now into the next piece Loki.

Installing Loki as log aggregator platform

In the last years the area of log management has been clearly dominated by the Elastic stack becoming the de-facto standard whereas Grafana has maintained a strong position in terms of visualization of metrics from multiple data sources, among which prometheus stands out.

Lately a very popular alternative for log management is Grafana Loki. The Grafana Labs website describes Loki as a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost effective and easy to operate. It does not index the contents of the logs, but rather a set of labels for each log stream.

For that reason we will use Loki as a solution for aggregating the logs we got from kubernetes pods. We will focus on Antrea related pods but it can be used with a wider scope for the rest of applications ruunning in your cluster.

In the same way that we did in the Grafana installation, we will use the official helm chart of the product to proceed with the installation. This time it is not necessary to install a new helm repository because the Loki chart is already included in the grafana repo. As we did with Grafana helm chart, the first step will be to obtain the configuration file associated to the helm chart that we will use to be able to customize our installation.

helm show values grafana/loki > default_values.yaml

Using this file as a reference, create a reduced and customized values.yaml file with some modified configuration. As a reminder, any setting not explicitly mentioned in the reduced values.yaml file will take the default values. Find the values.yaml file I am using here.

For a production solution it is highly recommended to install Loki using the scalable architecture. The scalable architecture requires a managed object store such as AWS S3 or Google Cloud Storage but, if you are planning to use it on-premises, a very good choice is to use a self-hosted store solution such as the popular MinIO. There is a previous post explaining how to deploy a a MinIO based S3-like storage platform based on vSAN here. In case you are going with MinIO Operator, as a prerequisite before installing Loki, you would need to perform following tasks.

The following script will

  • Create a new MinIO tenant. We will use a new tenant called logs in the namespace logs.
  • Obtain the S3 endpoint that will be used to interact with your S3 storage via API. I am using here the internal ClusterIP endpoint but feel free to use the external FQDN if you want to expose it externally. In that case you would use the built-in kubernetes naming convention for pods and services as explained here and, that it would be something like minio.logs.svc.cluster.local.
  • Obtain the AccessKey and SecretAccessKey. By default a set of credentials are generated upon tenant creation. You can extract them from the corresponding secrets easily or just create a new set of credentials using Minio Tenant Console GUI.
  • Create the required buckets. You can use console or mc tool as well.

Considering you are using MinIO operator, the following script will create required tenant and buckets. Copy and Paste the contents in the console or create a sh file and execute it using bash. Adjust the tenant settings in terms of servers, drives and capacity match with your environment. All the tenant objects will be placed in the namespace logs.

vi create-loki-minio-tenant.sh
#!/bin/bash

TENANT=logs
BUCKET1=chunks
BUCKET2=ruler
BUCKET3=admin

# CREATE NAMESPACE AND TENANT 6 x 4 drives for raw 50 G, use Storage Class SNA
# -------------------------------------------------------------------------------
kubectl create ns $TENANT
kubectl minio tenant create $TENANT --servers 6 --volumes 24 --capacity 50G --namespace $TENANT --storage-class vsphere-sna --expose-minio-service --expose-console-service

# EXTRACT CREDENTIALS FROM CURRENT TENANT AND CREATE SECRET 
# ---------------------------------------------------------
echo "MINIO_S3_ENDPOINT=https://minio.${TENANT}.svc.cluster.local" > s3vars.env
echo "MINIO_S3_BUCKET1=${BUCKET1}" >> s3vars.env
echo "MINIO_S3_BUCKET2=${BUCKET2}" >> s3vars.env
echo "MINIO_S3_BUCKET3=${BUCKET3}" >> s3vars.env
echo "SECRET_KEY=$(kubectl get secrets -n ${TENANT} ${TENANT}-user-1 -o jsonpath="{.data.CONSOLE_SECRET_KEY}" | base64 -d)" >> s3vars.env
echo "ACCESS_KEY=$(kubectl get secrets -n ${TENANT} ${TENANT}-user-1 -o jsonpath="{.data.CONSOLE_ACCESS_KEY}" | base64 -d)" >> s3vars.env

kubectl create secret generic -n $TENANT loki-s3-credentials --from-env-file=s3vars.env

Once the tenant is created we can proceed with bucket creation. You can do it manually via console or mc client or using following yaml file used to define a job that will create the required buckets as shown here. Basically it will wait untill the tenant is initialized and then it will created the three required buckets as per the secret loki-s3-credentials injected variables.

vi create-loki-minio-buckets-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: create-loki-minio-buckets
  namespace: logs
spec:
  template:
    spec:
      containers:
      - name: mc
       # loki-s3-credentials contains $ACCESS_KEY, $SECRET_KEY, $MINIO_S3_ENDPOINT, $MINIO_S3_BUCKET1-3
        envFrom:
        - secretRef:
            name: loki-s3-credentials
        image: minio/mc
        command: 
          - sh
          - -c
          - ls /tmp/error > /dev/null 2>&1 ; until [[ "$?" == "0" ]]; do sleep 5; echo "Attempt to connect with MinIO failed. Attempt to reconnect in 5 secs"; mc alias set s3 $(MINIO_S3_ENDPOINT) $(ACCESS_KEY) $(SECRET_KEY) --insecure; done && mc mb s3/$(MINIO_S3_BUCKET1) --insecure; mc mb s3/$(MINIO_S3_BUCKET2) --insecure; mc mb s3/$(MINIO_S3_BUCKET3) --insecure
      restartPolicy: Never
  backoffLimit: 4

Verify job execution is completed displaying the logs created by the pod in charge of completing the defined job.

kubectl logs -n logs create-loki-minio-buckets
mc: <ERROR> Unable to initialize new alias from the provided credentials. The Access Key Id you provided does not exist in our records.          
Attempt to connect with MinIO failed. Attempt to reconnect in 5 secs                                                                             
Added `s3` successfully.                                                                                                                         
Bucket created successfully `s3/chunks`.
Bucket created successfully `s3/ruler`.
Bucket created successfully `s3/admin`.

Now the S3 storage requirement is fullfiled. Lets move into the values.yaml file that will be used as the configuration source for our Loki deployment. The first section provides some general configuration options including the data required to access the shared S3 store. Replace s3 attributes with your particular settings.

vi values.yaml
loki:
  auth_enabled: false
  storage_config:
    boltdb_shipper:
      active_index_directory: /var/loki/index
      cache_location: /var/loki/index_cache
      resync_interval: 5s
      shared_store: s3
  compactor:
    working_directory: /var/loki/compactor
    shared_store: s3
    compaction_interval: 5m
  storage:
    bucketNames:
      chunks: chunks
      ruler: ruler
      admin: admin
    type: s3
    s3:
      s3: 
      endpoint: https://minio.logs.svc.cluster.local:443
      region: null
      secretAccessKey: YDLEu99wPXmAAFyQcMzDwDNDwzF32GnS8HhHBuoD
      accessKeyId: ZKYLET51JWZ8LXYYJ0XP
      s3ForcePathStyle: true
      insecure: true
      
  # 
  querier:
    max_concurrent: 4096
  #
  query_scheduler:
    max_outstanding_requests_per_tenant: 4096

# Configuration for the write
# <continue below...>

Note. If you used the instructions above to create the MinIO Tenant you can extract the S3 information from the plaintext s3vars.env variable. You can also extract from the secret logs-user-1. Remember to delete the s3vars.env file after usage as it may reveal sensitive information.

cat s3vars.env
MINIO_S3_ENDPOINT=https://minio.logs.svc.cluster.local
MINIO_S3_BUCKET1=chunks
MINIO_S3_BUCKET2=ruler
MINIO_S3_BUCKET3=admin
SECRET_KEY=YDLEu99wPXmAAFyQcMzDwDNDwzF32GnS8HhHBuoD
ACCESS_KEY=ZKYLET51JWZ8LXYYJ0XP

When object storage is configured, the helm chart configures Loki to deploy read and write targets in high-availability fashion running 3 replicas of each independent process. It will use a storageClass able to provide persistent volumes to avoid losing data in case of the failure of the application. Again, I am using here a storage class called vsphere-sc that is backed by vSAN and accesed by a CSI driver. If you want to learn how to provide data persistence using vSphere and vSAN check a previous post here.

vi values.yaml (Storage and General Section)

# Configuration for the write
write:
  persistence:
    # -- Size of persistent disk
    size: 10Gi
    storageClass: vsphere-sc
# Configuration for the read node(s)
read:
  persistence:
    # -- Size of persistent disk
    size: 10Gi
    storageClass: vsphere-sc
    # -- Selector for persistent disk

# Configuration for the Gateway
# <continue below..>

Additionally, the chart installs the gateway component which is an NGINX that exposes Loki’s API and automatically proxies requests to the correct Loki components (read or write in our scalable setup). If you want to reach Loki from the outside (e.g. other clusters) you must expose it using any kubernetes methods to gain external reachability. In this example I am using a LoadBalancer but feel free to explore further options in the defaults_values.yaml such as a secure Ingress. Remember when the gateway is enabled, the visualization tool (Grafana) as well as the log shipping agents (Fluent-Bit) should be configured to use the gateway as endpoint.

vi values.yaml (Gateway Section)

# Configuration for the Gateway
gateway:
  # -- Specifies whether the gateway should be enabled
  enabled: true
  # -- Number of replicas for the gateway
  service:
    # -- Port of the gateway service
    port: 80
    # -- Type of the gateway service
    type: LoadBalancer
  # Basic auth configuration
  basicAuth:
    # -- Enables basic authentication for the gateway
    enabled: false
    # -- The basic auth username for the gateway

The default chart will install another complementary components in Loki called canary and backend. Loki canary component is fully described here. Basically it is used to audit the log-capturing performance of Loki by generating artificial log lines.

Once the values.yaml file is completed we can proceed with the installation of the helm chart using following command. I am installing loki in the a namespace named loki.

helm install loki –create-namespace loki -n loki grafana/loki -f values.yaml
Release "loki" has been installed. Happy Helming!
NAME: loki
LAST DEPLOYED: Wed Feb 14 12:18:46 2023
NAMESPACE: loki
STATUS: deployed
REVISION: 1
NOTES:
***********************************************************************
 Welcome to Grafana Loki
 Chart version: 4.4.1
 Loki version: 2.7.2
***********************************************************************

Installed components:
* grafana-agent-operator
* gateway
* read
* write
* backend

Now we are done with Loki installation in a scalable and distributed architecture and backed by a MinIO S3 storage, let’s do some verifications to check everything is running as expected.

Verifying Loki Installation

As a first step, explore the kubernetes objects that the loki chart has created.

kubectl get all -n loki
NAME                                               READY   STATUS    RESTARTS   AGE
pod/loki-backend-0                                 1/1     Running   0          2m13s
pod/loki-backend-1                                 1/1     Running   0          2m49s
pod/loki-backend-2                                 1/1     Running   0          3m37s
pod/loki-canary-4xkfw                              1/1     Running   0          5h42m
pod/loki-canary-crxwt                              1/1     Running   0          5h42m
pod/loki-canary-mq79f                              1/1     Running   0          5h42m
pod/loki-canary-r76pz                              1/1     Running   0          5h42m
pod/loki-canary-rclhj                              1/1     Running   0          5h42m
pod/loki-canary-t55zt                              1/1     Running   0          5h42m
pod/loki-gateway-574476d678-vkqc7                  1/1     Running   0          5h42m
pod/loki-grafana-agent-operator-5555fc45d8-rcs59   1/1     Running   0          5h42m
pod/loki-logs-25hvr                                2/2     Running   0          5h42m
pod/loki-logs-6rnmt                                2/2     Running   0          5h42m
pod/loki-logs-72c2w                                2/2     Running   0          5h42m
pod/loki-logs-dcwkb                                2/2     Running   0          5h42m
pod/loki-logs-j6plp                                2/2     Running   0          5h42m
pod/loki-logs-vgqqb                                2/2     Running   0          5h42m
pod/loki-read-598f8c5cd5-dqtqt                     1/1     Running   0          2m59s
pod/loki-read-598f8c5cd5-fv6jq                     1/1     Running   0          2m18s
pod/loki-read-598f8c5cd5-khmzw                     1/1     Running   0          3m39s
pod/loki-write-0                                   1/1     Running   0          93s
pod/loki-write-1                                   1/1     Running   0          2m28s
pod/loki-write-2                                   1/1     Running   0          3m33s

NAME                            TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)             AGE
service/loki-backend            ClusterIP      10.98.167.78     <none>         3100/TCP,9095/TCP   5h42m
service/loki-backend-headless   ClusterIP      None             <none>         3100/TCP,9095/TCP   5h42m
service/loki-canary             ClusterIP      10.97.139.4      <none>         3500/TCP            5h42m
service/loki-gateway            LoadBalancer   10.111.235.35    10.113.3.104   80:30251/TCP        5h42m
service/loki-memberlist         ClusterIP      None             <none>         7946/TCP            5h42m
service/loki-read               ClusterIP      10.99.220.81     <none>         3100/TCP,9095/TCP   5h42m
service/loki-read-headless      ClusterIP      None             <none>         3100/TCP,9095/TCP   5h42m
service/loki-write              ClusterIP      10.102.132.138   <none>         3100/TCP,9095/TCP   5h42m
service/loki-write-headless     ClusterIP      None             <none>         3100/TCP,9095/TCP   5h42m

NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/loki-canary   6         6         6       6            6           <none>          5h42m
daemonset.apps/loki-logs     6         6         6       6            6           <none>          5h42m

NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/loki-gateway                  1/1     1            1           5h42m
deployment.apps/loki-grafana-agent-operator   1/1     1            1           5h42m
deployment.apps/loki-read                     3/3     3            3           5h42m

NAME                                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/loki-gateway-574476d678                  1         1         1       5h42m
replicaset.apps/loki-grafana-agent-operator-5555fc45d8   1         1         1       5h42m
replicaset.apps/loki-read-598f8c5cd5                     3         3         3       3m40s
replicaset.apps/loki-read-669c9d7689                     0         0         0       5h42m
replicaset.apps/loki-read-6c7586fdc7                     0         0         0       11m

NAME                            READY   AGE
statefulset.apps/loki-backend   3/3     5h42m
statefulset.apps/loki-write     3/3     5h42mNAME                                               

A simple test you can do to verify gateway status is by “curling” the API endpoint exposed in the allocated external IP. The OK response would indicate the service is up and ready to receive requests.

curl http://10.113.3.104/ ; echo
OK

Now access grafana console and try to add Loki as a DataSource. Click on the Engine button at the bottom of the left bar and then click on Add Data Source blue box.

Adding Loki as Datasource in Grafana (1)

Click on Loki to add the required datasource type.

Adding Loki as Datasource in Grafana (2)

In this setup Grafana and Loki has been deployed in the same cluster so we can use as URL the internal FQDN corresponding to the loki-gateway ClusterIP service. In case you are accesing from the outside you need to change that to the external URL (e.g. http://loki-gateway.loki.avi.sdefinitive.net in my case).

Adding Loki as Datasource in Grafana (3)

Click on “Save & test” button and if the attempt of Grafana to reach Loki API endpoint is successful, it should display the green tick as shown below.

Another interesting verification would be to check if the MinIO S3 bucket is getting the logs as expected. Open the MinIO web console and access to the chunks bucket wich is the target to write the logs Loki receives. You should see how the bucket is receiving new objects and how the size is increasing.

You may wonder who is sending this data to Loki at this point since we have not setup any log shipper yet. The reason behind is that, by default, when you deploy Loki using the official chart, a sub-chart with Grafana Agent is also installed to enable self monitoring. Self monitoring settings determine whether Loki should scrape it’s own logs. It will create custom resources to define how to scrape it’s own logs. If you are curious about it explore (i.e kubectl get) GrafanaAgent, LogsInstance, and PodLogs CRDs objects created in the Loki namespace to figure out how this is actually pushing self-monitoring logs into Loki.

To verify what are this data being pushed into MinIO S3 bucket, you can explore Loki Datasource through Grafana. Return to the Grafana GUI and try to show logs related to a loki component such as the loki-gateway pod. Click on the compass icon at the left to explore the added datasource. Now filter using job as key label and select the name of the loki/loki-gateway as value label as shown below. Click on Run Query blue button on the top right corner next to see what happens.

Displaying Loki logs at Grafana (1)

Et voila! If everything is ok you should see how logs are successfully being shipped to Loki by its internal self monitoring Grafana agent.

Displaying Loki logs at Grafana (2)

Now that our log aggregator seems to be ready let’s move into the log shipper section.

Installing Fluent-Bit

Fluent-Bit is an log shipper based in a open source software designed to cover highly distributed environments that demand high performance but keeping a very light footprint.

The main task of Fluent-bit in this setup is watch for changes in any interesting log file and send any update in that file to Loki in a form of a new entry. We will focus on Antrea related logs only so far but you can extend the input of Fluent-bit to a wider scope in order to track other logs of your OS.

Again we will rely in helm to proceed with the installation. This time we need to add a new repository maintaned by fluent.

helm repo add fluent https://fluent.github.io/helm-charts

Explore the repo to see what is.

helm search repo fluent
NAME                    CHART VERSION   APP VERSION     DESCRIPTION                                       
fluent/fluent-bit       0.22.0          2.0.8           Fast and lightweight log processor and forwarde...
fluent/fluentd          0.3.9           v1.14.6         A Helm chart for Kubernetes                       

As we did before, create a reference yaml file with default configuration values of the chart.

helm show values fluent/fluent-bit > default_values.yaml

Using default_values.yaml as a template, create a new values.yaml file that will contain the desired configuration. The main piece of the values.yaml file resides on the config section. You can customize how the logs will be treated in a secuencial fashion creating a data pipeline scheme as depicted here.

FluentBit DataPipeline

The full documentation is maintained in the Fluent-Bit website here, but in a nutshell the main subsections we will use to achieve our goal are:

  • SERVICE.- The service section defines global properties of the service, including additional parsers to adapt the data found in the logs.
  • INPUT.- The input section defines a source that is associated to an input plugin. Depending on the selected input plugin you will have extra configuration keys. In this case we are using the tail input plugin that capture any new line of the watched files (Antrea logs in this case). This section is also used to tag the captured data for classification purposes in later stages.
  • PARSER.- This section is used to format or parse any information present on records such as extracting fields according to the position of the information in the log record.
  • FILTER.- The filter section defines a filter that is associated with a filter plugin. In this case we will use the kubernetes filter to be able enrich our log files with Kubernetes metadata.
  • OUTPUT.- The output section section specify a destination that certain records should follow after a Tag match. We would use here Loki as target.

We will use following values.yaml file. A more complex values file including parsing and regex can be found in an specific section of the next post here.

vi values.yaml
env: 
  - name: NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName

config:
  service: |
    [SERVICE]
        Daemon Off
        Flush {{ .Values.flush }}
        Log_Level {{ .Values.logLevel }}
        Parsers_File parsers.conf
        Parsers_File custom_parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port {{ .Values.metricsPort }}
        Health_Check On

  ## https://docs.fluentbit.io/manual/pipeline/inputs
  inputs: |
    [INPUT]
        Name tail
        Path /var/log/containers/antrea*.log
        multiline.parser docker, cri
        Tag antrea.*
        Mem_Buf_Limit 5MB
        Skip_Long_Lines On
        
  ## https://docs.fluentbit.io/manual/pipeline/filters
  ## First filter Uses a kubernetes filter plugin. Match antrea tag. Use K8s Parser
  ## Second filter enriches log entries with hostname and node name 
  filters: |
    [FILTER]
        Name kubernetes
        Match antrea.*
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude On
        
    [FILTER]
        Name record_modifier
        Match antrea.*
        Record podname ${HOSTNAME}
        Record nodename ${NODE_NAME}

  ## https://docs.fluentbit.io/manual/pipeline/outputs
  ## Send the matching data to loki adding a label
  outputs: |
    [OUTPUT]
        Name loki
        Match antrea.*
        Host loki-gateway.loki.svc
        Port 80
        Labels job=fluentbit-antrea

Create the namespace fluent-bit where all the objects will be placed.

kubectl create ns fluent-bit

And now proceed with fluent-bit chart installation.

helm install fluent-bit -n fluent-bit fluent/fluent-bit -f values.yaml
NAME: fluent-bit
LAST DEPLOYED: Wed Jan 11 19:21:20 2023
NAMESPACE: fluent-bit
STATUS: deployed
REVISION: 1
NOTES:
Get Fluent Bit build information by running these commands:

export POD_NAME=$(kubectl get pods --namespace fluent-bit -l "app.kubernetes.io/name=fluent-bit,app.kubernetes.io/instance=fluent-bit" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace fluent-bit port-forward $POD_NAME 2020:2020
curl http://127.0.0.1:2020

Verifying Fluent-Bit installation

As suggested by the highlighted output of the previous chart installlation you can easily try to reach the fluent-bit API that is listening at port TCP 2020 using a port-forward. Issue the port-forward action and try to curl to see if the service is accepting the GET request. The output indicates the service is ready and you get some metadata such as flags, and version associated with the running fluent-bit pod.

curl localhost:2020 -s | jq
jhasensio@forty-two:~/ANTREA$ curl localhost:2020 -s | jq
{
  "fluent-bit": {
    "version": "2.0.8",
    "edition": "Community",
    "flags": [
      "FLB_HAVE_IN_STORAGE_BACKLOG",
      "FLB_HAVE_CHUNK_TRACE",
      "FLB_HAVE_PARSER",
      "FLB_HAVE_RECORD_ACCESSOR",
      "FLB_HAVE_STREAM_PROCESSOR",
      "FLB_HAVE_TLS",
      "FLB_HAVE_OPENSSL",
      "FLB_HAVE_METRICS",
      "FLB_HAVE_WASM",
      "FLB_HAVE_AWS",
      "FLB_HAVE_AWS_CREDENTIAL_PROCESS",
      "FLB_HAVE_SIGNV4",
      "FLB_HAVE_SQLDB",
      "FLB_LOG_NO_CONTROL_CHARS",
      "FLB_HAVE_METRICS",
      "FLB_HAVE_HTTP_SERVER",
      "FLB_HAVE_SYSTEMD",
      "FLB_HAVE_FORK",
      "FLB_HAVE_TIMESPEC_GET",
      "FLB_HAVE_GMTOFF",
      "FLB_HAVE_UNIX_SOCKET",
      "FLB_HAVE_LIBYAML",
      "FLB_HAVE_ATTRIBUTE_ALLOC_SIZE",
      "FLB_HAVE_PROXY_GO",
      "FLB_HAVE_JEMALLOC",
      "FLB_HAVE_LIBBACKTRACE",
      "FLB_HAVE_REGEX",
      "FLB_HAVE_UTF8_ENCODER",
      "FLB_HAVE_LUAJIT",
      "FLB_HAVE_C_TLS",
      "FLB_HAVE_ACCEPT4",
      "FLB_HAVE_INOTIFY",
      "FLB_HAVE_GETENTROPY",
      "FLB_HAVE_GETENTROPY_SYS_RANDOM"
    ]
  }
}

Remember the fluent-bit process needs access to the logs that are generated on every single node, that means you will need daemonSet object that will run a local fluent-bit pod in each of the eligible nodes across the cluster.

kubectl get all -n fluent-bit
NAME                   READY   STATUS    RESTARTS   AGE
pod/fluent-bit-8s72h   1/1     Running   0          9m20s
pod/fluent-bit-lwjrn   1/1     Running   0          9m20s
pod/fluent-bit-ql5gp   1/1     Running   0          9m20s
pod/fluent-bit-wkgnh   1/1     Running   0          9m20s
pod/fluent-bit-xcpn9   1/1     Running   0          9m20s
pod/fluent-bit-xk7vc   1/1     Running   0          9m20s

NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/fluent-bit   ClusterIP   10.111.248.240   <none>        2020/TCP   9m20s

NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/fluent-bit   6         6         6       6            6           <none>          9m20s

You can also display the logs that any of the pods generates on booting. Note how the tail input plugin only watches matching files according to the regex (any file with name matching antrea*.log in /var/log/containers/ folder).

kubectl logs -n fluent-bit fluent-bit-8s72h
Fluent Bit v2.0.8
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/01/11 18:21:32] [ info] [fluent bit] version=2.0.8, commit=9444fdc5ee, pid=1
[2023/01/11 18:21:32] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2023/01/11 18:21:32] [ info] [cmetrics] version=0.5.8
[2023/01/11 18:21:32] [ info] [ctraces ] version=0.2.7
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] initializing
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] multiline core started
[2023/01/11 18:21:32] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2023/01/11 18:21:32] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2023/01/11 18:21:32] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2023/01/11 18:21:32] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2023/01/11 18:21:32] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2023/01/11 18:21:32] [ info] [output:loki:loki.0] configured, hostname=loki-gateway.loki.svc:80
[2023/01/11 18:21:32] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2023/01/11 18:21:32] [ info] [sp] stream processor started
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529048 watch_fd=1 name=/var/log/containers/antrea-agent-b4tfl_kube-system_antrea-agent-9dadd3c909f9471408ebf569c0d8f2622bedd572ef7a982bfe71a7f3cd6010d0.log
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=532180 watch_fd=2 name=/var/log/containers/antrea-agent-b4tfl_kube-system_antrea-agent-fd6cfdb5a18c77e66403e66e3a16e2f577d213cd010bdf09f863e22d897194a8.log
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=532181 watch_fd=3 name=/var/log/containers/antrea-agent-b4tfl_kube-system_antrea-ovs-5344c17989a14d5773ae75e4403c12939c34b2ca53fb5a09951d8fd953cea00d.log
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529094 watch_fd=4 name=/var/log/containers/antrea-agent-b4tfl_kube-system_antrea-ovs-cb343ab16cc1d9a718b938be8a889196fd93134f63c9f9da6c53a2ff291f25f5.log
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528986 watch_fd=5 name=/var/log/containers/antrea-agent-b4tfl_kube-system_install-cni-583b2d7380e3dc9cff9c3a05870c7997747d9751c075707bd182d1d0a0ec5e9b.log

Now we are sure the fluent-bit is working properly, the last step is to check if we actually are receiving the logs in Loki using Grafana to retrieve ingested logs. Remember in the fluent-bit output configuration we labeled the logs using job=fluentbit-antrea and we will use that as input to filter our interesting logs. Click on the compass icon at the left ribbon and use populate the label filter with mentioned label (key and value).

Exploring Antrea logs sent to Loki with Grafana

Generate some activity in the antrea agents, for example, as soon as you create a new pod and the CNI should provide the IP Address and it will write a corresponding log indicating a new IP address has been Allocated. Let’s try to locate this exact string in any antrea log. To do so, just press on the Code button, to write down a custom filter by hand.

Code Option for manual queries creation

Type the following filter to find any log with a label job=fluentbit-antrea that also contains the string “Allocated”.

{job="fluentbit-antrea"} |= "Allocated" 

Press Run Query blue button at the right top corner and you should be able to display any matching log entry as shown below.

Exploring Antrea logs sent to Loki with Grafana (2)

Feel feel to explore further the log to see the format and different labels and fields as shown below

Exploring Antrea logs sent to Loki with Grafana (2)

This concludes this post. If you followed it you should now have the required tools up and running to gain observability. This is just the first step though. For any given observability solution, the real effort come in the Day 2 when you need to figure out what are your KPI according to your particular needs and how to visualize them in the proper way. Next post here will continue diving in dashboards and log analysis. Stay tuned!

Antrea Observability Part 1: Installing Antrea and Test Application

Antrea is an opensource Kubernetes Networking and Security project maintaned by VMware that implements a Container Network Interface (CNI) to provide network connectivity and security for pod workloads. It has been designed with flexibility and high performance in mind and is based on Open vSwitch (OVS), a very mature project that stands out precisely because of these characteristics.

Antrea creates a SDN-like architecture separating the data plane (which is based in OVS) and the control plane. To do so, Antrea installs an agent component in each of the worker nodes to program the OVS datapath whereas a central controller running on the control plane node is be in charge of centralized tasks such as calculating network policies. The following picture that is available at main Antrea site depicts how it integrates with a kubernetes cluster.

Antrea high-level architecture

This series of post will focus in how to provide observability capabilities to this CNI using different tools. This first article will explain how to install Antrea and some related management stuff and also how to create a testbed environment based on a microservices application and a load-generator tool that will be used as reference example throughout the next articles. Now we have a general overview of what Antrea is, let’s start by installing Antrea through Helm.

Installing Antrea using Helm

To start with Antrea, the first step is to create a kubernetes cluster. The process of setting up a kubernetes cluster is out of scope of this post. By the way, this series of articles are based on kubernetes vanilla but feel free to try with another distribution such as VMware Tanzu or Openshift. If you are using kubeadm to build your kubernetes cluster you need to start using –pod-network-cidr=<CIDR> parameter to enable the NodeIpamController in kubernetes, alternatively you can leverage some built-in Antrea IPAM capabilities.

The easiest way to verify that NodeIPAMController is operating in your existing cluster is by checking if the flag –-allocate-node-cidrs is set to true and if there is a cluster-cidr configured. Use the following command for verification.

kubectl cluster-info dump | grep cidr
                            "--allocate-node-cidrs=true",
                            "--cluster-cidr=10.34.0.0/16",

Is important to mention that in kubernetes versions prior to 1.24, the CNI plugin could be managed by kubelet and there was a requirement to ensure the kubelet was started with network-plugin=cni flag enabled, however in newer versions is the container runtime instead of the kubelet who is in charge of managing the CNI plugin. In my case I am using containerd 1.6.9 as container runtime and Kubernetes version 1.24.8.

To install the Antrea CNI you just need to apply the kubernetes manifest that specifies all the resources needed. The latest manifest is generally available at the official antrea github site here.

Alternatively, VMware also maintains a Helm chart available for this purpose. I will install Antrea using Helm here because it has some advantages. For example, in general when you enable a new featuregate you must manually restart the pods to actually applying new settings but, when using Helm, the new changes are applied and pods restarted as part of the new release deployment. There is however an important consideration on updating CRDs when using Helm. Make sure to check this note before updating Antrea.

Lets start with the installation process. Considering you have Helm already installed in the OS of the server you are using to manage your kubernetes cluster (otherwise complete installation as per official doc here), the first step would be to add the Antrea repository source as shown below.

helm repo add antrea https://charts.antrea.io
"antrea" has been added to your repositories

As a best practice update the repo before installing to ensure you work with the last available version of the chart.

helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "antrea" chart repository

You can also explore the contents of the added repository. Note there is not only a chart for Antrea itself but also other interesting charts that will be explore in next posts. At the time of writing this post the last Antrea version is 1.10.0.

helm search repo antrea
NAME                    CHART VERSION   APP VERSION     DESCRIPTION                                
antrea/antrea           1.10.0          1.10.0          Kubernetes networking based on Open vSwitch
antrea/flow-aggregator  1.10.0          1.10.0          Antrea Flow Aggregator                     
antrea/theia            0.4.0           0.4.0           Antrea Network Flow Visibility    

When using Helm, you can customize the installation of your chart by means of templates in the form of yaml files that contain all configurable settings. To figure out what settings are available for a particular chart a good idea is to write in a default_values.yaml file that would contain all the accepted values using the helm show values command as you can see here.

helm show values antrea/antrea >> default_values.yaml

Using the default_values.yaml file as a reference, you can change any of the default values to meet your requirements. Note that, as expected for a default configuration file, any configuration not explicitily referenced in the values file will use default settings. We will use a very simplified version of the values file with just a simple setting to specifiy the desired tag or version for our deployment. We will add extra configuration later when enabling some features. Create a simple values.yaml file with this content.

vi values.yaml
# -- Container image to use for Antrea components.
image:
  tag: "v1.10.0"

And now install the new chart using helm install and the created values.yaml file as input.

helm install antrea -n kube-system antrea/antrea -f values.yaml
NAME: antrea
LAST DEPLOYED: Fri Jan 13 18:52:48 2023
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Antrea CNI has been successfully installed

You are using version 1.10.0

For the Antrea documentation, please visit https://antrea.io

After couple of minutes you should be able to see the pods in Running state. The Antrea agents are deployed using a daemonSet and thus they run an independent pod on every single node in the cluster. On the other hand the Antrea controller component is installed as a single replica deployment and may run by default on any node.

kubectl get pods -n kube-system -o wide
NAME                                          READY   STATUS    RESTARTS        AGE   IP            NODE                  NOMINATED NODE   READINESS GATES
antrea-agent-52cjk                            2/2     Running   0               86s   10.113.2.17   k8s-worker-03         <none>           <none>
antrea-agent-549ps                            2/2     Running   0               86s   10.113.2.10   k8s-contol-plane-01   <none>           <none>
antrea-agent-5kvhb                            2/2     Running   0               86s   10.113.2.15   k8s-worker-01         <none>           <none>
antrea-agent-6p856                            2/2     Running   0               86s   10.113.2.19   k8s-worker-05         <none>           <none>
antrea-agent-f75b7                            2/2     Running   0               86s   10.113.2.16   k8s-worker-02         <none>           <none>
antrea-agent-m6qtc                            2/2     Running   0               86s   10.113.2.18   k8s-worker-04         <none>           <none>
antrea-agent-zcnd7                            2/2     Running   0               86s   10.113.2.20   k8s-worker-06         <none>           <none>
antrea-controller-746dcd98d4-c6wcd            1/1     Running   0               86s   10.113.2.10   k8s-contol-plane-01   <none>           <none>

Your nodes should now transit to Ready status as a result of the CNI sucessfull installation.

kubectl get nodes
NAME                  STATUS   ROLES           AGE   VERSION
k8s-contol-plane-01   Ready    control-plane   46d   v1.24.8
k8s-worker-01         Ready    <none>          46d   v1.24.8
k8s-worker-02         Ready    <none>          46d   v1.24.8
k8s-worker-03         Ready    <none>          46d   v1.24.8
k8s-worker-04         Ready    <none>          46d   v1.24.8
k8s-worker-05         Ready    <none>          46d   v1.24.8
k8s-worker-06         Ready    <none>          46d   v1.24.8

Jump into any of the nodes using ssh. If Antrea CNI plugin is successfully installed you should see a new configuration file in the /etc/cni/net.d directory with a content similiar to this.

cat /etc/cni/net.d/10-antrea.conflist
{
    "cniVersion":"0.3.0",
    "name": "antrea",
    "plugins": [
        {
            "type": "antrea",
            "ipam": {
                "type": "host-local"
            }
        }
        ,
        {
            "type": "portmap",
            "capabilities": {"portMappings": true}
        }
        ,
        {
            "type": "bandwidth",
            "capabilities": {"bandwidth": true}
        }
    ]
}

As a last verification, spin up any new pod to check if the CNI, which is supposed to provide connectivity for the pod, is actually working as expected. Let’s run a simple test app called kuard.

kubectl run –restart=Never –image=gcr.io/kuar-demo/kuard-amd64:blue kuard pod/kuard created
pod/kuard created

Check the status of new created pod and verify if it is running and if an IP has been assigned.

kubectl get pod kuard -o wide
NAME    READY   STATUS    RESTARTS   AGE    IP           NODE            NOMINATED NODE   READINESS GATES
kuard   1/1     Running   0          172m   10.34.4.26   k8s-worker-05   <none>           <none>

Now forward the port the container is listening to a local port.

kubectl port-forward kuard 8080:8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

Next open your browser at http://localhost:8080 and you should reach the kuard application that shows some information about the pod that can be used for test and troubleshooting purposes.

So far so good. It seems our CNI plugin is working as expected lets install extra tooling to interact with the CNI component.

Installing antctl tool to interact with Antrea

Antrea includes a nice method to interact with the CNI through cli commands using a tool called antctl. Antctl can be used in controller mode or in agent mode. For controller mode you can run the command externally or from whithin the controller pod using regular kubectl exec to issue the commands you need. For using antctl in agent mode you must run the commands from within an antrea agent pod.

To install antctl simply copy and paste the following commands to download a copy of the antctl prebuilt binary for your OS.

TAG=v1.10.0
curl -Lo ./antctl "https://github.com/antrea-io/antrea/releases/download/$TAG/antctl-$(uname)-x86_64" 
chmod +x ./antctl 
sudo mv ./antctl /usr/local/bin/antctl 

Now check installed antctl version.

antctl version
antctlVersion: v1.10.0
controllerVersion: v1.10.0

When antctl run out-of-cluster (for controller-mode only) it will look for your kubeconfig file to gain access to the antrea components. As an example, you can issue following command to check overall status.

antctl get controllerinfo
POD                                            NODE          STATUS  NETWORK-POLICIES ADDRESS-GROUPS APPLIED-TO-GROUPS CONNECTED-AGENTS
kube-system/antrea-controller-746dcd98d4-c6wcd k8s-worker-01 Healthy 0                0              0                 4

We can also display the activated features (aka featuregates) by using antctl get featuregates. To enable certain functionalities such as exporting flows for analytics or antrea proxy we need to change the values.yaml file and deploy a new antrea release through helm as we did before.

antctl get featuregates
Antrea Agent Feature Gates
FEATUREGATE              STATUS         VERSION   
FlowExporter             Disabled       ALPHA     
NodePortLocal            Enabled        BETA      
AntreaPolicy             Enabled        BETA      
AntreaProxy              Enabled        BETA      
Traceflow                Enabled        BETA      
NetworkPolicyStats       Enabled        BETA      
EndpointSlice            Disabled       ALPHA     
ServiceExternalIP        Disabled       ALPHA     
Egress                   Enabled        BETA      
AntreaIPAM               Disabled       ALPHA     
Multicast                Disabled       ALPHA     
Multicluster             Disabled       ALPHA     

Antrea Controller Feature Gates
FEATUREGATE              STATUS         VERSION   
AntreaPolicy             Enabled        BETA      
Traceflow                Enabled        BETA      
NetworkPolicyStats       Enabled        BETA      
NodeIPAM                 Disabled       ALPHA     
ServiceExternalIP        Disabled       ALPHA     
Egress                   Enabled        BETA      
Multicluster             Disabled       ALPHA    

On the other hand we can use the built-in antctl utility inside each of the antrea agent pods. For example, using kubectl get pods, obtain the name of the antrea-agent pod running on the node in which the kuard application we created before has been scheduled. Now open a shell using the following command (use your unique antrea-agent pod name for this).

kubectl exec -ti -n kube-system antrea-agent-6p856 — bash
root@k8s-worker-05:/# 

The prompt indicates you are inside the k8s-worker-05 node. Now you can interact with the antrea agent with several commands, as an example, get the information of the kuard podinterface using this command.

root@k8s-worker-05:/# antctl get podinterface kuard
NAMESPACE NAME  INTERFACE-NAME IP         MAC               PORT-UUID                            OF-PORT CONTAINER-ID
default   kuard kuard-652adb   10.34.4.26 46:b5:b9:c5:c6:6c 99c959f1-938a-4ee3-bcda-da05c1dc968a 17      4baee3d2974 

We will explore further advanced options with antctl later in this series of posts. Now the CNI is ready lets install a full featured microservices application.

Installing Acme-Fitness App

Throughout this series of posts we will use a reference kubernetes application to help us with some of the examples. We have chosen the popular acme-fitness created by Google because it is a good representation of a polyglot application based on microservices that includes some typical services of a e-commerce app such as front-end, catalog, cart, payment. We will use a version maintained by VMware that is available here. The following picture depicts how the microservices that conform the acme-fitness application are communication each other.

The first step is to clone the repo to have a local version of the git locally using the following command:

git clone https://github.com/vmwarecloudadvocacy/acme_fitness_demo.git
Cloning into 'acme_fitness_demo'...
remote: Enumerating objects: 765, done.
remote: Total 765 (delta 0), reused 0 (delta 0), pack-reused 765
Receiving objects: 100% (765/765), 1.01 MiB | 1.96 MiB/s, done.
Resolving deltas: 100% (464/464), done.

In order to get some of the database microservices running, we need to setup in advance some configurations (basically credentials) that will be injected as secrets into kubernetes cluster. Move to the cloned kubernetes-manifest folder and create a new file that will contain the required configurations that will be used as required credentials for some of the database microservices. Remember the password must be Base64 encoded. In my case I am using passw0rd so the base64 encoded form results in cGFzc3cwcmQK as shown below.

vi acme-fitness-secrets.yaml
# SECRETS FOR ACME-FITNESS (Plain text password is "passw0rd" in this example)
apiVersion: v1
data:
  password: cGFzc3cwcmQK
kind: Secret
metadata:
  name: cart-redis-pass
type: Opaque
---
apiVersion: v1
data:
  password: cGFzc3cwcmQK
kind: Secret
metadata:
  name: catalog-mongo-pass
type: Opaque
---
apiVersion: v1
data:
  password: cGFzc3cwcmQK
kind: Secret
metadata:
  name: order-postgres-pass
type: Opaque
---
apiVersion: v1
data:
  password: cGFzc3cwcmQK
kind: Secret
metadata:
  name: users-mongo-pass
type: Opaque
---
apiVersion: v1
data:
  password: cGFzc3cwcmQK
kind: Secret
metadata:
  name: users-redis-pass
type: Opaque

Now open the manifest named point-of-sales.yaml and set the value of the FRONTEND_HOST variable as per your particular setup. In this deployment we will install the acme-fitness application in a namespace called acme-fitness so the FQDN would be frontend.acme-fitness.svc.cluster.local. Adjust accordingly if you are using a domain different than cluster.local.

vi point-of-sales-total.yaml
      labels:
        app: acmefit
        service: pos
    spec:
      containers:
      - image: gcr.io/vmwarecloudadvocacy/acmeshop-pos:v0.1.0-beta
        name: pos
        env:
        - name: HTTP_PORT
          value: '7777'
        - name: DATASTORE
          value: 'REMOTE'
        - name: FRONTEND_HOST
          value: 'frontend.acme-fitness.svc.cluster.local'
        ports:
        - containerPort: 7777
          name: pos

Now create the namespace acme-fitness that is where we will place all microservices and related objects.

kubectl create ns acme-fitness

And now apply all the manifest in current directory kubernetes-manifest using the namespace as keyword as shown here to ensure all objects are place in this particular namespace. Ensure the acme-fitness-secrets.yaml manifest we created in previous step is also placed in the same directory and check if you get in the output the confirmation of the new secrets being created.

kubectl apply -f . -n acme-fitness
service/cart-redis created
deployment.apps/cart-redis created
service/cart created
deployment.apps/cart created
configmap/catalog-initdb-config created
service/catalog-mongo created
deployment.apps/catalog-mongo created
service/catalog created
deployment.apps/catalog created
service/frontend created
deployment.apps/frontend created
service/order-postgres created
deployment.apps/order-postgres created
service/order created
deployment.apps/order created
service/payment created
deployment.apps/payment created
service/pos created
deployment.apps/pos created
secret/cart-redis-pass created
secret/catalog-mongo-pass created
secret/order-postgres-pass created
secret/users-mongo-pass created
secret/users-redis-pass created
configmap/users-initdb-config created
service/users-mongo created
deployment.apps/users-mongo created
service/users-redis created
deployment.apps/users-redis created
service/users created
deployment.apps/users created

Wait a couple of minutes to allow workers to pull the images of the containers and you should see all the pods running. In case you find any pod showing a non-running status simply delete it and wait for kubernetes to renconcile the state till you see everything up and running.

kubectl get pods -n acme-fitness
NAME                              READY   STATUS      RESTARTS   AGE
cart-76bb4c586-blc2m              1/1     Running     0          2m45s
cart-redis-5cc665f5bd-zrjwh       1/1     Running     0          3m57s
catalog-958b9dc7c-qdzbt           1/1     Running     0          2m27s
catalog-mongo-b5d4bfd54-c2rh5     1/1     Running     0          2m13s
frontend-6cd56445-8wwws           1/1     Running     0          3m57s
order-584d9c6b44-fwcqv            1/1     Running     0          3m56s
order-postgres-8499dcf8d6-9dq8w   1/1     Running     0          3m57s
payment-5ffb9c8d65-g8qb2          1/1     Running     0          3m56s
pos-5995956dcf-xxmqr              1/1     Running     0          108s
users-67f9f4fb85-bntgv            1/1     Running     0          3m55s
users-mongo-6f57dbb4f-kmcl7       1/1     Running     0          45s
users-mongo-6f57dbb4f-vwz64       0/1     Completed   0          3m56s
users-redis-77968854f4-zxqgc      1/1     Running     0          3m56s

Now check the created services. The entry point of the acme-fitness application resides on the service named frontend that points to the frontend microservice. This service is a LoadBalancer type so, in case you have an application in your cluster to expose LoadBalancer type services, you should receive an external IP that will be used to reach our application externally. If this is not your case you can always access using the corresponding NodePort for the service. In this particular case the service is exposed in the dynamic 31967 port within the NodePort port range as shown below.

kubectl get svc -n acme-fitness
NAME             TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)          AGE
cart             ClusterIP      10.96.58.243     <none>         5000/TCP         4m38s
cart-redis       ClusterIP      10.108.33.107    <none>         6379/TCP         4m39s
catalog          ClusterIP      10.106.250.32    <none>         8082/TCP         4m38s
catalog-mongo    ClusterIP      10.104.233.60    <none>         27017/TCP        4m38s
frontend         LoadBalancer   10.111.94.177    10.113.3.110   80:31967/TCP     4m38s
order            ClusterIP      10.108.161.14    <none>         6000/TCP         4m37s
order-postgres   ClusterIP      10.99.98.123     <none>         5432/TCP         4m38s
payment          ClusterIP      10.99.108.248    <none>         9000/TCP         4m37s
pos              NodePort       10.106.130.147   <none>         7777:30431/TCP   4m37s
users            ClusterIP      10.106.170.197   <none>         8083/TCP         4m36s
users-mongo      ClusterIP      10.105.18.58     <none>         27017/TCP        4m37s
users-redis      ClusterIP      10.103.94.131    <none>         6379/TCP         4m37s

I am using the AVI Kubernetes Operator in my setup to watch por LoadBalancer type services, so we should be able to access the site using the IP address or the allocated name, that according to my particular settings here will be http://frontend.acme-fitness.avi.sdefinitive.net. If you are not familiar with AVI Ingress solution and how to deploy it you can check a series related posts here that explain step by step how to integrate this powerful enterprise solution with your cluster.

Acme-Fitness application Home Page

Now that the application is ready let’s deploy a traffic generator to inject some traffic.

Generating traffic using Locust testing tool

A simple but powerful way to generate synthetic traffic is using the Locust tool written in Python. An interesting advantage of Locust is that it has a distributed architecture that allows running multiple tests on multiple workers. It also includes a web interface that allows to start and parameterize the load test. Locust allows you to define complex test scenarios that are described through a locustfile.py. Thankfully the acme-fitness repository includes here a load-generator folder that contains instructions and a ready-to-use locustfile.py to fully simulate traffic and users according to the architecture of the acme-fitness application.

We can install the locust application in kubernetes easily. The first step is to create the namespace in which the load-gen application will be installed

kubectl create ns load-gen

As usual, if we want to inject configuration into kubernetes we can use a configmap object containing the mentioned the required settings (in this case in a form of locustfile.py file). Be careful because creating the .py file into a cm using kubectl create from-file might cause some formatting errors. I have created a functional configmap in yaml format file that you can use directly using following command.

kubectl apply -n load-gen -f https://raw.githubusercontent.com/jhasensio/antrea/main/LOCUST/acme-locustfile-cm.yaml
configmap/locust-configmap created

Once the configmap is created you can apply the following manifest that includes services and deployments to install the Locust load-testing tool in a distributed architecture and taking as input the locustfile.py file coded in a configmap named locust-configmap. Use following command to deploy Locust.

kubectl apply -n load-gen -f https://raw.githubusercontent.com/jhasensio/antrea/main/LOCUST/locust.yaml
deployment.apps/locust-master created
deployment.apps/locust-worker created
service/locust-master created
service/locust-ui created

You can inspect the created kubernetes objects in the load-gen namespace. Note how there is a LoadBalancer service that will be used to reach the Locust application.

kubectl get all -n load-gen
NAME                                 READY   STATUS    RESTARTS   AGE
pod/locust-master-67bdb5dbd4-ngtw7   1/1     Running   0          81s
pod/locust-worker-6c5f87b5c8-pgzng   1/1     Running   0          81s
pod/locust-worker-6c5f87b5c8-w5m6h   1/1     Running   0          81s

NAME                    TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)                      AGE
service/locust-master   ClusterIP      10.110.189.42   <none>         5557/TCP,5558/TCP,8089/TCP   81s
service/locust-ui       LoadBalancer   10.106.18.26    10.113.3.111   80:32276/TCP                 81s

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/locust-master   1/1     1            1           81s
deployment.apps/locust-worker   2/2     2            2           81s

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/locust-master-67bdb5dbd4   1         1         1       81s
replicaset.apps/locust-worker-6c5f87b5c8   2         2         2       81s

Open a browser to access to Locust User Interface using the allocated IP at http://10.113.3.111 or the FQDN that in my case corresponds to http://locust-ui.load-gen.avi.sdefinitive.net and you will reach the following web site. If you don’t have any Load Balancer solution installed just go the allocated port (32276 in this case) using any of your the IP addresses of your nodes.

Locust UI

Now populate the Host text box using to start the test against the acme-fitness application. You can use the internal URL that will be reachable at http://frontend.acme-fitness.svc.cluster.local or the external name http://frontend.acme-fitness.avi.sdefinitive.net. Launch the test and observe how the Request Per Second counter is increasing progressively.

If you need extra load simply scale the locust-worker deployment and you would get more pods acting as workers available to generate traffic

We have the cluster up and running, the application and the load-generator ready. It’s time to start the journey to add observability using extra tools. Be sure to check next part of this series. Stay tuned!

Antrea Observability Part 0: Quick Start Guide

In this series of posts I have tried to create a comprehensive guide with a good level of detail around the installation of Antrea as CNI along with complimentary mainstream tools necessary for the visualization of metrics and logs. The guide is composed of the following modules:

If you are one of those who like the hard-way and like to understand how different pieces work together or if you are thinking in deploying in a more demanding enviroment that requires scalability and performance, then I highly recommend taking the long path and going through the post series in which you will find some cool stuff to gain understanding of this architecture.

Alternatively, if you want to take the short path, just continue reading the following section to deploy Antrea and related observability stack using a very simple architecture that is well suited for demo and non-production environments.

Quick Setup Guide

A basic requirement before moving forwared is to have a kubernetes cluster up and running. If you are reading this guide you probably are already familiar with how to setup a kubernetes cluster so I won’t spent time describing the procedure. Once the cluster is ready and you can interact with it via kubectl command you are ready to go. The first step is to get the Antrea CNI installed with some particular configuration enabled. The installation of Antrea is very straightforward. As docummented in Antrea website you just need to apply the yaml manifest as shown below and you will end up with your CNI deployed and a fully functional cluster prepared to run pod workloads.

kubectl apply -f https://raw.githubusercontent.com/antrea-io/antrea/main/build/yamls/antrea.yml
customresourcedefinition.apiextensions.k8s.io/antreaagentinfos.crd.antrea.io created
customresourcedefinition.apiextensions.k8s.io/antreacontrollerinfos.crd.antrea.io created
customresourcedefinition.apiextensions.k8s.io/clustergroups.crd.antrea.io created
<skipped>

mutatingwebhookconfiguration.admissionregistration.k8s.io/crdmutator.antrea.io created
validatingwebhookconfiguration.admissionregistration.k8s.io/crdvalidator.antrea.io created

Once Antrea is deployed, edit the configmap that contains the configuration settings in order to enable some required features related to observability.

kubectl edit configmaps -n kube-system antrea-config

Scroll down and locate FlowExporter setting and ensure is set to true and the line is uncomment to take effect in the configuration.

# Enable flowexporter which exports polled conntrack connections as IPFIX flow records from each
# agent to a configured collector.
      FlowExporter: true

Locate also the enablePrometheusMetrics keyword and ensure is enabled and uncommented as well. This setting should be enabled by default but double check just in case.

# Enable metrics exposure via Prometheus. Initializes Prometheus metrics listener.
    enablePrometheusMetrics: true

Last step would be to restart the antrea agent daemonSet to ensure the new settings are applied.

kubectl rollout restart ds/antrea-agent -n kube-system

Now we are ready to go. Clone my git repo here that contains all required stuff to proceed with the observability stack installation.

git clone https://github.com/jhasensio/antrea.git
Cloning into 'antrea'...
remote: Enumerating objects: 155, done.
remote: Counting objects: 100% (155/155), done.
remote: Compressing objects: 100% (86/86), done.
remote: Total 155 (delta 75), reused 145 (delta 65), pack-reused 0
Receiving objects: 100% (155/155), 107.33 KiB | 1.18 MiB/s, done.
Resolving deltas: 100% (75/75), done

Create the observability namespace that will be used to place all the needed kubernetes objects.

kubectl create ns observability
namespace/observability created

Navigate to the antrea/antrea-observability-quick-start folder and now install all the required objects by applying recursively the yaml files in the folder tree using the mentioned namespace. As you can tell from the output below, the manifest will deploy a bunch of tools including fluent-bit, loki and grafana with precreated dashboards and datasources.

kubectl apply -f tools/ –recursive -n observability
clusterrole.rbac.authorization.k8s.io/fluent-bit created
clusterrolebinding.rbac.authorization.k8s.io/fluent-bit created
configmap/fluent-bit created
daemonset.apps/fluent-bit created
service/fluent-bit created
serviceaccount/fluent-bit created
clusterrole.rbac.authorization.k8s.io/grafana-clusterrole created
clusterrolebinding.rbac.authorization.k8s.io/grafana-clusterrolebinding created
configmap/grafana-config-dashboards created
configmap/grafana created
configmap/1--agent-process-metrics-dashboard-cm created
configmap/2--agent-ovs-metrics-and-logs-dashboard-cm created
configmap/3--agent-conntrack-metrics-dashboard-cm created
configmap/4--agent-network-policy-metrics-and-logs-dashboard-cm created
configmap/5--antrea-agent-logs-dashboard-cm created
configmap/6--controller-metrics-and-logs-dashboard-cm created
configmap/loki-datasource-cm created
configmap/prometheus-datasource-cm created
deployment.apps/grafana created
role.rbac.authorization.k8s.io/grafana created
rolebinding.rbac.authorization.k8s.io/grafana created
secret/grafana created
service/grafana created
serviceaccount/grafana created
configmap/loki created
configmap/loki-runtime created
service/loki-memberlist created
serviceaccount/loki created
service/loki-headless created
service/loki created
statefulset.apps/loki created
clusterrole.rbac.authorization.k8s.io/prometheus-antrea created
serviceaccount/prometheus created
secret/prometheus-service-account-token created
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created
configmap/prometheus-server-conf created
deployment.apps/prometheus-deployment created
service/prometheus-service created

The grafana UI is exposed through a LoadBalancer service-type object. If you have a load balancer solution deployed in your cluster, use the allocated external IP Address, otherwise use the dynamically allocated NodePort if you are able to reach the kubernetes nodes directly or ultimately use the port-forward from your administrative console as shown below.

kubectl port-forward -n observability services/grafana 8888:80
Forwarding from 127.0.0.1:8888 -> 3000

Open your browser at localhost 8888 and you will access the Grafana login page as depicted in following picture.

Grafana Login Page

Enter the default username admin. The password is stored in a secret object with a superstrong password only for demo purposes that you can easly decode.

kubectl get secrets -n observability grafana -o jsonpath=”{.data.admin-password}” | base64 -d ; echo
passw0rd

Now you should reach the home page of Grafana application with a similar aspect of what you see below. By default Grafana uses dark theme but you can easily change. You just have to click on the settings icon at the left column > Preferences and select Light as UI Thene. Find at the end of the post some dashboard samples with both themes.

Grafana home page

Now, click on the General link at the top left corner and you should be able to reach the six precreated dashboards in the folder.

Grafana Dashboard browsing

Unless your have a productive cluster running hundreds of workloads, you will find some of the panels empty because there is no metrics nor logs to display yet. Just to add some fun to the dashboards I have created a script to simulate a good level of activity in the cluster. The script is also included in the git repo so just execute it and wait for some minutes.

bash simulate_activity.sh
 This simulator will create some activity in your cluster using current context
 It will spin up some client and a deployment-backed ClusterIP service based on apache application
 After that it will create random AntreaNetworkPolicies and AntreaClusterNetworkPolicies and it will generate some traffic with random patterns
 It will also randomly scale in and out the deployment during execution. This is useful for demo to see all metrics and logs showing up in the visualization tool

For more information go to https://sdefinitive.net

   *** Starting activity simulation. Press CTRL+C to stop job ***   
....^CCTRL+C Pressed...

Cleaning temporary objects, ACNPs, ANPs, Deployments, Services and Pods
......
Cleaning done! Thanks for using it!

After some time, you can enjoy the colorful dashboards and logs specially designed to get some key indicators from Antrea deployment. Refer to Part 3 of this post series for extra information and use cases. Find below some supercharged dashboards and panels as a sample of what you will get.

Dashboard 1: Agent Process Metrics
Dashboard 2: Agent OVS Metrics and Logs
Dashboard 3: Agent Conntrack Metrics
Dashboard 4: Agent Network Policy Metric and Logs (1/2) (Light theme)
Dashboard 4: Agent Network Policy Metric and Logs (2/2) (Light theme)
Dashboard 5: Antrea Agent Logs
Dashboard 6: Controller Metrics and Logs (1/2)

Keep reading next post in this series to get more information around dashboards usage and how to setup all the tools with data persistence and with a high-available and distributed architecture.

Preparing vSphere for Kubernetes Persistent Storage (3/3): MinIO S3 Object Storage using vSAN SNA

Modern Kubernetes-based applications are built with the idea of being able to offer features such as availability, scalability and replication natively and agnosticly to the insfrasturure they are running on. This philosophy questions the need to duplicate those same functions from the infrastructure, especially at the storage layer. As an example, we could set up a modern storage application like MinIO to provide an S3 object storage platform with its own protection mechanisms (e.g., erasure code) using vSAN as underlay storage infrastructure that in turn have some embedded data redundancy and availability mechanisms.

The good news is that when using vSAN, we can selectively choose which features we want to enable in our storage layer through Storage Policy-Based Management (SPBM). A special case is the so-called vSAN SNA (Shared-Nothing Architecture) that basically consists of disabling any redundancy in the vSAN-based volumes and rely on the application itself (e.g. MinIO) to meet any data redundany and fault tolerance requirements.

In this post we will go through the setup of MinIO over the top of vSAN using a Shared-Nothing Architecture.

Defining vSAN SNA (Shared-Nothing Architecture)

As mentioned, VMware vSAN provides a lot of granularity to tailor the storage service through SPBM policies. These policies can control where tha data is going to be physically placed, how the data is going to be protected in case a failure occurs, what is the performance in term of IOPS and so on in order to guarantee the required level of service to meet your use case requirements.

The setup of the vSAN cluster is out of scope of this walk through so, assuming vSAN cluster is configured and working properly, open the vCenter GUI and go to Policies and Profiles > VM Storage Policies and click on CREATE to start the definition of a new Storage Policy.

vSAN Storage Policy Creation (1)

Pick up a significative name. I am using here vSAN-SNA MINIO as you can see below

vSAN Storage Policy Creation (2)

Now select the Enable rules for “vSAN” storage that is exactly what we are trying to configure.

vSAN Storage Policy Creation (3)

Next, configure the availability options. Ensure you select “No data redundancy with host affinity” here that is the main configuration setting if you want to create a SNA architecture in which all the data protection and availabilty will rely on the upper storage application (MinIO in this case).

vSAN Storage Policy Creation (4)

Select the Storage Rules as per your preferences. Remember we are trying to avoid overlapping features to gain in performance and space efficiency, so ensure you are not duplication any task that MinIO is able to provide including data encryption.

vSAN Storage Policy Creation (5)

Next select the Object space reservation and other settings. I am using here Thin provisioning and default values for the rest of the settings.

vSAN Storage Policy Creation (6)

Select any compatible vsanDatastore that you have in your vSphere infrastructure.

vSAN Storage Policy Creation (6)

After finishing the creation of the Storage Policy, it is time to define a StorageClass attached to the created vSAN policy. This configuration assumes that you have your kubernetes cluster already integrated with vSphere CNS services using the CSI Driver. If this is not the case you can follow this previous post before proceeding. Create a yaml file with following content.

vi vsphere-sc-sna.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: vsphere-sc-sna
provisioner: csi.vsphere.vmware.com
parameters:
  datastoreurl: "ds:///vmfs/volumes/vsan:529c9fd4d68b174b-1af2d7a4b1b22457/"
  storagepolicyname: "vSAN-SNA MINIO"
# csi.storage.k8s.io/fstype: "ext4" #Optional Parameter

Once the yaml manifest is created simply apply it using kubectl.

kubectl apply -f vsphere-sc-sna.yaml 

Now create a new PVC that uses the storageclass defined. Remember that in general, when using an storageClass there is no need to precreate the PVs because the CSI Driver will provision for you dinamically without any storage administrator preprovisioning.

vi pvc_sna.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vsphere-pvc-sna
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 500Mi
  storageClassName: vsphere-sc-sna

Capture the allocated name pvc-9946244c-8d99-467e-b444-966c392d3bfa for the new created PVC.

kubectl get pvc vsphere-pvc-sna
NAME              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS     AGE
vsphere-pvc-sna   Bound    pvc-9946244c-8d99-467e-b444-966c392d3bfa   500Mi      RWO            vsphere-sc-sna   14s

Access to Cluster > Monitor > vSAN Virtual Objects and filter out using the previously captured pvc name in order to view the physical placement that is being used.

PVC virtual object

The redundancy scheme here is RAID 0 with a single vmdk disk which means in practice that vSAN is not providing any level of protection or performance gaining which is exactly what we are trying to achieve in this particular Shared-Nothing Architecture.

PVC Physical Placement

Once the SNA architecture is defined we can proceed with MinIO Operator installation.

Installing MinIO Operator using Krew

MinIO is an object storage solution with S3 API support and is a popular alternative for providing an Amazon S3-like service in a private cloud environment. In this example we will install MinIO Operator that allows us to deploy and manage different MinIO Tenants in the same cluster. According to the kubernetes official documentation, a kubernetes operator is just a software extension to Kubernetes that make use of custom resources to manage applications and their components following kubernetes principles.

Firstly, we will install a MinIO plugin using the krew utility. Krew is a plugin manager that complements kubernetes CLI tooling and is very helpful for installing and updating plugins that act as add-ons over kubectl command .

To install krew just copy and paste the following command line. Basically that set of commands check the OS to pick the corresponding krew prebuilt binary and then downloads, extracts and finally installs krew utility.

(
  set -x; cd "$(mktemp -d)" &&
  OS="$(uname | tr '[:upper:]' '[:lower:]')" &&
  ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/')" &&
  KREW="krew-${OS}_${ARCH}" &&
  curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" &&
  tar zxvf "${KREW}.tar.gz" &&
  ./"${KREW}" install krew
)

The binary will be installed in the $HOME/.krew/bin directory. Update your PATH environment variable to include this new directory. In my case I am using bash.

echo 'export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"' >> $HOME/.bashrc

Now make the changes effective by restarting your shell. The easiest way is using exec to reinitialize the existing session and force to read new values.

exec $SHELL

If everything is ok, we should be able to list the installed krew plugins. The only available plugin by default is the krew plugin itself.

kubectl krew list
PLUGIN    VERSION
krew      v0.4.3

Now we should be able to install the desired MinIO plugin just using kubectl krew install as shown below.

kubectl krew install minio
Updated the local copy of plugin index.
Installing plugin: minio
Installed plugin: minio
\
 | Use this plugin:
 |      kubectl minio
 | Documentation:
 |      https://github.com/minio/operator/tree/master/kubectl-minio
 | Caveats:
 | \
 |  | * For resources that are not in default namespace, currently you must
 |  |   specify -n/--namespace explicitly (the current namespace setting is not
 |  |   yet used).
 | /
/
WARNING: You installed plugin "minio" from the krew-index plugin repository.
   These plugins are not audited for security by the Krew maintainers.
   Run them at your own risk.

Once the plugin is installed we can initialize the MinIO operator just by using the kubectl minio init command as shown below.

kubectl minio init
namespace/minio-operator created
serviceaccount/minio-operator created
clusterrole.rbac.authorization.k8s.io/minio-operator-role created
clusterrolebinding.rbac.authorization.k8s.io/minio-operator-binding created
customresourcedefinition.apiextensions.k8s.io/tenants.minio.min.io created
service/operator created
deployment.apps/minio-operator created
serviceaccount/console-sa created
secret/console-sa-secret created
clusterrole.rbac.authorization.k8s.io/console-sa-role created
clusterrolebinding.rbac.authorization.k8s.io/console-sa-binding created
configmap/console-env created
service/console created
deployment.apps/console created
-----------------

To open Operator UI, start a port forward using this command:

kubectl minio proxy -n minio-operator

-----------------

The MinIO Operator installation has created several kubernetes resources within the minio-operator namespace, we can explore the created services.

kubectl get svc -n minio-operator
NAME       TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)             AGE
console    ClusterIP      10.110.251.111   <none>         9090/TCP,9443/TCP   34s
operator   ClusterIP      10.111.75.168    <none>         4222/TCP,4221/TCP   34s

A good idea to ease the console access instead of using port-forwarding as suggested in the output of the MinIO Operator installation is exposing the operator console externally using any of the well known kubernetes methods. As an example, the following manifest creates a LoadBalancer object that will provide external reachability to the MinIO Operator console web site.

vi minio-console-lb.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: minio-operator
  name: minio
  namespace: minio-operator
spec:
  ports:
  - name: http
    port: 9090
    protocol: TCP
    targetPort: 9090
  selector:
    app: console
  type: LoadBalancer

Apply above yaml file using kubectl apply and check if your Load Balancer component is reporting the allocated External IP address. I am using AVI with its operator (AKO) to capture the LoadBalancer objects and program the external LoadBalancer but feel free to use any other ingress solution of your choice.

kubectl get svc minio -n minio-operator
NAME    TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)          AGE
minio   LoadBalancer   10.97.25.202   10.113.3.101   9090:32282/TCP   3m

Before trying to access the console, you need to get the token that will be used to authenticate the access to the MinIO Operator GUI.

kubectl get secret console-sa-secret -n minio-operator -o jsonpath=”{.data.token}” | base64 -d ; echo
eyJhbGciOi <redacted...> 

Now you can easily access to the external IP address that listens in the port 9090 and you should reach the MinIO Operator web that looks like the image below. Use the token as credential to access the console.

MinIO Console Access

The first step is to create a tenant. A tenant, as the wizard states, is just the logical structure to represent a MinIO deployment. A tenant can have different size and configurations from other tenants, even a different storage class. Click on the CREATE TENANT button to start the process.

MinIO Operator Tenant Creation

We will use the name archive, and will place the tenant in the namespace of the same name that we can create from the console itself if it does not exist previously in our target cluster. Make sure you select StorageClass vsphere-sc-sna that has been previously created for this specific purpose.

In the Capacity Section we can select the number of servers, namely the kubernetes nodes that will run a pod, and the number of drives per server, namely the persistent volumes mounted in each pod, that will architecture this software defined storage layer. According to the settings shown below, is easy to do the math to calculate that achieving a Size of 200 Gbytes using 6 servers with 4 drives each you would need 200/(6×4)=8,33 G per drive.

The Erasure Code Parity will set the level of redundancy and availability we need and it’s directly related with overall usable capacity. As you can guess, the more the protection, the more the wasting of storage. In this case, the selected EC:2 will tolerate a single server failure and the usable capacity would be 166,7 Gb. Feel free to change the setting and see how the calculation is refreshed in the table at the right of the window.

MinIO Tenant Creation Dialogue (Settings)

The last block of this configuration section allows you to the assigned resources in terms of CPU and memory that each pod will request to run.

MinIO Tenant Creation Dialogue (Settings 2)

Click on Configure button to define whether we want to expose the MinIO services (S3 endpoint and tenant Console) to provide outside access.

MinIO Tenant Creation Dialogue (Services)

Another important configuration decision is whether we want to use TLS or not to access the S3 API endpoint. By default the tenant and associated buckets uses TLS and autogenerated certs for this purpose.

MinIO Tenant Creation Dialogue (Security)

Other interesting setting you can optionally enable is the Audit Log to store your log transactions into a database. This can be useful for troubleshooting and security purposes. By default the Audit Log is enabled.

The monitoring section will provide allow you to get metrics and dashboards of your tenant by deploying a Prometheus Server to scrape some relevant metrics associated with your service. By default the service is also enabled.

Feel free to explore the rest of settings to change other advanced parameters such as encryption (disabled by default) or authentication (by default using local credentials but integrable with external authentication systems). As soon as you click on the Create button at the bottom right corner you will launch the new tenant and a new window will appear with a set of the credentials associated with the new tenant. Save it in a JSON file or write it down for later usage because there is no way to display it afterwards.

New Tenant Credentials

The aspect of the json file is you decide to download it is shown below.

{
  "url":"https://minio.archive.svc.cluster.local:443",
  "accessKey":"Q78mz0vRlnk6melG",
  "secretKey":"6nWtL28N1eyVdZdt16CIWyivUh3PB5Fp",
  "api":"s3v4",
  "path":"auto"
}

We are done with our first MinIO tenant creation process. Let’s move into the next part to inspect the created objects.

Inspecting Minio Operator kubernetes resources

Once the tenant is created the Minio Operator Console will show it up with real-time information. There are some task to be done under the hood to complete process such as deploying pods, pulling container images, creating pvcs and so on so it will take some time to have it ready. The red circle and the absence of statistics indicates the tenant creation is still in process.

Tenant Administration

After couple of minutes, if you click on the tenant Tile you will land in the tenant administration page from which you will get some information about the current state, the Console and Endpoint URLs, number of drives and so on.

If you click on the YAML button at the top you will see the YAML file. Although the GUI can be useful to take the first steps in the creation of a tenant, when you are planning to do it in an automated fashion and the best approach is to leverage the usage of yaml files to declare the tenant object that is basically a CRD object that the MinIO operator watches to reconcile the desired state of the tenants in the cluster.

In the Configuration section you can get the user (MINIO_ROOT_USER) and password (MINIO_ROOT_PASSWORD). Those credentials can be used to access the tenant console using the corresponding endpoint.

The external URL https://archive-console.archive.avi.sdefinitive.net:9443 can be used to reach our archive tenant console in a separate GUI like the one shown below. Another option available from the Minio Operator GUI is using the Console button at the top. Using this last method will bypass the authentication.

If you click on the Metrics you can see some interesting stats related to the archive tenant as shown below.

Move into kubernetes to check the created pods. As expected a set of 6 pods are running and acting as the storage servers of our tenant. Aditionally there is other complementary pods also running in the namespace for monitoring and logging

kubectl get pod -n archive
NAME                                      READY   STATUS    RESTARTS        AGE
archive-log-0                             1/1     Running   0               5m34s
archive-log-search-api-5568bc5dcb-hpqrw   1/1     Running   3 (5m17s ago)   5m32s
archive-pool-0-0                          1/1     Running   0               5m34s
archive-pool-0-1                          1/1     Running   0               5m34s
archive-pool-0-2                          1/1     Running   0               5m34s
archive-pool-0-3                          1/1     Running   0               5m33s
archive-pool-0-4                          1/1     Running   0               5m33s
archive-pool-0-5                          1/1     Running   0               5m33s
archive-prometheus-0                      2/2     Running   0               2m33s

If you check the volume of one of the pod you can tell how each server (pod) is mounting four volumes as specified upon tenant creation.

kubectl get pod -n archive archive-pool-0-0 -o=jsonpath=”{.spec.containers[*].volumeMounts}” | jq
[
  {
    "mountPath": "/export0",
    "name": "data0"
  },
  {
    "mountPath": "/export1",
    "name": "data1"
  },
  {
    "mountPath": "/export2",
    "name": "data2"
  },
  {
    "mountPath": "/export3",
    "name": "data3"
  },
  {
    "mountPath": "/tmp/certs",
    "name": "archive-tls"
  },
  {
    "mountPath": "/tmp/minio-config",
    "name": "configuration"
  },
  {
    "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
    "name": "kube-api-access-xdftb",
    "readOnly": true
  }
]

Correspondingly each of the volumes should be associated to a PVC bounded to a PV that has been dynamically created at the infrastructre storage layer through the storageClass. If you remember the calculation we did above, the size that each pvc should be 200/ (6 servers x 4) = 8,33 Gi that is aprox the capacity (8534Mi) of the 24 PVCs displayed below.

kubectl get pvc -n archive
NAME                                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS     AGE
archive-log-archive-log-0                 Bound    pvc-3b3b0738-d4f1-4611-a059-975dc44823ef   5Gi        RWO            vsphere-sc       33m
archive-prometheus-archive-prometheus-0   Bound    pvc-c321dc0e-1789-4139-be52-ec0dbc25211a   5Gi        RWO            vsphere-sc       30m
data0-archive-pool-0-0                    Bound    pvc-5ba0d4b5-e119-49b2-9288-c5556e86cdc1   8534Mi     RWO            vsphere-sc-sna   33m
data0-archive-pool-0-1                    Bound    pvc-ce648e61-d370-4725-abe3-6b374346f6bb   8534Mi     RWO            vsphere-sc-sna   33m
data0-archive-pool-0-2                    Bound    pvc-4a7198ce-1efd-4f31-98ed-fc5d36ebb06b   8534Mi     RWO            vsphere-sc-sna   33m
data0-archive-pool-0-3                    Bound    pvc-26567625-982f-4604-9035-5840547071ea   8534Mi     RWO            vsphere-sc-sna   33m
data0-archive-pool-0-4                    Bound    pvc-c58e8344-4462-449f-a6ec-7ece987e0b67   8534Mi     RWO            vsphere-sc-sna   33m
data0-archive-pool-0-5                    Bound    pvc-4e22d186-0618-417b-91b8-86520e37b3d2   8534Mi     RWO            vsphere-sc-sna   33m
data1-archive-pool-0-0                    Bound    pvc-bf497569-ee1a-4ece-bb2d-50bf77b27a71   8534Mi     RWO            vsphere-sc-sna   33m
data1-archive-pool-0-1                    Bound    pvc-0e2b1057-eda7-4b80-be89-7c256fdc3adc   8534Mi     RWO            vsphere-sc-sna   33m
data1-archive-pool-0-2                    Bound    pvc-f4d1d0ff-e8ed-4c6b-a053-086b7a1a049b   8534Mi     RWO            vsphere-sc-sna   33m
data1-archive-pool-0-3                    Bound    pvc-f8332466-bd49-4fc8-9e2e-3168c307d8db   8534Mi     RWO            vsphere-sc-sna   33m
data1-archive-pool-0-4                    Bound    pvc-0eeae90d-46e7-4dba-9645-b2313cd382c9   8534Mi     RWO            vsphere-sc-sna   33m
data1-archive-pool-0-5                    Bound    pvc-b7d7d1c1-ec4c-42ba-b925-ce02d56dffb0   8534Mi     RWO            vsphere-sc-sna   33m
data2-archive-pool-0-0                    Bound    pvc-08642549-d911-4384-9c20-f0e0ab4be058   8534Mi     RWO            vsphere-sc-sna   33m
data2-archive-pool-0-1                    Bound    pvc-638f9310-ebf9-4784-be87-186ea1837710   8534Mi     RWO            vsphere-sc-sna   33m
data2-archive-pool-0-2                    Bound    pvc-aee4d1c0-13f7-4d98-8f83-c047771c4576   8534Mi     RWO            vsphere-sc-sna   33m
data2-archive-pool-0-3                    Bound    pvc-06cd11d9-96ed-4ae7-a9bc-3b0c52dfa884   8534Mi     RWO            vsphere-sc-sna   33m
data2-archive-pool-0-4                    Bound    pvc-e55cc7aa-8a3c-463e-916a-8c1bfc886b99   8534Mi     RWO            vsphere-sc-sna   33m
data2-archive-pool-0-5                    Bound    pvc-64948a13-bdd3-4d8c-93ad-155f9049eb36   8534Mi     RWO            vsphere-sc-sna   33m
data3-archive-pool-0-0                    Bound    pvc-565ecc61-2b69-45ce-9d8b-9abbaf24b829   8534Mi     RWO            vsphere-sc-sna   33m
data3-archive-pool-0-1                    Bound    pvc-c61d45da-d7da-4675-aafc-c165f1d70612   8534Mi     RWO            vsphere-sc-sna   33m
data3-archive-pool-0-2                    Bound    pvc-c941295e-3e3d-425c-a2f0-70ee1948b2f0   8534Mi     RWO            vsphere-sc-sna   33m
data3-archive-pool-0-3                    Bound    pvc-7d7ce3b1-cfeb-41c8-9996-ff2c3a7578cf   8534Mi     RWO            vsphere-sc-sna   33m
data3-archive-pool-0-4                    Bound    pvc-c36150b1-e404-4ec1-ae61-6ecf58f055e1   8534Mi     RWO            vsphere-sc-sna   33m
data3-archive-pool-0-5                    Bound    pvc-cd847c84-b5e1-4fa4-a7e9-a538f9424dbf   8534Mi     RWO            vsphere-sc-sna   33m

Everything looks good so let’s move into the tenant console to create a new bucket.

Creating a S3 Bucket

Once the tenant is ready, you can use S3 API to create buckets and to push data into them. When using MinIO Operator setup you can also use the Tenant GUI as shown below. Access the tenant console using the external URL or simply jump from the Operator Console GUI and you will reach the following page.

A bucket in S3 Object Storage is similar to a Folder in a traditional filesystem and its used to organice the pieces of data (objects). Create a new bucket with the name my-bucket.

Accessing S3 using MinIO Client (mc)

Now lets move into the CLI to check how can we interact via API with the tenant through the minio client (mc) tool. To install it just issue below commands.

curl https://dl.min.io/client/mc/release/linux-amd64/mc \
  --create-dirs \
  -o $HOME/minio-binaries/mc

chmod +x $HOME/minio-binaries/mc
export PATH=$PATH:$HOME/minio-binaries/

As a first step declare a new alias with the definition of our tenant. Use the tenant endpoint and provide the accesskey and the secretkey you capture at the time of tenant creation. Remember we are using TLS and self-signed certificates so the insecure flag is required.

mc alias set minio-archive https://minio.archive.avi.sdefinitive.net Q78mz0vRlnk6melG 6nWtL28N1eyVdZdt16CIWyivUh3PB5Fp –insecure
Added `minio-archive` successfully.

The info keyword can be used to show relevant information of our tenant such as the servers (pods) and the status of the drives (pvc) and the network. Each of the servers will listen in the port 9000 to expose the storage access externally.

mc admin info minio-archive –insecure
●  archive-pool-0-0.archive-hl.archive.svc.cluster.local:9000
   Uptime: 52 minutes 
   Version: 2023-01-02T09:40:09Z
   Network: 6/6 OK 
   Drives: 4/4 OK 
   Pool: 1

●  archive-pool-0-1.archive-hl.archive.svc.cluster.local:9000
   Uptime: 52 minutes 
   Version: 2023-01-02T09:40:09Z
   Network: 6/6 OK 
   Drives: 4/4 OK 
   Pool: 1

●  archive-pool-0-2.archive-hl.archive.svc.cluster.local:9000
   Uptime: 52 minutes 
   Version: 2023-01-02T09:40:09Z
   Network: 6/6 OK 
   Drives: 4/4 OK 
   Pool: 1

●  archive-pool-0-3.archive-hl.archive.svc.cluster.local:9000
   Uptime: 52 minutes 
   Version: 2023-01-02T09:40:09Z
   Network: 6/6 OK 
   Drives: 4/4 OK 
   Pool: 1

●  archive-pool-0-4.archive-hl.archive.svc.cluster.local:9000
   Uptime: 52 minutes 
   Version: 2023-01-02T09:40:09Z
   Network: 6/6 OK 
   Drives: 4/4 OK 
   Pool: 1

●  archive-pool-0-5.archive-hl.archive.svc.cluster.local:9000
   Uptime: 52 minutes 
   Version: 2023-01-02T09:40:09Z
   Network: 6/6 OK 
   Drives: 4/4 OK 
   Pool: 1

Pools:
   1st, Erasure sets: 2, Drives per erasure set: 12

0 B Used, 1 Bucket, 0 Objects
24 drives online, 0 drives offline

You can list the existing buckets using mc ls command as shown below.

mc ls minio-archive –insecure
[2023-01-04 13:56:39 CET]     0B my-bucket/

As a quick test to check if we can able to write a file into the S3 bucket, following command will create a dummy 10Gb file using fallocate.

fallocate -l 10G /tmp/10Gb.file

Push the 10G file into S3 using following mc cp command

mc cp /tmp/10Gb.file minio-archive/my-bucket –insecure
/tmp/10Gb.file:                        10.00 GiB / 10.00 GiB ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.52 MiB/s 3m18s

While the transfer is still running you can verify at the tenant console that some stats are being populated in the metrics section.

Additionally you can also display some metrics dashboards in the Traffic tab.

Once the transfer is completed, if you list again to see the contents of the bucket you should see something like this.

mc ls minio-archive/my-bucket –insecure
[2023-01-04 17:20:14 CET]  10GiB STANDARD 10Gb.file

Consuming Minio S3 backend from Kubernetes Pods

Now that we know how to install the MinIO Operator, how to create a Tenant and how to create a bucket it is time to understand how to consume the MinIO S3 backend from a regular Pod. The access to the data in a S3 bucket is done generally through API calls so we need a method to create an abstraction layer that allow the data to be mounted in the OS filesystem as a regular folder or drive in a similar way to a NFS share.

This abstraction layer in Linux is implemented by FUSE (Filesystem in Userspace) which, in a nutshell is a user-space program able to mount a file system that appears to the operating system as if it were a regular folder. There is an special project called s3fs-fuse that allow precisely a Linux OS to mount a S3 bucket via FUSE while preserving the native object format.

As a container is essentially a Linux OS, we just need to figure out how to take advantage of this and use it in a kubernetes environment.

Option 1. Using a custom image

The first approach consists on creating a container to access the remote minio bucket presenting it to the container image as a regular filesystem. The following diagram depicts a high level overview of the intended usage.

S3 Mounter Container

As you can guess s3fs-fuse is not part of any core linux distribution so the first step is to create a custom image containing the required software that will allow our container to mount the S3 bucket as a regular file system. Let’s start by creating a Dockerfile that will be used as a template to create the custom image. We need to specify also the command we want to run when the container is spinned up. Find below the Dockerfile to create the custom image with some explanatory comments.

vi Dockerfile
# S3 MinIO mounter sample file 
# 
# Use ubuntu 22.04 as base image
FROM ubuntu:22.04

# MinIO S3 bucket will be mounted on /var/s3 as mouting point
ENV MNT_POINT=/var/s3

# Install required utilities (basically s3fs is needed here). Mailcap install MIME types to avoid some warnings
RUN apt-get update && apt-get install -y s3fs mailcap

# Create the mount point in the container Filesystem
RUN mkdir -p "$MNT_POINT"

# Some optional housekeeping to get rid of unneeded packages
RUN apt-get autoremove && apt-get clean

# 1.- Create a file containing the credentials used to access the Minio Endpoint. 
#     They are defined as variables that must be passed to the container ideally using a secret
# 2.- Mount the S3 bucket in /var/s3. To be able to mount the filesystem s3fs needs
#      - credentials injected as env variables through a Secret and written in s3-passwd file
#      - S3 MinIO Endpoint (passed as env variable through a Secret)
#      - S3 Bucket Name (passed as env variable through a Secret)
#      - other specific keywords for MinIO such use_path_request_style and insecure SSL
# 3.- Last tail command to run container indefinitely and to avoid completion

CMD echo "$ACCESS_KEY:$SECRET_KEY" > /etc/s3-passwd && chmod 600 /etc/s3-passwd && \
    /usr/bin/s3fs "$MINIO_S3_BUCKET" "$MNT_POINT" -o passwd_file=/etc/s3-passwd \
    -o url="$MINIO_S3_ENDPOINT" \
    -o use_path_request_style -o no_check_certificate -o ssl_verify_hostname=0 && \
    tail -f /dev/null

Once the Dockerfile is defined we can build and push the image to your registry. Feel free to use mine here if you wish. If you are using dockerhub instead of a private registry you would need to have an account and login before proceeding.

docker login
Login with your Docker ID to push and pull images from Docker Hub. If you don't have a Docker ID, head over to https://hub.docker.com to create one.
Username: jhasensio
Password: <yourpasswordhere>
WARNING! Your password will be stored unencrypted in /home/jhasensio/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

Now build your image using Dockerfile settings.

docker build –rm -t jhasensio/s3-mounter .
Sending build context to Docker daemon  18.94kB
Step 1/6 : FROM ubuntu:22.04
22.04: Pulling from library/ubuntu
677076032cca: Pull complete 
Digest: sha256:9a0bdde4188b896a372804be2384015e90e3f84906b750c1a53539b585fbbe7f
Status: Downloaded newer image for ubuntu:22.04
 ---> 58db3edaf2be
Step 2/6 : ENV MNT_POINT=/var/s3
 ---> Running in d8f0c218519d
Removing intermediate container d8f0c218519d
 ---> 6a4998c5b028
Step 3/6 : RUN apt-get update && apt-get install -y s3fs mailcap
 ---> Running in a20f2fc03315
Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
... <skipped>

Step 4/6 : RUN mkdir -p "$MNT_POINT"
 ---> Running in 78c5f3328988
Removing intermediate container 78c5f3328988
 ---> f9cead3b402f
Step 5/6 : RUN apt-get autoremove && apt-get clean
 ---> Running in fbd556a42ea2
Reading package lists...
Building dependency tree...
Reading state information...
0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded.
Removing intermediate container fbd556a42ea2
 ---> 6ad6f752fecc
Step 6/6 : CMD echo "$ACCESS_KEY:$SECRET_KEY" > /etc/s3-passwd && chmod 600 /etc/s3-passwd &&     /usr/bin/s3fs $MINIO_S3_BUCKET $MNT_POINT -o passwd_file=/etc/s3-passwd -o use_path_request_style -o no_check_certificate -o ssl_verify_hostname=0 -o url=$MINIO_S3_ENDPOINT  &&     tail -f /dev/null
 ---> Running in 2c64ed1a3c5e
Removing intermediate container 2c64ed1a3c5e
 ---> 9cd8319a789d
Successfully built 9cd8319a789d
Successfully tagged jhasensio/s3-mounter:latest

Now that you have your image built it’s time to push it to the dockerhub registry using following command

docker push jhasensio/s3-mounter
Using default tag: latest
The push refers to repository [docker.io/jhasensio/s3-mounter]
2c82254eb995: Pushed 
9d49035aff15: Pushed 
3ed41791b66a: Pushed 
c5ff2d88f679: Layer already exists 
latest: digest: sha256:ddfb2351763f77114bed3fd622a1357c8f3aa75e35cc66047e54c9ca4949f197 size: 1155

Now you can use your custom image from your pods. Remember this image has been created with a very specific purpose of mounting an S3 bucket in his filesystem, the next step is to inject the configuration and credentials needed into the running pod in order to sucess in the mouting process. There are different ways to do that but the recommended method is passing variables through a secret kubernetes object. Lets create a file with the required environment variables. The variables shown here were captured at tenant creation time and depends on your specific setup and tenant and bucket selected names.

vi s3bucket.env
MINIO_S3_ENDPOINT=https://minio.archive.svc.cluster.local:443
MINIO_S3_BUCKET=my-bucket
ACCESS_KEY=Q78mz0vRlnk6melG
SECRET_KEY=6nWtL28N1eyVdZdt16CIWyivUh3PB5Fp

Now create the secret object using the previous environment file as source as shown below.

kubectl create secret generic s3-credentials –from-env-file=s3bucket.env
secret/s3-credentials created

Verify the content. Note the variables appears as Base64 coded once the secret is created.

kubectl get secrets s3-credentials -o yaml
apiVersion: v1
data:
  ACCESS_KEY: UTc4bXowdlJsbms2bWVsRw==
  MINIO_S3_BUCKET: bXktYnVja2V0
  MINIO_S3_ENDPOINT: aHR0cHM6Ly9taW5pby5hcmNoaXZlLnN2Yy5jbHVzdGVyLmxvY2FsOjQ0Mw==
  SECRET_KEY: Nm5XdEwyOE4xZXlWZFpkdDE2Q0lXeWl2VWgzUEI1RnA=
kind: Secret
metadata:
  creationTimestamp: "2023-02-10T09:33:53Z"
  name: s3-credentials
  namespace: default
  resourceVersion: "32343427"
  uid: d0628fff-0438-41e8-a3a4-d01ee20f82d0
type: Opaque

If you need to be sure about your coded variables just revert the process to double check if the injected data is what you expect. For example taking MINIO_S3_ENDPOINT as an example use following command to get the plain text of any given secret data entry.

kubectl get secrets s3-credentials -o jsonpath=”{.data.MINIO_S3_ENDPOINT}” | base64 -d ; echo
https://minio.archive.svc.cluster.local:443

Now that the configuration secret is ready we can proceed with the pod creation. Note you would need privilege access to be able to access the fuse kernel module that is needed to access the S3 bucket

vi s3mounterpod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: s3-mounter
spec:
  containers:
    - image: jhasensio/s3-mounter
      name: s3-mounter
      imagePullPolicy: Always
      # Privilege mode needed to access FUSE kernel module
      securityContext:
        privileged: true
      # Injecting credentials from secret (must be in same ns)
      envFrom:
      - secretRef:
          name: s3-credentials
      # Mounting host devfuse is required
      volumeMounts:
      - name: devfuse
        mountPath: /dev/fuse
  volumes:
    - name: devfuse
      hostPath:
        path: /dev/fuse

Open an interactive session to the pod for further exploration.

kubectl exec s3-mounter -ti — bash
root@s3-mounter:/# 

Verify the secret injected variables are shown as expected

root@s3-mounter:/# env | grep -E “MINIO|KEY”
ACCESS_KEY=Q78mz0vRlnk6melG
SECRET_KEY=6nWtL28N1eyVdZdt16CIWyivUh3PB5Fp
MINIO_S3_ENDPOINT=https://minio.archive.svc.cluster.local:443
MINIO_S3_BUCKET=my-bucket

Also verify the mount in the pod filesystem

root@s3-mounter:/# mount | grep s3
s3fs on /var/s3 type fuse.s3fs (rw,nosuid,nodev,relatime,user_id=0,group_id=0)

You should be able to read any existing data in the s3 bucket. Note how you can list 10Gb.file we created earlier using the mc client.

root@s3-mounter:~# ls -alg /var/s3
total 10485765
drwx------ 1 root           0 Jan  1  1970 .
drwxr-xr-x 1 root        4096 Jan 04 17:00 ..
-rw-r----- 1 root 10737418240 Jan 04 17:53 10Gb.file

Create a new file to verify if you can write as well.

root@s3-mounter:~# touch /var/s3/file_from_s3_mounter_pod.txt

The new file should appear in the bucket

root@s3-mounter:/var/s3# ls -alg
total 10485766
drwx------ 1 root           0 Jan  1  1970 .
drwxr-xr-x 1 root        4096 Jan 04 17:00 ..
-rw-r----- 1 root 10737418240 Jan 04 17:53 10Gb.file
-rw-r--r-- 1 root           0 Jan 04 17:59 file_from_s3_mounter_pod.txt

And it should be accesible using the mc client as seen below.

mc ls minio-archive/my-bucket –insecure
[2023-01-04 17:53:34 CET]  10GiB STANDARD 10Gb.file
[2023-01-04 17:59:22 CET]     0B STANDARD file_from_s3_mounter_pod.txt

These are great news but, is that approach really useful? That means that to be able to reach an S3 bucket we need to prepare a custom image with our intended application and add the s3fs to ensure we are able to mount a MinIO S3 filesystem. That sounds pretty rigid so lets explore other more flexible options to achieve similar results.

Option 2. Using a sidecar pattern

Once we have prepared the custom image and have verified everything is working as expected we can give another twist. As you probably know there is a common pattern available in kubernetes that involves running an additional container (a.k.a sidecar) alongside the main container in a pod. The sidecar container provides additional functionality, such as logging, monitoring, or networking, to the main container. In our case the additional container will be in charge of mouting the MinIO S3 filesystem allowing the main container to focus on its core responsibility and offload this storage related tasks to the sidecar. The following picture depicts the intended arquictecture.

Pod using Sidecar Pattern with S3-Mounter

Lets try to create a simple pod to run apache to provide access to an S3 backed filesystem via web. The first step is to create a custom httpd.conf file to serve html document in a custom path at “/var/www/html”. Create a file like shown below:

vi httpd.conf
ServerRoot "/usr/local/apache2"
ServerName "my-apache"
Listen 80
DocumentRoot "/var/www/html"
LoadModule mpm_event_module modules/mod_mpm_event.so
LoadModule authz_core_module modules/mod_authz_core.so
LoadModule unixd_module modules/mod_unixd.so
LoadModule autoindex_module modules/mod_autoindex.so

<Directory "/var/www/html">
  Options Indexes FollowSymLinks
  AllowOverride None
  Require all granted
</Directory>

Now create a configmap object from the created file to easily inject the custom configuration into the pod.

kubectl create cm apache-config –from-file=httpd.conf
configmap/apache-config created

Find below a sample sidecar pod. The mountPropagation spec does the trick here. If you are interested, there is a deep dive around Mount Propagation in this blog. The bidirectional mountPropagation Bidirectional allow any volume mounts created by the container to be propagated back to the host and to all containers of all pods that use the same volume that is exactly what we are trying to achieve here. The s3-mounter sidecar container will mount the S3 bucket and will propagate to the host as a local mount that in turn will be available to the main apache pod as a regular hostPath type persistent volume.

vi web-server-sidecar-s3.yaml
apiVersion: v1
kind: Pod
metadata:
  name: web-server-sidecar-s3
spec:
  containers:
      ######
      # Main apache container
      ####
    - name: apache2
      image: httpd:2.4.55
      securityContext:
        privileged: true
      ports:
        - containerPort: 80
          name: http-web
      volumeMounts:
      # Mount custom apache config file from volume
        - name: apache-config-volume
          mountPath: /tmp/conf
      # Mount S3 bucket into web server root. (bidirectional
        - name: my-bucket
          mountPath: /var/www
          mountPropagation: Bidirectional
      # Copy custom httpd.conf extracted from configmap in right path and run httpd
      command: ["/bin/sh"]
      args: ["-c", "cp /tmp/conf/httpd.conf /usr/local/apache2/conf/httpd.conf && /usr/local/bin/httpd-foreground"]
      ######
      # Sidecar Container to mount MinIO S3 Bucket
      ####
    - name: s3mounter
      image: jhasensio/s3-mounter
      imagePullPolicy: Always
      securityContext:
        privileged: true
      envFrom:
        - secretRef:
            name: s3-credentials
      volumeMounts:
        - name: devfuse
          mountPath: /dev/fuse
      # Mount S3 bucket  (bidirectional allow sharing between containers)
        - name: my-bucket
          mountPath: /var/s3
          mountPropagation: Bidirectional
     # Safely umount filesystem before stopping
      lifecycle:
          preStop:
            exec:
              command: ["/bin/sh","-c","fusermount -u /var/s3"]
  volumes:
    - name: devfuse
      hostPath:
        path: /dev/fuse
    # Host filesystem
    - name: my-bucket
      hostPath:
        path: /mnt/my-bucket
    # Safely umount filesystem before stopping
    - name: apache-config-volume
      configMap:
        name: apache-config

Once the pod is ready try to list the contents of the /var/s3 folder at s3mounter sidecar container using follwing command (note we need to specify the name of the container with -c keyword to reach the intented one). The contents of the folder should be listed.

kubectl exec web-server-sidecar-s3 -c s3mounter — ls /var/s3/
10Gb.file
file_from_s3_mounter_pod.txt
html

Repeat the same for the apache2 container. The contents of the folder should be listed as well.

kubectl exec web-server-sidecar-s3 -c apache2 — ls /var/www/
10Gb.file
file_from_s3_mounter_pod.txt
html

Also in the worker filesystem the mounted S3 should be available. Extract the IP of the worker in which the pod is running using the .status.hostIP and try to list the contents of the local hostpath /mnt/my-bucket. You can jump into the worker IP or using a single command via ssh remote execution as seen below:

ssh $(kubectl get pod web-server-sidecar-s3 -o jsonpath=”{.status.hostIP}”) sudo ls /mnt/my-bucket
10Gb.file
file_from_s3_mounter_pod.txt
html

Now create some dummy files in the worker filesystem and in the html folder. This html folder is the configured root directory in apache2 to serve html documents.

ssh $(kubectl get pod web-server-sidecar-s3 -o jsonpath="{.status.hostIP}") sudo touch /mnt/my-bucket/html/file{01..10}.txt

And now both pods should be able to list the new files as shown below from apache2.

kubectl exec web-server-sidecar-s3 -c apache2 — ls /var/www/html
file01.txt
file02.txt
file03.txt
file04.txt
file05.txt
file06.txt
file07.txt
file08.txt
file09.txt
file10.txt

Apache2 will serve the content of the html folder so lets try to verify we can actually access the files through http. First forward the apache2 pod port listening at port 80 to a local port 8080 via kubectl port-forward command.

kubectl port-forward web-server-sidecar-s3 8080:80 &
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80

And now try to reach the apache2 web server. According to the httpd.conf it should display any file in the folder. Using curl you can verify this is working as displayed below

curl localhost:8080
Handling connection for 8080
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /</title>
 </head>
 <body>
<h1>Index of /</h1>
<ul><li><a href="file01.txt"> file01.txt</a></li>
<li><a href="file02.txt"> file02.txt</a></li>
<li><a href="file03.txt"> file03.txt</a></li>
<li><a href="file04.txt"> file04.txt</a></li>
<li><a href="file05.txt"> file05.txt</a></li>
<li><a href="file06.txt"> file06.txt</a></li>
<li><a href="file07.txt"> file07.txt</a></li>
<li><a href="file08.txt"> file08.txt</a></li>
<li><a href="file09.txt"> file09.txt</a></li>
<li><a href="file10.txt"> file10.txt</a></li>
</ul>
</body></html>

If you want to try from a browser you should also access the S3 bucket.

Now we have managed to make the s3-mounter pod independent of a regular pod using a sidecar pattern to deploy our application. Regardless this configuration may fit with some requirements it may not scale very well in other environments. For example, if you create another sidecar pod in the same worker node trying to access the very same bucket you will end up with an error generated by the new s3-mounter when trying to mount bidirecctionally a non-empty volume (basically because it is already mounted by the other pod) as seen below.

kubectl logs web-server-sidecar-s3-2 -c s3mounter
s3fs: MOUNTPOINT directory /var/s3 is not empty. if you are sure this is safe, can use the 'nonempty' mount option.

So basically you must use a different hostPath such /mnt/my-bucket2 to avoid above error which implies an awareness and management of existing hostPaths which sounds weird and not very scalable. This is where the third approach come into play.

Option 3. Using a DaemonSet to create a shared volume

This option relies on a DaemonSet object type. According to official documentation A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.

This way we can think in the bucket as a s3-backed storage provider with a corresponding persistent volume mounted as HostPath in all the workers. This approach is ideal to avoid any error when multiple pods try to mount the same bucket using the same host filesystem path. The following picture depicts the architecture with a 3 member cluster.

S3 Backend access using shared hostpath volumes with DaemonSet mounter

The DaemonSet object will look like the one shown below. We have added some lifecycle stuff to ensure the S3 filesystem is properly unmounted whe the pod is stopped.

vi s3-mounter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: minio-s3-my-bucket
  name: minio-s3-my-bucket
spec:
  selector:
    matchLabels:
       app: minio-s3-my-bucket
  template:
    metadata:
      labels:
        app: minio-s3-my-bucket
    spec:
      containers:
      - name: s3fuse
        image: jhasensio/s3-mounter
        imagePullPolicy: Always
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh","-c","fusermount -u /var/s3"]
        securityContext:
          privileged: true
        envFrom:
        - secretRef:
            name: s3-credentials
        volumeMounts:
        - name: devfuse
          mountPath: /dev/fuse
        - name: my-bucket
          mountPath: /var/s3
          mountPropagation: Bidirectional
      volumes:
      - name: devfuse
        hostPath:
          path: /dev/fuse
      - name: my-bucket
        hostPath:
          path: /mnt/my-bucket

Apply the above manifest and verify the daemonset is properly deployed and is ready and running in all the workers in the cluster.

kubectl get daemonsets.apps
NAME                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
minio-s3-my-bucket   6         6         6       6            6           <none>          153m

Verify you can list the contents of the S3 bucket

kubectl exec minio-s3-my-bucket-fzcb2 — ls /var/s3
10Gb.file
file_from_s3_mounter_pod.txt
html

Now that the daemonSet is properly running we should be able to consume the /mnt/my-bucket path in the worker filesystem as a regular hostPath volume. Let’s create the same pod we used previously as an single container pod. Remember to use Bidirectional mountPropagation again.

vi apache-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: apachepod
spec:
  containers:
    - name: web-server
      image: httpd:2.4.55
      securityContext:
        privileged: true
      ports:
        - containerPort: 80
          name: http-web
      volumeMounts:
        - name: apache-config-volume
          mountPath: /tmp/conf
        - name: my-bucket
          mountPath: /var/www
          mountPropagation: Bidirectional
      command: ["/bin/sh"]
      args: ["-c", "cp /tmp/conf/httpd.conf /usr/local/apache2/conf/httpd.conf && /usr/local/bin/httpd-foreground"]
  volumes:
    - name: apache-config-volume
      configMap:
        name: apache-config
    - name: my-bucket
      hostPath: 
        path: /mnt/my-bucket

Try to list the contents of the volume pointing to the hostPath /mnt/my-bucket that in turn points to the /var/s3 folder used by the daemonset controlled pod to mount the s3 bucket.

kubectl exec apachepod — ls /var/www
10Gb.file
file_from_s3_mounter_pod.txt
html

Repeat the port-forward to try to reach the apache2 web server.

kubectl port-forward apachepod 8080:80 &
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80

And you should see again the contents of /var/www/html folder via http as shown below.

curl localhost:8080
Handling connection for 8080
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /</title>
 </head>
 <body>
<h1>Index of /</h1>
<ul><li><a href="file01.txt"> file01.txt</a></li>
<li><a href="file02.txt"> file02.txt</a></li>
<li><a href="file03.txt"> file03.txt</a></li>
<li><a href="file04.txt"> file04.txt</a></li>
<li><a href="file05.txt"> file05.txt</a></li>
<li><a href="file06.txt"> file06.txt</a></li>
<li><a href="file07.txt"> file07.txt</a></li>
<li><a href="file08.txt"> file08.txt</a></li>
<li><a href="file09.txt"> file09.txt</a></li>
<li><a href="file10.txt"> file10.txt</a></li>
</ul>
</body></html>

It works!

Bonus: All together now. It’s time for automation

So far we have explored different methods to install and consume an S3 backend over a vSAN enabled infrastructure. Some of this methods are intented to be done by a human, in fact, access to the console is very interesting for learning and for performing maintenance and observability tasks on day 2 where a graphical interface might be crucial.

However, in the real world and especially when using kubernetes, it would be very desirable to be able to industrialize all these provisioning tasks without any mouse “click”. Next you will see a practical case for provisioning via CLI and therefore 100% automatable.

To recap, the steps we need to complete the provisioning of a S3 bucket from scratch given a kubernetes cluster with MinIO operator installed are summarized below

  • Create a namespace to place all MinIO tenant related objects
  • Create a MinIO tenant using MinIO plugin (or apply an existing yaml manifiest containing the Tenant CRD and rest of required object)
  • Extract credentials for the new tenant and inject them into a kubernetes secret object
  • Create the MinIO S3 bucket via a kubernetes job
  • Deploy Daemonset to expose the S3 backend as a shared volume

The following script can be used to automate everything. It will only require two parameters namely Tenant and Bucket name. In the script they are statically set but it would be very easy to pass both parameters as an input.

Sample Script to Create MinIO Tenant and share S3 bucket through a Bidirectional PV
#/bin/bash

# INPUT PARAMETERS
# ----------------
TENANT=test
BUCKET=mybucket

# STEP 1 CREATE NAMESPACE AND TENANT
# --------------------------------
kubectl create ns $TENANT
kubectl minio tenant create $TENANT --servers 2 --volumes 4 --capacity 10G --namespace $TENANT --storage-class vsphere-sc-sna

# STEP 2 DEFINE VARIABLES, EXTRACT CREDENTIALS FROM CURRENT TENANT AND PUT TOGETHER IN A SECRET 
# ----------------------------------------------------------------------------------------------
echo "MINIO_S3_ENDPOINT=https://minio.${TENANT}.svc.cluster.local" > s3bucket.env
echo "MINIO_S3_BUCKET=${BUCKET}" >> s3bucket.env
echo "SECRET_KEY=$(kubectl get secrets -n ${TENANT} ${TENANT}-user-1 -o jsonpath="{.data.CONSOLE_SECRET_KEY}" | base64 -d)" >> s3bucket.env
echo "ACCESS_KEY=$(kubectl get secrets -n ${TENANT} ${TENANT}-user-1 -o jsonpath="{.data.CONSOLE_ACCESS_KEY}" | base64 -d)" >> s3bucket.env

kubectl create secret generic s3-credentials --from-env-file=s3bucket.env

# STEP 3 CREATE BUCKET USING A JOB
----------------------------------
kubectl apply -f create-s3-bucket-job.yaml


# STEP 4 WAIT FOR S3 BUCKET CREATION JOB TO SUCCESS
---------------------------------------------------
kubectl wait pods --for=jsonpath='{.status.phase}'=Succeeded -l job-name=create-s3-bucket-job --timeout=10m


# STEP 5 DEPLOY DAEMONSET TO SHARE S3 BUCKET AS A PV
----------------------------------------------------
kubectl apply -f s3-mounter-daemonset.yaml

Note the Step 2 requires the MinIO operator deployed in the cluster and also the krew MinIO plugin installed. Alternatively, you can also use the –output to dry-run the tenant creation command and generate an output manifest yaml file that can be exported an used later without the krew plugin installed. The way to generate the tenant file is shown below.

kubectl minio tenant create $TENANT --servers 2 --volumes 4 --capacity 10G --namespace $TENANT --storage-class vsphere-sc --output > tenant.yaml

Is also worth to mention we have deciced to create the MinIO bucket through a job. This is a good choice to perform task in a kubernetes cluster. As you can tell looking into the manifest content below, the command used in the pod in charge of the job includes some simple logic to ensure the task is retried in case of error whilst the tenant is still spinning up and will run until the bucket creation is completed succesfully. The manifest that define the job is shown below.

vi create-s3-bucket-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: create-s3-bucket-job
spec:
  template:
    spec:
      containers:
      - name: mc
       # s3-credentials secret must contain $ACCESS_KEY, $SECRET_KEY, $MINIO_S3_ENDPOINT, $MINIO_S3_BUCKET 
        envFrom:
        - secretRef:
            name: s3-credentials
        image: minio/mc
        command: 
          - sh
          - -c
          - ls /tmp/error > /dev/null 2>&1 ; until [[ "$?" == "0" ]]; do sleep 5; echo "Attempt to connect with MinIO failed. Attempt to reconnect in 5 secs"; mc alias set s3 $(MINIO_S3_ENDPOINT) $(ACCESS_KEY) $(SECRET_KEY) --insecure; done && mc mb s3/$(MINIO_S3_BUCKET) --insecure
      restartPolicy: Never
  backoffLimit: 4

Once the job is created you can observe up to three diferent errors of the MinIO client trying to complete the bucket creation. The logic will reattempt until completion as shown below:

kubectl logs -f create-s3-bucket-job-qtvx2
mc: <ERROR> Unable to initialize new alias from the provided credentials. Get "https://minio.my-minio-tenant.svc.cluster.local/probe-bucket-sign-ubatm9xin2a
Attempt to connect with MinIO failed. Attempt to reconnect in 5 secs                                                                             
mc: <ERROR> Unable to initialize new alias from the provided credentials. Get "https://minio.my-minio-tenant.svc.cluster.local/probe-bucket-sign-n09ttv9pnfw
Attempt to connect with MinIO failed. Attempt to reconnect in 5 secs                                                                             
mc: <ERROR> Unable to initialize new alias from the provided credentials. Server not initialized, please try again.                              
Attempt to connect with MinIO failed. Attempt to reconnect in 5 secs                                                                             
mc: <ERROR> Unable to initialize new alias from the provided credentials. Server not initialized, please try again.                              
...skipped                                                                          
mc: <ERROR> Unable to initialize new alias from the provided credentials. The Access Key Id you provided does not exist in our records.          
Attempt to connect with MinIO failed. Attempt to reconnect in 5 secs                                                                                                                                                        
mc: <ERROR> Unable to initialize new alias from the provided credentials. The Access Key Id you provided does not exist in our records.          
... skipped         
Attempt to connect with MinIO failed. Attempt to reconnect in 5 secs                                                                             
mc: <ERROR> Unable to initialize new alias from the provided credentials. The Access Key Id you provided does not exist in our records.          
Attempt to connect with MinIO failed. Attempt to reconnect in 5 secs                                                                             
Added `s3` successfully.                                                                                                                         
Bucket created successfully `s3/mybucket`.

Once the job is completed, the wait condition (status.phase=Suceeded) will be met and the script will continue to next step that consists on the deployment of the DaemonSet. Once the DaemonSet is ready you should be able to consume the hostPath type PV that points to the S3 bucket (/mnt/my-bucket) in this case from any regular pod. Create a test file in the pod mount folder (/var/s3).

kubectl exec pods/minio-s3-my-bucket-nkgw2 -ti -- touch /var/s3/test.txt

Now you can spin a simple sleeper pod that is shown below for completeness. Don’t forget to add the Bidirectional mountPropagation spec.

vi sleeper_pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: s3-sleeper-test
spec:
  containers:
    - name: sleeper
      image: ubuntu
      securityContext:
        privileged: true
      volumeMounts:
        - name: my-bucket
          mountPath: /data
          mountPropagation: Bidirectional
      command: ["sleep", "infinity"]
  volumes:
    - name: my-bucket
      hostPath: 
        path: /mnt/my-bucket

We should be able to list de S3 bucket objects under /data pod folder.

kubectl exec pods/s3-sleeper-test -ti — ls -alg /data
total 5
drwx------ 1 root    0 Jan  1  1970 .
drwxr-xr-x 1 root 4096 Feb 13 10:16 ..
-rw-r--r-- 1 root    0 Feb 13 10:13 test.txt

We are done!!!

This has been a quite lenghty series of articles. We have learn how to install vSphere CSI Driver, how to enable vSAN File Services and, in this post, how to setup MinIO over the top of a vSAN enabled infrastructure and how to consume from kubernetes pods using different approaches. Hope you can find useful.

Preparing vSphere for Kubernetes Persistent Storage (2/3): Enabling vSAN File Services

In the previous article we walked through the preparation of an upstream k8s cluster to take advantage of the converged storage infrastructure that vSphere provides by using a CSI driver that allows the pod to consume the vSAN storage in the form of Persistent Volumes created automatically through an special purpose StorageClass.

With the CSI driver we would have most of the persistent storage needs for our pods covered, however, in some particular cases it is necessary that multiple pods can mount and read/write the same volume simultaneously. This is basically defined by the Access Mode specification that is part of the PV/PVC definition. The typical Access Modes available in kubernetes are:

  • ReadWriteOnce – Mount a volume as read-write by a single node
  • ReadOnlyMany – Mount the volume as read-only by many nodes
  • ReadWriteMany – Mount the volume as read-write by many nodes

In this article we will focus on the Access Mode ReadWriteMany (RWX) that allow a volume to be mounted simultaneously in read-write mode for multiple pods running in different kubernetes nodes. This access mode is tipically supported by a network file sharing technology such as NFS. The good news are this is not a big deal if we have vSAN because, again, we can take advantage of this wonderful technology to enable the built-in file services and create shared network shares in a very easy and integrated way.

Enabling vSAN File Services

The procedure for doing this is described below. Let’s move into vSphere GUI for a while. Access to your cluster and go to vSAN>Services. Now click on ENABLE blue button at the bottom of the File Service Tile.

Enabling vSAN file Services

The first step will be to select the network on which the service will be deployed. In my case it will select a specific PortGroup in the subnet 10.113.4.0/24 and the VLAN 1304 with name VLAN-1304-NFS.

Enable File Services. Select Network

This action will trigger the creation of the necessary agents in each one of the hosts that will be in charge of serving the shared network resources via NFS. After a while we should be able to see a new Resource Group named ESX Agents with four new vSAN File Service Node VMs.

vSAN File Service Agents

Once the agents have been deployed we can access to the services configuration and we will see that the configuration is incomplete because we haven’t defined some important service parameters yet. Click on Configure Domain button at the botton of the File Service tile.

vSAN File Service Configuration Tile

The first step is to define a domain that will host the names of the shared services. In my particular case I will use vsanfs.sdefinitive.net as domain name. Any shared resource will be reached using this domain.

vSAN File Service Domain Configuration

Before continuing, it is important that our DNS is configured with the names that we will give to the four file services agents needed. In my case I am using fs01, fs02, fs03 and fs04 as names in the domain vsanfs.sdefinitive.net and the IP addresses 10.113.4.100-103. Additionally we will have to indicate the IP of the DNS server used in the resolution, the netmask and the default gateway as shown below.

vSAN File Services Network Configuration

In the next screen we will see the option to integrate with AD, at the moment we can skip it because we will consume the network services from pods.

vSAN File Service integration with Active Directory

Next you will see the summary of all the settings made for a last review before proceeding.

vSAN File Service Domain configuration summary

Once we press the FINISH green button the network file services will be completed and ready to use.

Creating File Shares from vSphere

Once the vSAN File Services have been configured we should be able to create network shares that will be eventually consumed in the form of NFS type volumes from our applications. To do this we must first provision the file shares according our preferences. Go to File Services and click on ADD to create a new file share.

Adding a new NFS File Share

The file share creation wizard allow us to specify some important parameters such as the name of our shared service, the protocol (NFS) used to export the file share, the NFS version (4.1 and 3), the Storage Policy that the volume will use and, finally other quota related settings such as the size and warning threshold for our file share.

Settings for vSAN file share

Additionally we can set a add security by means of a network access control policy. In our case we will allow any IP to access the shared service so we select the option “Allow access from any IP” but feel free to restrict access to certain IP ranges in case you need it.

Network Access Control for vSAN file share

Once all the parameters have been set we can complete the task by pressing the green FINISH button at the bottom right side of the window.

vSAN File Share Configuration Summary

Let’s inspect the created file share that will be seen as another vSAN Virtual Object from the vSphere administrator perspective.

Inpect vSAN File Share

If we click on the VIEW FILE SHARE we could see the specific configuration of our file share. Write down the export path (fs01.vsanfs.sdefinitive.net:/vsanfs/my-nfs-share) since it will be used later as an specification of the yaml manifest that will declare the corresponding persistent volume kubernetes object.

Displaying vSAN File Share parameters

From an storage administrator perspective we are done. Now we will see how to consume it from the developer perspective through native kubernetes resources using yaml manifest.

Consuming vSAN Network File Shares from Pods

A important requirement to be able to mount nfs shares is to have the necesary software installed in the worker OS, otherwise the mounting process will fail. If you are using a Debian’s familly Linux distro such as Ubuntu, the installation package that contains the necessary binaries to allow nfs mounts is nfs-common. Ensure this package is installed before proceeding. Issue below command to meet the requirement.

sudo apt-get install nfs-common

Before proceeding with creation of PV/PVCs, it is recommended to test connectivity from the workers as a sanity check. The first basic test would be pinging to the fqdn of the host in charge of the file share as defined in the export path of our file share captured earlier. Ensure you can also ping to the rest of the nfs agents defined (fs01-fs04).

ping fs01.vsanfs.sdefinitive.net
PING fs01.vsanfs.sdefinitive.net (10.113.4.100) 56(84) bytes of data.
64 bytes from 10.113.4.100 (10.113.4.100): icmp_seq=1 ttl=63 time=0.688 ms

If DNS resolution and connectivity is working as expected we are safe to mount the file share in any folder in your filesystem. Following commands show how to mount the file share using NFS 4.1 by using the export part associated to our file share. Ensure the mount point (/mnt/my-nfs-share in this example) exists before proceeding. If not so create in advance using mkdir as usual.

mount -t nfs4 -o minorversion=1,sec=sys fs01.vsanfs.sdefinitive.net:/vsanfs/my-nfs-share /mnt/my-nfs-share -v
mount.nfs4: timeout set for Fri Dec 23 21:30:09 2022
mount.nfs4: trying text-based options 'minorversion=1,sec=sys,vers=4,addr=10.113.4.100,clientaddr=10.113.2.15'

If the mounting is sucessfull you should be able to access the share at the mount point folder and even create a file like shown below.

/ # cd /mnt/my-nfs-share
/ # touch hola.txt
/ # ls

hola.txt

Now we are safe to jump into the manifest world to define our persistent volumes and attach them to the desired pod. First declare the PV object using ReadWriteMany as accessMode and specify the server and export path of our network file share.

Note we will use here a storageClassName specification using an arbitrary name vsan-nfs. Using a “fake” or undefined storageClass is supported by kubernetes and is tipically used for binding purposes between the PV and the PVC which is exactly our case. This is a requirement to avoid that our PV resource ends up using the default storage class which in this particular scenario would not be compatible with the ReadWriteMany access-mode required for NFS volumes.

vi nfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv
spec:
  storageClassName: vsan-nfs
  capacity:
    storage: 500Mi
  accessModes:
    - ReadWriteMany
  nfs:
    server: fs01.vsanfs.sdefinitive.net
    path: "/vsanfs/my-nfs-share"

Apply the yaml and verify the creation of the PV. Note we are using rwx mode that allow access to the same volume from different pods running in different nodes simultaneously.

kubectl get pv nfs-pv
NAME     CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS   REASON   AGE
nfs-pv   500Mi      RWX            Retain           Bound    default/nfs-pvc   vsan-nfs                 60s

Now do the same for the PVC pointing to the PV created. Note we are specifiying the same storageClassName to bind the PVC with the PV. The accessMode must be also consistent with PV definition and finally, for this example we are claiming 500 Mbytes of storage.

vi nfs-pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nfs-pvc
spec:
  storageClassName: vsan-nfs
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 500Mi

As usual verify the status of the pvc resource. As you can see the pvc is bound state as expected.

kubectl get pvc nfs-pvc
NAME      STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nfs-pvc   Bound    nfs-pv   500Mi      RWX            vsan-fs        119s

Then attach the volume to a regular pod using following yaml manifest as shown below. This will create a basic pod that will run an alpine image that will mount the nfs pvc in the /my-nfs-share container’s local path . Ensure the highlighted claimName specification of the volume matches with the PVC name defined earlier.

vi nfs-pod1.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nfs-pod1
spec:
  containers:
  - name: alpine
    image: "alpine"
    volumeMounts:
    - name: nfs-vol
      mountPath: "/my-nfs-share"
    command: [ "sleep", "1000000" ]
  volumes:
    - name: nfs-vol
      persistentVolumeClaim:
        claimName: nfs-pvc

Apply the yaml using kubectl apply and try to open a shell session to the container using kubectl exec as shown below.

kubectl exec nfs-pod1 -it -- sh

We should be able to access the network share, list any existing files to check if you are able to write new files as shown below.

/ # touch /my-nfs-share/hola.pod1
/ # ls /my-nfs-share
hola.pod1  hola.txt

The last test to check if actually multiple pods running in different nodes can read and write the same volume simultaneously would be creating a new pod2 that mounts the same volume. Ensure that both pods are scheduled in different nodes for a full verification of the RWX access-mode.

vi nfs-pod2.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nfs-pod2
spec:
  containers:
  - name: alpine
    image: "alpine"
    volumeMounts:
    - name: nfs-vol
      mountPath: "/my-nfs-share"
    command: [ "sleep", "1000000" ]
  volumes:
    - name: nfs-vol
      persistentVolumeClaim:
        claimName: nfs-pvc

In the same manner apply the manifest file abouve to spin up the new pod2 and try to open a shell.

kubectl exec nfs-pod2 -it -- sh

Again, we should be able to access the network share, list existing files and also to create new files.

/ # touch /my-nfs-share/hola.pod2
/ # ls /my-nfs-share
hola.pod1  hola.pod2  hola.txt

In this article we have learnt how to enable vSAN File Services and how to consume PV in RWX. In the next post I will explain how to leverage MinIO technology to provide an S3 like object based storage on the top of vSphere for our workloads. Stay tuned!

Preparing vSphere for Kubernetes Persistent Storage (1/3): Installing vSphere CSI Driver

It is very common to relate Kubernetes with terms like stateless and ephemeral. This is because, when a container is terminated, any data stored in it during its lifetime is lost. This behaviour might be acceptable in some scenarios, however, very often, there is a need to store persistently data and some static configurations. As you can imagine data persistence is an essential feature for running stateful applications such as databases and fileservers.

Persistence enables data to be stored outside of the container and here is when persitent volumes come into play. This post will explain later what a PV is but, in a nutshell, a PV is a Kubernetes resource that allows data to be retained even if the container is terminated, and also allows the data to be accessed by multiple containers or pods if needed.

On the other hand, is important to note that, generally speaking, a Kubernetes cluster does not exist on its own but depends on some sort of underlying infrastructure, that means that it would be really nice to have some kind of connector between the kubernetes control plane and the infrastructure control plane to get the most of it. For example, this dialogue may help the kubernetes scheduler to place the pods taking into account the failure domain of the workers to achieve better availability of an application, or even better, when it comes to storage, the kubernetes cluster can ask the infrastructure to use the existing datastores to meet any persistent storage requirement. This concept of communication between a kubernetes cluster and the underlying infrastructure is referred as a Cloud Provider or Cloud Provider Interface.

In this particular scenario we are running a kubernetes cluster over the top of vSphere and we will walk through the process of setting up the vSphere Cloud Provider. Once the vSphere Cloud Provider Interface is set up you can take advantage of the Container Native Storage (CNS) vSphere built-in feature. The CNS allows the developer to consume storage from vSphere on-demand on a fully automated fashion while providing to the storage administrator visibility and management of volumes from vCenter UI. Following picture depicts a high level diagram of the integration.

Kubernetes and vSphere Cloud Provider

It is important to note that this article is not based on any specific kubernetes distributions in particular. In the case of using Tanzu on vSphere, some of the installation procedures are not necessary as they come out of the box when enabling the vSphere with Tanzu as part of an integrated solution.

Installing vSphere Container Storage Plugin

The Container Native Storage feature is realized by means of a Container Storage Plugin, also called a CSI driver. This CSI runs in a Kubernetes cluster deployed in vSphere and will be responsilbe for provisioning persistent volumes on vSphere datastores by interacting with vSphere control plane (i.e. vCenter). The plugin supports both file and block volumes. Block volumes are typically used in more specialized cases where low-level access to the storage is required, such as for databases or other storage-intensive workloads whereas file volumes are more commonly used in kubernetes because they are more flexible and easy to manage. This guide will focus on the file volumes but feel free to explore extra volume types and supported functionality as documented here.

Before proceeding, if you want to interact with vCenter via CLI instead of using the GUI a good helper would be govc that is a tool designed to be a user friendly CLI alternative to the vCenter GUI and it is well suited for any related automation tasks. The easiest way to install it is using the govc prebuilt binaries on the releases page. The following command will install it automatically and place the binary in the /usr/local/bin path.

curl -L -o - "https://github.com/vmware/govmomi/releases/latest/download/govc_$(uname -s)_$(uname -m).tar.gz" | sudo tar -C /usr/local/bin -xvzf - govc

To facilitate the use of govc, we can create a file to set some environment variables to avoid having to enter the URL and credentials each time. A good practice is to obfuscate the credentials using a basic Base64 encoding algorithm. Following command show how to code any string using this mechanism.

echo “passw0rd” | base64
cGFzc3cwcmQK

Get the Base64 encoded string of your username and password as shown above and now edit a file named govc.env and set the following environment variables replacing with your particular data.

vi govc.env
export GOVC_URL=vcsa.cpod-vcn.az-mad.cloud-garage.net
export GOVC_USERNAME=$(echo <yourbase64encodedusername> | base64 -d)
export GOVC_PASSWORD=$(echo <yourbase64encodedpasssord> | base64 -d)
export GOVC_INSECURE=1

Once the file is created you can actually set the variables using source command.

source govc.env

If everything is ok you can should be able to use govc command without any further parameters. As an example, try a simple task such as browsing your inventory to check if you can access to your vCenter and authentication has succeded.

govc ls
/cPod-VCN/vm
/cPod-VCN/network
/cPod-VCN/host
/cPod-VCN/datastore

Step 1: Prepare vSphere Environment

According to the deployment document mentioned earlier, one of the requirements is enabling the UUID advanced property in all Virtual Machines that conform the cluster that is going to consume the vSphere storage through the CSI.

Since we already have the govc tool installed and operational we can take advantage of it to do it programmatically instead of using the vsphere graphical interface which is always more laborious and costly in time, especially if the number of nodes in our cluster is very high. The syntax to enable the mentioned advanced property is shown below.

govc vm.change -vm 'vm_inventory_path' -e="disk.enableUUID=1"

Using ls command and pointing to the right folder, we can see the name of the VMs that have been placed in the folder of interest. In my setup the VMs are placed under cPod-VCN/vm/k8s folder as you can see in the following output.

govc ls /cPod-VCN/vm/k8s
/cPod-VCN/vm/k8s/k8s-worker-06
/cPod-VCN/vm/k8s/k8s-worker-05
/cPod-VCN/vm/k8s/k8s-worker-03
/cPod-VCN/vm/k8s/k8s-control-plane-01
/cPod-VCN/vm/k8s/k8s-worker-04
/cPod-VCN/vm/k8s/k8s-worker-02
/cPod-VCN/vm/k8s/k8s-worker-01

Now that we know the VMs that conform our k8s cluster you can issue the following command to set the disk-enableUUID VM property one by one. Another smarter approach (specially if the number of worker nodes is high of if you need to automate this task) is taking advantage of some linux helpers to create “single line commands”. See below how you can do it chaining the govc output along with the powerful xargs command to easily issue the same command recursively for all ocurrences.

govc ls /cPod-VCN/vm/k8s | xargs -n1 -I{arg} govc vm.change -vm {arg} -e="disk.enableUUID=1"

This should enable the UUID advanced parameter in all the listed vms and we should be ready to take next step.

Step 2: Install Cloud Provider Interface

Once this vSphere related tasks has been completed, we can move to Kubernetes to install the Cloud Provider Interface. First of all, is worth to mention that the vSphere cloud-controller-manager (the element in charge of installing the required components that conforms the Cloud Provider) relies the well-known kubernetes taint node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule to mark the kubelet as not initialized before proceeding with cloud provider installation. Generally speaking a taint is just a node’s property in a form of a label that is typically used to ensure that nodes are properly configured before they are added to the cluster, and to prevent issues caused by nodes that are not yet ready to operate normally. Once the node is fully initialized, the label can be removed to restoring normal operation. The procedure to taint all the nodes of your cluster in a row, using a single command is shown below.

kubectl get nodes | grep Ready | awk ‘{print $1}’ | xargs -n1 -I{arg} kubectl taint node {arg} node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
node/k8s-contol-plane-01 tainted
node/k8s-worker-01 tainted
node/k8s-worker-02 tainted
node/k8s-worker-03 tainted
node/k8s-worker-04 tainted
node/k8s-worker-05 tainted
node/k8s-worker-06 tainted

Once the cloud-controller-manager initializes this node, the kubelet removes this taint. Verify the taints are configured by using regular kubectl commads and some of the parsing and filtering capabilities that jq provides as showed below.

kubectl get nodes -o json | jq ‘[.items[] | {name: .metadata.name, taints: .spec.taints}]’
{
    "name": "k8s-worker-01",
    "taints": [
      {
        "effect": "NoSchedule",
        "key": "node.cloudprovider.kubernetes.io/uninitialized",
        "value": "true"
      }
    ]
  },
<skipped...>

Once the nodes are properly tainted we can install the vSphere cloud-controller-manager. Note CPI is tied to the specific kubernetes version we are running. In this particular case I am running k8s version 1.24. Get the corresponding manifest from the official cloud-provider-vsphere github repository using below commands.

VERSION=1.24
wget https://raw.githubusercontent.com/kubernetes/cloud-provider-vsphere/master/releases/v$VERSION/vsphere-cloud-controller-manager.yaml

Now edit the downloaded yaml file and locate the section where a Secret object named vsphere-cloud-secret is declared. Change the highlighted lines to match your environment settings. Given the fact this intents to be a lab environment and for the sake of simplicity, I am using a full-rights administrator account for this purpose. Make sure you should follow best practiques and create minimum privileged service accounts if you plan to use it in a production environment. Find here the full procedure to set up specific roles and permissions.

vi vsphere-cloud-controller-manager.yaml (Secret)
apiVersion: v1
kind: Secret
metadata:
  name: vsphere-cloud-secret
  labels:
    vsphere-cpi-infra: secret
    component: cloud-controller-manager
  namespace: kube-system
  # NOTE: this is just an example configuration, update with real values based on your environment
stringData:
  172.25.3.3.username: "[email protected]"
  172.25.3.3.password: "<useyourpasswordhere>"

In the same way, locate a ConfigMap object called vsphere-cloud-config and change relevant settings to match your environment as showed below:

vi vsphere-cloud-controller-manager.yaml (ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: vsphere-cloud-config
  labels:
    vsphere-cpi-infra: config
    component: cloud-controller-manager
  namespace: kube-system
data:
  # NOTE: this is just an example configuration, update with real values based on your environment
  vsphere.conf: |
    # Global properties in this section will be used for all specified vCenters unless overriden in VirtualCenter section.
    global:
      port: 443
      # set insecureFlag to true if the vCenter uses a self-signed cert
      insecureFlag: true
      # settings for using k8s secret
      secretName: vsphere-cloud-secret
      secretNamespace: kube-system

    # vcenter section
    vcenter:
      vcsa:
        server: 172.25.3.3
        user: "[email protected]"
        password: "<useyourpasswordhere>"
        datacenters:
          - cPod-VCN

Now that our configuration is completed we are ready to install the controller that will be in charge of establishing the communication between our vSphere based infrastructure and our kubernetes cluster.

kubectl apply -f vsphere-cloud-controller-manager.yaml
serviceaccount/cloud-controller-manager created 
secret/vsphere-cloud-secret created 
configmap/vsphere-cloud-config created 
rolebinding.rbac.authorization.k8s.io/servicecatalog.k8s.io:apiserver-authentication-reader created 
clusterrolebinding.rbac.authorization.k8s.io/system:cloud-controller-manager created
clusterrole.rbac.authorization.k8s.io/system:cloud-controller-manager created 
daemonset.apps/vsphere-cloud-controller-manager created

If everything goes as expected, we should now see a new pod running in the kube-system namespace. Verify the running state just by showing the created pod using kubectl as shown below.

kubectl get pod -n kube-system vsphere-cloud-controller-manager-wtrjn -o wide
NAME                                     READY   STATUS    RESTARTS        AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
vsphere-cloud-controller-manager-wtrjn   1/1     Running   1 (5s ago)   5s   10.113.2.10   k8s-contol-plane-01   <none>           <none>

Step 3: Installing Container Storage Interface (CSI Driver)

Before moving further, it is important to establish the basic kubernetes terms related to storage. The following list summarizes the main resources kubernetes uses for this specific purpose.

  • Persistent Volume: A PV is a kubernetes object used to provision persistent storage for a pod in the form of volumes. The PV can be provisioned manually by an administrator and backed by physical storage in a variety of formats such as local storage on the host running the pod or external storage such as NFS, or it can also be dinamically provisioned interacting with an storage provider through the use of a CSI (Compute Storage Interface) Driver.
  • Persistent Volume Claim: A PVC is the developer’s way of defining a storage request. Just as the definition of a pod involves a computational request in terms of cpu and memory, a pvc will be related to storage parameters such as the size of the volume, the type of data access or the storage technology used.
  • Storage Class: a StorageClass is another Kubernetes resource related to storage that allows you to point a storage resource whose configuration has been defined previously using a class created by the storage administrator. Each class can be related to a particular CSI driver and have a configuration profile associated with it such as class of service, deletion policy or retention.

To sum up, in general, to have persistence in kubernetes you need to create a PVC which later will be consumed by a pod. The PVC is just a request for storage bound to a particular PV, however, if your define a Storage Class, you don’t have to worry about PV provisioning, the StorageClass will create the PV on the fly on your behalf interacting via API with the storage infrastructure.

In the particular case of the vSphere CSI Driver, when a PVC requests storage, the driver will translate the instructions declared in the Kubernetes object into a API request that vCenter will be able to understand. vCenter will then instruct the creation of vSphere cloud native storage (i.e a PV in a form of a native vsphere vmdk) that will be attached to the VM running the Kubernetes node and then attached to the pod itself. One extra benefit is that vCenter will report information about the container volumes in the vSphere client to allow the administrator to have an integrated storage management view.

Let’s deploy the CSI driver then. The first step is to create a new namespace that we will use to place the CSI related objects. To do this we use kubectl as showed below:

kubectl create ns vmware-system-csi

Now create a config file that will be used to authenticate the cluster against vCenter. As mentioned we are using here a full-rights administrator account but it is recommended to use a service account with specific associated roles and permissions. Also, for the sake of simplicity, I am not verifying vCenter SSL presented certificate but it is strongly recommended to import vcenter certificates to enhance communications security. Replace the highligthed lines to match with your own environment as shown below.

vi csi-vsphere.conf
[Global]
cluster-id = "cluster01"
cluster-distribution = "Vanilla"
# ca-file = <ca file path> # optional, use with insecure-flag set to false
# thumbprint = "<cert thumbprint>" # optional, use with insecure-flag set to false without providing ca-file
[VirtualCenter "vcsa.cpod-vcn.az-mad.cloud-garage.net"]
insecure-flag = "true"
user = "[email protected]"
password = "<useyourpasswordhere>"
port = "443"
datacenters = "<useyourvsphereDChere>"

In order to inject the configuration and credential information into kubernetes we will use a secret object that will use the config file as source. Use following kubectl command to proceed.

kubectl create secret generic vsphere-config-secret --from-file=csi-vsphere.conf -n vmware-system-csi

And now it is time to install the driver itself. As usual we will use a manifest that will install the latest version available that at the moment of writing this post is 2.7.

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v2.7.0/manifests/vanilla/vsphere-csi-driver.yaml

If you inspect the state of the installed driver you will see that two replicas of the vsphere-csi-controller deployment remain in a pending state. This is because the deployment by default is set to spin up 3 replicas but also has a policy to be scheduled only in control plane nodes along with an antiaffinity rule to avoid two pods running on the same node. That means that with a single control plane node the maximum number of replicas in running state would be one. On the other side a daemonSet will also spin a vsphere-csi-node in every single node.

kubectl get pod -n vmware-system-csi
NAME                                      READY   STATUS    RESTARTS        AGE
vsphere-csi-controller-7589ccbcf8-4k55c   0/7     Pending   0               2m25s
vsphere-csi-controller-7589ccbcf8-kbc27   0/7     Pending   0               2m27s
vsphere-csi-controller-7589ccbcf8-vc5d5   7/7     Running   0               4m13s
vsphere-csi-node-42d8j                    3/3     Running   2 (3m25s ago)   4m13s
vsphere-csi-node-9npz4                    3/3     Running   2 (3m28s ago)   4m13s
vsphere-csi-node-kwnzs                    3/3     Running   2 (3m24s ago)   4m13s
vsphere-csi-node-mb4ss                    3/3     Running   2 (3m26s ago)   4m13s
vsphere-csi-node-qptpc                    3/3     Running   2 (3m24s ago)   4m13s
vsphere-csi-node-sclts                    3/3     Running   2 (3m22s ago)   4m13s
vsphere-csi-node-xzglp                    3/3     Running   2 (3m27s ago)   4m13s

You can easily adjust the number of replicas in the vsphere-csi-controller deployment that just by editing the kubernetes resource and set the number of replicas to one. The easiest way to do it is shown below.

kubectl scale deployment -n vmware-system-csi vsphere-csi-controller --replicas=1

Step 4: Creating StorageClass and testing persistent storage

Now that our CSI driver is up and running let’s create a storageClass that will point to the infrastructure provisioner to create the PVs for us on-demand. Before proceeding with storageClass definition lets take a look at the current datastore related information in our particular scenario. We can use the vSphere GUI for this but again, an smarter way is using govc to obtain some relevant information of our datastores that we will use afterwards.

govc datastore.info
Name:        vsanDatastore
  Path:      /cPod-VCN/datastore/vsanDatastore
  Type:      vsan
  URL:       ds:///vmfs/volumes/vsan:529c9fd4d68b174b-1af2d7a4b1b22457
  Capacity:  4095.9 GB
  Free:      3328.0 GB
Name:        nfsDatastore
  Path:      /cPod-VCN/datastore/nfsDatastore
  Type:      NFS
  URL:       ds:///vmfs/volumes/f153c0aa-c96d23c2/
  Capacity:  1505.0 GB
  Free:      1494.6 GB
  Remote:    172.25.3.1:/data/Datastore

We want our volumes to use the vSAN storage as our persistent storage. To do so, use the vsanDataStore associated URL to instruct the CSI to create the persistent volumes in the desired datastore. You can create as many storageClasses as required, each of them with particular parametrization such as the datastore backend, the storage policy or the filesystem type. Additionally, as part of the definition of our Storage class, we are adding an annotation to declare this class as default. That means any PVC without an explicit storageClass specification will use this one as default.

vi vsphere-sc.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: vsphere-sc
  annotations: storageclass.kubernetes.io/is-default-class=true
provisioner: csi.vsphere.vmware.com
parameters:
  storagepolicyname: "vSAN Default Storage Policy"  
  datastoreurl: "ds:///vmfs/volumes/vsan:529c9fd4d68b174b-1af2d7a4b1b22457/"
# csi.storage.k8s.io/fstype: "ext4" #Optional Parameter

Once the yaml manifest is created simply apply it using kubectl.

kubectl apply -f vsphere-sc.yaml 

As a best practique, always verify the status of any created object to see if everything is correct. Ensure the StorageClass is followed by “(default)” which means the annotation has been correctly applied and this storageClass will be used by default.

kubectl get storageclass
NAME                        PROVISIONER              RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
vsphere-sc (default)        csi.vsphere.vmware.com   Delete          Immediate              false                  10d

As mentioned above the StorageClass allows us to abstract from the storage provider so that the developer can dynamically request a volume without the intervention of an storage administrator. The following manifest would allow you to create a volume using the newly created vsphere-cs class.

vi vsphere-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vsphere-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: vsphere-sc

Apply the created yaml file using kubectl command as shown below

kubectl apply -f vsphere-pvc.yaml 

And verify the creation of the PVC and the current status using kubectl get pvc command line.

kubectl get pvc
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
vsphere-pvc   Bound    pvc-8da817f0-8231-4d27-9452-188a8ef4f144   5Gi        RWO            vsphere-sc     13s

Note how the new PVC is bound to a volume that has been automatically created by means of the CSI Driver without any user intervention. If you explore the PV using kubectl you would see.

kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                 STORAGECLASS   REASON   AGE
pvc-8da817f0-8231-4d27-9452-188a8ef4f144   5Gi        RWO            Delete           Bound    default/vsphere-pvc   vsphere-sc              37s

When using vSphere CSI Driver another cool benefit is that you have integrated management so you can access the vsphere console to verify the creation of the volume according to the capacity parameters and access policy configured as shown in the figure below

Inspecting PVC from vSphere GUI

You can drill if you go to Cluster > Monitor > vSAN Virtual Objects. Filter out using the volume name assigned to the PVC to get a cleaner view of your interesting objects.

vSAN Virtual Objects

Now click on VIEW PLACEMENT DETAILS to see the Physical Placement for the persistent volume we have just created.

Persistent Volume Physical Placement

The Physical Placement window shows how the data is actually placed in vmdks that resides in different hosts (esx03 and esx04) of the vSAN enabled cluster creating a RAID 1 strategy for data redundancy that uses a third element as a witness placed in esx01. This policy is dictated by the “VSAN Default Storage Policy” that we have attached to our just created StorageClass. Feel free to try different StorageClasses bound to different vSAN Network policies to fit your requirements in terms of availability, space optimization, encryption and so on.

This concludes the article. In the next post I will explain how to enable vSAN File Services in kubernetes to cover a particular case in which many different pods running in different nodes need to access the same volume simultaneously. Stay tuned!

AVI for K8s Part 10: Customizing Infrastructure Settings

Without a doubt, the integration provided by AKO provides fantastic automation capabilities that accelerate the roll-out of kubernetes-based services through an enterprise-grade ADC. Until now the applications created in kubernetes interacting with the kubernetes API through resources such as ingress or loadbalancer had their realization on the same common infrastructure implemented through the data layer of the NSX Advanced Load Balancer, that is, the service engines. However, it could be interesting to gain some additional control over the infrastructure resources that will ultimately implement our services. For example, we may be interested in that certain services use premium high-performance resources or a particular high availability scheme or even a specific placement in the network for regulatory security aspects for some critical business applications. On the other hand other less important or non productive services could use a less powerful and/or highly oversubscribed resources.

The response of the kubernetes community to cover this need for specific control of the infrastructure for services in kubernetes has materialized in project called Gateway API. Gateway API (formerly known as Service API) is an open source project that brings up a collection of resources such as GatewayClass, Gateway, HTTPRoute, TCPRoute… etc that is being adopted by many verdors and have broad industry support. If you want to know more about Gateway API you can explore the official project page here.

Before the arrival of Gateway API, AVI used annotations to express extra configuration but since the Gateway API is more standard and widely adopted method AVI has included the support for this new API since version 1.4.1 and will probably become the preferred method to express this configuration.

On the other hand AKO supports networking/v1 ingress API, that was released for general availability starting with Kubernetes version 1.19. Specifically AKO supports IngressClass and DefaultIngressClass networking/v1 ingress features.

The combination of both “standard” IngressClass along with Gateway API resources is the foundation to add the custom infrastructure control. When using Ingress resources we can take advantage of the existing IngressClasses objects whereas when using LoadBalancer resources we would need to resort to the Gateway API.

Exploring AviInfraSettings CRD for infrastructure customization

On startup, AKO automatically detects whether ingress-class API is enabled/available in the cluster it is operating in. If the ingress-class api is enabled, AKO switches to use the IngressClass objects, instead of the previously long list of custom annotation whenever you wanted to express custom configuration.

If your kubernetes cluster supports IngressClass resources you should be able to see the created AVI ingressclass object as shown below. It is a cluster scoped resource and receives the name avi-lb that point to the AKO ingress controller. Note also that the object receives automatically an annotation ingressclass.kubernetes.io/is-default-class set to true. This annotation will ensure that new Ingresses without an ingressClassName specified will be assigned this default IngressClass. 

kubectl get ingressclass -o yaml
apiVersion: v1
items:
- apiVersion: networking.k8s.io/v1
  kind: IngressClass
  metadata:
    annotations:
      ingressclass.kubernetes.io/is-default-class: "true"
      meta.helm.sh/release-name: ako-1622125803
      meta.helm.sh/release-namespace: avi-system
    creationTimestamp: "2021-05-27T14:30:05Z"
    generation: 1
    labels:
      app.kubernetes.io/managed-by: Helm
    name: avi-lb
    resourceVersion: "11588616"
    uid: c053a284-6dba-4c39-a8d0-a2ea1549e216
  spec:
    controller: ako.vmware.com/avi-lb
    parameters:
      apiGroup: ako.vmware.com
      kind: IngressParameters
      name: external-lb

A new AKO CRD called AviInfraSetting will help us to express the configuration needed in order to achieve segregation of virtual services that might have properties based on different underlying infrastructure components such as Service Engine Group, network names among others. The general AviInfraSetting definition is showed below.

apiVersion: ako.vmware.com/v1alpha1
kind: AviInfraSetting
metadata:
  name: my-infra-setting
spec:
  seGroup:
    name: compact-se-group
  network:
    names:
      - vip-network-10-10-10-0-24
    enableRhi: true
  l7Settings:
    shardSize: MEDIUM

As showed in the below diagram the Ingress object will define an ingressClassName specification that points to the IngressClass object. In the same way the IngressClass object will define a series of parameters under spec section to refer to the AviInfraSetting CRD.

AVI Ingress Infrastructure Customization using IngressClasses

For testing purposes we will use the hello kubernetes service. First create the deployment, service and ingress resource using yaml file below. It is assumed that an existing secret named hello-secret is already in place to create the secure ingress service.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hello
  template:
    metadata:
      labels:
        app: hello
    spec:
      containers:
      - name: hello-kubernetes
        image: paulbouwer/hello-kubernetes:1.7
        ports:
        - containerPort: 8080
        env:
        - name: MESSAGE
          value: "MESSAGE: Critical App Running here!!"
---
apiVersion: v1
kind: Service
metadata:
  name: hello
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: hello
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hello
  labels:
    app: hello
    app: gslb
spec:
  tls:
  - hosts:
    - kuard.avi.iberia.local
    secretName: hello-secret
  rules:
    - host: hello.avi.iberia.local
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: hello
              port:
                number: 8080

After pushing the declarative file above to the Kubernetes API by using kubectl apply command you should be able to access to the application service just by browsing to the host, in my case https://hello.avi.iberia.local. I have created a custom message by passing an environment variable in the deployment definition with the text you can read below.

Now that we have our test service ready, let’s start testing each of the customization options for the infrastructure.

seGroup

The first parameter we are testing is in charge of selecting the Service Engine Group. Remember Service Engines (e.g. our dataplane) are created with a set of attributes inherited from a Service Engine Group, which contains the definition of how the SEs should be sized, placed, and made highly available. The seGroup parameter defines the service Engine Group that will be used by services that points to this particular AviInfraSettings CRD object. By default, any Ingress or LoadBalancer objects created by AKO will use the SE-Group as specified in the values.yaml that define general AKO configuration.

In order for AKO to make use of this configuration, the first step is to create a new Service Engine Group definition in the controller via GUI / API. In this case, let’s imagine that this group of service engines will be used by applications that demand an active-active high availability mode in which the services will run on at least two service engines to minimize the impact in the event of a failure of one of them. From AVI GUI go to Infrastructure > Service Engine Group > Create. Assign a new name such as SE-Group-AKO-CRITICAL that will be used by the AviInfraSettings object later and configure the Active/Active Elastic HA scheme with a minimum of 2 Service Engine by Virtual Service in the Scale per Virtual Service Setting as shown below:

Active-Active SE Group definition for critical Virtual Service

Now we will create the AviInfraSetting object with the following specification. Write below content in a file and apply it using kubectl apply command.

apiVersion: ako.vmware.com/v1alpha1
kind: AviInfraSetting
metadata:
  name: critical.infra
spec:
  seGroup:
    name: SE-Group-AKO-CRITICAL

Once created you can verify your new AviInfraSetting-type object by exploring the resource using kubectl commands. In this case our new created object is named critical.infra. To show the complete object definition use below kubectl get commands as usual:

kubectl get AviInfraSetting critical.infra -o yaml
apiVersion: ako.vmware.com/v1alpha1
kind: AviInfraSetting
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"ako.vmware.com/v1alpha1","kind":"AviInfraSetting","metadata":{"annotations":{},"name":"critical.infra"},"spec":{"seGroup":{"name":"SE-Group-AKO-CRITICAL"}}}
  creationTimestamp: "2021-05-27T15:12:50Z"
  generation: 1
  name: critical.infra
  resourceVersion: "11592607"
  uid: 27ef1502-5a91-4244-a23b-96bb8ffd9a6e
spec:
  seGroup:
    name: SE-Group-AKO-CRITICAL
status:
  status: Accepted

Now we want to attach this infra setup to our ingress service. To do so, we need to create our IngressClass object first. This time, instead of writing a new yaml file and applying, we will use the stdin method as shown below. After the EOF string you can press enter and the pipe will send the content of the typed yaml file definition to the kubectl apply -f command. An output message should confirm that the new IngressClass object has been successfully created.

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: critical-ic
spec:
  controller: ako.vmware.com/avi-lb
  parameters:
    apiGroup: ako.vmware.com
    kind: AviInfraSetting
    name: critical.infra
EOF


Output:
ingressclass.networking.k8s.io/critical-ingress-class createdku

Once we have created an IngressClass that maps to the critical.infra AviInfraSetting object, it’s time to instruct the Ingress object that defines the external access to our application to use that particular ingress class. Simply edit the existing Ingress object previously created and add the corresponging ingressClass definition under the spec section.

kubectl edit ingress hello 
apiVersion: networking.k8s.io/v1
kind: Ingress
<< Skipped >>
spec:
  ingressClassName: critical-ic
  rules:
  - host: hello.avi.iberia.local
    http:
<< Skipped >>

Once applied AKO will send an API call to the AVI Controller to reconcile the new expressed desired state. That also might include the creation of new Service Engines elements in the infrastructure if there were not any previous active Service Engine in that group as in my case. In this particular case a new pair of Service Engine must be created to fullfill the Active/Active high-availability scheme as expressed in the Service Engine definition. You can tell how a new Shared object with the name S1-AZ1–Shared-L7-critical-infra-4 has been created and will remain as unavailable until the cloud infrastructure (vCenter in my case) complete the automated service engine creation.

c

This image has an empty alt attribute; its file name is image-74.png
New Parent L7 VS shared object and service Engine Creation

After some minutes you should be able to see the new object in a yellow state that will become green eventually after some time. The yellow color can be an indication of the VS dependency with an early created infrastructure Service Engines as in our case. Note how our VS rely on two service engines as stated in the Service Engine Group definition fo HA.

New Parent L7 VS shared object with two Service Engine in Active/Active architecture

The hello Ingress object shows a mapping with the critical-ic ingressClass object we defined for this particular service.

kubectl get ingress
NAME    CLASS         HOSTS                    ADDRESS        PORTS     AGE
ghost   avi-lb        ghost.avi.iberia.local   10.10.15.162   80, 443   9d
hello   critical-ic   hello.avi.iberia.local   10.10.15.161   80, 443   119m
httpd   avi-lb        httpd.avi.iberia.local   10.10.15.166   80, 443   6d
kuard   avi-lb        kuard.avi.iberia.local   10.10.15.165   80, 443   6d5h

network

Next configuration element we can customize as part of the AviInfraSetting definition is the network. This can help to determine which network pool will be used for a particular group of services in our kubernetes cluster. As in previous examples, to allow the DevOps operator to consume certain AVI related settings, we need to define it first as part of the AVI infrastructure operator role.

To create a new FrontEnd pool to expose our new services simply define a new network and allocate some IPs for Service Engine and Virtual Service Placement. In my case the network has been automatically discovered as part of the cloud integration with vSphere. We just need to define the corresponding static pools for both VIPs and Service Engines to allow the internal IPAM to assign IP addresses when needed.

New network definition for custom VIP placement

Once the new network is defined, we can use the AviInfraSetting CRD to point to the new network name. In my case the name the assigned name is REGA_EXT_UPLINKB_3016. Since the CRD object is already created the easiest way to change this setting is simply edit and add the new parameter under spec section as shown below:

kubectl edit aviinfrasetting critical.infra 
apiVersion: ako.vmware.com/v1alpha1
kind: AviInfraSetting
  name: critical.infra
spec:
  seGroup:
    name: SE-Group-AKO-CRITICAL
  network:
    names:
    - REGA_EXT_UPLINKB_3016

After writing and exiting from the editor (vim by default) the new file is applied with the changes. You can see in the AVI Controller GUI how the new config change is reflected with the engine icon in the Analytics page indicating the VS has received a new configuration. If you expand the CONFIG_UPDATE event you can see how a new change in the network has ocurred and now the VS will used the 10.10.16.224 IP address to be reachable from the outside.

Change of VIP Network trhough AKO as seen in AVI GUI

NOTE.- In my case, after doing the change I noticed the Ingress object will still showed the IP Address assigned at the creation time and the new real value wasn’t updated.

kubectl get ingress
NAME    CLASS         HOSTS                    ADDRESS        PORTS     AGE
ghost   avi-lb        ghost.avi.iberia.local   10.10.15.162   80, 443   9d
hello   critical-ic   hello.avi.iberia.local   10.10.15.161   80, 443   119m
httpd   avi-lb        httpd.avi.iberia.local   10.10.15.166   80, 443   6d
kuard   avi-lb        kuard.avi.iberia.local   10.10.15.165   80, 443   6d5h

If this is your case, simple delete and recreate the ingress object with the corresponding ingress-class and you should see the new IP populated

kubectl get ingress
NAME    CLASS         HOSTS                    ADDRESS        PORTS     AGE
ghost   avi-lb        ghost.avi.iberia.local   10.10.15.162   80, 443   9d
hello   critical-ic   hello.avi.iberia.local   10.10.16.224   80, 443   119m
httpd   avi-lb        httpd.avi.iberia.local   10.10.15.166   80, 443   6d
kuard   avi-lb        kuard.avi.iberia.local   10.10.15.165   80, 443   6d5h

enableRhi

As mentioned before, AVI is able to place the same Virtual Service in different Service Engines. This is very helpful for improving the high-availiablity of the application by avoiding a single point of failure and also for scaling out our application to gain extra capacity. AVI has a native AutoScale capability that selects a primary service engine within a group that is in charge of coordinating the distribution of the virtual service traffic among the other SEs where a particular Virtual Service is also active.

Whilst the native AutoScaling method is based on L2 redirection, an alternative and more scalable and efficient method for scaling a virtual service is to rely on Border Gateway Protocol (BGP) and specifically in a feature that is commonly named as route health injection (RHI) to provide equal cost multi-path (ECMP) reachability from the upstream router towards the application. Using Route Health Injection (RHI) with ECMP for virtual service scaling avoids the extra burden on the primary SE to coordinate the scaled out traffic among the SEs.

To leverage this feature, as in previous examples, is it a pre-task of the LoadBalancer and/or network administrator to define the network peering with the underlay BGP network. case so you need to select a local Autonomous System Number (5000 in my case) and declare the IP of the peers that will be used to establish BGP sessions to interchange routing information to reach the corresponding Virtual Services IP addresses. The upstream router in this case in 10.10.16.250 and belongs to ASN 6000 so an eBGP peering would be in place.

The following diagram represent the topology I am using here to implement the required BGP setup.

AVI network topology to enable BGP RHI for L3 Scaling Out using ECMP

You need to define a BGP configuration at AVI Controller with some needed attributes as shown in the following table

SettingValueComment
BGP AS5000Local ASN number used for eBGP
Placement NetworkREGA_EXT_UPLINKB_3016Network used to reach external BGP peers
IPv4 Prefix10.10.16.0/24Subnet that will be used for external announces
IPv4 Peer10.10.16.250IP address of the external BGP Peer
Remote AS6000Autonomous System Number the BGP peer belongs to
Multihop0TTL Setting for BGP control traffic. Adjust if the peer is located some L3 hops beyond
BFDEnabledBidirectional Forwarding Detection mechanism
Advertise VIPEnabledAnnounce allocated VS VIP as Route Health Injection
Advertise SNATEnabledAnnounce allocated Service Engine IP address used as source NAT. Useful in one arm deployments to ensure returning traffic from backends.
BGP Peering configuration to enable RHI

The above configuration settings are shown in the following configuration screen at AVI Controller:

AVI Controller BGP Peering configuration

As a reference I am using in my example a Cisco CSR 1000V as external upstream router that will act as BGP neigbor. The upstream router needs to know in advance the IP addresses of the neighbors in order to configure the BGP peering statements. Some BGP implementations has the capability to define dynamic BGP peering using a range of IP addresses and that fits very well with an autoscalable fabric in which the neighbors might appears and dissappears automatically as the traffic changes. You would also need to enable the ECMP feature adjusting the maximum ECMP paths to the maximum SE configured in your Service Engine Group. Below you can find a sample configuration leveraging the BGP Dynamic Neighbor feature and BFD for fast convergence.

!!! enable BGP using dynamic neighbors

router bgp 6000
 bgp log-neighbor-changes
 bgp listen range 10.10.16.192/27 peer-group AVI-PEERS
 bgp listen limit 32
 neighbor AVI-PEERS peer-group
 neighbor AVI-PEERS remote-as 5000
 neighbor AVI-PEERS fall-over bfd
 !
 address-family ipv4
  neighbor AVI-PEERS activate
  maximum-paths eibgp 10
 exit-address-family
!
!! Enable BFD for fast convergence
interface gigabitEthernet3
   ip address 10.10.16.250 255.255.255.0
   no shutdown
   bfd interval 50 min_rx 50 multiplier 5

As you can see below, once the AVI controller configuration is completed you should see the neighbor status by issuing the show ip bgp summary command. The output is shown below. Notice how two dynamic neighborships has been created with 10.10.16.192 and 10.10.16.193 which correspond to the allocated IP addresses for the two new service engines used to serve our hello Virtual Service. Note also in the State/PfxRcd column that no prefixes has been received yet.

csr1000v-ecmp-upstream#sh ip bgp summary
BGP router identifier 22.22.22.22, local AS number 6000
BGP table version is 2, main routing table version 2
1 network entries using 248 bytes of memory
1 path entries using 136 bytes of memory
1/1 BGP path/bestpath attribute entries using 280 bytes of memory
1 BGP AS-PATH entries using 40 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 704 total bytes of memory
BGP activity 1/0 prefixes, 1/0 paths, scan interval 60 secs

Neighbor        V           AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
*10.10.16.192   4         5000       2       4        2    0    0 00:00:11        0
*10.10.16.193   4         5000       2       4        2    0    0 00:00:11        0
* Dynamically created based on a listen range command
Dynamically created neighbors: 2, Subnet ranges: 2

BGP peergroup AVI-PEERS listen range group members:
  10.10.16.192/27

Total dynamically created neighbors: 2/(32 max), Subnet ranges: 1

If you want to check or troubleshoot the BGP from the Service Engine, you can always use the CLI to see the runtime status of the BGP peers. Since this is a distributed architecture, the BGP daemon runs locally on each of the service engine that conform the Service Engine Group. To access to the service engine login in in the AVI Controller via SSH and open a shell session.

admin@10-10-10-33:~$ shell
Login: admin
Password: <password>

Now attach the desired service engine. I am picking one of the recently created service engines

attach serviceengine s1_ako_critical-se-idiah
Warning: Permanently added '[127.1.0.7]:5097' (ECDSA) to the list of known hosts.

Avi Service Engine

Avi Networks software, Copyright (C) 2013-2017 by Avi Networks, Inc.
All rights reserved.

Version:      20.1.5
Date:         2021-04-15 07:08:29 UTC
Build:        9148
Management:   10.10.10.46/24                 UP
Gateway:      10.10.10.1                     UP
Controller:   10.10.10.33                    UP


The copyrights to certain works contained in this software are
owned by other third parties and used and distributed under
license. Certain components of this software are licensed under
the GNU General Public License (GPL) version 2.0 or the GNU
Lesser General Public License (LGPL) Version 2.1. A copy of each
such license is available at
http://www.opensource.org/licenses/gpl-2.0.php and
http://www.opensource.org/licenses/lgpl-2.1.php

Use the ip netns command to show the network namespace within the Service Engine

admin@s1-ako-critical-se-idiah:~$ ip netns
avi_ns1 (id: 0)

And now open a bash shell session in the correspoding network namespace. In this case we are using the default avi_ns1 network namespace at the Service Engine. The prompt should change after entering the proper credentials

admin@s1-ako-critical-se-idiah:~$ sudo ip netns exec avi_ns1 bash
[sudo] password for admin: <password> 
root@s1-ako-critical-se-idiah:/home/admin#

Open a session to the internal FRR-based BGP router daemon issuing a netcat localhost bgpd command as shown below

root@s1-ako-critical-se-idiah:/home/admin# netcat localhost bgpd

Hello, this is FRRouting (version 7.0).
Copyright 1996-2005 Kunihiro Ishiguro, et al.


User Access Verification

▒▒▒▒▒▒"▒▒Password: avi123

s1-ako-critical-se-idiah>

Use enable command to gain privileged access and show running configuration. The AVI Controller has created automatically a configuration to peer with our external router at 10.10.16.250. Some route maps to filter inbound and outbound announces has also been populated as seen below

s1-ako-critical-se-idiah# show run
show run

Current configuration:
!
frr version 7.0
frr defaults traditional
!
hostname s1-ako-critical-se-idiah
password avi123
log file /var/lib/avi/log/bgp/avi_ns1_bgpd.log
!
!
!
router bgp 5000
 bgp router-id 2.61.174.252
 no bgp default ipv4-unicast
 neighbor 10.10.16.250 remote-as 6000
 neighbor 10.10.16.250 advertisement-interval 5
 neighbor 10.10.16.250 timers connect 10
 !
 address-family ipv4 unicast
  neighbor 10.10.16.250 activate
  neighbor 10.10.16.250 route-map PEER_RM_IN_10.10.16.250 in
  neighbor 10.10.16.250 route-map PEER_RM_OUT_10.10.16.250 out
 exit-address-family
!
!
ip prefix-list def-route seq 5 permit 0.0.0.0/0
!
route-map PEER_RM_OUT_10.10.16.250 permit 10
 match ip address 1
 call bgp_properties_ebgp_rmap
!
route-map bgp_community_rmap permit 65401
!
route-map bgp_properties_ibgp_rmap permit 65400
 match ip address prefix-list snat_vip_v4-list
 call bgp_community_rmap
!
route-map bgp_properties_ibgp_rmap permit 65401
 call bgp_community_rmap
!
route-map bgp_properties_ebgp_rmap permit 65400
 match ip address prefix-list snat_vip_v4-list
 call bgp_community_rmap
!
route-map bgp_properties_ebgp_rmap permit 65401
 call bgp_community_rmap
!
line vty
!
end

To verify neighbor status use show bgp summary command

s1-ako-critical-se-idiah# sh bgp summary
sh bgp summary

IPv4 Unicast Summary:
BGP router identifier 2.61.174.252, local AS number 5000 vrf-id 0
BGP table version 6
RIB entries 0, using 0 bytes of memory
Peers 1, using 22 KiB of memory

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
10.10.16.250    4       6000      29      25        0    0    0 00:23:29            0

Total number of neighbors 1

Note that by default the AVI BGP implementation filters any prefix coming from the external upstream router, therefore BGP is mainly used to inject RHI to allow outside world to gain VS reachability. Once the network is ready we can use the enableRhi setting in our custom AviInfraSetting object to enable this capability. Again the easiest way is by editing the existing our critical.infra AviInfraSetting object using kubectl edit as shown below

kubectl edit AviInfraSetting critical.infra
apiVersion: ako.vmware.com/v1alpha1
kind: AviInfraSetting
metadata:
  name: critical.infra
spec:
  network:
    enableRhi: true
    names:
    - REGA_EXT_UPLINKB_3016
  seGroup:
    name: SE-Group-AKO-CRITICAL

Before applying the new configuration, enable console messages (term mon) in case you are accesing the external router by SSH and activate the debugging of any ip routing table changes using debug ip routing you would be able to see the route announcements received by the upstream router. Now apply the above setting by editing the critical.infra AviInfraSetting CRD object.

csr1000v-ecmp-upstream#debug ip routing
IP routing debugging is on
*May 27 16:42:50.115: RT: updating bgp 10.10.16.224/32 (0x0) [local lbl/ctx:1048577/0x0]  :
    via 10.10.16.192   0 1048577 0x100001

*May 27 16:42:50.115: RT: add 10.10.16.224/32 via 10.10.16.192, bgp metric [20/0]
*May 27 16:42:50.128: RT: updating bgp 10.10.16.224/32 (0x0) [local lbl/ctx:1048577/0x0]  :
    via 10.10.16.193   0 1048577 0x100001
    via 10.10.16.192   0 1048577 0x100001

*May 27 16:42:50.129: RT: closer admin distance for 10.10.16.224, flushing 1 routes
*May 27 16:42:50.129: RT: add 10.10.16.224/32 via 10.10.16.193, bgp metric [20/0]
*May 27 16:42:50.129: RT: add 10.10.16.224/32 via 10.10.16.192, bgp metric [20/0]

As you can see above new messages appears indicating a new announcement of VIP network at 10.10.16.224/32 has been received by both 10.10.16.193 and 10.10.16.192 neighbors and the event showing the new equal paths routs has been installed in the routing table. In fact, if you check the routing table for this particular prefix.

csr1000v-ecmp-upstream#sh ip route 10.10.16.224
Routing entry for 10.10.16.224/32
  Known via "bgp 6000", distance 20, metric 0
  Tag 5000, type external
  Last update from 10.10.16.192 00:00:46 ago
  Routing Descriptor Blocks:
  * 10.10.16.193, from 10.10.16.193, 00:00:46 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 5000
      MPLS label: none
    10.10.16.192, from 10.10.16.192, 00:00:46 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 5000
      MPLS label: none

You can even see the complete IP routing table with a more familiar command as shown below:

csr1000v-ecmp-upstream#show ip route
Codes: L - local, C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
       ia - IS-IS inter area, * - candidate default, U - per-user static route
       o - ODR, P - periodic downloaded static route, H - NHRP, l - LISP
       a - application route
       + - replicated route, % - next hop override, p - overrides from PfR

Gateway of last resort is 10.10.15.1 to network 0.0.0.0

B*    0.0.0.0/0 [20/0] via 10.10.15.1, 00:48:21
      10.0.0.0/8 is variably subnetted, 5 subnets, 2 masks
C        10.10.16.0/24 is directly connected, GigabitEthernet3
B        10.10.16.224/32 [20/0] via 10.10.16.194, 00:02:10
                         [20/0] via 10.10.16.193, 00:02:10
                         [20/0] via 10.10.16.192, 00:02:10
L        10.10.16.250/32 is directly connected, GigabitEthernet3

Remember we also enabled the Bidirectional Forwarding Detection (BFD) in our BGP peering configuration. The BFD protocol is a simple hello mechanism that detects failures in a network. Hello packets are sent at a specified, regular interval. A neighbor failure is detected when the routing device stops receiving a reply after a specified interval. BFD works with a wide variety of network environments and topologies and is used in combination with BGP to provide faster failure detection. The current status of the BFD neighbors can be also seen in the upstream router console. A Local and Remote Discrimination ID (LD/RD column) are asigned to uniquely indentify the BFD peering across the network.

csr1000v-ecmp-upstream#show bfd neighbors

IPv4 Sessions
NeighAddr            LD/RD         RH/RS     State     Int
10.10.16.192         4104/835186103  Up        Up        Gi3
10.10.16.193         4097/930421219  Up        Up        Gi3

To verify the Route Health Injection works as expect we can now manually scale out our service to create an aditional Service Engine, that means the hello application should now be active and therefore reachable through three different equal cost paths. Hover the mouse over the Parent Virtual Service of the hello application and press the Scale Out button

Manual Scale out for the parent VS

A new window pops up indicating a new service engine is being created to complete the manual Scale Out operation.

Scaling Out progress

After a couple of minutes you should see how the service is now running in three independent Service which means we have increased the overall capacity for our service engine.

Scaling out a VS to three Service Engines

At the same time, in the router console we can see a set of events indicating the new BGP and BFD peering creation with new Service Engine at 10.10.16.194. After just one second a new route is announced through this new peering and also installed in the routing table.

csr1000v-ecmp-upstream#
*May 27 17:07:02.515: %BFD-6-BFD_SESS_CREATED: BFD-SYSLOG: bfd_session_created, neigh 10.10.16.194 proc:BGP, idb:GigabitEthernet3 handle:3 act
*May 27 17:07:02.515: %BGP-5-ADJCHANGE: neighbor *10.10.16.194 Up
*May 27 17:07:02.531: %BFDFSM-6-BFD_SESS_UP: BFD-SYSLOG: BFD session ld:4098 handle:3 is going UP

*May 27 17:07:03.478: RT: updating bgp 10.10.16.224/32 (0x0) [local lbl/ctx:1048577/0x0]  :
    via 10.10.16.194   0 1048577 0x100001
    via 10.10.16.193   0 1048577 0x100001
    via 10.10.16.192   0 1048577 0x100001

*May 27 17:07:03.478: RT: closer admin distance for 10.10.16.224, flushing 2 routes
*May 27 17:07:03.478: RT: add 10.10.16.224/32 via 10.10.16.194, bgp metric [20/0]
*May 27 17:07:03.478: RT: add 10.10.16.224/32 via 10.10.16.193, bgp metric [20/0]
*May 27 17:07:03.478: RT: add 10.10.16.224/32 via 10.10.16.192, bgp metric [20/0]

In we inject some traffic in the VS we could verify how this mechanism is distributing traffic acrross the three Service Engine. Note that the upstream router uses a 5-tuple (Source IP, Destination IP, Source Port, Destination Port and Protocol) hashing in its selection algorithm to determine the path among the available equal cost paths for any given new flow. That means any flow will be always sticked to the same path, or in other words, you need some network entropy if you want to achieve a fair distribution scheme among available paths (i.e Service Engines).

Traffic distribution across Service Engine using GUI Analytics

Our new resulting topology is shown in the following diagram. Remember you can add extra capacing by scaling out again the VS using the manual method as described before or even configure the AutoRebalance to automatically adapt to the traffic or Service Engine health conditions.

Resulting BGP topology after manual Scale Out operation

shardSize

A common problem with traditional LoadBalancers deployment methods is that, for each new application (Ingress), a new separate VIP is created, resulting in a large number of routable addresses being required. You can find also more conservative approaches with a single VIP for all Ingresses but this way also may have their own issues related to stability and scaling.

AVI proposes a method to are automatically shard new ingress across a small number of VIPs offering best of both methods of deployment. The number of shards is configurable according to the shardSize. The shardSize defines the number of VIPs that will be used for the new ingresses and are described in following list:

  • LARGE: 8 shared VIPs
  • MEDIUM: 4 shared VIPs
  • SMALL: 1 shared VIP
  • DEDICATED: 1 non-shared Virtual Service

If not specified it uses the shardSize value provided in the values.yaml that by default is set to LARGE. The decision of selecting one of these sizes for Shard virtual service depends on the size of the Kubernetes cluster’s ingress requirements. It is recommended to always go with the highest possible Shard virtual service number that is(LARGE) to take into consideration future growing. Note, you need to adapt the number of available IPs for new services to match with the configured shardSize. For example you cannnot use a pool of 6 IPs for a LARGE shardSize since a minimum of eight would be required to create the set of needed Virtual Services to share the VIPs for new ingress. If the lenght of the available pool is less than the shardSize some of the ingress would fail. Let’s go through the different settings and check how it changes the way AKO creates the parent objects.

kubectl edit AviInfraSetting critical.infra
apiVersion: ako.vmware.com/v1alpha1
kind: AviInfraSetting
metadata:
  name: critical.infra
spec:
  network:
    enableRhi: true
    names:
    - REGA_EXT_UPLINKB_3016
  seGroup:
    name: SE-Group-AKO-CRITICAL
  l7Settings:
    shardSize: LARGE

To test how the ingress are distributed to the shared Virtual Services I have created a simple script that creates a loop to produce dummy ingress services over for a given ClusterIP service. The script is available here. Let’s create a bunch of 20 new ingresses to see how it works.

./dummy_ingress.sh 20 apply hello
ingress.networking.k8s.io/hello1 created
service/hello1 created
ingress.networking.k8s.io/hello2 created
service/hello2 created
<skipped>
...
service/hello20 created
ingress.networking.k8s.io/hello20 created

Using kubectl and some filtering and sorting keywords to display only the relevant information you can see in the output below how AVI Controller uses up to eight different VS/IPs ranging from 10.10.16.225 to 10.10.16.232 to accomodate the created ingress objects.

kubectl get ingress --sort-by='.status.loadBalancer.ingress[0].ip' -o='custom-columns=HOSTNAME:.status.loadBalancer.ingress[0].hostname,AVI-VS-IP:.status.loadBalancer.ingress[0].ip'
HOSTNAME                   AVI-VS-IP
hello14.avi.iberia.local   10.10.16.225
hello1.avi.iberia.local    10.10.16.225
hello9.avi.iberia.local    10.10.16.225
hello2.avi.iberia.local    10.10.16.226
hello17.avi.iberia.local   10.10.16.226
hello3.avi.iberia.local    10.10.16.227
hello16.avi.iberia.local   10.10.16.227
hello4.avi.iberia.local    10.10.16.228
hello11.avi.iberia.local   10.10.16.228
hello19.avi.iberia.local   10.10.16.228
hello10.avi.iberia.local   10.10.16.229
hello18.avi.iberia.local   10.10.16.229
hello5.avi.iberia.local    10.10.16.229
hello13.avi.iberia.local   10.10.16.230
hello6.avi.iberia.local    10.10.16.230
hello20.avi.iberia.local   10.10.16.231
hello15.avi.iberia.local   10.10.16.231
hello8.avi.iberia.local    10.10.16.231
hello7.avi.iberia.local    10.10.16.232
hello12.avi.iberia.local   10.10.16.232

As you can see in the AVI GUI, up to eight new VS has been created that will be used to distribute the new ingresses

Shared Virtual Services using a LARGE shardSize (8 shared VS)
Virtual Service showing several pools that uses same VIP

Now let’s change the AviInfraSetting object and set the shardSize to MEDIUM. You would probably need to reload the AKO controller to make this new change. Once done you can see how the distribution has now changed and the ingress are being distributed to a set of four VIPs ranging now from 10.10.16.225 to 10.10.16.228.

kubectl get ingress --sort-by='.status.loadBalancer.ingress[0].ip' -o='custom-columns=HOSTNAME:.status.loadBalancer.ingress[0].hostname,AVI-VS-IP:.status.loadBalancer.ingress[0].ip'
HOSTNAME                   AVI-VS-IP
hello14.avi.iberia.local   10.10.16.225
hello10.avi.iberia.local   10.10.16.225
hello5.avi.iberia.local    10.10.16.225
hello1.avi.iberia.local    10.10.16.225
hello18.avi.iberia.local   10.10.16.225
hello9.avi.iberia.local    10.10.16.225
hello6.avi.iberia.local    10.10.16.226
hello17.avi.iberia.local   10.10.16.226
hello13.avi.iberia.local   10.10.16.226
hello2.avi.iberia.local    10.10.16.226
hello16.avi.iberia.local   10.10.16.227
hello12.avi.iberia.local   10.10.16.227
hello7.avi.iberia.local    10.10.16.227
hello3.avi.iberia.local    10.10.16.227
hello15.avi.iberia.local   10.10.16.228
hello11.avi.iberia.local   10.10.16.228
hello4.avi.iberia.local    10.10.16.228
hello20.avi.iberia.local   10.10.16.228
hello8.avi.iberia.local    10.10.16.228
hello19.avi.iberia.local   10.10.16.228

You can verify how now only four Virtual Services are available and only four VIPs will be used to expose our ingresses objects.

Shared Virtual Services using a MEDIUM shardSize (4 shared VS)

The smaller the shardSize, the higher the density of ingress per VIP as you can see in the following screenshot

Virtual Service showing a higher number of pools that uses same VIP

If you use the SMALL shardSize then you would see how all the applications will use a single external VIP.

kubectl get ingress --sort-by='.status.loadBalancer.ingress[0].ip' -o='custom-columns=HOSTNAME:.status.loadBalancer.ingress[0].hostname,AVI-VS-IP:.status.loadBalancer.ingress[0].ip'
HOSTNAME                   AVI-VS-IP
hello1.avi.iberia.local    10.10.16.225
hello10.avi.iberia.local   10.10.16.225
hello11.avi.iberia.local   10.10.16.225
hello12.avi.iberia.local   10.10.16.225
hello13.avi.iberia.local   10.10.16.225
hello14.avi.iberia.local   10.10.16.225
hello15.avi.iberia.local   10.10.16.225
hello16.avi.iberia.local   10.10.16.225
hello17.avi.iberia.local   10.10.16.225
hello18.avi.iberia.local   10.10.16.225
hello19.avi.iberia.local   10.10.16.225
hello2.avi.iberia.local    10.10.16.225
hello20.avi.iberia.local   10.10.16.225
hello3.avi.iberia.local    10.10.16.225
hello4.avi.iberia.local    10.10.16.225
hello5.avi.iberia.local    10.10.16.225
hello6.avi.iberia.local    10.10.16.225
hello7.avi.iberia.local    10.10.16.225
hello8.avi.iberia.local    10.10.16.225
hello9.avi.iberia.local    10.10.16.225

You can verify how now a single Virtual Service is available and therefore a single VIPs will be used to expose our ingresses objects.

Shared Virtual Services using a SMALL shardSize (1 single shared VS)

The last option for the shardSize is DEDICATED, that, in fact disable the VIP sharing and creates a new VIP for any new ingress object. First delete the twenty ingresses/services we created before using the same script but now with the delete keyword as shown below.

./dummy_ingress.sh 20 delete hello
service "hello1" deleted
ingress.networking.k8s.io "hello1" deleted
service "hello2" deleted
ingress.networking.k8s.io "hello2" deleted
service "hello3" deleted
<Skipped>
...
service "hello20" deleted
ingress.networking.k8s.io "hello20" deleted

Now let’s create five new ingress/services using again the custom script.

./dummy_ingress.sh 5 apply hello
service/hello1 created
ingress.networking.k8s.io/hello1 created
service/hello2 created
ingress.networking.k8s.io/hello2 created
service/hello3 created
ingress.networking.k8s.io/hello3 created
service/hello4 created
ingress.networking.k8s.io/hello4 created
service/hello5 created
ingress.networking.k8s.io/hello5 created

As you can see, now a new IP address is allocated for any new service so there is no VIP sharing in place

kubectl get ingress --sort-by='.status.loadBalancer.ingress[0].ip' -o='custom-columns=HOSTNAME:.status.loadBalancer.ingress[0].hostname,AVI-VS-IP:.status.loadBalancer.ingress[0].ip'
HOSTNAME                  AVI-VS-IP
hello5.avi.iberia.local   10.10.16.225
hello1.avi.iberia.local   10.10.16.226
hello4.avi.iberia.local   10.10.16.227
hello3.avi.iberia.local   10.10.16.228
hello2.avi.iberia.local   10.10.16.229

You can verify in the GUI how a new VS is created. The name used for the VS indicates this is using a dedicated sharing scheme for this particular ingress.

Virtual Services using a DEDICATED shardSize (1 new dedicated VS per new ingress)

Remember you can use custom AviInfraSetting objects option to selectively set the shardSize according to your application needs.

Gateway API for customizing L4 LoadBalancer resources

As mentioned before, to provide some customized information for a particular L4 LoadBalancer resource we need to switch to the services API. To allow AKO to use Gateway API we need to enable it using one of the configuration settings in the values.yaml file that is used by the helm chart we use to deploy the AKO component.

servicesAPI: true 
# Flag that enables AKO in services API mode: https://kubernetes-sigs.github.io/service-apis/

Set the servicesAPI flag to true and redeploy the AKO release. You can use this simple ako_reload.sh script that you can find here to delete and recreate the existing ako release automatically after changing the above flag

./ako_reload.sh
"ako" already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "ako" chart repository
Update Complete. ⎈Happy Helming!⎈
release "ako-1621926773" uninstalled
NAME: ako-1622125803
LAST DEPLOYED: Thu May 27 14:30:05 2021
NAMESPACE: avi-system
STATUS: deployed
REVISION: 1

The AKO implementation uses following Gateway API resources

  • GatewayClass.– used to aggregate a group of Gateway object. Used to point to some specific parameters of the load balancing implementation. AKO identifies GatewayClasses that points to ako.vmware.com/avi-lb as the controller.
  • Gateway.- defines multiple services as backends. It uses matching labels to select the Services that need to be implemented in the actual load balancing solution

Above diagram summarizes the different objects and how they map togheter:

AVI LoadBalancer Infrastructure Customization using Gateway API and labels matching

Let’s start by creating a new GatewayClass type object as defined in the following yaml file. Save in a yaml file or simply paste the following code using stdin.

cat <<EOF | kubectl apply -f -
apiVersion: networking.x-k8s.io/v1alpha1
kind: GatewayClass
metadata:
  name: critical-gwc
spec:
  controller: ako.vmware.com/avi-lb
  parametersRef:
    group: ako.vmware.com
    kind: AviInfraSetting
    name: critical.infra
EOF

Output: 
gatewayclass.networking.x-k8s.io/critical-gateway-class created

Now define Gateway object including the labels we will use to select the application we are using as backend. Some backend related parameters such as protocol and port needs to be defined. The gatewayClassName defined previously is also referred using the spec.gatewayClassName key.

cat <<EOF | kubectl apply -f -
apiVersion: networking.x-k8s.io/v1alpha1
kind: Gateway
metadata:
  name: avi-alb-gw
  namespace: default
spec: 
  gatewayClassName: critical-gwc    
  listeners: 
  - protocol: TCP 
    port: 80 
    routes: 
      selector: 
       matchLabels: 
        ako.vmware.com/gateway-namespace: default 
        ako.vmware.com/gateway-name: avi-alb-gw
      group: v1 
      kind: Service
EOF

Output:
gateway.networking.x-k8s.io/avi-alb-gw created

As soon as we create the GW resource, AKO will call the AVI Controller to create this new object even when there are now actual services associated yet. In the AVI GUI you can see how the service is created and it takes the name of the gateway resource. This is a namespace scoped resource so you should be able to create the same gateway definition in a different namespace. A new IP address has been selected from the AVI IPAM as well.

Virtual Service for Gateway resource used for LoadBalancer type objects.

Now we can define the LoadBalancer service. We need to add the corresponding labels as they are used to link the backend to the gateway. Use the command below that also includes the deployment declaration for our service.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: hello
  labels:
    ako.vmware.com/gateway-namespace: default 
    ako.vmware.com/gateway-name: avi-alb-gw
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: hello
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hello
  template:
    metadata:
      labels:
        app: hello
    spec:
      containers:
      - name: hello-kubernetes
        image: paulbouwer/hello-kubernetes:1.7
        ports:
        - containerPort: 8080
        env:
        - name: MESSAGE
          value: "MESSAGE: Running in port 8080!!"
EOF

Once applied, the AKO will translate the new changes in the Gateway API related objects and will call the AVI API to patch the corresponding Virtual Service object according to the new settings. In this case the Gateway external IP is allocated as seen in the following output.

kubectl get service
NAME         TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)        AGE
hello        LoadBalancer   10.106.124.101   10.10.16.227   80:30823/TCP   9s
kubernetes   ClusterIP      10.96.0.1        <none>         443/TCP        99d

You can explore the AVI GUI to see how the L4 Load Balancer object has been realized in a Virtual Service.

L4 Virtual Service realization using AKO and Gateway API

And obviously we can browse to the external IP address and check if the service is actually running and is reachable from the outside.

An important benefit of this is the ability to share the same external VIP for exposing different L4 services outside. You can easily add a new listener definition that will expose the port TCP 8080 and will point to the same backend hello application as shown below:

cat <<EOF | kubectl apply -f -
apiVersion: networking.x-k8s.io/v1alpha1
kind: Gateway
metadata:
  name: avi-alb-gw
  namespace: default
spec: 
  gatewayClassName: critical-gwc    
  listeners: 
  - protocol: TCP 
    port: 8080 
    routes: 
      selector: 
       matchLabels: 
        ako.vmware.com/gateway-namespace: default 
        ako.vmware.com/gateway-name: avi-alb-gw
      group: v1 
      kind: Service
  - protocol: TCP 
    port: 80 
    routes: 
      selector: 
       matchLabels: 
        ako.vmware.com/gateway-namespace: default 
        ako.vmware.com/gateway-name: avi-alb-gw
      group: v1 
      kind: Service
EOF

Describe the new gateway object to see the status of the resource

kubectl describe gateway avi-alb-gw
Name:         avi-alb-gw
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  networking.x-k8s.io/v1alpha1
Kind:         Gateway
Metadata:
<Skipped>
Status:
  Addresses:
    Type:   IPAddress
    Value:  10.10.16.227
  Conditions:
    Last Transition Time:  1970-01-01T00:00:00Z
    Message:               Waiting for controller
    Reason:                NotReconciled
    Status:                False
    Type:                  Scheduled
  Listeners:
    Conditions:
      Last Transition Time:  2021-06-03T08:30:11Z
      Message:
      Reason:                Ready
      Status:                True
      Type:                  Ready
    Port:                    8080
    Protocol:
    Conditions:
      Last Transition Time:  2021-06-03T08:30:11Z
      Message:
      Reason:                Ready
      Status:                True
      Type:                  Ready
    Port:                    80
    Protocol:
Events:                      <none>

And the kubectl get services shows the same external IP address is being shared

kubectl get services
NAME         TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)          AGE
hello        LoadBalancer   10.106.124.101   10.10.16.227   80:31464/TCP     35m
hello8080    LoadBalancer   10.107.152.148   10.10.16.227   8080:31990/TCP   7m23s

The AVI GUI represents the new Virtual Service object with two different Pool Groups as shown in the screen capture below

AVI Virtual Service object representation of the Gateway resource

And you can see how the same Virtual Service is proxying both 8080 and 80 TCP ports simultaneously

Virtual Service object including two listeners

It could be interesting for predictibility reasons to be able to pick an specific IP address from the available range instead of use the AVI IPAM automated allocation process. You can spec the desired IP address by including the spec.addresses definition as part of the gateway object configuration. To change the IPAddress a complete gateway recreation is required. First delete the gateway

kubectl delete gateway avi-alb-gw
gateway.networking.x-k8s.io "avi-alb-gw" deleted

And now recreate it adding the addresses definition as shown below

cat <<EOF | kubectl apply -f -
apiVersion: networking.x-k8s.io/v1alpha1
kind: Gateway
metadata:
  name: avi-alb-gw
  namespace: default
spec: 
  gatewayClassName: critical-gwc
  addresses:
  - type: IPAddress
    value: 10.10.16.232  
  listeners: 
  - protocol: TCP 
    port: 8080 
    routes: 
      selector: 
       matchLabels: 
        ako.vmware.com/gateway-namespace: default 
        ako.vmware.com/gateway-name: avi-alb-gw
      group: v1 
      kind: Service
  - protocol: TCP 
    port: 80 
    routes: 
      selector: 
       matchLabels: 
        ako.vmware.com/gateway-namespace: default 
        ako.vmware.com/gateway-name: avi-alb-gw
      group: v1 
      kind: Service
EOF

From the AVI GUI you can now see how the selected IP Address as been configured in our Virtual Service that maps with the Gateway kubernetes resource.

L4 Virtual Service realization using AKO and Gateway API

This concludes this article. Stay tuned for new content.

AVI for K8s Part 9: Customizing Ingress Pools using HTTPRule CRDs

In the previous article we went through the different available options to add extra customization for our delivered applications using the HostRule CRD on the top of the native kubernetes objects.

Now it’s time to explore another interesint CRD called HTTPRule that can be used as a complimentary object that dictates the treatment applied to the traffic sent towards the backend servers. We will tune some key properties to control configuration settings such as load balancing algorithm, persistence, health-monitoring or re-encryption.

Exploring the HTTPRule CRD

The HTTPRule CRD general definition looks like this:

apiVersion: ako.vmware.com/v1alpha1
kind: HTTPRule
metadata:
   name: my-http-rule
   namespace: purple-l7
spec:
  fqdn: foo.avi.internal
  paths:
  - target: /foo
    healthMonitors:
    - my-health-monitor-1
    - my-health-monitor-2
    loadBalancerPolicy:
      algorithm: LB_ALGORITHM_CONSISTENT_HASH
      hash: LB_ALGORITHM_CONSISTENT_HASH_SOURCE_IP_ADDRESS
    tls: ## This is a re-encrypt to pool
      type: reencrypt # Mandatory [re-encrypt]
      sslProfile: avi-ssl-profile
      destinationCA:  |-
        -----BEGIN CERTIFICATE-----
        [...]
        -----END CERTIFICATE-----

In following sections we will decipher this specifications one by one to understand how affects to the behaviour of the load balancer. As a very first step we will need a testbed application in a form of a secure ingress object. I will use this time the kuard application that is useful for testing and troubleshooting. You can find information about kuard here.

kubectl create deployment kuard --image=gcr.io/kuar-demo/kuard-amd64:1 --replicas=6

Now expose the application creating a ClusterIP service listening in port 80 and targeting the port 8080 that is the one used by kuard.

kubectl expose deployment kuard --port=80 --target-port=8080
service/kuard exposed

The secure ingress definition requires a secret resource in kubernetes. An easy way to generate the required cryptographic stuff is by using a simple i created and availabe here. Just copy the script, make it executable and launch it as shown below using your own data.

./create_secret.sh ghost /C=ES/ST=Madrid/CN=kuard.avi.iberia.local default

If all goes well you should have a new kubernetes secret tls object that you can verify by using kubectcl commands as shown below

kubectl describe secret kuard-secret
Name:         kuard-secret
Namespace:    default
Labels:       <none>
Annotations:  <none>

Type:  kubernetes.io/tls

Data
====
tls.crt:  574 bytes
tls.key:  227 bytes

Create a secure ingress yaml definition including the certificate, name, ports and rest of relevant specifications.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: kuard
  labels:
    app: kuard
    app: gslb
spec:
  tls:
  - hosts:
    - kuard.avi.iberia.local
    secretName: kuard-secret
  rules:
    - host: kuard.avi.iberia.local
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: kuard
              port:
                number: 80

If everything went well, you will see the beatiful graphical representation of the declared ingress state in the AVI Controller GUI

And we can check the app is up and running by loading the page at https://ghost.avi.iberia.local in my case.

Now we are ready to go so let’s starting playing with the CRD definitions.

healthMonitors

Health Monitoring is a key element in a Application Delivery Controller system because is a subsystem that takes care of the current status of the real servers that eventually will respond the client requests. The health monitor can operate at different levels, it could be just a simple L3 ICMP echo request to check if the backend server is alive or it could be a L4 TCP SYN to verify the server is listening in a specific TCP port or even it can be a L7 HTTP probe to check if the server is responding with certain especific body content or headers. Sometimes it might be interesting to add some extra verification to ensure our system is responding as expected or even controlling the health-status of a particular server on environmental variables such as Time of Day or Day of Week. The health-monitor can use python, perl or shell scripting to create very sophisticated health-monitors.

To test how it works I have created a very simple script that will parse the server response and will decide if the server is healthy. To do so I will send a curl and will try to match (grep) an specific string within the server response. If the script returns any data it is considered ALIVE whereas if there is no data returned the system will declare the server as DOWN. For this especific case, just as an example I will use the Health Monitor to exclude certain worker nodes in kubernetes in a rudimentary way based on the IP included in the response that the kuard application sent. In this case, I will consider that only the servers running at any IP starting with 10.34.3 will be considered ALIVE.

Navitage to Templates > Health Monitor > CREATE and create a new Health Monitor that we will call KUARD-SHELL

Remember all the DataPlane related tasks are performed from the Service Engines including the health-monitoring. So It’s always a good idea to verify manually from the Service Engine if the health-monitor is working as expected. Let’s identify the Service Engine that is realizing our Virtual Service

Log in into the AVI controller CLI and connect to the service engine using attach command

[admin:10-10-10-33]: > attach serviceengine s1_ako-se-kmafa

Discover the network namespace id that usually is avi_ns1

admin@s1-ako-se-vyrvl:~$ ip netns
avi_ns1 (id: 0)

Open a bash shell in the especific namespace. The admin password would be required

admin@s1-ako-se-vyrvl:~$ sudo ip netns exec avi_ns1 bash
[sudo] password for admin:
root@s1-ako-se-kmafa:/home/admin#

From this shell you can now mimic the health-monitoring probes to validate your actual server health manually and for script debugging. Get the IP address assigned to your pods using kubectl get pods and check the reachability and actual responses as seen by the Service Engines.

kubectl get pod -o custom-columns="NAME:metadata.name,IP:status.podIP" -l app=kuard
NAME                   IP
kuard-96d556c9-k2bfd   10.34.1.131
kuard-96d556c9-k6r99   10.34.3.204
kuard-96d556c9-nwxhm   10.34.3.205

In my case I have selected 10.34.3.206 that been assigned to one of the kuard application pods. Now curl to the application to see the actual server response as shown below:

curl -s http://10.34.3.204:8080
<!doctype html>

<html lang="en">
<head>
  <meta charset="utf-8">

  <title>KUAR Demo</title>

  <link rel="stylesheet" href="/static/css/bootstrap.min.css">
  <link rel="stylesheet" href="/static/css/styles.css">

  <script>
var pageContext = {"hostname":"kuard-96d556c9-g9sb4","addrs":["10.34.3.206"],"version":"v0.8.1-1","versionColor":"hsl(18,100%,50%)","requestDump":"GET / HTTP/1.1\r\nHost: 10.34.3.204:8080\r\nAccept: */*\r\nUser-Agent: curl/7.69.0","requestProto":"HTTP/1.1","requestAddr":"10.10.14.22:56830"}
  </script>
</head>

<... skipped> 
</html>

Using the returned BODY section you can now define your own health-monitor. In this example, we want to declare alive only to pods running in the worker node whose allocated podCIDR matches with 10.34.3.0/24. So an simple way to do it is by using grep and try to find a match with the “10.34.3” string.

root@s1-ako-se-kmafa:/home/admin# curl -s http://10.34.3.204:8080 | grep "10.34.3"
var pageContext = {"hostname":"kuard-96d556c9-g9sb4","addrs":["10.34.3.206"],"version":"v0.8.1-1","versionColor":"hsl(18,100%,50%)","requestDump":"GET / HTTP/1.1\r\nHost: 10.34.3.204:8080\r\nAccept: */*\r\nUser-Agent: curl/7.69.0","requestProto":"HTTP/1.1","requestAddr":"10.10.14.22:56968"}

You can also verify if this there is no answer for pods at any other podCIDR that does not start from 10.10.3. Take 10.34.1.130 as the pod IP and you should not see any output.

root@s1-ako-se-kmafa:/home/admin# curl -s http://10.34.1.131:8080 | grep "10.10.3"
<NO OUTPUT RECEIVED>

Now we have done some manual validation we are safe to go and using IP and PORT as input variables we can now formulate our simple custom-health monitor using the piece of code below.

#!/bin/bash
curl http://$IP:$PORT | grep "10.34.3"

Paste the above script in the Script Code section of our custom KUARD-SHELL Health-Monitor

And now push the configuration to the HTTPRule CRD adding above lines and pushing to Kubernetes API using kubectl apply as usual.

apiVersion: ako.vmware.com/v1alpha1
kind: HTTPRule
metadata:
   name: kuard
   namespace: default
spec:
  fqdn: kuard.avi.iberia.local
  paths:
  - target: /
    healthMonitors:
    - KUARD-SHELL

As a first step, verify in the Pool Server configuration how the new Health Monitor has been configured.

Navigate to Server Tab within our selected pool and you should see an screen like the shown below. According to our custom health-monitor only pods running at 10.34.3.X are declared as green whereas pods running in any other podCIDR will be shown as red (dead).

Now let’s can scale our replicaset to eight replicas to see if the behaviour is consistent.

kubectl scale deployment kuard --replicas=8
deployment.apps/kuard scaled

ahora se muestra y bla, bla

That example illustrate how you can attach a custom health-monitor to influence the method to verify of the backend servers using sophisticated scripting.

loadBalancerPolicy

The heart of a load balancer is its ability to effectively distribute traffic across the available healthy servers. AVI provides a number of algorithms, each with characteristics that may be best suited each different use case. Currently, the following values are supported for load balancer policy:

  • LB_ALGORITHM_CONSISTENT_HASH
  • LB_ALGORITHM_CORE_AFFINITY
  • LB_ALGORITHM_FASTEST_RESPONSE
  • LB_ALGORITHM_FEWEST_SERVERS
  • LB_ALGORITHM_FEWEST_TASKS
  • LB_ALGORITHM_LEAST_CONNECTIONS
  • LB_ALGORITHM_LEAST_LOAD
  • LB_ALGORITHM_NEAREST_SERVER
  • LB_ALGORITHM_RANDOM
  • LB_ALGORITHM_ROUND_ROBIN
  • LB_ALGORITHM_TOPOLOGY

A full description of existing load balancing algorithms and how they work is available here.

The default algorithm is the Least Connection who takes into account the number of existing connections in each of the servers to make a decision about the next request. To verify the operation of the current LB algorithm you can use a simple single line shell script and some text processing. This is an example for the kuard application but adapt it according to your needs and expected servers response.

while true; do echo "Response received from POD at " $(curl -k https://kuard.avi.iberia.local -s | grep "addrs" | awk -F ":" '/1/ {print $3}' | awk -F "," '/1/ {print $1}'); sleep 1; done
Response received from POD at  ["10.34.3.42"]
Response received from POD at  ["10.34.3.42"]
Response received from POD at  ["10.34.3.42"]
Response received from POD at  ["10.34.3.42"]
Response received from POD at  ["10.34.3.42"]
Response received from POD at  ["10.34.3.42"]

As you can see the response is been received always from the same server that is running, in this case, at 10.34.3.42. Now we will try to change it to LS_ALGORITHM_ROUND_ROBIN to see how it work

kubectl edit HTTPRule kuard

apiVersion: ako.vmware.com/v1alpha1
kind: HTTPRule
metadata:
  name: kuard
  namespace: default
spec:
  fqdn: kuard.avi.iberia.local
  paths:
  - target: / 
    healthMonitors:
    - KUARD-SHELL
    loadBalancerPolicy:
      algorithm: LB_ALGORITHM_ROUND_ROBIN

If you repeat the same test you can now see how the responses are now being distributed in a round robin fashion across all the existing backend servers (i.e pods).

while true; do echo "Response received from POD at " $(curl -k https://kuard.avi.iberia.local -s | grep addrs | awk -F ":" '/1/ {print $3}' | awk -F "," '/1/ {print $1}'); sleep 1; done
Response received from POD at  ["10.34.3.204"]
Response received from POD at  ["10.34.3.208"]
Response received from POD at  ["10.34.3.207"]
Response received from POD at  ["10.34.3.205"]
Response received from POD at  ["10.34.3.204"]
Response received from POD at  ["10.34.3.208"]
Response received from POD at  ["10.34.3.207"]
Response received from POD at  ["10.34.3.205"]
Response received from POD at  ["10.34.3.204"]
Response received from POD at  ["10.34.3.208"]
Response received from POD at  ["10.34.3.207"]
Response received from POD at  ["10.34.3.205"]
Response received from POD at  ["10.34.3.204"]
Response received from POD at  ["10.34.3.208"]
Response received from POD at  ["10.34.3.207"]

An easy way to verify the traffic distribution is using AVI Analytics. Click on Server IP Address and you should see how the client request are being distributed evenly across the available servers following the round-robin algorithm.

You can play with other available methods to select the best algorithm according to your needs.

applicationPersistence

HTTPRule CRD can also be used to express application persistence for our application. Session persistence ensures that, at least for the duration of the session or amount of time, the client will reconnect with the same server. This is especially important when servers maintain session information locally. There are diferent options to ensure the persistence. You can find a full description of available Server Persistence options in AVI here.

We will use the method based on HTTP Cookie to achieve the required persistence. With this persistence method, AVI Service Engines (SEs) will insert an HTTP cookie into a server’s first response to a client. Remember to use HTTP cookie persistence, no configuration changes are required on the back-end servers. HTTP persistence cookies created by AVI have no impact on existing server cookies or behavior.

Let’s create our own profile. Navigate to Templates > Profiles > Persistence > CREATE and define the COOKIE-PERSISTENCE-PROFILE. The cookie name is an arbitrary name. I will use here MIGALLETA as the cookie name as shown below:

Edit the HTTPRule to push the configuration to our Pool as shown below:

kubectl edit HTTPRule kuard

apiVersion: ako.vmware.com/v1alpha1
kind: HTTPRule
metadata:
  name: kuard
  namespace: default
spec:
  fqdn: kuard.avi.iberia.local
  paths:
  - target: / 
    healthMonitors:
    - KUARD-SHELL
    loadBalancerPolicy:
      algorithm: LB_ALGORITHM_ROUND_ROBIN
    applicationPersistence: COOKIE-PERSISTENCE-PROFILE

The AVI GUI shows how the new configuration has been succesfully applied to our Pool.

To verify how the cookie-based persistence works lets do some tests with curl. Although the browsers will use the received cookie for subsequent requests during session lifetime, the curl client implementation does not reuse this cookie received information. That means the Server Persistence will not work as expected unless you reuse the cookie received. In fact if you repeat the same test we used to verify the LoadBalancer algorithm you will see the same round robin in action.

while true; do echo "Response received from POD at " $(curl -k https://kuard.avi.iberia.local -s | grep addrs | awk -F ":" '/1/ {print $3}' | awk -F "," '/1/ {print $1}'); sleep 1; done
Response received from POD at  ["10.34.3.208"]
Response received from POD at  ["10.34.3.207"]
Response received from POD at  ["10.34.3.205"]
Response received from POD at  ["10.34.3.204"]
Response received from POD at  ["10.34.3.208"]
Response received from POD at  ["10.34.3.207"]

We need to save the received cookie and then reuse it during the session. To save the received cookies from the AVI LoadBalancer just use the following command that will write the cookies in the mycookie file

curl -k https://kuard.avi.iberia.local -c mycookie

As expected, the server has sent a cookie with the name MIGALLETA and some encrypted payload that contains the back-end server IP address and port. The payload is encrypted with AES-256. When a client makes a subsequent HTTP request, it includes the cookie, which the SE uses to ensure the client’s request is directed to the same server and theres no need to maintain in memory session tables in the Service Engines. To show the actual cookie just show the content of the mycookie file.

cat mycookie
# Netscape HTTP Cookie File
# https://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

kuard.avi.iberia.local  FALSE   /       TRUE    0       MIGALLETA       029390d4b1-d684-4e4e2X85YaqGAIGwwilIc5zjXcplMYncHJMGZRVobEXRvqRWOuM7paLX4al2rWwQ5IJB8

Now repeat the same loop but note that now the curl command has been modified to send the cookie contents with the –cookie option as shown below.

while true; do echo "Response received from POD at " $(curl -k https://kuard.avi.iberia.local --cookie MIGALLETA=029390d4b1-d684-4e4e2X85YaqGAIGwwilIc5zjXcplMYncHJMGZRVobEXRvqRWOuM7paLX4al2rWwQ5IJB8 -s | grep addrs | awk -F ":" '/1/ {print $3}' | awk -F "," '/1/ {print $1}'); sleep 1; done
Response received from POD at  ["10.34.3.205"]
Response received from POD at  ["10.34.3.205"]
Response received from POD at  ["10.34.3.205"]
Response received from POD at  ["10.34.3.205"]
Response received from POD at  ["10.34.3.205"]
Response received from POD at  ["10.34.3.205"]
Response received from POD at  ["10.34.3.205"]
Response received from POD at  ["10.34.3.205"]

The server persistence is now achieved. You can easily verify it using the AVI Analytics as shown below:

Just select a transaction. Note the Persistence Used is displayed as true and a Persistence Session ID has been assigned indicating this session will persist in the same backend server.

Now click on the View All Headers and you should be able to see the cookie received from the client and sent to the end server. The service engine decodes the payload content to persist the session with the original backend server.

tls

The tls setting is used to express the reencryption of the traffic between the Load Balancer and the backend servers. This can be used in a environments in which clear communications channels are not allowed to meet regulatory requirements such as PCI/DSS. To try this out, we will change the application and we will prepare an application that uses HTTPS as transport protocol in the ServiceEngine-to-pod segment.

We will create a custom docker image based on Apache httpd server and we will enable TLS and use our own certificates. As a first step is create the cryptographic stuff needed to enable HTTPS. Create a private key then a Certificate Signing Request and finally self-signed the request using the private key to produce a X509 public certificate. The steps are shown below:

# Generate Private Key and save in server.key file
openssl ecparam -name prime256v1 -genkey -noout -out server.key
# Generate a Cert Signing Request using a custom Subject and save into server.csr file
openssl req -new -key server.key -out server.csr -subj /C=ES/ST=Madrid/CN=server.internal.lab
# Self-Signed the CSR and create a X509 cert in server.crt
openssl x509 -req -days 365 -in server.csr -signkey server.key -out server.crt

Now get the apache configuration file using the following command that runs a temporary docker image and execute a command to get the default httpd.conf and saves it to a local my-httpd.conf file.

docker run --rm httpd:2.4 cat /usr/local/apache2/conf/httpd.conf > my-httpd.conf

Edit my-httpd.conf and uncomment the /usr/local/apache2/conf/httpd.conf by removing the hash symbol at the beginning of the following lines:

...
LoadModule socache_shmcb_module modules/mod_socache_shmcb.so
...
LoadModule ssl_module modules/mod_ssl.so
...
Include conf/extra/httpd-ssl.conf
...

Create a simple Dockerfile to COPY the created certificates server.crt and server.key into /usr/local/apache2/conf/ as well as the custom config file with SSL enabling options.

FROM httpd:2.4
COPY ./my-httpd.conf /usr/local/apache2/conf/httpd.conf
COPY ./server.crt /usr/local/apache2/conf
COPY ./server.key /usr/local/apache2/conf

Build the new image. Use your own Docker Hub id and login first using docker login to interact with Docker hub using CLI. In this case my docker hub is jhasensio and bellow image is publicly available if you want to reuse it.

sudo docker build -t jhasensio/httpd:2.4 .
Sending build context to Docker daemon  27.14kB
Step 1/4 : FROM httpd:2.4
 ---> 39c2d1c93266
Step 2/4 : COPY ./my-httpd.conf /usr/local/apache2/conf/httpd.conf
 ---> fce9c451f72e
Step 3/4 : COPY ./server.crt /usr/local/apache2/conf
 ---> ee4f1a446b78
Step 4/4 : COPY ./server.key /usr/local/apache2/conf
 ---> 48e828f52951
Successfully built 48e828f52951
Successfully tagged jhasensio/httpd:2.4

Login into your docker account and Push to docker.

sudo docker push jhasensio/httpd:2.4
The push refers to repository [docker.io/jhasensio/httpd]
e9cb228edc5f: Pushed
9afaa685c230: Pushed
66eaaa491246: Pushed
98d580c48609: Mounted from library/httpd
33de34a890b7: Mounted from library/httpd
33c6c92714e0: Mounted from library/httpd
15fd28211cd0: Mounted from library/httpd
02c055ef67f5: Mounted from library/httpd
2.4: digest: sha256:230891f7c04854313e502e2a60467581569b597906318aa88b243b1373126b59 size: 1988

Now you can use the created image as part of you deployment. Create a deployment resource as usual using below yaml file. Note the Pod will be listening in port 443

apiVersion: apps/v1
kind: Deployment
metadata:
  name: httpd
  labels:
    app: httpd
spec:
  replicas: 3
  selector:
    matchLabels:
      app: httpd
  template:
    metadata:
      labels:
        app: httpd
    spec:
      containers:
      - name: httpd
        image: jhasensio/httpd:2.4
        ports:
        - containerPort: 443

Now create the service ClusterIP and expose it using an secure ingress object. An existing tls object called httpd-secret object must exist in kubernetes to get this configuration working. You can generate this secret object using a simple script available here.

apiVersion: v1
kind: Service
metadata:
  name: httpd
spec:
  ports:
   - name: https
     port: 443
     targetPort: 443
  type: ClusterIP
  selector:
    app: httpd
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: httpd
  labels:
    app: httpd
    app: gslb
spec:
  tls:
  - hosts:
    - httpd.avi.iberia.local
    secretName: httpd-secret
  rules:
    - host: httpd.avi.iberia.local
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: httpd
              port:
                number: 443

Verify the pod IP assignment using kubectl get pod and some filtering as shown below

kubectl get pod -o custom-columns="NAME:metadata.name,IP:status.podIP" -l app=httpd
NAME                     IP
httpd-5cffd677d7-clkmm   10.34.3.44
httpd-5cffd677d7-hr2q8   10.34.1.39
httpd-5cffd677d7-qtjcw   10.34.1.38

Create a new HTTPRule object in a yaml file and apply it using kubectl apply command. Note we have changed the application to test TLS reencryption so a new FQDN is needed to link the HTTPRule object with the new application. It’s a good idea to change the healthMonitor to System-HTTPS instead of the default System-HTTP. We can refer also to our own SSL Profile that will define the TLS negotiation and cypher suites.

apiVersion: ako.vmware.com/v1alpha1
kind: HTTPRule
metadata:
  name: httpd
  namespace: default
spec:
  fqdn: httpd.avi.iberia.local
  paths:
  - target: /
    tls:
      type: reencrypt
      sslProfile: CUSTOM_SSL_PROFILE
    healthMonitors:
    - System-HTTPS

Now we will verify if our httpd pods are actually using https to serve the content. A nice trick to troubleshoot inside the pod network is using a temporary pod with a prepared image that contains required network tools preinstalled. An example of this images is the the netshoot image available here. The following command creates a temporary pod and execute a bash session for troubleshooting purposes. The pod will be removed as soon as you exit from the ad-hoc created shell.

kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash

Now you can test the pod from inside the cluster to check if our SSL setup is actually working as expected. Using curl from the temporary shell try to connect to one of the allocated pod IPs.

bash-5.1# curl -k https://10.34.3.44 -v
*   Trying 10.34.3.44:443...
* Connected to 10.34.3.44 (10.34.3.44) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=ES; ST=Madrid; CN=server.internal.lab
*  start date: May 27 10:54:48 2021 GMT
*  expire date: May 27 10:54:48 2022 GMT
*  issuer: C=ES; ST=Madrid; CN=server.internal.lab
*  SSL certificate verify result: self signed certificate (18), continuing anyway.
> GET / HTTP/1.1
> Host: 10.34.3.44
> User-Agent: curl/7.75.0
> Accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Thu, 27 May 2021 12:25:02 GMT
< Server: Apache/2.4.48 (Unix) OpenSSL/1.1.1d
< Last-Modified: Mon, 11 Jun 2007 18:53:14 GMT
< ETag: "2d-432a5e4a73a80"
< Accept-Ranges: bytes
< Content-Length: 45
< Content-Type: text/html
<
<html><body><h1>It works!</h1></body></html>
* Connection #0 to host 10.34.3.44 left intact

Above you can verify how the server is listening on port 443 and the certificate information presented during TLS handshaking corresponds to our configuration. This time TLS1.3 has been used to establish the secure negotiation and AES_256_GCM_SHA384 cypher suite has used for encryption. Generate some traffic to the https://httpd.avi.iberia.local url and it should display the default apache webpage as displayed below:

Select one of the transactions.This time, according to the configured SSL custom profile the traffic is using TLS1.2 as shown below:

To check how our custom HTTPRule has changed the Pool configuration just navigate to Applications > Pool > Edit Pool: S1-AZ1–default-httpd.avi.iberia.local_-httpd. The Enable SSL and the selected SSL Profile has now been set to the new values as per the HTTPRule.

You can even specify a custom CA in case you are using CA issued certificates to validate backend server identity. We are not testing this because is pretty straightforward.

destinationCA:  |-
        -----BEGIN CERTIFICATE-----
        [...]
        -----END CERTIFICATE-----

That concludes this article. Hope you have found useful to influence how the AVI loadbalancer handle the pool configuration to fit your application needs. Now it’s time to explore in the next article how we can take control of some AVI Infrastructure parameters using a new CRD: AviInfraSettings.

AVI for K8s Part 8: Customizing L7 Virtual Services using HostRule CRDs

Till now we have been used standard API kubernetes resources such as deployments, replicasets, secrets, services, ingresses… etc to define all the required configurations for the integrated Load Balancing services that AVI Service Engines eventually provides. Very oftenly the native K8s API is not rich enough to have a corresponding object to configure advance configuration in the external integrated system (e.g. the external Load Balancer in our case) and this is when the Custom Resource Definition or CRD come into scene. A CRD is a common way to extend the K8s API with aditional custom schemas. The AKO operator supports some CRD objects to define extra configuracion that allows the end user to customize even more the service. Another common method to personalize the required configuration is through the use of annotations or even matchlabels, however, the usage of CRD is a best approach since, among other benefits, can be integrated with the RBAC native policies in k8s to add extra control and access to these new custom resources.

This guide is based on the testbed scenario represented in above figure and uses a single kubernetes cluster and a single AVI controller. Antrea is selected as CNI inside the Kubernetes cluster.

AKO uses two categories of CRD

  • Layer 7 CRDs.- provides customization for L7 Ingress resources
    • HostRule CRD.- provides extra settings to configure the Virtual Service
    • HTTPRule CRD.- provides extra settings to customize the Pool or backend associated objects
  • Infrastructure CRDs.- provides extra customization for Infrastructure
    • AviInfraSetting CRD.- Defines L4/L7 infrastructure related parameters such as Service Engine Groups, VIP Network… etc)

This article will cover in detail the first of them which is the HostRule CRD. The subsequent articles of this series go through HTTPRule and AviInfraSetting CRD.

Upgrading existing AKO Custom Resource Definitions

As mentioned in previous articles we leverage helm3 to install and manages AKO related packages. Note that when we perform a release upgrade, helm3 does not upgrade the CRDs. So, whenever you upgrade a release, run the following command to ensure you are getting the last version of CRD:

helm template ako/ako --version 1.4.2 --include-crds --output-dir $HOME
wrote /home/ubuntu/ako/crds/networking.x-k8s.io_gateways.yaml
wrote /home/ubuntu/ako/crds/ako.vmware.com_hostrules.yaml
wrote /home/ubuntu/ako/crds/ako.vmware.com_aviinfrasettings.yaml
wrote /home/ubuntu/ako/crds/ako.vmware.com_httprules.yaml
wrote /home/ubuntu/ako/crds/networking.x-k8s.io_gatewayclasses.yaml
wrote /home/ubuntu/ako/templates/serviceaccount.yaml
wrote /home/ubuntu/ako/templates/secret.yaml
wrote /home/ubuntu/ako/templates/configmap.yaml
wrote /home/ubuntu/ako/templates/clusterrole.yaml
wrote /home/ubuntu/ako/templates/clusterrolebinding.yaml
wrote /home/ubuntu/ako/templates/statefulset.yaml
wrote /home/ubuntu/ako/templates/tests/test-connection.yaml

Once you have downloaded, just apply them using kubectl apply command.

kubectl apply -f $HOME/ako/crds/
customresourcedefinition.apiextensions.k8s.io/aviinfrasettings.ako.vmware.com created
Warning: resource customresourcedefinitions/hostrules.ako.vmware.com is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
customresourcedefinition.apiextensions.k8s.io/hostrules.ako.vmware.com configured
Warning: resource customresourcedefinitions/httprules.ako.vmware.com is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
customresourcedefinition.apiextensions.k8s.io/httprules.ako.vmware.com configured
customresourcedefinition.apiextensions.k8s.io/gatewayclasses.networking.x-k8s.io created
customresourcedefinition.apiextensions.k8s.io/gateways.networking.x-k8s.io created

Once upgraded, relaunch AKO using the values.yaml according to your setup. For our testbed scenario I will use a set of values you can find here.

Exploring the HostRule CRD

Let’s start with the HostRule CRD one that is used to provide extra configuracion for the Virtual Host properties. The virtual host is a logical construction for hosting multiple FQDNs on a single virtual service definition. This allows one VS to share some resources and properties among multiple Virtual Hosts. The CRD object as any other kubernetes resource is configured using declarative yaml files and it looks this:

apiVersion: ako.vmware.com/v1alpha1
kind: HostRule
metadata:
  name: my-host-rule
  namespace: red
spec:
  virtualhost:
    fqdn: foo.com # mandatory
    enableVirtualHost: true
    tls: # optional
      sslKeyCertificate:
        name: avi-ssl-key-cert
        type: ref
      sslProfile: avi-ssl-profile
      termination: edge
    httpPolicy: 
      policySets:
      - avi-secure-policy-ref
      overwrite: false
    datascripts:
    - avi-datascript-redirect-app1
    wafPolicy: avi-waf-policy
    applicationProfile: avi-app-ref
    analyticsProfile: avi-analytics-ref
    errorPageProfile: avi-errorpage-ref

Before going through the different settings to check how they affect to the Virtual Service configuration we need to create an application for testing. I will create a secure Ingress object to expose the a deployment that will run the Ghost application. Ghost is a quite popular app and one of the most versatile open source content management systems. First define the deployment, this time using imperative commands.

kubectl create deployment ghost --image=ghost --replicas=3
deployment.apps/ghost created

Now expose the application in port 2368 which is the port used by the ghost application.

kubectl expose deployment ghost --port=2368 --target-port=2368
service/ghost exposed

The secure ingress definition requires a secret resource in kubernetes. An easy way to generate the required cryptographic stuff is by using a simple script including the Openssl commands created and availabe here. Just copy the script, make it executable and launch it as shown below using your own data.

./create_secret.sh ghost /C=ES/ST=Madrid/CN=ghost.avi.iberia.local default

If all goes well you should have a new kubernetes secret tls object that you can verify by using kubectcl commands as shown below

kubectl describe secret ghost
Name:         ghost-secret
Namespace:    default
Labels:       <none>
Annotations:  <none>

Type:  kubernetes.io/tls

Data
====
tls.crt:  570 bytes
tls.key:  227 bytes

Now we can specify the ingress yaml definition including the certificate, name, ports and other relevant attributes.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ghost
  labels:
    app: ghost
    app: gslb
spec:
  tls:
  - hosts:
    - ghost.avi.iberia.local
    secretName: ghost-secret
  rules:
    - host: ghost.avi.iberia.local
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: ghost
              port:
                number: 2368

In a few seconds after applying you should see the beatiful graphical representation of the declared ingress state in the AVI Controller GUI.

Virtual Service representation

And naturally, we can check the if the ghost application is up and running by loading the web interface at https://ghost.avi.iberia.local in my case.

Click on the lock Non secure icon in the address bar of the brower and show the certificate. Verify the ingress is using the secret we created that should corresponds to our kubernetes

Now let’s play with the CRD definitions.

enableVirtualHost

The first setting is very straightforward and basically is used as a flag to change the administrative status of the Virtual Service. This is a simple way to delegate the actual status of the ingress service to the kubernetes administrator. To create the HostRule you need to create a yaml file with the following content and apply it using kubectl apply command. The new HostRule object will be named ghost.

apiVersion: ako.vmware.com/v1alpha1
kind: HostRule
metadata:
  name: ghost
  namespace: default
spec:
  virtualhost:
    fqdn: ghost.avi.iberia.local
    enableVirtualHost: true

Once our HostRule resource is created you can explore the actual status by using regular kubectl command line as shown below

kubectl get HostRule ghost -o yaml
apiVersion: ako.vmware.com/v1alpha1
kind: HostRule
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"ako.vmware.com/v1alpha1","kind":"HostRule","metadata":{"annotations":{},"name":"ghost","namespace":"default"},"spec":{"virtualhost":{"enableVirtualHost":true,"fqdn":"ghost.avi.iberia.local"}}}
  creationTimestamp: "2021-05-19T17:51:09Z"
  generation: 1
  name: ghost
  namespace: default
  resourceVersion: "10590334"
  uid: 6dd06a15-33c2-4c9c-970e-ed5a21a81ce6
spec:
  virtualhost:
    enableVirtualHost: true
    fqdn: ghost.avi.iberia.local
status:
  status: Accepted

Now it’s time to toogle the enableVirtualHost key and set to false to see how it affect to our external Virtual Service in the AVI load balancer. The easiest way is using kubectl edit that will launch your preferred editor (typically vim) to change the definition on the fly.

kubectl edit HostRule ghost
apiVersion: ako.vmware.com/v1alpha1
kind: HostRule
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"ako.vmware.com/v1alpha1","kind":"HostRule","metadata":{"annotations":{},"name":"ghost","namespace":"default"},"spec":{"virtualhost":{"enableVirtualHost":true,"fqdn":"ghost.avi.iberia.local"}}}
  creationTimestamp: "2021-05-19T17:51:09Z"
  generation: 11
  name: ghost
  namespace: default
  resourceVersion: "12449394"
  uid: 6dd06a15-33c2-4c9c-970e-ed5a21a81ce6
spec:
  virtualhost:
    enableVirtualHost: false
    fqdn: ghost.avi.iberia.local
status:
  status: Accepted

Save the file using the classical <Esc>:wq! sequence if you are using vim editor and now you can check verify in th AVI GUI how this affect to the status of the Virtual Service.

w

If you click in the Virtual Service and then click on the pencil icon you can see the Enabled toogle is set to OFF as shown below:

sslKeyCertificate

AKO integration uses the cryptographic information stored in the standard secret kubernetes object and automatically pushes that information to the AVI controller using API calls according to the secure ingress specification. If you want to override this setting you can use the sslKeyCertificate key as part of the HostRule specification to provide alternative information that will be used for the associated ingress object. You can specify both the name of the certificate and also the sslProfile to influence the SSL negotiation parameters.

Till now, the AKO has been translating standard kubernetes objects such as ingress, secrets, deployments into AVI configuration items, in other words, AKO was automated all the required configuration in the AVI controller on our behalf. Generally speaking, when using CRD, the approach is slightly different. Now the AVI Load Balancer administrator must create in advance the required configuration objects to allow the kubernetes administrator to consume the defined policies and configurations as they are defined.

Let’s create this required configuration items from the AVI GUI. First we will check the available system certificates. Navigate to Templates > Security > SSL/TLS Certificates. We will use the System-Default-Cert-EC this time.

Similarly now navigate to Templates > Security > SSL/TLS Profile and create a new SSL Profile. Just for testing purposes select only insecure version such as SSL 3.0 and TLS 1.0 as the TLS version used during TLS handshake

Once the required configuration items are precreated in the AVI GUI you can reference them in the associated yaml file. Use kubectl apply -f to push the new configuration to the HostRule object.

apiVersion: ako.vmware.com/v1alpha1
kind: HostRule
metadata:
  name: ghost
  namespace: default
spec:
  virtualhost:
    fqdn: ghost.avi.iberia.local
    enableVirtualHost: true
    tls:
      sslKeyCertificate:
        name: System-Default-Cert-EC
        type: ref
      sslProfile: CUSTOM_SSL_PROFILE
      termination: edge

If you navigate to the ghost Virtual Service in the AVI GUI you can verify in the SSL Settings section how the new configuration has been successfully applied.

Additionally, if you use a browser and open the certificate you can see how the Virtual Service is using the System-Default-Cert-EC we have just configured by the HostRule CRD.

To verify the TLS handshaking according to the SSL Profile specification just use curl. Notice how the output shows TLS version 1.0 has been used to establish the secure channel.

curl -vi -k https://ghost.avi.iberia.local
*   Trying 10.10.15.162:443...
* TCP_NODELAY set
* Connected to ghost.avi.iberia.local (10.10.15.162) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.0 (IN), TLS handshake, Certificate (11):
* TLSv1.0 (IN), TLS handshake, Server key exchange (12):
* TLSv1.0 (IN), TLS handshake, Server finished (14):
* TLSv1.0 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.0 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.0 (OUT), TLS handshake, Finished (20):
* TLSv1.0 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.0 / ECDHE-ECDSA-AES256-SHA
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=System Default EC Cert
*  start date: Feb 23 19:02:20 2021 GMT
*  expire date: Feb 21 19:02:20 2031 GMT
*  issuer: CN=System Default EC Cert
*  SSL certificate verify result: self signed certificate (18), continuing anyway.
> GET / HTTP/1.1
> Host: ghost.avi.iberia.local
> User-Agent: curl/7.67.0
> Accept: */*

httpPolicy

The AVI HTTP Policy is a feature that allow advanced customization of network layer security, HTTP security, HTTP requests, and HTTP responses. A policy may be used to control security, client request attributes, or server response attributes. Policies are comprised of matches and actions. If the match condition is satisfied then AVI performs the corresponding action.

A full description of the power of the httpPolicy is available in the AVI documentation page here.

The configuration of an httpPolicy is not as easy as the other AVI configuration elementes because part is the httppolicyset object is not shared neither explicitly populated in the AVI GUI which means you can only share policy sets and attach multiple policy sets to a VS through the CLI/API.

By default, AKO creates a HTTP Policy Set using the API when a new ingress object is created and it will be unique to the VS but, as mentioned, is not shown in the GUI as part of the VS definition.

Let’s try to define the http policy. Open a SSH connection to the AVI Controller IP address, log in and launch the AVI CLI Shell by issuing a shell command. The prompt will change indicating you are now accessing the full AVI CLI. Now configure the new httpPolicySet This time we will create a network policy to add traffic control for security purposes. Let define a rule with a MATCH statement that matches any request with a header equals to ghost.avi.iberia.local and if the condition is met the associated ACTION will be a rate-limit allowing only up to ten connections per second. For any traffic out of contract AVI will send a local response with a 429 code. To configure just paste the below command lines.

[admin:10-10-10-33]: configure httppolicyset MY_RATE_LIMIT
 http_security_policy
  rules
   name RATE_LIMIT_10CPS
   index 0
    match
     host_hdr 
      match_criteria HDR_EQUALS
      match_case insensitive 
      value ghost.avi.iberia.local
	  exit
	 exit
	action
	 action http_security_action_rate_limit
	 rate_profile
	  rate_limiter
	   count 10
	   period 1
	   burst_sz 0
	   exit
	  action
	   type rl_action_local_rsp
	   status_code http_local_response_status_code_429
	   exit
	  exit
	exit
   exit
  exit
 save

After saving a summary page is displayed indicating the resulting configuration


[admin:10-10-10-33]: httppolicyset>  save
+------------------------+----------------------------------------------------
| Field                  | Value                                              
+------------------------+----------------------------------------------------
| uuid                   | httppolicyset-6a33d859-d823-4748-9701-727fa99345b5 
| name                   | MY_RATE_LIMIT                                      
| http_security_policy   |                                                    
|   rules[1]             |                                                    
|     name               | RATE_LIMIT_10CPS                                   
|     index              | 0                                                  
|     enable             | True                                               
|     match              |                                                    
|       host_hdr         |                                                    
|         match_criteria | HDR_EQUALS                                         
|         match_case     | INSENSITIVE                                        
|         value[1]       | ghost.avi.iberia.local                             
|     action             |                                                    
|       action           | HTTP_SECURITY_ACTION_RATE_LIMIT                    
|       rate_profile     |                                                    
|         rate_limiter   |                                                    
|           count        | 10                                                 
|           period       | 1 sec                                              
|           burst_sz     | 0                                                  
|         action         |                                                    
|           type         | RL_ACTION_LOCAL_RSP                                
|           status_code  | HTTP_LOCAL_RESPONSE_STATUS_CODE_429                
| is_internal_policy     | False                                              
| tenant_ref             | admin                                              
+------------------------+----------------------------------------------------+

You can also interrogate the AVI API navigating to the following URL https://site1-avi.regiona.iberia.local/api/httppolicyset?name=MY_RATE_LIMIT. To make this work you need to open first a session to the AVI GUI in the same browser to authorize the API access requests.

Now that the httppolicyset configuration item is created you can simply attach to the Virtual Service using the HostRule object as previously explained. Edit the yaml file and apply the new configuration or edit inline using the kubect edit comand.

kubectl edit HostRule ghost
apiVersion: ako.vmware.com/v1alpha1
kind: HostRule
metadata:
  name: ghost
  namespace: default
spec:
  virtualhost:
    enableVirtualHost: true
    fqdn: ghost.avi.iberia.local
    tls:
      sslKeyCertificate:
        name: System-Default-Cert-EC
        type: ref
      sslProfile: CUSTOM_SSL_PROFILE
      termination: edge
    httpPolicy: 
      policySets:
      - MY_RATE_LIMIT
      overwrite: false

As soon as you apply the new configure navigate to the Virtual Service configuration and click on the Policies tab. A dropdown menu appears now that shows the default Policy Set applied by AKO along with the new custom Policy set named MY_RATE_LIMIT.

AKO currently creates an httppolicyset that uses objects on the SNI child virtual services to route traffic based on host/path matches. These rules are always at a lower index than the httppolicyset objects specified in the CRD object. If you want to overwrite all httppolicyset objects on a SNI virtual service with the ones specified in the HostRule CRD, set the overwrite flag to True.

To check if our rate-limiting is actually working you can use the Apache Bench tool to inject some traffic into the virtual service. The below command will sent three million of request with a concurrency value set to 100.

ab -c 100 -n 300000 https://ghost.avi.iberia.local/

Try to access to the ghost application using your browser while the test is still progress and you are likely to receive a 429 Too Many Request error code indicating the rate limiting is working as expected.

You can also verify the rate-limiter in action using the AVI Anlytics. In this case the graph below clearly shows how AVI is sending a 4XX response for the 98,7% of most of the requests.

If you want to check even deeper click on any of the 429 response and you will verify how the RATE_LIMIT_10CPS rule is the responsible for this response sent to the client.

dataScript

DataScripts is the most powerful mechanism to add extra security and customization to our virtual services. They are comprised of any number of function or method calls which can be used to inspect and act on traffic flowing through a virtual service. DataScript’s functions are exposed via Lua libraries and grouped into modules: string, vs, http, pool, ssl and crypto. You can find dozens of samples at the AVI github site here.

For this example, we will use a simple script to provide message signature using a Hash-based Message Authenticaton code (HMAC) mechanism. This is a common method to add extra security for a RESTful API service by signing your message based on a shared secret between the client and the service. For the sake of simplicity we will use an easy script that will extract the host header of the server response and will generate a new header with the computed hashing of this value. We will use the SHA algorithm to calculate the hashing.

Again, in order to attach the Datascript to our application we need to precreate the corresponding configuration item in the AVI controller using any method (GUI, API, CLI). This time we will use the GUI. Navigate to Templates > Scripts > DataScripts > CREATE. Scroll down to the HTTP Response Event Script section and put above script that extract the Host header and then create a new http header named SHA1-hash with the computed SHA hashing applied to the host header value.

Select a proper name for this script since we need to reference it from the HostRule CRD object. I have named Compute_Host_HMAC. Now edit again the HostRule and apply the new configuration.

kubectl edit HostRule ghost

apiVersion: ako.vmware.com/v1alpha1
kind: HostRule
metadata:
  name: ghost
  namespace: default
spec:
  virtualhost:
    enableVirtualHost: true
    fqdn: ghost.avi.iberia.local
    tls:
      sslKeyCertificate:
        name: System-Default-Cert-EC
        type: ref
      sslProfile: CUSTOM_SSL_PROFILE
      termination: edge
    httpPolicy: 
      policySets:
      - MY_RATE_LIMIT
      overwrite: false
    datascripts:
    - Compute_Host_HMAC

Once the HostRule has been modified we can verify in the AVI GUI how the new DataScript has been applied. Edit the Virtual Service and go to Policies > DataScripts

To check if the datascript is working as expected browse to the ghost application to generate some traffic and open one of the transactions

Now click on View All Headers link at the right of the screen to inspect the Headers and you should see in the Headers sent to server section how a new custom header named SHA1-hash has been sent containing the computed value of the host header found in the request as expected according to the DataScript function.

wafPolicy

Web Application Firewall is an additional L7 security layer to protect applications from external attackers. AVI uses a very powerful software WAF solution and provides scalable security, threat detection, and application protection. WAF policy is based on a specific set of rules that protects the application. 

Apart from the classical technices that uses signature matching to check if the attacker is trying to exploit a known vulnerability by a known attack and technique, AVI also uses a sophisticated technology called iWAF that take advantage of Artificial Intelligence models with the goal of clasifying the good traffic from the potentially dangerous using unsupervised machine learning models. This modern method is very useful not only to alleviate the operative burden that tuning a WAF policy using legacy methods implies but also to mitigate false positives ocurrences.

As seen with previous examples we need to predefine a base WAF Policy in AVI to allow the kubernetes admins to consume or reference them by using the HostRule CRD corresponding specification. Let’s create the WAF Policy. Navigate now to Templates > WAF > WAF Profile > CREATE

Here we can assign GHOST-WAF-PROFILE as the WAF Profile name and other general settings for our polity such as HTTP versions, Methods, Content-Types, Extensions… etc. I will using default settings so far.

Now we can create our WAF Policy from Templates > WAF > WAF Policy > CREATE. Again we will use default settings and we keep the Detection Mode (just flag the suspicious requests) instead of Enforcement (sent a 403 response code and deny access). We will use GHOST-WAF-POLICY as the WAF Policy name and it will be referenced in the HostRule definition.

Now that all the preconfiguration tasks has been completed we are ready to attach the WAF policy by using the HostRule CRD corresponding setting. Edit the existing HostRule object and modify accordingly as shown below:

kubectl edit HostRule ghost

apiVersion: ako.vmware.com/v1alpha1
kind: HostRule
metadata:
  name: ghost
  namespace: default
spec:
  virtualhost:
    enableVirtualHost: true
    fqdn: ghost.avi.iberia.local
    tls:
      sslKeyCertificate:
        name: System-Default-Cert-EC
        type: ref
      sslProfile: CUSTOM_SSL_PROFILE
      termination: edge
    httpPolicy: 
      policySets:
      - MY_RATE_LIMIT
      overwrite: false
    datascripts:
    - Compute_Host_HMAC
    wafPolicy: GHOST-WAF-POLICY

As soon as the new settings are applied a new shield icon appears next to the green rounded symbol that represents the Virtual Service object in the AVI GUI. The shield confirms that a WAF policy has been attached to this particular VS.

If you navigate to the Virtual Service properties you can verify the WAF Policy has been configured as expected.

Just to play a little bit with the Positive Learning Module of the AVI iWAF feature, click on the Learning tab and drag the button to enable the WAF Learning. A warning indicating a Learning Group is required appears as shown below.

Click on Create Learning Group to create this configuration that allows the system to create the baseline of the known-good behaviour according to the learnt traffic.

Assign a new name such as Ghost-PSM-Group and enable the Learning Group checkbox. Left the rest settings as default.

Return to the previous screen and set some settings that allows to speed up the learning process as shown below:

The parameters we are tuning above are specifically

  • Sampling = 100 % .- All requests are analyzed and used to the train the ML model
  • Update_interval = 1 .- Time for the Service Engine to collect data before sending to the learning module
  • min_hits_to_learn = 10 .- Specify the minimun number of ocurrences to consider a relevant hit. A lower value allow learning to happen faster. Default is 10k.

WAF Policy is ready to learn and will autopromote rules according to observed traffic. In a production environment it can take some time to have enough samples to consider a good policy. To produce some traffic and get our applications quite ready before goint to production stage it’s recomended to perform an active scanning over our web application. We will use here one of the most popular scanner which is OWASP Zed Attack Proxy (a.k.a ZAP Proxy). You can find more information of this tool at the zap official site here. Using command line as shown below perform a quick scanning over our application.

/usr/share/zaproxy/zap.sh -cmd -quickurl https://ghost.avi.iberia.local -quickprogress
Found Java version 11.0.11-ea
Available memory: 7966 MB
Using JVM args: -Xmx1991m
Ignoring legacy log4j.properties file, backup already exists.
Accessing URL
Using traditional spider
Active scanning
[===============     ] 79% /

After some scanning we can see explore the discovered locations (paths) that our web application uses. These discovered locations will be used to understand how a legitimate traffic should behave and will be the input for our AI based classifier. By using the right amount of data this would help the system to gain accurary to clasiffy known-good behaviour from anomalies and take action without any manual tweaking of the WAF Policies.

Additionally, the ZAP active scanner attempts to find potential vulnerabilities by using known attacks against the specified target. From the analytics page you can now see the WAF Tags that are associated to the scanner activities and are used to classify the different attack techniques observed.

You can also see how the active scanner attempts matches with specific WAF signatures.

And if you want to go deeper you can pick up one of the flagged requests to see the specific matching signature.

And also you can create exceptions for instance in case of false positives if needed.

Note how a new WAF tab is now available as part of the embedded analytics. If you click on it you can see some insights related to the WAF attack historical trends as well as more specific details.

Lastly, enable the enforcement mode in our WAF Policy.

Open one of the FLAGGED requests detected during the active scanning while in Detection mode and replicate the same request from the browser. In this case I have chosen one of the observed XSS attack attempts using the above URL. If you try to navigate to the target, the WAF engine now will block the access and will generate a 403 Forbidden response back to the client as shown below

applicationProfile

The application profiles are used to change the behavior of virtual services, based on application type. By default the system uses the System-Secure-HTTP with common settings including SSL Everywhere feature set that ensures you use the best-in-class security methods for HTTPS such includin HSTS, Securing Cookies, X-Forwarded-Proto among other.

To test how we can use the application profiles from the HostRule CRD object, preconfigure a new Application Profile that we will call GHOST-HTTPS-APP-PROFILE. As an example I am tuning here the compression setting and checking the Remove Accept Encoding Header for the traffic sent to the backend server. This is a method for offloading the content compression to the AVI Load Balancer in order to alleviate the processing at the end server.

Push the configuration to the Virtual Service by adding the relevant information to our HostRule CRD using kubectl edit command as shown:

kubectl edit HostRule ghost

apiVersion: ako.vmware.com/v1alpha1
kind: HostRule
metadata:
  name: ghost
  namespace: default
spec:
  virtualhost:
    enableVirtualHost: true
    fqdn: ghost.avi.iberia.local
    tls:
      sslKeyCertificate:
        name: System-Default-Cert-EC
        type: ref
      sslProfile: CUSTOM_SSL_PROFILE
      termination: edge
    httpPolicy: 
      policySets:
      - MY_RATE_LIMIT
      overwrite: false
    datascripts:
    - Compute_Host_HMAC
    wafPolicy: GHOST-WAF-POLICY
    applicationProfile: GHOST-HTTPS-APP-PROFILE

As soon as the new configuration is pushed to the CRD, AKO will patch the VS with the new Application Profile setting as you can verify in the GUI.

Generate some traffic, select a recent transaction and show All Headers. Now you can see how the compression related settings specified in the Accept-Encoding header received from client are now suppresed and rewritten by a identity value meaning no encoding.

analyticsProfile

Since each application is different, it may be necessary to modify the analytics profile to set the threshold for satisfactory client experience or omit certain errors or to configure an external collector to send the analytic to. This specific setting can be attached to any of the applications deployed from kubernetes. As in previous examples, we need to preconfigure the relevant items to be able to reference them from the CRD. In that case we can create a custom analyticsProfile that we will call GHOST-ANALYTICS-PROFILE navigating to Templates > Profiles > Analytics. Now define an external source to send our logs via syslog.

As usual, edit the custom ghost HostRule object and add the corresponding lines.

kubectl edit HostRule ghost

apiVersion: ako.vmware.com/v1alpha1
kind: HostRule
metadata:
  name: ghost
  namespace: default
spec:
  virtualhost:
    enableVirtualHost: true
    fqdn: ghost.avi.iberia.local
    tls:
      sslKeyCertificate:
        name: System-Default-Cert-EC
        type: ref
      sslProfile: CUSTOM_SSL_PROFILE
      termination: edge
    httpPolicy: 
      policySets:
      - MY_RATE_LIMIT
      overwrite: false
    datascripts:
    - Compute_Host_HMAC
    wafPolicy: GHOST-WAF-POLICY
    applicationProfile: GHOST-HTTPS-APP-PROFILE
    analyticsProfile: GHOST-ANALYTICS-PROFILE

Once done, the Virtual Service will populate the Analytic Profile setting as per our HostRule specification as shown below:

If you have access to the syslog server you can see how AVI is now streaming via Syslog the transactions. Notice the amount of metrics AVI analytics produces as seen below. You can use a BI tool of your choice to further processing and dashboarding with great granularity.

May 26 17:40:01 AVI-CONTROLLER S1-AZ1--ghost.avi.iberia.local[0]: {"adf":false,"significant":0,"udf":false,"virtualservice":"virtualservice-cd3bab2c-1091-4d31-956d-ef0aee80bbc6","report_timestamp":"2021-05-26T17:40:01.744394Z","service_engine":"s1-ako-se-kmafa","vcpu_id":0,"log_id":5082,"client_ip":"10.10.15.128","client_src_port":45254,"client_dest_port":443,"client_rtt":1,"ssl_version":"TLSv1.2","ssl_cipher":"ECDHE-ECDSA-AES256-GCM-SHA384","sni_hostname":"ghost.avi.iberia.local","http_version":"1.0","method":"HEAD","uri_path":"/","user_agent":"avi/1.0","host":"ghost.avi.iberia.local","etag":"W/\"5c4a-r+7DuBoSJ6ifz7nS1cluKPsY5VI\"","persistent_session_id":3472328296598305370,"response_content_type":"text/html; charset=utf-8","request_length":83,"cacheable":true,"http_security_policy_rule_name":"RATE_LIMIT_10CPS","http_request_policy_rule_name":"S1-AZ1--default-ghost.avi.iberia","pool":"pool-6bf69a45-7f07-4dce-8e4e-7081136b31bb","pool_name":"S1-AZ1--default-ghost.avi.iberia.local_-ghost","server_ip":"10.34.3.3","server_name":"10.34.3.3","server_conn_src_ip":"10.10.14.20","server_dest_port":2368,"server_src_port":38549,"server_rtt":2,"server_response_length":288,"server_response_code":200,"server_response_time_first_byte":52,"server_response_time_last_byte":52,"response_length":1331,"response_code":200,"response_time_first_byte":52,"response_time_last_byte":52,"compression_percentage":0,"compression":"NO_COMPRESSION_CAN_BE_COMPRESSED","client_insights":"","request_headers":65,"response_headers":2060,"request_state":"AVI_HTTP_REQUEST_STATE_SEND_RESPONSE_HEADER_TO_CLIENT","all_request_headers":{"User-Agent":"avi/1.0","Host":"ghost.avi.iberia.local","Accept":"*/*"},"all_response_headers":{"Content-Type":"text/html; charset=utf-8","Content-Length":23626,"Connection":"close","X-Powered-By":"Express","Cache-Control":"public, max-age=0","ETag":"W/\"5c4a-r+7DuBoSJ6ifz7nS1cluKPsY5VI\"","Vary":"Accept-Encoding","Date":"Wed, 26 May 2021 17:40:02 GMT","Strict-Transport-Security":"max-age=31536000; includeSubDomains"},"headers_sent_to_server":{"X-Forwarded-For":"10.10.15.128","Host":"ghost.avi.iberia.local","Connection":"keep-alive","User-Agent":"avi/1.0","Accept":"*/*","X-Forwarded-Proto":"https","SHA1-hash":"8d8bc49ef49ac3b70a059c85b928953690700a6a"},"headers_received_from_server":{"X-Powered-By":"Express","Cache-Control":"public, max-age=0","Content-Type":"text/html; charset=utf-8","Content-Length":23626,"ETag":"W/\"5c4a-r+7DuBoSJ6ifz7nS1cluKPsY5VI\"","Vary":"Accept-Encoding","Date":"Wed, 26 May 2021 17:40:02 GMT","Connection":"keep-alive","Keep-Alive":"timeout=5"},"server_connection_reused":true,"vs_ip":"10.10.15.162","waf_log":{"status":"PASSED","latency_request_header_phase":210,"latency_request_body_phase":477,"latency_response_header_phase":19,"latency_response_body_phase":0,"rules_configured":true,"psm_configured":true,"application_rules_configured":false,"allowlist_configured":false,"allowlist_processed":false,"rules_processed":true,"psm_processed":true,"application_rules_processed":false},"request_id":"Lv-scVJ-2Ivw","servers_tried":1,"jwt_log":{"is_jwt_verified":false}}

Use your favourite log collector tool to extract the different fields contained in the syslog file and you can get nice graphics very easily. As an example, using vRealize Log Insights you can see the syslog events sent by AVI via syslog over the time.

This other example shows the average backend server RTT over time grouped by server IP (i.e POD).

Or even this one that shows the percentage of requests accross the pods.

errorPageProfile

The last configurable parameter so far is the errorPageProfile which can be use to produce custom error page responses adding relevant information that might be used to trace issues or simply to provide cool error page to your end users. As with previous settings, the first step is to preconfigure the custom error Page Profile using the GUI. Navitate to Templates > Error Page and create a new profile that we will call GHOST-ERROR-PAGE.

We will create a custom page to warn the users they are trying to access the website using a forbidden method. When this happens a 403 Code is generated and a customized web page can be returned to the user. I have used one cool page that displays the forbidden city. The HTML code is available here.

Once the Error Page profile has been created now is time to reference to customize our application using the HostRule CRD as shown below

kubectl edit HostRule ghost

apiVersion: ako.vmware.com/v1alpha1
kind: HostRule
metadata:
  name: ghost
  namespace: default
spec:
  virtualhost:
    enableVirtualHost: true
    fqdn: ghost.avi.iberia.local
    tls:
      sslKeyCertificate:
        name: System-Default-Cert-EC
        type: ref
      sslProfile: CUSTOM_SSL_PROFILE
      termination: edge
    httpPolicy: 
      policySets:
      - MY_RATE_LIMIT
      overwrite: false
    datascripts:
    - Compute_Host_HMAC
    wafPolicy: GHOST-WAF-POLICY
    applicationProfile: GHOST-HTTPS-APP-PROFILE
    analyticsProfile: GHOST-ANALYTICS-PROFILE
    errorPageProfile: GHOST-ERROR-PAGE

Verify the new configuration has been successfuly applied to our Virtual Service.

Now repeat the XSS attack attempt as shown in the wafPolicy section above and you can see how a beatufil custom message appears instead of the bored static 403 Forbidden shown before.

This finish this article. The next one will cover the customization of the backend/pool associated with our application and how you can also influence in the LoadBalancing algorithm, persistence, reencryption and other funny stuff.

« Older posts

© 2024 SDefinITive

Theme by Anders NorenUp ↑