Who does not love to watch a nice dashboard full of colors? Observing patterns and real-time metrics in a time series might make us sit in front of a screen as if hypnotized for hours. But apart from the inherent beauty of dashboards, they provide observability, which is a crucial feature for understanding the performance of our applications and also a very good tool for predicting future behavior and fixing existing problems.

There is a big ecosystem out there with plenty of tools to create a logging pipeline that collect, parse, process, enrich, index, analyze and visualize logs. In this post we will focus on a combination that is gaining popularity for log Analysis that is based on FluentBit, Loki and Grafana as shown below. On the other hand we will use Prometheus for metric collection.

Opensource Observability Stack

Let’s build the different blocks starting by the visualization tool.

Installing Grafana as visualization platform

Grafana is a free software based on Apache 2.0 license, which allows us to visualize data collected from various sources such as Prometheus, InfluxDB or Telegraph, tools that collect data from our infrastructure, such as CPU usage, memory, or network traffic of a virtual machine, a Kubernetes cluster, or each of its containers.

The real power of Grafana lies in the flexibility to create as many dashboards as we need with very smart visualization grapsh where we can format this data and represent it as we want. We will use Grafana as main tool for adding observatility capabilities to Antrea which is the purpose of this series of posts.

To carry out the installation of grafana we will rely on the official helm charts. The first step, therefore, would be to add the grafana repository so that helm can access it.

helm repo add grafana https://grafana.github.io/helm-charts

Once the repository has been added we can broswe it. We will use the latest available release of chart to install the version 9.2.4 of Grafana.

helm search repo grafana
NAME                                    CHART VERSION   APP VERSION             DESCRIPTION                                       
grafana/grafana                         6.43.5          9.2.4                   The leading tool for querying and visualizing t...
grafana/grafana-agent-operator          0.2.8           0.28.0                  A Helm chart for Grafana Agent Operator           
grafana/enterprise-logs                 2.4.2           v1.5.2                  Grafana Enterprise Logs                           
grafana/enterprise-logs-simple          1.2.1           v1.4.0                  DEPRECATED Grafana Enterprise Logs (Simple Scal...
grafana/enterprise-metrics              1.9.0           v1.7.0                  DEPRECATED Grafana Enterprise Metrics             
grafana/fluent-bit                      2.3.2           v2.1.0                  Uses fluent-bit Loki go plugin for gathering lo...
grafana/loki                            3.3.2           2.6.1                   Helm chart for Grafana Loki in simple, scalable...
grafana/loki-canary                     0.10.0          2.6.1                   Helm chart for Grafana Loki Canary                
grafana/loki-distributed                0.65.0          2.6.1                   Helm chart for Grafana Loki in microservices mode 
grafana/loki-simple-scalable            1.8.11          2.6.1                   Helm chart for Grafana Loki in simple, scalable...
grafana/loki-stack                      2.8.4           v2.6.1                  Loki: like Prometheus, but for logs.              
grafana/mimir-distributed               3.2.0           2.4.0                   Grafana Mimir                                     
grafana/mimir-openshift-experimental    2.1.0           2.0.0                   Grafana Mimir on OpenShift Experiment             
grafana/oncall                          1.0.11          v1.0.51                 Developer-friendly incident response with brill...
grafana/phlare                          0.1.0           0.1.0                   🔥 horizontally-scalable, highly-available, mul...
grafana/promtail                        6.6.1           2.6.1                   Promtail is an agent which ships the contents o...
grafana/rollout-operator                0.1.2           v0.1.1                  Grafana rollout-operator                          
grafana/synthetic-monitoring-agent      0.1.0           v0.9.3-0-gcd7aadd       Grafana's Synthetic Monitoring application. The...
grafana/tempo                           0.16.3          1.5.0                   Grafana Tempo Single Binary Mode                  
grafana/tempo-distributed               0.27.5          1.5.0                   Grafana Tempo in MicroService mode                
grafana/tempo-vulture                   0.2.1           1.3.0                   Grafana Tempo Vulture - A tool to monitor Tempo...

Any helm chart includes configuration options to customize the setup by passing a configuration file that helm will use when deploying our release. We can research in the documentation to understand what all this possible helm chart values really means and how affect the final setup. Sometimes it is useful to get a file with all the default configuration values and personalize as requiered. To get the default values associated with a helm chart just use the following command.

helm show values grafana/grafana > default_values.yaml

Based on the default_values.yaml we will create a customized and reduced version and we will save in a new values.yaml file with some modified values for our custom configuration. You can find the full values.yaml here. The first section enables data persistence by creating a PVC that will use the vsphere-sc storageClass we created in this previous post to leverage vSphere Container Native Storage capabilities to provision persistent volumes. Adjust the storageClassName as per your setup.

vi values.yaml
# Enable Data Persistence
persistence:
  type: pvc
  enabled: true
  storageClassName: vsphere-sc
  accessModes:
    - ReadWriteOnce
  size: 10Gi

The second section enables the creation of sidecars containers that allow us to import grafana configurations such as datasources or dashboards through configmaps, this will be very useful to deploy Grafana fully configured in an automated way without user intervention through the graphical interface. With this settings applied, any configmap in the grafana namespace labeled with grafana_dashboard=1 will trigger the import of dashboard. Similarly, any configmaps labeled with grafana_datasource=1 will trigger the import of the grafana datasource.

vi values.yaml (Sidecars Section)
# SideCars Section
# Enable Sidecars containers creationfor dashboards and datasource import via configmaps
sidecar:
  dashboards:
    enabled: true
    label: grafana_dashboard
    labelValue: "1"
  datasources:
    enabled: true
    label: grafana_datasource
    labelValue: "1"
    

The last section defines how to expose Grafana graphical interface externally. We will use a kubernetes service type LoadBalancer for this purpose. In my case I will use AVI as the ingress solution for our cluster so the load balancer will be created in the service engine. Feel free to use any other external LoadBalancer solution if you want.

vi values.yaml (service expose section)
# Define how to expose the service
service:
  enabled: true
  type: LoadBalancer
  port: 80
  targetPort: 3000
  portName: service

The following command creates the namespace grafana and installs the grafana/grafana chart named grafana in the grafana namespace taking the values.yaml configuration file as the input. After successful deployment, the installation gives you some hints for accessing the application, e.g. how to get the credentials, which are stored in a secret k8s object. Ignore any warning about PSP you might get.

helm install grafana –create-namespace grafana -n grafana grafana/grafana -f values.yaml
Release "grafana" has been installed. Happy Helming!
NAME: grafana
LAST DEPLOYED: Mon Dec 26 18:42:05 2022
NAMESPACE: grafana
STATUS: deployed
REVISION: 2
NOTES:
1. Get your 'admin' user password by running:

   kubectl get secret --namespace grafana grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:

   grafana.grafana.svc.cluster.local

   Get the Grafana URL to visit by running these commands in the same shell:
     export POD_NAME=$(kubectl get pods --namespace grafana -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}")
     kubectl --namespace grafana port-forward $POD_NAME 3000

3. Login with the password from step 1 and the username: admin

As explained in the notes after helm installation, the first step is to get the plaintext password that will be used to authenticate the default admin username in the Grafana UI.

kubectl get secret –namespace grafana grafana -o jsonpath=”{.data.admin-password}” | base64 –decode ; echo
wFCT81uGC7ij5Sv1rTIuf2CwQa5Y9xkGQSixDKOx

Veryfing Grafana Installation

Before moving to the Grafana UI let’s explore created kubernetes resources and their status.

kubectl get all -n grafana
NAME                           READY   STATUS    RESTARTS   AGE
pod/grafana-7d95c6cf8c-pg5dw   3/3     Running   0          24m

NAME              TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)        AGE
service/grafana   LoadBalancer   10.100.220.164   10.113.3.106   80:32643/TCP   24m

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/grafana   1/1     1            1           24m

NAME                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/grafana-7d95c6cf8c   1         1         1       24m

The chart has created a deployment with 3 grafana replica pods that are in running status. Note how the LoadBalancer service has already allocated the external IP address 10.113.3.106 to provide outside reachability. As mentioned earlier, if you have a LoadBalancer solution such as AVI with his AKO operator deployed in your setup you will see that a new Virtual Service has been created and it’s ready to use as shown below:

Now you can open your browser and type the IP Address. AVI also register in its internal DNS the new LoadBalancer objects that the developer creates in kubernetes. In this specific setup an automatic FQDN is created and grafana should be available from your browser at http://grafana.grafana.avi.sdefinitive.net. As specified in the LoadBalancer section in the values.yaml at deployment type, the grafana GUI will be exposed on port 80. For security purpose is strongly recommended to use a Secure Ingress object instead if you are planning to deploy in production.

Grafana GUI Welcome Page

This way we would have finished the installation of Grafana visualization tool. Let’s move now to install another important piece in observability in charge of retrieving metrics which is Prometheus.

Prometheus for metric collection

Prometheus was created to monitor highly dynamic environments, so over the past years it has become the mainstream monitoring tool of choice in container and microservices world. Modern devops is becoming more and more complex to handle manually and there is a need for automation. Imagine a complex infrastructure with loads of servers distributed over many locations and you have no insight of what is happening in terms of errors, latency, usage and so on. In a modern architecture there are more things than can go wrong when you have tons of dynamic and ephemeral services and applications and any of them can crash and cause failure of other services. This is why is crucial to avoid manual intervention and allow the administrator to quickly identify and fix any potential problem or degradation of the system.

The prometheus architecture is represented in the following picture that has been taken from the official prometheus.io website.

Prometheus architecture and its ecosystem

The heart componentes of the prometheus server are listed below

  • Data Retrieval Worker.- responsible for fetching time series data from a particular data source, such as a web server or a database, and converting it into the Prometheus metric format.
  • Time Series Data Base (TSDB).- used to store, manage, and query time-series data.
  • HTTP Server.- responsible for exposing the Prometheus metrics endpoint, which provides access to the collected metrics data for monitoring and alerting purposes.

The first step would be enabling Prometheus

Installing Prometheus

Prometheus requires access to Kubernetes API resources for service discovery, access to the Antrea Metrics Listener and some configuration to instruct the Data Retrieval Worker to scrape the required metrics in both Agent and Controller components. There are some manifests in the Antrea website ready to use to save some time with all the scraping and job configurations of Prometheus. Lets apply the provided manifest as a first step.

kubectl apply -f https://raw.githubusercontent.com/antrea-io/antrea/main/build/yamls/antrea-prometheus.yml
namespace/monitoring created
serviceaccount/prometheus created
secret/prometheus-service-account-token created
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created
configmap/prometheus-server-conf created
deployment.apps/prometheus-deployment created
service/prometheus-service create

As you can see in the output the manifest include all required kubernetes objects including permissions, configurations and lastly the prometheus server itself. The manifest deploy all the resources in a dedicated monitoring namespace.

kubectl get all -n monitoring
NAME                                         READY   STATUS    RESTARTS   AGE
pod/prometheus-deployment-57d7b4c6bc-jx28z   1/1     Running   0          42s

NAME                         TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/prometheus-service   NodePort   10.101.177.47   <none>        8080:30000/TCP   42s

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/prometheus-deployment   1/1     1            1           42s

NAME                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/prometheus-deployment-57d7b4c6bc   1         1         1       42s

Feel free to explore all created resources. As you can tell there is a new service is running as a NodePort type so you should be able to reach the Prometheus server using any of your workers IP addresses that listen at the static 30000 port. Alternatively you can always use port-forward method to redirect a local port to the service listening at 8080. Open a browser to verify you can access to the HTTP Server component of Prometheus.

Prometheus configuration will retrieve not only Antrea related metrics but also some built in kubernetes metrics. Just for fun type “api” in the search box and you will see dozens of metrics available.

Prometheus Server

Now we are sure Prometheus Server is running and is able to scrape metrics succesfully, lets move into our area of interest that is Antrea CNI.

Enabling Prometheus Metrics in Antrea

The first step is to configure Antrea to generate Prometheus metrics. As we explain in the previous post here we are using Helm to install Antrea so the better way to change the configuration of the Antrea setup is by using the values.yaml file and redeploying the helem chart. As you can see we are enabling also the FlowExporter featuregate. This is a mandatory setting to allow conntrack flows related metrics to get updated. Edit the values.yaml

vi values.yaml
# -- Container image to use for Antrea components.
image:
  tag: "v1.10.0"

enablePrometheusMetrics: true

featureGates:
   FlowExporter: true
   AntreaProxy: true
   TraceFlow: true
   NodePortLocal: true
   Egress: true
   AntreaPolicy: true

Deploy a new release of antrea helm chart taking the values.yaml file as in input. Since the chart is already deployed, now we need to use upgrade keyword instead of install as did the first time.

helm upgrade -f values.yaml antrea antrea/antrea -n kube-system
Release "antrea" has been upgraded. Happy Helming!
NAME: antrea
LAST DEPLOYED: Wed Jan 18 13:53:56 2023
NAMESPACE: kube-system
STATUS: deployed
REVISION: 2
TEST SUITE: None
NOTES:
The Antrea CNI has been successfully installed

You are using version 1.10.0

For the Antrea documentation, please visit https://antrea.io

Now with antctl command verify the feature gates has been enabled as expected

antctl get featuregates
Antrea Agent Feature Gates
FEATUREGATE              STATUS         VERSION   
Traceflow                Enabled        BETA      
AntreaIPAM               Disabled       ALPHA     
Multicast                Disabled       ALPHA     
AntreaProxy              Enabled        BETA      
Egress                   Enabled        BETA      
EndpointSlice            Disabled       ALPHA     
ServiceExternalIP        Disabled       ALPHA     
AntreaPolicy             Enabled        BETA      
Multicluster             Disabled       ALPHA     
FlowExporter             Enabled        ALPHA     
NetworkPolicyStats       Enabled        BETA      
NodePortLocal            Enabled        BETA      

Antrea Controller Feature Gates
FEATUREGATE              STATUS         VERSION   
AntreaPolicy             Enabled        BETA      
NetworkPolicyStats       Enabled        BETA      
NodeIPAM                 Disabled       ALPHA     
Multicluster             Disabled       ALPHA     
Egress                   Enabled        BETA      
Traceflow                Enabled        BETA      
ServiceExternalIP        Disabled       ALPHA     

As seen before Prometheus has some built-in capabilities to browse and visualize metrics however we want this metric to be consumed from the powerful Grafana that we have installed earlier. Lets access to the Grafana console and click on the Gear icon to add the new Prometheus Datasource.

Grafana has been deployed in the same cluster in a different namespace. The prometheus URL is derived from <service>.<namespace>.svc.<port>.domain. In our case the URL is http://prometheus-service.monitoring.svc:8080. If your accessing Prometheus from a different cluster ensure you use the FQDN adding the corresponding domain at the end of the URL (by default cluster.local).

Click on Save & Test blue button at the bottom of the screen and you should see a message indicating the Prometheus server is reachable and working as expected.

Now click on the compass button to verify that Antrea metrics are being populated and are reachable from Grafana for visualization. Select new added Prometheus as Datasource at the top.

Pick up any of the available antrea agent metrics (I am using here antrea_agent_local_pod_count as an example) and you should see the visualization of the gathered metric values in the graph below.

That means Prometheus datasource is working and Antrea is populating metrics successfully. Let’s move now into the next piece Loki.

Installing Loki as log aggregator platform

In the last years the area of log management has been clearly dominated by the Elastic stack becoming the de-facto standard whereas Grafana has maintained a strong position in terms of visualization of metrics from multiple data sources, among which prometheus stands out.

Lately a very popular alternative for log management is Grafana Loki. The Grafana Labs website describes Loki as a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost effective and easy to operate. It does not index the contents of the logs, but rather a set of labels for each log stream.

For that reason we will use Loki as a solution for aggregating the logs we got from kubernetes pods. We will focus on Antrea related pods but it can be used with a wider scope for the rest of applications ruunning in your cluster.

In the same way that we did in the Grafana installation, we will use the official helm chart of the product to proceed with the installation. This time it is not necessary to install a new helm repository because the Loki chart is already included in the grafana repo. As we did with Grafana helm chart, the first step will be to obtain the configuration file associated to the helm chart that we will use to be able to customize our installation.

helm show values grafana/loki > default_values.yaml

Using this file as a reference, create a reduced and customized values.yaml file with some modified configuration. As a reminder, any setting not explicitly mentioned in the reduced values.yaml file will take the default values. Find the values.yaml file I am using here.

For a production solution it is highly recommended to install Loki using the scalable architecture. The scalable architecture requires a managed object store such as AWS S3 or Google Cloud Storage but, if you are planning to use it on-premises, a very good choice is to use a self-hosted store solution such as the popular MinIO. There is a previous post explaining how to deploy a a MinIO based S3-like storage platform based on vSAN here. In case you are going with MinIO Operator, as a prerequisite before installing Loki, you would need to perform following tasks.

The following script will

  • Create a new MinIO tenant. We will use a new tenant called logs in the namespace logs.
  • Obtain the S3 endpoint that will be used to interact with your S3 storage via API. I am using here the internal ClusterIP endpoint but feel free to use the external FQDN if you want to expose it externally. In that case you would use the built-in kubernetes naming convention for pods and services as explained here and, that it would be something like minio.logs.svc.cluster.local.
  • Obtain the AccessKey and SecretAccessKey. By default a set of credentials are generated upon tenant creation. You can extract them from the corresponding secrets easily or just create a new set of credentials using Minio Tenant Console GUI.
  • Create the required buckets. You can use console or mc tool as well.

Considering you are using MinIO operator, the following script will create required tenant and buckets. Copy and Paste the contents in the console or create a sh file and execute it using bash. Adjust the tenant settings in terms of servers, drives and capacity match with your environment. All the tenant objects will be placed in the namespace logs.

vi create-loki-minio-tenant.sh
#!/bin/bash

TENANT=logs
BUCKET1=chunks
BUCKET2=ruler
BUCKET3=admin

# CREATE NAMESPACE AND TENANT 6 x 4 drives for raw 50 G, use Storage Class SNA
# -------------------------------------------------------------------------------
kubectl create ns $TENANT
kubectl minio tenant create $TENANT --servers 6 --volumes 24 --capacity 50G --namespace $TENANT --storage-class vsphere-sna --expose-minio-service --expose-console-service

# EXTRACT CREDENTIALS FROM CURRENT TENANT AND CREATE SECRET 
# ---------------------------------------------------------
echo "MINIO_S3_ENDPOINT=https://minio.${TENANT}.svc.cluster.local" > s3vars.env
echo "MINIO_S3_BUCKET1=${BUCKET1}" >> s3vars.env
echo "MINIO_S3_BUCKET2=${BUCKET2}" >> s3vars.env
echo "MINIO_S3_BUCKET3=${BUCKET3}" >> s3vars.env
echo "SECRET_KEY=$(kubectl get secrets -n ${TENANT} ${TENANT}-user-1 -o jsonpath="{.data.CONSOLE_SECRET_KEY}" | base64 -d)" >> s3vars.env
echo "ACCESS_KEY=$(kubectl get secrets -n ${TENANT} ${TENANT}-user-1 -o jsonpath="{.data.CONSOLE_ACCESS_KEY}" | base64 -d)" >> s3vars.env

kubectl create secret generic -n $TENANT loki-s3-credentials --from-env-file=s3vars.env

Once the tenant is created we can proceed with bucket creation. You can do it manually via console or mc client or using following yaml file used to define a job that will create the required buckets as shown here. Basically it will wait untill the tenant is initialized and then it will created the three required buckets as per the secret loki-s3-credentials injected variables.

vi create-loki-minio-buckets-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: create-loki-minio-buckets
  namespace: logs
spec:
  template:
    spec:
      containers:
      - name: mc
       # loki-s3-credentials contains $ACCESS_KEY, $SECRET_KEY, $MINIO_S3_ENDPOINT, $MINIO_S3_BUCKET1-3
        envFrom:
        - secretRef:
            name: loki-s3-credentials
        image: minio/mc
        command: 
          - sh
          - -c
          - ls /tmp/error > /dev/null 2>&1 ; until [[ "$?" == "0" ]]; do sleep 5; echo "Attempt to connect with MinIO failed. Attempt to reconnect in 5 secs"; mc alias set s3 $(MINIO_S3_ENDPOINT) $(ACCESS_KEY) $(SECRET_KEY) --insecure; done && mc mb s3/$(MINIO_S3_BUCKET1) --insecure; mc mb s3/$(MINIO_S3_BUCKET2) --insecure; mc mb s3/$(MINIO_S3_BUCKET3) --insecure
      restartPolicy: Never
  backoffLimit: 4

Verify job execution is completed displaying the logs created by the pod in charge of completing the defined job.

kubectl logs -n logs create-loki-minio-buckets
mc: <ERROR> Unable to initialize new alias from the provided credentials. The Access Key Id you provided does not exist in our records.          
Attempt to connect with MinIO failed. Attempt to reconnect in 5 secs                                                                             
Added `s3` successfully.                                                                                                                         
Bucket created successfully `s3/chunks`.
Bucket created successfully `s3/ruler`.
Bucket created successfully `s3/admin`.

Now the S3 storage requirement is fullfiled. Lets move into the values.yaml file that will be used as the configuration source for our Loki deployment. The first section provides some general configuration options including the data required to access the shared S3 store. Replace s3 attributes with your particular settings.

vi values.yaml
loki:
  auth_enabled: false
  storage_config:
    boltdb_shipper:
      active_index_directory: /var/loki/index
      cache_location: /var/loki/index_cache
      resync_interval: 5s
      shared_store: s3
  compactor:
    working_directory: /var/loki/compactor
    shared_store: s3
    compaction_interval: 5m
  storage:
    bucketNames:
      chunks: chunks
      ruler: ruler
      admin: admin
    type: s3
    s3:
      s3: 
      endpoint: https://minio.logs.svc.cluster.local:443
      region: null
      secretAccessKey: YDLEu99wPXmAAFyQcMzDwDNDwzF32GnS8HhHBuoD
      accessKeyId: ZKYLET51JWZ8LXYYJ0XP
      s3ForcePathStyle: true
      insecure: true
      
  # 
  querier:
    max_concurrent: 4096
  #
  query_scheduler:
    max_outstanding_requests_per_tenant: 4096

# Configuration for the write
# <continue below...>

Note. If you used the instructions above to create the MinIO Tenant you can extract the S3 information from the plaintext s3vars.env variable. You can also extract from the secret logs-user-1. Remember to delete the s3vars.env file after usage as it may reveal sensitive information.

cat s3vars.env
MINIO_S3_ENDPOINT=https://minio.logs.svc.cluster.local
MINIO_S3_BUCKET1=chunks
MINIO_S3_BUCKET2=ruler
MINIO_S3_BUCKET3=admin
SECRET_KEY=YDLEu99wPXmAAFyQcMzDwDNDwzF32GnS8HhHBuoD
ACCESS_KEY=ZKYLET51JWZ8LXYYJ0XP

When object storage is configured, the helm chart configures Loki to deploy read and write targets in high-availability fashion running 3 replicas of each independent process. It will use a storageClass able to provide persistent volumes to avoid losing data in case of the failure of the application. Again, I am using here a storage class called vsphere-sc that is backed by vSAN and accesed by a CSI driver. If you want to learn how to provide data persistence using vSphere and vSAN check a previous post here.

vi values.yaml (Storage and General Section)

# Configuration for the write
write:
  persistence:
    # -- Size of persistent disk
    size: 10Gi
    storageClass: vsphere-sc
# Configuration for the read node(s)
read:
  persistence:
    # -- Size of persistent disk
    size: 10Gi
    storageClass: vsphere-sc
    # -- Selector for persistent disk

# Configuration for the Gateway
# <continue below..>

Additionally, the chart installs the gateway component which is an NGINX that exposes Loki’s API and automatically proxies requests to the correct Loki components (read or write in our scalable setup). If you want to reach Loki from the outside (e.g. other clusters) you must expose it using any kubernetes methods to gain external reachability. In this example I am using a LoadBalancer but feel free to explore further options in the defaults_values.yaml such as a secure Ingress. Remember when the gateway is enabled, the visualization tool (Grafana) as well as the log shipping agents (Fluent-Bit) should be configured to use the gateway as endpoint.

vi values.yaml (Gateway Section)

# Configuration for the Gateway
gateway:
  # -- Specifies whether the gateway should be enabled
  enabled: true
  # -- Number of replicas for the gateway
  service:
    # -- Port of the gateway service
    port: 80
    # -- Type of the gateway service
    type: LoadBalancer
  # Basic auth configuration
  basicAuth:
    # -- Enables basic authentication for the gateway
    enabled: false
    # -- The basic auth username for the gateway

The default chart will install another complementary components in Loki called canary and backend. Loki canary component is fully described here. Basically it is used to audit the log-capturing performance of Loki by generating artificial log lines.

Once the values.yaml file is completed we can proceed with the installation of the helm chart using following command. I am installing loki in the a namespace named loki.

helm install loki –create-namespace loki -n loki grafana/loki -f values.yaml
Release "loki" has been installed. Happy Helming!
NAME: loki
LAST DEPLOYED: Wed Feb 14 12:18:46 2023
NAMESPACE: loki
STATUS: deployed
REVISION: 1
NOTES:
***********************************************************************
 Welcome to Grafana Loki
 Chart version: 4.4.1
 Loki version: 2.7.2
***********************************************************************

Installed components:
* grafana-agent-operator
* gateway
* read
* write
* backend

Now we are done with Loki installation in a scalable and distributed architecture and backed by a MinIO S3 storage, let’s do some verifications to check everything is running as expected.

Verifying Loki Installation

As a first step, explore the kubernetes objects that the loki chart has created.

kubectl get all -n loki
NAME                                               READY   STATUS    RESTARTS   AGE
pod/loki-backend-0                                 1/1     Running   0          2m13s
pod/loki-backend-1                                 1/1     Running   0          2m49s
pod/loki-backend-2                                 1/1     Running   0          3m37s
pod/loki-canary-4xkfw                              1/1     Running   0          5h42m
pod/loki-canary-crxwt                              1/1     Running   0          5h42m
pod/loki-canary-mq79f                              1/1     Running   0          5h42m
pod/loki-canary-r76pz                              1/1     Running   0          5h42m
pod/loki-canary-rclhj                              1/1     Running   0          5h42m
pod/loki-canary-t55zt                              1/1     Running   0          5h42m
pod/loki-gateway-574476d678-vkqc7                  1/1     Running   0          5h42m
pod/loki-grafana-agent-operator-5555fc45d8-rcs59   1/1     Running   0          5h42m
pod/loki-logs-25hvr                                2/2     Running   0          5h42m
pod/loki-logs-6rnmt                                2/2     Running   0          5h42m
pod/loki-logs-72c2w                                2/2     Running   0          5h42m
pod/loki-logs-dcwkb                                2/2     Running   0          5h42m
pod/loki-logs-j6plp                                2/2     Running   0          5h42m
pod/loki-logs-vgqqb                                2/2     Running   0          5h42m
pod/loki-read-598f8c5cd5-dqtqt                     1/1     Running   0          2m59s
pod/loki-read-598f8c5cd5-fv6jq                     1/1     Running   0          2m18s
pod/loki-read-598f8c5cd5-khmzw                     1/1     Running   0          3m39s
pod/loki-write-0                                   1/1     Running   0          93s
pod/loki-write-1                                   1/1     Running   0          2m28s
pod/loki-write-2                                   1/1     Running   0          3m33s

NAME                            TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)             AGE
service/loki-backend            ClusterIP      10.98.167.78     <none>         3100/TCP,9095/TCP   5h42m
service/loki-backend-headless   ClusterIP      None             <none>         3100/TCP,9095/TCP   5h42m
service/loki-canary             ClusterIP      10.97.139.4      <none>         3500/TCP            5h42m
service/loki-gateway            LoadBalancer   10.111.235.35    10.113.3.104   80:30251/TCP        5h42m
service/loki-memberlist         ClusterIP      None             <none>         7946/TCP            5h42m
service/loki-read               ClusterIP      10.99.220.81     <none>         3100/TCP,9095/TCP   5h42m
service/loki-read-headless      ClusterIP      None             <none>         3100/TCP,9095/TCP   5h42m
service/loki-write              ClusterIP      10.102.132.138   <none>         3100/TCP,9095/TCP   5h42m
service/loki-write-headless     ClusterIP      None             <none>         3100/TCP,9095/TCP   5h42m

NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/loki-canary   6         6         6       6            6           <none>          5h42m
daemonset.apps/loki-logs     6         6         6       6            6           <none>          5h42m

NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/loki-gateway                  1/1     1            1           5h42m
deployment.apps/loki-grafana-agent-operator   1/1     1            1           5h42m
deployment.apps/loki-read                     3/3     3            3           5h42m

NAME                                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/loki-gateway-574476d678                  1         1         1       5h42m
replicaset.apps/loki-grafana-agent-operator-5555fc45d8   1         1         1       5h42m
replicaset.apps/loki-read-598f8c5cd5                     3         3         3       3m40s
replicaset.apps/loki-read-669c9d7689                     0         0         0       5h42m
replicaset.apps/loki-read-6c7586fdc7                     0         0         0       11m

NAME                            READY   AGE
statefulset.apps/loki-backend   3/3     5h42m
statefulset.apps/loki-write     3/3     5h42mNAME                                               

A simple test you can do to verify gateway status is by “curling” the API endpoint exposed in the allocated external IP. The OK response would indicate the service is up and ready to receive requests.

curl http://10.113.3.104/ ; echo
OK

Now access grafana console and try to add Loki as a DataSource. Click on the Engine button at the bottom of the left bar and then click on Add Data Source blue box.

Adding Loki as Datasource in Grafana (1)

Click on Loki to add the required datasource type.

Adding Loki as Datasource in Grafana (2)

In this setup Grafana and Loki has been deployed in the same cluster so we can use as URL the internal FQDN corresponding to the loki-gateway ClusterIP service. In case you are accesing from the outside you need to change that to the external URL (e.g. http://loki-gateway.loki.avi.sdefinitive.net in my case).

Adding Loki as Datasource in Grafana (3)

Click on “Save & test” button and if the attempt of Grafana to reach Loki API endpoint is successful, it should display the green tick as shown below.

Another interesting verification would be to check if the MinIO S3 bucket is getting the logs as expected. Open the MinIO web console and access to the chunks bucket wich is the target to write the logs Loki receives. You should see how the bucket is receiving new objects and how the size is increasing.

You may wonder who is sending this data to Loki at this point since we have not setup any log shipper yet. The reason behind is that, by default, when you deploy Loki using the official chart, a sub-chart with Grafana Agent is also installed to enable self monitoring. Self monitoring settings determine whether Loki should scrape it’s own logs. It will create custom resources to define how to scrape it’s own logs. If you are curious about it explore (i.e kubectl get) GrafanaAgent, LogsInstance, and PodLogs CRDs objects created in the Loki namespace to figure out how this is actually pushing self-monitoring logs into Loki.

To verify what are this data being pushed into MinIO S3 bucket, you can explore Loki Datasource through Grafana. Return to the Grafana GUI and try to show logs related to a loki component such as the loki-gateway pod. Click on the compass icon at the left to explore the added datasource. Now filter using job as key label and select the name of the loki/loki-gateway as value label as shown below. Click on Run Query blue button on the top right corner next to see what happens.

Displaying Loki logs at Grafana (1)

Et voila! If everything is ok you should see how logs are successfully being shipped to Loki by its internal self monitoring Grafana agent.

Displaying Loki logs at Grafana (2)

Now that our log aggregator seems to be ready let’s move into the log shipper section.

Installing Fluent-Bit

Fluent-Bit is an log shipper based in a open source software designed to cover highly distributed environments that demand high performance but keeping a very light footprint.

The main task of Fluent-bit in this setup is watch for changes in any interesting log file and send any update in that file to Loki in a form of a new entry. We will focus on Antrea related logs only so far but you can extend the input of Fluent-bit to a wider scope in order to track other logs of your OS.

Again we will rely in helm to proceed with the installation. This time we need to add a new repository maintaned by fluent.

helm repo add fluent https://fluent.github.io/helm-charts

Explore the repo to see what is.

helm search repo fluent
NAME                    CHART VERSION   APP VERSION     DESCRIPTION                                       
fluent/fluent-bit       0.22.0          2.0.8           Fast and lightweight log processor and forwarde...
fluent/fluentd          0.3.9           v1.14.6         A Helm chart for Kubernetes                       

As we did before, create a reference yaml file with default configuration values of the chart.

helm show values fluent/fluent-bit > default_values.yaml

Using default_values.yaml as a template, create a new values.yaml file that will contain the desired configuration. The main piece of the values.yaml file resides on the config section. You can customize how the logs will be treated in a secuencial fashion creating a data pipeline scheme as depicted here.

FluentBit DataPipeline

The full documentation is maintained in the Fluent-Bit website here, but in a nutshell the main subsections we will use to achieve our goal are:

  • SERVICE.- The service section defines global properties of the service, including additional parsers to adapt the data found in the logs.
  • INPUT.- The input section defines a source that is associated to an input plugin. Depending on the selected input plugin you will have extra configuration keys. In this case we are using the tail input plugin that capture any new line of the watched files (Antrea logs in this case). This section is also used to tag the captured data for classification purposes in later stages.
  • PARSER.- This section is used to format or parse any information present on records such as extracting fields according to the position of the information in the log record.
  • FILTER.- The filter section defines a filter that is associated with a filter plugin. In this case we will use the kubernetes filter to be able enrich our log files with Kubernetes metadata.
  • OUTPUT.- The output section section specify a destination that certain records should follow after a Tag match. We would use here Loki as target.

We will use following values.yaml file. A more complex values file including parsing and regex can be found in an specific section of the next post here.

vi values.yaml
env: 
  - name: NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName

config:
  service: |
    [SERVICE]
        Daemon Off
        Flush {{ .Values.flush }}
        Log_Level {{ .Values.logLevel }}
        Parsers_File parsers.conf
        Parsers_File custom_parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port {{ .Values.metricsPort }}
        Health_Check On

  ## https://docs.fluentbit.io/manual/pipeline/inputs
  inputs: |
    [INPUT]
        Name tail
        Path /var/log/containers/antrea*.log
        multiline.parser docker, cri
        Tag antrea.*
        Mem_Buf_Limit 5MB
        Skip_Long_Lines On
        
  ## https://docs.fluentbit.io/manual/pipeline/filters
  ## First filter Uses a kubernetes filter plugin. Match antrea tag. Use K8s Parser
  ## Second filter enriches log entries with hostname and node name 
  filters: |
    [FILTER]
        Name kubernetes
        Match antrea.*
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude On
        
    [FILTER]
        Name record_modifier
        Match antrea.*
        Record podname ${HOSTNAME}
        Record nodename ${NODE_NAME}

  ## https://docs.fluentbit.io/manual/pipeline/outputs
  ## Send the matching data to loki adding a label
  outputs: |
    [OUTPUT]
        Name loki
        Match antrea.*
        Host loki-gateway.loki.svc
        Port 80
        Labels job=fluentbit-antrea

Create the namespace fluent-bit where all the objects will be placed.

kubectl create ns fluent-bit

And now proceed with fluent-bit chart installation.

helm install fluent-bit -n fluent-bit fluent/fluent-bit -f values.yaml
NAME: fluent-bit
LAST DEPLOYED: Wed Jan 11 19:21:20 2023
NAMESPACE: fluent-bit
STATUS: deployed
REVISION: 1
NOTES:
Get Fluent Bit build information by running these commands:

export POD_NAME=$(kubectl get pods --namespace fluent-bit -l "app.kubernetes.io/name=fluent-bit,app.kubernetes.io/instance=fluent-bit" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace fluent-bit port-forward $POD_NAME 2020:2020
curl http://127.0.0.1:2020

Verifying Fluent-Bit installation

As suggested by the highlighted output of the previous chart installlation you can easily try to reach the fluent-bit API that is listening at port TCP 2020 using a port-forward. Issue the port-forward action and try to curl to see if the service is accepting the GET request. The output indicates the service is ready and you get some metadata such as flags, and version associated with the running fluent-bit pod.

curl localhost:2020 -s | jq
jhasensio@forty-two:~/ANTREA$ curl localhost:2020 -s | jq
{
  "fluent-bit": {
    "version": "2.0.8",
    "edition": "Community",
    "flags": [
      "FLB_HAVE_IN_STORAGE_BACKLOG",
      "FLB_HAVE_CHUNK_TRACE",
      "FLB_HAVE_PARSER",
      "FLB_HAVE_RECORD_ACCESSOR",
      "FLB_HAVE_STREAM_PROCESSOR",
      "FLB_HAVE_TLS",
      "FLB_HAVE_OPENSSL",
      "FLB_HAVE_METRICS",
      "FLB_HAVE_WASM",
      "FLB_HAVE_AWS",
      "FLB_HAVE_AWS_CREDENTIAL_PROCESS",
      "FLB_HAVE_SIGNV4",
      "FLB_HAVE_SQLDB",
      "FLB_LOG_NO_CONTROL_CHARS",
      "FLB_HAVE_METRICS",
      "FLB_HAVE_HTTP_SERVER",
      "FLB_HAVE_SYSTEMD",
      "FLB_HAVE_FORK",
      "FLB_HAVE_TIMESPEC_GET",
      "FLB_HAVE_GMTOFF",
      "FLB_HAVE_UNIX_SOCKET",
      "FLB_HAVE_LIBYAML",
      "FLB_HAVE_ATTRIBUTE_ALLOC_SIZE",
      "FLB_HAVE_PROXY_GO",
      "FLB_HAVE_JEMALLOC",
      "FLB_HAVE_LIBBACKTRACE",
      "FLB_HAVE_REGEX",
      "FLB_HAVE_UTF8_ENCODER",
      "FLB_HAVE_LUAJIT",
      "FLB_HAVE_C_TLS",
      "FLB_HAVE_ACCEPT4",
      "FLB_HAVE_INOTIFY",
      "FLB_HAVE_GETENTROPY",
      "FLB_HAVE_GETENTROPY_SYS_RANDOM"
    ]
  }
}

Remember the fluent-bit process needs access to the logs that are generated on every single node, that means you will need daemonSet object that will run a local fluent-bit pod in each of the eligible nodes across the cluster.

kubectl get all -n fluent-bit
NAME                   READY   STATUS    RESTARTS   AGE
pod/fluent-bit-8s72h   1/1     Running   0          9m20s
pod/fluent-bit-lwjrn   1/1     Running   0          9m20s
pod/fluent-bit-ql5gp   1/1     Running   0          9m20s
pod/fluent-bit-wkgnh   1/1     Running   0          9m20s
pod/fluent-bit-xcpn9   1/1     Running   0          9m20s
pod/fluent-bit-xk7vc   1/1     Running   0          9m20s

NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/fluent-bit   ClusterIP   10.111.248.240   <none>        2020/TCP   9m20s

NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/fluent-bit   6         6         6       6            6           <none>          9m20s

You can also display the logs that any of the pods generates on booting. Note how the tail input plugin only watches matching files according to the regex (any file with name matching antrea*.log in /var/log/containers/ folder).

kubectl logs -n fluent-bit fluent-bit-8s72h
Fluent Bit v2.0.8
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/01/11 18:21:32] [ info] [fluent bit] version=2.0.8, commit=9444fdc5ee, pid=1
[2023/01/11 18:21:32] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2023/01/11 18:21:32] [ info] [cmetrics] version=0.5.8
[2023/01/11 18:21:32] [ info] [ctraces ] version=0.2.7
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] initializing
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] multiline core started
[2023/01/11 18:21:32] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2023/01/11 18:21:32] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2023/01/11 18:21:32] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2023/01/11 18:21:32] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2023/01/11 18:21:32] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2023/01/11 18:21:32] [ info] [output:loki:loki.0] configured, hostname=loki-gateway.loki.svc:80
[2023/01/11 18:21:32] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2023/01/11 18:21:32] [ info] [sp] stream processor started
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529048 watch_fd=1 name=/var/log/containers/antrea-agent-b4tfl_kube-system_antrea-agent-9dadd3c909f9471408ebf569c0d8f2622bedd572ef7a982bfe71a7f3cd6010d0.log
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=532180 watch_fd=2 name=/var/log/containers/antrea-agent-b4tfl_kube-system_antrea-agent-fd6cfdb5a18c77e66403e66e3a16e2f577d213cd010bdf09f863e22d897194a8.log
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=532181 watch_fd=3 name=/var/log/containers/antrea-agent-b4tfl_kube-system_antrea-ovs-5344c17989a14d5773ae75e4403c12939c34b2ca53fb5a09951d8fd953cea00d.log
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529094 watch_fd=4 name=/var/log/containers/antrea-agent-b4tfl_kube-system_antrea-ovs-cb343ab16cc1d9a718b938be8a889196fd93134f63c9f9da6c53a2ff291f25f5.log
[2023/01/11 18:21:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528986 watch_fd=5 name=/var/log/containers/antrea-agent-b4tfl_kube-system_install-cni-583b2d7380e3dc9cff9c3a05870c7997747d9751c075707bd182d1d0a0ec5e9b.log

Now we are sure the fluent-bit is working properly, the last step is to check if we actually are receiving the logs in Loki using Grafana to retrieve ingested logs. Remember in the fluent-bit output configuration we labeled the logs using job=fluentbit-antrea and we will use that as input to filter our interesting logs. Click on the compass icon at the left ribbon and use populate the label filter with mentioned label (key and value).

Exploring Antrea logs sent to Loki with Grafana

Generate some activity in the antrea agents, for example, as soon as you create a new pod and the CNI should provide the IP Address and it will write a corresponding log indicating a new IP address has been Allocated. Let’s try to locate this exact string in any antrea log. To do so, just press on the Code button, to write down a custom filter by hand.

Code Option for manual queries creation

Type the following filter to find any log with a label job=fluentbit-antrea that also contains the string “Allocated”.

{job="fluentbit-antrea"} |= "Allocated" 

Press Run Query blue button at the right top corner and you should be able to display any matching log entry as shown below.

Exploring Antrea logs sent to Loki with Grafana (2)

Feel feel to explore further the log to see the format and different labels and fields as shown below

Exploring Antrea logs sent to Loki with Grafana (2)

This concludes this post. If you followed it you should now have the required tools up and running to gain observability. This is just the first step though. For any given observability solution, the real effort come in the Day 2 when you need to figure out what are your KPI according to your particular needs and how to visualize them in the proper way. Next post here will continue diving in dashboards and log analysis. Stay tuned!