The need for integration.
"Breaking systems on purpose."
This is the first thing on many people's minds when they are asked about Chaos Engineering.
While absolutely true, the core principle of Chaos Engineering is running thoughtful and planned experiments on your systems to reveal weaknesses and ultimately make your systems more reliable.
From that perspective, Chaos Engineering is an important tool in the toolbox that is called Site Reliability Engineering. But it is not the only tool.
Most SRE practitioners will strive towards an intelligent and automated incident response process. Your Monitoring Platforms and Incident Response Systems will make well informed decisions before trying to automatically start remediating an issue or alert the right people or systems.
The emphasis here is on the words 'well informed'.
What if you are running a Chaos Engineering experiment which causes some interesting issues in your application? Your Monitoring platform picks up on the application's behavior and creates an incident in your Incident Response system, which in turn pages one of your on-call engineers.
Just to find out you are running a GameDay...
Would your Monitoring Platform make a different decision based on the information that a Chaos Engineering experiment is being run? I'd like to think it does.
The case for integration seems pretty clear ...
The bump in the road.
If you've ever been to our Chaos Engineering bootcamp or attended our webinars, you will have noticed we work a lot with the Gremlin platform as our tool of choice for Chaos Engineering.
Next to that, we also partner with Instana for their Monitoring and Observability platform.
Both platforms offer a SaaS based control plane in combination with smart agents that are deployed on your systems.
Gremlin has a webhook feature that allows easy integration with almost any other system.
Instana, in turn, has an API endpoint that is capable of ingesting external events.
Unfortunately, that endpoint is only available on the host API (e.g. at the agent level), so connecting to it from Gremlin's Control Plane is a bit of a struggle.
Solving the challenge.
When you run the Instana agent on Kubernetes, it will run an agent on every node as part of a DaemonSet. The agent publishes the host API on port 42699, which listens to http requests.
Exposing that API to the outside world (e.g. outside of our k8s cluster) requires two challenges to be solved:
- The host API does not require authentication as it normally is only accessible on the host system itself.
- The host API does not require the http connection to be encrypted, it will use plain text http.
This blog describes an easy to implement solution by placing an NGINX reverse proxy in front of the Instana agent host API. The NGINX proxy will require a https connection from the outside world and authenticate the incoming request. The end result would look very similar to this:
Before you start.
Before you deploy, make sure the following prerequisites are met:
- You have the following command line tools installed: kubectl, htpasswd and base64.
- You have access to your full ssl certificate chain and private key, or you are able to generate one for this specific setup and have it signed by a known CA.
Preparing the NGINX configuration.
Using your favorite editor create a file called default.conf with the following contents:
server {
#listen 80;
listen 443 ssl;
server_name demo-instana-api.nuaware.training;
ssl_certificate /etc/nginx/certs/ssl.cert;
ssl_certificate_key /etc/nginx/certs/ssl.key;
location / {
root /usr/share/nginx/html;
index index.html index.htm;
}
location /api/ {
auth_basic "Welcome to Area51";
auth_basic_user_file /etc/nginx-auth/.htpasswd;
proxy_pass http://gremlin-instana-api:42699/;
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root /usr/share/nginx/html;
}
}
Make sure you adjust server_name to match your environment.
Securing NGINX with https and basic authentication.
As mentioned we will use ssl encrypted https between the Gremlin Control Plane and our Proxy. Make sure you have your full certificate chain in a file called ssl.cert and your private key in a file called ssl.key. If you choose to use different filenames, you will need to adjust the default.conf file accordingly, but also the upcoming kubectl commands.
We will use basic authentication for this set up. This isn't the most secure way of authenticating, but seeing that we are also encrypting with https we use this for simplicity purposes.
For this to work we will need to manually create the .htpasswd file with a username and a password.
Secondly, we'll need to encode the username and password as a base64 encoded string (which we will need when authenticating the Gremlin Control Plane to the Proxy).
Use the following commands to achieve this (and make sure you store the base64 encoded string until we need it!):
$ export USER_NAME="api-user"
$ export PASSWORD="<your_password>"
$ htpasswd -cb .htpasswd $USER_NAME $PASSWORD
Adding password for user api-user
$ echo -n $USER_NAME:$PASSWORD | base64
WW91IGZ1bm55IGd1eSwgdHJ5aW5nIHRvIGRlY29kZSB0aGlzIHN0cmluZyB0byBzZWUgaWYgdGhlcmUncyBhbiBhY3R1YWwgcGFzc3dvcmQgaW4gaXQgOi0p
Kubernetes secrets and configMaps.
Next, we are going to add the certificate files, the credential file and the NGINX configuration to our Kubernetes cluster. We'll store the NGINX configuration in a configMap and the rest are stored in secrets:
$ kubectl create configmap gremlin-instana-proxy-config --from-file default.conf -n instana-agent
$ kubectl create secret generic gremlin-instana-proxy-certs --from-file=ssl.key --from-file ssl.cert -n instana-agent
$ kubectl create secret generic gremlin-instana-proxy-credentials --from-file=.htpasswd -n instana-agent
Deploying to k8s.
Once that's done, we can deploy the Kubernetes manifest to our cluster. You can find the manifest here:
https://github.com/nuaware/gremlin-instana-proxy/blob/main/gremlin-instana-proxy.yaml
You can either make a local copy, or deploy it directly from Github using the following command:
$ kubectl apply -f https://raw.githubusercontent.com/nuaware/gremlin-instana-proxy/main/gremlin-instana-proxy.yaml -n instana-agent
This will deploy the following resources:
- deployment.apps/gremlin-instana-proxy
- service/gremlin-instana-proxy
- service/gremlin-instana-api
Once those are up and running, find the External-IP address that was created using the following command:
$ kubectl get svc gremlin-instana-proxy -n instana-agent
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
gremlin-instana-proxy LoadBalancer 10.100.251.128 a03d2041841f24c30b1cb8554af8da84-836459315.eu-central-1.elb.amazonaws.com 443:31131/TCP 3m23s
Validate that External-IP is linked to the domain name in your NGINX configuration and your certificates (either directly or as a CNAME record).
Configuring the Gremlin Webhook.
In the Gremlin Web Application, click on the account icon on the top right, next to "Halt All Attacks", make sure you are using the correct Team and navigate to "Team Settings" followed by the "Webhooks" tab, and click on "New Webhook".
Fill in the form as below, but notice the Request URL, Custom Headers and the JSON payload:
- Request URL: Has to formed as follows: https://<hostname>.<domainname>/api/com.instana.plugin.generic.event
- The Customer Headers: Use "Authorization" as the key, while the value will contain "Basic" followed by your base64 encoded string we created earlier.
- The JSON Payload: Gremlin can use multiple variables in the payload described here, while the Instana agent accepts the event parameters described here. The payload we use in this example:
{
"title": "Gremlin attack: ${ATTACK_TYPE} Status ${STATUS}",
"severity": "5",
"text": "Team: ${TEAM_ID} Attack ID: ${ATTACK_ID} Attack Type: ${ATTACK_TYPE} Status: ${STATUS}"
}
Hit Save and we are ready to test!
Testing and troubleshooting.
Try and fire a random attack against your Kubernetes cluster and watch the events stream in your Instana dashboard. If all is right, you should see several events come in looking similar to this:
If you don't see any events come in, these three steps can help you troubleshoot:
- Open the url for the proxy directly in a browser, just the hostname without trailing paths after the slash (e.g. https://<hostname>.<domainname>/).
You should be able to see the NGINX default page. Verify if the connection is safe to validate you certificates are set up correctly. - Add /api to the url. This should require you to authenticatie using the username and password for the proxy. If authentication is set up correctly, the website should display a version number in the following json format: {"version":"1.1.587"}
- If the webserver doesn't return the version number there is probably a connectivity error between the proxy pod and the instana agent pods. Verify the deployment of the gremlin-instana-agent service.
What’s next?
This specific example focuses solely on getting those events into Instana. Once they are in there, there are a multitude of scenarios we can think of to use that information and make our on-call lives much, much better.
We could, for example, replace our NGINX proxy for a Python app that requests details about the attack and the targets from the Gremlin API and also feeds that into Instana.
From there we could then start correlating infrastructure and applications to the Gremlin events coming in and visualising which parts of your infrastructure are impacted by the experiment and -even more important- if your applications are still operational.
Or maybe you have set-up incident creation to happen automatically. You could then decide to lower the severity of the incident as you know it is a controlled experiment and not an actual outage.
There’s an interesting blog post from Instana about context-rich alerting which might be of help as well:
https://www.instana.com/blog/when-time-is-of-the-essence-you-need-context-rich-alerts/
And of course, if you need help, the people at Nuaware can help you move along this journey!
