Discussions

Ask a Question
Back to All

Advanced monitoring and telemetry

What kind of advanced monitoring and telemetry can be implemented in a cluster? Currently, the baseline metrics are resource utilization for nodes and containers, but it would be great to learn what can be done with additional services in the application layer.

Using MedStack API, it is possible to gather information about the different services running inside of a cluster. We currently use the /tasks endpoint (https://support.medstack.co/reference/listtasks) to monitor containers for each services and alert if one of them shuts down and is not able to come back up.

The API response for that endpoint contains a list of tasks that can be treated as containers for this use. Each task is attached to service through the field service_id and multiple tasks can be linked to a single service. The field desired_state denotes what the status of the task is supposed to be. The field “status” contains an object which has the field “state”. This field tells us what the current state of that task is. So, to monitor our services for containers that are unable to start or if the cluster is unable to assign a node for a container, we look for tasks where the desired_state is equal to "running” and the “state” is different from “running”.

To accomplish, we have a Ruby service running inside of the cluster which monitor all tasks every minute. If we find a task that is not acting as it should be, we get all the info from the service itself using the API and we send an email to our team with the name of the cluster and the name of the service with timestamps and everything.

This could obviously be improved by waiting for multiple failed checks on a single service to limit false positives during deployments or by using a service like PagerDuty instead of only sending emails.

Marked as answered by Marcus Polini

Hi, I have followed this task through: https://support.medstack.co/reference/listtasks Bitlife however I still find some errors occurring exactly like yours. How can I solve it completely now?