diff --git a/grafana/README.md b/grafana/README.md index b129b6a..e0e9fe5 100644 --- a/grafana/README.md +++ b/grafana/README.md @@ -1,4 +1,6 @@ # Modules for Grafana alerts and dashboards + + ## Alerting @@ -7,77 +9,244 @@ Please check documentation about Grafana alerting [here](https://itdoc.schwarz/x The Terraform modules are separated per resource type, check README in each module directory for spefic examples. Below is example for alerts using „**Prometheus/Thanos**“ datasorce and sending notification to „**Google Chat**“. +## Authentication +Set Grafana credentials as Terraform variables: -```hcl title="main.tf" - # Datasource +```bash +export TF_VAR_grafana_url="https://grafana.example.com" +export TF_VAR_grafana_username="admin" +export TF_VAR_grafana_password="super-secret" +``` + +These credentials are used by all modules to authenticate with the Grafana API. + +--- + +## Directory Structure + +Organize alerts, templates, and Terraform code as follows: + +``` +. +├── alerts/ +│ ├── infra/ +│ │ ├── loki/ +│ │ │ └── alert-loki.yaml +│ │ └── thanos/ +│ │ └── alert-thanos.yaml +│ ├── oncall/ +│ │ └── alert-oncall.yaml +│ └── heartbeats/ +│ └── alert-heartbeat.yaml +├── templates/ +│ └── myteam/ +│ └── gchat-message.tmpl +└── main.tf +``` + +- **Alerts**: YAML files defining rule groups (`apiVersion: 1, groups: [...]`). +- **Templates**: Notification templates for Google Chat contact points. +- **Terraform code**: References modules and binds everything together. + +--- + +## Defining Secrets + +Datasource URLs and credentials should be stored in Terraform variables, not hardcoded. + +**Example: Environment Variables** + +```bash +export TF_VAR_thanos_coin_prd_url="https://thanos.example.com" +export TF_VAR_thanos_coin_prd_user="reader" +export TF_VAR_thanos_coin_prd_pass="password" + +export TF_VAR_loki_coin_prd_url="https://loki.example.com" +export TF_VAR_loki_coin_prd_user="reader" +export TF_VAR_loki_coin_prd_pass="password" + +export TF_VAR_opsgenie_api_key="xxxxxx" +``` + +--- + +## Module Usage + +### Datasources + +Define multiple datasources (Prometheus, Loki, etc.) with unique keys for URL/username/password: + +```hcl module "datasource" { source = "git::https://commerce-platform.git.onstackit.cloud/commerce-platform-public/terraform-modules//grafana/datasource?ref=main" - datasource_name = "Thanos - Myteam" - datasource_url = var.datasource_url - datasource_username = var.datasource_username - datasource_password = var.datasource_password -} -# Alert Receiver / Contact Point -module "google-chat-contact-point" { + datasources = { + Thanos-Common-Infra-PRD = { + type = "prometheus" + url_key = "thanos_coin_prd" + user_key = "thanos_coin_prd" + pass_key = "thanos_coin_prd" + is_default = true + } + Loki-Common-Infra-PRD = { + type = "loki" + url_key = "loki_coin_prd" + user_key = "loki_coin_prd" + pass_key = "loki_coin_prd" + } + } + + datasource_urls = { + thanos_coin_prd = var.thanos_coin_prd_url + loki_coin_prd = var.loki_coin_prd_url + } + + datasource_users = { + thanos_coin_prd = var.thanos_coin_prd_user + loki_coin_prd = var.loki_coin_prd_user + } + + datasource_passwords = { + thanos_coin_prd = var.thanos_coin_prd_pass + loki_coin_prd = var.loki_coin_prd_pass + } +} +``` + +### Contact Points + +**Google Chat** + +Each Google Chat space is configured as a contact point: + +```hcl +module "gchat-contact-point-coin" { source = "git::https://commerce-platform.git.onstackit.cloud/commerce-platform-public/terraform-modules//grafana/contact-point-gchat?ref=main" - google-chat-url = var.google_chat_url - contact-point-name = "gchat" + gchat_url = var.gchat_url_coin + contact_point_name = "gchat-coin" + templates_dir = "templates/coin" + template_prefix = "coin-" + disable_provenance = true } +``` +**OpsGenie** -# Alert Rule Folders +OpsGenie contact points use API keys: + +```hcl +module "opsgenie-contact-point" { + source = "git::https://commerce-platform.git.onstackit.cloud/commerce-platform-public/terraform-modules//grafana/contact-point-opsgenie?ref=main" + contact_point_name = "opsgenie-dev" + opsgenie_api_key = var.opsgenie_api_key +} +``` + +### Alert Folders + +Organize alerts in Grafana folders for logical separation: + +```hcl module "alert-folder" { - source = "git::https://commerce-platform.git.onstackit.cloud/commerce-platform-public/terraform-modules//grafana/alert-folder?ref=main" - alert-folder = "Alerts" + source = "git::https://commerce-platform.git.onstackit.cloud/commerce-platform-public/terraform-modules//grafana/alert-folder?ref=main" + alert-folder = "Common-Infra-Alerts" } +``` +### Notification Policies -# Template for messages -module "message-templates" { - source = "git::https://commerce-platform.git.onstackit.cloud/commerce-platform-public/terraform-modules//grafana/message-template?ref=main" - templates_dir = "templates" - disable_provenance = true -} +Map folders to contact points (e.g., send “Common-Infra-Alerts” to Google Chat): -# Notification policies +```hcl module "notification-policy" { source = "git::https://commerce-platform.git.onstackit.cloud/commerce-platform-public/terraform-modules//grafana/notification-policy?ref=main" - - default_contact_point_uid = module.google-chat-contact-point.contact_name + default_contact_point_uid = module.gchat-contact-point-coin.contact_point_name group_by = ["alertname"] folder_policies = { - "Alerts" = module.google-chat-contact-point.contact_name + "Common-Infra-Alerts" = module.gchat-contact-point-coin.contact_point_name + "Common-Infra-OnCall-Alerts" = module.opsgenie-contact-point.contact_name } } +``` -# Alert definition -module "alerting" { - source = "git::https://commerce-platform.git.onstackit.cloud/commerce-platform-public/terraform-modules//grafana/alerts?ref=main" +### Alert Definitions - alerts_dir = "alerts" - default_datasource_uid = module.datasource.datasource_uid - default_receiver = module.google-chat-contact-point.contact_name - default_folder_uid = module.alert-folder.folder_uid - default_interval_seconds = 60 - disable_provenance = true +Alert rules are defined in YAML and applied via the module: + +```hcl +module "alerting-coin" { + source = "git::https://commerce-platform.git.onstackit.cloud/commerce-platform-public/terraform-modules//grafana/alerts?ref=main" + alerts_dir = "alerts/common-infra/thanos" + datasource_uid = module.datasource.datasource_uids["Thanos-Common-Infra-PRD"] + folder_uid = module.alert-folder.folder_uid + receiver = module.gchat-contact-point-coin.contact_point_name + disable_provenance = true } ``` -With this configuration you need to place your notification templates into `templates` folder and alert definitions in YAML format to `alerts` folder in same directoty where `main.tf`is located. -You can example for both in [examples](./examples/) folder. -For this example you need export your secret variables e.g. with lookup in Secret Manager. -```sh -export TF_VAR_grafana_url="" -export TF_VAR_grafana_username="admin" -export TF_VAR_grafana_password="xxxxxxx" -export TF_VAR_google_chat_url="https://chat.googleapis.com/v1/spaces/xxxxx" -export TF_VAR_datasource_url="https://xxxxx.stackit.cloud/instances/xxxxxx" -export TF_VAR_datasource_username="stackit9_xxxxx" -export TF_VAR_datasource_password="xxxxxx" +--- + +## Alert YAML Format + +Alerts are defined in YAML Grafana format. The easiest way to get example from scratch is to define alert in Grafana UI and then export it using „Export rules“ button. +However, make sure to remove some fields which are not needed and provided by Terraform module logic automatically: + +- `datasourceUid` (defined with `alerts` module) +- `notification_settings` ( defined with `notification policy` module) + +Each file must have `apiVersion: 1` and define groups: + +```yaml +apiVersion: 1 +groups: + - name: infra-alerts + interval: 1m + rules: + - uid: pod-restart-alert + title: "Pod Restart Count High" + condition: C + data: + - refId: A + relativeTimeRange: + from: 600 + to: 0 + model: + expr: increase(kube_pod_container_status_restarts_total{}[5m]) > 3 + instant: true + refId: A + - refId: C + datasourceUid: __expr__ + model: + conditions: + - evaluator: + params: [0] + type: gt + operator: + type: and + query: + params: [C] + reducer: + type: last + type: query + expression: A + type: threshold + noDataState: OK + execErrState: Error + for: 5m + annotations: + description: "Pod is restarting too often" ``` -## Dashboard -TODO +--- + + +## Updating Alerts or Contact Points + +- Add new YAML files under `alerts/` for additional rules. +- Add new modules in `main.tf` for new datasources or contact points. +- Run `terraform apply` to sync changes to Grafana. + + +You can examples for alerts and templates in [examples](./examples/) folder.