跳至主要内容

YACE Deployment Guide — Pitfalls & Working Recipe

YACE (Yet Another CloudWatch Exporter) lessons learned from deploying v0.61.2 + chart 0.38.0 to EKS.

What YACE Is

Pulls AWS CloudWatch metrics → exposes Prometheus-format /metrics. Cheaper than the official cloudwatch_exporter because it batches GetMetricData calls.

Architecture (push-based, not pull-based)

YACE runs a background scraper at scraping-interval (default 300s). /metrics returns cached results — Prometheus scrape doesn't trigger CloudWatch calls.

[CloudWatch] ←─── background goroutine (300s) ─── [YACE]
↓ /metrics (cached)
[Prometheus scrape every 15s]

Pitfalls (with fixes)

1. Image registry double-prepend

Chart values image: defaults registry: ghcr.io. Setting repository: <ECR>/v1-ti-yace results in ghcr.io/<ECR>/v1-ti-yace:<tag>400 Bad Request.

Fix: split registry + repository:

image:
registry: <ECR-host>
repository: v1-ti-yace
tag: 0.61.2-test

Setting registry: "" produces leading /InvalidImageName. Don't.

2. searchTags requires the tag exists

If your AWS resources don't carry a Name tag (most don't unless you set them), searchTags: [{key: Name, ...}] discovers zero resources → No tagged resources made it through filtering.

Fix: use dimensionNameRequirements on the dimension AWS itself emits:

dimensionNameRequirements: [QueueName] # SQS
dimensionNameRequirements: [DBInstanceIdentifier] # RDS
dimensionNameRequirements: [LoadBalancer] # ALB

Then filter at Prometheus scrape with metric_relabel_configs if you want a subset.

3. Schema fields are version-pinned

includeLinkedAccounts: false is a v0.62+ field. v0.61.2 logs field includeLinkedAccounts not found in type config.Job and silently skips the job (or applies defaults). Always pin chart + app version and check release notes for the YACE version, not the chart version.

4. IAM sqs:ListQueues cannot be resource-scoped

Splitting SQS Statement is mandatory:

statement { sid="SQSList"; actions=["sqs:ListQueues"]; resources=["*"] }
statement { sid="SQSRead"; actions=["sqs:GetQueueAttributes"]; resources=[<arn>] }

5. iam:ListAccountAliases warning

Not strictly needed but YACE calls it on startup to enrich account_alias label. Without it: warn log, no functional impact. Adding it is cleaner.

6. Metric name suffix from statistics

Statistics get appended to metric name:

  • statistics: [Maximum]aws_sqs_approximate_number_of_messages_visible_maximum
  • statistics: [Average]aws_sqs_approximate_number_of_messages_visible_average
  • statistics: [Maximum, Average] → both metrics emitted

Querying without the suffix returns nothing.

7. nilToZero: true worth it

Without it, queues with no traffic emit no sample → Grafana stat panel shows "No data". With it, emits 0 → flat line.

Minimal Working Config

# helm values
image:
registry: <your-registry>
repository: yet-another-cloudwatch-exporter
tag: <version>
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: <irsa-role>
service:
port: 5000
config: |
apiVersion: v1alpha1
sts-region: <region>
discovery:
jobs:
- type: AWS/SQS
regions: [<region>]
dimensionNameRequirements: [QueueName]
period: 300
metrics:
- name: ApproximateNumberOfMessagesVisible
statistics: [Maximum]
nilToZero: true

Cost Tuning

Cost = metrics × statistics × (3600/period) × 24 × 30 × $0.01/1000.

  • 1 region, 30 queues, 3 metrics (visible/age/sent), 5min: ~$26/month
  • 10 regions, same: ~$260/month
  • Cut to period: 600 (10min): half the cost, lose nothing for backlog alerting

Verify End-to-End

# 1. YACE has data
kubectl exec deploy/yace -- wget -qO- http://localhost:5000/metrics | grep aws_sqs_ | head

# 2. Prometheus scraped successfully
curl prometheus:9090/api/v1/query?query='up{job="yace"}'

# 3. Metric stored
curl prometheus:9090/api/v1/query?query='aws_sqs_approximate_number_of_messages_visible_maximum'

Source