Configuration¶
Learn how to configure the PipeOps Kubernetes Agent for your specific environment and requirements. This guide covers all configuration options and best practices.
📋 Configuration Overview¶
The PipeOps Agent can be configured through multiple methods:
- Environment variables - For runtime configuration
- Configuration files - For persistent settings
- Helm values - For Kubernetes deployments
- Command-line flags - For one-time overrides
🔧 Configuration Methods¶
Primary method - Create /etc/pipeops/config.yaml:
# Complete configuration example
cluster:
name: "production-cluster"
region: "us-west-2"
environment: "production"
tags:
team: "platform"
cost-center: "engineering"
agent:
token: "${PIPEOPS_TOKEN}"
endpoint: "https://api.pipeops.io"
log_level: "info"
heartbeat_interval: "30s"
max_retries: 3
timeout: "60s"
monitoring:
enabled: true
namespace: "pipeops-system"
retention_days: 30
prometheus:
enabled: true
port: 9090
scrape_interval: "15s"
grafana:
enabled: true
port: 3000
admin_password: "${GRAFANA_ADMIN_PASSWORD}"
loki:
enabled: true
port: 3100
retention_period: "30d"
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
security:
tls:
enabled: true
cert_file: "/etc/ssl/certs/pipeops.crt"
key_file: "/etc/ssl/private/pipeops.key"
rbac:
enabled: true
service_account: "pipeops-agent"
networking:
tunnel:
enabled: true
port: 8443
keepalive: "30s"
proxy:
http_proxy: "${HTTP_PROXY}"
https_proxy: "${HTTPS_PROXY}"
no_proxy: "${NO_PROXY}"
Quick configuration - Set environment variables:
# Core configuration
export PIPEOPS_TOKEN="your-cluster-token"
export PIPEOPS_CLUSTER_NAME="production-cluster"
export PIPEOPS_LOG_LEVEL="info"
export PIPEOPS_ENDPOINT="https://api.pipeops.io"
# Monitoring configuration
export PIPEOPS_MONITORING_ENABLED="true"
export PIPEOPS_PROMETHEUS_ENABLED="true"
export PIPEOPS_GRAFANA_ENABLED="true"
# Resource limits
export PIPEOPS_CPU_LIMIT="500m"
export PIPEOPS_MEMORY_LIMIT="512Mi"
# Security settings
export PIPEOPS_TLS_ENABLED="true"
export PIPEOPS_RBAC_ENABLED="true"
Kubernetes deployments - Create values.yaml:
# Helm chart values
agent:
image:
repository: pipeops/agent
tag: "latest"
pullPolicy: IfNotPresent
cluster:
name: "production-cluster"
region: "us-west-2"
environment: "production"
token: "your-cluster-token"
logLevel: "info"
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
monitoring:
enabled: true
namespace: pipeops-system
prometheus:
enabled: true
retention: "30d"
resources:
requests:
cpu: 100m
memory: 128Mi
grafana:
enabled: true
adminPassword: "secure-password"
persistence:
enabled: true
size: 10Gi
loki:
enabled: true
persistence:
enabled: true
size: 50Gi
security:
serviceAccount:
create: true
name: pipeops-agent
rbac:
create: true
tls:
enabled: true
secretName: pipeops-tls
networking:
tunnel:
enabled: true
port: 8443
service:
type: ClusterIP
port: 8080
Override configuration - Use CLI flags:
# Start with custom configuration
pipeops-agent start \
--cluster-name="test-cluster" \
--log-level="debug" \
--endpoint="https://staging.pipeops.io" \
--monitoring-enabled=false
# Deploy with specific settings
pipeops-agent deploy \
--namespace="custom-namespace" \
--replicas=3 \
--cpu-limit="1000m" \
--memory-limit="1Gi"
⚙️ Configuration Sections¶
Cluster Configuration¶
Configure cluster identification and metadata:
cluster:
name: "production-cluster" # Required: Unique cluster identifier
region: "us-west-2" # Optional: AWS/GCP/Azure region
environment: "production" # Optional: Environment label
provider: "aws" # Optional: Cloud provider
version: "1.25" # Optional: Kubernetes version
tags: # Optional: Custom labels
team: "platform"
cost-center: "engineering"
project: "main-app"
Agent Configuration¶
Core agent behavior settings:
agent:
token: "${PIPEOPS_TOKEN}" # Required: Cluster authentication token
endpoint: "https://api.pipeops.io" # Control plane endpoint
log_level: "info" # Logging level: debug, info, warn, error
heartbeat_interval: "30s" # Health check frequency
max_retries: 3 # Connection retry attempts
timeout: "60s" # Request timeout
user_agent: "PipeOps-Agent/2.1.0" # Custom user agent
# Advanced settings
buffer_size: 1024 # Message buffer size
worker_count: 5 # Concurrent workers
batch_size: 100 # Batch processing size
Monitoring Configuration¶
Comprehensive monitoring stack setup:
monitoring:
enabled: true # Enable monitoring stack
namespace: "pipeops-system" # Kubernetes namespace
retention_days: 30 # Data retention period
prometheus:
enabled: true
port: 9090
scrape_interval: "15s" # Metrics collection frequency
evaluation_interval: "15s" # Rule evaluation frequency
retention: "30d" # Metric retention period
storage_size: "50Gi" # Storage allocation
# External labels for federation
external_labels:
cluster: "production"
region: "us-west-2"
# Additional scrape configs
additional_scrape_configs: |
- job_name: 'custom-app'
static_configs:
- targets: ['app:8080']
grafana:
enabled: true
port: 3000
admin_password: "${GRAFANA_ADMIN_PASSWORD}"
# Persistence
persistence:
enabled: true
size: "10Gi"
storage_class: "fast-ssd"
# Additional dashboards
dashboards:
default:
kubernetes-overview:
url: https://grafana.com/api/dashboards/7249/revisions/1/download
# Data sources
datasources:
prometheus:
url: "http://prometheus:9090"
access: proxy
loki:
enabled: true
port: 3100
retention_period: "30d"
# Storage configuration
storage:
type: "filesystem" # filesystem, s3, gcs
filesystem:
directory: "/loki/chunks"
# Log ingestion limits
limits:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
Security Configuration¶
Security and access control settings:
security:
# TLS Configuration
tls:
enabled: true
cert_file: "/etc/ssl/certs/pipeops.crt"
key_file: "/etc/ssl/private/pipeops.key"
ca_file: "/etc/ssl/certs/ca.crt"
verify_client: true
min_version: "1.2" # Minimum TLS version
# RBAC Configuration
rbac:
enabled: true
service_account: "pipeops-agent"
cluster_role: "pipeops-agent"
# Custom permissions
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["get", "list", "watch", "create", "update"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
# Network policies
network_policy:
enabled: true
ingress:
- from:
- namespaceSelector:
matchLabels:
name: pipeops-system
egress:
- to: [] # Allow all outbound traffic
# Pod security
pod_security:
security_context:
run_as_non_root: true
run_as_user: 1000
fs_group: 2000
seccomp_profile:
type: RuntimeDefault
Networking Configuration¶
Network and connectivity settings:
networking:
# Secure tunnel configuration
tunnel:
enabled: true
port: 8443
keepalive: "30s"
compression: true
# Connection settings
dial_timeout: "10s"
idle_timeout: "300s"
max_idle_conns: 100
# Proxy configuration
proxy:
http_proxy: "${HTTP_PROXY}"
https_proxy: "${HTTPS_PROXY}"
no_proxy: "${NO_PROXY}"
# DNS configuration
dns:
nameservers:
- "8.8.8.8"
- "8.8.4.4"
search_domains:
- "cluster.local"
- "svc.cluster.local"
# Service mesh integration
service_mesh:
enabled: false
type: "istio" # istio, linkerd, consul
mtls_mode: "strict"
Resource Configuration¶
Resource limits and requests:
resources:
# Agent resources
requests:
cpu: "250m"
memory: "256Mi"
ephemeral-storage: "1Gi"
limits:
cpu: "500m"
memory: "512Mi"
ephemeral-storage: "2Gi"
# Monitoring stack resources
monitoring:
prometheus:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "1Gi"
grafana:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "200m"
memory: "256Mi"
loki:
requests:
cpu: "50m"
memory: "128Mi"
limits:
cpu: "200m"
memory: "512Mi"
🎯 Environment-Specific Configurations¶
Development Environment¶
# dev-config.yaml
cluster:
name: "dev-cluster"
environment: "development"
agent:
log_level: "debug"
heartbeat_interval: "10s"
monitoring:
enabled: true
retention_days: 7
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "250m"
memory: "256Mi"
Staging Environment¶
# staging-config.yaml
cluster:
name: "staging-cluster"
environment: "staging"
agent:
log_level: "info"
heartbeat_interval: "30s"
monitoring:
enabled: true
retention_days: 14
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
Production Environment¶
# production-config.yaml
cluster:
name: "production-cluster"
environment: "production"
agent:
log_level: "warn"
heartbeat_interval: "30s"
max_retries: 5
monitoring:
enabled: true
retention_days: 90
security:
tls:
enabled: true
rbac:
enabled: true
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
🔐 Secret Management¶
Using Kubernetes Secrets¶
Create secrets for sensitive configuration:
# Create secret for agent token
kubectl create secret generic pipeops-token \
--from-literal=token="your-cluster-token" \
-n pipeops-system
# Create TLS secret
kubectl create secret tls pipeops-tls \
--cert=path/to/tls.crt \
--key=path/to/tls.key \
-n pipeops-system
# Create Grafana admin password secret
kubectl create secret generic grafana-admin \
--from-literal=password="secure-grafana-password" \
-n pipeops-system
Reference secrets in configuration:
agent:
token:
secretKeyRef:
name: pipeops-token
key: token
security:
tls:
secretName: pipeops-tls
monitoring:
grafana:
admin_password:
secretKeyRef:
name: grafana-admin
key: password
Using External Secret Management¶
✅ Configuration Validation¶
Validate Configuration¶
# Validate configuration file
pipeops-agent config validate --config=/etc/pipeops/config.yaml
# Test connection with configuration
pipeops-agent config test --config=/etc/pipeops/config.yaml
# Show resolved configuration
pipeops-agent config show --config=/etc/pipeops/config.yaml
Configuration Schema¶
The agent supports JSON Schema validation:
# Download configuration schema
curl -O https://schemas.pipeops.io/agent/v2/config.schema.json
# Validate against schema
jsonschema -i config.yaml config.schema.json
🔧 Advanced Configuration¶
Custom Resource Definitions¶
Define custom monitoring targets:
# Custom CRD example
apiVersion: monitoring.pipeops.io/v1
kind: ServiceMonitor
metadata:
name: custom-app-monitor
spec:
selector:
matchLabels:
app: custom-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
Plugin Configuration¶
Configure agent plugins:
plugins:
enabled:
- "network-monitor"
- "cost-optimizer"
- "security-scanner"
network-monitor:
config:
check_interval: "60s"
endpoints:
- "https://api.example.com"
- "https://db.example.com:5432"
cost-optimizer:
config:
scan_interval: "24h"
optimization_mode: "balanced" # aggressive, balanced, conservative
security-scanner:
config:
scan_interval: "12h"
vulnerability_threshold: "medium"
📝 Configuration Best Practices¶
1. Use Environment-Specific Configs¶
# Directory structure
/etc/pipeops/
├── config.yaml # Base configuration
├── environments/
│ ├── dev.yaml # Development overrides
│ ├── staging.yaml # Staging overrides
│ └── production.yaml # Production overrides
└── secrets/
├── tokens.yaml # Secret references
└── certificates/ # TLS certificates
2. Configuration Inheritance¶
# Base config.yaml
defaults: &defaults
agent:
heartbeat_interval: "30s"
timeout: "60s"
monitoring:
enabled: true
# Environment-specific configs
development:
<<: *defaults
agent:
log_level: "debug"
production:
<<: *defaults
agent:
log_level: "warn"
security:
tls:
enabled: true
3. Version Control Integration¶
# Git hooks for configuration validation
#!/bin/bash
# .git/hooks/pre-commit
pipeops-agent config validate --config=config.yaml
if [ $? -ne 0 ]; then
echo "Configuration validation failed"
exit 1
fi
🏥 Health Checks & Readiness¶
Truthful Readiness Probes¶
By default, the agent's /ready endpoint always returns 200 OK if the server is running (backward compatible). For production environments, you can enable truthful readiness probes that reflect the actual connection state.
When enabled, the /ready endpoint will:
- Return 200 OK when agent is connected and healthy
- Return 503 Service Unavailable when agent is disconnected from control plane
- Cause Kubernetes to stop routing traffic to unhealthy pods
- Trigger automatic pod replacement if connection cannot be restored
agent:
# Enable truthful readiness probes
truthfulReadinessProbe: true
# Optional: Configure heartbeat intervals
heartbeat:
intervalConnected: 30 # Heartbeat every 30s when connected
intervalReconnecting: 60 # Heartbeat every 60s when reconnecting (less noise)
intervalDisconnected: 15 # Heartbeat every 15s when disconnected (faster recovery)
Health Endpoints¶
The agent exposes multiple health check endpoints:
| Endpoint | Purpose | Behavior |
|---|---|---|
/health |
Liveness probe | Always returns 200 OK if server is running |
/ready |
Readiness probe | Returns 200 OK (or 503 if truthful mode + disconnected) |
/api/health/detailed |
Detailed health | Comprehensive health information with metrics |
/metrics |
Prometheus metrics | Metrics including connection state and heartbeat status |
Example Responses¶
Normal mode (backward compatible):
{
"status": "ready",
"timestamp": "2024-11-05T17:30:00Z",
"mode": "backward_compatible",
"connected": true,
"registered": true,
"connection_state": "connected",
"note": "Set PIPEOPS_TRUTHFUL_READINESS_PROBE=true for truthful readiness checks"
}
Truthful mode (healthy):
{
"status": "ready",
"healthy": true,
"connected": true,
"registered": true,
"connection_state": "connected",
"last_heartbeat": "2024-11-05T17:29:45Z",
"time_since_heartbeat": "15s",
"timestamp": "2024-11-05T17:30:00Z"
}
Truthful mode (unhealthy):
{
"status": "not_ready",
"healthy": false,
"connected": false,
"registered": true,
"connection_state": "disconnected",
"consecutive_failures": 5,
"time_since_heartbeat": "2m30s",
"reason": "agent not connected to control plane",
"timestamp": "2024-11-05T17:30:00Z"
}
Monitoring Connection Health¶
Use the detailed health endpoint for comprehensive diagnostics:
# Check detailed health
curl http://localhost:8080/api/health/detailed | jq .
# Check Prometheus metrics
curl http://localhost:8080/metrics | grep pipeops_agent
Key metrics:
- pipeops_agent_connection_state - Current connection state (0=disconnected, 3=connected)
- pipeops_agent_heartbeat_success_total - Total successful heartbeats
- pipeops_agent_heartbeat_failure_total - Total failed heartbeats
- pipeops_agent_unhealthy_duration_seconds - Time spent in unhealthy state
🔍 Troubleshooting Configuration¶
Common Configuration Issues¶
Agent fails to start with configuration errors
Check configuration syntax:
Common issues: - Invalid YAML syntax - Missing required fields - Incorrect data types - Invalid enum values
Secret references not resolving
Verify secret exists:
Check RBAC permissions:
Monitoring stack not starting
Check resource limits:
Verify storage class:
Debug Configuration Loading¶
Enable debug logging to see configuration loading:
Look for these log messages:
- Loading configuration from file
- Resolving environment variables
- Validating configuration schema
- Configuration loaded successfully
Next Steps: Architecture Guide | Advanced Monitoring