Frequently Asked Questions¶

Security & Gateway Proxy¶

Q: What does the agent do by default?¶

By default, the PipeOps agent provides secure admin access only:

Establishes WebSocket tunnel to PipeOps control plane
Enables secure cluster management without inbound firewall rules
Does NOT expose your cluster externally
Does NOT install monitoring components (when installed via manifest)
Does NOT sync ingresses or register routes

Q: Is ingress sync (Gateway Proxy) enabled by default?¶

NO. For security reasons, the PipeOps Gateway Proxy feature is DISABLED by default (as of v1.x).

The agent will NOT expose your cluster externally unless you explicitly enable it with:

agent:
  enable_ingress_sync: true

When disabled, you'll see:

{"level":"info","msg":"Ingress sync disabled - agent will not expose cluster via gateway proxy"}

When enabled, the agent automatically:

Detects if the cluster is private (no public LoadBalancer on ingress-nginx)
Starts the ingress watcher
Syncs all existing ingresses to the control plane
Watches for new/updated/deleted ingresses and syncs them in real-time

You'll see:

{"level":"info","msg":"Ingress sync enabled - monitoring ingresses for gateway proxy"}
{"level":"info","msg":"Initializing gateway proxy detection..."}
{"level":"info","msg":"Private cluster detected - using tunnel routing"}
{"level":"info","msg":"Starting ingress watcher for gateway proxy"}
{"cluster_uuid":"...","ingress_count":4,"msg":"Syncing ingresses with controller"}

Q: When should I enable Gateway Proxy?¶

Enable the PipeOps Gateway Proxy only if you want to:

Expose services in private clusters without VPN
Use custom domains for cluster services
Provide external access to applications via PipeOps gateway

For secure admin access only, keep it disabled (default).

Q: How does cluster detection work?¶

When gateway proxy is enabled, the agent checks the ingress-nginx-controller service:

LoadBalancer with External IP - Public cluster - Uses direct routing
NodePort or no External IP - Private cluster - Uses tunnel routing

Note: Cluster detection only happens when enable_ingress_sync: true is set.

Installation & Component Auto-Install¶

Q: Why aren't monitoring components installed when I use Helm?¶

The agent's auto-installation feature is disabled by default for Helm and Kubernetes manifest deployments. This is intentional because we assume you're deploying to an existing cluster that may already have monitoring tools like Prometheus, Grafana, or Loki installed.

To enable auto-installation with Helm:

helm install pipeops-agent ./helm/pipeops-agent \
  --set agent.pipeops.token="your-token" \
  --set agent.cluster.name="my-cluster" \
  --set agent.autoInstallComponents=true  # Enable auto-install

Why this design? - Fresh installations (bash script): Auto-install is enabled for quick setup - Existing clusters (Helm/K8s): Auto-install is disabled to prevent conflicts - Gives you full control based on your environment

See the Component Installation Behavior section for more details.

Q: What happens after ingress sync (when enabled)?¶

Routes are registered with the control plane: - Private clusters: Traffic routes through WebSocket tunnel to agent - Public clusters: Traffic routes directly to LoadBalancer IP (3-5x faster)

The agent sends: - routing_mode: "tunnel" or "direct" - public_endpoint: LoadBalancer IP (if available) or empty - cluster_uuid: Cluster identifier - Ingress rules (host, path, service, port, TLS, annotations)

Agent Health & Heartbeat¶

Q: Why don't I see heartbeat/ping logs?¶

Heartbeats ARE running - they just log at different levels:

Success: DEBUG level (not visible in INFO logs)
Failure: WARN/ERROR level (visible)

You'll see heartbeat failures like:

{"error":"failed to send heartbeat: WebSocket not connected","level":"warning","msg":"Heartbeat failed, retrying with backoff..."}

But successful heartbeats are silent at INFO level.

Q: How can I confirm the agent is healthy?¶

Check for these INFO-level logs:

Registration:

{"msg":"Agent registered successfully via WebSocket","status":"re-registered"}
{"msg":"Cluster registered successfully"}

Connection state:

{"msg":"Connection state changed","new_state":"connected","old_state":"reconnecting"}

Ingress sync (if enabled):

{"msg":"Finished syncing existing ingresses to controller","routes":4}

Prometheus metrics (if monitoring enabled):
```
{"msg":"Discovered Prometheus service"}
```

If you see these periodically, the agent is healthy.

Q: How often does the agent send heartbeats?¶

Every 30 seconds to match control plane expectations.

If heartbeat fails, it retries with exponential backoff (5s, 10s, 30s) up to 3 attempts.

Monitoring Stack¶

Q: Why do I keep seeing "Discovered Prometheus service" every 30 seconds?¶

This is normal and expected when the monitoring stack is enabled. The agent:

Sends a heartbeat to the control plane every 30 seconds
Each heartbeat includes monitoring information (Prometheus URL, credentials, etc.)
To get this information, the agent discovers the Prometheus service dynamically
Logs at INFO level when successfully discovered

Why dynamic discovery? Different Kubernetes distributions (K3s, managed clusters, vanilla K8s) deploy Prometheus with different service names. The agent detects the actual service name and port automatically.

Other services (Grafana, Loki) are discovered once at startup because they don't need to be included in heartbeat messages.

Note: This only happens if you've installed the monitoring stack. If you haven't enabled monitoring, you won't see these logs.

Q: Can I disable these periodic logs?¶

Not directly, but you can:

Reduce log level to WARN (hides INFO logs)
Filter logs in your monitoring system (Loki/Grafana)
The logs are harmless and indicate healthy monitoring

Q: Why only Prometheus is logged repeatedly?¶

Because only Prometheus information is sent with each heartbeat (every 30 seconds) to the control plane. This allows the control plane to: - Access Prometheus metrics via the tunnel or directly - Monitor cluster health without polling - Get real-time access credentials

Other services: - Grafana: Accessed via ingress proxy (discovered once at startup) - Loki: Logs forwarded by Promtail (no agent involvement)

Technical detail: The log appears in internal/components/manager.go::discoverPrometheusService() which is called by GetMonitoringInfo() on every heartbeat cycle.

Region Detection¶

Q: How does the agent detect region?¶

Detection order:

Node labels (most reliable): topology.kubernetes.io/region, provider-specific labels
Provider ID: aws://, gce://, azure://, etc.
Metadata service: AWS IMDSv2, GCP metadata, Azure IMDS
Local environment detection: K3s, kind, minikube, Docker Desktop
GeoIP detection: For bare-metal/on-premises clusters

Q: What if region can't be detected?¶

Defaults: - Provider: "bare-metal" or "on-premises" - Region: "on-premises" or "agent-managed" - Registry Region: "us" (unless GeoIP detects Europe)

Q: How is registry region determined?¶

For cloud providers: - EU regions (eu-west-1, eu-central-1, etc.) → "eu" - All other regions → "us"

For bare-metal/on-premises: - GeoIP: Europe + Africa → "eu" - GeoIP: Other continents → "us" - No GeoIP: "us" (default)

Troubleshooting¶

Agent not connecting to control plane¶

Check logs for:

{"error":"failed to connect via WebSocket","level":"warning"}

Solutions: 1. Verify PIPEOPS_API_URL is correct 2. Check AGENT_TOKEN is valid 3. Ensure network connectivity to control plane 4. Check firewall rules (WebSocket requires outbound HTTPS)

Ingress sync not working¶

Check logs for:

{"msg":"✅ Gateway proxy ingress watcher started successfully"}

If you see:

{"msg":"Gateway proxy not needed for public cluster"}

Your cluster has a public LoadBalancer - ingress sync is disabled (direct routing instead).

Monitoring stack not starting¶

Check: 1. Storage class available: kubectl get storageclass 2. Helm installed: helm version 3. CRDs installed: kubectl get crd | grep monitoring.coreos.com 4. Namespace exists: kubectl get ns pipeops-monitoring

WebSocket disconnections¶

{
  "error": "websocket: close 1006 (abnormal closure): unexpected EOF"
}
{
  "msg": "Attempting to reconnect to WebSocket",
  "base_delay": "4s",
  "jitter": "892ms",
  "total_delay": "4.892s",
  "next_delay": "8s"
}

This is normal - network hiccups, control plane restarts, etc. The agent auto-reconnects and re-registers.

Reconnection Behavior: - Uses exponential backoff with jitter (±25%) - Maximum retry delay: 15 seconds (caps at 15s after 6 failures) - Typical reconnection time: 15-45 seconds for control plane outages - Brief network blips (<5s): Reconnects in ~1 second

Advanced Configuration¶

Enable verbose heartbeat logging¶

Set log level to DEBUG:

# config.yaml
logging:
  level: debug

Or via environment variable:

LOG_LEVEL=debug

How to enable ingress sync¶

Add to your configuration file:

agent:
  enable_ingress_sync: true

Or set via environment variable:

PIPEOPS_ENABLE_INGRESS_SYNC=true

Then restart the agent:

kubectl rollout restart deployment/pipeops-agent -n pipeops-system

How to check if ingress sync is enabled¶

Check agent logs on startup:

kubectl logs deployment/pipeops-agent -n pipeops-system | grep -i "ingress sync"

Output examples: - "Ingress sync disabled" = NOT exposing cluster - "Ingress sync enabled" = Monitoring and exposing ingresses

Disable gateway proxy (force direct routing)¶

Gateway proxy is enabled by default and automatically detects your cluster type. For private clusters, it uses tunnel routing. For public clusters with LoadBalancers, it uses direct routing. To disable ingress sync completely, set enable_ingress_sync: false.

Custom Prometheus discovery interval¶

Not currently configurable - hard-coded to 30 seconds. However, discovery results are now cached for 5 minutes to reduce log frequency.

Log Levels Guide¶

Level	What you see
ERROR	Critical failures only
WARN	Failures with retry, missing configs
INFO	Startup, registration, sync, discoveries (default)
DEBUG	Heartbeats, WebSocket messages, detailed flow
TRACE	Raw HTTP/WebSocket traffic, internal state

Recommended: INFO (default) for production, DEBUG for troubleshooting.