GET /v1/agents/{agent_id}/health

Check the health status of a deployed agent.

Request

GET https://api.run-agent.ai/v1/agents/{agent_id}/health
Authorization: Bearer YOUR_API_KEY

Path Parameters

agent_id
string
required

The unique identifier of the agent

Response

status
string
required

Overall health status: healthy, degraded, or unhealthy

checks
object
required

Individual health check results

version
string

Current agent version

uptime
number

Uptime in seconds

Examples

Basic Health Check

curl https://api.run-agent.ai/v1/agents/agent-123/health \
  -H "Authorization: Bearer YOUR_API_KEY"

Response Examples

Healthy Agent

{
  "status": "healthy",
  "checks": {
    "agent": {
      "status": "healthy",
      "response_time_ms": 45
    },
    "dependencies": {
      "openai_api": "healthy",
      "database": "healthy"
    },
    "resources": {
      "memory_usage_percent": 65,
      "cpu_usage_percent": 20
    }
  },
  "version": "1.2.3",
  "uptime": 3600,
  "last_request": "2024-01-01T12:00:00Z"
}

Degraded Agent

{
  "status": "degraded",
  "checks": {
    "agent": {
      "status": "healthy",
      "response_time_ms": 150
    },
    "dependencies": {
      "openai_api": "healthy",
      "database": "slow"
    },
    "resources": {
      "memory_usage_percent": 85,
      "cpu_usage_percent": 75
    }
  },
  "version": "1.2.3",
  "uptime": 7200,
  "warnings": ["High memory usage", "Database latency detected"]
}

Health Check Logic

Status is determined by:

  1. Healthy: All checks pass
  2. Degraded: Some checks show warnings but agent is functional
  3. Unhealthy: Critical checks fail

Monitoring Integration

Automated Monitoring

import time

def monitor_agent(agent_id, interval=60):
    while True:
        try:
            response = requests.get(
                f"https://api.run-agent.ai/v1/agents/{agent_id}/health",
                headers={"Authorization": "Bearer YOUR_API_KEY"}
            )
            
            health = response.json()
            
            if health['status'] != 'healthy':
                send_alert(f"Agent {agent_id} is {health['status']}")
                
        except Exception as e:
            send_alert(f"Health check failed: {e}")
            
        time.sleep(interval)

Prometheus Integration

# Expose metrics for Prometheus
from prometheus_client import Gauge

agent_health = Gauge('agent_health_status', 'Agent health status', ['agent_id'])
memory_usage = Gauge('agent_memory_usage', 'Memory usage percentage', ['agent_id'])

def update_metrics(agent_id):
    health = get_agent_health(agent_id)
    
    status_value = {'healthy': 1, 'degraded': 0.5, 'unhealthy': 0}
    agent_health.labels(agent_id=agent_id).set(status_value[health['status']])
    
    memory = health['checks']['resources']['memory_usage_percent']
    memory_usage.labels(agent_id=agent_id).set(memory)

Best Practices

  1. Regular Monitoring: Check health every 30-60 seconds
  2. Set Alerts: Alert on status changes
  3. Track Trends: Monitor resource usage over time
  4. Implement Retries: Handle temporary network issues

See Also