Scrape Uptime
The Scrape Uptime page provides real-time monitoring of Prometheus scrape target health across your Kubernetes clusters. It helps you identify and troubleshoot issues with metric collection by showing which targets are successfully being scraped and which are experiencing problems.
Overview
Scrape Uptime allows you to:
- Monitor the health status of all Prometheus scrape targets
- View uptime statistics and target availability
- Identify failing scrape targets quickly
- Group targets by job for better organization
- Track target labels and endpoints
Getting Started
Accessing Scrape Uptime
- Navigate to Alerts in the main menu
- Select Scrape Uptime from the Prometheus section
- Choose a cluster from the dropdown
Understanding the Dashboard
The Scrape Uptime page displays:
Summary Statistics
- Total Targets: Total number of scrape targets configured
- Healthy: Number of targets successfully being scraped (UP status)
- Unhealthy: Number of targets failing to be scraped (DOWN status)
- Uptime: Overall percentage of healthy targets
Target Details
Each scrape target shows:
- Status Icon: Visual indicator (green checkmark for UP, red error for DOWN)
- Endpoint: The URL being scraped for metrics
- Labels: Kubernetes labels associated with the target
- Health Chip: Current status (UP/DOWN)
Target Grouping
Targets are automatically grouped by their job label, which typically represents:
- kube-state-metrics: Kubernetes cluster state metrics
- node-exporter: Node-level system metrics
- cadvisor: Container metrics
- Custom exporters: Application-specific metrics
Each group displays the number of targets it contains.
Health Status
UP Status
A target shows UP status when:
- Prometheus successfully connects to the endpoint
- Metrics are being scraped without errors
- The target responds within the timeout period
DOWN Status
A target shows DOWN status when:
- The endpoint is unreachable
- Authentication fails
- The target times out
- Network connectivity issues occur
- The service is not running
Common Scrape Targets
Kubernetes System Targets
kube-state-metrics
Provides cluster-level metrics about Kubernetes objects:
- Pod states and conditions
- Deployment status
- Node information
- Resource quotas
node-exporter
Exposes hardware and OS metrics:
- CPU usage
- Memory utilization
- Disk I/O
- Network statistics
cadvisor
Container-level metrics:
- Container CPU and memory usage
- Network traffic per container
- Filesystem usage
Application Targets
Custom application exporters that expose:
- Application-specific metrics
- Business metrics
- Custom performance indicators
Troubleshooting
Target is DOWN
If a target shows DOWN status:
-
Check Service Availability
- Verify the service is running:
kubectl get pods -n <namespace> - Check service endpoints:
kubectl get endpoints -n <namespace>
- Verify the service is running:
-
Verify Network Connectivity
- Ensure the Prometheus pod can reach the target
- Check network policies and firewall rules
-
Review Service Configuration
- Confirm the service is exposing metrics on the correct port
- Verify the metrics path (usually
/metrics)
-
Check Authentication
- Ensure proper ServiceAccount permissions
- Verify TLS certificates if using HTTPS
-
Review Prometheus Logs
kubectl logs -n monitoring prometheus-xxx
All Targets are DOWN
If all targets show DOWN status:
-
Check Prometheus Status
- Verify Prometheus is running
- Check Prometheus pod logs for errors
-
Verify Service Discovery
- Ensure Kubernetes service discovery is configured
- Check ServiceMonitor or PodMonitor resources
-
Review RBAC Permissions
- Confirm Prometheus has proper ClusterRole permissions
- Verify ServiceAccount bindings
Intermittent Failures
For targets that alternate between UP and DOWN:
-
Check Resource Limits
- Target pod may be experiencing CPU/memory throttling
- Review pod resource requests and limits
-
Network Stability
- Look for network congestion or packet loss
- Check for DNS resolution issues
-
Scrape Timeout
- Target may be slow to respond
- Consider increasing scrape timeout in Prometheus config
Best Practices
Monitoring Strategy
- Regular Checks: Review scrape uptime daily to catch issues early
- Set Alerts: Create alerts for when targets go down
- Track Trends: Monitor uptime percentage over time
- Document Targets: Maintain documentation of what each target monitors
Target Configuration
- Consistent Labeling: Use consistent labels across targets for easier filtering
- Meaningful Names: Use descriptive job names that indicate the target's purpose
- Health Endpoints: Ensure all services expose a
/metricsendpoint - Timeout Settings: Configure appropriate scrape intervals and timeouts
Performance Optimization
- Scrape Intervals: Balance between data freshness and system load
- Metric Cardinality: Avoid high-cardinality labels that increase storage
- Target Count: Monitor the number of targets to prevent overload
- Resource Allocation: Ensure Prometheus has adequate resources
Integration with Alerts
Use Scrape Uptime data to:
-
Create Alert Rules: Set up alerts for target failures
up == 0 -
Monitor SLOs: Track uptime as part of service level objectives
-
Incident Response: Quickly identify affected services during outages
-
Capacity Planning: Use uptime trends to plan infrastructure changes
Refresh and Updates
- Auto-refresh: Data is fetched when you select a cluster
- Manual Refresh: Click the "Refresh" button to get the latest status
- Last Refreshed: Timestamp shows when data was last updated
Next Steps
- Configure Alert Rules for target failures
- Set up Probes for additional health checks
- Use PromQL Playground to query target metrics
- Configure Alert Routing for notifications