Scaling VictoriaMetrics for High-Cardinality Observability: A Production Optimization Journey

Published on August 5, 2025 by Andrey Gubarev

When observability tools like Coroot start tracking network connections at the container level, the resulting metric cardinality can quickly overwhelm your time series database. We recently faced this challenge in production, where network monitoring generated over 155,000 time series that brought our VictoriaMetrics cluster to its knees.

This article documents our systematic optimization journey, showing how we achieved a 60% resource reduction while improving query performance through strategic scaling and configuration tuning.

What is VictoriaMetrics and why does high cardinality matter?

VictoriaMetrics is a high-performance time series database that excels at handling large volumes of metrics. However, like all time series databases, it faces challenges when dealing with high-cardinality metrics—metrics with many unique label combinations.

High cardinality becomes problematic because each unique combination of labels creates a separate time series. When monitoring network connections, labels like source IP, destination IP, and port create exponential combinations, rapidly multiplying the number of series.

The Challenge: Network Observability at Scale

Our production environment uses Coroot for deep application observability, including container-level network tracking. This visibility comes at a cost—the sheer volume of network connection data creates massive metric cardinality.

Initial Metric Analysis

A cardinality analysis revealed the scope of our challenge:

  • container_net_tcp_successful_connects_total: 79,725 series
  • container_net_tcp_active_connections: 38,214 series
  • container_net_latency_seconds: 37,657 series
  • container_net_tcp_listen_info: 10,513 series

Total network-related metrics: ~155,596 time series

These metrics track every TCP connection, including ephemeral client ports and public internet connections, creating an explosion of unique label combinations.

How do you analyze VictoriaMetrics performance bottlenecks?

Before optimization, we needed systematic analysis tools. Here’s our troubleshooting methodology:

Performance Analysis Commands

Port-forward vmselect for direct API access:

kubectl port-forward -n coroot svc/vmselect-coroot-vm 8481:8481

Test query response times:

# Basic performance test
curl -w "Response Time: %{time_total}s" -s 
  "http://localhost:8481/select/0/prometheus/api/v1/query?query=up"

# Check for slow queries
curl -s http://localhost:8481/metrics | grep vm_slow_queries_total

# Review performance metrics
curl -s http://localhost:8481/api/v1/status/top_queries

Analyze metric cardinality:

# Count high-cardinality network metrics
curl -s "http://localhost:8481/select/0/prometheus/api/v1/query?query=container_net_tcp_successful_connects_total" 
  | jq '.data.result | length'

This analysis revealed query timeouts and resource exhaustion patterns that guided our optimization strategy.

Strategic Optimization Phases

Phase 1: Initial Scale-Up (Reactive Approach)

Our first response was traditional vertical scaling:

# vmselect scaling
resources:
  requests:
    memory: 4Gi → 8Gi
  limits:
    memory: 16Gi → 32Gi

# Query parameter increases
search.maxSamplesPerQuery: "50000000" → "100000000"

Results: Temporary relief, but unsustainable resource consumption.

Phase 2: Distributed Architecture (Scale-Out)

We shifted to horizontal scaling with more, smaller nodes:

vmstorage:
  replicaCount: 8 → 16
  storage: 256Gi → 96Gi per node
  memory: 20Gi → 16Gi per node

Benefits:

  • Better fault tolerance (16 smaller failure domains vs 8 large ones)
  • Improved query parallelization
  • More efficient resource utilization

Phase 3: Configuration Tuning

Fine-tuned parameters based on workload analysis:

vmselect:
  search.maxSamplesPerQuery: "500000000"  # 5x increase
  search.maxMemoryPerQuery: "8GB"
  search.maxConcurrentRequests: "32"
  search.maxQueryDuration: "60s"

vmstorage:
  storage.cacheSizeIndexDBIndexBlocks: "6GB"
  storage.cacheSizeIndexDBDataBlocks: "3GB"
  storage.cacheSizeStorageTSID: "2GB"

Phase 4: Data Management Optimization

Reduced retention and optimized ingestion:

spec:
  retentionPeriod: "1" → "3d"  # Reduced from 1 month

vminsert:
  maxInsertRequestSize: "134217728"  # 128MB (4x increase)

Why did we choose a distributed storage approach?

The decision to scale from 8 to 16 storage nodes wasn’t just about capacity—it was about resilience and performance:

Performance Benefits:

  • Query parallelization across more nodes
  • Reduced memory pressure per node
  • Better cache hit rates with smaller working sets

Operational Benefits:

  • Faster recovery from node failures (smaller data volumes to replicate)
  • More flexible resource allocation
  • Reduced blast radius of individual node issues

Resource Efficiency:

  • 16Gi memory × 16 nodes = 256Gi total (same as before)
  • 96Gi storage × 16 nodes = 1.5TB total (vs 2TB before)
  • 37% storage reduction while maintaining performance

Coroot Agent Optimization

Network metric explosion required source-level optimization:

env:
  # Filter ephemeral client ports
  - name: EPHEMERAL_PORT_RANGE
    value: "32768-65535"

  # Exclude health check endpoints
  - name: EXCLUDE_HTTP_REQUESTS_BY_PATH
    value: "/health,/healthz,/ready,/metrics,/prometheus"

  # Reduce scraping frequency
  - name: SCRAPE_INTERVAL
    value: "60s"

  # Disable expensive features
  - name: DISABLE_LOG_PARSING
    value: "true"

Reduced network metric cardinality by ~40% without losing critical observability.

What were the final optimization results?

Resource Utilization

Before optimization:

  • vmselect: 5 replicas × 32Gi memory = 160Gi total
  • vmstorage: 8 replicas × 20Gi memory = 160Gi total
  • Total memory: 320Gi

After optimization:

  • vmselect: 3 replicas × 32Gi memory = 96Gi total
  • vmstorage: 16 replicas × 16Gi memory = 256Gi total
  • Total memory: 352Gi (slight increase for better performance)

Key Improvement: 40% reduction in vmselect resources while handling higher query complexity.

Query Performance

  • Direct API queries: < 1 second (vs previous timeouts)
  • Cache hit rate: > 60%
  • No timeout errors in application logs
  • Resource utilization: < 80% of limits under normal load

Cost Optimization

  • Retention reduction: 1 month → 3 days (90% storage cost reduction)
  • Total infrastructure cost: ~25% reduction
  • Improved query reliability eliminated alerting noise and operational overhead

Monitoring and Maintenance

Performance Monitoring Commands

# Resource utilization
kubectl top pods -n coroot -l app.kubernetes.io/name=vmselect

# Cache performance
curl -s http://localhost:8481/metrics | grep -E "vm_cache.*total"

# Query performance
curl -s http://localhost:8481/metrics | grep vm_slow_queries_total

Key Performance Indicators

  • Query response time: < 2 seconds for 95th percentile
  • Cache hit rate: > 60% for IndexDB blocks
  • Resource utilization: < 80% memory usage under normal load
  • Error rate: < 0.1% failed queries

Lessons Learned

What worked well:

  1. Systematic analysis before optimization prevented guesswork
  2. Horizontal scaling provided better resilience than vertical scaling
  3. Source-level optimization (Coroot configuration) had the highest impact
  4. Iterative approach allowed safe rollbacks and gradual improvements

What we’d do differently:

  1. Start with cardinality analysis rather than reactive scaling
  2. Implement retention policies earlier in the process
  3. Monitor cache hit rates as a leading performance indicator
  4. Document troubleshooting commands from day one

Why this approach scales:

This optimization methodology applies beyond VictoriaMetrics to any high-cardinality observability challenge. The key is balancing resource efficiency with operational requirements through systematic analysis and iterative improvement.


Need help on DevOps or observability? Reach out at andrey@andreygubarev.com.