ClickHouse for Observability: Building Cost-Effective Metrics & Logs Pipelines
Replace costly metrics stores with ClickHouse: a practical architecture for high-cardinality metrics and logs, retention tiers, and query/rollup patterns.
Why your observability bill is out of control — and how ClickHouse fixes it
If you’re an SRE, platform engineer, or developer running production services in 2026, you’ve probably felt the sting: metrics and logs licensing, ingestion egress, and long-term retention costs spiraling out of control. High-cardinality telemetry (per-user, per-request, per-feature labels) is the usual culprit — it explodes cardinality, storage, and query cost in purpose-built metric stores.
This article shows a proven, practical architecture to replace expensive metrics stores with ClickHouse for both metrics and logs at scale. You’ll get concrete ingestion patterns, table schemas, TTL and tiering strategies, pre-aggregation patterns for dashboards and alerts, and real-world trade-offs for 2026 observability operations.
Quick summary — the big ideas
- ClickHouse as LTS + OLAP backend: Keep real-time short-term systems for scraping/alerts, use ClickHouse for long-term, high-cardinality analytics and dashboards.
- Hot-warm-cold storage: Use MergeTree TTLs and S3-backed disk policies to move old raw data to cheaper cold object storage while keeping rollups hot.
- Pre-aggregate aggressively: Materialized views with Summing/AggregatingMergeTree and daily/hourly rollups reduce query costs for dashboards.
- Ingest via streaming: Use Vector or Kafka to buffer/transform, then batch-insert to ClickHouse (Kafka engine, Buffer engine, or HTTP bulk inserts).
- Cardinality hygiene: Label filtering, low_cardinality types, dictionaries, and tag hashing are essential to prevent explosion.
The 2026 context you need to know
ClickHouse’s continued growth (notably a major funding milestone in late 2025 that accelerated ecosystem investment) has pushed rapid feature maturity: richer disk policies with S3 tiers, improved ingest engines, built-in Keeper for clustering, and tighter integrations with cloud-native ingestion tools. At the same time, managed metrics services became noticeably costlier in 2024–2026, which makes re-architecting telemetry storage a priority for cost-conscious teams.
"In late 2025 and early 2026, teams shifted more budget from managed TSDBs to OLAP backends and object storage for long-term telemetry retention." — industry signals and vendor adoption trends
High-level architecture
Design the pipeline in three layers: Ingest, Storage & Processing, and Query/Alerting. Each layer has components tuned for throughput, cost, and query patterns.
1) Ingest layer — durable, transformable streaming
- Agents & shippers: Vector (vector.dev), Fluent Bit / Fluentd for logs, Prometheus exporters and remote_write adapters for metrics.
- Stream buffer: Kafka (or managed Pub/Sub) for smoothing bursts and enabling replay.
- Transform & normalize: Use Vector or Kafka Streams to flatten JSON, drop noisy tags, and map labels to typed columns.
- Write targets: ClickHouse Kafka Engine or HTTP bulk inserts. For highest throughput, use the native TCP protocol with batch inserts.
Why Vector + Kafka?
By 2026 Vector is widely adopted as a fast, resource-efficient transform/shipper that supports schema enforcement, backpressure, and batching. Kafka remains the standard for durable buffering. This combination lets you: 1) pre-aggregate or drop unnecessary tags, 2) guarantee at-least-once delivery, and 3) scale ingestion independently of ClickHouse cluster capacity.
Example Ingest Flow
- Prometheus exporters -> Prometheus (short-term), and remote_write -> vector -> kafka/topic(metrics)
- Fluent Bit -> vector -> kafka/topic(logs)
- ClickHouse consumer reads kafka topics into raw tables, transforms via materialized views.
Schema design and ClickHouse table patterns
ClickHouse gives you flexibility. For observability, follow these principles:
- Store raw events as compact as possible (JSON fields parsed to typed columns where you query frequently).
- Use low_cardinality(String) for high-cardinality labels that still repeat often (service names, region, instance_type).
- Partition by time (usually to the day) and choose a primary key that supports common query patterns: (date, service, metric_name, tags_hash).
- Pre-aggregate into rollup tables at 1m/5m/1h granularities using Aggregating/SummingMergeTree to make dashboards cheap and fast.
Raw metrics table example
CREATE TABLE metrics_raw
(
ts DateTime64(3) ,
service String,
metric_name String,
labels Nested(key String, value String),
value Float64,
labels_hash UInt64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(ts)
ORDER BY (toDate(ts), service, metric_name, labels_hash)
TTL ts + INTERVAL 90 DAY
SETTINGS index_granularity = 8192;
Notes: store parsed labels as a Nested type for occasional queries, but populate frequently queried labels into top-level columns. The labels_hash is a normalized fingerprint (e.g., SipHash) of the label set to allow grouping while avoiding explosion of unique composite strings.
Rollup table example (5m)
CREATE TABLE metrics_5m
(
interval_start DateTime64(3),
service String,
metric_name String,
labels_hash UInt64,
count UInt64,
sum Float64,
min Float64,
max Float64
)
ENGINE = SummingMergeTree
PARTITION BY toYYYYMMDD(interval_start)
ORDER BY (toDate(interval_start), service, metric_name, labels_hash)
TTL interval_start + INTERVAL 365 DAY;
Populate metrics_5m with a Materialized View that ingests from metrics_raw and aggregates 5m buckets. Dashboards should query the rollup for queries older than the hot window.
Ingest best practices & sample ClickHouse ingestion patterns
At scale, how you write to ClickHouse matters as much as schema design.
- Batch writes: Send large batches (10k–100k rows) to reduce per-insert overhead.
- Use Buffer or Kafka engines: The
Bufferengine absorbs spikes;Kafkaengine pairs well with consumer groups. - Dedup when necessary: Use ReplacingMergeTree with a version column to remove duplicates when you have at-least-once delivery.
- Backpressure: Vector and Kafka enable backpressure; avoid direct unbuffered writes from many agents.
Kafka engine consumer example
CREATE TABLE metrics_kafka
(
ts DateTime64(3),
service String,
metric_name String,
value Float64,
labels_hash UInt64
) ENGINE = Kafka
SETTINGS kafka_broker_list = 'kafka:9092', kafka_topic = 'metrics', kafka_group_name = 'ch-consumer', kafka_format = 'JSONEachRow';
CREATE MATERIALIZED VIEW to_metrics_raw TO metrics_raw AS
SELECT ts, service, metric_name, parseLabels(JSONExtractRaw(_raw)) AS labels, value, sipHash64(arrayJoin(labels.key), arrayJoin(labels.value)) AS labels_hash
FROM metrics_kafka;
Retention & tiering — hot, warm, cold
The most powerful cost optimization is to keep raw, high-cardinality data hot only for a short window, then move either aggregated data or compressed raw events to cheaper storage.
- Hot (0–7 days): Raw events in local NVMe or SSD (fast query, low latency).
- Warm (7–90 days): Keep rollups and sampled raw events on cheaper HDD or slower nodes.
- Cold (90+ days): Move compressed raw files or partitioned ClickHouse data to S3/object storage using disk policies and TTL TO DISK, or export to Parquet on S3 and delete from CH.
Example TTL and move-to-disk
ALTER TABLE metrics_raw
MODIFY TTL
ts + INTERVAL 7 DAY TO VOLUME 'hot',
ts + INTERVAL 30 DAY TO VOLUME 'warm',
ts + INTERVAL 365 DAY TO DISK 's3_archive' /* or DELETE */;
ClickHouse supports disk policies where volumes map to local disks and S3. Use compression codecs like ZSTD-3 for colder data to reduce storage costs while keeping read performance reasonable for ad-hoc historical analysis.
Query strategies for dashboards & alerts
Ad-hoc SELECTs across high-cardinality raw tables are expensive. Use these proven strategies:
- Prefer rollups: Grafana dashboards should query metrics_1m/5m/1h tables depending on time range.
- Query pruning: Partition/primary-key your tables so queries include time and service filters to hit fewer parts.
- Approximate functions: Use uniqCombined64 / approx_quantile for cardinality and percentile estimations in dashboards where exact values aren't required.
- Materialize alert conditions: For frequent alerts, precompute indicators and write them to a lightweight table that Alertmanager or your alerting engine can poll.
Sample dashboard query (5m rollup)
SELECT
interval_start,
sum / toFloat64(count) AS avg_value
FROM metrics_5m
WHERE metric_name = 'http_request_latency' AND service = 'checkout' AND interval_start BETWEEN now() - INTERVAL 7 DAY AND now()
GROUP BY interval_start
ORDER BY interval_start;
Handling logs in ClickHouse
ClickHouse can be an excellent back-end for logs when combined with smart ingestion and indexing:
- Store structured fields as columns: timestamp, service, severity, trace_id, span_id, user_id (if needed and privacy-compliant).
- Keep raw message compressed: Store the full message as a compressed String or in a separate cold archive to allow occasional retrievals.
- Use tokenbf_v1 / ngram indexes: ClickHouse provides bloom-filter based token indexes for fast substring search on specific fields.
- Retention policy: Keep 7–14 days of raw logs hot for troubleshooting and 90–365 days of structured indexes/aggregates for business analytics.
Log table example
CREATE TABLE logs_raw
(
ts DateTime64(3),
service String,
level LowCardinality(String),
trace_id Nullable(String),
message String CODEC(ZSTD(3)),
message_tokens String MATERIALIZED tokenize(message),
partition_key String
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(ts)
ORDER BY (toDate(ts), service, level);
Use a separate lightweight "log_index" table that stores parsed fields and references to S3 objects or chunk ids for raw archives — this pattern keeps frequent queries cheap while retaining full-fidelity data off disk.
Cardinality control: the single most important operational discipline
High-cardinality labels are the reason metric stores fail economically. Adopt a deliberate approach:
- Inventory label cardinality — export top label values and decide which to keep.
- Apply relabeling and drop labels at the agent level for noisy fields (container_id, request_id).
- Hash sparse labels into a single
labels_hashcolumn when you need composite equality but not label-level slicing. - Use
low_cardinality(String)for repeatable strings to reduce memory and index size. - Convert some fields to dictionaries (ClickHouse dictionaries) for cheap lookups rather than repeated strings in the table.
Security, compliance and operational notes
- Encryption: Use TLS for client-server traffic, and S3 server-side encryption or client-side encryption for cold buckets.
- Access control: Enable ClickHouse RBAC and integrate with LDAP/OIDC for team separation of duties.
- Auditability: Use query logging and separate audit tables; purge according to compliance timelines.
- Backup & restore: Snapshot partitions to S3 and keep a catalog of exported Parquet files for disaster recovery.
Operational trade-offs
ClickHouse is extremely cost-efficient for high-cardinality analytics, but it’s not a drop-in replacement for a short-term scraping TSDB like Prometheus when it comes to scrape cycles, local alerting, and instantaneous rule evaluation. The recommended approach in 2026 is hybrid:
- Keep Prometheus or Cortex for short-term scraping and immediate alerts.
- Use ClickHouse as the long-term store and analytics engine, feeding it via remote_write adapters or Kafka.
- For extremely low-latency alerting, precompute indicators in ClickHouse and stream them to the alerting system.
Migration checklist — from managed metrics to ClickHouse
- Map your existing metric labels and measure cardinality.
- Design raw and rollup schemas; choose retention windows.
- Build Vector/Prometheus remote_write adapters into Kafka.
- Start dual-writing: keep current system as source-of-truth while writing to ClickHouse.
Run dashboards in shadow mode against ClickHouse and tune rollups. - Cut over queries and alerts incrementally, keeping a rollback path.
Measuring success — key KPIs
- Cost per GB/month for retained observability data (target: reduce by 50% vs managed service).
- Dashboard latency (95th percentile): target < 2s for commonly used views.
- Alert reliability: false positives / false negatives should remain stable or improve.
- Storage efficiency (compression ratio + cardinality reduction).
Real-world example — an ecommerce platform
A 2025–2026 migration example: an ecommerce company sent Prometheus remote_write to Kafka, used Vector to normalize labels, consumed into ClickHouse raw tables, and created 1m and 5m rollups. They retained raw events for 14 days, rollups for 365 days and archived monthly raw partitions to S3. Result: 60% lower monthly observability cost, 5x faster multi-dimensional queries for service-level KPIs, and stable alerting by materializing the most important signals into a lightweight alerts table.
Actionable checklist — get started in 30 days
- Deploy a single-node ClickHouse for staging and a Kafka topic for telemetry.
- Install Vector and configure: Prometheus remote_write -> Kafka; Fluent Bit -> Kafka.
- Create a raw metrics table, and a 5m rollup materialized view.
- Dual-write dashboards (existing service + ClickHouse) and compare query costs and latency.
- Iterate: add TTLs, disk policies, and cardinality filters; measure cost savings.
Key takeaways
- ClickHouse is a cost-effective OLAP backend for high-cardinality observability when paired with streaming ingestion and aggressive rollups.
- Control cardinality early — relabel aggressively, use low_cardinality types, and dictionaries.
- Use hot-warm-cold tiers with TTLs and S3 to reduce long-term storage costs without losing analytics capability.
- Precompute rollups for dashboards and alerts to keep queries cheap and fast.
Further reading & tools in the ecosystem (2026)
- Vector (vector.dev) — universal telemetry router/transformation
- ClickHouse Kafka engine & Buffer engine — ingestion patterns
- Grafana ClickHouse datasource plugin — dashboarding
- Prometheus remote_write adapters for ClickHouse — community projects
Final thoughts & next step
Moving metrics and logs to ClickHouse is not a silver bullet — it requires disciplined schema design and operational practices. But for teams facing exploding observability bills and needing flexible, high-cardinality analytics in 2026, ClickHouse offers a pragmatic, battle-tested path to dramatic cost savings and query performance.
Ready to prototype? Start with a staging ClickHouse cluster, a Kafka topic, and Vector in front of your exporters. If you want a vetted reference architecture and Terraform/IaC templates tuned for cost and scale, we publish tested patterns and starter kits that accelerate migrations.
Call to action
Get the opensoftware.cloud ClickHouse Observability Starter Pack: Terraform + Vector pipelines + ClickHouse schema templates. Deploy in a day, validate in production, and cut your observability costs — visit our resources page or contact our engineering team for a migration review and cost projection.
Related Reading
- How to Spot Fake MTG Booster Deals and Avoid Common Amazon Pitfalls
- Five Best Practices for Quantum-augmented Video Ad Campaigns
- Buyer’s Guide: Choosing a CRM That Connects to Your Task Manager Without Breaking Workflows
- Storage Roadmap for Seedboxes: How SK Hynix PLC Flash Could Change Cost and Capacity Planning
- How to Package and Price a Collector’s Beauty Box Without Alienating Everyday Customers
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Integrating Smart Delivery Solutions in Open Source Platforms
Navigating Costly MarTech Decisions: Avoiding Multi-Million Dollar Pitfalls
The Evolution of On-Premises vs. Cloud Solutions in the Face of Regulatory Changes
The Future is Custom: Why Bespoke AI Solutions Trump Large Models
Harnessing Home: Repurposing Spaces for Local Data Processing
From Our Network
Trending stories across our publication group