Realtime Hosting Metrics with ClickHouse

Blueprint for ingesting high-cardinality hosting telemetry into ClickHouse — pipelines, schema patterns, and cost controls for realtime metrics.

Realtime Hosting Metrics with ClickHouse: Pipeline, Schemas, and Cost Controls

Hook: If your hosting telemetry pipeline is drowning in high-cardinality metrics, unpredictable costs, and slow queries, you need a blueprint that treats ClickHouse as more than a target store — it must be the engine of an end-to-end realtime telemetry platform. This guide delivers pragmatic schema patterns, a robust ingest pipeline, and cost-control strategies you can apply in production in 2026.

Why ClickHouse for hosting telemetry in 2026?

ClickHouse continues to be a dominant choice for telemetry and observability workloads because it offers columnar OLAP performance at extreme ingest rates and sub-second analytical queries. The project's growth — including a major funding round that accelerated cloud and operator ecosystems in late 2025 — means improved managed offerings and richer tooling for telemetry pipelines (Bloomberg, Dina Bass, 2025).

High-level architecture: source to realtime queries

At a glance, a resilient realtime telemetry pipeline for hosting metrics looks like this:

Sources: containers, serverless functions, web servers, load balancers, agents (Prometheus exporters, host agents)
Shippers: Vector / Fluent Bit / Logstash / Fluentd (lightweight edge), or direct SDKs for structured telemetry
Streaming layer: Kafka / Redpanda / Pulsar (buffers bursts and provides durability)
ClickHouse ingestion: Kafka engine & Materialized Views or HTTP/Bulk inserts to MergeTree tables
Downstream: rollups, aggregated tables, dashboards, alerting

Why use a streaming layer?

Buffering with Kafka or Redpanda decouples producers from ClickHouse and provides backpressure handling. For high-cardinality, bursty hosting telemetry (e.g., millions of events per second during a deployment), this separation prevents failed inserts and enables controlled bulk writes to ClickHouse.

Ingest patterns: reliable, scalable, and low-latency

Two production-proven patterns work well:

1) Kafka engine + Materialized View -> MergeTree

Use ClickHouse's Kafka table engine to consume topics and a materialized view to insert parsed events into a MergeTree table. This pattern gives you at-least-once delivery semantics and backpressure isolation.

-- Kafka source table
CREATE TABLE telemetry_kafka (
  key String,
  value String,
  timestamp DateTime64(3)
) ENGINE = Kafka SETTINGS kafka_broker_list = 'kafka:9092', kafka_topic_list = 'telemetry', kafka_group_name = 'clickhouse-consumer';

-- Parse JSON and insert into MergeTree
CREATE MATERIALIZED VIEW mv_telemetry TO telemetry_raw AS
SELECT
  parseJSON(value):ts AS event_time,
  parseJSON(value):site_id AS site_id,
  parseJSON(value):host_id AS host_id,
  parseJSON(value):path AS path,
  parseJSON(value):status AS status_code,
  parseJSON(value):latency_ms AS latency_ms,
  parseJSON(value):tags AS tags
FROM telemetry_kafka;

This approach supports high parallelism and lets ClickHouse apply its merge logic efficiently.

2) Bulk HTTP / Native Insert with batched clients

For smaller fleets or serverless producers that can batch, use ClickHouse's HTTP or native protocol with batched JSON/CSV inserts. Batches of 5–50k rows per insert provide a good balance of latency and throughput.

Example with curl (JSONEachRow):

curl -sS 'http://clickhouse:8123/?query=INSERT%20INTO%20telemetry_raw%20FORMAT%20JSONEachRow' \
  -H 'Content-Type: application/json' --data-binary @batch.json

Schema design: optimizing for high-cardinality telemetry

High-cardinality telemetry (unique path+query strings, request IDs, session IDs) demands a schema that optimizes for common queries while controlling storage and index overhead. Apply these principles:

Partition by time at a granularity that matches your retention and query patterns (daily or hourly).
Order by columns you filter by most (site_id, host_id, event_time). The ORDER BY key enables efficient range scans.
Use LowCardinality for repeated strings like site names, host names, status codes — but avoid it for highly unique fields.
Externalize very high-cardinality dimensions (long URLs, full user agents) into separate storage or dictionaries and reference them by ID.
Design rollup/aggregate tables for standard query shapes (per-minute, per-host latency percentiles).

Example schema for hosting telemetry

CREATE TABLE telemetry_raw (
  event_time DateTime64(3),
  site_id UInt32,
  host_id UInt32,
  path_id UInt64,      -- reference to paths table
  status_code UInt16,
  latency_ms UInt32,
  client_hash UInt64,  -- hashed client IP
  tags Array(String),
  raw_path String      -- optional, consider TTL to drop or move
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (site_id, host_id, event_time)
SETTINGS index_granularity = 8192;

Notes:

path_id: store canonicalized paths in a separate dictionary table to reduce cardinality in the raw table.
client_hash: store client IP as a hash for privacy and to reduce index pressure.
raw_path: keep unindexed raw data only when necessary and apply TTL/move rules.

Dimension tables and dictionaries

For high-cardinality string dimensions (URL paths, user agents), use a two-table pattern:

Paths table: maps path -> path_id (INSERT-if-not-exists, upsert logic via materialized view or external service)
Telemetry raw table: stores path_id (UInt64) instead of full path strings

This reduces MergeTree index size and accelerates group-by queries over paths.

Low-latency query techniques

To keep interactive queries fast on large high-cardinality datasets, use a combination of these features:

Projections and pre-aggregations: build projections for common query shapes (e.g., per-minute latency percentiles per host).
Materialized views for rollups: maintain minute and hourly aggregates to avoid scanning raw data for dashboards.
Skip indexes: create tokenbf_v1 or bloom filter indexes on high-cardinality string columns where equality filters are common (e.g., path_id).
Sampling: for exploratory queries, use SAMPLE to reduce scanned rows for quick trends.

-- Example: minute-level rollup table
CREATE TABLE telemetry_minute
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(minute)
ORDER BY (site_id, host_id, minute)
AS
SELECT
  toStartOfMinute(event_time) AS minute,
  site_id,
  host_id,
  path_id,
  count() AS requests,
  quantileExact(0.95)(latency_ms) AS p95_latency_ms
FROM telemetry_raw
GROUP BY minute, site_id, host_id, path_id;

Cost control strategies for high-ingest environments

High ingest rates mean fast-growing storage and CPU use. Control costs with a combination of lifecycle rules, tiered storage, and compute governance:

1) Tiered storage and TTL policies

Use ClickHouse storage policies to create hot/warm/cold tiers. Move older parts to cheaper object storage (S3) and keep recent parts on NVMe. Apply column and table TTLs to either delete or move rows/columns automatically.

ALTER TABLE telemetry_raw MODIFY TTL
  event_time + INTERVAL 7 DAY TO VOLUME 'hot',
  event_time + INTERVAL 30 DAY TO VOLUME 'warm',
  event_time + INTERVAL 365 DAY TO DISK 'cold',
  event_time + INTERVAL 400 DAY DELETE;

By moving full data objects (parts) to S3 and keeping only active windows local, you reduce expensive block storage while retaining queryability for cold ranges.

2) Downsampling and multi-resolution retention

Keep raw, high-cardinality rows for a short window (e.g., 7–14 days). Maintain minute-level aggregates for 90 days and hourly/monthly aggregates for longer retention. This reduces long-term storage and query cost.

Suggested retention matrix:

Raw events: 7 days
Per-minute aggregates: 90 days
Per-hour aggregates: 1 year
Top-k / sampled aggregates: multi-year (for trends)

3) Compression and codecs

Choose codecs per-column. Numeric metrics compress well with ZSTD; repetitive strings benefit from LZ4 or ZSTD with dictionaries. ClickHouse column-level compression settings let you tune size vs CPU trade-offs.

ALTER TABLE telemetry_raw MODIFY COLUMN tags String CODEC(ZSTD(3));

4) Control CPU and memory usage

Enforce user profiles and quotas to limit per-query memory and threads. Limit expensive functions (quantileExact over raw data) and route heavy analytic jobs to read-only replica clusters or an offline analytics cluster.

CREATE USER analytics_user SETTINGS max_memory_usage = 20000000000, max_threads = 4;

5) Efficient indexing and indexes for high-cardinality columns

Don't add expensive primary keys that cause wide parts. Instead, use compact ORDER BY keys and add sparse skip indexes (bloom filters) for string columns that you equality filter frequently. Skip indexes reduce scanned bytes dramatically for selective filters.

Operational best practices

Use these operational controls to maintain a stable, cost-effective ClickHouse telemetry system.

Monitoring ClickHouse itself

Use system tables (system.parts, system.merges, system.metrics, system.asynchronous_metrics) to observe ingestion pressure and merge backlog.
Track number of parts per partition — high part counts increase overhead and storage fragmentation.
Monitor query_log and trace slow queries to add appropriate projections or materialized views.

Autoscaling and cloud-managed options

In 2026, managed ClickHouse cloud offerings are mature and include autoscaling, managed tiered storage, and integrated backup/restore. When running self-managed clusters, use the ClickHouse Operator on Kubernetes for lifecycle automation and shard placement.

Backups and disaster recovery

Store backups of dictionary and dimension tables outside ClickHouse (S3) and keep metadata snapshots for restore. Use incremental backup strategies to reduce egress costs.

Design patterns for high-cardinality tags

Tags are the primary source of cardinality explosions. Use these patterns:

Canonicalization: normalize query strings and strip ephemeral parameters before insertion.
Tag sampling: selectively store rare or debug-only tags only when sampling is enabled.
Tag dictionary: assign tag ids and store arrays of tag_id in raw rows. Join with a tag metadata table for descriptive queries.
Dynamically promote hot tags: maintain fast-access indexes for frequently queried tags and aggregate rest into an "other" bucket.

Example: complete flow with Vector + Kafka + ClickHouse

Vector is a lightweight, high-throughput shaper. This example sketches a production flow:

Producers emit structured JSON telemetry to Vector
Vector batches, compresses, and writes to Kafka topic 'telemetry'
ClickHouse consumes Kafka topic via Kafka engine
Materialized view parses and writes into MergeTree raw table
Background materialized views maintain minute aggregates

# vector.toml (snippet)
[sources.http]
type = "http"
address = "0.0.0.0:9000"

[sinks.kafka]
type = "kafka"
inputs = ["http"]
bootstrap_servers = "kafka:9092"
topic = "telemetry"
compression = "gzip"
batch.timeout = 500
batch.max_events = 50000

Sizing and cost estimation

Estimate storage and CPU by estimating average row size and retention. Example calculation for budgeting:

Avg row size (after compression): 200 bytes
Ingest rate: 100k rows/sec -> ~17.3 billion rows/day
Daily storage raw: 17.3B * 200B = ~3.46 TB/day (compressed)
7-day raw retention => ~24TB; with rollups and tiering, long-term cost may be reduced by an order of magnitude

These numbers show why aggressive downsampling and tiered storage are essential for predictable costs.

Realtime analytics patterns and examples

Common realtime queries you should pre-plan:

Per-host error rate over last 5 minutes
P95 latency per-site per-minute
Top slowest endpoints in last hour

Create pre-aggregated tables or projections for each to keep dashboard refreshes sub-second. For example:

CREATE MATERIALIZED VIEW mv_p95 TO telemetry_minute AS
SELECT toStartOfMinute(event_time) AS minute, site_id, host_id,
  quantileTDigest(0.95)(latency_ms) AS p95_td
FROM telemetry_raw
GROUP BY minute, site_id, host_id;

2026 trends and what to plan for

As ClickHouse matures, several trends shape telemetry architectures:

Cloud-first operators: Managed ClickHouse offerings with autoscaling reduce ops burden; plan for hybrid hot/warm storage policies.
Streaming as the default: Producers and shipper ecosystems like Vector and Redpanda are becoming standard for telemetry ingestion.
Edge pre-aggregation: To control cardinality and egress, more teams pre-aggregate at the edge (containers or serverless VMs), sending metrics instead of raw events.
Privacy and hashing: Hashing/Pseudonymization of client identifiers is standard practice for telemetry across hosted environments.

ClickHouse's recent momentum and investment in cloud capabilities mean teams can expect better managed features for tiered storage, autoscaling, and improved operator ecosystems in 2026. Plan your architecture to take advantage of these improvements while retaining control of cardinality and retention.

Actionable checklist

Partition by time and choose ORDER BY for your common filters (site_id, host_id, event_time).
Externalize high-cardinality strings to dictionary tables and reference by ID.
Use Kafka/Redpanda as a buffer for resiliency and backpressure handling.
Implement minute/hour rollups and TTLs: raw for 7–14 days, aggregates longer.
Apply compression codecs and per-column settings to optimize size vs CPU.
Use skip indexes and projections for low-latency dashboards.
Enforce quotas and user profiles to limit runaway queries.

Closing recommendations

Designing a ClickHouse-based telemetry system for hosting metrics in 2026 is about trade-offs: ingest latency vs long-term storage cost, cardinality vs query speed, and flexibility vs governance. Start by protecting your hot window — optimize raw ingestion for a short retention and build efficient, query-optimized rollups for everything else.

If you're migrating from an existing time-series or analytics store, do a phased approach: dual-write for a week, compare query latency and cost, then flip your dashboards and roll up retention policies gradually.

Next steps

Want a hands-on migration plan or a cost projection tailored to your fleet? Contact our team for a free architecture review and a 30-day ClickHouse hosting trial tuned for high-cardinality telemetry. We’ll analyze your ingest patterns, recommend partitioning and retention, and provide ready-to-deploy Vector/Kafka/ClickHouse configurations.

Call to action: Reach out for a free ClickHouse telemetry audit and cost forecast — get a production-ready pipeline and schema blueprint in 48 hours.