ScyllaDB University LIVE, FREE Virtual Training Event | March 21
Register for Free
ScyllaDB Documentation Logo Documentation
  • Server
  • Cloud
  • Tools
    • ScyllaDB Manager
    • ScyllaDB Monitoring Stack
    • ScyllaDB Operator
  • Drivers
    • CQL Drivers
    • DynamoDB Drivers
  • Resources
    • ScyllaDB University
    • Community Forum
    • Tutorials
Download
ScyllaDB Docs ScyllaDB Open Source ScyllaDB for Developers ScyllaDB Features Change Data Capture (CDC) Querying CDC Streams

Caution

You're viewing documentation for a previous version. Switch to the latest stable version.

Querying CDC Streams¶

Some use cases for CDC may require querying the log table periodically in short intervals. One way to do that would be to perform partition scans, where you don’t specify the partition (in this case, the stream) which you want to query, for example:

SELECT * FROM ks.t_scylla_cdc_log;

Although partition scans are convenient, they require the read coordinator to contact the entire cluster, not just a small set of replicas defined by the replication factor.

The recommended alternative is to query each stream separately:

SELECT * FROM ks.t_scylla_cdc_log WHERE "cdc$stream_id" = 0x365fd1a9ae34373954529ac8169dfb93;

With the above approach you can, for instance, build a distributed CDC consumer, where each of the consumer nodes queries only streams that are replicated to Scylla nodes in proximity to the consumer node. This allows efficient, concurrent querying of streams, without putting strain on a single node due to a partition scan.

Caution

The tables mentioned in the following sections: system_distributed.cdc_generation_timestamps and system_distributed.cdc_streams_descriptions_v2 have been introduced in Scylla 4.4. It is highly recommended to upgrade to 4.4 for efficient CDC usage. The last section explains how to run the below examples in Scylla 4.3.

If you use CDC in Scylla 4.3 and your application is constantly querying CDC log tables and using the old description table to learn about new generations and stream IDs, you should upgrade your application before upgrading to 4.4. The upgraded application should dynamically switch from using the old description table to the new description tables when the cluster is upgraded from 4.3 to 4.4. We present an example algorithm that the application can perform in the last section.

We highly recommend using the newest releases of our client CDC libraries (Java CDC library, Go CDC library, Rust CDC library). They take care of correctly querying the stream description tables and they handle the upgrade procedure for you.

Learning about available streams¶

To query the log table without performing partition scans, you need to know which streams to look at. For this you can use the system_distributed.cdc_generation_timestamps and system_distributed.cdc_streams_descriptions_v2 tables.

Example: querying the CDC description table¶

  1. Retrieve the timestamp of the currently operating CDC generation from the cdc_generation_timestamps table. If you have a multi-node cluster, query the table with QUORUM or ALL consistency level so you don’t miss any entry:

    CONSISTENCY QUORUM;
    SELECT time FROM system_distributed.cdc_generation_timestamps WHERE key = 'timestamps';
    

    The query can return multiple entries:

     time
    ---------------------------------
     2020-03-25 16:05:29.484000+0000
     2020-03-25 12:44:43.006000+0000
    
    (2 rows)
    

    Take the highest one. In our case this is 2020-03-25 16:05:29.484000+0000.

  2. Retrieve the list of stream IDs in the current CDC generation from the cdc_streams_descriptions_v2 table. Unfortunately, to use the time-date value in a WHERE clause, you have to modify the format of the time-date a little by removing the last three 0s before the +. In our case, the modified time-date is 2020-03-25 16:05:29.484+0000:

    CONSISTENCY QUORUM;
    SELECT streams FROM system_distributed.cdc_streams_descriptions_v2 WHERE time = '2020-03-25 16:05:29.484+0000';
    

    The result consists of a number of rows (most likely returned in multiple pages, unless you turned off paging), each row containing a list of stream IDs, such as:

     streams
    --------------------------------------------------------------------------------------------------------------
     {0x7ffe0c687fcce86e0783343730000001, 0x80000000000000010d9ee5f1f4000001, 0x800555555555555653e250f2d8000001}
     {0x807ae73e07dbd4122e32d36e08000011, 0x80800000000000001facbbb618000011, 0x80838c6b76e19a1bc3581db310000011}
     {0x80838c6b76e19a1c6da83d4d14000021, 0x80855555555555566d556a0a18000021, 0x808aaaaaaaaaaaabf1008f4120000021}
     {0x80c5343222b6eee636e3ed42d0000031, 0x80c5555555555556efd251b0b8000031, 0x80caaaaaaaaaaaabb9bde28998000031}
     ...
    
    (256 rows)
    

    Save all stream IDs returned by the query. When we ran the example, the query returned 256 * 3 = 768 stream IDs.

  3. Use the obtained stream IDs to query your CDC log tables:

    CREATE TABLE ks.t (pk int, ck int, v int, primary key (pk, ck)) WITH cdc = {'enabled': true};
    INSERT INTO ks.t (pk, ck, v) values (0, 0, 0);
    SELECT * FROM ks.t_scylla_cdc_log WHERE "cdc$stream_id" = 0x7ffe0c687fcce86e0783343730000001;
    SELECT * FROM ks.t_scylla_cdc_log WHERE "cdc$stream_id" = 0x80000000000000010d9ee5f1f4000001;
    ...
    

    Each change will be present in exactly one of these stream IDs. When we ran the example, it was:

    SELECT * FROM ks.t_scylla_cdc_log WHERE "cdc$stream_id" = 0xced00000000000009663c8dc500005a1;
    
     cdc$stream_id                      | cdc$time                             | cdc$batch_seq_no | cdc$deleted_v | cdc$end_of_batch | cdc$operation | cdc$ttl | ck | pk | v
    ------------------------------------+--------------------------------------+------------------+---------------+------------------+---------------+---------+----+----+---
     0xced00000000000009663c8dc500005a1 | 7a370e64-819f-11eb-c419-1f717873d8fa |                0 |          null |             True |             2 |    null |  0 |  0 | 0
    
    (1 rows)
    

Query all streams to read the entire CDC log.

Reacting to topology changes¶

As explained in CDC Stream Generations, the set of used CDC stream IDs changes whenever you bootstrap a new node. You should then query the CDC description table to read the new set of stream IDs and the corresponding timestamp.

If you’re periodically querying streams and you don’t want to miss any writes that are sent to the old generation, you should query it at least one time after the old generation stops operating (which happens when the new generation starts operating).

Keep in mind that time is relative: every node has its own clock. Therefore you should make sure that the old generation stops operating from the point of view of every node in the cluster before you query it one last time and start querying the new generation.

Example: switching streams¶

Suppose that cdc_generation_timestamps contains the following entries:

 time
---------------------------------
 2020-03-25 16:05:29.484000+0000
 2020-03-25 12:44:43.006000+0000

(2 rows)

The currently operating generation’s timestamp is 2020-03-25 16:05:29.484000+0000 — the highest one in the above list. You’ve been periodically querying all streams in this generation. In the meantime, a new node is bootstrapped, hence a new generation appears:

 time
---------------------------------
 2020-03-25 17:21:45.360000+0000
 2020-03-25 16:05:29.484000+0000
 2020-03-25 12:44:43.006000+0000

(3 rows)

You should keep querying streams from generation 2020-03-25 16:05:29.484000+0000 until after you make sure that every node’s clock moved past 2020-03-25 17:21:45.360000+0000. One way to do that is to connect to each node and use the now() function:

$ cqlsh 127.0.0.1
Connected to  at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select totimestamp(now()) from system.local;

 system.totimestamp(system.now())
----------------------------------
  2020-03-25 17:24:34.104000+0000

(1 rows)
cqlsh>
$ cqlsh 127.0.0.4
Connected to  at 127.0.0.4:9042.
[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select totimestamp(now()) from system.local;

 system.totimestamp(system.now())
----------------------------------
  2020-03-25 17:24:42.038000+0000

(1 rows)

and so on. After you make sure that every node uses the new generation, you can query streams from the previous generation one last time, and then switch to querying streams from the new generation.

Differences in Scylla 4.3¶

In Scylla 4.3 the tables cdc_generation_timestamps and cdc_streams_descriptions_v2 don’t exist. Instead there is the cdc_streams_descriptions table. To retrieve all generation timestamps, instead of querying the time column of cdc_generation_timestamps using a single-partition query (i.e. using WHERE key = 'timestamps'), you would query the time column of cdc_streams_descriptions with a full range scan (without specifying a single partition):

SELECT time FROM system_distributed.cdc_streams_descriptions;

To retrieve a generation’s stream IDs, you query the streams column of cdc_streams_descriptions as follows:

SELECT streams FROM system_distributed.cdc_streams_descriptions WHERE time = '2020-03-25 16:05:29.484+0000';

All stream IDs are stored in a single row, unlike cdc_streams_descriptions_v2.

Scylla 4.3 to Scylla 4.4 upgrade¶

If you didn’t enable CDC on any table while using Scylla 4.3 or earlier, you don’t need to understand this section. Simply upgrade to 4.4 (we recommend doing it as soon as you can) and implement your application to query streams as described above.

However, if you use CDC with Scylla 4.3 and your application is periodically querying the old cdc_streams_descriptions table, you should upgrade your application before upgrading the cluster to Scylla 4.4.

The upgraded application should understand both the old cdc_streams_descriptions table and the new cdc_generation_timestamps and cdc_streams_descriptions_v2 tables. It should smoothly transition from querying the old table to querying the new tables as the cluster upgrades.

When Scylla upgrades from 4.3 to 4.4 it will attempt to copy descriptions of all existing generations from the old table to the new tables. This copying procedure may take a while. Until it finishes, your application should keep using the old table; it should switch as soon as it detects that the procedure is finished. To detect that the procedure is finished, you can query the system.cdc_local table: if the table contains a row with key = 'rewritten', the procedure was finished; otherwise it is still in progress.

It is possible to disable the rewriting procedure. In that case only the latest generation will be inserted to the new table and your application should act accordingly (it shouldn’t wait for the 'rewritten' row to appear but start using the new tables immediately). It is not recommended to disable the rewriting procedure and we’ve purposefully left it undocumented how to do it. This option exists only for emergencies and should be used only with the assistance of a qualified Scylla engineer.

In fresh Scylla 4.4 clusters (that were not upgraded from a previous version) the old description table does not exist. Thus the application should check for its existence and when it detects its absence, it should use the new tables immediately.

With the above considerations in mind, the application should behave as follows. When it wants to learn if there are new generations:

  1. Check if the system_distributed.cdc_streams_descriptions table exists. If not, proceed to query the new tables.

  2. Otherwise, check if system.cdc_local contains a row with key = 'rewritten'. If yes, proceed to query the new tables.

  3. Otherwise, query the old table; the rewriting procedure is still in progress. Repeat step 2 in a few seconds; by this time the rewriting may have already finished.

You may also decide that it’s safe to switch to the new tables even though not all generations have been copied from the old table. This may be the case if your application is interested only in the latest changes in the latest generation (for example, because it queries the CDC log tables in near-real time and has already seen all past changes). In this case, the application may check that the latest generation’s timestamp is present in cdc_generation_timestamps and if it is, start using the new tables immediately.

Note that after upgrading the cluster to 4.4, all new generations (which are created when bootstrapping new nodes) appear only in the new tables. After upgrading your application and your cluster, and ensuring that either all generations have been rewritten to the new tables or that you’re not interested in the data from old generations, it is safe to remove the old description table.

Note

We highly recommend using the newest releases of our client CDC libraries (Java CDC library, Go CDC library, Rust CDC library). They take care of correctly querying the stream description tables and they handle the upgrade procedure for you.

Was this page helpful?

PREVIOUS
CDC Stream Generations
NEXT
Advanced column types
  • Create an issue
  • Edit this page

On this page

  • Querying CDC Streams
    • Learning about available streams
      • Example: querying the CDC description table
    • Reacting to topology changes
      • Example: switching streams
    • Differences in Scylla 4.3
      • Scylla 4.3 to Scylla 4.4 upgrade
ScyllaDB Open Source
  • 6.0
    • master
    • 6.2
    • 6.1
    • 6.0
    • 5.4
    • 5.2
    • 5.1
  • Getting Started
    • Install ScyllaDB
      • Launch ScyllaDB on AWS
      • Launch ScyllaDB on GCP
      • Launch ScyllaDB on Azure
      • ScyllaDB Web Installer for Linux
      • Install ScyllaDB Linux Packages
      • Install ScyllaDB Without root Privileges
      • Air-gapped Server Installation
      • ScyllaDB Housekeeping and how to disable it
      • ScyllaDB Developer Mode
    • Configure ScyllaDB
    • ScyllaDB Configuration Reference
    • ScyllaDB Requirements
      • System Requirements
      • OS Support by Linux Distributions and Version
      • Cloud Instance Recommendations
      • ScyllaDB in a Shared Environment
    • Migrate to ScyllaDB
      • Migration Process from Cassandra to Scylla
      • Scylla and Apache Cassandra Compatibility
      • Migration Tools Overview
    • Integration Solutions
      • Integrate Scylla with Spark
      • Integrate Scylla with KairosDB
      • Integrate ScyllaDB with Presto
      • Integrate Scylla with Elasticsearch
      • Integrate Scylla with Kubernetes
      • Integrate Scylla with the JanusGraph Graph Data System
      • Integrate Scylla with DataDog
      • Integrate Scylla with Kafka
      • Integrate Scylla with IOTA Chronicle
      • Integrate Scylla with Spring
      • Shard-Aware Kafka Connector for Scylla
      • Install Scylla with Ansible
      • Integrate Scylla with Databricks
      • Integrate Scylla with Jaeger Server
      • Integrate Scylla with MindsDB
    • Tutorials
  • ScyllaDB for Administrators
    • Administration Guide
    • Procedures
      • Cluster Management
      • Backup & Restore
      • Change Configuration
      • Maintenance
      • Best Practices
      • Benchmarking Scylla
      • Migrate from Cassandra to Scylla
      • Disable Housekeeping
    • Security
      • ScyllaDB Security Checklist
      • Enable Authentication
      • Enable and Disable Authentication Without Downtime
      • Creating a Custom Superuser
      • Generate a cqlshrc File
      • Reset Authenticator Password
      • Enable Authorization
      • Grant Authorization CQL Reference
      • Certificate-based Authentication
      • Role Based Access Control (RBAC)
      • Encryption: Data in Transit Client to Node
      • Encryption: Data in Transit Node to Node
      • Generating a self-signed Certificate Chain Using openssl
      • Configure SaslauthdAuthenticator
    • Admin Tools
      • Nodetool Reference
      • CQLSh
      • Admin REST API
      • Tracing
      • Scylla SStable
      • Scylla Types
      • SSTableLoader
      • cassandra-stress
      • SSTabledump
      • SSTableMetadata
      • Scylla Logs
      • Seastar Perftune
      • Virtual Tables
      • Reading mutation fragments
      • Maintenance socket
      • Maintenance mode
    • ScyllaDB Monitoring Stack
    • ScyllaDB Operator
    • ScyllaDB Manager
    • Upgrade Procedures
      • ScyllaDB Versioning
      • ScyllaDB Open Source Upgrade
      • ScyllaDB Open Source to ScyllaDB Enterprise Upgrade
      • ScyllaDB Image
      • ScyllaDB Enterprise
    • System Configuration
      • System Configuration Guide
      • scylla.yaml
      • ScyllaDB Snitches
    • Benchmarking ScyllaDB
    • ScyllaDB Diagnostic Tools
  • ScyllaDB for Developers
    • Develop with ScyllaDB
    • Tutorials and Example Projects
    • Learn to Use ScyllaDB
    • ScyllaDB Alternator
    • ScyllaDB Features
      • Lightweight Transactions
      • Global Secondary Indexes
      • Local Secondary Indexes
      • Materialized Views
      • Counters
      • Change Data Capture
      • Workload Attributes
    • ScyllaDB Drivers
      • Scylla CQL Drivers
      • Scylla DynamoDB Drivers
  • CQL Reference
    • CQLSh: the CQL shell
    • Appendices
    • Compaction
    • Consistency Levels
    • Consistency Level Calculator
    • Data Definition
    • Data Manipulation
      • SELECT
      • INSERT
      • UPDATE
      • DELETE
      • BATCH
    • Data Types
    • Definitions
    • Global Secondary Indexes
    • Expiring Data with Time to Live (TTL)
    • Functions
    • Wasm support for user-defined functions
    • JSON Support
    • Materialized Views
    • Non-Reserved CQL Keywords
    • Reserved CQL Keywords
    • Service Levels
    • ScyllaDB CQL Extensions
  • ScyllaDB Architecture
    • Data Distribution with Tablets
    • ScyllaDB Ring Architecture
    • ScyllaDB Fault Tolerance
    • Consistency Level Console Demo
    • ScyllaDB Anti-Entropy
      • Scylla Hinted Handoff
      • Scylla Read Repair
      • Scylla Repair
    • SSTable
      • ScyllaDB SSTable - 2.x
      • ScyllaDB SSTable - 3.x
    • Compaction Strategies
    • Raft Consensus Algorithm in ScyllaDB
  • Troubleshooting ScyllaDB
    • Errors and Support
      • Report a Scylla problem
      • Error Messages
      • Change Log Level
    • ScyllaDB Startup
      • Ownership Problems
      • Scylla will not Start
      • Scylla Python Script broken
    • Upgrade
      • Inaccessible configuration files after ScyllaDB upgrade
    • Cluster and Node
      • Handling Node Failures
      • Failure to Add, Remove, or Replace a Node
      • Failed Decommission Problem
      • Cluster Timeouts
      • Node Joined With No Data
      • SocketTimeoutException
      • NullPointerException
      • Failed Schema Sync
    • Data Modeling
      • Scylla Large Partitions Table
      • Scylla Large Rows and Cells Table
      • Large Partitions Hunting
      • Failure to Update the Schema
    • Data Storage and SSTables
      • Space Utilization Increasing
      • Disk Space is not Reclaimed
      • SSTable Corruption Problem
      • Pointless Compactions
      • Limiting Compaction
    • CQL
      • Time Range Query Fails
      • COPY FROM Fails
      • CQL Connection Table
    • ScyllaDB Monitor and Manager
      • Manager and Monitoring integration
      • Manager lists healthy nodes as down
  • Knowledge Base
    • Upgrading from experimental CDC
    • Compaction
    • Consistency in ScyllaDB
    • Counting all rows in a table is slow
    • CQL Query Does Not Display Entire Result Set
    • When CQLSh query returns partial results with followed by “More”
    • Run Scylla and supporting services as a custom user:group
    • Customizing CPUSET
    • Decoding Stack Traces
    • Snapshots and Disk Utilization
    • DPDK mode
    • Debug your database with Flame Graphs
    • How to Change gc_grace_seconds for a Table
    • Gossip in Scylla
    • Increase Permission Cache to Avoid Non-paged Queries
    • How does Scylla LWT Differ from Apache Cassandra ?
    • Map CPUs to Scylla Shards
    • Scylla Memory Usage
    • NTP Configuration for Scylla
    • Updating the Mode in perftune.yaml After a ScyllaDB Upgrade
    • POSIX networking for Scylla
    • Scylla consistency quiz for administrators
    • Recreate RAID devices
    • How to Safely Increase the Replication Factor
    • Scylla and Spark integration
    • Increase Scylla resource limits over systemd
    • Scylla Seed Nodes
    • How to Set up a Swap Space
    • Scylla Snapshots
    • Scylla payload sent duplicated static columns
    • Stopping a local repair
    • System Limits
    • How to flush old tombstones from a table
    • Time to Live (TTL) and Compaction
    • Scylla Nodes are Unresponsive
    • Update a Primary Key
    • Using the perf utility with Scylla
    • Configure Scylla Networking with Multiple NIC/IP Combinations
  • Reference
    • AWS Images
    • Azure Images
    • GCP Images
    • Configuration Parameters
    • Glossary
    • Limits
    • API Reference (BETA)
    • Metrics (BETA)
  • ScyllaDB FAQ
  • Contribute to ScyllaDB
  • Alternator: DynamoDB API in Scylla
    • Getting Started With ScyllaDB Alternator
    • ScyllaDB Alternator for DynamoDB users
Docs Tutorials University Contact Us About Us
© 2025, ScyllaDB. All rights reserved. | Terms of Service | Privacy Policy | ScyllaDB, and ScyllaDB Cloud, are registered trademarks of ScyllaDB, Inc.
Last updated on 13 May 2025.
Powered by Sphinx 7.4.7 & ScyllaDB Theme 1.8.6