ScyllaDB University LIVE, FREE Virtual Training Event | March 21
Register for Free
ScyllaDB Documentation Logo Documentation
  • Server
  • Cloud
  • Tools
    • ScyllaDB Manager
    • ScyllaDB Monitoring Stack
    • ScyllaDB Operator
  • Drivers
    • CQL Drivers
    • DynamoDB Drivers
  • Resources
    • ScyllaDB University
    • Community Forum
    • Tutorials
Download
ScyllaDB Docs ScyllaDB Open Source ScyllaDB Architecture ScyllaDB SSTable Format ScyllaDB SSTable - 3.x SSTables 3.0 Index File Format

Caution

You're viewing documentation for a previous version. Switch to the latest stable version.

SSTables 3.0 Index File Format¶

The SSTables index file, together with the SSTables summary file , provides a way to efficiently locate a partition with a specific key or some position within a partition in the SSTables data file.

Broadly, the index file lists the keys in the data file in order, giving for each key its position in the data file. The summary file, which is held in memory, contains samples of these keys, pointing to their position in the index file. So to efficiently search for a specific key, we use the summary to find a (relatively short) range of positions in the index file which may hold this key, then read this range from the index file and look for the specific key. The details on how to further pinpoint specific columns inside very long partitions are explained below.

The Index File Layout¶

Building Blocks¶

The building blocks of the new index format are same as those of the new data format. This includes varints, optional values and delta encoding.

Description¶

The index file format changes in v3.0 from the old format are moderate compared to how significantly has changed the data file layout. Still, there are some new fields and entities that are worth a detailed look. Such fields are the offsets array added to index_entry and end_open_marker structure added to promoted_index_block.

The index file is a long sequence of entries:

struct index_file {
    struct index_entry entries[];
};

struct index_entry {
    be16 key_length;
    char key[key_length];
    varint position; // decoded into a 64-bit integer
    varint promoted_index_length; // decoded into a 32-bit integer
    byte promoted_index[promoted_index_length];
};

Here, key is the SSTable partition key, and position is the position of that partition in the data file.

For large partitions, finding a specific range of columns requires more than just starting position to be efficient. For those, the so-called promoted index is stored along. It samples partition at roughly column_index_size_in_kb (by default, 64 KB) blocks and gives the clustering prefixes range for each block.

The reason for the strange name promoted index is that it originally started as a separately stored column index, until in Cassandra 1.2, when it was “promoted” to live inside the SSTables index file, reducing the number of seeks required to find a given column from 3 to 2.

The promoted_index_length stores the number of bytes right after this field and up to the end of the current index_entry. If promoted_index_length is 0, there is no promoted index for this entry and nothing follows it.

The structure of promoted index (when promoted_index_length != 0) is as follows:

struct promoted_index {
    varint partition_header_length; //decoded into a 64-bit integer
    struct deletion_time deletion_time;
    varint promoted_index_blocks_count; // decoded into a 32-bit integer
    struct promoted_index_block blocks[promoted_index_blocks_count];
    be32 offsets[promoted_index_blocks_count];
}

The first field, partition_header_length, stores the length of serialized partition key, partition tombstone and static row, if present, in the data file. This allows to skip over the partition prefix straight to the first clustered row.

The deletion time structure is described in SSTables 3.0 data file format article. It is simply a copy of the partition deletion_time, either live or a partition tombstone. We need this copy inside the index, because normally the deletion_time of a partition appears in its very beginning in the data file, but with the promoted index we intend to skip directly to the middle of a very large partition, so we don’t want to also read the beginning of the partition just to read its partition tombstone.

After deletion time mark, the number of promoted index blocks is stored as a varint. It can be zero for small partitions or >= 2 (apparently, it makes no sense to store a single promoted index block).

Promoted Index Blocks¶

The promoted index block looks like:

struct promoted_index_block {
    struct clustering_prefix first_name;
    struct clustering_prefix last_name;
    varint offset;
    varint delta_width;
    byte end_open_marker_present;
    optional<struct deletion_time> end_open_marker;
};

Clustering Prefixes¶

Here, first_name and last_name are the serialised prefixes of clustering columns marking the boundaries of the block. They are stored as follows:

struct clustering_prefix {
    byte kind;
    optional<be16> size;
    struct clustering_block clustering_blocks[];
};

The first byte, kind, stores the serialised ordinal of the value from the bound_kind enum (see SSTables 3.0 data file format) for the enum definition). It can be either 4 (for CLUSTERING) when corresponds to a row or one of 0, 1 and 2 (for EXCL_END_BOUND, INCL_START_BOUND or EXCL_END_INCL_START_BOUNDARY accordingly) for a range tombstone marker.

The next field, size, is only present for range tombstone markers, i.e., when kind != 4. It contains the actual number of values in the clustering prefix (the prefix length). For rows, it is not needed as rows have all the clustering columns values (full clustering prefix) so the count is taken from the table schema definition.

The sequence of clustering blocks is serialised in exactly the same way as clustering blocks in rows or range tombstone markers in data files. You can refer to the detailed description here

Other Promoted Index Blocks Fields¶

The offset field stores the start of the block relative to the beginning of the current partition in the data file.

The width field is a delta-encoded value representing the length of the current promoted index block in the index file. The size of promoted index blocks is a configurable value set through the column_index_size_in_kb option in the config file. It defaults to 64KB and this is the base value that width delta is taken from.

As of today, the base value is not stored along in the index file and always equals 64KB == 65536.

Note that since the actual promoted index block size can be configured to be less than 64K, it can so happen that delta-encoding width would result in a negative value. Because of that, the stored value should be treated as signed.

The next byte after width called end_open_marker_present is effectively a boolean which tells us whether the following end_open_marker structure is serialised or not. If end_open_marker_present is 1, end_open_marker should be read.

The end_open_marker structure is present in case if the end bound of the current promoted index block falls in between two range tombstone markers denoting a range. In other words, that means that the block of unfiltereds in the data file that corresponds to this promoted index block contains a range tombstone start marker but its counterpart (range tombstone end marker) lies somewhere past that block. It is set for a range tombstone marker of either the INCL_START_BOUND or the EXCL_END_INCL_START_BOUNDARY kind and contains the open bound deletion time.

The end_open_marker has to do with how range tombstones are organised in SSTables 3.0. and is closely connected with the merging logic of storage engine code. That code expects all the data it reads from SSTables to come strictly ordered and it expects range tombstone markers to always come in pairs (the combined “boundary” type marker is just treated as two markers - an end marker for the previous RT followed immediately by a start marker for a new RT. So when a slice of data is being read, it may contain an end RT marker without the corresponding start RT marker. In order to properly handle this, reader code returns a start RT marker corresponding to the slice start bound with the deletion time taken from the end_open_marker structure see: SSTableIterator.java.handlePreSliceData

The higher-level reason for this design is that C* 3.0 storage design aims to be as iterator-based as possible and as such doesn’t re-arrange the data it reads (and without start RT marker, such re-arrangement would be due).

Lastly, there is an offsets array which stores the offsets for all the promoted index blocks. Its size is the same as the value stored in promoted_index_blocks_count. The first offset is always 0, and others are the offsets of corresponding promoted index blocks from the beginning of promoted_index.blocks.

It is possible to read those offsets (commonly referred to as “offsets map”) directly without reading the entire promoted index. This, in turn, allows searching through the promoted index using binary search which results in O(log N) reads instead of O (N).

References¶

The following code parts in Cassandra deal with index serialisation/de-serialisation:

  • ColumnIndex.java, buildRowIndex

  • RowIndexEntry.java

  • IndexInfo.java

Additional Information¶

The offset map reduces the complexity of search in the promoted index from O(N) to O(log N) using the offsets map. There are some better alternatives that should further improve the performance of index-related operations using BTree-like structures. They are not released yet but planned for Cassandra 4. See CASSANDRA-9754.

Copyright

© 2016, The Apache Software Foundation.

Apache®, Apache Cassandra®, Cassandra®, the Apache feather logo and the Apache Cassandra® Eye logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Was this page helpful?

PREVIOUS
SSTables 3.0 Summary File Format
NEXT
SSTable 3.0 Format in ScyllaDB
  • Create an issue
  • Edit this page

On this page

  • SSTables 3.0 Index File Format
    • The Index File Layout
      • Building Blocks
      • Description
      • Promoted Index Blocks
      • Clustering Prefixes
      • Other Promoted Index Blocks Fields
    • References
      • Additional Information
ScyllaDB Open Source
  • 6.2
    • master
    • 6.2
    • 6.1
    • 6.0
    • 5.4
    • 5.2
    • 5.1
  • Getting Started
    • Install ScyllaDB
      • Launch ScyllaDB on AWS
      • Launch ScyllaDB on GCP
      • Launch ScyllaDB on Azure
      • ScyllaDB Web Installer for Linux
      • Install ScyllaDB Linux Packages
      • Install scylla-jmx Package
      • Run ScyllaDB in Docker
      • Install ScyllaDB Without root Privileges
      • Air-gapped Server Installation
      • ScyllaDB Housekeeping and how to disable it
      • ScyllaDB Developer Mode
    • Configure ScyllaDB
    • ScyllaDB Configuration Reference
    • ScyllaDB Requirements
      • System Requirements
      • OS Support by Linux Distributions and Version
      • Cloud Instance Recommendations
      • ScyllaDB in a Shared Environment
    • Migrate to ScyllaDB
      • Migration Process from Cassandra to ScyllaDB
      • ScyllaDB and Apache Cassandra Compatibility
      • Migration Tools Overview
    • Integration Solutions
      • Integrate ScyllaDB with Spark
      • Integrate ScyllaDB with KairosDB
      • Integrate ScyllaDB with Presto
      • Integrate ScyllaDB with Elasticsearch
      • Integrate ScyllaDB with Kubernetes
      • Integrate ScyllaDB with the JanusGraph Graph Data System
      • Integrate ScyllaDB with DataDog
      • Integrate ScyllaDB with Kafka
      • Integrate ScyllaDB with IOTA Chronicle
      • Integrate ScyllaDB with Spring
      • Shard-Aware Kafka Connector for ScyllaDB
      • Install ScyllaDB with Ansible
      • Integrate ScyllaDB with Databricks
      • Integrate ScyllaDB with Jaeger Server
      • Integrate ScyllaDB with MindsDB
    • Tutorials
  • ScyllaDB for Administrators
    • Administration Guide
    • Procedures
      • Cluster Management
      • Backup & Restore
      • Change Configuration
      • Maintenance
      • Best Practices
      • Benchmarking ScyllaDB
      • Migrate from Cassandra to ScyllaDB
      • Disable Housekeeping
    • Security
      • ScyllaDB Security Checklist
      • Enable Authentication
      • Enable and Disable Authentication Without Downtime
      • Creating a Custom Superuser
      • Generate a cqlshrc File
      • Reset Authenticator Password
      • Enable Authorization
      • Grant Authorization CQL Reference
      • Certificate-based Authentication
      • Role Based Access Control (RBAC)
      • Encryption: Data in Transit Client to Node
      • Encryption: Data in Transit Node to Node
      • Generating a self-signed Certificate Chain Using openssl
      • Configure SaslauthdAuthenticator
    • Admin Tools
      • Nodetool Reference
      • CQLSh
      • Admin REST API
      • Tracing
      • ScyllaDB SStable
      • ScyllaDB Types
      • SSTableLoader
      • cassandra-stress
      • SSTabledump
      • SSTableMetadata
      • ScyllaDB Logs
      • Seastar Perftune
      • Virtual Tables
      • Reading mutation fragments
      • Maintenance socket
      • Maintenance mode
      • Task manager
    • ScyllaDB Monitoring Stack
    • ScyllaDB Operator
    • ScyllaDB Manager
    • Upgrade Procedures
      • ScyllaDB Versioning
      • ScyllaDB Open Source Upgrade
      • ScyllaDB Open Source to ScyllaDB Enterprise Upgrade
      • ScyllaDB Image
      • ScyllaDB Enterprise
    • System Configuration
      • System Configuration Guide
      • scylla.yaml
      • ScyllaDB Snitches
    • Benchmarking ScyllaDB
    • ScyllaDB Diagnostic Tools
  • ScyllaDB for Developers
    • Develop with ScyllaDB
    • Tutorials and Example Projects
    • Learn to Use ScyllaDB
    • ScyllaDB Alternator
    • ScyllaDB Drivers
      • ScyllaDB CQL Drivers
      • ScyllaDB DynamoDB Drivers
  • CQL Reference
    • CQLSh: the CQL shell
    • Appendices
    • Compaction
    • Consistency Levels
    • Consistency Level Calculator
    • Data Definition
    • Data Manipulation
      • SELECT
      • INSERT
      • UPDATE
      • DELETE
      • BATCH
    • Data Types
    • Definitions
    • Global Secondary Indexes
    • Expiring Data with Time to Live (TTL)
    • Functions
    • Wasm support for user-defined functions
    • JSON Support
    • Materialized Views
    • Non-Reserved CQL Keywords
    • Reserved CQL Keywords
    • Service Levels
    • ScyllaDB CQL Extensions
  • Alternator: DynamoDB API in Scylla
    • Getting Started With ScyllaDB Alternator
    • ScyllaDB Alternator for DynamoDB users
  • Features
    • Lightweight Transactions
    • Global Secondary Indexes
    • Local Secondary Indexes
    • Materialized Views
    • Counters
    • Change Data Capture
      • CDC Overview
      • The CDC Log Table
      • Basic operations in CDC
      • CDC Streams
      • CDC Stream Generations
      • Querying CDC Streams
      • Advanced column types
      • Preimages and postimages
      • Data Consistency in CDC
    • Workload Attributes
  • ScyllaDB Architecture
    • Data Distribution with Tablets
    • ScyllaDB Ring Architecture
    • ScyllaDB Fault Tolerance
    • Consistency Level Console Demo
    • ScyllaDB Anti-Entropy
      • ScyllaDB Hinted Handoff
      • ScyllaDB Read Repair
      • ScyllaDB Repair
    • SSTable
      • ScyllaDB SSTable - 2.x
      • ScyllaDB SSTable - 3.x
    • Compaction Strategies
    • Raft Consensus Algorithm in ScyllaDB
    • Zero-token Nodes
  • Troubleshooting ScyllaDB
    • Errors and Support
      • Report a ScyllaDB problem
      • Error Messages
      • Change Log Level
    • ScyllaDB Startup
      • Ownership Problems
      • ScyllaDB will not Start
      • ScyllaDB Python Script broken
    • Upgrade
      • Inaccessible configuration files after ScyllaDB upgrade
    • Cluster and Node
      • Handling Node Failures
      • Failure to Add, Remove, or Replace a Node
      • Failed Decommission Problem
      • Cluster Timeouts
      • Node Joined With No Data
      • NullPointerException
      • Failed Schema Sync
    • Data Modeling
      • ScyllaDB Large Partitions Table
      • ScyllaDB Large Rows and Cells Table
      • Large Partitions Hunting
      • Failure to Update the Schema
    • Data Storage and SSTables
      • Space Utilization Increasing
      • Disk Space is not Reclaimed
      • SSTable Corruption Problem
      • Pointless Compactions
      • Limiting Compaction
    • CQL
      • Time Range Query Fails
      • COPY FROM Fails
      • CQL Connection Table
    • ScyllaDB Monitor and Manager
      • Manager and Monitoring integration
      • Manager lists healthy nodes as down
    • Installation and Removal
      • Removing ScyllaDB on Ubuntu breaks system packages
  • Knowledge Base
    • Upgrading from experimental CDC
    • Compaction
    • Consistency in ScyllaDB
    • Counting all rows in a table is slow
    • CQL Query Does Not Display Entire Result Set
    • When CQLSh query returns partial results with followed by “More”
    • Run ScyllaDB and supporting services as a custom user:group
    • Customizing CPUSET
    • Decoding Stack Traces
    • Snapshots and Disk Utilization
    • DPDK mode
    • Debug your database with Flame Graphs
    • How to Change gc_grace_seconds for a Table
    • Gossip in ScyllaDB
    • Increase Permission Cache to Avoid Non-paged Queries
    • How does ScyllaDB LWT Differ from Apache Cassandra ?
    • Map CPUs to ScyllaDB Shards
    • ScyllaDB Memory Usage
    • NTP Configuration for ScyllaDB
    • Updating the Mode in perftune.yaml After a ScyllaDB Upgrade
    • POSIX networking for ScyllaDB
    • ScyllaDB consistency quiz for administrators
    • Recreate RAID devices
    • How to Safely Increase the Replication Factor
    • ScyllaDB and Spark integration
    • Increase ScyllaDB resource limits over systemd
    • ScyllaDB Seed Nodes
    • How to Set up a Swap Space
    • ScyllaDB Snapshots
    • ScyllaDB payload sent duplicated static columns
    • Stopping a local repair
    • System Limits
    • How to flush old tombstones from a table
    • Time to Live (TTL) and Compaction
    • ScyllaDB Nodes are Unresponsive
    • Update a Primary Key
    • Using the perf utility with ScyllaDB
    • Configure ScyllaDB Networking with Multiple NIC/IP Combinations
  • Reference
    • AWS Images
    • Azure Images
    • GCP Images
    • Configuration Parameters
    • Glossary
    • Limits
    • API Reference (BETA)
    • Metrics (BETA)
  • ScyllaDB FAQ
  • Contribute to ScyllaDB
Docs Tutorials University Contact Us About Us
© 2025, ScyllaDB. All rights reserved. | Terms of Service | Privacy Policy | ScyllaDB, and ScyllaDB Cloud, are registered trademarks of ScyllaDB, Inc.
Last updated on 08 May 2025.
Powered by Sphinx 7.4.7 & ScyllaDB Theme 1.8.6