ScyllaDB University LIVE, FREE Virtual Training Event | March 21
Register for Free
ScyllaDB Documentation Logo Documentation
  • Server
  • Cloud
  • Tools
    • ScyllaDB Manager
    • ScyllaDB Monitoring Stack
    • ScyllaDB Operator
  • Drivers
    • CQL Drivers
    • DynamoDB Drivers
  • Resources
    • ScyllaDB University
    • Community Forum
    • Tutorials
Download
ScyllaDB Docs ScyllaDB Open Source Knowledge Base ScyllaDB and Spark integration

Caution

You're viewing documentation for a previous version. Switch to the latest stable version.

ScyllaDB and Spark integration¶

Simple ScyllaDB-Spark integration example¶

This is an example of how to create a very simple Spark application that uses ScyllaDB to store its data. The application is going to read people’s names and ages from one table and write the names of the adults to another one. It also will show the number of adults and all people in the database.

Prerequisites¶

  • ScyllaDB

  • sbt

Prepare ScyllaDB¶

Firstly, we need to create keyspace and tables in which data processed by the example application will be stored.

Launch ScyllaDB and connect to it using cqlsh. The following commands will create a new keyspace for our tests and make it the current one.

CREATE KEYSPACE spark_example WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3};
USE spark_example;

Then, tables both for input and output data need to be created:

CREATE TABLE persons (name TEXT PRIMARY KEY, age INT);
CREATE TABLE adults (name TEXT PRIMARY KEY);

Lastly, the database needs to contain some input data for our application to process:

INSERT INTO persons (name, age) VALUES ('Anne', 34);
INSERT INTO persons (name, age) VALUES ('John', 47);
INSERT INTO persons (name, age) VALUES ('Elisabeth', 89);
INSERT INTO persons (name, age) VALUES ('George', 52);
INSERT INTO persons (name, age) VALUES ('Amy', 17);
INSERT INTO persons (name, age) VALUES ('Jack', 16);
INSERT INTO persons (name, age) VALUES ('Treebeard', 36421);

Prepare the application¶

With a database containing all the necessary tables and data, it is now time to write our example application. Create a directory scylla-spark-example, which will contain all source code and build configuration.

First, very important file is build.sbt, which should be created in the project main directory. It contains all the application metadata, including name, version, and dependencies.

name := "scylla-spark-example-simple"
version := "1.0"
scalaVersion := "2.10.5"

libraryDependencies ++= Seq(
        "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M1",
        "org.apache.spark" %% "spark-catalyst" % "1.5.0" % "provided"
    )

Then, we need to enable sbt-assembly plugin. Create directory project and create file plugins.sbt with the following content:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.0")

The steps above should cover all build configuration, what is left is the actual logic of the application. Create file src/main/scala/ScyllaSparkExampleSimple.scala:

import org.apache.spark.{SparkContext,SparkConf}
import com.datastax.spark.connector._

object ScyllaSparkExampleSimple {
    def main(args: Array[String]): Unit = {
        val sc = new SparkContext(new SparkConf())

        val persons = sc.cassandraTable("spark_example", "persons")

        val adults = persons.filter(_.getInt("age") >= 18).map(n => Tuple1(n.getString("name")))
        adults.saveToCassandra("spark_example", "adults")

        val out = s"Adults: %d\nTotal: %d\n".format(adults.count(), persons.count())
        println(out)
    }
}

Since we don’t want to hardcode in our application any information about ScyllaDB or Spark we will also need an additional configuration file spark-scylla.conf.

spark.master local
spark.cassandra.connection.host 127.0.0.1

Now it is time to build the application and create a self-containing jar file that we will be able to send to Spark. To do that, execute the command:

sbt assembly

It will download all necessary dependencies, build our example, and create an output jar file in target/scala-2.10/scylla-spark-example-simple-assembly-1.0.jar.

Download and run Spark¶

The next step is to get Spark running. Pre-built binaries can be downloaded from this website. Make sure to choose release 1.5.0. Since we are going to use it with ScyllaDB Hadoop version doesn’t matter.

Once the download has finished, unpack the archive and in its root directory, execute the following command to start Spark Master:

./sbin/start-master.sh -h localhost

Spark Web UI should now be available at http://localhost:8080. The Spark URL used to connect its workers is spark://localhost:7077.

With the master running, the only thing left to have minimal Spark deployment is to start a worker. This can be done with the following command:

./sbin/start-slave.sh spark://localhost:7077

Run application¶

The application is built, Spark is up, and ScyllaDB has all the necessary tables created and contains the input data for our example. This means that we are ready to run the application. Make sure that ScyllaDB is running and execute (still in the Spark directory) the following command):

./bin/spark-submit --properties-file /path/to/scylla-spark-example/spark-scylla.conf \
    --class ScyllaSparkExampleSimple /path/to/scylla-spark-example/target/scala-2.10/scylla-spark-example-simple-assembly-1.0.jar

spark-submit will output some logs and debug information, but among them, there should be a message from the application:

Adults: 5
Total: 7

You can also connect to ScyllaDB with cqlsh, and using the following query, see the results of our example in the database.

SELECT * FROM spark_example.adults;

Expected output:

 name
-----------
 Treebeard
 Elisabeth
    George
      John
      Anne

Based on http://www.planetcassandra.org/getting-started-with-apache-spark-and-cassandra/ and http://koeninger.github.io/spark-cassandra-example/#1.

RoadTrip example¶

This is a short guide explaining how to run a Spark example application available here with ScyllaDB.

Prerequisites¶

  • ScyllaDB

  • Maven

  • Git

Get the source code¶

You can get the source code of this example by cloning the following repository:

https://github.com/jsebrien/spark-tests

Disable Apache Cassandra¶

spark-tests are configured to launch Cassandra, which is not what we want to achieve here. The following patch disables Cassandra. It can be applied, for example, using git apply --ignore-whitespace -.

diff --git a/src/main/java/blog/hashmade/spark/util/CassandraUtil.java b/src/main/java/blog/hashmade/spark/util/CassandraUtil.java
index 37bbc2e..bfe5517 100644
--- a/src/main/java/blog/hashmade/spark/util/CassandraUtil.java
+++ b/src/main/java/blog/hashmade/spark/util/CassandraUtil.java
@@ -14,7 +14,7 @@ public final class CassandraUtil {
    }

    static Session startCassandra() throws Exception {
-       EmbeddedCassandraServerHelper.startEmbeddedCassandra();
+       //EmbeddedCassandraServerHelper.startEmbeddedCassandra();
        Thread.sleep(EMBEDDED_CASSANDRA_SERVER_WAITING_TIME);
        Cluster cluster = new Cluster.Builder().addContactPoints("localhost")
                .withPort(9142).build();

Update connector¶

spark-tests use Spark Cassandra Connector in version 1.1.0 which is too old for our purposes. Before 1.3.0 the connector used to use Thrift as well CQL and that won’t work with ScyllaDB. Updating the example isn’t very complicated and can be accomplished by applying the following patch:

diff --git a/pom.xml b/pom.xml
index 673e22b..1245ffc 100644
--- a/pom.xml
+++ b/pom.xml
@@ -142,7 +142,7 @@
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
-           <version>1.1.0</version>
+           <version>1.3.0</version>
            <exclusions>
                <exclusion>
                    <groupId>com.google.guava</groupId>
@@ -157,7 +157,7 @@
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
-           <version>1.1.0</version>
+           <version>1.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.cassandraunit</groupId>
@@ -173,18 +173,18 @@
        <dependency>
            <groupId>com.datastax.cassandra</groupId>
            <artifactId>cassandra-driver-core</artifactId>
-           <version>2.1.2</version>
+           <version>2.1.7.1</version>
        </dependency>
        <!-- Datastax -->
        <dependency>
            <groupId>com.datastax.spark</groupId>
            <artifactId>spark-cassandra-connector_2.10</artifactId>
-           <version>1.1.0-beta2</version>
+           <version>1.3.0</version>
        </dependency>
        <dependency>
            <groupId>com.datastax.spark</groupId>
            <artifactId>spark-cassandra-connector-java_2.10</artifactId>
-           <version>1.1.0-beta2</version>
+           <version>1.3.0</version>
        </dependency>
        <dependency>
            <groupId>net.sf.supercsv</groupId>
diff --git a/src/main/java/blog/hashmade/spark/DatastaxSparkTest.java b/src/main/java/blog/hashmade/spark/DatastaxSparkTest.java
index 1027e42..190eb3d 100644
--- a/src/main/java/blog/hashmade/spark/DatastaxSparkTest.java
+++ b/src/main/java/blog/hashmade/spark/DatastaxSparkTest.java
@@ -43,8 +43,7 @@ public class DatastaxSparkTest {
                .setAppName("DatastaxTests")
                .set("spark.executor.memory", "1g")
                .set("spark.cassandra.connection.host", "localhost")
-               .set("spark.cassandra.connection.native.port", "9142")
-               .set("spark.cassandra.connection.rpc.port", "9171");
+               .set("spark.cassandra.connection.port", "9142");
        SparkContext ctx = new SparkContext(conf);
        SparkContextJavaFunctions functions = CassandraJavaUtil.javaFunctions(ctx);
        CassandraJavaRDD<CassandraRow> rdd = functions.cassandraTable("roadtrips", "roadtrip");

Build the example¶

The example can be built with Maven:

mvn compile

Start ScyllaDB¶

The application we are trying to run will try to connect with ScyllaDB using custom port 9142. That’s why when starting ScyllaDB, an additional flag is needed to make sure that’s the port it listens on (alternatively, you can change all occurrences of 9142 to 9042 in the example source code).

scylla --native-transport-port=9142

Run the application¶

With the example compiled and ScyllaDB running all that is left to be done is to actually run the application:

mvn exec:java

ScyllaDB limitations¶

  • ScyllaDB needs Spark Cassandra Connector 1.3.0 or later.

  • ScyllaDB doesn’t populate system.size_estimates, and therefore the connector won’t be able to perform automatic split sizing optimally.

For more compatibility information check ScyllaDB status

Knowledge Base

Copyright

© 2016, The Apache Software Foundation.

Apache®, Apache Cassandra®, Cassandra®, the Apache feather logo and the Apache Cassandra® Eye logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Was this page helpful?

PREVIOUS
How to Safely Increase the Replication Factor
NEXT
Increase ScyllaDB resource limits over systemd
  • Create an issue
  • Edit this page

On this page

  • ScyllaDB and Spark integration
    • Simple ScyllaDB-Spark integration example
      • Prerequisites
      • Prepare ScyllaDB
      • Prepare the application
      • Download and run Spark
      • Run application
    • RoadTrip example
      • Prerequisites
      • Get the source code
      • Disable Apache Cassandra
      • Update connector
      • Build the example
      • Start ScyllaDB
      • Run the application
    • ScyllaDB limitations
ScyllaDB Open Source
  • 6.2
    • master
    • 6.2
    • 6.1
    • 6.0
    • 5.4
    • 5.2
    • 5.1
  • Getting Started
    • Install ScyllaDB
      • Launch ScyllaDB on AWS
      • Launch ScyllaDB on GCP
      • Launch ScyllaDB on Azure
      • ScyllaDB Web Installer for Linux
      • Install ScyllaDB Linux Packages
      • Install scylla-jmx Package
      • Run ScyllaDB in Docker
      • Install ScyllaDB Without root Privileges
      • Air-gapped Server Installation
      • ScyllaDB Housekeeping and how to disable it
      • ScyllaDB Developer Mode
    • Configure ScyllaDB
    • ScyllaDB Configuration Reference
    • ScyllaDB Requirements
      • System Requirements
      • OS Support by Linux Distributions and Version
      • Cloud Instance Recommendations
      • ScyllaDB in a Shared Environment
    • Migrate to ScyllaDB
      • Migration Process from Cassandra to ScyllaDB
      • ScyllaDB and Apache Cassandra Compatibility
      • Migration Tools Overview
    • Integration Solutions
      • Integrate ScyllaDB with Spark
      • Integrate ScyllaDB with KairosDB
      • Integrate ScyllaDB with Presto
      • Integrate ScyllaDB with Elasticsearch
      • Integrate ScyllaDB with Kubernetes
      • Integrate ScyllaDB with the JanusGraph Graph Data System
      • Integrate ScyllaDB with DataDog
      • Integrate ScyllaDB with Kafka
      • Integrate ScyllaDB with IOTA Chronicle
      • Integrate ScyllaDB with Spring
      • Shard-Aware Kafka Connector for ScyllaDB
      • Install ScyllaDB with Ansible
      • Integrate ScyllaDB with Databricks
      • Integrate ScyllaDB with Jaeger Server
      • Integrate ScyllaDB with MindsDB
    • Tutorials
  • ScyllaDB for Administrators
    • Administration Guide
    • Procedures
      • Cluster Management
      • Backup & Restore
      • Change Configuration
      • Maintenance
      • Best Practices
      • Benchmarking ScyllaDB
      • Migrate from Cassandra to ScyllaDB
      • Disable Housekeeping
    • Security
      • ScyllaDB Security Checklist
      • Enable Authentication
      • Enable and Disable Authentication Without Downtime
      • Creating a Custom Superuser
      • Generate a cqlshrc File
      • Reset Authenticator Password
      • Enable Authorization
      • Grant Authorization CQL Reference
      • Certificate-based Authentication
      • Role Based Access Control (RBAC)
      • Encryption: Data in Transit Client to Node
      • Encryption: Data in Transit Node to Node
      • Generating a self-signed Certificate Chain Using openssl
      • Configure SaslauthdAuthenticator
    • Admin Tools
      • Nodetool Reference
      • CQLSh
      • Admin REST API
      • Tracing
      • ScyllaDB SStable
      • ScyllaDB Types
      • SSTableLoader
      • cassandra-stress
      • SSTabledump
      • SSTableMetadata
      • ScyllaDB Logs
      • Seastar Perftune
      • Virtual Tables
      • Reading mutation fragments
      • Maintenance socket
      • Maintenance mode
      • Task manager
    • ScyllaDB Monitoring Stack
    • ScyllaDB Operator
    • ScyllaDB Manager
    • Upgrade Procedures
      • ScyllaDB Versioning
      • ScyllaDB Open Source Upgrade
      • ScyllaDB Open Source to ScyllaDB Enterprise Upgrade
      • ScyllaDB Image
      • ScyllaDB Enterprise
    • System Configuration
      • System Configuration Guide
      • scylla.yaml
      • ScyllaDB Snitches
    • Benchmarking ScyllaDB
    • ScyllaDB Diagnostic Tools
  • ScyllaDB for Developers
    • Develop with ScyllaDB
    • Tutorials and Example Projects
    • Learn to Use ScyllaDB
    • ScyllaDB Alternator
    • ScyllaDB Drivers
      • ScyllaDB CQL Drivers
      • ScyllaDB DynamoDB Drivers
  • CQL Reference
    • CQLSh: the CQL shell
    • Appendices
    • Compaction
    • Consistency Levels
    • Consistency Level Calculator
    • Data Definition
    • Data Manipulation
      • SELECT
      • INSERT
      • UPDATE
      • DELETE
      • BATCH
    • Data Types
    • Definitions
    • Global Secondary Indexes
    • Expiring Data with Time to Live (TTL)
    • Functions
    • Wasm support for user-defined functions
    • JSON Support
    • Materialized Views
    • Non-Reserved CQL Keywords
    • Reserved CQL Keywords
    • Service Levels
    • ScyllaDB CQL Extensions
  • Alternator: DynamoDB API in Scylla
    • Getting Started With ScyllaDB Alternator
    • ScyllaDB Alternator for DynamoDB users
  • Features
    • Lightweight Transactions
    • Global Secondary Indexes
    • Local Secondary Indexes
    • Materialized Views
    • Counters
    • Change Data Capture
      • CDC Overview
      • The CDC Log Table
      • Basic operations in CDC
      • CDC Streams
      • CDC Stream Generations
      • Querying CDC Streams
      • Advanced column types
      • Preimages and postimages
      • Data Consistency in CDC
    • Workload Attributes
  • ScyllaDB Architecture
    • Data Distribution with Tablets
    • ScyllaDB Ring Architecture
    • ScyllaDB Fault Tolerance
    • Consistency Level Console Demo
    • ScyllaDB Anti-Entropy
      • ScyllaDB Hinted Handoff
      • ScyllaDB Read Repair
      • ScyllaDB Repair
    • SSTable
      • ScyllaDB SSTable - 2.x
      • ScyllaDB SSTable - 3.x
    • Compaction Strategies
    • Raft Consensus Algorithm in ScyllaDB
    • Zero-token Nodes
  • Troubleshooting ScyllaDB
    • Errors and Support
      • Report a ScyllaDB problem
      • Error Messages
      • Change Log Level
    • ScyllaDB Startup
      • Ownership Problems
      • ScyllaDB will not Start
      • ScyllaDB Python Script broken
    • Upgrade
      • Inaccessible configuration files after ScyllaDB upgrade
    • Cluster and Node
      • Handling Node Failures
      • Failure to Add, Remove, or Replace a Node
      • Failed Decommission Problem
      • Cluster Timeouts
      • Node Joined With No Data
      • NullPointerException
      • Failed Schema Sync
    • Data Modeling
      • ScyllaDB Large Partitions Table
      • ScyllaDB Large Rows and Cells Table
      • Large Partitions Hunting
      • Failure to Update the Schema
    • Data Storage and SSTables
      • Space Utilization Increasing
      • Disk Space is not Reclaimed
      • SSTable Corruption Problem
      • Pointless Compactions
      • Limiting Compaction
    • CQL
      • Time Range Query Fails
      • COPY FROM Fails
      • CQL Connection Table
    • ScyllaDB Monitor and Manager
      • Manager and Monitoring integration
      • Manager lists healthy nodes as down
    • Installation and Removal
      • Removing ScyllaDB on Ubuntu breaks system packages
  • Knowledge Base
    • Upgrading from experimental CDC
    • Compaction
    • Consistency in ScyllaDB
    • Counting all rows in a table is slow
    • CQL Query Does Not Display Entire Result Set
    • When CQLSh query returns partial results with followed by “More”
    • Run ScyllaDB and supporting services as a custom user:group
    • Customizing CPUSET
    • Decoding Stack Traces
    • Snapshots and Disk Utilization
    • DPDK mode
    • Debug your database with Flame Graphs
    • How to Change gc_grace_seconds for a Table
    • Gossip in ScyllaDB
    • Increase Permission Cache to Avoid Non-paged Queries
    • How does ScyllaDB LWT Differ from Apache Cassandra ?
    • Map CPUs to ScyllaDB Shards
    • ScyllaDB Memory Usage
    • NTP Configuration for ScyllaDB
    • Updating the Mode in perftune.yaml After a ScyllaDB Upgrade
    • POSIX networking for ScyllaDB
    • ScyllaDB consistency quiz for administrators
    • Recreate RAID devices
    • How to Safely Increase the Replication Factor
    • ScyllaDB and Spark integration
    • Increase ScyllaDB resource limits over systemd
    • ScyllaDB Seed Nodes
    • How to Set up a Swap Space
    • ScyllaDB Snapshots
    • ScyllaDB payload sent duplicated static columns
    • Stopping a local repair
    • System Limits
    • How to flush old tombstones from a table
    • Time to Live (TTL) and Compaction
    • ScyllaDB Nodes are Unresponsive
    • Update a Primary Key
    • Using the perf utility with ScyllaDB
    • Configure ScyllaDB Networking with Multiple NIC/IP Combinations
  • Reference
    • AWS Images
    • Azure Images
    • GCP Images
    • Configuration Parameters
    • Glossary
    • Limits
    • API Reference (BETA)
    • Metrics (BETA)
  • ScyllaDB FAQ
  • Contribute to ScyllaDB
Docs Tutorials University Contact Us About Us
© 2025, ScyllaDB. All rights reserved. | Terms of Service | Privacy Policy | ScyllaDB, and ScyllaDB Cloud, are registered trademarks of ScyllaDB, Inc.
Last updated on 08 May 2025.
Powered by Sphinx 7.4.7 & ScyllaDB Theme 1.8.6