ScyllaDB University LIVE, FREE Virtual Training Event | March 21
Register for Free
ScyllaDB Documentation Logo Documentation
  • Server
  • Cloud
  • Tools
    • ScyllaDB Manager
    • ScyllaDB Monitoring Stack
    • ScyllaDB Operator
  • Drivers
    • CQL Drivers
    • DynamoDB Drivers
  • Resources
    • ScyllaDB University
    • Community Forum
    • Tutorials
Download
ScyllaDB Docs ScyllaDB Open Source Knowledge Base Scylla and Spark integration

Caution

You're viewing documentation for a previous version. Switch to the latest stable version.

Scylla and Spark integration¶

Simple Scylla-Spark integration example¶

This is an example of how to create a very simple Spark application that uses Scylla to store its data. The application is going to read people’s names and ages from one table and write the names of the adults to another one. It also will show the number of adults and all people in the database.

Prerequisites¶

  • Scylla

  • sbt

Prepare Scylla¶

Firstly, we need to create keyspace and tables in which data processed by the example application will be stored.

Launch Scylla and connect to it using cqlsh. The following commands will create a new keyspace for our tests and make it the current one.

CREATE KEYSPACE spark_example WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1};
USE spark_example;

Then, tables both for input and output data need to be created:

CREATE TABLE persons (name TEXT PRIMARY KEY, age INT);
CREATE TABLE adults (name TEXT PRIMARY KEY);

Lastly, the database needs to contain some input data for our application to process:

INSERT INTO persons (name, age) VALUES ('Anne', 34);
INSERT INTO persons (name, age) VALUES ('John', 47);
INSERT INTO persons (name, age) VALUES ('Elisabeth', 89);
INSERT INTO persons (name, age) VALUES ('George', 52);
INSERT INTO persons (name, age) VALUES ('Amy', 17);
INSERT INTO persons (name, age) VALUES ('Jack', 16);
INSERT INTO persons (name, age) VALUES ('Treebeard', 36421);

Prepare the application¶

With a database containing all the necessary tables and data, it is now time to write our example application. Create a directory scylla-spark-example, which will contain all source code and build configuration.

First, very important file is build.sbt, which should be created in the project main directory. It contains all the application metadata, including name, version, and dependencies.

name := "scylla-spark-example-simple"
version := "1.0"
scalaVersion := "2.10.5"

libraryDependencies ++= Seq(
        "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M1",
        "org.apache.spark" %% "spark-catalyst" % "1.5.0" % "provided"
    )

Then, we need to enable sbt-assembly plugin. Create directory project and create file plugins.sbt with the following content:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.0")

The steps above should cover all build configuration, what is left is the actual logic of the application. Create file src/main/scala/ScyllaSparkExampleSimple.scala:

import org.apache.spark.{SparkContext,SparkConf}
import com.datastax.spark.connector._

object ScyllaSparkExampleSimple {
    def main(args: Array[String]): Unit = {
        val sc = new SparkContext(new SparkConf())

        val persons = sc.cassandraTable("spark_example", "persons")

        val adults = persons.filter(_.getInt("age") >= 18).map(n => Tuple1(n.getString("name")))
        adults.saveToCassandra("spark_example", "adults")

        val out = s"Adults: %d\nTotal: %d\n".format(adults.count(), persons.count())
        println(out)
    }
}

Since we don’t want to hardcode in our application any information about Scylla or Spark we will also need an additional configuration file spark-scylla.conf.

spark.master local
spark.cassandra.connection.host 127.0.0.1

Now it is time to build the application and create a self-containing jar file that we will be able to send to Spark. To do that, execute the command:

sbt assembly

It will download all necessary dependencies, build our example, and create an output jar file in target/scala-2.10/scylla-spark-example-simple-assembly-1.0.jar.

Download and run Spark¶

The next step is to get Spark running. Pre-built binaries can be downloaded from this website. Make sure to choose release 1.5.0. Since we are going to use it with Scylla Hadoop version doesn’t matter.

Once the download has finished, unpack the archive and in its root directory, execute the following command to start Spark Master:

./sbin/start-master.sh -h localhost

Spark Web UI should now be available at http://localhost:8080. The Spark URL used to connect its workers is spark://localhost:7077.

With the master running, the only thing left to have minimal Spark deployment is to start a worker. This can be done with the following command:

./sbin/start-slave.sh spark://localhost:7077

Run application¶

The application is built, Spark is up, and Scylla has all the necessary tables created and contains the input data for our example. This means that we are ready to run the application. Make sure that Scylla is running and execute (still in the Spark directory) the following command):

./bin/spark-submit --properties-file /path/to/scylla-spark-example/spark-scylla.conf \
    --class ScyllaSparkExampleSimple /path/to/scylla-spark-example/target/scala-2.10/scylla-spark-example-simple-assembly-1.0.jar

spark-submit will output some logs and debug information, but among them, there should be a message from the application:

Adults: 5
Total: 7

You can also connect to Scylla with cqlsh, and using the following query, see the results of our example in the database.

SELECT * FROM spark_example.adults;

Expected output:

 name
-----------
 Treebeard
 Elisabeth
    George
      John
      Anne

Based on http://www.planetcassandra.org/getting-started-with-apache-spark-and-cassandra/ and http://koeninger.github.io/spark-cassandra-example/#1.

RoadTrip example¶

This is a short guide explaining how to run a Spark example application available here with Scylla.

Prerequisites¶

  • Scylla

  • Maven

  • Git

Get the source code¶

You can get the source code of this example by cloning the following repository:

https://github.com/jsebrien/spark-tests

Disable Apache Cassandra¶

spark-tests are configured to launch Cassandra, which is not what we want to achieve here. The following patch disables Cassandra. It can be applied, for example, using git apply --ignore-whitespace -.

diff --git a/src/main/java/blog/hashmade/spark/util/CassandraUtil.java b/src/main/java/blog/hashmade/spark/util/CassandraUtil.java
index 37bbc2e..bfe5517 100644
--- a/src/main/java/blog/hashmade/spark/util/CassandraUtil.java
+++ b/src/main/java/blog/hashmade/spark/util/CassandraUtil.java
@@ -14,7 +14,7 @@ public final class CassandraUtil {
    }

    static Session startCassandra() throws Exception {
-       EmbeddedCassandraServerHelper.startEmbeddedCassandra();
+       //EmbeddedCassandraServerHelper.startEmbeddedCassandra();
        Thread.sleep(EMBEDDED_CASSANDRA_SERVER_WAITING_TIME);
        Cluster cluster = new Cluster.Builder().addContactPoints("localhost")
                .withPort(9142).build();

Update connector¶

spark-tests use Spark Cassandra Connector in version 1.1.0 which is too old for our purposes. Before 1.3.0 the connector used to use Thrift as well CQL and that won’t work with Scylla. Updating the example isn’t very complicated and can be accomplished by applying the following patch:

diff --git a/pom.xml b/pom.xml
index 673e22b..1245ffc 100644
--- a/pom.xml
+++ b/pom.xml
@@ -142,7 +142,7 @@
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
-           <version>1.1.0</version>
+           <version>1.3.0</version>
            <exclusions>
                <exclusion>
                    <groupId>com.google.guava</groupId>
@@ -157,7 +157,7 @@
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
-           <version>1.1.0</version>
+           <version>1.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.cassandraunit</groupId>
@@ -173,18 +173,18 @@
        <dependency>
            <groupId>com.datastax.cassandra</groupId>
            <artifactId>cassandra-driver-core</artifactId>
-           <version>2.1.2</version>
+           <version>2.1.7.1</version>
        </dependency>
        <!-- Datastax -->
        <dependency>
            <groupId>com.datastax.spark</groupId>
            <artifactId>spark-cassandra-connector_2.10</artifactId>
-           <version>1.1.0-beta2</version>
+           <version>1.3.0</version>
        </dependency>
        <dependency>
            <groupId>com.datastax.spark</groupId>
            <artifactId>spark-cassandra-connector-java_2.10</artifactId>
-           <version>1.1.0-beta2</version>
+           <version>1.3.0</version>
        </dependency>
        <dependency>
            <groupId>net.sf.supercsv</groupId>
diff --git a/src/main/java/blog/hashmade/spark/DatastaxSparkTest.java b/src/main/java/blog/hashmade/spark/DatastaxSparkTest.java
index 1027e42..190eb3d 100644
--- a/src/main/java/blog/hashmade/spark/DatastaxSparkTest.java
+++ b/src/main/java/blog/hashmade/spark/DatastaxSparkTest.java
@@ -43,8 +43,7 @@ public class DatastaxSparkTest {
                .setAppName("DatastaxTests")
                .set("spark.executor.memory", "1g")
                .set("spark.cassandra.connection.host", "localhost")
-               .set("spark.cassandra.connection.native.port", "9142")
-               .set("spark.cassandra.connection.rpc.port", "9171");
+               .set("spark.cassandra.connection.port", "9142");
        SparkContext ctx = new SparkContext(conf);
        SparkContextJavaFunctions functions = CassandraJavaUtil.javaFunctions(ctx);
        CassandraJavaRDD<CassandraRow> rdd = functions.cassandraTable("roadtrips", "roadtrip");

Build the example¶

The example can be built with Maven:

mvn compile

Start Scylla¶

The application we are trying to run will try to connect with Scylla using custom port 9142. That’s why when starting Scylla, an additional flag is needed to make sure that’s the port it listens on (alternatively, you can change all occurrences of 9142 to 9042 in the example source code).

scylla --native-transport-port=9142

Run the application¶

With the example compiled and Scylla running all that is left to be done is to actually run the application:

mvn exec:java

Scylla limitations¶

  • Scylla needs Spark Cassandra Connector 1.3.0 or later.

  • Scylla doesn’t populate system.size_estimates, and therefore the connector won’t be able to perform automatic split sizing optimally.

For more compatibility information check Scylla status

Knowledge Base

Copyright

© 2016, The Apache Software Foundation.

Apache®, Apache Cassandra®, Cassandra®, the Apache feather logo and the Apache Cassandra® Eye logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Was this page helpful?

PREVIOUS
How to Safely Increase the Replication Factor
NEXT
Increase Scylla resource limits over systemd
  • Create an issue
  • Edit this page

On this page

  • Scylla and Spark integration
    • Simple Scylla-Spark integration example
      • Prerequisites
      • Prepare Scylla
      • Prepare the application
      • Download and run Spark
      • Run application
    • RoadTrip example
      • Prerequisites
      • Get the source code
      • Disable Apache Cassandra
      • Update connector
      • Build the example
      • Start Scylla
      • Run the application
    • Scylla limitations
ScyllaDB Open Source
  • 5.2
    • master
    • 6.2
    • 6.1
    • 6.0
    • 5.4
    • 5.2
    • 5.1
  • Getting Started
    • Install ScyllaDB
      • ScyllaDB Web Installer for Linux
      • ScyllaDB Unified Installer (relocatable executable)
      • Air-gapped Server Installation
      • What is in each RPM
      • ScyllaDB Housekeeping and how to disable it
      • ScyllaDB Developer Mode
      • ScyllaDB Configuration Reference
    • Configure ScyllaDB
    • ScyllaDB Requirements
      • System Requirements
      • OS Support by Linux Distributions and Version
      • ScyllaDB in a Shared Environment
    • Migrate to ScyllaDB
      • Migration Process from Cassandra to Scylla
      • Scylla and Apache Cassandra Compatibility
      • Migration Tools Overview
    • Integration Solutions
      • Integrate Scylla with Spark
      • Integrate Scylla with KairosDB
      • Integrate Scylla with Presto
      • Integrate Scylla with Elasticsearch
      • Integrate Scylla with Kubernetes
      • Integrate Scylla with the JanusGraph Graph Data System
      • Integrate Scylla with DataDog
      • Integrate Scylla with Kafka
      • Integrate Scylla with IOTA Chronicle
      • Integrate Scylla with Spring
      • Shard-Aware Kafka Connector for Scylla
      • Install Scylla with Ansible
      • Integrate Scylla with Databricks
    • Tutorials
  • ScyllaDB for Administrators
    • Administration Guide
    • Procedures
      • Cluster Management
      • Backup & Restore
      • Change Configuration
      • Maintenance
      • Best Practices
      • Benchmarking Scylla
      • Migrate from Cassandra to Scylla
      • Disable Housekeeping
    • Security
      • ScyllaDB Security Checklist
      • Enable Authentication
      • Enable and Disable Authentication Without Downtime
      • Generate a cqlshrc File
      • Reset Authenticator Password
      • Enable Authorization
      • Grant Authorization CQL Reference
      • Role Based Access Control (RBAC)
      • ScyllaDB Auditing Guide
      • Encryption: Data in Transit Client to Node
      • Encryption: Data in Transit Node to Node
      • Generating a self-signed Certificate Chain Using openssl
      • Encryption at Rest
      • LDAP Authentication
      • LDAP Authorization (Role Management)
    • Admin Tools
      • Nodetool Reference
      • CQLSh
      • REST
      • Tracing
      • Scylla SStable
      • Scylla Types
      • SSTableLoader
      • cassandra-stress
      • SSTabledump
      • SSTable2json
      • Scylla Logs
      • Seastar Perftune
      • Virtual Tables
    • ScyllaDB Monitoring Stack
    • ScyllaDB Operator
    • ScyllaDB Manager
    • Upgrade Procedures
      • ScyllaDB Open Source Upgrade
      • ScyllaDB Open Source to ScyllaDB Enterprise Upgrade
      • ScyllaDB Image
      • ScyllaDB Enterprise
    • System Configuration
      • System Configuration Guide
      • scylla.yaml
      • ScyllaDB Snitches
    • Benchmarking ScyllaDB
  • ScyllaDB for Developers
    • Learn To Use ScyllaDB
      • Scylla University
      • Course catalog
      • Scylla Essentials
      • Basic Data Modeling
      • Advanced Data Modeling
      • MMS - Learn by Example
      • Care-Pet an IoT Use Case and Example
    • Scylla Alternator
    • Scylla Features
      • Scylla Open Source Features
      • Scylla Enterprise Features
    • Scylla Drivers
      • Scylla CQL Drivers
      • Scylla DynamoDB Drivers
    • Workload Attributes
  • CQL Reference
    • CQLSh: the CQL shell
    • Appendices
    • Compaction
    • Consistency Levels
    • Consistency Level Calculator
    • Data Definition
    • Data Manipulation
    • Data Types
    • Definitions
    • Global Secondary Indexes
    • Additional Information
    • Expiring Data with Time to Live (TTL)
    • Additional Information
    • Functions
    • JSON Support
    • Materialized Views
    • Non-Reserved CQL Keywords
    • Reserved CQL Keywords
    • ScyllaDB CQL Extensions
  • ScyllaDB Architecture
    • ScyllaDB Ring Architecture
    • ScyllaDB Fault Tolerance
    • Consistency Level Console Demo
    • ScyllaDB Anti-Entropy
      • Scylla Hinted Handoff
      • Scylla Read Repair
      • Scylla Repair
    • SSTable
      • ScyllaDB SSTable - 2.x
      • ScyllaDB SSTable - 3.x
    • Compaction Strategies
    • Raft Consensus Algorithm in ScyllaDB
  • Troubleshooting ScyllaDB
    • Errors and Support
      • Report a Scylla problem
      • Error Messages
      • Change Log Level
    • ScyllaDB Startup
      • Ownership Problems
      • Scylla will not Start
      • Scylla Python Script broken
    • Upgrade
      • Inaccessible configuration files after ScyllaDB upgrade
    • Cluster and Node
      • Failed Decommission Problem
      • Cluster Timeouts
      • Node Joined With No Data
      • SocketTimeoutException
      • NullPointerException
    • Data Modeling
      • Scylla Large Partitions Table
      • Scylla Large Rows and Cells Table
      • Large Partitions Hunting
    • Data Storage and SSTables
      • Space Utilization Increasing
      • Disk Space is not Reclaimed
      • SSTable Corruption Problem
      • Pointless Compactions
      • Limiting Compaction
    • CQL
      • Time Range Query Fails
      • COPY FROM Fails
      • CQL Connection Table
      • Reverse queries fail
    • ScyllaDB Monitor and Manager
      • Manager and Monitoring integration
      • Manager lists healthy nodes as down
  • Knowledge Base
    • Upgrading from experimental CDC
    • Compaction
    • Counting all rows in a table is slow
    • CQL Query Does Not Display Entire Result Set
    • When CQLSh query returns partial results with followed by “More”
    • Run Scylla and supporting services as a custom user:group
    • Decoding Stack Traces
    • Snapshots and Disk Utilization
    • DPDK mode
    • Debug your database with Flame Graphs
    • How to Change gc_grace_seconds for a Table
    • Gossip in Scylla
    • Increase Permission Cache to Avoid Non-paged Queries
    • How does Scylla LWT Differ from Apache Cassandra ?
    • Map CPUs to Scylla Shards
    • Scylla Memory Usage
    • NTP Configuration for Scylla
    • Updating the Mode in perftune.yaml After a ScyllaDB Upgrade
    • POSIX networking for Scylla
    • Scylla consistency quiz for administrators
    • Recreate RAID devices
    • How to Safely Increase the Replication Factor
    • Scylla and Spark integration
    • Increase Scylla resource limits over systemd
    • Scylla Seed Nodes
    • How to Set up a Swap Space
    • Scylla Snapshots
    • Scylla payload sent duplicated static columns
    • Stopping a local repair
    • System Limits
    • How to flush old tombstones from a table
    • Time to Live (TTL) and Compaction
    • Scylla Nodes are Unresponsive
    • Update a Primary Key
    • Using the perf utility with Scylla
    • Configure Scylla Networking with Multiple NIC/IP Combinations
  • ScyllaDB FAQ
  • Contribute to ScyllaDB
  • Glossary
  • Alternator: DynamoDB API in Scylla
    • Getting Started With ScyllaDB Alternator
    • ScyllaDB Alternator for DynamoDB users
Docs Tutorials University Contact Us About Us
© 2025, ScyllaDB. All rights reserved. | Terms of Service | Privacy Policy | ScyllaDB, and ScyllaDB Cloud, are registered trademarks of ScyllaDB, Inc.
Last updated on 08 May 2025.
Powered by Sphinx 7.4.7 & ScyllaDB Theme 1.8.6