Cassandra Integration with Spark SQL




What is Cassandra?

Apache Cassandra is a open source, distributed, No SQL[Not Only SQL] database management system designed to handle large amount[Big data] of data across in commodity  servers. Cassandra is Peer to Peer architecture it means that near by node act as master as well as slave.

In hive we are using use database whereas Cassandra use “KEY SPACE”

In a single node cluster class name defined as “simple strategy”

In a multi node cluster class name defined as “network topology strategy”

How to enter Cassandra shell and create keyspace & table?
After Cassandra installation on a your cluster using “cqlsh” command then enter into Cassandra shell.

cqlsh > DECRIBE KEYSPACES;
cqlsh > create keypace Sparkwrd8 with replication = {'class': 'Simple Strategy','replication_factor':1};
cqlsh > create table emp (empid int primarykey, ename string,esal int)
cqlsh > insert into emp (empid,ename,esal)values (101,Hari,10000)
cqlsh > select * from emp where empid =101;

After executing select command will get output.

cqlsh > select * from emp where esal =10000;
Invalid request: error from server: code =2200[Invalid query]
message ="Predicates on non - primary key columns(esal) are 
not yet supported for non secondary index queries."

The above error belongs to Cassandra index related issue. Actually we executing select query where statement given to non primary key column so we got invalid request. In Cassandra simple resolution is there to create Cassandra index.



cqlsh > create index empindex on emp(esal);

How to connect Cassandra with Spark:

We are connect Cassandra in Spark shell using below command

$ spark-shell --packages datastax : spark -cassandra - connector : 2.0.0 - M2-s_2.11 --conf spark.cassandra.connection host = 127.0.0.1

Without Cassandra import we didn’t get Cassandra related packages. Sometimes we get version related issue at the we are using Spark 2 or above version for compatibility issues.

Here we using the below Spark Cassandra connector with the Spark – shell.

import connector classes:

Scala > import com.datastax.Spark.connector._
Scala > import com.datastax.spark.connector.cql._
Scala > import org.apache.Spark.sql.Cassandra._

Summary: Above steps for Cassandra integration with Spark in single node cluster.