DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


question

animesh.sharma.pandit0_178934 avatar image
animesh.sharma.pandit0_178934 asked ·

data modeling: choosing primary key(random vs hashing certain columns)

I am trying data modeling with Cassandra and I am confused on what should I choose as my primary key. My table looks like below


CREATE TABLE mykeyspace.mytable ( 
id UUID,
A text,
B text,
C text,
D text,
... other columns
PRIMARY KEY(id)
);

I have introduced an id column in my table and made it as primary key, so that querying with id is faster, as most of my query would be with id.


The problem that I am facing is the set of columns (A,B,C,D) uniquely identifies the data, and whenever a record creation comes with set of columns (A,B,C,D) it should not create a new record and rather return a response with the id of already existing record and suggesting client to use that id for updating the record.

I am generating the id randomly. Below are the approaches that I though to solve the problem

  1. first approach that I though was to hash the 4 columns to generate the id, then it would solve the problem but I skeptical about how the data would be distributed if I start taking the hash for the 4 columns.
  2. second approach that I though of was making a secondary index using (A,B,C,D) columns, here I am bit skeptical about the search using secondary index before insertion.


Which of the above approach for data modeling is more suitable or is there any other approach?

data modellingschemadesigntable
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

@animesh.sharma.pandit0_178934 since A, B, C, D uniquely identify each partition in the table then you should use them as the partition key. In your table definition, it would look like:

CREATE TABLE mykeyspace.mytable (
    PRIMARY KEY ( (A, B, C, D) )
);

Note that all 4 columns are enclosed in a separate bracket to mark all of them as the full partition key. Cheers!

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Each of them is text and some of them can be long and can be updated, do you still think using all 4 of them as the partition key is good idea, since all client querying needs to have all this 4 field present, don't you think having an id would simplify querying?

0 Likes 0 · ·