Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

guruprasadb avatar image
guruprasadb asked ·

How can I partition a table so that I don't end up with 900M rows?

Hello,

The data model i am working on is as below with different tables for same data data set for satisfying different kinds of query. The data mainly stores event data of some campaigns sent out on multiple channels like email, web, mobile app, sms etc. Events can include page visists, email opens, link clicks etc for different subscribers.

Table 1:

(enterprise_id int, domain_id text, campaign_id int, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, ........) (many more columns not part of primary key)

PRIMARY KEY ((enterprise_id,campaign_id),domain_id, event_category, event_action, datetime, subscription_id))

CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)

Keys and Data size for Table 1) : I have partition key as enterprise_id + campaign_id . Each enterprise can have several campaigns . The datastore may have data for few hundred campaigns. Each campaign can have upto 2-3 million records. Hence there may be 3000 partitions across 100 enterprises and each partition having 2-3 miilion records.

Cassandra Queries: Query always with partition key + primary key including the datetime field. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. enterprise_id +campaign_id is always available as a filter in the queries.

Table 2:

(enterprise_id int, domain_id text, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, campaign_id int........) (many more columns not part of primary key)

PRIMARY KEY (enterprise_id, domain_id, event_category, event_action, datetime, subscription_id))

CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)

Keys and Data size for Table 2) : I have partition key as enterprise_id only. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 - 900 million entries

Cassandra Queries: Query always with partition key + primary key upto datetime. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. In this case, data has to be queries across campaigns and we may not have campaign_id as a filter in the queries.

Table 3:

(enterprise_id int, subscription_id text, domain_id text, event_category text, event_action text, datetime timestamp, event_label text, campaign_id int........) (many more columns not part of primary key)

PRIMARY KEY (enterprise_id, subscription_id, domain_id, event_category, event_action, datetime, ))

CLUSTERING ORDER BY ( subscription_id DESC, domain_id DESC, event_category DESC, event_action DESC, datetime DESC,)

Keys and Data size for Table 3) : I have partition key as enterprise_id. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 -900 million entries

Cassandra Queries: Query always with partition key + primary key as subscription_id only. Should be able to query directly on enterprise_id + subscription_id.

My Queries:

1) Size of data on each partition: With Table 2) and Table 3) i may end up with more than 800 -900 million rows per partiton. As per my reading it is not ok to have so many entries per partition. How can i achieve my use case in this scenario? Even if i create multiple partitions based on some data like a week_number (1-52 in a year), the query will need to query across all partitions and end up using a IN clause with all week numbers which is as good as scanning all data.

2) [Follow up questions posted in #11016 and #10978]

3) [Follow up question posted in #11009]

data modeling
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

The first rule of data modeling is to design one table for each application query. In your case, it appears as though you have designed the tables first and didn't take the app queries into account.

The reason I say this is that you have used enterprise_id as the partition key when you only have 100 enterprises. This means that regardless of the size of your table, you will only ever have 100 partitions in it.

I doubt that your application would ever have a query like:

Get the domain and event details for all subscriptions for enterprise X

because to me that's how you've modelled the tables. A more realistic app query would be:

How many link clicks were there for campaign A5678 sent to emails?

A sample data model for this query would look like:

CREATE TABLE events_by_campaign (
  campaign_id text,
  channel text,
  link_clicks int,
  enterprise_id text,
  ...
  PRIMARY KEY ( (campaign_id), channel )
)

With this table, it wouldn't matter if you had ONE BILLION campaigns. What matters is that each campaign (partition) would have less than 10 rows (email, web, mobile app, SMS, etc).

The equivalent CQL for the hypothetical query I posted above is:

SELECT link_clicks FROM events_by_campaign \
  WHERE campaign_id = 'A5678' \
  AND channel = 'email';

Note that if campaign_id is not unique across all enterprises then you can partition the table based on the enterprise_id and campaign_id:

  PRIMARY KEY ( (enterprise_id, campaign_id), channel )

As a side note, I have created new posts for the other questions you'v asked. This is a Q&A forum so we prefer to post an answer for each question. Otherwise, it gets too difficult to manage really long posts. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.