Hello,
The data model i am working on is as below with different tables for same data data set for satisfying different kinds of query. The data mainly stores event data of some campaigns sent out on multiple channels like email, web, mobile app, sms etc. Events can include page visists, email opens, link clicks etc for different subscribers.
Table 1:
(enterprise_id int, domain_id text, campaign_id int, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, ........) (many more columns not part of primary key)
PRIMARY KEY ((enterprise_id,campaign_id),domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 1) : I have partition key as enterprise_id + campaign_id . Each enterprise can have several campaigns . The datastore may have data for few hundred campaigns. Each campaign can have upto 2-3 million records. Hence there may be 3000 partitions across 100 enterprises and each partition having 2-3 miilion records.
Cassandra Queries: Query always with partition key + primary key including the datetime field. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. enterprise_id +campaign_id is always available as a filter in the queries.
Table 2:
(enterprise_id int, domain_id text, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 2) : I have partition key as enterprise_id only. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 - 900 million entries
Cassandra Queries: Query always with partition key + primary key upto datetime. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. In this case, data has to be queries across campaigns and we may not have campaign_id as a filter in the queries.
Table 3:
(enterprise_id int, subscription_id text, domain_id text, event_category text, event_action text, datetime timestamp, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, subscription_id, domain_id, event_category, event_action, datetime, ))
CLUSTERING ORDER BY ( subscription_id DESC, domain_id DESC, event_category DESC, event_action DESC, datetime DESC,)
Keys and Data size for Table 3) : I have partition key as enterprise_id. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 -900 million entries
Cassandra Queries: Query always with partition key + primary key as subscription_id only. Should be able to query directly on enterprise_id + subscription_id.
My Queries:
1) Size of data on each partition: With Table 2) and Table 3) i may end up with more than 800 -900 million rows per partiton. As per my reading it is not ok to have so many entries per partition. How can i achieve my use case in this scenario? Even if i create multiple partitions based on some data like a week_number (1-52 in a year), the query will need to query across all partitions and end up using a IN clause with all week numbers which is as good as scanning all data.
2) [Follow up questions posted in #11016 and #10978]
3) [Follow up question posted in #11009]