hi there,
say I have a table that contains two fields (marketid: text, email: Set<text>), market is the primary key. This table is populated by an application that collects email addresses that satisfies certain conditions. Under normal circumstances, the email field will only get tens or hundreds of items., but under some extreme situations, it could get tens or hundreds of millions of records. The total size could reach hundreds of MB probably close to GB in size.
my questions are,
1. say if I have an entry with a set of 500MB emails, when I append another email to this set, how bad the performance of this operation is? How about reading the data, I know it'll read the entire record, but how does it compare to a text field with 500MB data?
2. If the above scenario will impact the performance, is there a way to set the limit of items that the table could accept?
* is it possible to set the limit when the table is created?
* is it possible to query the size of items before append?
thanks!
Version information:
[cqlsh 6.0.0 | Cassandra 3.11.3.5124 | CQL spec 3.4.4 | Native protocol v65]
==== update ====
@smadhavan thank you for your answer!
So it seems to me that using Collection is not recommended?
With your suggestion, let's say we have these 2 data models, one is using collection, other is using the alternative you mentioned in your original answer, for primary key 'abc' they both have 1,000,000 records,
table1 with Set collection:
marketid | |
abc | ['email1@email.com', 'email2@email.com', 'email3@email.com', ....'email1000000@email.com'] |
sdf | ['sdf1@sdf.com', 'fhga@gah.com'] |
table2 with the alternative data model:
marketid | |
abc | email1@email.com |
abc | email2@email.com |
sdf | sdf1@sdf.com |
abc | email3@email.com |
sdf | fhga@gah.com |
abc | email4@email.com |
... | ... |
abc | email1000000@emai.com |
For writing, if I want to upsert another email 'email1000001@email.com' to 'abc' in the above tables, which one has better performance?
For reading, if I want to get all emails with marketid 'abc' from either table, which one has better performance?
If table2 wins in either case, does it mean there is no need to introduce Set collection at all?
thanks!