question

gduan2000 avatar image
gduan2000 asked gduan2000 commented

How do I limit the number of items in a CQL collection?

hi there,

say I have a table that contains two fields (marketid: text, email: Set<text>), market is the primary key. This table is populated by an application that collects email addresses that satisfies certain conditions. Under normal circumstances, the email field will only get tens or hundreds of items., but under some extreme situations, it could get tens or hundreds of millions of records. The total size could reach hundreds of MB probably close to GB in size.

my questions are,

1. say if I have an entry with a set of 500MB emails, when I append another email to this set, how bad the performance of this operation is? How about reading the data, I know it'll read the entire record, but how does it compare to a text field with 500MB data?

2. If the above scenario will impact the performance, is there a way to set the limit of items that the table could accept?

* is it possible to set the limit when the table is created?

* is it possible to query the size of items before append?

thanks!

Version information:

[cqlsh 6.0.0 | Cassandra 3.11.3.5124 | CQL spec 3.4.4 | Native protocol v65]

==== update ====

@smadhavan thank you for your answer!

So it seems to me that using Collection is not recommended?

With your suggestion, let's say we have these 2 data models, one is using collection, other is using the alternative you mentioned in your original answer, for primary key 'abc' they both have 1,000,000 records,

table1 with Set collection:

marketid email
abc ['email1@email.com', 'email2@email.com', 'email3@email.com', ....'email1000000@email.com']
sdf ['sdf1@sdf.com', 'fhga@gah.com']

table2 with the alternative data model:

marketid email
abc email1@email.com
abc email2@email.com
sdf sdf1@sdf.com
abc email3@email.com
sdf fhga@gah.com
abc email4@email.com
... ...
abc email1000000@emai.com

For writing, if I want to upsert another email 'email1000001@email.com' to 'abc' in the above tables, which one has better performance?

For reading, if I want to get all emails with marketid 'abc' from either table, which one has better performance?

If table2 wins in either case, does it mean there is no need to introduce Set collection at all?

thanks!

collections
4 comments
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

smadhavan avatar image smadhavan ♦ commented ·
@gduan2000 , could you please update your question to post the version of Apache Cassandra or DataStax Enterprise (DSE) that you're using here? Thanks!
0 Likes 0 ·
gduan2000 avatar image gduan2000 smadhavan ♦ commented ·
@smadhavan thanks for your comments, just updated the version information.
0 Likes 0 ·
smadhavan avatar image smadhavan ♦ gduan2000 commented ·

@gduan2000, does the single partition records size fall within the 5 to 10MB maximum range for a given marketid (e.g. "abc" in this example) even with the 1 million e-mails? All things being equal, the alternate model that I proposed earlier will be faster.

In either case, I would simply suggest you test this yourselves with your existing hardware specifications, Cassandra cluster topology and your own complete schema/data model using the available performance benchmarking tools. You could read about the same in an article here.

I once again wanted to point you to the noteworthy characteristics of Collections here.

0 Likes 0 ·
Show more comments
smadhavan avatar image
smadhavan answered smadhavan edited

@gduan2000, I am afraid it is not possible to restrict/limit the number of items in a Set collection column type during the table creation. You could achieve that in the application side by reading the list for a given `marketid` partition, calculating the size and then route it accordingly (i.e. either deny the request with some custom message or allow it to be ingested).

Another alternative is to design your data model in such a way that you're not leveraging the collection type at all.

CREATE TABLE IF NOT EXISTS <your_table> (
  marketid text,
  resource_type text,
  resource_value text,
  <other_columns>
  PRIMARY KEY ((marketid, resource_type), resource_value)
);

and then have the values such as the below example,

marketid = 'abc' AND resource_type = 'email' AND resource_value = 'something@somewhere.com'

By doing this way, you don't have to worry about performance in reading a collection and updating it, etc., And, when you need to read all of the `email`, you could simply issue a query like below,

SELECT <identifiers_columns> FROM <your_table> WHERE marketid = ? AND resource_type = 'email';

Ideally, in a data modeling scenario, we recommend that your entire partition size is in the range of 5-10MB in size for scale and speed. I also want to take this opportunity to introduce a wonderful resource that has helped with data modeling in Cassandra very effectively and I hope you would see value in this. Website can be accessed here. Other helpful documentation in updating a `Set` collection type is here for your reference. I hope this gives you an idea of how to design and data model for higher efficiency and scale.

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

gduan2000 avatar image gduan2000 commented ·

Thanks @smadhavan, based on your comments I added a couple more questions to the original post. Please take a look.

0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered gduan2000 commented

It is not possible to limit the number of elements in a CQL collection. For this reason, our general recommendation is to not use CQL collections to store unbounded data because it won't perform well.

CQL collections are intended to store small amounts of data such as a person's contact numbers (mobile, home phone, work phone). When reading the contents of a collection, the whole collection must be read in order to access just one element.

In your case where the application query is "get all emails for a particular market", you need to design a table accordingly by using the market ID as the partition key and the email address as the clustering key:

CREATE TABLE emails_by_marketid (
    marketid text,
    email text,
    ...
    PRIMARY KEY (marketid, email)
)

To respond to your follow up question directly, the table schema has no impact to write performance. The only relevant point is read performance. Cheers!

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

gduan2000 avatar image gduan2000 commented ·
@Erick Ramirez thank you for the clarification that there is no impact on write performance in either case. I thought appending to Set will take more resources especially when the set gets bigger and bigger, but, it is good to know it doesn't impact on Write.
0 Likes 0 ·