question

jean.andre_185656 avatar image
jean.andre_185656 asked Erick Ramirez edited

Why is my query working with partial partition key?

I heard and learned that a query cannot be made with a partial partition key, all columns from the partition key must be provided otherwise it fails. But, in this case, it is not, the query does not fail and the return is right. However I've to admit that the case is particular because, it is a JUnit (in memory database) test.... Please, let me know where I'm wrong or if it is a bug or a new feature ? ;)

So, here are the version of the component I'm using.

        <dependency>
            <groupId>com.datastax.cassandra</groupId>
            <artifactId>cassandra-driver-extras</artifactId>
            <cassandra-driver-extras.version>3.10.2</cassandra-driver-extras.version>
        </dependency>
        <dependency>
            <groupId>com.datastax.cassandra</groupId>
            <artifactId>cassandra-driver-core</artifactId>
            <cassandra-driver.version>3.10.2</cassandra-driver.version>
        </dependency>
        <dependency>
            <groupId>org.cassandraunit</groupId>
            <artifactId>cassandra-unit-spring</artifactId>
            <cassandra-unit.version>3.11.2.0</cassandra-unit.version>
        </dependency>

Here is the model:

CREATE TABLE IF NOT EXISTS test (
    company_id text,
    location_id text,
    user_id text,
    date_time timestamp,
    type text,
    id uuid,
    ...
    type text,
    external_id text,
    PRIMARY KEY ((company_id, location_id), user_id, date_time, type, id)
) WITH comment = 'Test for secondary index and partition key';

Here are the data

INSERT INTO test (company_id, location_id, user_id, date_time, type, id, status, external_id)
VALUES ('5-company_id', '1-location_id', '1-user_id', 951822000000, 'TRAINING_START', 43026b3e-5746-4e78-a9c3-7635ad032bb0, 'QUEUED', 'The-External-Id')
IF NOT EXISTS;

INSERT INTO test (company_id, location_id, user_id, date_time, type, id, status, external_id)
VALUES ('5-company_id', '2-location_id', '1-user_id', 951908400000, 'SHIFT_START', ba001e98-07a6-44ed-b191-810c10faf059, 'IMPORTED', 'The-External-Id')
IF NOT EXISTS;

Query is:

final Select select = QueryBuilder.select().from("test")
        .where(eq(PKEY_COMPANY_ID, companyId))
        .and(eq("external_id", externalId))
        .allowFiltering()
        .limit(7);
final List<Test> Tests = select(select, Test.class);

Log shown:

16:50:45.930 [main] DEBUG c.w.cassandra.dao.test.TestDAO - findByExternalId - companyId: 5-company_id - externalId: The-External-Id
16:50:45.930 [main] DEBUG c.w.cassandra.dao.test.TestDAO - findByExternalId - Query string: SELECT * FROM test WHERE company_id=? AND external_id=? LIMIT 7 ALLOW FILTERING;
16:50:45.933 [main] DEBUG c.w.cassandra.dao.test.TestDAO - findByExternalId - result from DB: 2 item(s)
17:29:18.995 [main] DEBUG c.w.cassandra.dao.test.TestDAO - Test@1306429814 -> id: ba001e98-07a6-44ed-b191-810c10faf059 - companyId: 5-company_id - locationId: 2-location_id - userId: 1-user_id - datetime: Wed Mar 01 06:00:00 EST 2000 - type: SHIFT_START - status: IMPORTED
17:29:18.995 [main] DEBUG c.w.cassandra.dao.test.TestDAO - Test@1216863787 -> id: 43026b3e-5746-4e78-a9c3-7635ad032bb0 - companyId: 5-company_id - locationId: 1-location_id - userId: 1-user_id - datetime: Tue Feb 29 06:00:00 EST 2000 - type: TRAINING_START - status: QUEUED

So, how it is possible ?

I have just made a test under DataStax Studio and it is working too.... and it triggers other questions related to in that case. How to be sure that all nodes have been queried and how to be sure about the order of the data, and how to be sure that we have newest data if one node did not answer or no need to answer due to the limit ? Then, does the secondary index is the safest way ?

Thank you for your answer.

data modelingcql
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered mcboatface.boat_186845 Suspended commented

Relevant information

You posted a lot of information but the only relevant items are the primary key:

    PRIMARY KEY ((company_id, location_id), user_id, date_time, type, id)

and the query:

SELECT * FROM test WHERE company_id=? AND external_id=? LIMIT 7 ALLOW FILTERING;

Answer

The simple answer to your question is that the query works because you have used the ALLOW FILTERING clause (.allowFiltering()).

When you use the ALLOW FILTERING clause, you can query the table by filtering on any column that is not indexed. Since you are querying without specifying the partition key, it requires a full table scan and will affect the performance of your cluster.

For this reason, consider the use of the ALLOW FILTERING clause in development environments only. You should never use it in your application code for production.

There are some use cases where a secondary index is appropriate but our general recommendation is that you should create a table for every application query for optimal performance since the one-table-per-query approach scales as your dataset and cluster grows.

For more info, see When to use (and not use) an index.

Other details

As a side note, I noted your use of IF NOT EXISTS when inserting data. If you are not already aware, conditional writes (compare-and-set, also known as lightweight transactions) are very expensive and you should only use them when necessary.

Conditional writes require a read-before-write since Cassandra needs to read the data to test the condition before it can write. Lightweight transactions involve 4 round trips and are costly if you are using them unnecessarily.

For more info, see Lightweight transactions in Cassandra. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

jean.andre_185656 avatar image
jean.andre_185656 answered David Jones-Gilardi commented

Oh thank very much for your quick answer. Finally yesterday, as newbie in that tech., I made some search on how Cassandra handles those requests (late in the night). I found this answer. Now, I have a better understanding. So my other question is, does Cassandra will still have the same slowness if the partition is fully provided but the search is still made will "allow filtering" for a clustering column (user_id, type, date_time, ...) ?

So, what is the most performant in high volume between those 2 ways ?

1)

final Select select = QueryBuilder.select().from("test")
        .where(eq(PKEY_COMPANY_ID, companyId))
        .and(eq(PKEY_LOCATION_ID, locationId))
        .and(eq(PKEY_USER_ID, userId))
        .allowFiltering();
final List<Test> tests = select(select, Test.class);

Versus

2)

final Select select = QueryBuilder.select().from("test")
        .where(eq(PKEY_COMPANY_ID, companyId))
        .and(eq(PKEY_LOCATION_ID, locationId)); 
final List<Test> tests = select(select, Test.class);

return tests.stream()
        .parallel()
        .filter(filterOnUser)
        .collect(Collectors.toList()


I understand that duplicating the data by providing another table is the best approach with Cassandra to hit the best speed performance, but do we really want to duplicate 300 000 millions of records multiply by 3 in our real situation because there is a need for special queries when all the most queries provide the full partition key ? I'm hesitating...but if it needs....we will do it. And if we do not want any "allow filtering" then we have to multiply by 10 !!


But I rather prefer the case 1) as we use the pagination based on that sample from DataStax. It makes the things simpler for now.


Thank you for the good advice about "IF NOT EXIST". It is only for JUnit test, as the CQL file to populate the DB for JUnit is written by hand and as I do copy/paste, I do not want to modify my previous test case scenario by modifying existing data (just want to prevent error).


So thank you again for your answer and have a good day.

3 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Cedrick Lunven avatar image Cedrick Lunven ♦ commented ·

Sorry but simply put if you need `.allowFiltering();` your data model is not correct (you need to replicate data in another table or use secondary index for low cardinality.

To end you Query with the build terminate with .build();

Here are some sample code and code to help you migrate to latest version of driver:

https://github.com/DataStax-Examples/java-cassandra-driver-from3x-to4x/blob/master/example-3x/src/main/java/com/datastax/samples/SampleCode3x_CRUD_01_Simple.java

Enjoy !

0 Likes 0 ·
Cedrick Lunven avatar image Cedrick Lunven ♦ Cedrick Lunven ♦ commented ·

oups the .build() is only for latest versio not 3.x

0 Likes 0 ·
David Jones-Gilardi avatar image David Jones-Gilardi ♦ Cedrick Lunven ♦ commented ·

Yes, to add to this you should NEVER use ALLOW FILTERING in any production sense. It will not scale and you will end up a bad experience for your database and your application. As @Cedrick Lunven said, if you feel compelled to do this you really need to take a look at your data model and work to remove any cases where it exists.

0 Likes 0 ·
smadhavan avatar image
smadhavan answered jean.andre_185656 commented

Fantastic job @jean.andre_185656 for posting all the details which will help the party on the other side to understand your problem in detail!

I see that you're using an older 3.x version of DataStax Java Driver 3.10.2 version and I wanted to point to our latest Unified DataStax Java Driver for Apache Cassandra™ 4.x series which has got quite a lot of improvements and features to explore. If you are starting your development new, this will be the recommended version to leverage. Here is the Maven repositories for reference.

Here is a blog that talks about ALLOW FILTERING and why it is to be avoided in first place with pros and cons in addition to what Erick has already explained.

Hope that helps!


About your followup question above, for the 1st case, I don't understand why would you require adding ALLOW FILTERING since you're already providing the full composite partition key (company_id, location_id) and the 1st clustering key user_id of the table already. If you could provide additional context here by a comment or by updating your question, it would make it easier for folks to understand the goal and recommend a solution.

p/s You would require to provide ALLOW FILTERING only if you are wanting to use one of the clustering keys without providing the preceding clustering keys in the order in which they were specified during table creation. For e.g. if you want to execute a query like below,

SELECT ... FROM [keyspace_name.]test WHERE company_id = ? AND location_id = ? AND id = ?

notice you are skipping user_id, date_time, and type clustering columns prior to using id last clustering column, you should specify ALLOW FILTERING without which it won't execute. It might be okay to leverage the usage of allow filtering within a given partition if that specific partition doesn't contain a large number of rows (or) is not a wide/large partition going over the recommended total size of the partition of a couple MBs.

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

jean.andre_185656 avatar image jean.andre_185656 commented ·

Thank you very much. You are totally right for the user id, I've some tests with and without and what is a little bit funny is about how to terminate the Select without having providing a allowFiltering or limit statement !?

final Select select = QueryBuilder.select().from("Test")
        .where(eq(PKEY_COMPANY_ID, companyId))
        .and(eq(PKEY_LOCATION_ID, locationId))
        .and(eq(PKEY_USER_ID, userId))
        .limit(4);

versus

final Select select = QueryBuilder.select().from("Test")
        .where(eq(PKEY_COMPANY_ID, companyId))
        .and(eq(PKEY_LOCATION_ID, locationId))
        .and(eq(PKEY_USER_ID, userId))
        .allowFiltering();


So, it is a detail, now I looking for to create an secondary index, but my tests show I cannot create index on 2 columns.

STMT_COMPANYID_EXTERNALID_IDX = "CREATE INDEX IF NOT EXISTS" +
        " company_id_external_id_idx ON test" 
        " (company_id, external_id)";


0 Likes 0 ·
jean.andre_185656 avatar image jean.andre_185656 jean.andre_185656 commented ·

Then, I look for Custom index and documentation on CustomIndex like SASI from this page, tell us that is not recommend to go with in Production. So any other solution ?

0 Likes 0 ·