question

anson avatar image
anson asked Erick Ramirez commented

Clarification on pagination internals

i am continuing on the question i previously asked https://community.datastax.com/questions/9978/how-pagination-is-done-internally.html#

since there was a 1000 character limit on the comment section , i thought i could ask this seperately

Based on your answer , if i am using an unbounded query like "select column from images" , then it needs to go through ALL the tables like cassandra normal read path.and after that only the first 1000 (assuming my paging size is 1000) records is shown right?

but what i found is when i go to cqlsh and i do this query "select coulmn from images LIMIT 1000 " , it gave a timeout- its expected.

but in driver java if i do like the following (forgive if any syntax problems are there, and its just a partial code)


Statement statement = new SimpleStatement("SELECT column from images")             
statement.setFetchSize(1000);     
savingPageState = null;          
while(true) {                      
    if (null != savingPageState) {                    
             statement = statement.setPagingState(PagingState.fromString(savingPageState));                 }     
       result = session.execute(statement);                   
       PagingState pagingState = result.getExecutionInfo().getPagingState();                      if (null != pagingState) {              
                savingPageState = result.getExecutionInfo().getPagingState().toString();                 }              
            }
           

here right, i am executing the statement with next page size in a while loop and even if i am doing the unbounded query, i was able to fetch 1000 records at a time. so ideally if the same read path is followed , like scanning through all the tables, it should have timed out right?

could you clarify?

cassandradriver
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

No, it isn't guaranteed to timeout. It depends on a lot of factors such as the number of partitions in the table, size of the cluster, etc. Cheers!

5 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

anson avatar image anson commented ·

So bascially even if i page for a 1000 records at a time using an unbounded query from a cluster of say 1 million rows, it essentially scans through all those tables ( containg total of 1 million) and after that it takes the first 1000 rows. Am i right?

0 Likes 0 ·
anson avatar image anson anson commented ·

@Erick Ramirez

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ anson commented ·

To be clear, Cassandra isn't designed for full table scans because it's solving a different problem -- it's goal is to provide really fast retrieval of a partition at internet scale.

When you run unbounded queries, you're really running OLAP workloads instead of OLTP workloads which is why you need other tools such as Spark to perform analytics queries against a C* cluster since the Spark connector is smart enough to parallelise the query into small ranges so only small amounts of data are retrieved at a time. Cheers!

0 Likes 0 ·
anson avatar image anson Erick Ramirez ♦♦ commented ·

Yes . That i understand.

So basically even if we are using pagination (1000 at a time), on an unbounded query ,it retrieves first 1000 AFTER a full table/nodes scan. Right?

@Erick Ramirez

0 Likes 0 ·
Show more comments