In this use-case, need to fetch the records as the tables are designed as parent-child relation. Having the value and need to query that value on parent-table, collect it's child until the child not there in the table, query this to the child-table and fetch the children of all the parents. We have spark (2 node cluster) and DB (12 node cluster)
Parent element count = approx 8k (extracted on the basis of the given value)~30G
Child element count = approx 11k (extracted on the basis of 8k values)~412G
Total records = approx 76k (extracted on the basis of 11k+8k=19k values)~126G
Current entire code is taking 2.1 hrs to run and collect the data.
executor-memory: 30GB and driver-memory: 10GB
In the first two tables, parentaui is secondary index. As per your suggestion and raised here CASSANDRA-10050 , secondary indexes are not that good.
Tried couple of cases here-
1. Using datasets- we saw cached read is slower.
2. Using rdd- still trying the approaches.
Code took 2.1 hours:
1. Create a queue(in parent-child retrieval, order is important) and set(elements will be unique).
2. Filter rdd at reading time in where clause for parentaui (secondary index filter) as given specifically [Recursive till the queue not ended as need to retrieve the children of children] ---> Around 8k extracted.
3. After retrieving the values from step2, querying to child table based on where clause parentaui (secondary index filter) ----> Around 11k extracted
4. After collecting all the records, querying to the final table based on where clause on primary key and store the records ----> Around 76k extracted (based on 11k+8k=19k)
Intention is to reduce the processing time.
Attaching code for reference. (Doing rdd.collect().toSet operation to collect the items for the further retrieval)child_retrieval.txt
Please let me know in case, need for further details, I believe we can able to optimize the time for this sort of use-cases.