I am reading a cassandra table using pyspark . My spark job is failing as there is few large partitions in cassandra those do not fit into spark memory . I was under impression that input.split.size should divide such large partitions diving by 64 MB and creating number of spark partitions from large cassandra partition , but this does not seem to be happening . As per my research I have come to know that cassandra data is divided into 64 MB and if partition size is less than 64 MB in that case all smaller partitions are combined and created one partition size closer to 64 MB , but if cassandra partition is greater than 64 MB in that case input split size des not work and that large cassandra partition is also going to be large spark partition .
How to deal with such issues ? What can I do to break cassandra large partitions into spark smaller partitions ?