Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

phofegger_148429 avatar image
phofegger_148429 asked ·

Spark job failures, returns BusyPoolException: Pool is busy (no available connection and the queue has reached its max size 256)

Hi,

We have an DataAnalytics cluster running with Cassandra (3.10) / Spark 2.4.4 / Mesos (1.6.2).

There is a certain job running every day, which has caused the following error on Cassandra for 2 days.

Job aborted due to stage failure: Task 21 in stage 28.0 failed 4 times, most recent failure: Lost task 21.3 in stage 28.0 (TID 1507, host2, executor 3): com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: host1:9042 (com.datastax.driver.core.exceptions.BusyPoolException: [host1] Pool is busy (no available connection and the queue has reached its max size 256)),host2:9042 (com.datastax.driver.core.exceptions.BusyPoolException: [host2] Pool is busy (no available connection and the queue has reached its max size 256)), host3:9042 (com.datastax.driver.core.exceptions.BusyPoolException: [host3] Pool is busy (no available connection and the queue has reached its max size 256)), host5:9042 [only showing errors of first 3 hosts, use getErrors() for more details])

Is this a known issue? Can I tune some properties?

Many thanks in advance.

sparkconnector
1 comment
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Do you use custom consistency level?

0 Likes 0 · ·
Russell Spitzer avatar image
Russell Spitzer answered ·

This is an issue with where a point upgrade of the driver changed the behavior of executeAsync inside of the C* connector. As Detailed in SPARKC-503 previously the call would block if the pool was full but in the new version the driver would not block but would instantly return a future which had failed. That means if the application launches more simultaneous queries than the queue can hold it would throw an error instead of blocking.

To fix the issue there are basically two paths, first you can upgrade to a version greater than v2.0.4 of the Spark Cassandra Connector which will just automatically retry they busy pool exceptions (you must use the SCC default retry policy for this). Which may be required since there are versions of the driver which do not automatically resize the pool correctly.

Or you can adjust the parameters around the driver's connection pool size so that it's greater than the theoretical maximum number of queries executed simultaneously by the app. Max queries executed is basically

On an executor JVM

Number of Cores * concurrent.reads (when using a join with cassandra table operation)

or

Number of Cores * concurrent.writes

You can balance this with the total number of connections allowed using the parameters

SCC 2.5 ~ spark.cassandra.connection.localConnectionsPerExecutor
SCC 2.4 - spark.cassandra.connection.connections_per_executor_max

In both cases Spark tries to estimate the right number of connections given your number of executor cores but this sometimes does not work for certain resource managers.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

alex.ott avatar image
alex.ott answered ·

In most cases, the BusyPoolException is a sign that Cassandra can't cope with queries fast enough. For writes I would start with decreasing spark.cassandra.output.concurrent.writes (see reference). For reads, I would go with tuning spark.cassandra.input.readsPerSec, or spark.cassandra.input.throughputMBPerSec.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.