I'm seeing the same issue in Question #1147 when enabling client authentication from LCM. I've tried even manually updating cassandra.yaml and setting client_encryption_options.require_client_auth=true, then restart dse, datastax-agent and opscenterd and no luck. I've tried every combination possible, generating my own CA and certificates, signing the certs with the CA generated by LCM or mine, nothing worked. I also followed the instructions here , here and here to name just a few and no joy. It was working perfectly previously until I updated the server and client keystore and truststore passwords because JConsole was failing to connect and I thought it was because of a too complex password being parsed badly (I had to escape the $ sign in datastax-agent-env.sh for it to be read correctly), but afterwards I re-created the keystore and truststore several times with new passwords, nothing works.
All the nodes are perfectly communicating between them with client auth enabled, jmx client auth enabled, node to node client auth enabled. I can see all the nodes and stats in OpsCenter and the agents with a green checkmark, however, the LCM configuration task fails every time with NoHostAvailable error when I try running a configuration with client auth enabled for client encryption, it works fine when running the Clusters/Cluster/Configure command from LCM up to the point where it fails with this message:
2020-04-29 22:22:40,880Z [opscenterd] INFO: Received milestone from node name="cassandra03" ssh-management-address="10.220.2.103" node-id="0f91243f-c0af-4914-9baa-42ebc21e0f9b" message="Remote execution is now complete. Closing the SSH connection." job-id="7600e9ea-e2a6-4d30-bea9-d9e1d6e8b315" (opscd-pool-0) 2020-04-29 22:22:40,892Z [opscenterd] INFO: Received check message="Verifying the password for DSE user 'cassandra'" job-id="7600e9ea-e2a6-4d30-bea9-d9e1d6e8b315" (async-thread-macro-9) 2020-04-29 22:22:40,919Z [opscenterd] INFO: configure job finished for node name="cassandra03" ssh-management-address="10.220.2.103" node-id="0f91243f-c0af-4914-9baa-42ebc21e0f9b" (async-thread-macro-6) 2020-04-29 22:23:00,001Z [Appcuarium_DSE_Cluster] INFO: Starting scheduled best-practice job 47e19b6d-5f44-4dca-85d7-e4bba25a566a (MainThread) 2020-04-29 22:23:00,002Z [Appcuarium_DSE_Cluster] INFO: Starting scheduled best-practice job c89aa6fc-07c4-46de-8382-ba70ebafa623 (MainThread) 2020-04-29 22:23:00,008Z [Appcuarium_DSE_Cluster] INFO: Starting scheduled best-practice job e251bb7f-ca64-4c3d-8ed0-3ec0497442ec (MainThread) 2020-04-29 22:23:00,018Z [Appcuarium_DSE_Cluster] INFO: Scheduled best-practice job 47e19b6d-5f44-4dca-85d7-e4bba25a566a finished (MainThread) 2020-04-29 22:23:00,060Z [Appcuarium_DSE_Cluster] INFO: Scheduled best-practice job e251bb7f-ca64-4c3d-8ed0-3ec0497442ec finished (MainThread) 2020-04-29 22:23:00,166Z [Appcuarium_DSE_Cluster] INFO: Scheduled best-practice job c89aa6fc-07c4-46de-8382-ba70ebafa623 finished (MainThread) 2020-04-29 22:23:01,872Z [opscenterd] ERROR: Can't get a cassandra connection for cassandra user. Target host may be down or CQL port may be blocked by a firewall. Consider setting a larger cassandra_connection_timeout property value. username="cassandra" cassandra_connection_timeout="20000" (async-thread-macro-9) 2020-04-29 22:23:01,883Z [opscenterd] ERROR: Configure job 7600e9ea-e2a6-4d30-bea9-d9e1d6e8b315 failed! (async-thread-macro-9)
There is no error logged in /var/log/datastax-agent/agent.log at the same time as the error in the OpsCenter log, but there is one in /var/log/cassandra/system.log
INFO [CoreThread-2] 2020-04-29 22:23:01,847 NoSpamLogger.java:95 - Unexpected exception during request; channel = [id: 0xecf8c802, L:/10.220.2.103:9042 ! R:/10.220.0.100:39570] javax.net.ssl.SSLHandshakeException: null cert chain at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1566) at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:545) at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:819) at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:783) at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:626) at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:294) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1275) at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1177) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1221) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:808) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:474) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processEpollEvents(EpollTPCEventLoopGroup.java:956) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processEvents(EpollTPCEventLoopGroup.java:924) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.run(EpollTPCEventLoopGroup.java:501) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) at org.apache.cassandra.utils.concurrent.InlinedThreadLocalThread.run(InlinedThreadLocalThread.java:251) Caused by: javax.net.ssl.SSLHandshakeException: null cert chain at sun.security.ssl.Alerts.getSSLException(Alerts.java:198) at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1667) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:333) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:321) at sun.security.ssl.ServerHandshaker.clientCertificate(ServerHandshaker.java:2011) at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:233) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1082) at sun.security.ssl.Handshaker$1.run(Handshaker.java:1015) at sun.security.ssl.Handshaker$1.run(Handshaker.java:1012) at java.security.AccessController.doPrivileged(Native Method) at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1504) at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1435) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1343) ... 21 common frames omitted
From this error above I understand that OpsCenter is refusing the certificate that the node is sending, but I could be wrong. OpsCenter is sitting on 10.220.0.100.
There's no firewall whatsoever for now as I'm configuring the cluster. The cassandra password works perfectly with client auth disabled. Also, cqlsh, dsetool, nodetool and nodesync work as expected with SSL and auth.
It's been 3 days now without luck, any help is really appreciated, I'm out of ideas :/
UPDATE: @Erick Ramirez Thanks for your response. I will provide more context for some parts of your input. Sorry for publishing this as a response to myself but I've exceeded 1000 chars as a response to your answer.
This means you will also need to import the OpsCenter certificate into the truststore of every DSE node by following this procedure.
I followed the procedure you linked (it's also linked in my comment) and the other link is another one I looked into for solving this issue. The client certificates (OpsCenter) are imported into the truststore of the DSE nodes and I can connect successfully from any node via cqlsh with SSL and client authentication. Also, the OpsCenter truststore has the nodes certificates and CA in its truststore. As mentioned, OpsCenter shows the status of the nodes correctly and can communicate with the agents without issues, it is LCM that causes that error message and fails when I try to deploy the configuration with client_auth checked on client encryption options tab in cassandra.yaml
You will also need to configure cqlsh to connect to the nodes with client-to-node encryption enabled
It is configured and I can start a cql shell to any node from any node. I also installed on the OpsCenter node the cqlsh package in order to test the connection and certificates, extracted the private key from the OpsCenter certificate keystore and added the settings user@opscenter ~/.cassandra/cqlshrc and can connect from OpsCenter to any node in the cluster, this is why I'm so confused about why LCM is complaining about the TLS handshake, the certificate used for this connection is the same that is imported into the nodes truststore that should accept the certificate.
These are the contents of the cqlshrc file
[authentication] username = cassandra password = redacted [ssl] certfile = /var/lib/opscenter/ssl/lcm/cluster_151c8e74-c4c7-40ed-8bf3-680207880945.crt validate = true userkey = /home/redacted/opscenter.key usercert = /home/redacted/signing_request.crt_signed [connection] hostname=10.220.2.103
And here is the result:
user@opscenter:~$ sudo cqlsh-6.8.0/bin/cqlsh --ssl --debug Using CQL driver: <module 'cassandra' from '/home/user/cqlsh-6.8.0/bin/../zipfiles/cassandra-driver-internal-only-3.21.0.post0+20200211.zip/cassandra-driver-3.21.0.post0+20200211/cassandra/__init__.py'> Using connect timeout: 5 seconds Using 'utf-8' encoding Using ssl: True Using DSEPlainTextAuthProvider Connected to DSE Cluster at 10.220.2.103:9042. [cqlsh 6.8.0 | DSE 6.8.0 | CQL spec 3.4.5 | DSE protocol v2] Use HELP for help. cassandra@cqlsh>
Any ideas what else could be happening? Thanks.