question

appcuarium avatar image
appcuarium asked Erick Ramirez edited

OpsCenter can't connect to Cassandra when client auth is enabled in LCM

I'm seeing the same issue in Question #1147 when enabling client authentication from LCM. I've tried even manually updating cassandra.yaml and setting client_encryption_options.require_client_auth=true, then restart dse, datastax-agent and opscenterd and no luck. I've tried every combination possible, generating my own CA and certificates, signing the certs with the CA generated by LCM or mine, nothing worked. I also followed the instructions here , here and here to name just a few and no joy. It was working perfectly previously until I updated the server and client keystore and truststore passwords because JConsole was failing to connect and I thought it was because of a too complex password being parsed badly (I had to escape the $ sign in datastax-agent-env.sh for it to be read correctly), but afterwards I re-created the keystore and truststore several times with new passwords, nothing works.

All the nodes are perfectly communicating between them with client auth enabled, jmx client auth enabled, node to node client auth enabled. I can see all the nodes and stats in OpsCenter and the agents with a green checkmark, however, the LCM configuration task fails every time with NoHostAvailable error when I try running a configuration with client auth enabled for client encryption, it works fine when running the Clusters/Cluster/Configure command from LCM up to the point where it fails with this message:

2020-04-29 22:22:40,880Z [opscenterd]  INFO: Received milestone from node name="cassandra03" ssh-management-address="10.220.2.103" node-id="0f91243f-c0af-4914-9baa-42ebc21e0f9b" message="Remote execution is now complete. Closing the SSH connection." job-id="7600e9ea-e2a6-4d30-bea9-d9e1d6e8b315" (opscd-pool-0)
2020-04-29 22:22:40,892Z [opscenterd]  INFO: Received check message="Verifying the password for DSE user 'cassandra'" job-id="7600e9ea-e2a6-4d30-bea9-d9e1d6e8b315" (async-thread-macro-9)
2020-04-29 22:22:40,919Z [opscenterd]  INFO: configure job finished for node name="cassandra03" ssh-management-address="10.220.2.103" node-id="0f91243f-c0af-4914-9baa-42ebc21e0f9b" (async-thread-macro-6)
2020-04-29 22:23:00,001Z [Appcuarium_DSE_Cluster]  INFO: Starting scheduled best-practice job 47e19b6d-5f44-4dca-85d7-e4bba25a566a (MainThread)
2020-04-29 22:23:00,002Z [Appcuarium_DSE_Cluster]  INFO: Starting scheduled best-practice job c89aa6fc-07c4-46de-8382-ba70ebafa623 (MainThread)
2020-04-29 22:23:00,008Z [Appcuarium_DSE_Cluster]  INFO: Starting scheduled best-practice job e251bb7f-ca64-4c3d-8ed0-3ec0497442ec (MainThread)
2020-04-29 22:23:00,018Z [Appcuarium_DSE_Cluster]  INFO: Scheduled best-practice job 47e19b6d-5f44-4dca-85d7-e4bba25a566a finished (MainThread)
2020-04-29 22:23:00,060Z [Appcuarium_DSE_Cluster]  INFO: Scheduled best-practice job e251bb7f-ca64-4c3d-8ed0-3ec0497442ec finished (MainThread)
2020-04-29 22:23:00,166Z [Appcuarium_DSE_Cluster]  INFO: Scheduled best-practice job c89aa6fc-07c4-46de-8382-ba70ebafa623 finished (MainThread)
2020-04-29 22:23:01,872Z [opscenterd] ERROR: Can't get a cassandra connection for cassandra user. Target host may be down or CQL port may be blocked by a firewall. Consider setting a larger cassandra_connection_timeout property value. username="cassandra" cassandra_connection_timeout="20000" (async-thread-macro-9)
2020-04-29 22:23:01,883Z [opscenterd] ERROR: Configure job 7600e9ea-e2a6-4d30-bea9-d9e1d6e8b315 failed! (async-thread-macro-9)

There is no error logged in /var/log/datastax-agent/agent.log at the same time as the error in the OpsCenter log, but there is one in /var/log/cassandra/system.log

INFO  [CoreThread-2] 2020-04-29 22:23:01,847  NoSpamLogger.java:95 - Unexpected exception during request; channel = [id: 0xecf8c802, L:/10.220.2.103:9042 ! R:/10.220.0.100:39570]
javax.net.ssl.SSLHandshakeException: null cert chain
at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1566)
at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:545)
at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:819)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:783)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:626)
at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:294)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1275)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1177)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1221)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:808)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:474)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processEpollEvents(EpollTPCEventLoopGroup.java:956)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processEvents(EpollTPCEventLoopGroup.java:924)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.run(EpollTPCEventLoopGroup.java:501)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
at org.apache.cassandra.utils.concurrent.InlinedThreadLocalThread.run(InlinedThreadLocalThread.java:251)
Caused by: javax.net.ssl.SSLHandshakeException: null cert chain
at sun.security.ssl.Alerts.getSSLException(Alerts.java:198)
at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1667)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:333)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:321)
at sun.security.ssl.ServerHandshaker.clientCertificate(ServerHandshaker.java:2011)
at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:233)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1082)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:1015)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:1012)
at java.security.AccessController.doPrivileged(Native Method)
at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1504)
at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1435)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1343)
... 21 common frames omitted

From this error above I understand that OpsCenter is refusing the certificate that the node is sending, but I could be wrong. OpsCenter is sitting on 10.220.0.100.

There's no firewall whatsoever for now as I'm configuring the cluster. The cassandra password works perfectly with client auth disabled. Also, cqlsh, dsetool, nodetool and nodesync work as expected with SSL and auth.

It's been 3 days now without luck, any help is really appreciated, I'm out of ideas :/

UPDATE: @Erick Ramirez Thanks for your response. I will provide more context for some parts of your input. Sorry for publishing this as a response to myself but I've exceeded 1000 chars as a response to your answer.

This means you will also need to import the OpsCenter certificate into the truststore of every DSE node by following this procedure.

I followed the procedure you linked (it's also linked in my comment) and the other link is another one I looked into for solving this issue. The client certificates (OpsCenter) are imported into the truststore of the DSE nodes and I can connect successfully from any node via cqlsh with SSL and client authentication. Also, the OpsCenter truststore has the nodes certificates and CA in its truststore. As mentioned, OpsCenter shows the status of the nodes correctly and can communicate with the agents without issues, it is LCM that causes that error message and fails when I try to deploy the configuration with client_auth checked on client encryption options tab in cassandra.yaml

You will also need to configure cqlsh to connect to the nodes with client-to-node encryption enabled

It is configured and I can start a cql shell to any node from any node. I also installed on the OpsCenter node the cqlsh package in order to test the connection and certificates, extracted the private key from the OpsCenter certificate keystore and added the settings user@opscenter ~/.cassandra/cqlshrc and can connect from OpsCenter to any node in the cluster, this is why I'm so confused about why LCM is complaining about the TLS handshake, the certificate used for this connection is the same that is imported into the nodes truststore that should accept the certificate.

These are the contents of the cqlshrc file

[authentication]
username = cassandra
password = redacted

[ssl]
certfile = /var/lib/opscenter/ssl/lcm/cluster_151c8e74-c4c7-40ed-8bf3-680207880945.crt
validate = true
userkey = /home/redacted/opscenter.key
usercert = /home/redacted/signing_request.crt_signed

[connection]
hostname=10.220.2.103

And here is the result:

user@opscenter:~$ sudo cqlsh-6.8.0/bin/cqlsh --ssl --debug
Using CQL driver: <module 'cassandra' from '/home/user/cqlsh-6.8.0/bin/../zipfiles/cassandra-driver-internal-only-3.21.0.post0+20200211.zip/cassandra-driver-3.21.0.post0+20200211/cassandra/__init__.py'>
Using connect timeout: 5 seconds
Using 'utf-8' encoding
Using ssl: True
Using DSEPlainTextAuthProvider
Connected to DSE Cluster at 10.220.2.103:9042.
[cqlsh 6.8.0 | DSE 6.8.0 | CQL spec 3.4.5 | DSE protocol v2]
Use HELP for help.
cassandra@cqlsh>

Any ideas what else could be happening? Thanks.

lifecycle manager
2 comments
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image Erick Ramirez ♦♦ commented ·

@appcuarium My apologies for missing the bit about LCM. I'm not sure what's happening to your cluster yet.

Could you confirm that you're using the certificates that LCM has generated? If so, what happens when you disable two-way SSL in LCM -- is LCM able to manage the cluster/run jobs?

0 Likes 0 ·
appcuarium avatar image appcuarium Erick Ramirez ♦♦ commented ·

Hi @Erick Ramirez, thanks for the input. I'm using for the nodes the certificates that LCM generated and deployed when enabling 1. node-to-node and 2. client-to-node encryption. I then 3. generated a JKS keystore on the OpsCenter machine as per the documentation, exported the public certificate to file, signed it with the same RootCA generated by LCM in the above steps 1 and 2, imported the certificate into the nodes server and client truststore and modified cluster.conf on the OpsCenter machine to use the keystore holding the client certificate in step 3 in the [cassandra] environment. If I disable two-way SSL, LCM can communicate without issues with the nodes. The RootCA is already in the truststore of the nodes, as it was deployed by LCM when enabling steps 1 and 2.

0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered appcuarium commented

@appcuarium If I understood you correctly, the problem is that OpsCenter is not able to connect to the database as a client specifically with this error in opscenterd.log:

2020-04-29 22:23:01,872Z [opscenterd] ERROR: Can't get a cassandra connection for cassandra user. Target host may be down or CQL port may be blocked by a firewall. Consider setting a larger cassandra_connection_timeout property value. username="cassandra" cassandra_connection_timeout="20000" (async-thread-macro-9)

According to the system.log on node 10.220.2.103, the connection from OpsCenter server (10.220.0.100) fails because of a problem with the certificate:

INFO  [CoreThread-2] 2020-04-29 22:23:01,847  NoSpamLogger.java:95 - Unexpected exception during request; channel = [id: 0xecf8c802, L:/10.220.2.103:9042 ! R:/10.220.0.100:39570]
javax.net.ssl.SSLHandshakeException: null cert chain

When you enable two-way SSL authentication on the DSE nodes, you need to import the client certificates into the truststore on every DSE node. A "client" is any app which connects to the database on CQL port 9042.

In this scenario, the OpsCenter server and agents are also clients because they're connecting to the database to store and retrieve data such as keyspace/table metrics. This means you will also need to import the OpsCenter certificate into the truststore of every DSE node by following this procedure.

You will also need to configure cqlsh to connect to the nodes with client-to-node encryption enabled. I suggest doing this first since it's an easier way of confirming whether SSL is configured correctly on the DSE nodes. Cheers!

5 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image Erick Ramirez ♦♦ commented ·

@appcuarium After re-reading your original post and writing notes along the way, I've realised that I've missed several things you've already tried and I'm at a loss as to why it doesn't work for you other than something broke in your LCM configuration that's preventing it from working.

The nature of SSL is that the errors are cryptic or obfuscated to discourage malicious attackers but it also makes it difficult to troubleshoot. These things are typically environmental but I'll do my best and talk to the OpsCenter engineers for ideas. Cheers!

P.S. I've spent my whole weekend trying to replicate the problem but had no luck so I don't know what's broken in your deployment.

0 Likes 0 ·
appcuarium avatar image appcuarium Erick Ramirez ♦♦ commented ·

@Erick Ramirez Thanks for all the effort, I also think there's something broken in the LCM or OpsCenter configuration, maybe some ghost record in the DB that gets in the way, the easiest would be to just re-create the cluster from scratch (VMs included) but then if this happened again I would like to know where the error is so I can fix it quickly in production. Maybe I could try re-installing only OpsCenter and try. I forgot to mention that system resources are encrypted, but only them, the configuration files are not encrypted because of the limitations described here. Thanks.

0 Likes 0 ·
markc avatar image markc appcuarium commented ·

@appcuarium Are you using config encryption?

https://docs.datastax.com/en/security/6.7/security/secEncryptProperties.html

There is a known issue with LCM if you "backfill" a SSL cert and have config encryption on it LCM will re-gen the certs, but where I've seen this before clients have lost connectivity too as LCM clobbered the original SSL artifacts.

Assuming your md5sum across all the keystores is all looking ok, sounds as if it would be since you say you still have connectivity?

0 Likes 0 ·
Show more comments
appcuarium avatar image appcuarium Erick Ramirez ♦♦ commented ·

@Erick Ramirez Looks like the options in the [cassandra] section in /etc/opscenter/clusters/clustername.conf are being ignored? I modified the ssl_keystore and ssl_truststore variables there to point to a non-existing file and the error reported javax.net.ssl.SSLHandshakeException: null cert chain is still the same, which I would have expected to change to something like keystore not found or similar anywhere in the opscenterd/dse/datastax-agent log files. I thought it was worth mentioning.

0 Likes 0 ·