question

ratna53_110276 avatar image
ratna53_110276 asked Erick Ramirez commented

How can I get the status of a node from OpsCenter API?

I am trying to write an automated script to stop all the nodes and start the nodes and at the end of my script, i am trying to get the status of the node from opscenter API. But its not updating properly when we stop the DB and I am waiting for 90sec then stopping the opscenter agent still the opscenter agent is not able to send the DB status as down. is there any way we can poll the opscenter to send the DB status to opscenter before stopping the agent.

we are not using poll_period in our address.yaml. so we might be using default value.

I am using below curl to retrieve the DB status.

curl -s -H 'opscenter-session: session_id}' 'http://{opscenter_host}:8090/{cluster_name}/nodes'

Thanks

Ratna Kumar

opscenterapi
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

@ratna53_110276 It looks like you were making the correct OpsCenter API call but perhaps just didn't know how to interpret the output.

Here's an example where I attempt to retrieve information about one of the nodes in my cluster:

$ curl -L http://10.1.2.3:8888/community/nodes/10.1.2.45
{
  "node_ip": "10.1.2.45",
  "last_seen": 0,
  "dse_config_encryption": false,
  "mode": "normal",
  ...
}

Note that I have formatted the output to make it easier to read. The 2 important properties in the output above are last_seen and mode. In the example where the node is up and operational:

  • last_seen is 0 meaning it's been zero seconds since the node was detected as being down
  • mode is normal

When DSE is shutdown on the node, we get a different output:

$ curl -L http://10.1.2.3:8888/community/nodes/10.1.2.45
{
  "node_ip": "10.1.2.45",
  "last_seen": 1588056113,
  "dse_config_encryption": false,
  "mode": null,
  ...
}

The last_seen property is a Unix timestamp indicating the number of seconds since epoch (January 1, 1970). In the case above, 1588056113 is equivalent to April 28 6:41am GMT. The mode is null because DSE is down and the agent can't get DSE's state.

For your purposes, what you need to check for is a non-zero value for the last_seen property to determine if a node is down. Cheers!

UPDATE - The reason that the status is not immediate is because several things have to happen when you are restarting DSE on the nodes:

  • DSE has to go through the shutdown procedure (usually takes several seconds depending on the size of the node)
  • a 15-second window when the agent reconnects to the DSE/Cassandra instance
  • the time it takes for the agent waiting for the JMX connection to timeout (10 seconds)
  • the time it takes for the DSE connection attempt to timeout (30 seconds)
  • DSE startup sequence -- this can vary widely depending on what workloads are configured, node density, size of Search cores, etc
  • the time it takes for the agent to report back to OpsCenter

All this time is cumulative but note that it does _NOT_ mean that all the steps I noted above comes into play -- it depends on what the agent is running at the time. But the 2 biggest contributors to the turnaround are the DSE shutdown and startup sequence. But know that no two nodes will have identical times.

It's important to note that scripting a shutdown/startup of DSE can be problematic. Because you cannot assume that a node has cleanly shutdown not that it has started up cleanly either. You need to monitor/watch the logs while you're working on a node to make sure nothing goes wrong. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

ratna53_110276 avatar image ratna53_110276 commented ·

@Erick Ramirez : I am able to interpret the output but when we stop the Database and trying to retrieve the database status within 60/90 seconds from opscenter that information is not updated to opscenter within 60 to 90 seconds.

So is there any way we can push the DB status to opscenter within 60 or 90 seconds. so, there is no need to wait for 60 or 90s to retrieve the DB status from last seen column (0 or epoch).

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ ratna53_110276 commented ·

@ratna53_110276 I've updated my answer. Cheers!

0 Likes 0 ·