Author: Sucwinder Bassi
Original publish date: August 31, 2017
I like to use the analogy of gold mining when thinking about obtaining information from a diagnostic tarball. That's mainly because a diagnostic tarball is a gold mine of information and this blog article will hopefully provide some useful insight into the information available. So grab your hard hat, pick axe, torch and get ready to strike it rich with information. Like any good blog post, rather than making a long list, I've tried to break down the article into these separate sections:
- General information
- OpsCenter and agent information
- Node information
- Example Scenarios
I've laid it out this way so if for example you have a problem with OpsCenter you can just refer to the 'OpsCenter and agent information' section and not have to read through the whole blog again. Think mining again - you don't need to blast through the opening again with TNT every time you dig further into another section of the gold mine.
To generate a diagnostic tarball you'll need to install OpsCenter then click Help > Diagnostics and select Download. This will create a file called diagnostics.tar.gz and when you untar this file you'll see this file and folders:
cluster_info.json nodes/ opscenterd/
A good place to start digging is to have a look at the
cluster_info.json and I find this information particularly useful:
- bdp_version - This is the version of DSE.
- Cassandra version
- Cluster cores
- Cluster OS and version
- DC count / Keyspace count / Column family count
- OpsCenter version
- OpsCenter OS and version
- OpsCenter server RAM
If the values for CPU and RAM are below recommended production settings there is a good chance the cluster will not be as performant as an appropriately specced cluster. Time for another good analogy - if the size of your shovel is too small you'll be digging for a long time.
For information specific to a node I look at the
opscenterd/node_info.json file. Some useful nuggets of information include:
- IP and host name of the node
- JVM version
- Keyspace sizes
- Load on the node
There are also some thread pool statistics that look like
nodetool tpstats output.
OpsCenter and agent information
For OpsCenter server problems refer to the OpsCenter log file
For OpsCenter server and datastax-agent configuration issues you can check the settings in these files:
opscenterd/ conf.json clusters/ <cluster_name>.conf
For OpsCenter repair service issues take a look at these files:
opscenterd/ repair_service.log repair_service.json
repair_service.json shows time to completion and if the repair service is active. Contrary to popular belief the repair service doesn't fix anything. It just streams data to synchronise the data on your nodes.
For OpsCenter datastax-agent issues refer to the
agent.log on the nodes which can be found here:
nodes/ <node_ip>/ logs/ opsagent/ agent.log
You can also take a look at GC activity by referring to the
gc.log located here:
In the old days of mining, miners used to take canaries into the mine with them to detect noxious gases such as carbon monoxide. If the canary died it was a pretty good indicator that something wasn't right. The same analogy can be applied to the nodes folder. If all is well you should see folders appended with the IP of the node. However, if you find files with the IPs of nodes from the cluster or no folders for some nodes this is a good indicator of a potential problem. The missing diagnostic information could be a result of network problems or you may need to restart the datastax-agent on the node. Another thing to note, if downloading the tarball times out try increasing the default value of the
diagnostic_tarball_download_timeout option in the
cluster_name.conf. Increasing the default value is recommended for DSE multi-instance clusters or for slower machines and connections.
node_ip folder you'll find these files:
agent_version.json- Displays the datastax-agent version.
agent-metrics.json- Refer to this file if you have a problem obtaining metrics from this node.
blockdev_report- Shows filesystem information.
java_heap.json- Displays heap and non-heap memory used.
java_system_properties.json- Shows lots of useful Java information including Java version and class path.
machine-info.json- Displays architecture and memory on the node.
os-info.json- Shows OS and version.
process_limits- Refer to this file if you have an OS resource issue.
There is so much useful information in the nodes folder that it can be a little overwhelming. It's like entering a cave and then realising you're in King Solomon's mines and you even have those Geode rocks we all like containing crystals like the ones you get from tourist shops. The trouble is you really don't know what you're going to do with a Geode when you bring it home. However, if you spend a little time understanding what gems lay inside you can take advantage of your rather fortunate position.
- You can ignore this folder because it's deprecated in later versions.
agentaddress.yaml- Useful to confirm the stomp interface which is the IP for the OpsCenter server.
location.json- Displays the path to the cassandra.yaml and dse.yaml.
- Contains the
dsefile is created for package installs and can be used to confirm which services are active (Spark, Solr, Graph etc).
- Counting the folders in the
solrfolder reveals the number of Solr cores.
- The folder contains the
- Contains Spark configuration including the
describe_cluster- Displays cluster name and partitioner used.
describe_schema- Shows the current keyspace and table schema.
ring- Useful to determine which node is running the master process and status of the nodes.
sparkmaster- Will be updated to use
- Contains logs for Cassandra, OpsCenter agent, Solr and Spark.
output.log- Shows the log output from the last startup sequence.
agent.log- Useful log to interrogate for DataStax agent issues.
tomcat- Contains the catalina log files.
master- Contains the master.log.
spark-jobserver- Contains spark-jobserver log.
worker- Contains the worker.log.
The nodetool folder is the biggest haul of them all. Even the US Federal Reserve would be envious of the gold bars of information found in this section. Fortunately for us, it's no Fort Knox and the information is easily accessible.
cfstats- Shows statistics for tables. If you see a high sstable count this could indicate a compaction problem unless the table uses LCS which tends to have higher sstable counts. For space problems refer to space used and space used by snapshots. For any write or read performance issues refer to the respective write or read latency output. A node struggling to keep up with writes will show counts for
Dropped Mutations. The
Compacted partition maximum bytesoutput is also a good place to check for large partitions. Finally, the
tombstones per sliceinformation is very useful to identify tables with a large number of tombstones.
compactionhistory- Shows history of compaction operations. Useful to investigate compaction issues.
compactionstats- Provides statistics about a compaction. Useful to investigate compaction issues.
describecluster- Shows the name, snitch, partitioner and schema version of a cluster. Can be used to identify schema disagreements when there are unreachable nodes.
getcompactionthroughput- Prints the throughput cap (in MB/s) for compaction in the system.
getstreamthroughput- Prints the MB/s outbound throughput limit for streaming in the system.
gossipinfo- Provides the gossip information for the cluster. Can be useful to confirm all the nodes can see each other.
info- Some useful node information including uptime, disk storage (load) information, heap and off heap memory used.
netstats- Shows network information about the host. High counts for read repair can indicate the repair process hasn't been run for a while.
proxyhistograms- Provides a cumulative histogram of network statistics and shows the full read / write request latency recorded by the coordinator. The output can be compared to other nodes to determine if requests encounter a slow node.
ring- Shows node status and information about the ring. The output shows all tokens and can be quite big if you're using vnodes.
status- Displays the cluster layout including DC names, shows if nodes are up or down, load on disk, tokens, IDs and rack information.
statusbinary- Prints the status of native transport. Also shown in the output of nodetool info.
statusthrift- Prints the status of the Thrift server. Also shown in the output of nodetool info.
tpstats- Provides momentary usage statistics of thread pools. Similar output is written to the system log when longer GC pauses occur. This isn't always a problem, but it can help to indicate which thread pool is most active. Any thread pools with
All time blockedthreads, a high number of
pendingthreads or dropped counts indicate a problem that should be investigated. The output can help to point you in the right direction. For example, if you found dropped counts for Mutation this would indicate a problem with writes. You could then refer to the
nodetool cfstatsoutput and look for
Dropped Mutationswhich will show the table or tables that are dropping mutations.
version- Displays the version of Cassandra on the node.
ntpstat- Shows network time synchronisation status.
ntptime- Reads kernel time variables
- NTP configuration is often overlooked and yet time synchronisation is pretty fundamental for a distributed database.
cpu.json- Shows CPU activity. This can help determine if a high CPU issue is related to user, system or iowait activity.
disk_space.json- Shows disk usage information. Good place to confirm if the node is running low on disk space.
disk.json- Similar output to the linux iostat command that can help identify disk performance issues.
load_avg.json- A high load average can indicate the node is being overloaded. An idle node would have load average of 0. Any running process either using or waiting for CPU cycles adds 1 to the load average.
memory.json- This is a breakdown of the memory used and includes used, free and cache. The output of nodetool info also provides memory usage.
index_size.json- Displays the size of the Solr index
A node is reported as down in the cluster.
The first place to start would be to refer to the
nodetool status output from any node in the cluster. If a node is down the
nodetool status output found in
nodes/node_ip/nodetool/status will show DN status for the node, for example.
Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 127.0.0.1 47.66 KB 1 33.3% aaa1b7c1-6049-4a08-ad3e-3697a0e30e10 rack1 UN 127.0.0.2 47.67 KB 1 33.3% 1848c369-4306-4874-afdf-5c1e95b8732e rack1 DN 127.0.0.3 47.67 KB 1 33.3% 49578bf1-728f-438d-b1c1-d8dd644b6f7f rack1
The output above shows node 127.0.0.3 is down. Once you've identified which node is down you can check the system.log for the node in
system.log will hopefully contain more information to help you determine why the node went down. You could also perform a
nodetool describecluster as it performs an actual rpc call to all nodes which can reveal unreachable nodes and any schema disagreements.
Poor performance on a node.
If a node is experiencing poor performance we need to determine if the node is CPU bound, running out of memory, experiencing an IO problem or something more obscure like running out of OS resources.
For CPU issues, you can refer to the
cluster_info.json to determine the number of cluster cores. If the node has sufficient CPU cores you can then check
nodes/node_ip/os-metrics/cpu.json to see if the CPU is pegged.
For memory issues, as before, you can check the
cluster_info.json or the
nodes/node_ip/machine-info.json to determine the amount of RAM available. Assuming there is sufficient memory you can refer to the output of
nodetool info to see how much heap and off heap memory is in use. You may be able to resolve the issue by tuning the JVM by increasing the amount of memory allocated to the heap. It is also worth checking the
system.log for garbage collection (GC) activity. Long GC pauses or lots of GC activity would suggest the node is being overloaded and further investigation is needed to check what is running on the node. You could refer the
nodetool tpstats output which will show you which thread pools are most active.
For IO problems, if you're experiencing read or write latency and
nodetool tpstats output shows read and/or write (mutation) threads have pending activity this could indicate an IO problem. The
nodetool tpstats output is momentary so checking the same output in the
system.log for climbing and sustained pending threads can be useful. Furthermore, you can refer to the
nodes/node_ip/os-metrics/disk.json file for output similar to iostat which can help to identify an IO issue.
Finally, if you're seeing messages in the
system.log referring to the number of open files then you may have an OS resource issue. You can check the OS resources allocated to a node by referring to
nodes/node_ip/process_limits. For the latest OS resource limits, JVM settings and disk tuning advice refer to the DataStax recommended production settings for your version of DSE.
This blog post is not all encompassing and even after nearly 4 years of DSE support I'm still finding new things in a diagnostic tarball. We tend not to document in detail what is collected in a diagnostic tarball because our developers continue to make improvements and the content collected changes from version to version. However, this blog article should help to address that problem by acting as a torch in a dark cave and enlighten (pun intended) readers about the information available in a diagnostic tarball.