ted.petersson_164115 avatar image
ted.petersson_164115 asked jim.dickinson_187342 answered

How do I revert changes to a pod in CrashLoopBackOff status?

I'm testing the cass-operator and have a very simple 3-node cluster setup.

Then I try to give the operator a serverImage configuration that does not work, e.g.

serverImage: "cassandra:3.11"

Now my pods (and statefulset) looks like this:

> kubectl -n cass-operator get pods,statefulset
NAME                                 READY   STATUS             RESTARTS   AGE
pod/cass-operator-78884f4f84-lmkbj   1/1     Running            0          18m
pod/cluster1-dc1-default-sts-0       2/2     Running            0          18m
pod/cluster1-dc1-default-sts-1       2/2     Running            0          18m
pod/cluster1-dc1-default-sts-2       1/2     CrashLoopBackOff   6          7m26s

NAME                                        READY   AGE
statefulset.apps/cluster1-dc1-default-sts   2/3     18m

The statefulset try to roll out the new image to pod #2 but it fails!

And the operator status is

 > kubectl -n cass-operator describe cassdc
  Cassandra Operator Progress:  Updating

So even though I try to fallback (apply a old config without a failing image), the operator is "stuck" in Updating state and will not "feed" the old (working) image down to the statefulset...

> kubectl -n cass-operator describe statefulset |grep Image
    Image:      datastax/cass-config-builder:1.0.0  <--- Init container
    Image:      cassandra:3.11  <--- non-working image - not "fallback:ed"
    Image:      busybox  <--- I'm using a sidecar image as well, that's why its 2 containers/pod

So HOW can I force the operator to ignore its updating state and feed the correct image to the statefulset?

I want to do this without deleting the statefulset (then all pods will be deleted and traffic will be lost).

/BR Ted

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

This looks like more a Kubernetes configuration issue than a problem with the cass-operator. The CrashLoopBackOff is k8s attempting to repeatedly restart the container after it keeps crashing.

You'll need to investigate the root cause and resolve it if you don't want to delete the statefulset. I'd recommend you run kubectl describe on the problematic pod and review the events for clues. Also get the logs with kubectl logs so you can review for a possible cause.

I don't believe there's a way to force configuration changes. I'm going to reach out to the authors of the operator internally and request them to respond. Cheers!

4 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

I'm testing a In-service software rollback scenario. So that the pod does not start is deliberate. I don't want it to start, I want to rollback to the previous image (without affecting the other pods currently taking traffic.

Is this possible somehow?

0 Likes 0 ·

There's a concept of a canary upgrade like in this example:

  # Using canaryUpgrade will limit config changes that directly impact the
  # underlying StatefulSets resources (which is most of them) to only updating
  # the first StatefulSet / rack. Users can use this to test configuration
  # changes before rolling them out to the whole cluster.
  canaryUpgrade: false

But I don't think there's a rollback facility. Cheers!

0 Likes 0 ·

Maybe failback is a better name for what I'm testing.

The scenario is to perform an In-Service Upgrade, but something fails (faulty image) during the pod rollout... Then an In-Service Failback should be performed, to fall back to the previous (working) image.

0 Likes 0 ·

Rollback, failback, revert -- I understood your intent the first time. :)

1 Like 1 ·
jim.dickinson_187342 avatar image
jim.dickinson_187342 answered

We added a feature to fix this without mucking around with kubectl on the underlying k8s resources. It's in Cass Operator v1.2.0, but we need to document it better.

To fix a rack's configuration after a bad change, put it in the CassandraDatacenter configuration like this...

  - rack1

The operator will rewrite the StatefulSet's configuration and cause the pods to cycle.

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.