Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

Erick Ramirez avatar image
Erick Ramirez asked ·

PerSSTableIndexWriter.java:211 - Rejecting value (size 1.938KiB, maximum 1.000KiB) for column X (analyzed false) at Y SSTable

What does this log message mean?

INFO  [CompactionExecutor:9] 2020-09-17 14:54:56,789 PerSSTableIndexWriter.java:211 - Rejecting value (size 1.036KiB, maximum 1.000KiB) for column hugecolumn (analyzed false) at /path/to/data/ks/table_name/md-5-big SSTable.
index
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

The log message coming from PerSSTableIndexWriter class indicates that the table has a SSTable Attached Secondary Index (SASI). The message means that one or more terms couldn't be added to the index because it is larger than OnDiskIndexBuilder.MAX_TERM_SIZE (1024 bytes in Cassandra 3.11.8).

Definition

An index term is typically a word in a string so the phrase:

May the force be with you

may be tokenised into terms be, force, may, the, with, and you if a SASI analyser is defined on the index. Otherwise, the term is equal to the full value of the indexed CQL column.

Example

Here is what I did to replicate the issue so I can illustrate the problem.

Step A1 - Create the following table:

CREATE TABLE massivestrings (
    key text PRIMARY KEY,
    hugecolumn text
)

Step A2 - Insert a really long string:

INSERT INTO massivestrings (key, hugecolumn) VALUES ('random','This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating. This sentence keeps repeating.');

Step A3 - Create an index on the hugecolumn column:

CREATE CUSTOM INDEX hugecolumn_index ON massivestrings (hugecolumn) USING 'org.apache.cassandra.index.sasi.SASIIndex';

As Cassandra tries to index the partition key = 'random', the following message is logged in the system.log:

INFO  [CompactionExecutor:51] 2020-09-17 18:21:06,415 PerSSTableIndexWriter.java:211 - Rejecting value (size 1.028KiB, maximum 1.000KiB) for column hugecolumn (analyzed false) at /home/ubuntu/apache-cassandra-3.11.4/data/data/stackoverflow/massivestrings-b33f0910f8a011eaa8600966255cfe1c/md-6-big SSTable.

Further attempts to insert partitions with large values results in failure to index the column value:

INFO  [MutationStage-2] 2020-09-17 18:14:52,105 TrieMemIndex.java:86 - Can't add term of column hugecolumn to index for key: random, term size 1.028KiB, max allowed size 1.000KiB, use analyzed = true (if not yet set) for that column.

Cause

As I stated above, the maximum term size allowed in SASI is 1024 bytes. ASCII characters are encoded as 1 byte each so the maximum term length is 1024 characters.

For other Unicode such as CJK characters (Chinese, Japanese, Korean), each ideogram takes 3 bytes in UTF-8 so the maximum term length is 341 CJK characters.

Solution

Depending on your use case, consider using the standard SASI analyser so the column value gets tokenised.

You will need to drop the existing index and create a new SASI index. For example:

Step B1 - Drop the hugecolumn_index index:

DROP INDEX hugecolumn_index ;

Step B2 - Create a new index that uses an analyzer:

CREATE CUSTOM INDEX hugecolumn_contains_idx
ON stackoverflow.massivestrings (hugecolumn)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
  'analyzed': 'true',
  'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
  'mode': 'CONTAINS'};

For more info, see Using a SSTable Attached Secondary Index (SASI). Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.