I need to change my schema so that the data is distributed evenly across the partitions. Currently I have one broker that has 10 times more that all the others. The PK field is prfl_id is basically the primary key although there are two other columns part of the PK those almost never change (only when a new version of another field changes). Here is how my schema looks like:
CREATE TABLE IF NOT EXISTS user_data(
prfl_id bigint ,
type_cd text ,
ver_nb bigint ,
txn_dtl_tx text ,
cre_ts timestamp ,
cre_usr_id text ,
last_txn_ts timestamp ,
PRIMARY KEY (prty_ol_prfl_id,pre_calc_type_cd,ver_nb)
)
I can find out what the range and distribution of prfl_id is, so based on that I would like to use the ByteOrderedPartitioner for this table. Could you point me to an example?
Due to the way the data is accessed I can't add any dummy column to distributed so it has to be a ByteOrderedPartitioner. Now since the source data is in Hive and the prfl_id barely changes I can find out ahead the best range. Even if it changes over time those changes are small. I am aware that this is an anti-pattern but the client only has the prfl_id to query the table for so changing the schema in this case would not work. The other point is that the data can be load from spark very fast so in case of changes we could reload the data in 30 mins.
Basically I can't find the syntax of how to specify a RandomPartitioner in the table schema. Can somebody point me to an example?
Thanks