Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

mishra.anurag643_153409 avatar image
mishra.anurag643_153409 asked Erick Ramirez edited

md5 (hashing) is not working on complex data types in pyspark ingested UDT column from cassandra

I am trying to calculate hash using md5 function in pyspark on entire row. In pyspark dataframe I have multiple complex data types present for few columns. These are UDT columns present in cassandra and my requirement is to calculate md5 on entire row irrespective of any type of columns in pyspark .

for e.g :  col: array (nullable = true) |    |-- element: struct (containsNull = true)  for e.g :  col: array (nullable = true) |    |-- element: array (containsNull = true)

when I try to calculate md5 on entire row , md5 throws an error with below message:

**`col`' is of array<array<string>> type. argument 28 requires (array<string> or string) type, however, '`col`' is of array<array<string>> type**

code to calculate md5:

def prepare_data_md5(data):    """ Prepare the data with md5 column.     :param data: input DataFrame object    :return: output DataFrame object    """    return data.withColumn("hash", md5(concat_ws(*data.columns)))

1. Is there some other function I could use for hash and that works for complex data types too ?

2. Is there some library available in pyspark or python for flattening complex data types , so that I could calculate md5 over flattened data-frame ?

spark-cassandra-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered

My initial reaction to your post is that you have a Python issue, not an issue with Cassandra, UDTs or the connector.

It would help us immensely if you provide background information on what you are trying to achieve and what the MD5 has to do with Cassandra or the connector. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.