I am trying to calculate hash using md5 function in pyspark on entire row. In pyspark dataframe I have multiple complex data types present for few columns. These are UDT columns present in cassandra and my requirement is to calculate md5 on entire row irrespective of any type of columns in pyspark .
for e.g : col: array (nullable = true) | |-- element: struct (containsNull = true) for e.g : col: array (nullable = true) | |-- element: array (containsNull = true)
when I try to calculate md5 on entire row , md5 throws an error with below message:
**`col`' is of array<array<string>> type. argument 28 requires (array<string> or string) type, however, '`col`' is of array<array<string>> type**
code to calculate md5:
def prepare_data_md5(data): """ Prepare the data with md5 column. :param data: input DataFrame object :return: output DataFrame object """ return data.withColumn("hash", md5(concat_ws(*data.columns)))
1. Is there some other function I could use for hash and that works for complex data types too ?
2. Is there some library available in pyspark or python for flattening complex data types , so that I could calculate md5 over flattened data-frame ?