question

czarR avatar image
czarR asked Erick Ramirez answered

Is there an alternative to DataFrame column explode() method?

Hi, I have a Python dataframe that I am trying to use explode on a column "transactions_details" that contains array of elements. It takes very long to return data. Is there an alternate for for explode

My current output:

    userId  username    transactions
    44782   eyuan       object
                        details: [{"expiration": "2022-09-08T00:00:00", "externalItemId":               null,                           "from_sitelocationId": 1821, "itemDescription": "Avastin 2.5MG PINE SILICONE FREE", "itemId": 50679, "lot": "59529", "ndcCode": "50242-0060-01", "ndcCode10Digit": "50242-060-01", "ndcDesc": "Avastin 2.5MG PINE SILICONE FREE", "qty": 1, "to_sitelocationId": 0},{"expiration": "2022-09-08T00:00:00", "externalItemId": null, "from_sitelocationId": 1821, "itemDescription": "Avastin 2.5MG PINE SILICONE FREE", "itemId": 50679, "lot": "59530", "ndcCode": "50243-0060-01", "ndcCode10Digit": "50243-060-01", "ndcDesc": "Avastin 2.5MG PINE SILICONE FREE", "qty": 1, "to_sitelocationId": 0}]
                        modifiedByUsername: null
                        transactionType: 6
                        transactionTypeName: "Use"

My desired output:

    userid        username   expiration           externalItemId  from_sitelocationId       itemDescription            itemId   lot    ndcCode        ndcCode10Digit    ndcDesc                           qty  to_sitelocationId   
    44782         722        2022-09-08T00:00:00       null            1821               Avastin 2.5MG PINE SILICONE  50679    59529  50242-0060-01   50242-060-01     Avastin 2.5MG PINE SILICONE FREE   1    0       
    44782         722        2022-09-08T00:00:00       null            1821               Avastin 2.5MG PINE SILICONE  50679    59530  50243-0060-01   50243-060-01     Avastin 2.5MG PINE SILICONE FREE   1    0  

Here is my code:

df2 = df1.withColumn("transactions_details.itemid", explode(col("transactions.details.itemid")))\
.withColumn("transactions_details.expiration", explode(col("transactions.details.expiration")))\
.withColumn("transactions_details.externalitemid", explode(col("transactions.details.externalitemid")))\
.withColumn("transactions_details.from_sitelocationid", explode(col("transactions.details.from_sitelocationid")))\
.withColumn("transactions_details.itemDescription", explode(col("transactions.details.itemDescription")))\
.withColumn("transactions_details.lot", explode(col("transactions.details.lot")))\
.withColumn("transactions_details.ndcCode", explode(col("transactions.details.ndcCode")))\
.withColumn("transactions_details.ndcCode10Digit", explode(col("transactions.details.ndcCode10Digit")))\
.withColumn("transactions_details.ndcDesc", explode(col("transactions.details.ndcDesc")))\
.withColumn("transactions_details.qty", explode(col("transactions.details.qty")))\
.withColumn("transactions_details.to_sitelocationid", explode(col("transactions.details.to_sitelocationid")))\
.drop("transactions")
display(df2)
spark-cassandra-connector
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered

I haven't worked with the explode() method before and I don't fully understand the underlying implementation but my initial reaction is that you have a lot of nested calls to the method. Do you really have arrays upon arrays of data? If so, that would make it expensive to process and would account for the slow performance.

In any case, I'm going to reach out to other engineers and get them to respond to your question directly. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.