question

drabuna avatar image
drabuna asked drabuna commented

Cassandra social network design

Intro

I'm getting into Cassandra a bit more and creating a simple social network (similar to Instagram/Facebook) as an experiment project.

The general idea is that you can "follow"/"friend" people and see their updates in a feed. When using an application, you should be able to scroll through the feed (updates sorted by time) and paginate it.

In my case I'm sticking with the fan-out approach - whenever an update is being published, it will be inserted into the feeds of friends/followers.

Here's what my hypothetical schema looks like:

CREATE TABLE feed_by_time (
    user_id uuid,
    time timestamp,
    author_id uuid,
    post_id uuid,
    PRIMARY KEY  (user_id, time, author_id)
) WITH CLUSTERING ORDER BY (time DESC);

With this schema, I should be able to get any users' feed easily and in sorted order, as well as paginate it further, and each users' feed will reside in its own partition, so I will not fetch from multiple nodes. I am a bit limited by the potential size of each partition, but let's assume it's not a problem right now.

Problems and concerns

Now, the actual problem - how should unfollow/unfriend functionality be implemented? It assumes that all the posts of a specific "author" should disappear from your feed.

I know that with the feed_by_time table I will not be able to delete all the posts of a specific "author" so I can have another one like:

CREATE TABLE feed_by_author (
    user_id uuid,
    author_id uuid,
    post_id uuid,
    time timestamp,
    PRIMARY KEY ((user_id, author_id), post_id)
);

And I can write to both tables at the same time, when a new post is published.

When I unfollow someone, I can do something like

  1. SELECT * FROM feed_by_author WHERE user_id="user_id" AND author_id="unfriended_id"
  2. Extract time for the query result and run A LOT of DELETE queries towards feed_by_time
  3. Run one big DELETE to clean-up feed_by_author (in this case, it will remove the entire partition)

My worry is mostly around the second point. I read that I should avoid such deletes since they are extremely costly due to tombstones. So, should I be worried, and are there more efficient/correct ways of doing this?

In traditional RDBMS I would consider having a deleted column, and setting its value to 1. As far as I understand deletes and updates will result in a similar "slow-down"?

As a follow up to this question - say, I re-follow/re-friend someone. This will result in a lot of INSERT into the feed_by_time user-specific partition since we will be effectively merging the entire author's feed into the user's feed. The time value on each post can vary a lot, so we can be inserting it in the "middle" of the partition, is it costly in terms of performance? Haven't found much information regarding that.

It would be great to get a comment from someone who is using Cassandra in production on a large scale project, with 100k+ users - how much does deletes REALLY affect performance? Is it a huge hit to the performance? If I expect a decent rate of deletes, should Cassandra still be a consideration, or should I look into alternative DBs?

Thanks a lot!

data modelingtombstones
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered drabuna commented

Specifically on tombstones, Alain Rodriguez from The Last Pickle has an excellent blog post discussing the implications of use cases which have a high number of deletes -- About Deletes and Tombstones in Cassandra.

Hopefully someone else here can help you with how to model your data. Cheers!

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

drabuna avatar image drabuna commented ·

Hi, I've read it a couple of times before, and it does cover the principle of how tombstones work and explain the potential effect on the system.

Unfortunately, it doesn't cover the practical side of things, what is the actual decrease in reading/writing performance with a high frequency of deletes.

It would be great to hear about it from the active users of Cassandra :)

0 Likes 0 ·