I'm getting into Cassandra a bit more and creating a simple social network (similar to Instagram/Facebook) as an experiment project.
The general idea is that you can "follow"/"friend" people and see their updates in a feed. When using an application, you should be able to scroll through the feed (updates sorted by time) and paginate it.
In my case I'm sticking with the fan-out approach - whenever an update is being published, it will be inserted into the feeds of friends/followers.
Here's what my hypothetical schema looks like:
CREATE TABLE feed_by_time ( user_id uuid, time timestamp, author_id uuid, post_id uuid, PRIMARY KEY (user_id, time, author_id) ) WITH CLUSTERING ORDER BY (time DESC);
With this schema, I should be able to get any users' feed easily and in sorted order, as well as paginate it further, and each users' feed will reside in its own partition, so I will not fetch from multiple nodes. I am a bit limited by the potential size of each partition, but let's assume it's not a problem right now.
Problems and concerns
Now, the actual problem - how should unfollow/unfriend functionality be implemented? It assumes that all the posts of a specific "author" should disappear from your feed.
I know that with the
feed_by_time table I will not be able to delete all the posts of a specific "author" so I can have another one like:
CREATE TABLE feed_by_author ( user_id uuid, author_id uuid, post_id uuid, time timestamp, PRIMARY KEY ((user_id, author_id), post_id) );
And I can write to both tables at the same time, when a new post is published.
When I unfollow someone, I can do something like
SELECT * FROM feed_by_author WHERE user_id="user_id" AND author_id="unfriended_id"
timefor the query result and run A LOT of
- Run one big
feed_by_author(in this case, it will remove the entire partition)
My worry is mostly around the second point. I read that I should avoid such deletes since they are extremely costly due to tombstones. So, should I be worried, and are there more efficient/correct ways of doing this?
In traditional RDBMS I would consider having a
deleted column, and setting its value to 1. As far as I understand deletes and updates will result in a similar "slow-down"?
As a follow up to this question - say, I re-follow/re-friend someone. This will result in a lot of
INSERT into the
feed_by_time user-specific partition since we will be effectively merging the entire author's feed into the user's feed. The
time value on each post can vary a lot, so we can be inserting it in the "middle" of the partition, is it costly in terms of performance? Haven't found much information regarding that.
It would be great to get a comment from someone who is using Cassandra in production on a large scale project, with 100k+ users - how much does deletes REALLY affect performance? Is it a huge hit to the performance? If I expect a decent rate of deletes, should Cassandra still be a consideration, or should I look into alternative DBs?
Thanks a lot!