Intro
I'm getting into Cassandra a bit more and creating a simple social network (similar to Instagram/Facebook) as an experiment project.
The general idea is that you can "follow"/"friend" people and see their updates in a feed. When using an application, you should be able to scroll through the feed (updates sorted by time) and paginate it.
In my case I'm sticking with the fan-out approach - whenever an update is being published, it will be inserted into the feeds of friends/followers.
Here's what my hypothetical schema looks like:
CREATE TABLE feed_by_time ( user_id uuid, time timestamp, author_id uuid, post_id uuid, PRIMARY KEY (user_id, time, author_id) ) WITH CLUSTERING ORDER BY (time DESC);
With this schema, I should be able to get any users' feed easily and in sorted order, as well as paginate it further, and each users' feed will reside in its own partition, so I will not fetch from multiple nodes. I am a bit limited by the potential size of each partition, but let's assume it's not a problem right now.
Problems and concerns
Now, the actual problem - how should unfollow/unfriend functionality be implemented? It assumes that all the posts of a specific "author" should disappear from your feed.
I know that with the feed_by_time
table I will not be able to delete all the posts of a specific "author" so I can have another one like:
CREATE TABLE feed_by_author (
user_id uuid,
author_id uuid,
post_id uuid,
time timestamp,
PRIMARY KEY ((user_id, author_id), post_id)
);
And I can write to both tables at the same time, when a new post is published.
When I unfollow someone, I can do something like
SELECT * FROM feed_by_author WHERE user_id="user_id" AND author_id="unfriended_id"
- Extract
time
for the query result and run A LOT ofDELETE
queries towardsfeed_by_time
- Run one big
DELETE
to clean-upfeed_by_author
(in this case, it will remove the entire partition)
My worry is mostly around the second point. I read that I should avoid such deletes since they are extremely costly due to tombstones. So, should I be worried, and are there more efficient/correct ways of doing this?
In traditional RDBMS I would consider having a deleted
column, and setting its value to 1. As far as I understand deletes and updates will result in a similar "slow-down"?
As a follow up to this question - say, I re-follow/re-friend someone. This will result in a lot of INSERT
into the feed_by_time
user-specific partition since we will be effectively merging the entire author's feed into the user's feed. The time
value on each post can vary a lot, so we can be inserting it in the "middle" of the partition, is it costly in terms of performance? Haven't found much information regarding that.
It would be great to get a comment from someone who is using Cassandra in production on a large scale project, with 100k+ users - how much does deletes REALLY affect performance? Is it a huge hit to the performance? If I expect a decent rate of deletes, should Cassandra still be a consideration, or should I look into alternative DBs?
Thanks a lot!