Reddit's empire no longer founded on a flawed algorithm

About a month ago on January 12 2014, reddit made a seemingly minor change to its hot algorithm. This algorithm is responsible for determining which stories appear at the top of the front page and each individual subreddit. You can see why this algorithm is crucial to the company as it decides which submissions live and prosper as well as which ones wither and die.

The algorithm was often criticized as being flawed, the most recent occurence happening on December 9 2013 (reddit thread). Reddit's response was always that there's was more to it and that the "flaw" was by-design.

The hot sort might seem a bit weird, but it's intentionally set up that way.

- bsimpson63

You can see most of the responses this criticism on this GitHub issue.

So why did they suddenly change? What was the impact? Let's look at the code.

Old algorithm
cpdef double _hot(long ups, long downs, double date):  
    """The hot formula. Should match the equivalent function in postgres."""
    s = score(ups, downs)
    order = log10(max(abs(s), 1))
    if s > 0:
        sign = 1
    elif s < 0:
        sign = -1
    else:
        sign = 0
    seconds = date - 1134028003
    return round(order + sign * seconds / 45000, 7)
New algorithm
cpdef double _hot(long ups, long downs, double date):  
    """The hot formula. Should match the equivalent function in postgres."""
    s = score(ups, downs)
    order = log10(max(abs(s), 1))
    if s > 0:
        sign = 1
    elif s < 0:
        sign = -1
    else:
        sign = 0
    seconds = date - 1134028003
    return round(sign * order + seconds / 45000, 7)

Notice the difference?

-    return round(order + sign * seconds / 45000, 7)
+    return round(sign * order + seconds / 45000, 7)

Let's take a look at the impact on a couple examples.

Impact of score

The score (upvotes - downvotes) is used to compute the order value.

                                 old            new
# score > 0
_hot(1, 0, 1262304000)           2850.0         2850.0  
_hot(1000, 500, 1262304000)      2852.69897     2852.69897

# score < 0
_hot(0, 1, 1262304000)          -2851.0         2850.0  
_hot(1000, 1500, 1262304000)    -2848.30103     2847.30103

# score == 0
_hot(1000, 1000, 1262304000)     0.0            2850.0  
score > 0

We notice right away that submissions with a positive score haven't changed at all. It makes sense since the change only moved where sign was applied and in the case of positive scores, the sign is always 1.

score < 0

Here we notice a drastic change in the results. Submissions that previously had a negative ranking now have a positive one. Furthermore, the lower the score, the lower the submission will be ranked. That was not the case before. Yes, that's right, older submissions with a negative score would rank higher with the old algorithm than newer submissions with the same score. In practice this wasn't such a problem because the interesting, positive submissions would still be ranked before them anyway.

score == 0

Finally, when the score was 0, all submissions would be ranked at 0, no matter what. With the new algorithm, the score has no impact on the rank, only it's age as we'll see next.

Impact of age

1262304000 and 1353107345 are simply timestamps represented in seconds since epoch. Higher is more recent.

                                 old            new
# score > 0
_hot(1000, 500, 1262304000)      2852.69897     2852.69897  
_hot(1000, 500, 1353107345)      4870.69897     4870.69897

# score < 0
_hot(1000, 1500, 1262304000)    -2848.30103     2847.30103  
_hot(1000, 1500, 1353107345)    -4866.30103     4865.30103

# score == 0
_hot(1000, 1000, 1262304000)     0.0            2850.0  
_hot(1000, 1000, 1353107345)     0.0            4868.0  
score > 0

Again, no change here for the same reasons.

score < 0

As before, the rankings are now positive with the new algorithm. But the biggest change is that newer submissions now rank higher than older ones. You'd think that would have been the case before, but submissions with a negative scores would quickly disappear from the front page with the previous flawed algorithm. Now, these stories have at least a chance to show up if their score isn't too abysmal.

score == 0

The rankings for submissions with a score of 0 serve as the baseline for submissions with a similar date. Furthermore, there's now a way to compare the rank of these submissions.

Bird's-eye view

If we distant ourselves from these results, this is what we see:

old algorithm:  
  positive submissions
  neutral (score == 0) submissions
  negative submissions

new algorithm:  
  positive submissions within a time frame
  neutral submissions within a time frame
  negative submissions within a time farme
  positive submissions within an older time frame
  ...

Of course in practice things aren't as tidy as this, but the point to remember is that neutral and negative submissions can now appear before older positive submissions. You're still unlikely to see submissions with a score of 0 on your front page. However, if you look at the next pages or visit a less active subreddit, at least now the positively ranked submissions won't drown out the more recent, less well received submissions.

Further reading