Predicting Twitter IDs and self-quoting tweets

⚠️ this blog post is a draft, and requires editing.

Much like recursion, self-reference is a neat concept that never fails to tickle hackers and tinkerers. In particular, it's somewhat common practice to show off technical skills by crafting self-referential content on web platforms: this is most often done by abusing APIs in entertaining ways. YouTube is a great example of this:

  1. This Video Has 26,974,620 Views
  2. YouTube URL in the video

As of Twitter, Diomidis Spinellis posted the first ever self-referential (s.r.) Tweet back in 2009.

Anecdotically, the tweet caused some issues as the Twitter codebase was engineered around retweets as DAGs.

What makes s.r. Tweets fundamentally hard is that there's no way at all to edit a Tweet after creating it. No API quirks. No workarounds. The only way you're ever gonna get a s.r. Tweet is if you reverse-engineer Twitter's ID generation system. By comparing IDs of many different Tweets based on their timestamps, Diomidis has a good estimate of the ID burn rate. He wrote a blog post to describe his approach.

Mind you, this was in 2009, when Twitter used auto-incremented 32-bit IDs on MySql. Things have gotten harder.

The bad news is that Twitter's exponential growth in popularity has forced engineers to increase the ID generation space to 64 bits and to develop a custom ID generation service called Snowflake, which is way more complicated than auto-increment. The good news is that Twitter open-sourced Snowflake, which makes my work so much easier. Snowflake logically partitions the 64-bit space into four separate pieces of information as follows:

// SystemVerilog description of Twitter Snowflakes.

typedef struct packed {
    // Reserved for future uses.
    bit reserved;
    // Timestamp with millisecond-precision and
    // Twitter-specific epoch (1288834974657).
    bit [41:0] timestamp;
    bit [5:0] datacenter_id;
    bit [5:0] datacenter_worker_id;
    // Autoincremented ID for same-millisecond Tweets.
    bit [12:0] sequence_id;
} snowflake;

Predicting Snowflakes requires that you guess correctly every single one of the four fields. I will use a script to post Tweets at short intervals and possibly use the collected data to iteratively refine my predictions. The script will stop once we have correctly predicted a Snowflake.

Time to write some utilities for working with Snowflake.

TWITTER_EPOCH_MILLIS = 1288834974657

class Snowflake(NamedTuple):
    dtime: datetime
    worker_id: int
    seq_number: int

    def __repr__(self):
        return 'Snowflake<{}/{}/{}/{}>'.format(self.twitter_timestamp(), self.worker_id >> 5, self.worker_id & 0b11111, self.seq_number)

    def from_id(cls, id: int):
        # 12 bits with offset 0.
        seq_number = id & (2**12-1)
        # 10 bits (5 datacenter, 5 actual worker id) with offset 12.
        worker_id = (id >> 12) & (2**10-1)
        # 41 bits (1 reserved for future purposes according to spec) with offset 22.
        timestamp = ((id >> 22) + TWITTER_EPOCH_MILLIS) / 1000
        dtime = datetime.fromtimestamp(timestamp)
        return Snowflake(dtime, worker_id, seq_number)

    def to_id(self):
        return (self.twitter_timestamp() << 22) | (self.worker_id << 12) | (self.seq_number)

    def twitter_timestamp(self):
        return int(round(self.dtime.timestamp() * 1000 - TWITTER_EPOCH_MILLIS))

    def status_url(self, username):
        return '{}/{}'.format(username, self.to_id())

Snowflake.from_id allows us to reverse-engineer the Snowflake primitives from raw Tweet IDs. I then posted a bunch of Tweets to collect data.

The Snowflake model reserves 12 bits for a "sequence ID", which is auto-incremented whenever the server must generate two IDs at the same timestamp (which, at Twitter's scale, happens quite often). Intuitively, seq. IDs are much more predictable during low-traffic hours, when you're unlikely to compete with other users for low single-digit seq. IDs.

Twitter traffic by hour and day

Clearly it's easier to predict auto-incremented IDs when traffic is low, so I ran my script at night.

# Incidence rate of sequence IDs. Data collected over 274 Tweets.

0:      ++++++++++++++++++++++++++++++ 29.56%
1:      ++++++++++++++++++++++++ 24.45%
2:      ++++++++++++++++++ 18.25%
3:      ++++++++ 7.66%
4:      +++++ 5.11%
5:      +++ 2.92%
6:      ++++ 4.38%
7:      ++ 2.19%
8:      ++ 1.82%
9:      ++ 1.82%
>= 10:  ++ 1.82%

As one would assume, sequence IDs. By guessing Seq ID 0 every time, I would guess it right ~29.56% of the time. I've thought about looking for correlation between worker IDs and sequence IDs, but load balancers most probably remove any correlation whatsoever. There's surely some correlation between data center ID and sequence ID, but a VPN I wouldn't be able to choose what data center my request hits. In the end, it seemed like 29.56% was the best hit rate I could get on sequence IDs.

Data center and worker IDs

All my requests would get the same data center ID (11) with over 99% accuracy. Load balancing across intra- data center workers, however, complicates things. I wasn't able to find any pattern in the worker IDs generation, so I simply chose the most common. That gave a hit rate of 8%. Ouch.

On latency

Finally, I had to reliably predict at what exact timestamp the Twitter servers would generate the ID of my Tweet. This is, in and itself, not hard at all. Network latency is quite predictable. However, I was facing three problems:

  • 300/daily rate limiting. Not nearly enough for a serious statical analysis.
  • Millisecond-precision.
  • No apparent correlation between latency and worker IDs. This is expected, as workers in the same data center share low-latency network connections with each other.

I decided to keep track of the last N requests and track the round-to-id time delta.

At a first glace you might think that taking the average would be a good idea. I did too. Unfortunately, the expected value is arguably the worst choices for this use case. The data is right-skewed and has quite a few outliers. The median scores better and achieves a hit rate of 6.93%, compared to the abysimal 1.09% of the mean. I thought that it was enough. Just wait it out, right?

Not quite. As I've stated before, 300 hits are very, very few.

Can we do better?

Say hello to the log-normal distribution.

The log-normal distribution is well known in the industry to realistically describe server response times under consistent load. We have good reasons to assume log-normality and

This is what I finally came up with:

def predict_snowflake(now: datetime, history: List[AttemptRecord]) -> Snowflake:
    if len(history) == 0:
        return Snowflake(now, 0, 0)
    if len(history) < 3:
        ping = statistics.median([attempt.actual.dtime.timestamp() -
                                  attempt.dtime.timestamp() for attempt in history])
        log_ping_times = [math.log(attempt.actual.dtime.timestamp() -
                                   attempt.dtime.timestamp()) for attempt in history]
        lognormal_mu = statistics.mean(log_ping_times)
        lognormal_sigma = statistics.stdev(log_ping_times)
        ping = math.e ** (lognormal_mu - lognormal_sigma**2)
    most_popular_workers = Counter(
        [attempt.actual.worker_id for attempt in history]).most_common()
    # We always try with `seq_number == 0`.
    return Snowflake(now + timedelta(seconds=ping), most_popular_workers[0][0], 0)

About the sequence ID

The Snowflake model reserves 12 bits for a "sequence ID", which is auto-incremented whenever the server must generate two IDs at the same timestamp (which, at Twitter's scale, happens quite often).

At last, here it is: