- 日本語版(より詳細な説明
および蛇足を含む)
In polling operations on timelines in Snowflake-ID order, it is a common optimization technique to skip statuses with IDs lower than or equal to the ID of the latest status using the since_id
parameter of the API, on the assumption that the IDs are chronologically ordered. In fact however, the ordering of Snowflake IDs is not complete. Instead, they are since_id
parameter value of the subsequent request so that the timestamp part of the ID is at least
Suppose your app is polling a timeline for new Tweets using Twitter API, that is, the app periodically fetches the timeline and processes new Tweets as they appear. This is a common operation typically seen in client apps that display a timeline to an end user... oh, no, third-party clients have been banned by Twitter. Well, another example is bot apps that respond to a timeline in real time... ugh, Twitter has declared a battle against the bots. Anyway, we are going to discuss this not-so-common-at-least-among-Twitter's-friends-anymore operation called polling... for fun, maybe? Though you will have to pay ~$100/mo. for this.
A naïve approach to poll a timeline would be to fetch a constant number of latest Tweets in every request and filter out Tweets that have already been processed in previous requests on the client's side. This would cause same Tweets to be repeatedly returned by the API, wasting the network bandwidth and Tweet caps.
The alternative way recommended by Twitter is to use the since_id
parameter of the API. The parameter tells the API to only return Tweets with IDs higher than the specified value. Since Tweet IDs are said to be chronologically ordered, it is plausible that you can safely skip the redundant Tweets by setting the parameter to the ID of the latest Tweet the app has fetched before. At least, that's what the Twitter documentation claims.
In fact, that reasoning has a small hole in it. By examining the design of Tweet IDs, you will notice an edge case around its ordering and see that the common approach mentioned earlier would miss Tweets on rare occasions that should otherwise be processed, making the timeline "leaky". This article demonstrates how the problem of timeline leaks occurs and proposes an improved approach to polling timelines.
Tweet IDs are generated by Snowflake, an internal service developed by Twitter, which is used to generate IDs for other kinds of resources like users as well. The goal of Snowflake is to generate IDs parallelly across machines in an uncoordinated manner, to keep up with the scale of Twitter.
Because of its uncoordinated nature, Snowflake cannot guarantee the chronological ordering of the IDs, as described in the README of its GitHub repository. However, it still guarantees the IDs to be roughly sorted, or in mathematical terms, we're promising 1s, but shooting for 10's of ms
, meaning that tweets posted within a second of one another will be within a second of one another in the id space too
, according to the Twitter blog article Announcing Snowflake.
To achieve the id >> 22
) of the generated ID, making sure that the IDs are ordered by the time of their generation in the respective machine's clock. The timestamp is of millisecond precision and is based on a custom epoch, which is twepoch = 1288834974657
milliseconds later from Unix epoch.
The since_id
parameter.
The previous section showed that the conmon polling approach may miss Tweets that should otherwise be fetched, due to the lack of a complete ordering of Snowflake IDs. However, the
Now, let's pretend that we have a Teeet with an ID with its timestamp being since_id
parameter value to catch every future Tweet. Of course, we don't necessarily have such a Tweet, but we can calculate a (hypothetical) Snowflake ID with the same property using the following pseudo code (let latest_id
be the ID of the latest Tweet):
k = 1000
timestamp = latest_id >> 22
since_id =
if timestamp <= k then
/* Prevent overflow */
latest_id
else
/* Subtract one because `since_id` parameter is exclusive */
((timestamp - k) << 22) - 1
The since_id
value shown here is sufficient to prevent the problem of timeline leaks. However, as the pseudo ID value is lower than latest_id
at least, this would cause same Tweets to be repeatedly returned by the API especially when the speed of the timeline is low. This is redundant when we know that the latest Tweet was posted more than
Let's improve the earlier pseudo code by setting since_id
parameter to latest_id
as is if its timestamp is more than retrieved_at
be the millisecond precision Unix time of the last request, the code will look like the following:
twepoch = 1288834974657
k = 1000
clamp(x, lower, upper) = max(lower, min(upper, x))
time2sf(unix_time_ms) = max(0, unix_time_ms - twepoch) << 22
/* Same calculation as the earlier code.
* This value may be used when the local clock is behind Twitter's one. */
timestamp = latest_id >> 22
lower =
if timestamp <= k then
latest_id
else
((timestamp - k) << 22) - 1
since_id = clamp(time2sf(retrieved_at - k) - 1, lower, latest_id)
Note that this code still fetches duplicate Tweets, which your app needs to handle, although the amount is now minimal.
In this article, we have demonstrated the problem of timeline leaks, where client apps pollong a timeline may miss Tweets that should otherwise fetched due to
However, the reasoning is purely theoretical so far. The experiment
directory contains an experimental code to see if the timeline leaks actually occur in Twitter API.
See COPYING.md
for the copyright notice and license of this article and the experiment code.