Why APPROX_PERCENTILE in Apache IoTDB Returns 30 Instead of 25 for [10, 20, 30, 40]
Understanding the Behavior of APPROX_PERCENTILE in Apache IoTDB
When building latency or performance reports, calculating the median (50th percentile) is a standard requirement. However, if you are querying a small dataset with an even number of elements—such as [10.0, 20.0, 30.0, 40.0]—using APPROX_PERCENTILE(latency, 0.5) in Apache IoTDB might return 30.0 instead of the expected mathematical median of 25.0.
This article explains why this discrepancy occurs, the underlying algorithm IoTDB uses, and how to get the conventional median for your reports.
Why Doesn't it Interpolate to 25.0?
The discrepancy comes down to the difference between exact continuous percentiles and approximate percentiles designed for big data.
1. The Purpose of APPROX_PERCENTILE
In time-series databases like Apache IoTDB, datasets can easily scale to billions of rows. Calculating an exact percentile requires sorting the entire dataset in memory, which is computationally expensive and slow. To solve this, IoTDB uses an approximation algorithm (typically based on T-Digest or similar data structures) for the APPROX_PERCENTILE function.
These algorithms sacrifice exact mathematical interpolation on tiny datasets in exchange for high performance and constant memory usage on massive datasets.
2. How the Quantile Rank is Calculated
In standard statistics, when you have an even number of samples ($N$), the median is calculated by averaging the two middle values (at index $N/2$ and $N/2 + 1$).
However, approximation algorithms and discrete percentile formulas often map the percentile $p$ directly to a specific rank index without interpolating. For example, using a common ceiling-based rank formula:
Rank = ceil(p * N) = ceil(0.5 * 4) = 2Depending on the zero-based or one-based indexing and the specific binning logic of the T-Digest centroid allocation, the algorithm selects the representative value of the closest centroid. With only 4 data points, the internal digest structure does not have enough continuous density to perform linear interpolation, resulting in the direct return of the upper middle value (30.0) or the nearest cluster representative.
How to Get the Conventional Median (25.0) in IoTDB
If your report requires exact mathematical percentiles (with linear interpolation for even datasets), you have a few options depending on your IoTDB version and dataset size.
Option 1: Use the Exact PERCENTILE Function
If you are dealing with smaller datasets where exact precision is more important than query speed, check if your version of IoTDB supports the exact PERCENTILE or PERCENTILE_CONT (continuous percentile) function. Unlike the approximate version, the exact percentile function performs linear interpolation:
SELECT batch_id,
PERCENTILE(latency, 0.5) AS exact_p50
FROM percentile_latency
GROUP BY batch_id;Option 2: Calculate the Median Manually Using Window Functions
If an exact percentile function is not available in your specific IoTDB table model version, you can compute the exact median using standard SQL window functions (like ROW_NUMBER() and COUNT()) to find and average the middle values:
WITH ordered_latency AS (
SELECT batch_id,
latency,
ROW_NUMBER() OVER (PARTITION BY batch_id ORDER BY latency) AS row_num,
COUNT(*) OVER (PARTITION BY batch_id) AS total_count
FROM percentile_latency
)
SELECT batch_id,
AVG(latency) AS conventional_median
FROM ordered_latency
WHERE row_num IN (FLOOR((total_count + 1) / 2.0), CEIL((total_count + 1) / 2.0))
GROUP BY batch_id;Summary: When to Use Each Function
- Use
APPROX_PERCENTILEwhen querying massive time-series datasets where speed and resource efficiency are critical, and a small margin of error (e.g., getting 30 instead of 25) is acceptable. - Use Exact Methods (like
PERCENTILEor manual SQL CTEs) for small-batch reporting, financial metrics, or SLA validation where mathematical precision is strictly required.