When studying the field of Sorting Algorithms, many Computer Science courses stop at some very famous ones, like Quicksort and Merge Sort. They are very efficient, performing at O(n log n) time, and can be applied to any data that can be compared. In fact, real-world sorting implementations, like CPython’s Timsort and Powersort are derived from those simpler ones, while adding some pretty advanced techniques.

Most people stop there, for many reasons. Perhaps lack of curiosity. Perhaps we just had to move on after seeing too many algorithms that achieve the same thing. Perhaps people are convinced those are the best we can achieve. After all, if there was a faster algorithm, it would be well known, right?

Turns out that there’s a whole class of alternative sorting algorithms that are asymptotically faster than the well known algorithms. And some of them are really simple. And we can beat Python’s sorted (i.e. Powersort) in our benchmark with them*!

What is this?

Today we are going to focus on Counting Sort, an algorithm from the class of Distribution Sorts. This class is especially interesting because it distributes elements into other data structures and then gathers them back afterwards, instead of comparing elements to each other.

If you studied Computer Science, you may remember that there are some algorithms that trade memory-efficiency for time-efficiency (see: memoization). That’s similar to what Distribution Sorts do. For instance, Counting Sort uses a big array to store every element, in addition to an output array, in case you’re not sorting in-place.

The catch is: Counting Sort and others only work for arrays of non-negative integers. So, if you can map your objects clealy into non-negative integers beforehand, it should work. Given that restriction the complexity of Counting Sort is as follows.

  • Time: O(n + k)
  • Space: O(n + k)

Where n is the length of the array, and k is the maximum element of the array.

Counting Sort

The basic idea is that you allocate a big array with size k (or list in our case) where each index i represents the values from the input array, and each stored value v represents the number of occurences of each element. After we populate this array, we just need to iterate it and populate our output array with each element i times its occurrences v.

My implementation of Counting Sort is defined as follows.

from typing import List

def countingsort(lst: List[int]) -> List[int]:
    """Simple counting sort implementation assuming non-negative integers"""
    # Handle empty list
    if not lst:
        return []
    
    # Find the maximum value
    max_val = max(lst)

    # Initialize occurrences list
    # Using list multiplication for better performance
    occurs = [0] * (max_val + 1)
    
    # Count occurrences
    for i in lst:
        occurs[i] += 1
    
    # Build output list
    out = []
    for i, v in enumerate(occurs):
        # If there's at least one occurrence of i
        if v > 0:
            # Add the element i, v times
            out += [i] * v
    
    return out

This program should return the same result as the sorted function, as long as the input contains only non-negative integers. What might differ between them is the execution time and memory usage.

Let’s go through this algorithm and analyze its performance in various situations.

Analysis

The memory usage of this program can be determined from verifying that the occurs list has size v where v is the largest element in the input, and that the output list has size n, size it has the same size as the input.

The time complexity isn’t as straightforward to analyze because we’re using Python’s efficient list multiplication. Let’s assume that appending occurrences in the last loop takes constant time.

We can see that computing the max value is an iteration over n elements, while counting occurrences takes k iterations, and building the output takes k iterations as well, resulting in 2k + n.

As we can see, this algorithm is insensitive to factors such the sorted input, likely because it doesn’t make any comparison.

So, the best case for it is where the largest value is not too high, i.e. the lower k is, the more this algorithm approximates linear time/space.

The worst case is when k is too high, particularly when it is much bigger than the other elements.

Benchmark

I decided to compare our Python Counting Sort (CS) code to CPython’s implementation of Powersort from the sorted in Python 3.13. The experiments consist of 10 executions of each function given lists with random non-negative integers. I have varied the list size from 64 to 16,777,216, and range of numbers (k) from 100 to 100,000.

It is expected for CS to underperform, since our code is implemented using Python code, while Python’s builtin functions are implemented in C. However, we found some surprising results.

Very Small range (0-100)
--------------------------------------------------
Size     CountingSort    sorted()        CS_Mem     Sort_Mem   Speedup    Winner
--------------------------------------------------------------------------------
64       55.9μs          6.1μs           1.7KB      0.6KB      0.11x      sorted()
128      87.3μs          9.1μs           2.3KB      1.1KB      0.10x      sorted()
256      120.8μs         18.5μs          3.3KB      2.1KB      0.15x      sorted()
512      121.0μs         33.2μs          5.8KB      4.1KB      0.27x      sorted()
1024     123.6μs         68.4μs          9.9KB      12.1KB     0.55x      sorted()
2048     168.9μs         140.2μs         18.7KB     24.0KB     0.83x      sorted()
4096     226.0μs         301.7μs         36.8KB     48.0KB     1.34x      countingsort
8192     347.8μs         551.7μs         70.5KB     95.8KB     1.59x      countingsort
16384    580.1μs         1.1ms           131.9KB    191.5KB    1.95x      countingsort
32768    2.9ms           2.3ms           294.9KB    382.7KB    0.80x      sorted()
65536    12.6ms          4.5ms           551.9KB    765.6KB    0.35x      sorted()
131072   30.5ms          8.9ms           1.1MB      1.5MB      0.29x      sorted()
262144   67.7ms          18.1ms          2.1MB      3.0MB      0.27x      sorted()
524288   141.4ms         36.6ms          4.3MB      6.0MB      0.26x      sorted()
1048576  297.1ms         73.2ms          8.5MB      12.0MB     0.25x      sorted()
2097152  591.8ms         147.5ms         17.1MB     23.9MB     0.25x      sorted()
4194304  1.19s           300.0ms         34.2MB     47.8MB     0.25x      sorted()
8388608  2.39s           610.4ms         68.4MB     95.7MB     0.26x      sorted()
16777216 4.79s           1.23s           136.7MB    191.4MB    0.26x      sorted()

Small range (0-1000)
--------------------------------------------------
Size     CountingSort    sorted()        CS_Mem     Sort_Mem   Speedup    Winner
--------------------------------------------------------------------------------
64       329.6μs         6.9μs           9.9KB      0.6KB      0.02x      sorted()
128      365.7μs         10.3μs          11.6KB     1.1KB      0.03x      sorted()
256      475.2μs         19.6μs          14.7KB     2.1KB      0.04x      sorted()
512      609.3μs         32.7μs          20.7KB     4.1KB      0.05x      sorted()
1024     788.2μs         71.6μs          29.3KB     12.1KB     0.09x      sorted()
2048     984.1μs         146.2μs         42.9KB     24.1KB     0.15x      sorted()
4096     1.3ms           335.7μs         63.4KB     48.1KB     0.25x      sorted()
8192     1.3ms           745.9μs         96.9KB     96.1KB     0.56x      sorted()
16384    1.8ms           1.4ms           166.0KB    192.1KB    0.80x      sorted()
32768    2.3ms           2.8ms           290.0KB    384.0KB    1.25x      countingsort
65536    3.7ms           6.0ms           545.7KB    767.9KB    1.64x      countingsort
131072   5.7ms           11.9ms          1.1MB      1.5MB      2.08x      countingsort
262144   13.4ms          24.3ms          2.2MB      3.0MB      1.82x      countingsort
524288   91.0ms          49.2ms          4.2MB      6.0MB      0.54x      sorted()
1048576  260.0ms         99.6ms          8.4MB      12.0MB     0.38x      sorted()
2097152  653.7ms         202.5ms         17.1MB     24.0MB     0.31x      sorted()
4194304  1.23s           408.7ms         33.9MB     48.0MB     0.33x      sorted()
8388608  2.52s           818.4ms         66.5MB     96.0MB     0.33x      sorted()
16777216 5.30s           1.67s           135.6MB    191.9MB    0.32x      sorted()

Medium range (0-10000)
--------------------------------------------------
Size     CountingSort    sorted()        CS_Mem     Sort_Mem   Speedup    Winner
--------------------------------------------------------------------------------
64       3.4ms           6.8μs           80.2KB     0.6KB      0.00x      sorted()
128      3.3ms           8.8μs           82.7KB     1.1KB      0.00x      sorted()
256      3.9ms           21.0μs          87.1KB     2.1KB      0.01x      sorted()
512      3.6ms           41.3μs          95.5KB     4.1KB      0.01x      sorted()
1024     4.3ms           72.7μs          113.0KB    12.1KB     0.02x      sorted()
2048     4.8ms           151.5μs         146.2KB    24.1KB     0.03x      sorted()
4096     6.7ms           317.6μs         199.9KB    48.1KB     0.05x      sorted()
8192     8.0ms           705.1μs         293.7KB    96.1KB     0.09x      sorted()
16384    10.0ms          1.5ms           425.9KB    192.1KB    0.15x      sorted()
32768    11.7ms          3.1ms           618.0KB    384.1KB    0.27x      sorted()
65536    12.8ms          6.6ms           913.6KB    768.1KB    0.52x      sorted()
131072   14.8ms          13.6ms          1.4MB      1.5MB      0.91x      sorted()
262144   19.4ms          28.7ms          2.4MB      3.0MB      1.48x      countingsort
524288   28.3ms          59.6ms          4.8MB      6.0MB      2.11x      countingsort
1048576  47.1ms          124.3ms         9.2MB      12.0MB     2.64x      countingsort
2097152  82.5ms          260.5ms         16.8MB     24.0MB     3.16x      countingsort
4194304  601.5ms         537.1ms         35.6MB     48.0MB     0.89x      sorted()
8388608  1.90s           1.09s           68.1MB     96.0MB     0.58x      sorted()
16777216 4.41s           2.23s           129.3MB    192.0MB    0.51x      sorted()

Large range (0-100000)
--------------------------------------------------
Size     CountingSort    sorted()        CS_Mem     Sort_Mem   Speedup    Winner
--------------------------------------------------------------------------------
64       32.0ms          10.9μs          780.9KB    0.6KB      0.00x      sorted()
128      32.5ms          11.4μs          780.7KB    1.1KB      0.00x      sorted()
256      32.7ms          18.7μs          789.2KB    2.1KB      0.00x      sorted()
512      33.0ms          34.5μs          799.1KB    4.1KB      0.00x      sorted()
1024     33.5ms          72.0μs          816.8KB    12.1KB     0.00x      sorted()
2048     34.2ms          152.8μs         854.8KB    24.1KB     0.00x      sorted()
4096     35.7ms          324.0μs         923.2KB    48.1KB     0.01x      sorted()
8192     39.1ms          696.9μs         1.0MB      96.1KB     0.02x      sorted()
16384    44.9ms          1.5ms           1.3MB      192.1KB    0.03x      sorted()
32768    54.9ms          3.2ms           1.8MB      384.1KB    0.06x      sorted()
65536    72.1ms          6.7ms           2.6MB      768.1KB    0.09x      sorted()
131072   91.8ms          14.4ms          3.7MB      1.5MB      0.16x      sorted()
262144   110.3ms         30.6ms          5.4MB      3.0MB      0.28x      sorted()
524288   122.1ms         65.5ms          7.6MB      6.0MB      0.54x      sorted()
1048576  143.4ms         138.8ms         12.3MB     12.0MB     0.97x      sorted()
2097152  176.6ms         298.4ms         19.8MB     24.0MB     1.69x      countingsort
4194304  247.6ms         645.2ms         36.6MB     48.0MB     2.61x      countingsort
8388608  403.5ms         1.39s           74.1MB     96.0MB     3.43x      countingsort
16777216 693.8ms         2.93s           145.2MB    192.0MB    4.22x      countingsort

As you can see, CS had much lower execution times than sorted in some very specific cases, while also using less memory with inputs of size 1000 or less. These results suggest that there’s a certain ratio of size per range where CS is particularly efficient. Unfortunately though, this means that the use cases for CS should be even more limited. For this particular implementation, at least.

Inevitably, I think, we should go deeper and explore an implementation that’s, comparably to sorted, implemented and compiled in a performant language.

Rust Implementation

The next step would be for me to test sorted against an implementation in either C or Rust, since the two languages are the most straightforward for creating efficient compiled packages for Python.

C was one of the first languages in my education, but something about Python’s instructions for C modules put me off. Rust’s Maturin seems much easier to use, on the other hand, so I went with Rust.

I actually don’t know any Rust yet, and thought of implementing CS in it myself while doing a tutorial. Maybe another day: I decided to use an existing CS package and just import it from Python. With a little help from the docs plus my smart automated assistant, I’ve reached the following code for this module. I just needed to build it with maturin develop -r and it got compiled, optimized, and placed in my virtualenv.

use pyo3::prelude::*;
use pyo3::types::PyList;
use counting_sort::CountingSort;

#[pyfunction]
fn counting_sort(lst: &Bound<'_, PyList>) -> PyResult<Vec<usize>> {
    let vec: Vec<usize> = lst.extract()?;
    match vec.iter().cnt_sort() {
        Ok(sorted_vec) => Ok(sorted_vec),
        Err(_) => Err(pyo3::exceptions::PyValueError::new_err("Failed to sort")),
    }
}

/// A Python module implemented in Rust.
#[pymodule]
fn counting_sort_rust(m: &Bound<'_, PyModule>) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(counting_sort, m)?)?;
    Ok(())
}

Now we can use the same benchmark to test it against Python’s Powersort.

Rust Benchmark

This time we went with different range increments, because the results were the same starting at k = 500.

Very Small range (0-100)
--------------------------------------------------
Size     CountingSort    sorted()        CS_Mem     Sort_Mem   Speedup    Winner
--------------------------------------------------------------------------------
64       3.4μs           5.0μs           0.6KB      0.6KB      1.49x      countingsort
128      5.0μs           8.4μs           1.1KB      1.1KB      1.67x      countingsort
256      6.7μs           14.6μs          2.1KB      2.1KB      2.19x      countingsort
512      8.5μs           35.9μs          4.1KB      4.1KB      4.24x      countingsort
1024     18.5μs          64.1μs          8.1KB      12.1KB     3.47x      countingsort
2048     40.0μs          150.5μs         16.1KB     24.0KB     3.76x      countingsort
4096     53.1μs          277.5μs         32.1KB     48.0KB     5.23x      countingsort
8192     105.7μs         589.5μs         64.1KB     95.8KB     5.58x      countingsort
16384    214.8μs         1.2ms           128.1KB    191.5KB    5.43x      countingsort
32768    412.7μs         2.3ms           256.1KB    382.7KB    5.61x      countingsort
65536    813.6μs         4.6ms           512.1KB    765.6KB    5.68x      countingsort
131072   1.7ms           9.3ms           1.0MB      1.5MB      5.58x      countingsort
262144   3.6ms           18.8ms          2.0MB      3.0MB      5.14x      countingsort
524288   7.5ms           38.1ms          4.0MB      6.0MB      5.10x      countingsort
1048576  15.0ms          77.2ms          8.0MB      12.0MB     5.14x      countingsort
2097152  30.1ms          154.3ms         16.0MB     23.9MB     5.13x      countingsort
4194304  61.9ms          311.4ms         32.0MB     47.8MB     5.03x      countingsort
8388608  122.4ms         620.7ms         64.0MB     95.7MB     5.07x      countingsort
16777216 245.5ms         1.25s           128.0MB    191.4MB    5.09x      countingsort

Very Small range (0-200)
--------------------------------------------------
Size     CountingSort    sorted()        CS_Mem     Sort_Mem   Speedup    Winner
--------------------------------------------------------------------------------
64       4.4μs           4.1μs           0.6KB      0.6KB      0.92x      sorted()
128      3.7μs           8.4μs           1.1KB      1.1KB      2.30x      countingsort
256      6.2μs           14.8μs          2.1KB      2.1KB      2.40x      countingsort
512      8.6μs           39.5μs          4.1KB      4.1KB      4.57x      countingsort
1024     16.6μs          65.0μs          8.1KB      12.1KB     3.92x      countingsort
2048     27.8μs          147.0μs         16.1KB     24.1KB     5.29x      countingsort
4096     57.0μs          290.0μs         32.1KB     48.1KB     5.09x      countingsort
8192     159.4μs         616.6μs         64.1KB     96.0KB     3.87x      countingsort
16384    279.7μs         1.3ms           128.1KB    191.9KB    4.48x      countingsort
32768    481.8μs         2.5ms           256.1KB    383.3KB    5.21x      countingsort
65536    816.6μs         5.0ms           512.1KB    766.8KB    6.18x      countingsort
131072   1.7ms           10.2ms          1.0MB      1.5MB      5.85x      countingsort
262144   3.7ms           20.5ms          2.0MB      3.0MB      5.54x      countingsort
524288   7.6ms           41.5ms          4.0MB      6.0MB      5.49x      countingsort
1048576  16.3ms          83.4ms          8.0MB      12.0MB     5.11x      countingsort
2097152  30.7ms          166.7ms         16.0MB     24.0MB     5.43x      countingsort
4194304  63.6ms          344.1ms         32.0MB     47.9MB     5.41x      countingsort
8388608  123.0ms         675.9ms         64.0MB     95.8MB     5.50x      countingsort
16777216 254.9ms         1.39s           128.0MB    191.7MB    5.46x      countingsort

Very Small range (0-300)
--------------------------------------------------
Size     CountingSort    sorted()        CS_Mem     Sort_Mem   Speedup    Winner
--------------------------------------------------------------------------------
64       9.5μs           6.0μs           0.9KB      0.6KB      0.63x      sorted()
128      9.8μs           8.0μs           1.5KB      1.1KB      0.81x      sorted()
256      24.4μs          15.4μs          3.3KB      2.1KB      0.63x      sorted()
512      34.2μs          31.6μs          6.3KB      4.1KB      0.92x      sorted()
1024     66.7μs          68.3μs          12.0KB     12.1KB     1.02x      countingsort
2048     132.1μs         152.1μs         24.6KB     24.1KB     1.15x      countingsort
4096     236.4μs         319.7μs         48.3KB     48.1KB     1.35x      countingsort
8192     548.2μs         630.3μs         98.0KB     96.0KB     1.15x      countingsort
16384    1.2ms           1.5ms           191.9KB    191.9KB    1.26x      countingsort
32768    1.9ms           2.6ms           387.0KB    383.7KB    1.39x      countingsort
65536    3.7ms           5.4ms           772.0KB    767.3KB    1.46x      countingsort
131072   7.3ms           10.9ms          1.5MB      1.5MB      1.49x      countingsort
262144   15.1ms          22.1ms          3.0MB      3.0MB      1.46x      countingsort
524288   30.7ms          44.2ms          6.0MB      6.0MB      1.44x      countingsort
1048576  61.2ms          100.3ms         12.1MB     12.0MB     1.64x      countingsort
2097152  123.5ms         181.3ms         24.2MB     24.0MB     1.47x      countingsort
4194304  250.3ms         363.3ms         48.4MB     47.9MB     1.45x      countingsort
8388608  493.6ms         724.5ms         96.8MB     95.9MB     1.47x      countingsort
16777216 986.7ms         1.46s           193.5MB    191.8MB    1.48x      countingsort

Very Small range (0-400)
--------------------------------------------------
Size     CountingSort    sorted()        CS_Mem     Sort_Mem   Speedup    Winner
--------------------------------------------------------------------------------
64       13.7μs          4.2μs           1.3KB      0.6KB      0.30x      sorted()
128      20.3μs          10.3μs          2.5KB      1.1KB      0.51x      sorted()
256      34.9μs          15.4μs          4.5KB      2.1KB      0.44x      sorted()
512      64.1μs          39.9μs          8.9KB      4.1KB      0.62x      sorted()
1024     129.8μs         83.6μs          17.7KB     12.1KB     0.64x      sorted()
2048     257.4μs         150.6μs         36.5KB     24.1KB     0.58x      sorted()
4096     509.9μs         311.1μs         73.3KB     48.1KB     0.61x      sorted()
8192     1.1ms           694.1μs         144.6KB    96.1KB     0.65x      sorted()
16384    2.0ms           1.4ms           289.2KB    192.0KB    0.71x      sorted()
32768    3.9ms           2.8ms           579.5KB    383.7KB    0.73x      sorted()
65536    7.9ms           5.7ms           1.1MB      767.4KB    0.73x      sorted()
131072   15.6ms          11.4ms          2.3MB      1.5MB      0.73x      sorted()
262144   31.8ms          23.0ms          4.5MB      3.0MB      0.72x      sorted()
524288   63.4ms          47.0ms          9.0MB      6.0MB      0.74x      sorted()
1048576  127.4ms         95.1ms          18.1MB     12.0MB     0.75x      sorted()
2097152  254.6ms         191.3ms         36.1MB     24.0MB     0.75x      sorted()
4194304  510.0ms         384.3ms         72.2MB     48.0MB     0.75x      sorted()
8388608  1.02s           777.0ms         144.5MB    95.9MB     0.76x      sorted()
16777216 2.14s           1.56s           288.8MB    191.8MB    0.73x      sorted()

As we can see, Rust CS dominated the results up to range 300, reaching close to 6.2X Speedup. At higher ranges, this version of CS lose performance due to memory access overhead, perhaps. Interestingly, its the memory usage is comparable to sorted’s, but growing faster as the range of elements grow.

Conclusion

In this post, we have seen an analysis and benchmark of Counting Sort. I hope it was informative. My hope was that we could leave this with clearer use cases for this algorithm, but what remains is the conventional wisdom of using it when your maximum element is not large.