The Mystery of Inconsistent String Collation on macOS

If you have ever run a C, C++, or Python application on macOS and noticed that string sorting or collation behaves differently on your local machine compared to a CI/CD environment (like GitHub Actions), you are not alone. A particularly baffling issue occurs when using the POSIX locale library: on some macOS machines, the collation of en_US.UTF-8 is case-sensitive (as expected), while on others, it acts completely case-insensitively, treating "apple" and "Apple" as identical.

This article dives into why this happens, how macOS handles locales under the hood, and how you can ensure consistent, cross-platform string collation in your applications.

Understanding the Symptom: A Tale of Two macOS Versions

Consider a standard C program that uses wcsxfrm to transform wide strings for collation comparison:

#include <stdio.h>
#include <locale.h>
#include <wchar.h>

void perform_test() {
    const wchar_t *in1 = L"a";
    wchar_t out1[10];
    wcsxfrm(out1, in1, 10);

    const wchar_t *in2 = L"A";
    wchar_t out2[10];
    wcsxfrm(out2, in2, 10);

    printf("wcscmp result: %d\n", wcscmp(out1, out2));
}

Depending on the macOS version, you will get drastically different results:

  • On macOS 15.7 (GitHub Actions): The transformed bytes are different, and wcscmp returns a non-zero value (case-sensitive collation).
  • On macOS 15.5 (Local M1 Mac): The transformed bytes are identical (\x38\x01\x02), resulting in wcscmp returning 0 (case-insensitive collation).

This is not a language bug. Because Python's locale.strxfrm wraps the system's C library, Python developers experience the exact same discrepancy.

The Root Cause: macOS libc and Localdef Evolution

The root of this issue lies in the macOS system C library (libc) and how Apple manages its locale database (localedef).

1. Historical Limitations of macOS Locales

Historically, macOS (built on BSD) has had notoriously limited support for POSIX locales compared to GNU/Linux (glibc). For many years, macOS only supported basic UTF-8 character classification (LC_CTYPE), while collation (LC_COLLATE) defaulted to simple ASCII-based byte-wise comparison or extremely simplified tables, regardless of the active locale.

2. The Transition to Modern Collation

In recent releases (such as macOS 14 Sonoma and macOS 15 Sequoia), Apple has been actively updating its POSIX locale implementations. In newer minor versions (like macOS 15.7), Apple introduced a more compliant, multi-level collation weight generator. This generator correctly assigns distinct primary, secondary, and tertiary weights to characters, allowing it to differentiate between lowercase and uppercase letters (e.g., "a" vs "A") while maintaining proper alphabetical order.

In older minor versions (like macOS 15.5), the collation tables or the wcsxfrm engine were either buggy, incomplete, or configured to collapse case distinctions entirely for certain locales. This minor-version dependency causes unexpected behavior across different environments running seemingly identical operating systems.

How to Achieve Consistent Collation

If your application relies on consistent sorting across developer machines, CI/CD runners, and production servers, relying on the host OS's POSIX locale is highly discouraged—especially on macOS. Instead, use one of the following robust alternatives.

Solution 1: Use ICU (International Components for Unicode) in C/C++

The industry standard for Unicode collation is ICU. It behaves identically across macOS, Linux, and Windows because it uses its own bundled, highly-standardized database (CLDR).

Here is how you can perform case-sensitive and locale-aware collation using ICU in C:

#include <unicode/ucol.h>
#include <unicode/ustring.h>
#include <stdio.h>

int main() {
    UErrorCode status = U_ZERO_ERROR;
    UCollator *coll = ucol_open("en_US", &status);
    
    if (U_FAILURE(status)) {
        printf("Failed to open collator\n");
        return 1;
    }

    UChar str1[] = {0x61, 0}; // "a"
    UChar str2[] = {0x41, 0}; // "A"

    UCollationResult result = ucol_strcoll(coll, str1, -1, str2, -1);
    
    if (result == UCOL_LESS) {
        printf("'a' is less than 'A'\n");
    } else if (result == UCOL_GREATER) {
        printf("'a' is greater than 'A'\n");
    } else {
        printf("They are equal\n");
    }

    ucol_close(coll);
    return 0;
}

Solution 2: Use PyICU in Python

If you are developing in Python, avoid the standard locale module for critical collation. Install the PyICU package, which wraps the ICU library:

import icu

collator = icu.Collator.createInstance(icu.Locale("en_US"))

# Consistent sorting regardless of the host OS
words = ["apple", "Apple", "banana"]
sorted_words = sorted(words, key=collator.getCollationKey)
print(sorted_words) # Output: ['apple', 'Apple', 'banana']

Solution 3: Standardize the CI/CD Environment

If you absolutely must use system locales and cannot introduce third-party dependencies like ICU, ensure that your local development environment and your CI/CD runners are operating on the exact same minor version of macOS. However, keep in mind that this is a fragile workaround, as future macOS updates could easily alter collation weights again.

Summary

Inconsistent string collation on macOS is caused by ongoing updates to Apple's BSD-derived libc locale databases. Minor OS versions (such as macOS 15.5 vs 15.7) can exhibit completely different sorting rules and weight generations. To write truly portable, robust software, decouple your application from the operating system's locale system and adopt ICU for all Unicode-sensitive collation and sorting tasks.