Palindrome as an Interview Question

Palindrome often comes up as an interview question. In this post I will investigate 3 implementations and their relative performance.

What is a palindrome? Wikipedia says A palindrome is a word, number, phrase, or other sequence of characters which reads the same backward as forward, such as madam or racecar.

It is quite common to ask an interview candidate to implement a function to check if a given string is a palindrome. In this blog post I will show 4 different implementations. While these implementations might not handle all use cases identically, they satisfy a set of testcases to make sure they all work correctly based on a common interpretation.

Simple

One typical implementation is to iterate over the half of the input string and check if the character at index i equals the character at index n - i - 1, where 'n' denotes the length of the input string.

Here is a reference implementation:

public static bool IsPalindromeSimple(string input)
{
    if (input is null)
        return true;
    int i = 0;
    while (i < input.Length / 2)
    {
        if (input[i] != input[input.Length - i - 1])
            return false;
        i++;
    }
    return true;
}

In case the length of the input string is an even number, all characters are checked. In case of odd number, the middle character is not checked, but as it is a single character left from both ends, it reads the same character from backwards and forwards.

Is it correct? Mostly: one can argue that a null input is not a palindrome, hence an exception should be thrown. A rather different question could be: what characters are allowed in the input string? In this case the implementation assumes that the input string is Unicode UTF-16 encoded. It will not handle strings with smileys correctly, one would need to use Rune to handle those.

Spans

While the previous implementation works correctly on my test inputs, it requires a great care from the developer. There is quite a lot of indexing into arrays involved as well division by integer, which both offers plenty of opportunities to introduce off-by-one errors. In C# a more modern implementation could be using Span<T> instead:

public static bool IsPalindromeSpan(ReadOnlySpan<char> input)
{
    while (input.Length > 1)
    {
        if (input[0] != input[^1])
            return false;
        input = input.Slice(1, input.Length - 2);
    }
    return true;
}

Besides the implementation being shorter and easier to read, it has less opportunities for off-by-one errors. Note, that it is always comparing the first and the last character of the available span of chars. When comparison fails it returns false, otherwise it slices the span further by removing the first and last characters, until only a single character is left.

Such an implementation on a coding interview proves that the candidate is up-to-date with latest C# features. It is also less text to type, inherently handles null inputs by using Span<T> as the input parameter.

List Pattern

C# 11 introduced a new pattern for handling lists. It is called pattern matching. It is a new feature of C# that allows to match a list of values against a pattern.

public static bool IsPalindromeListPattern(ReadOnlySpan<char> input) => input switch
{
    [] => true,
    [var _] => true,
    [var first, .. var remaining, var last] => first == last && IsPalindromeListPattern(remaining),
};

The code above matches the empty input and the single character input with true, while the general case matches the equality of the first and the last character and recursively calls itself on the remaining characters.

The performance of list pattern matching is worse than the one with Span, as well as this is a preview feature at the time of writing this post, hence this solution is excluded from the benchmarks.

Vectorization

One might notice, that palindrome task lends itself for vectorization using single instruction, multiple data: SIMD. Modern CPUs includes SIMD instructions to improve performance. While the code presented below works well on the CPU of the machines I use, an absolutely correct implementation should check the available size of the SIMD registers. Then it could load a corresponding implementation, for example by using branching by abstraction.

public static readonly Vector256<byte> shuffle =
        new Vector<byte>(new byte[] { 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1,
                                      14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 }).AsVector256();

public static bool IsPalindromeVectorized(ReadOnlySpan<char> input)
{
    var data = MemoryMarshal.AsBytes(input);
    while (data.Length > 32)
    {
        Vector256<byte> vBegin = new Vector<byte>(data).AsVector256();
        Vector256<byte> vEnd = new Vector<byte>(data.Slice(data.Length - 32)).AsVector256();
        vEnd = Avx2.Shuffle(vEnd, shuffle);
        vEnd = Avx2.Permute2x128(vEnd, vEnd, 1);
        vEnd = Avx2.CompareEqual(vBegin, vEnd);
        if (!Avx2.TestC(vEnd, Vector256<byte>.AllBitsSet))
            return false;

        var remaining = data.Length - 64;
        data = data.Slice(32, remaining < 0 ? 0 : remaining);
    }

    while (data.Length > 2)
    {
        if (data[1] != data[data.Length - 1] || data[0] != data[data.Length - 2])
            return false;
        data = data.Slice(2, data.Length - 4);
    }
    return true;
}

The idea of the code above is that it converts the byte representation of the character sequence, so that it is comparable at the beginning and at the end of the input string.The code above uses two vectors that are 256 bit length (32 bytes). The first vector is loaded from the beginning of the input string, while the second vector is loaded from the end of it. A char represents 2 bytes, for example 'a' == 97 represented as [0x00, 0x61] in bytes. In case of an input string 'abba', it is represented as [0x00, 0x61, 0x00, 0x62, 0x00, 0x62, 0x00, 0x61] the code will manipulate the vectors in such a way that a [0x00, 0x61, 0x00, 0x62] == [0x00, 0x61, 0x00, 0x62] comparison can be made. To achieve this, shuffles the bytes in the second vector to invert the order of each pair of bytes. After shuffling Permute2x128 method is used to switch the bytes in the high/low lane. This is required because shuffle works only on a single lane. Next the two bytes vectors are compared, and branches to either return false or take the next slice of the remaining data. Finally, another while loop is used to handle spans that are less than the size of the vector.

As the vectorized loop can compare 32 bytes at a time, this solution compares 16 characters.

Note, that this solution does not handle UTF-8 encoded string correctly. One should use Rune to handle them.

Vectorization with ASCII

An interview candidate might ask if we only need to handle ASCII characters. For now, let's assume we may only receive ASCII input. In this case, every other byte of the char will be zero. The implementation below builds upon this realization using vectorization. It reads 64 bytes into 2-2 vectors at the beginning and at the end of the input string. Then it blends the two vectors in a way that the 'unused' 0 bytes are overwritten by the values of the second vector. The rest of this implementation completes the same way as the previous vectorization method.

public static readonly Vector256<byte> shuffle =
    new Vector<byte>(new byte[] { 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1,
                                      14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 }).AsVector256();

public static readonly Vector256<byte> blendingMask =
    new Vector<byte>(new byte[] { 255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0,
                                  255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0 }).AsVector256();

public static bool IsPalindromeVectorizedAscii(ReadOnlySpan<char> input)
{
    var data = MemoryMarshal.AsBytes(input);
    while (data.Length > 64)
    {
        Vector256<byte> vBegin = new Vector<byte>(data).AsVector256();
        Vector256<byte> vBegin2 = new Vector<byte>(data.Slice(32)).AsVector256();
        vBegin2 = Avx2.ShiftLeftLogical128BitLane(vBegin2, 1);
        vBegin = Avx2.BlendVariable(vBegin2, vBegin, blendingMask);

        Vector256<byte> vEnd = new Vector<byte>(data.Slice(data.Length - 32)).AsVector256();
        Vector256<byte> vEnd2 = new Vector<byte>(data.Slice(data.Length - 64)).AsVector256();
        vEnd2 = Avx2.ShiftLeftLogical128BitLane(vEnd2, 1);
        vEnd = Avx2.BlendVariable(vEnd2, vEnd, blendingMask);

        vEnd = Avx2.Shuffle(vEnd, shuffle);
        vEnd = Avx2.Permute2x128(vEnd, vEnd, 1);
        vEnd = Avx2.CompareEqual(vBegin, vEnd);
        if (!Avx2.TestC(vEnd, Vector256<byte>.AllBitsSet))
            return false;

        var remaining = data.Length - 128;
        data = data.Slice(64, remaining < 0 ? 0 : remaining);
    }

    while (data.Length > 2)
    {
        if (data[0] != data[data.Length - 2])
            return false;
        data = data.Slice(2, data.Length - 4);
    }
    return true;
}

Conclusion and Benchmarks

Does it worth the extra work for the vectorization? It depends on the input string. I benchmarked all implementations with palindrome input strings. The input strings vary by their length from 3 chars to 2048 chars.

Results with shorter input strings:

|          Method |             TestData |      Mean |     Error |    StdDev |
|---------------- |--------------------- |----------:|----------:|----------:|
|          Simple | aabbc(...)cbbaa [32] | 13.928 ns | 0.1216 ns | 0.1015 ns |
|            Span | aabbc(...)cbbaa [32] | 25.476 ns | 0.5358 ns | 0.7685 ns |
|      Vectorized | aabbc(...)cbbaa [32] |  5.707 ns | 0.1386 ns | 0.1361 ns |
| VectorizedAscii | aabbc(...)cbbaa [32] | 50.392 ns | 0.8853 ns | 0.9472 ns |
|          Simple |                  aba |  1.102 ns | 0.0253 ns | 0.0249 ns |
|            Span |                  aba |  2.272 ns | 0.0601 ns | 0.0532 ns |
|      Vectorized |                  aba |  4.964 ns | 0.0798 ns | 0.0746 ns |
| VectorizedAscii |                  aba |  4.435 ns | 0.0719 ns | 0.1098 ns |
|          Simple |             abcddcba |  3.935 ns | 0.0318 ns | 0.0297 ns |
|            Span |             abcddcba |  7.644 ns | 0.1152 ns | 0.1021 ns |
|      Vectorized |             abcddcba | 21.973 ns | 0.3228 ns | 0.3842 ns |
| VectorizedAscii |             abcddcba | 16.487 ns | 0.3619 ns | 0.4023 ns |

Results with longer input strings:

|          Method |             TestData |        Mean |     Error |    StdDev |
|---------------- |--------------------- |------------:|----------:|----------:|
|          Simple |  aabb(...)bbaa [128] |    62.84 ns |  0.535 ns |  0.500 ns |
|            Span |  aabb(...)bbaa [128] |   106.89 ns |  1.736 ns |  1.624 ns |
|      Vectorized |  aabb(...)bbaa [128] |    17.16 ns |  0.278 ns |  0.247 ns |
| VectorizedAscii |  aabb(...)bbaa [128] |    16.49 ns |  0.342 ns |  0.433 ns |
|          Simple | aabb(...)bbaa [2048] |   867.16 ns |  2.934 ns |  2.450 ns |
|            Span | aabb(...)bbaa [2048] | 1,423.26 ns | 27.409 ns | 32.629 ns |
|      Vectorized | aabb(...)bbaa [2048] |   240.88 ns |  4.726 ns |  6.145 ns |
| VectorizedAscii | aabb(...)bbaa [2048] |   189.94 ns |  3.672 ns |  3.435 ns |

First observation to make, is that the span based implementation is slower compared to the Simple one. To understand the reasons behind this, one would need to check the JITTED code. One reason could be related to bound checking. Faster Bound Checks blog post should give an idea on how to investigate further. The second observation is that the vectorized solutions are slower on small input strings but gains significant performance gains on larger input strings. To get an optimal solution, one could replace the second loop of the vectorized implementation a loop presented in the Simple solution, but I leave this exercises for the reader.