Laszlo - Regex Unleashed

Regex Unleashed

06/05/2022

8 minutes

In the series of these posts, I take a regular task, and implement it multiple times in an iteratively manner, while also improving performance.

The task is the following, given an input string with placeholders in it, and a map of placeholders to actual values. Replace the placeholders in the input string with the actual text values from the map and return the final string. For example, given input string Hello [%placeholder%]! and map [%placeholder%] => World should return Hello World!. The placeholder's [%placeholder%] part is referred as the key, and World part as the value in this post.

To further clarify the task, these constraints are also true:

placeholder keys are delimited with [% at the beginning and %] at the end
placeholders are recursive (a placeholder's value may contain another (or the same) placeholder's key, but it should be processed recursively)
input text contains no placeholder that are embedded by other placeholders (such as hello [%outer[%inner%]example%])
the key of a placeholder contain letters and underscores
the same fixed set of placeholders are used on multiple input strings
placeholders are used at most once per input text

To place the task in a real-life example, imagine a system handling order requests. For each request the system needs to create an e-mail based on a template, a pdf containing billing information, a pdf for shipping information, notification for end users etc. The same set of order parameters (placeholder values) need to be replaced in the template (input string) of the e-mail, pdf, notification etc.

In the series of these posts, I present multiple, iterative implementations. The first part of the series established a common set of unit tests to be passed by all discussed implementations. It also presented four solutions using regular expressions to achieve to goal of this task. The second part the series discussed non regular expression based solutions to further improve the CPU and memory utilization. In this final post of the series, I weaken the constraint about recursive processing of placeholder keys in placeholder values.

In this post the constraint of not processing recursive placeholders is inverted: the application should process recursive placeholders. If a placeholder is matched in the input string, and a corresponding placeholder also found in the placeholders' map, we replace the key in the input string with the value of the placeholder. However, this placeholder value might contain a key for another placeholder (or the same placeholder). The application shall keep replacing these placeholders as long as it does not conform an infinite loop.

For example the input string of Hello [%placeholder%] and a map of [%placeholder%] => World[%punctuation%], [%punctuation%] => !should return a string: Hello World!.

Notice, that a placeholder's value might contain its own placeholder key. Processing such a placeholder in a recursive manner would put the application on an infinite loop. In practice an unprepared application would run out of memory and crash. Hence processing such placeholders is limited to processing them once in this post. The exact behavior required is captured by the unit tests. Note, that the recursive implementations use the RecursiveInputOutputData() method to provide the list of test cases. Finally, notice that while there is a constraint that an input string contains placeholders at most once, the same is not true for placeholder values.

In this post two implementations are presented. Recursive 0 is an enhancement of Solution 0 detailed in the first post of the series, while Recursive 7 is an enhancement of Solution 7 discussed in the second post of the series.

Attached code samples may be too large for a blog post to present effectively, the actual code is linked from GitHub.

Recursive 0

Recursive 0 implementation is based on regular expressions. It consists of two methods, Substitute() method is part of the public API, which invokes the private SubstituteImpl() method with the input arguments and an empty HashSet<T>. SubstituteImpl() is a recursive method which does not have a clear stopping condition. It creates a new regular expression and uses the Matches() method to iterate over all matches. When a match is found (and not yet visited), string replacement is used to assemble a new string. However, to replace keys in placeholder values, it invokes SubstituteImpl() recursively passing the previously matched placeholder's value. To avoid infinite loops, it appends every placeholder key to the hashset, and removes them after the recursive operation. Removing is necessary to enable processing the same placeholder that appears later on again. This way the Hashset<T> only contains placeholder keys that are replaced in the current chain of recursions.

The benchmarks show that this implementation is slower to Solution 0 as well it uses more memory:

|     Method |     Mean |    Error |   StdDev |  Gen 0 | Allocated |
|----------- |---------:|---------:|---------:|-------:|----------:|
| Recursive0 | 20.87 us | 0.526 us | 1.475 us | 9.3689 |     29 KB |

Recursive 7

Recursive 7 implementation takes Solution 7 and it extends it with these methods:

ProcessInnerReplacements()
UpdateEntry()
ShouldUpdate()

All of these methods are invoked by the constructor, and their purpose is to unroll recursive placeholder values. The rest of the methods are the same as in Solution 7, which means the actual substitution method runs exactly as long as it runs for Solution 7. Unrolling the recursive placeholders happen once per a new instance of this class is created. This is benefit when multiple input strings are substituted with the same set of placeholders. On the other hand, the construction of the class costs more, as it has to unroll the placeholders and create the internal hashcode based dictionary. The source code attached is for demonstration purpose, in production one might want to handle the case, when the generated hashcodes collide. Let me discuss what each of these helper methods do:

ProcessInnerReplacements() iterates over the input placeholder map and creates a new map by unrolling each placeholder. It also recalculates a new total size of placeholder values, which is later used by the Substitute() method to allocate a large enough buffer for the result string.

UpdateEntry() unrolls each placeholder value. It is a recursive method, this time with a clear stopping condition. It uses the ShouldUpdate() helper method to determine if the remaining placeholder value still contains other placeholder keys. If it does, it recursively calls itself to update the matched placeholder and the remaining part of the placeholder value. A recursive implementation is used to simplify the maintenance of the visited collection, which helps to break the infinite loop of placeholder key-values.

ShouldUpdate() is a helper method which checks if a given placeholder value contains further placeholder keys. Returns true if the value contains another placeholder key, otherwise the method returns false. It also has 2 out parameters which returns the location of the match in the input span as well the substitute value.

Another edge case that a productionized version of this code might want to handle is where the allocated buffer for replacing placeholder values is too small. In such case the code should allocate a larger buffer to continue the execution.

The instantiation of this type takes significantly longer compared to Recursive 0's constructor, as well as uses more memory. Benchmarking only the Substitute() method would not reflect a true comparison of these implementations.

Substitute()

|        Method |       Mean |    Error |    StdDev |   Gen 0 | Allocated |
|-------------- |-----------:|---------:|----------:|--------:|----------:|
|    Recursive7 |   340.7 ns |  6.83 ns |   8.88 ns |  0.3543 |      1 KB |

Ctor

|        Method |     Mean |     Error |    StdDev |   Gen 0 | Allocated |
|-------------- |---------:|----------:|----------:|--------:|----------:|
| Recursive7Map | 4.259 us | 0.0834 us | 0.1547 us | 21.2746 |     66 KB |

Notice, that this solution has significant performance gain compared to Recursive 0 solution. Although the initial memory allocation in the constructor is higher, over a course of 3 invocations of the Substitute() method, it gets amortized. From the mean execution time's point of view using Recursive 7 has a clear advantage already after the initial Substitute() method call.

Conclusion

With this post I have covered two solutions for the above task with recursive placeholder values. Throughout this series of posts, the goal has been to gradually improve the performance both for the mean execution time and for the allocated memory as well.

The first part of this series has used regular expression to achieve the above task, while the second post has used a custom implementation to replace the placeholder keys.

In this part I have relaxed a constraint on placeholder values by allowing to contain placeholder keys. This post has presented two implementations, one based on regular expressions, and one based on custom placeholder matching. It has also compared these solutions from performance perspective and explained a reasoning on their key characteristics.

regex placeholder performance

Hello, I am Laszlo

.NET developer