What would it take to rewrite Lintian in Rust?

During DebConf 2024 Busan the poor state of maintenance and performance of Lintian was mentioned (again).

I have been affected by long check times especially as packages grows in size and Lintian only uses one core of my otherwise beefy machines. I took a very short look at how Lintian works under the hood. I only studied the Python.pm module:

sub visit_installed_files {
    my ($self, $item) = @_;
    # .pyc/.pyo (compiled Python files)
    #  skip any file installed inside a __pycache__ directory

    #  - we have a separate check for that directory.
    $self->pointed_hint('package-installs-python-bytecode', $item->pointer)
      if $item->name =~ /\.py[co]$/
      && $item->name !~ m{/__pycache__/};
    # __pycache__ (directory for pyc/pyo files)
    $self->pointed_hint('package-installs-python-pycache-dir', $item->pointer)
      if $item->is_dir
      && $item->name =~ m{/__pycache__/};

    # ...
}

So far so good. A lot of regular expressions everywhere. And we know Perl is a nice Language for writing regexes. But Perl cannot keep up as files piles up (not this module in particular but overall).

What would it take to do the same kind of checks in Rust, and in parallel?

I have created a small project named rust-benchmark-parallel-regex to test how fast it would go, and if regex matching could be scaled to multiple cores.

Here is the main result as a graph:

par_iter_oncelock is faster that iter_oncelock aroud 1000 elements. par_iter_thread_local is even faster

The good news: it scales pretty well in parallel! (16 cores)

It starts at around 1000 items, but the function only matches two patterns. With a more complex function like the one in the Python.pm module, I expect the threshold to be even lower.

Note how we have to use a little trick to compile the regexes per-thread, instead of having a regex compiled only one time and shared between threads. The reason is that when the only thing you do is regex matching, and the haystack is small (as is the case with file paths), then the bottleneck becomes acquiring the regex inside the multiple threads.

This is well explained in this article titled Contention on multi-threaded regex matching (reddit thread).

Another thing I did was to use the regex-automata crate that allows to create multiple regexes at once, instead of composing with multiple regular regexes.

The code:

macro_rules! match_any_oncelock {
    ($e:expr, $($patterns:expr),*) => {{
        static REGEX: OnceLock<Regex> = OnceLock::new();
        REGEX.get_or_init(|| Regex::new_many(&[$($patterns),*]).unwrap()).is_match($e)
    }}
}

macro_rules! match_any {
    ($e:expr, $($patterns:expr),*) => {{
        thread_local!(static REGEX: Regex = Regex::new_many(&[$($patterns),*]).unwrap());
        REGEX.with(|re| re.is_match($e))
    }}
}

fn is_match_oncelock(input: &str) -> bool {
    match_any_oncelock!(input, "^foo", "bar$")
}

fn is_match_thread_local(input: &str) -> bool {
    match_any!(input, "^foo", "bar$")
}

The use of macros allow to reduce the verbosity of the code, in order to try to make it as concise as in perl (but not writeonly).

This is far from a rewrite of lintian in Rust. But it gives me hope. If other rustaceans want to join, maybe we could create something? lintian-ng?

social