Parsing bash/shell
I have been avoiding #629247 for quite a while. Not because I think we couldn't use a better shell parser, but because I dreaded having to write the parser. Of course, #629247 blocks about 16 bugs and that number will only increase, so "someone" has to solve it eventually... Unfortunately, that "someone" is likely to be "me". So...
I managed to scrabble down the following Perl snippet. It does a decent job at getting lines split into "words" (which may or may not contain spaces, newlines, quotes etc.). It currently tokenizes the "<<EOF"-constructs (heredocs?). Also it does not allow one to distinguish between "EOF" and " EOF" (the former ends the heredoc, the latter doesn't.).
Other defects includes that it does not tokenize all operators (like ">&"). Probably all I need is a list of them and all the "special cases" (Example: ">&" can optionally take numbers on both sides, like ">&2" or "2>&1").
It does not always appear to terminate (I think EOF + unclosed quote triggers this). If you try it out and notice something funny, please let me know.
You can also find an older version of it in the bug #629247 and the output it produced at that time (that version used " instead of - as token marker).
#!/usr/bin/perl
use strict;
use warnings;
use Text::ParseWords qw(quotewords);
my $opregex;
{
my $tmp = join( "|", map { quotemeta $_ } qw (&& || | ; ));
# Match & but not >& or <&
# - Actually, it should eventually match those, but not right now.
$tmp .= '|(?<![\>\<])\&';
$opregex = qr/$tmp/ox;
}
my @tokens = ();
my $lno;
while (my $line = <>) {
chomp $line;
next if $line =~ m/^\s*(?:\#|$)/;
$lno = $. unless defined $lno;
while ($line =~ s,\\$,,) {
$line .= "\n" . <>;
chomp $line;
}
$line =~ s/^\s++//;
$line =~ s/\s++$//;
# Ignore empty lines (again, via "$empty \ $empty"-constructs)
next if $line =~ m/^\s*(?:\#|$)/;
my @it = quotewords ($opregex, 'delimiters', $line);
if (!@it) {
# This happens if the line has unbalanced quotes, so pop another
# line and redo the loop.
$line .= "\n" . <>;
redo;
}
foreach my $orig (@it) {
my @l;
$orig =~ s,",\\\\",g;
@l = quotewords (qr/\s++/, 1, $orig);
pop @l unless defined $l[-1] && $l[-1] ne '';
shift @l if $l[0] eq '';
push @tokens, map { s,\\\\",",g; $_ } @l;
}
print "Line $lno: -" . join ("- -", map { s/\n/\\n/g; $_ } @tokens ) . "-\n";
@tokens = ();
$lno = undef;
}
Here is a little example script and the "tokenization" of that script (no, the example script is not supposed to be useful).
$ cat test
#!/bin/sh
for p in *; do
if [ -d "$p" ];then continue;elif
[ -f "$p" ]
then echo "$p is a file";fi
done
$ ./test.pl test
Line 3: -for- -p- -in- -*- -;- -do-
Line 4: -if- -[- --d- -"$p"- -]- -;- -then- -continue- -;- -elif-
Line 5: -[- --f- -"$p"- -]-
Line 6: -then- -echo- -"$p is a file"- -;- -fi-
Line 7: -done-
