|
1
|
# Fossil grep vs POSIX grep |
|
2
|
|
|
3
|
As of Fossil 2.7, there is a `grep` command which acts roughly like |
|
4
|
POSIX's `grep -E` over all historical versions of one or more managed files. |
|
5
|
This document explains the commonalities and divergences between [POSIX |
|
6
|
`grep`](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html) |
|
7
|
and Fossil `grep`. |
|
8
|
|
|
9
|
|
|
10
|
## Options |
|
11
|
|
|
12
|
Fossil `grep` implements about half of the options specified for |
|
13
|
POSIX `grep`: |
|
14
|
|
|
15
|
| Option | Meaning |
|
16
|
|--------|------------------------------------------------------------- |
|
17
|
| `-c` | report the count of matches rather than the matched text |
|
18
|
| `-i` | ignore case in matches |
|
19
|
| `-l` | list a checkin ID prefix for matching historical versions of the file |
|
20
|
| `-q` | no output; return only a status code indicating the success of the match |
|
21
|
| `-s` | suppress error output about missing files |
|
22
|
| `-v` | invert the sense of the match |
|
23
|
|
|
24
|
That leaves several divergences at the option level from POSIX `grep`: |
|
25
|
|
|
26
|
* You cannot give more than one pattern, as with `grep -e` or |
|
27
|
`grep -f`. |
|
28
|
|
|
29
|
* There is no equivalent of `grep -F` to do literal fixed-string |
|
30
|
matches only. |
|
31
|
|
|
32
|
* There is no `-x` option for doing a whole-line match. |
|
33
|
|
|
34
|
* `fossil grep -l` does not do precisely the same thing as POSIX |
|
35
|
`grep -l`: it lists checkin ID prefixes, not file names. |
|
36
|
|
|
37
|
* Fossil always gives the line number in its output, which is to say |
|
38
|
that it acts like `grep -n`. There is no way to disable the line |
|
39
|
number in `fossil grep` output. |
|
40
|
|
|
41
|
Patches to remove those limitations will be thoughtfully considered. |
|
42
|
|
|
43
|
Fossil `grep` doesn’t support any of the GNU and BSD `grep` extensions. |
|
44
|
For instance, it doesn’t support the common `-R` extension to POSIX, |
|
45
|
which would presumably search a subtree of managed files. If Fossil does |
|
46
|
one day get this feature, it would have a different option letter, since |
|
47
|
`-R` in Fossil has a different meaning, by convention. Until then, you |
|
48
|
can get the same effect on systems with a POSIX shell like so: |
|
49
|
|
|
50
|
$ fossil grep COMMAND: $(fossil ls src) |
|
51
|
|
|
52
|
If you run that in a check-out of the [Fossil self-hosting source |
|
53
|
repository][fshsr], that returns the first line of the built-in |
|
54
|
documentation for each Fossil command, across all historical versions. |
|
55
|
|
|
56
|
Fossil `grep` has extensions relative to these other `grep` standards, |
|
57
|
such as `--verbose` to print each checkin ID considered, regardless of |
|
58
|
whether it matches. This one is noteworthy here because the behavior |
|
59
|
used to be under `-v` before we reassigned it to give POSIX-like `grep |
|
60
|
-v` behavior. |
|
61
|
|
|
62
|
[fshsr]: ./selfhost.wiki |
|
63
|
|
|
64
|
|
|
65
|
## Regular Expression Dialect |
|
66
|
|
|
67
|
Fossil contains a built-in regular expression engine implementing a |
|
68
|
subset of the [POSIX extended regular expression][ere] dialect: |
|
69
|
|
|
70
|
[ere]: https://en.wikipedia.org/wiki/Regular_expression#POSIX_extended |
|
71
|
|
|
72
|
| Atom | Meaning |
|
73
|
|---------|------------------------------------------------------------- |
|
74
|
| `X*` | zero or more occurrences of X |
|
75
|
| `X+` | one or more occurrences of X |
|
76
|
| `X?` | zero or one occurrences of X |
|
77
|
| `X{p,q}`| between p and q occurrences of X, inclusive |
|
78
|
| `(X)` | match X |
|
79
|
| <tt>X\|Y</tt>| X or Y |
|
80
|
| `^X` | X occurring at the beginning of a line |
|
81
|
| `X$` | X occurring at the end of a line |
|
82
|
| `.` | Match any single character |
|
83
|
| `\c` | Character `c` where `c` is one of <tt>{}()[]\|\*+?.\\</tt> |
|
84
|
| `\c` | C-language escapes for `c` in `afnrtv`. ex: `\t` or `\n` |
|
85
|
| `\uXXXX`| Where XXXX is exactly 4 hex digits, Unicode value XXXX |
|
86
|
| `\xXX` | Where XX is exactly 2 hex digits, Unicode value XX |
|
87
|
| `[abc]` | Any single character from the set `abc` |
|
88
|
| `[^abc]`| Any single character not in the set `abc` |
|
89
|
| `[a-z]` | Any single character in the range `a-z` |
|
90
|
| `[^a-z]`| Any single character not in the range `a-z` |
|
91
|
| `\b` | Word boundary |
|
92
|
| `\w` | Word character: `[A-Za-z0-9_]` |
|
93
|
| `\W` | Non-word character: `[^A-Za-z0-9_]` |
|
94
|
| `\d` | Digit: `[0-9]` |
|
95
|
| `\D` | Non-digit: `[^0-9]` |
|
96
|
| `\s` | Whitespace character: `[ \t\r\n\v\f]` |
|
97
|
| `\S` | Non-whitespace character: `[^ \t\r\n\v\f]` |
|
98
|
|
|
99
|
There are several restrictions in Fossil `grep` relative to a fully |
|
100
|
POSIX compatible regular expression engine. Among them are: |
|
101
|
|
|
102
|
* There is currently no support for POSIX character classes such as |
|
103
|
`[:lower:]`. |
|
104
|
|
|
105
|
* The values of `p` and `q` in the "`{p,q}`" syntax can be no greater |
|
106
|
than 999. This is because the NFA that is used for regular expression |
|
107
|
matching is proportional in size to the largest p or q value, and hence |
|
108
|
allowing arbitrarily large values could result in a DoS attack. |
|
109
|
|
|
110
|
* Fossil `grep` does not currently attempt to take your operating |
|
111
|
system's locale settings into account when doing this match. Since |
|
112
|
Fossil has no way to mark a given file as having a particular |
|
113
|
encoding, Fossil `grep` assumes the input files are in UTF-8 format. |
|
114
|
|
|
115
|
This means Fossil `grep` will not work correctly if the files in |
|
116
|
your repository are in an encoding that is not backwards-compatible |
|
117
|
with ASCII, such as UTF-16. Partially compatible encodings such as |
|
118
|
ISO 8859 should work with Fossil `grep` as long as you stick to |
|
119
|
their ASCII-compatible subset. |
|
120
|
|
|
121
|
The Fossil `grep` language is not a strict subset of POSIX extended |
|
122
|
regular expressions. Some of the features documented above are |
|
123
|
well-understood extensions to it, such as the "word" features `\b`, `\w` |
|
124
|
and `\W`. |
|
125
|
|
|
126
|
Fossil `grep` is based on the Unicode engine from [SQLite's FTS5 |
|
127
|
feature][fts5]. This means it does do things like Unicode-aware case |
|
128
|
folding. Therefore, it is usually a user error to attempt to substitute |
|
129
|
`[a-z]` for a lack of the POSIX character class `[:lower:]` if you are |
|
130
|
grepping over pretty much any human written language other than English. |
|
131
|
Use `fossil grep -i` instead, which does Unicode case folding. |
|
132
|
|
|
133
|
[fts5]: https://www.sqlite.org/fts5.html |
|
134
|
|
|
135
|
|
|
136
|
## Algorithm Details |
|
137
|
|
|
138
|
Fossil `grep` uses a [nondeterministic finite automaton][nfa] for |
|
139
|
matching, so the performance is bounded by ***O(nm)*** where ***n*** is |
|
140
|
the size of the regular expression and ***m*** is the size of the input |
|
141
|
string. |
|
142
|
|
|
143
|
[nfa]: https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton |
|
144
|
|
|
145
|
In order to avoid [ReDoS attacks][redos], the Fossil regular expression |
|
146
|
matcher was purposely written to avoid [implementation choices][rei] |
|
147
|
which have the potential to require exponential evaluation time. This |
|
148
|
constrains the possible feature set we can support in the Fossil `grep` |
|
149
|
dialect. For instance, we are unlikely to ever add support for |
|
150
|
[backtracking][bt]. |
|
151
|
|
|
152
|
[redos]: https://en.wikipedia.org/wiki/ReDoS |
|
153
|
[rei]: https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times |
|
154
|
[bt]: https://en.wikipedia.org/wiki/Backtracking |
|
155
|
|
|
156
|
The `X{p,q}` operator expands to `p` copies of `X` followed by `q-p` |
|
157
|
copies of `X?` before RE evaluation. The ***O(nm)*** performance bound |
|
158
|
above remains true for this case, but realize that it applies to the RE |
|
159
|
*after* this expansion, not to the form as given by the user. In other |
|
160
|
words, as `q-p` increases, so does the RE evaluation time. |
|
161
|
|