Fossil SCM

fossil-scm / www / grep.md
1
# Fossil grep vs POSIX grep
2
3
As of Fossil 2.7, there is a `grep` command which acts roughly like
4
POSIX's `grep -E` over all historical versions of one or more managed files.
5
This document explains the commonalities and divergences between [POSIX
6
`grep`](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html)
7
and Fossil `grep`.
8
9
10
## Options
11
12
Fossil `grep` implements about half of the options specified for
13
POSIX `grep`:
14
15
| Option | Meaning
16
|--------|-------------------------------------------------------------
17
| `-c` | report the count of matches rather than the matched text
18
| `-i` | ignore case in matches
19
| `-l` | list a checkin ID prefix for matching historical versions of the file
20
| `-q` | no output; return only a status code indicating the success of the match
21
| `-s` | suppress error output about missing files
22
| `-v` | invert the sense of the match
23
24
That leaves several divergences at the option level from POSIX `grep`:
25
26
* You cannot give more than one pattern, as with `grep -e` or
27
`grep -f`.
28
29
* There is no equivalent of `grep -F` to do literal fixed-string
30
matches only.
31
32
* There is no `-x` option for doing a whole-line match.
33
34
* `fossil grep -l` does not do precisely the same thing as POSIX
35
`grep -l`: it lists checkin ID prefixes, not file names.
36
37
* Fossil always gives the line number in its output, which is to say
38
that it acts like `grep -n`. There is no way to disable the line
39
number in `fossil grep` output.
40
41
Patches to remove those limitations will be thoughtfully considered.
42
43
Fossil `grep` doesn’t support any of the GNU and BSD `grep` extensions.
44
For instance, it doesn’t support the common `-R` extension to POSIX,
45
which would presumably search a subtree of managed files. If Fossil does
46
one day get this feature, it would have a different option letter, since
47
`-R` in Fossil has a different meaning, by convention. Until then, you
48
can get the same effect on systems with a POSIX shell like so:
49
50
$ fossil grep COMMAND: $(fossil ls src)
51
52
If you run that in a check-out of the [Fossil self-hosting source
53
repository][fshsr], that returns the first line of the built-in
54
documentation for each Fossil command, across all historical versions.
55
56
Fossil `grep` has extensions relative to these other `grep` standards,
57
such as `--verbose` to print each checkin ID considered, regardless of
58
whether it matches. This one is noteworthy here because the behavior
59
used to be under `-v` before we reassigned it to give POSIX-like `grep
60
-v` behavior.
61
62
[fshsr]: ./selfhost.wiki
63
64
65
## Regular Expression Dialect
66
67
Fossil contains a built-in regular expression engine implementing a
68
subset of the [POSIX extended regular expression][ere] dialect:
69
70
[ere]: https://en.wikipedia.org/wiki/Regular_expression#POSIX_extended
71
72
| Atom | Meaning
73
|---------|-------------------------------------------------------------
74
| `X*` | zero or more occurrences of X
75
| `X+` | one or more occurrences of X
76
| `X?` | zero or one occurrences of X
77
| `X{p,q}`| between p and q occurrences of X, inclusive
78
| `(X)` | match X
79
| <tt>X\|Y</tt>| X or Y
80
| `^X` | X occurring at the beginning of a line
81
| `X$` | X occurring at the end of a line
82
| `.` | Match any single character
83
| `\c` | Character `c` where `c` is one of <tt>{}()[]\|\*+?.\\</tt>
84
| `\c` | C-language escapes for `c` in `afnrtv`. ex: `\t` or `\n`
85
| `\uXXXX`| Where XXXX is exactly 4 hex digits, Unicode value XXXX
86
| `\xXX` | Where XX is exactly 2 hex digits, Unicode value XX
87
| `[abc]` | Any single character from the set `abc`
88
| `[^abc]`| Any single character not in the set `abc`
89
| `[a-z]` | Any single character in the range `a-z`
90
| `[^a-z]`| Any single character not in the range `a-z`
91
| `\b` | Word boundary
92
| `\w` | Word character: `[A-Za-z0-9_]`
93
| `\W` | Non-word character: `[^A-Za-z0-9_]`
94
| `\d` | Digit: `[0-9]`
95
| `\D` | Non-digit: `[^0-9]`
96
| `\s` | Whitespace character: `[ \t\r\n\v\f]`
97
| `\S` | Non-whitespace character: `[^ \t\r\n\v\f]`
98
99
There are several restrictions in Fossil `grep` relative to a fully
100
POSIX compatible regular expression engine. Among them are:
101
102
* There is currently no support for POSIX character classes such as
103
`[:lower:]`.
104
105
* The values of `p` and `q` in the "`{p,q}`" syntax can be no greater
106
than 999. This is because the NFA that is used for regular expression
107
matching is proportional in size to the largest p or q value, and hence
108
allowing arbitrarily large values could result in a DoS attack.
109
110
* Fossil `grep` does not currently attempt to take your operating
111
system's locale settings into account when doing this match. Since
112
Fossil has no way to mark a given file as having a particular
113
encoding, Fossil `grep` assumes the input files are in UTF-8 format.
114
115
This means Fossil `grep` will not work correctly if the files in
116
your repository are in an encoding that is not backwards-compatible
117
with ASCII, such as UTF-16. Partially compatible encodings such as
118
ISO 8859 should work with Fossil `grep` as long as you stick to
119
their ASCII-compatible subset.
120
121
The Fossil `grep` language is not a strict subset of POSIX extended
122
regular expressions. Some of the features documented above are
123
well-understood extensions to it, such as the "word" features `\b`, `\w`
124
and `\W`.
125
126
Fossil `grep` is based on the Unicode engine from [SQLite's FTS5
127
feature][fts5]. This means it does do things like Unicode-aware case
128
folding. Therefore, it is usually a user error to attempt to substitute
129
`[a-z]` for a lack of the POSIX character class `[:lower:]` if you are
130
grepping over pretty much any human written language other than English.
131
Use `fossil grep -i` instead, which does Unicode case folding.
132
133
[fts5]: https://www.sqlite.org/fts5.html
134
135
136
## Algorithm Details
137
138
Fossil `grep` uses a [nondeterministic finite automaton][nfa] for
139
matching, so the performance is bounded by ***O(nm)*** where ***n*** is
140
the size of the regular expression and ***m*** is the size of the input
141
string.
142
143
[nfa]: https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton
144
145
In order to avoid [ReDoS attacks][redos], the Fossil regular expression
146
matcher was purposely written to avoid [implementation choices][rei]
147
which have the potential to require exponential evaluation time. This
148
constrains the possible feature set we can support in the Fossil `grep`
149
dialect. For instance, we are unlikely to ever add support for
150
[backtracking][bt].
151
152
[redos]: https://en.wikipedia.org/wiki/ReDoS
153
[rei]: https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
154
[bt]: https://en.wikipedia.org/wiki/Backtracking
155
156
The `X{p,q}` operator expands to `p` copies of `X` followed by `q-p`
157
copies of `X?` before RE evaluation. The ***O(nm)*** performance bound
158
above remains true for this case, but realize that it applies to the RE
159
*after* this expansion, not to the form as given by the user. In other
160
words, as `q-p` increases, so does the RE evaluation time.
161

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button