Fossil SCM

fossil-scm / www / grep.md

Source Rendered

Blame History Raw 161 lines

1	`# Fossil grep vs POSIX grep`
2
3	As of Fossil 2.7, there is a `grep` command which acts roughly like
4	POSIX's `grep -E` over all historical versions of one or more managed files.
5	`This document explains the commonalities and divergences between [POSIX`
6	`grep`](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html)
7	and Fossil `grep`.
8
9
10	`## Options`
11
12	Fossil `grep` implements about half of the options specified for
13	POSIX `grep`:
14
15	`\| Option \| Meaning`
16	`\|--------\|-------------------------------------------------------------`
17	\| `-c` \| report the count of matches rather than the matched text
18	\| `-i` \| ignore case in matches
19	\| `-l` \| list a checkin ID prefix for matching historical versions of the file
20	\| `-q` \| no output; return only a status code indicating the success of the match
21	\| `-s` \| suppress error output about missing files
22	\| `-v` \| invert the sense of the match
23
24	That leaves several divergences at the option level from POSIX `grep`:
25
26	* You cannot give more than one pattern, as with `grep -e` or
27	`grep -f`.
28
29	* There is no equivalent of `grep -F` to do literal fixed-string
30	`matches only.`
31
32	* There is no `-x` option for doing a whole-line match.
33
34	* `fossil grep -l` does not do precisely the same thing as POSIX
35	`grep -l`: it lists checkin ID prefixes, not file names.
36
37	`* Fossil always gives the line number in its output, which is to say`
38	that it acts like `grep -n`. There is no way to disable the line
39	number in `fossil grep` output.
40
41	`Patches to remove those limitations will be thoughtfully considered.`
42
43	Fossil `grep` doesn’t support any of the GNU and BSD `grep` extensions.
44	For instance, it doesn’t support the common `-R` extension to POSIX,
45	`which would presumably search a subtree of managed files. If Fossil does`
46	`one day get this feature, it would have a different option letter, since`
47	`-R` in Fossil has a different meaning, by convention. Until then, you
48	`can get the same effect on systems with a POSIX shell like so:`
49
50	`$ fossil grep COMMAND: $(fossil ls src)`
51
52	`If you run that in a check-out of the [Fossil self-hosting source`
53	`repository][fshsr], that returns the first line of the built-in`
54	`documentation for each Fossil command, across all historical versions.`
55
56	Fossil `grep` has extensions relative to these other `grep` standards,
57	such as `--verbose` to print each checkin ID considered, regardless of
58	`whether it matches. This one is noteworthy here because the behavior`
59	used to be under `-v` before we reassigned it to give POSIX-like `grep
60	-v` behavior.
61
62	`[fshsr]: ./selfhost.wiki`
63
64
65	`## Regular Expression Dialect`
66
67	`Fossil contains a built-in regular expression engine implementing a`
68	`subset of the [POSIX extended regular expression][ere] dialect:`
69
70	`[ere]: https://en.wikipedia.org/wiki/Regular_expression#POSIX_extended`
71
72	`\| Atom \| Meaning`
73	`\|---------\|-------------------------------------------------------------`
74	\| `X*` \| zero or more occurrences of X
75	\| `X+` \| one or more occurrences of X
76	\| `X?` \| zero or one occurrences of X
77	\| `X{p,q}`\| between p and q occurrences of X, inclusive
78	\| `(X)` \| match X
79	`\| <tt>X\\|Y</tt>\| X or Y`
80	\| `^X` \| X occurring at the beginning of a line
81	\| `X$` \| X occurring at the end of a line
82	\| `.` \| Match any single character
83	\| `\c` \| Character `c` where `c` is one of <tt>{}()[]\\|\*+?.\\</tt>
84	\| `\c` \| C-language escapes for `c` in `afnrtv`. ex: `\t` or `\n`
85	\| `\uXXXX`\| Where XXXX is exactly 4 hex digits, Unicode value XXXX
86	\| `\xXX` \| Where XX is exactly 2 hex digits, Unicode value XX
87	\| `[abc]` \| Any single character from the set `abc`
88	\| `[^abc]`\| Any single character not in the set `abc`
89	\| `[a-z]` \| Any single character in the range `a-z`
90	\| `[^a-z]`\| Any single character not in the range `a-z`
91	\| `\b` \| Word boundary
92	\| `\w` \| Word character: `[A-Za-z0-9_]`
93	\| `\W` \| Non-word character: `[^A-Za-z0-9_]`
94	\| `\d` \| Digit: `[0-9]`
95	\| `\D` \| Non-digit: `[^0-9]`
96	\| `\s` \| Whitespace character: `[ \t\r\n\v\f]`
97	\| `\S` \| Non-whitespace character: `[^ \t\r\n\v\f]`
98
99	There are several restrictions in Fossil `grep` relative to a fully
100	`POSIX compatible regular expression engine. Among them are:`
101
102	`* There is currently no support for POSIX character classes such as`
103	`[:lower:]`.
104
105	* The values of `p` and `q` in the "`{p,q}`" syntax can be no greater
106	`than 999. This is because the NFA that is used for regular expression`
107	`matching is proportional in size to the largest p or q value, and hence`
108	`allowing arbitrarily large values could result in a DoS attack.`
109
110	* Fossil `grep` does not currently attempt to take your operating
111	`system's locale settings into account when doing this match. Since`
112	`Fossil has no way to mark a given file as having a particular`
113	encoding, Fossil `grep` assumes the input files are in UTF-8 format.
114
115	This means Fossil `grep` will not work correctly if the files in
116	`your repository are in an encoding that is not backwards-compatible`
117	`with ASCII, such as UTF-16. Partially compatible encodings such as`
118	ISO 8859 should work with Fossil `grep` as long as you stick to
119	`their ASCII-compatible subset.`
120
121	The Fossil `grep` language is not a strict subset of POSIX extended
122	`regular expressions. Some of the features documented above are`
123	well-understood extensions to it, such as the "word" features `\b`, `\w`
124	and `\W`.
125
126	Fossil `grep` is based on the Unicode engine from [SQLite's FTS5
127	`feature][fts5]. This means it does do things like Unicode-aware case`
128	`folding. Therefore, it is usually a user error to attempt to substitute`
129	`[a-z]` for a lack of the POSIX character class `[:lower:]` if you are
130	`grepping over pretty much any human written language other than English.`
131	Use `fossil grep -i` instead, which does Unicode case folding.
132
133	`[fts5]: https://www.sqlite.org/fts5.html`
134
135
136	`## Algorithm Details`
137
138	Fossil `grep` uses a [nondeterministic finite automaton][nfa] for
139	`matching, so the performance is bounded by *O(nm)* where *n* is`
140	`the size of the regular expression and *m* is the size of the input`
141	`string.`
142
143	`[nfa]: https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton`
144
145	`In order to avoid [ReDoS attacks][redos], the Fossil regular expression`
146	`matcher was purposely written to avoid [implementation choices][rei]`
147	`which have the potential to require exponential evaluation time. This`
148	constrains the possible feature set we can support in the Fossil `grep`
149	`dialect. For instance, we are unlikely to ever add support for`
150	`[backtracking][bt].`
151
152	`[redos]: https://en.wikipedia.org/wiki/ReDoS`
153	`[rei]: https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times`
154	`[bt]: https://en.wikipedia.org/wiki/Backtracking`
155
156	The `X{p,q}` operator expands to `p` copies of `X` followed by `q-p`
157	copies of `X?` before RE evaluation. The *O(nm)* performance bound
158	`above remains true for this case, but realize that it applies to the RE`
159	`after this expansion, not to the form as given by the user. In other`
160	words, as `q-p` increases, so does the RE evaluation time.
161

Fossil SCM

Keyboard Shortcuts