Fossil SCM

fossil-scm / www / antibot.wiki
1
<title>Defense Against Robots</title>
2
3
A typical Fossil website can have billions and billions of pages,
4
and many of those pages (for example diffs and annotations and tarballs)
5
can be expensive to compute.
6
If a robot walks a Fossil-generated website,
7
it can present a crippling bandwidth and CPU load.
8
A "robots.txt" file can help, but in practice, most robots these
9
days ignore the robots.txt file, so it won't help much.
10
11
A Fossil website is intended to be used
12
interactively by humans, not walked by robots. This article
13
describes the techniques used by Fossil to try to welcome human
14
users while keeping out robots.
15
16
<h2>Defenses Are Enabled By Default</h2>
17
18
In the latest implementations of Fossil, most robot defenses are
19
enabled by default. You can probably get by with standing up a
20
public-facing Fossil instance in the default configuration. But
21
you can also customize the defenses to serve your particular needs.
22
23
<h2>Customizing Anti-Robot Defenses</h2>
24
25
Admin users can configure robot defenses on the
26
"Robot Defense Settings" page (/setup_robot).
27
That page is accessible (to Admin users) from the default menu bar
28
by click on the "Admin" menu choice, then selecting the
29
"Robot-Defense" link from the list.
30
31
<h2>The Hyperlink User Capability</h2>
32
33
Every Fossil web session has a "user". For random passers-by on the internet
34
(and for robots) that user is "nobody". The "anonymous" user is also
35
available for humans who do not wish to identify themselves. The difference
36
is that "anonymous" requires a login (using a password supplied via
37
a CAPTCHA) whereas "nobody" does not require a login.
38
The site administrator can also create logins with
39
passwords for specific individuals.
40
41
Users without the <b>[./caps/ref.html#h | Hyperlink]</b> capability
42
do not see most Fossil-generated hyperlinks. This is
43
a simple defense against robots, since [./caps/#ucat | the "nobody"
44
user category] does not have this capability by default.
45
Users must log in (perhaps as
46
"anonymous") before they can see any of the hyperlinks. A robot
47
that cannot log into your Fossil repository will be unable to walk
48
its historical check-ins, create diffs between versions, pull zip
49
archives, etc. by visiting links, because there are no links.
50
51
A text message appears at the top of each page in this situation to
52
invite humans to log in as anonymous in order to activate hyperlinks.
53
54
But requiring a login, even an anonymous login, can be annoying.
55
Fossil provides other techniques for blocking robots which
56
are less cumbersome to humans.
57
58
<h2>Automatic Hyperlinks Based on UserAgent and Javascript</h2>
59
60
Fossil has the ability to selectively enable hyperlinks for users
61
that lack the <b>Hyperlink</b> capability based on their UserAgent string in the
62
HTTP request header and on the browsers ability to run Javascript.
63
64
The UserAgent string is a text identifier that is included in the header
65
of most HTTP requests that identifies the specific maker and version of
66
the browser (or robot) that generated the request. Typical UserAgent
67
strings look like this:
68
69
<ul>
70
<li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0
71
<li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0)
72
<li> Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
73
<li> Wget/1.12 (openbsd4.9)
74
</ul>
75
76
The first two UserAgent strings above identify Firefox 19 and
77
Internet Explorer 8.0, both running on Windows NT. The third
78
example is the robot used by Google to index the internet.
79
The fourth example is the "wget" utility running on OpenBSD.
80
Thus the first two UserAgent strings above identify the requester
81
as human whereas the second two identify the requester as a robot.
82
Note that the UserAgent string is completely under the control
83
of the requester and so a malicious robot can forge a UserAgent
84
string that makes it look like a human. But most robots want
85
to "play nicely" on the internet and are quite open
86
about the fact that they are a robot. And so the UserAgent string
87
provides a good first-guess about whether or not a request originates
88
from a human or a robot.
89
90
The [/help/auto-hyperlink|auto-hyperlink] setting, shown as
91
"<b>Enable hyperlinks based on User-Agent and/or Javascript</b>" on
92
the Robot Defense Settings page,
93
can be set to "UserAgent only" or "UserAgent and Javascript" or "off".
94
If the UserAgent string looks like a human and not a robot, then
95
Fossil will enable hyperlinks even if the <b>Hyperlink</b> capability
96
is omitted from the user permissions. This setting gives humans easy
97
access to the hyperlinks while preventing robots
98
from walking the billions of pages on a typical Fossil site.
99
100
If the setting is "UserAgent only" (2), then the hyperlinks are simply
101
enabled and that is all. But if the setting is "UserAgent and Javascript" (1),
102
then the hyperlinks are not enabled directly.
103
Instead, the HTML code that is generated contains anchor tags ("&lt;a&gt;")
104
with "href=" attributes that point to [/honeypot] rather than the correct
105
link. JavaScript code is added to the end of the page that goes back and
106
fills in the correct "href=" attributes of
107
the anchor tags with the true hyperlink targets, thus enabling the hyperlinks.
108
This extra step of using JavaScript to enable the hyperlink targets
109
is a security measure against robots that forge a human-looking
110
UserAgent string. Most robots do not bother to run JavaScript and
111
so to the robot the empty anchor tag will be useless. But all modern
112
web browsers implement JavaScript, so hyperlinks will show up
113
normally for human users.
114
115
If the [/help/auto-hyperlink|"auto-hyperlink"] setting is (2)
116
"<b>Enable hyperlinks using User-Agent and/or Javascript</b>",
117
then there are now two additional sub-settings that control when
118
hyperlinks are enabled.
119
120
The first new sub-setting is a delay (in milliseconds) before setting
121
the "href=" attributes on anchor tags. The default value for this
122
delay is 10 milliseconds. The idea here is that a robots will try to
123
interpret the links on the page immediately, and will not wait for delayed
124
scripts to be run, and thus will never enable the true links.
125
126
The second sub-setting waits to run the
127
JavaScript that sets the "href=" attributes on anchor tags until after
128
at least one "mousedown" or "mousemove" event has been detected on the
129
&lt;body&gt; element of the page. The thinking here is that robots will not be
130
simulating mouse motion and so no mouse events will ever occur and
131
hence the hyperlinks will never become enabled for robots.
132
133
See also [./loadmgmt.md|Managing Server Load] for a description
134
of how expensive pages can be disabled when the server is under heavy
135
load.
136
137
<h2>Do Not Allow Robot Access To Certain Pages</h2>
138
139
The [/help/robot-restrict|robot-restrict setting] is a comma-separated
140
list of GLOB patterns for pages for which robot access is prohibited.
141
The default value is:
142
143
<blockquote><pre>
144
timelineX,diff,annotate,fileage,file,finfo,reports,tree,hexdump,download
145
</pre></blockquote>
146
147
Each entry corresponds to the first path element on the URI for a
148
Fossil-generated page. If Fossil does not know for certain that the
149
HTTP request is coming from a human, then any attempt to access one of
150
these pages brings up a javascript-powered captcha. The user has to
151
click the accept button the captcha once, and that sets a cookie allowing
152
the user to continue surfing without interruption for 15 minutes or so
153
before being presented with another captcha.
154
155
Some path elements have special meanings:
156
157
* <b>timelineX &rarr;</b>
158
This means a subset of /timeline/ pages that are considered
159
"expensive". The exact definition of which timeline pages are
160
expensive and which are not is still the subject of active
161
experimentation and is likely to change by the time you read this
162
text. The idea is that anybody (including robots) can see a timeline
163
of the most recent changes, but timelines of long-ago change or that
164
contain lists of file changes or other harder-to-compute values are
165
prohibited.
166
167
* <b>zip &rarr;</b>
168
The special "zip" keyword also matches "/tarball/" and "/sqlar/".
169
170
* <b>zipX &rarr;</b>
171
This is like "zip" in that it restricts access to "/zip/", "/tarball"/
172
and "/sqlar/" but with exceptions:<ol type="a">
173
<li><p> If the [/help/robot-zip-leaf|robot-zip-leaf] setting is
174
true, then tarballs of leaf check-ins are allowed. This permits
175
URLs that attempt to download the latest check-in on trunk or
176
from a named branch, for example.
177
<li><p> If a check-in has a tag that matches the GLOB list in
178
[/help/robot-zip-tag|robot-zip-tag], then tarballs of that
179
check-in are allowed. This allow check-ins tagged with
180
"release" or "allow-robots" (for example) to be downloaded
181
without restriction.
182
</ol>
183
The "zipX" restriction is not in the default robot-restrict setting.
184
This is something you might want to add, depending on your needs.
185
186
* <b>diff &rarr;</b>
187
This matches /vdiff/ and /fdiff/ and /vpatch/ and any other page that
188
is primarily about showing the difference between two check-ins or two
189
file versioons.
190
191
* <b>annotate &rarr;</b>
192
This also matches /blame/ and /praise/.
193
194
Other special keywords may be added in the future.
195
196
The default [/help/robot-restrict|robot-restrict]
197
setting has been shown in practice to do a good job of keeping
198
robots from consuming all available CPU and bandwidth while will
199
still allowing humans access to the full power of the site without
200
having to be logged in.
201
202
One possible enhancement is to add "zipX" to the
203
[/help/robot-restrict|robot-restrict] setting,
204
and enable [help?cmd=robot-zip-leaf|robot-zip-leaf]
205
and configure [help?cmd=robot-zip-tag|robot-zip-tag].
206
Do this if you find that robots downloading lots of
207
obscure tarballs is causing load issues on your site.
208
209
<h2>Anti-robot Exception RegExps</h2>
210
211
The [/help/robot-exception|robot-exception setting] under the name
212
of <b>Exceptions to anti-robot restrictions</b> is a list of
213
[/re_rules|regular expressions], one per line, that match
214
URIs that will bypass the captcha and allow robots full access. The
215
intent of this setting is to allow automated build scripts
216
to download specific tarballs of project snapshots.
217
218
The recommended value for this setting allows robots to use URIs of the
219
following form:
220
221
<blockquote>
222
<b>https://</b><i>DOMAIN</i><b>/tarball/release/</b><i>HASH</i><b>/</b><i>NAME</i><b>.tar.gz</b>
223
</blockquote>
224
225
The <i>HASH</i> part of this URL can be any valid
226
[./checkin_names.wiki|check-in name]. The link works as long as that
227
check-in is tagged with the "release" symbolic tag. In this way,
228
robots are permitted to download tarballs (and ZIP archives) of official
229
releases, but not every intermediate check-in between releases. Humans
230
who are willing to click the captcha can still download whatever they
231
want, but robots are blocked by the captcha. This prevents aggressive
232
robots from downloading tarballs of every historical check-in of your
233
project, once per day, which many robots these days seem eager to do.
234
235
For example, on the Fossil project itself, this URL will work, even for
236
robots:
237
238
<blockquote>
239
https://fossil-scm.org/home/tarball/release/version-2.27/fossil-scm.tar.gz
240
</blockquote>
241
242
But the next URL will not work for robots because check-in 3bbd18a284c8bd6a
243
is not tagged as a "release":
244
245
<blockquote>
246
https://fossil-scm.org/home/tarball/release/3bbd18a284c8bd6a/fossil-scm.tar.gz
247
</blockquote>
248
249
The second URL will work for humans, just not robots.
250
251
<h2>The Ongoing Struggle</h2>
252
253
Fossil currently does a good job of providing easy access to humans
254
while keeping out troublesome robots. However, robots
255
continue to grow more sophisticated, requiring ever more advanced
256
defenses. This "arms race" is unlikely to ever end. The developers of
257
Fossil will continue to try improve the robot defenses of Fossil so
258
check back from time to time for the latest releases and updates.
259
260
Readers of this page who have suggestions on how to improve the robot
261
defenses in Fossil are invited to submit your ideas to the Fossil Users
262
forum:
263
[https://fossil-scm.org/forum].
264

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button