|
1
|
<title>Defense Against Robots</title> |
|
2
|
|
|
3
|
A typical Fossil website can have billions and billions of pages, |
|
4
|
and many of those pages (for example diffs and annotations and tarballs) |
|
5
|
can be expensive to compute. |
|
6
|
If a robot walks a Fossil-generated website, |
|
7
|
it can present a crippling bandwidth and CPU load. |
|
8
|
A "robots.txt" file can help, but in practice, most robots these |
|
9
|
days ignore the robots.txt file, so it won't help much. |
|
10
|
|
|
11
|
A Fossil website is intended to be used |
|
12
|
interactively by humans, not walked by robots. This article |
|
13
|
describes the techniques used by Fossil to try to welcome human |
|
14
|
users while keeping out robots. |
|
15
|
|
|
16
|
<h2>Defenses Are Enabled By Default</h2> |
|
17
|
|
|
18
|
In the latest implementations of Fossil, most robot defenses are |
|
19
|
enabled by default. You can probably get by with standing up a |
|
20
|
public-facing Fossil instance in the default configuration. But |
|
21
|
you can also customize the defenses to serve your particular needs. |
|
22
|
|
|
23
|
<h2>Customizing Anti-Robot Defenses</h2> |
|
24
|
|
|
25
|
Admin users can configure robot defenses on the |
|
26
|
"Robot Defense Settings" page (/setup_robot). |
|
27
|
That page is accessible (to Admin users) from the default menu bar |
|
28
|
by click on the "Admin" menu choice, then selecting the |
|
29
|
"Robot-Defense" link from the list. |
|
30
|
|
|
31
|
<h2>The Hyperlink User Capability</h2> |
|
32
|
|
|
33
|
Every Fossil web session has a "user". For random passers-by on the internet |
|
34
|
(and for robots) that user is "nobody". The "anonymous" user is also |
|
35
|
available for humans who do not wish to identify themselves. The difference |
|
36
|
is that "anonymous" requires a login (using a password supplied via |
|
37
|
a CAPTCHA) whereas "nobody" does not require a login. |
|
38
|
The site administrator can also create logins with |
|
39
|
passwords for specific individuals. |
|
40
|
|
|
41
|
Users without the <b>[./caps/ref.html#h | Hyperlink]</b> capability |
|
42
|
do not see most Fossil-generated hyperlinks. This is |
|
43
|
a simple defense against robots, since [./caps/#ucat | the "nobody" |
|
44
|
user category] does not have this capability by default. |
|
45
|
Users must log in (perhaps as |
|
46
|
"anonymous") before they can see any of the hyperlinks. A robot |
|
47
|
that cannot log into your Fossil repository will be unable to walk |
|
48
|
its historical check-ins, create diffs between versions, pull zip |
|
49
|
archives, etc. by visiting links, because there are no links. |
|
50
|
|
|
51
|
A text message appears at the top of each page in this situation to |
|
52
|
invite humans to log in as anonymous in order to activate hyperlinks. |
|
53
|
|
|
54
|
But requiring a login, even an anonymous login, can be annoying. |
|
55
|
Fossil provides other techniques for blocking robots which |
|
56
|
are less cumbersome to humans. |
|
57
|
|
|
58
|
<h2>Automatic Hyperlinks Based on UserAgent and Javascript</h2> |
|
59
|
|
|
60
|
Fossil has the ability to selectively enable hyperlinks for users |
|
61
|
that lack the <b>Hyperlink</b> capability based on their UserAgent string in the |
|
62
|
HTTP request header and on the browsers ability to run Javascript. |
|
63
|
|
|
64
|
The UserAgent string is a text identifier that is included in the header |
|
65
|
of most HTTP requests that identifies the specific maker and version of |
|
66
|
the browser (or robot) that generated the request. Typical UserAgent |
|
67
|
strings look like this: |
|
68
|
|
|
69
|
<ul> |
|
70
|
<li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0 |
|
71
|
<li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0) |
|
72
|
<li> Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
|
73
|
<li> Wget/1.12 (openbsd4.9) |
|
74
|
</ul> |
|
75
|
|
|
76
|
The first two UserAgent strings above identify Firefox 19 and |
|
77
|
Internet Explorer 8.0, both running on Windows NT. The third |
|
78
|
example is the robot used by Google to index the internet. |
|
79
|
The fourth example is the "wget" utility running on OpenBSD. |
|
80
|
Thus the first two UserAgent strings above identify the requester |
|
81
|
as human whereas the second two identify the requester as a robot. |
|
82
|
Note that the UserAgent string is completely under the control |
|
83
|
of the requester and so a malicious robot can forge a UserAgent |
|
84
|
string that makes it look like a human. But most robots want |
|
85
|
to "play nicely" on the internet and are quite open |
|
86
|
about the fact that they are a robot. And so the UserAgent string |
|
87
|
provides a good first-guess about whether or not a request originates |
|
88
|
from a human or a robot. |
|
89
|
|
|
90
|
The [/help/auto-hyperlink|auto-hyperlink] setting, shown as |
|
91
|
"<b>Enable hyperlinks based on User-Agent and/or Javascript</b>" on |
|
92
|
the Robot Defense Settings page, |
|
93
|
can be set to "UserAgent only" or "UserAgent and Javascript" or "off". |
|
94
|
If the UserAgent string looks like a human and not a robot, then |
|
95
|
Fossil will enable hyperlinks even if the <b>Hyperlink</b> capability |
|
96
|
is omitted from the user permissions. This setting gives humans easy |
|
97
|
access to the hyperlinks while preventing robots |
|
98
|
from walking the billions of pages on a typical Fossil site. |
|
99
|
|
|
100
|
If the setting is "UserAgent only" (2), then the hyperlinks are simply |
|
101
|
enabled and that is all. But if the setting is "UserAgent and Javascript" (1), |
|
102
|
then the hyperlinks are not enabled directly. |
|
103
|
Instead, the HTML code that is generated contains anchor tags ("<a>") |
|
104
|
with "href=" attributes that point to [/honeypot] rather than the correct |
|
105
|
link. JavaScript code is added to the end of the page that goes back and |
|
106
|
fills in the correct "href=" attributes of |
|
107
|
the anchor tags with the true hyperlink targets, thus enabling the hyperlinks. |
|
108
|
This extra step of using JavaScript to enable the hyperlink targets |
|
109
|
is a security measure against robots that forge a human-looking |
|
110
|
UserAgent string. Most robots do not bother to run JavaScript and |
|
111
|
so to the robot the empty anchor tag will be useless. But all modern |
|
112
|
web browsers implement JavaScript, so hyperlinks will show up |
|
113
|
normally for human users. |
|
114
|
|
|
115
|
If the [/help/auto-hyperlink|"auto-hyperlink"] setting is (2) |
|
116
|
"<b>Enable hyperlinks using User-Agent and/or Javascript</b>", |
|
117
|
then there are now two additional sub-settings that control when |
|
118
|
hyperlinks are enabled. |
|
119
|
|
|
120
|
The first new sub-setting is a delay (in milliseconds) before setting |
|
121
|
the "href=" attributes on anchor tags. The default value for this |
|
122
|
delay is 10 milliseconds. The idea here is that a robots will try to |
|
123
|
interpret the links on the page immediately, and will not wait for delayed |
|
124
|
scripts to be run, and thus will never enable the true links. |
|
125
|
|
|
126
|
The second sub-setting waits to run the |
|
127
|
JavaScript that sets the "href=" attributes on anchor tags until after |
|
128
|
at least one "mousedown" or "mousemove" event has been detected on the |
|
129
|
<body> element of the page. The thinking here is that robots will not be |
|
130
|
simulating mouse motion and so no mouse events will ever occur and |
|
131
|
hence the hyperlinks will never become enabled for robots. |
|
132
|
|
|
133
|
See also [./loadmgmt.md|Managing Server Load] for a description |
|
134
|
of how expensive pages can be disabled when the server is under heavy |
|
135
|
load. |
|
136
|
|
|
137
|
<h2>Do Not Allow Robot Access To Certain Pages</h2> |
|
138
|
|
|
139
|
The [/help/robot-restrict|robot-restrict setting] is a comma-separated |
|
140
|
list of GLOB patterns for pages for which robot access is prohibited. |
|
141
|
The default value is: |
|
142
|
|
|
143
|
<blockquote><pre> |
|
144
|
timelineX,diff,annotate,fileage,file,finfo,reports,tree,hexdump,download |
|
145
|
</pre></blockquote> |
|
146
|
|
|
147
|
Each entry corresponds to the first path element on the URI for a |
|
148
|
Fossil-generated page. If Fossil does not know for certain that the |
|
149
|
HTTP request is coming from a human, then any attempt to access one of |
|
150
|
these pages brings up a javascript-powered captcha. The user has to |
|
151
|
click the accept button the captcha once, and that sets a cookie allowing |
|
152
|
the user to continue surfing without interruption for 15 minutes or so |
|
153
|
before being presented with another captcha. |
|
154
|
|
|
155
|
Some path elements have special meanings: |
|
156
|
|
|
157
|
* <b>timelineX →</b> |
|
158
|
This means a subset of /timeline/ pages that are considered |
|
159
|
"expensive". The exact definition of which timeline pages are |
|
160
|
expensive and which are not is still the subject of active |
|
161
|
experimentation and is likely to change by the time you read this |
|
162
|
text. The idea is that anybody (including robots) can see a timeline |
|
163
|
of the most recent changes, but timelines of long-ago change or that |
|
164
|
contain lists of file changes or other harder-to-compute values are |
|
165
|
prohibited. |
|
166
|
|
|
167
|
* <b>zip →</b> |
|
168
|
The special "zip" keyword also matches "/tarball/" and "/sqlar/". |
|
169
|
|
|
170
|
* <b>zipX →</b> |
|
171
|
This is like "zip" in that it restricts access to "/zip/", "/tarball"/ |
|
172
|
and "/sqlar/" but with exceptions:<ol type="a"> |
|
173
|
<li><p> If the [/help/robot-zip-leaf|robot-zip-leaf] setting is |
|
174
|
true, then tarballs of leaf check-ins are allowed. This permits |
|
175
|
URLs that attempt to download the latest check-in on trunk or |
|
176
|
from a named branch, for example. |
|
177
|
<li><p> If a check-in has a tag that matches the GLOB list in |
|
178
|
[/help/robot-zip-tag|robot-zip-tag], then tarballs of that |
|
179
|
check-in are allowed. This allow check-ins tagged with |
|
180
|
"release" or "allow-robots" (for example) to be downloaded |
|
181
|
without restriction. |
|
182
|
</ol> |
|
183
|
The "zipX" restriction is not in the default robot-restrict setting. |
|
184
|
This is something you might want to add, depending on your needs. |
|
185
|
|
|
186
|
* <b>diff →</b> |
|
187
|
This matches /vdiff/ and /fdiff/ and /vpatch/ and any other page that |
|
188
|
is primarily about showing the difference between two check-ins or two |
|
189
|
file versioons. |
|
190
|
|
|
191
|
* <b>annotate →</b> |
|
192
|
This also matches /blame/ and /praise/. |
|
193
|
|
|
194
|
Other special keywords may be added in the future. |
|
195
|
|
|
196
|
The default [/help/robot-restrict|robot-restrict] |
|
197
|
setting has been shown in practice to do a good job of keeping |
|
198
|
robots from consuming all available CPU and bandwidth while will |
|
199
|
still allowing humans access to the full power of the site without |
|
200
|
having to be logged in. |
|
201
|
|
|
202
|
One possible enhancement is to add "zipX" to the |
|
203
|
[/help/robot-restrict|robot-restrict] setting, |
|
204
|
and enable [help?cmd=robot-zip-leaf|robot-zip-leaf] |
|
205
|
and configure [help?cmd=robot-zip-tag|robot-zip-tag]. |
|
206
|
Do this if you find that robots downloading lots of |
|
207
|
obscure tarballs is causing load issues on your site. |
|
208
|
|
|
209
|
<h2>Anti-robot Exception RegExps</h2> |
|
210
|
|
|
211
|
The [/help/robot-exception|robot-exception setting] under the name |
|
212
|
of <b>Exceptions to anti-robot restrictions</b> is a list of |
|
213
|
[/re_rules|regular expressions], one per line, that match |
|
214
|
URIs that will bypass the captcha and allow robots full access. The |
|
215
|
intent of this setting is to allow automated build scripts |
|
216
|
to download specific tarballs of project snapshots. |
|
217
|
|
|
218
|
The recommended value for this setting allows robots to use URIs of the |
|
219
|
following form: |
|
220
|
|
|
221
|
<blockquote> |
|
222
|
<b>https://</b><i>DOMAIN</i><b>/tarball/release/</b><i>HASH</i><b>/</b><i>NAME</i><b>.tar.gz</b> |
|
223
|
</blockquote> |
|
224
|
|
|
225
|
The <i>HASH</i> part of this URL can be any valid |
|
226
|
[./checkin_names.wiki|check-in name]. The link works as long as that |
|
227
|
check-in is tagged with the "release" symbolic tag. In this way, |
|
228
|
robots are permitted to download tarballs (and ZIP archives) of official |
|
229
|
releases, but not every intermediate check-in between releases. Humans |
|
230
|
who are willing to click the captcha can still download whatever they |
|
231
|
want, but robots are blocked by the captcha. This prevents aggressive |
|
232
|
robots from downloading tarballs of every historical check-in of your |
|
233
|
project, once per day, which many robots these days seem eager to do. |
|
234
|
|
|
235
|
For example, on the Fossil project itself, this URL will work, even for |
|
236
|
robots: |
|
237
|
|
|
238
|
<blockquote> |
|
239
|
https://fossil-scm.org/home/tarball/release/version-2.27/fossil-scm.tar.gz |
|
240
|
</blockquote> |
|
241
|
|
|
242
|
But the next URL will not work for robots because check-in 3bbd18a284c8bd6a |
|
243
|
is not tagged as a "release": |
|
244
|
|
|
245
|
<blockquote> |
|
246
|
https://fossil-scm.org/home/tarball/release/3bbd18a284c8bd6a/fossil-scm.tar.gz |
|
247
|
</blockquote> |
|
248
|
|
|
249
|
The second URL will work for humans, just not robots. |
|
250
|
|
|
251
|
<h2>The Ongoing Struggle</h2> |
|
252
|
|
|
253
|
Fossil currently does a good job of providing easy access to humans |
|
254
|
while keeping out troublesome robots. However, robots |
|
255
|
continue to grow more sophisticated, requiring ever more advanced |
|
256
|
defenses. This "arms race" is unlikely to ever end. The developers of |
|
257
|
Fossil will continue to try improve the robot defenses of Fossil so |
|
258
|
check back from time to time for the latest releases and updates. |
|
259
|
|
|
260
|
Readers of this page who have suggestions on how to improve the robot |
|
261
|
defenses in Fossil are invited to submit your ideas to the Fossil Users |
|
262
|
forum: |
|
263
|
[https://fossil-scm.org/forum]. |
|
264
|
|