|
c9082b2…
|
drh
|
1 |
<title>Defense Against Robots</title> |
|
c9082b2…
|
drh
|
2 |
|
|
8d32812…
|
drh
|
3 |
A typical Fossil website can have billions and billions of pages, |
|
8d32812…
|
drh
|
4 |
and many of those pages (for example diffs and annotations and tarballs) |
|
8d32812…
|
drh
|
5 |
can be expensive to compute. |
|
c9082b2…
|
drh
|
6 |
If a robot walks a Fossil-generated website, |
|
c9082b2…
|
drh
|
7 |
it can present a crippling bandwidth and CPU load. |
|
14e2392…
|
drh
|
8 |
A "robots.txt" file can help, but in practice, most robots these |
|
14e2392…
|
drh
|
9 |
days ignore the robots.txt file, so it won't help much. |
|
c9082b2…
|
drh
|
10 |
|
|
c9082b2…
|
drh
|
11 |
A Fossil website is intended to be used |
|
c9082b2…
|
drh
|
12 |
interactively by humans, not walked by robots. This article |
|
1e26962…
|
drh
|
13 |
describes the techniques used by Fossil to try to welcome human |
|
c9082b2…
|
drh
|
14 |
users while keeping out robots. |
|
8d32812…
|
drh
|
15 |
|
|
14e2392…
|
drh
|
16 |
<h2>Defenses Are Enabled By Default</h2> |
|
14e2392…
|
drh
|
17 |
|
|
14e2392…
|
drh
|
18 |
In the latest implementations of Fossil, most robot defenses are |
|
14e2392…
|
drh
|
19 |
enabled by default. You can probably get by with standing up a |
|
14e2392…
|
drh
|
20 |
public-facing Fossil instance in the default configuration. But |
|
14e2392…
|
drh
|
21 |
you can also customize the defenses to serve your particular needs. |
|
14e2392…
|
drh
|
22 |
|
|
14e2392…
|
drh
|
23 |
<h2>Customizing Anti-Robot Defenses</h2> |
|
8d32812…
|
drh
|
24 |
|
|
8d32812…
|
drh
|
25 |
Admin users can configure robot defenses on the |
|
8d32812…
|
drh
|
26 |
"Robot Defense Settings" page (/setup_robot). |
|
8d32812…
|
drh
|
27 |
That page is accessible (to Admin users) from the default menu bar |
|
8d32812…
|
drh
|
28 |
by click on the "Admin" menu choice, then selecting the |
|
8d32812…
|
drh
|
29 |
"Robot-Defense" link from the list. |
|
1e26962…
|
drh
|
30 |
|
|
779ddef…
|
wyoung
|
31 |
<h2>The Hyperlink User Capability</h2> |
|
1e26962…
|
drh
|
32 |
|
|
1e26962…
|
drh
|
33 |
Every Fossil web session has a "user". For random passers-by on the internet |
|
c9082b2…
|
drh
|
34 |
(and for robots) that user is "nobody". The "anonymous" user is also |
|
1e26962…
|
drh
|
35 |
available for humans who do not wish to identify themselves. The difference |
|
1e26962…
|
drh
|
36 |
is that "anonymous" requires a login (using a password supplied via |
|
f47b705…
|
jan.nijtmans
|
37 |
a CAPTCHA) whereas "nobody" does not require a login. |
|
f47b705…
|
jan.nijtmans
|
38 |
The site administrator can also create logins with |
|
1e26962…
|
drh
|
39 |
passwords for specific individuals. |
|
1e26962…
|
drh
|
40 |
|
|
779ddef…
|
wyoung
|
41 |
Users without the <b>[./caps/ref.html#h | Hyperlink]</b> capability |
|
779ddef…
|
wyoung
|
42 |
do not see most Fossil-generated hyperlinks. This is |
|
c9082b2…
|
drh
|
43 |
a simple defense against robots, since [./caps/#ucat | the "nobody" |
|
779ddef…
|
wyoung
|
44 |
user category] does not have this capability by default. |
|
779ddef…
|
wyoung
|
45 |
Users must log in (perhaps as |
|
c9082b2…
|
drh
|
46 |
"anonymous") before they can see any of the hyperlinks. A robot |
|
779ddef…
|
wyoung
|
47 |
that cannot log into your Fossil repository will be unable to walk |
|
779ddef…
|
wyoung
|
48 |
its historical check-ins, create diffs between versions, pull zip |
|
c9082b2…
|
drh
|
49 |
archives, etc. by visiting links, because there are no links. |
|
779ddef…
|
wyoung
|
50 |
|
|
779ddef…
|
wyoung
|
51 |
A text message appears at the top of each page in this situation to |
|
779ddef…
|
wyoung
|
52 |
invite humans to log in as anonymous in order to activate hyperlinks. |
|
779ddef…
|
wyoung
|
53 |
|
|
c9082b2…
|
drh
|
54 |
But requiring a login, even an anonymous login, can be annoying. |
|
c9082b2…
|
drh
|
55 |
Fossil provides other techniques for blocking robots which |
|
1e26962…
|
drh
|
56 |
are less cumbersome to humans. |
|
1e26962…
|
drh
|
57 |
|
|
8d32812…
|
drh
|
58 |
<h2>Automatic Hyperlinks Based on UserAgent and Javascript</h2> |
|
1e26962…
|
drh
|
59 |
|
|
1e26962…
|
drh
|
60 |
Fossil has the ability to selectively enable hyperlinks for users |
|
779ddef…
|
wyoung
|
61 |
that lack the <b>Hyperlink</b> capability based on their UserAgent string in the |
|
1e26962…
|
drh
|
62 |
HTTP request header and on the browsers ability to run Javascript. |
|
1e26962…
|
drh
|
63 |
|
|
1e26962…
|
drh
|
64 |
The UserAgent string is a text identifier that is included in the header |
|
1e26962…
|
drh
|
65 |
of most HTTP requests that identifies the specific maker and version of |
|
c9082b2…
|
drh
|
66 |
the browser (or robot) that generated the request. Typical UserAgent |
|
1e26962…
|
drh
|
67 |
strings look like this: |
|
1e26962…
|
drh
|
68 |
|
|
1e26962…
|
drh
|
69 |
<ul> |
|
1e26962…
|
drh
|
70 |
<li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0 |
|
1e26962…
|
drh
|
71 |
<li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0) |
|
1e26962…
|
drh
|
72 |
<li> Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
|
1e26962…
|
drh
|
73 |
<li> Wget/1.12 (openbsd4.9) |
|
1e26962…
|
drh
|
74 |
</ul> |
|
1e26962…
|
drh
|
75 |
|
|
1e26962…
|
drh
|
76 |
The first two UserAgent strings above identify Firefox 19 and |
|
fe38a76…
|
drh
|
77 |
Internet Explorer 8.0, both running on Windows NT. The third |
|
c9082b2…
|
drh
|
78 |
example is the robot used by Google to index the internet. |
|
1e26962…
|
drh
|
79 |
The fourth example is the "wget" utility running on OpenBSD. |
|
0996347…
|
wyoung
|
80 |
Thus the first two UserAgent strings above identify the requester |
|
c9082b2…
|
drh
|
81 |
as human whereas the second two identify the requester as a robot. |
|
1e26962…
|
drh
|
82 |
Note that the UserAgent string is completely under the control |
|
c9082b2…
|
drh
|
83 |
of the requester and so a malicious robot can forge a UserAgent |
|
c9082b2…
|
drh
|
84 |
string that makes it look like a human. But most robots want |
|
c9082b2…
|
drh
|
85 |
to "play nicely" on the internet and are quite open |
|
c9082b2…
|
drh
|
86 |
about the fact that they are a robot. And so the UserAgent string |
|
1e26962…
|
drh
|
87 |
provides a good first-guess about whether or not a request originates |
|
c9082b2…
|
drh
|
88 |
from a human or a robot. |
|
c9082b2…
|
drh
|
89 |
|
|
c64f28d…
|
drh
|
90 |
The [/help/auto-hyperlink|auto-hyperlink] setting, shown as |
|
8d32812…
|
drh
|
91 |
"<b>Enable hyperlinks based on User-Agent and/or Javascript</b>" on |
|
8d32812…
|
drh
|
92 |
the Robot Defense Settings page, |
|
8d32812…
|
drh
|
93 |
can be set to "UserAgent only" or "UserAgent and Javascript" or "off". |
|
8d32812…
|
drh
|
94 |
If the UserAgent string looks like a human and not a robot, then |
|
c9082b2…
|
drh
|
95 |
Fossil will enable hyperlinks even if the <b>Hyperlink</b> capability |
|
efd79f8…
|
drh
|
96 |
is omitted from the user permissions. This setting gives humans easy |
|
c9082b2…
|
drh
|
97 |
access to the hyperlinks while preventing robots |
|
c9082b2…
|
drh
|
98 |
from walking the billions of pages on a typical Fossil site. |
|
c9082b2…
|
drh
|
99 |
|
|
8d32812…
|
drh
|
100 |
If the setting is "UserAgent only" (2), then the hyperlinks are simply |
|
8d32812…
|
drh
|
101 |
enabled and that is all. But if the setting is "UserAgent and Javascript" (1), |
|
efd79f8…
|
drh
|
102 |
then the hyperlinks are not enabled directly. |
|
1e26962…
|
drh
|
103 |
Instead, the HTML code that is generated contains anchor tags ("<a>") |
|
c9082b2…
|
drh
|
104 |
with "href=" attributes that point to [/honeypot] rather than the correct |
|
c9082b2…
|
drh
|
105 |
link. JavaScript code is added to the end of the page that goes back and |
|
c9082b2…
|
drh
|
106 |
fills in the correct "href=" attributes of |
|
c9082b2…
|
drh
|
107 |
the anchor tags with the true hyperlink targets, thus enabling the hyperlinks. |
|
0996347…
|
wyoung
|
108 |
This extra step of using JavaScript to enable the hyperlink targets |
|
c9082b2…
|
drh
|
109 |
is a security measure against robots that forge a human-looking |
|
c9082b2…
|
drh
|
110 |
UserAgent string. Most robots do not bother to run JavaScript and |
|
c9082b2…
|
drh
|
111 |
so to the robot the empty anchor tag will be useless. But all modern |
|
0996347…
|
wyoung
|
112 |
web browsers implement JavaScript, so hyperlinks will show up |
|
1e26962…
|
drh
|
113 |
normally for human users. |
|
1e26962…
|
drh
|
114 |
|
|
c64f28d…
|
drh
|
115 |
If the [/help/auto-hyperlink|"auto-hyperlink"] setting is (2) |
|
8d32812…
|
drh
|
116 |
"<b>Enable hyperlinks using User-Agent and/or Javascript</b>", |
|
8d32812…
|
drh
|
117 |
then there are now two additional sub-settings that control when |
|
8d32812…
|
drh
|
118 |
hyperlinks are enabled. |
|
c9082b2…
|
drh
|
119 |
|
|
c9082b2…
|
drh
|
120 |
The first new sub-setting is a delay (in milliseconds) before setting |
|
c9082b2…
|
drh
|
121 |
the "href=" attributes on anchor tags. The default value for this |
|
c9082b2…
|
drh
|
122 |
delay is 10 milliseconds. The idea here is that a robots will try to |
|
c9082b2…
|
drh
|
123 |
interpret the links on the page immediately, and will not wait for delayed |
|
c9082b2…
|
drh
|
124 |
scripts to be run, and thus will never enable the true links. |
|
c9082b2…
|
drh
|
125 |
|
|
c9082b2…
|
drh
|
126 |
The second sub-setting waits to run the |
|
0996347…
|
wyoung
|
127 |
JavaScript that sets the "href=" attributes on anchor tags until after |
|
c9082b2…
|
drh
|
128 |
at least one "mousedown" or "mousemove" event has been detected on the |
|
c9082b2…
|
drh
|
129 |
<body> element of the page. The thinking here is that robots will not be |
|
c9082b2…
|
drh
|
130 |
simulating mouse motion and so no mouse events will ever occur and |
|
c9082b2…
|
drh
|
131 |
hence the hyperlinks will never become enabled for robots. |
|
f146e21…
|
drh
|
132 |
|
|
f146e21…
|
drh
|
133 |
See also [./loadmgmt.md|Managing Server Load] for a description |
|
a0ce33c…
|
drh
|
134 |
of how expensive pages can be disabled when the server is under heavy |
|
a0ce33c…
|
drh
|
135 |
load. |
|
a0ce33c…
|
drh
|
136 |
|
|
8d32812…
|
drh
|
137 |
<h2>Do Not Allow Robot Access To Certain Pages</h2> |
|
8d32812…
|
drh
|
138 |
|
|
c64f28d…
|
drh
|
139 |
The [/help/robot-restrict|robot-restrict setting] is a comma-separated |
|
8d32812…
|
drh
|
140 |
list of GLOB patterns for pages for which robot access is prohibited. |
|
8d32812…
|
drh
|
141 |
The default value is: |
|
8d32812…
|
drh
|
142 |
|
|
8d32812…
|
drh
|
143 |
<blockquote><pre> |
|
67726b6…
|
florian
|
144 |
timelineX,diff,annotate,fileage,file,finfo,reports,tree,hexdump,download |
|
8d32812…
|
drh
|
145 |
</pre></blockquote> |
|
8d32812…
|
drh
|
146 |
|
|
8d32812…
|
drh
|
147 |
Each entry corresponds to the first path element on the URI for a |
|
8d32812…
|
drh
|
148 |
Fossil-generated page. If Fossil does not know for certain that the |
|
8d32812…
|
drh
|
149 |
HTTP request is coming from a human, then any attempt to access one of |
|
8d32812…
|
drh
|
150 |
these pages brings up a javascript-powered captcha. The user has to |
|
8d32812…
|
drh
|
151 |
click the accept button the captcha once, and that sets a cookie allowing |
|
8d32812…
|
drh
|
152 |
the user to continue surfing without interruption for 15 minutes or so |
|
8d32812…
|
drh
|
153 |
before being presented with another captcha. |
|
8d32812…
|
drh
|
154 |
|
|
8d32812…
|
drh
|
155 |
Some path elements have special meanings: |
|
8d32812…
|
drh
|
156 |
|
|
8d32812…
|
drh
|
157 |
* <b>timelineX →</b> |
|
8d32812…
|
drh
|
158 |
This means a subset of /timeline/ pages that are considered |
|
8d32812…
|
drh
|
159 |
"expensive". The exact definition of which timeline pages are |
|
8d32812…
|
drh
|
160 |
expensive and which are not is still the subject of active |
|
8d32812…
|
drh
|
161 |
experimentation and is likely to change by the time you read this |
|
8d32812…
|
drh
|
162 |
text. The idea is that anybody (including robots) can see a timeline |
|
8d32812…
|
drh
|
163 |
of the most recent changes, but timelines of long-ago change or that |
|
8d32812…
|
drh
|
164 |
contain lists of file changes or other harder-to-compute values are |
|
8d32812…
|
drh
|
165 |
prohibited. |
|
8d32812…
|
drh
|
166 |
|
|
8d32812…
|
drh
|
167 |
* <b>zip →</b> |
|
8d32812…
|
drh
|
168 |
The special "zip" keyword also matches "/tarball/" and "/sqlar/". |
|
8d32812…
|
drh
|
169 |
|
|
14e2392…
|
drh
|
170 |
* <b>zipX →</b> |
|
14e2392…
|
drh
|
171 |
This is like "zip" in that it restricts access to "/zip/", "/tarball"/ |
|
14e2392…
|
drh
|
172 |
and "/sqlar/" but with exceptions:<ol type="a"> |
|
c64f28d…
|
drh
|
173 |
<li><p> If the [/help/robot-zip-leaf|robot-zip-leaf] setting is |
|
14e2392…
|
drh
|
174 |
true, then tarballs of leaf check-ins are allowed. This permits |
|
14e2392…
|
drh
|
175 |
URLs that attempt to download the latest check-in on trunk or |
|
14e2392…
|
drh
|
176 |
from a named branch, for example. |
|
14e2392…
|
drh
|
177 |
<li><p> If a check-in has a tag that matches the GLOB list in |
|
c64f28d…
|
drh
|
178 |
[/help/robot-zip-tag|robot-zip-tag], then tarballs of that |
|
14e2392…
|
drh
|
179 |
check-in are allowed. This allow check-ins tagged with |
|
14e2392…
|
drh
|
180 |
"release" or "allow-robots" (for example) to be downloaded |
|
14e2392…
|
drh
|
181 |
without restriction. |
|
14e2392…
|
drh
|
182 |
</ol> |
|
14e2392…
|
drh
|
183 |
The "zipX" restriction is not in the default robot-restrict setting. |
|
14e2392…
|
drh
|
184 |
This is something you might want to add, depending on your needs. |
|
14e2392…
|
drh
|
185 |
|
|
8d32812…
|
drh
|
186 |
* <b>diff →</b> |
|
8d32812…
|
drh
|
187 |
This matches /vdiff/ and /fdiff/ and /vpatch/ and any other page that |
|
8d32812…
|
drh
|
188 |
is primarily about showing the difference between two check-ins or two |
|
8d32812…
|
drh
|
189 |
file versioons. |
|
8d32812…
|
drh
|
190 |
|
|
8d32812…
|
drh
|
191 |
* <b>annotate →</b> |
|
8d32812…
|
drh
|
192 |
This also matches /blame/ and /praise/. |
|
8d32812…
|
drh
|
193 |
|
|
8d32812…
|
drh
|
194 |
Other special keywords may be added in the future. |
|
8d32812…
|
drh
|
195 |
|
|
c64f28d…
|
drh
|
196 |
The default [/help/robot-restrict|robot-restrict] |
|
14e2392…
|
drh
|
197 |
setting has been shown in practice to do a good job of keeping |
|
8d32812…
|
drh
|
198 |
robots from consuming all available CPU and bandwidth while will |
|
8d32812…
|
drh
|
199 |
still allowing humans access to the full power of the site without |
|
8d32812…
|
drh
|
200 |
having to be logged in. |
|
8d32812…
|
drh
|
201 |
|
|
14e2392…
|
drh
|
202 |
One possible enhancement is to add "zipX" to the |
|
c64f28d…
|
drh
|
203 |
[/help/robot-restrict|robot-restrict] setting, |
|
14e2392…
|
drh
|
204 |
and enable [help?cmd=robot-zip-leaf|robot-zip-leaf] |
|
14e2392…
|
drh
|
205 |
and configure [help?cmd=robot-zip-tag|robot-zip-tag]. |
|
14e2392…
|
drh
|
206 |
Do this if you find that robots downloading lots of |
|
14e2392…
|
drh
|
207 |
obscure tarballs is causing load issues on your site. |
|
8d32812…
|
drh
|
208 |
|
|
8d32812…
|
drh
|
209 |
<h2>Anti-robot Exception RegExps</h2> |
|
8d32812…
|
drh
|
210 |
|
|
c64f28d…
|
drh
|
211 |
The [/help/robot-exception|robot-exception setting] under the name |
|
8d32812…
|
drh
|
212 |
of <b>Exceptions to anti-robot restrictions</b> is a list of |
|
8d32812…
|
drh
|
213 |
[/re_rules|regular expressions], one per line, that match |
|
8d32812…
|
drh
|
214 |
URIs that will bypass the captcha and allow robots full access. The |
|
8d32812…
|
drh
|
215 |
intent of this setting is to allow automated build scripts |
|
8d32812…
|
drh
|
216 |
to download specific tarballs of project snapshots. |
|
8d32812…
|
drh
|
217 |
|
|
8d32812…
|
drh
|
218 |
The recommended value for this setting allows robots to use URIs of the |
|
8d32812…
|
drh
|
219 |
following form: |
|
8d32812…
|
drh
|
220 |
|
|
8d32812…
|
drh
|
221 |
<blockquote> |
|
8d32812…
|
drh
|
222 |
<b>https://</b><i>DOMAIN</i><b>/tarball/release/</b><i>HASH</i><b>/</b><i>NAME</i><b>.tar.gz</b> |
|
8d32812…
|
drh
|
223 |
</blockquote> |
|
8d32812…
|
drh
|
224 |
|
|
8d32812…
|
drh
|
225 |
The <i>HASH</i> part of this URL can be any valid |
|
8d32812…
|
drh
|
226 |
[./checkin_names.wiki|check-in name]. The link works as long as that |
|
8d32812…
|
drh
|
227 |
check-in is tagged with the "release" symbolic tag. In this way, |
|
8d32812…
|
drh
|
228 |
robots are permitted to download tarballs (and ZIP archives) of official |
|
8d32812…
|
drh
|
229 |
releases, but not every intermediate check-in between releases. Humans |
|
8d32812…
|
drh
|
230 |
who are willing to click the captcha can still download whatever they |
|
8d32812…
|
drh
|
231 |
want, but robots are blocked by the captcha. This prevents aggressive |
|
8d32812…
|
drh
|
232 |
robots from downloading tarballs of every historical check-in of your |
|
8d32812…
|
drh
|
233 |
project, once per day, which many robots these days seem eager to do. |
|
8d32812…
|
drh
|
234 |
|
|
8d32812…
|
drh
|
235 |
For example, on the Fossil project itself, this URL will work, even for |
|
8d32812…
|
drh
|
236 |
robots: |
|
8d32812…
|
drh
|
237 |
|
|
8d32812…
|
drh
|
238 |
<blockquote> |
|
8d32812…
|
drh
|
239 |
https://fossil-scm.org/home/tarball/release/version-2.27/fossil-scm.tar.gz |
|
8d32812…
|
drh
|
240 |
</blockquote> |
|
8d32812…
|
drh
|
241 |
|
|
8d32812…
|
drh
|
242 |
But the next URL will not work for robots because check-in 3bbd18a284c8bd6a |
|
8d32812…
|
drh
|
243 |
is not tagged as a "release": |
|
8d32812…
|
drh
|
244 |
|
|
8d32812…
|
drh
|
245 |
<blockquote> |
|
8d32812…
|
drh
|
246 |
https://fossil-scm.org/home/tarball/release/3bbd18a284c8bd6a/fossil-scm.tar.gz |
|
8d32812…
|
drh
|
247 |
</blockquote> |
|
8d32812…
|
drh
|
248 |
|
|
8d32812…
|
drh
|
249 |
The second URL will work for humans, just not robots. |
|
8d32812…
|
drh
|
250 |
|
|
779ddef…
|
wyoung
|
251 |
<h2>The Ongoing Struggle</h2> |
|
a0ce33c…
|
drh
|
252 |
|
|
8d32812…
|
drh
|
253 |
Fossil currently does a good job of providing easy access to humans |
|
c9082b2…
|
drh
|
254 |
while keeping out troublesome robots. However, robots |
|
c9082b2…
|
drh
|
255 |
continue to grow more sophisticated, requiring ever more advanced |
|
a0ce33c…
|
drh
|
256 |
defenses. This "arms race" is unlikely to ever end. The developers of |
|
c9082b2…
|
drh
|
257 |
Fossil will continue to try improve the robot defenses of Fossil so |
|
f47b705…
|
jan.nijtmans
|
258 |
check back from time to time for the latest releases and updates. |
|
1e26962…
|
drh
|
259 |
|
|
c9082b2…
|
drh
|
260 |
Readers of this page who have suggestions on how to improve the robot |
|
1e26962…
|
drh
|
261 |
defenses in Fossil are invited to submit your ideas to the Fossil Users |
|
8ddeb17…
|
stephan
|
262 |
forum: |
|
8ddeb17…
|
stephan
|
263 |
[https://fossil-scm.org/forum]. |