Fossil SCM

fossil-scm / www / antibot.wiki

Source Rendered

Blame History Raw 264 lines

1	`<title>Defense Against Robots</title>`
2
3	`A typical Fossil website can have billions and billions of pages,`
4	`and many of those pages (for example diffs and annotations and tarballs)`
5	`can be expensive to compute.`
6	`If a robot walks a Fossil-generated website,`
7	`it can present a crippling bandwidth and CPU load.`
8	`A "robots.txt" file can help, but in practice, most robots these`
9	`days ignore the robots.txt file, so it won't help much.`
10
11	`A Fossil website is intended to be used`
12	`interactively by humans, not walked by robots. This article`
13	`describes the techniques used by Fossil to try to welcome human`
14	`users while keeping out robots.`
15
16	`<h2>Defenses Are Enabled By Default</h2>`
17
18	`In the latest implementations of Fossil, most robot defenses are`
19	`enabled by default. You can probably get by with standing up a`
20	`public-facing Fossil instance in the default configuration. But`
21	`you can also customize the defenses to serve your particular needs.`
22
23	`<h2>Customizing Anti-Robot Defenses</h2>`
24
25	`Admin users can configure robot defenses on the`
26	`"Robot Defense Settings" page (/setup_robot).`
27	`That page is accessible (to Admin users) from the default menu bar`
28	`by click on the "Admin" menu choice, then selecting the`
29	`"Robot-Defense" link from the list.`
30
31	`<h2>The Hyperlink User Capability</h2>`
32
33	`Every Fossil web session has a "user". For random passers-by on the internet`
34	`(and for robots) that user is "nobody". The "anonymous" user is also`
35	`available for humans who do not wish to identify themselves. The difference`
36	`is that "anonymous" requires a login (using a password supplied via`
37	`a CAPTCHA) whereas "nobody" does not require a login.`
38	`The site administrator can also create logins with`
39	`passwords for specific individuals.`
40
41	`Users without the <b>[./caps/ref.html#h \| Hyperlink]</b> capability`
42	`do not see most Fossil-generated hyperlinks. This is`
43	`a simple defense against robots, since [./caps/#ucat \| the "nobody"`
44	`user category] does not have this capability by default.`
45	`Users must log in (perhaps as`
46	`"anonymous") before they can see any of the hyperlinks. A robot`
47	`that cannot log into your Fossil repository will be unable to walk`
48	`its historical check-ins, create diffs between versions, pull zip`
49	`archives, etc. by visiting links, because there are no links.`
50
51	`A text message appears at the top of each page in this situation to`
52	`invite humans to log in as anonymous in order to activate hyperlinks.`
53
54	`But requiring a login, even an anonymous login, can be annoying.`
55	`Fossil provides other techniques for blocking robots which`
56	`are less cumbersome to humans.`
57
58	`<h2>Automatic Hyperlinks Based on UserAgent and Javascript</h2>`
59
60	`Fossil has the ability to selectively enable hyperlinks for users`
61	`that lack the <b>Hyperlink</b> capability based on their UserAgent string in the`
62	`HTTP request header and on the browsers ability to run Javascript.`
63
64	`The UserAgent string is a text identifier that is included in the header`
65	`of most HTTP requests that identifies the specific maker and version of`
66	`the browser (or robot) that generated the request. Typical UserAgent`
67	`strings look like this:`
68
69	`<ul>`
70	`<li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0`
71	`<li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0)`
72	`<li> Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)`
73	`<li> Wget/1.12 (openbsd4.9)`
74	`</ul>`
75
76	`The first two UserAgent strings above identify Firefox 19 and`
77	`Internet Explorer 8.0, both running on Windows NT. The third`
78	`example is the robot used by Google to index the internet.`
79	`The fourth example is the "wget" utility running on OpenBSD.`
80	`Thus the first two UserAgent strings above identify the requester`
81	`as human whereas the second two identify the requester as a robot.`
82	`Note that the UserAgent string is completely under the control`
83	`of the requester and so a malicious robot can forge a UserAgent`
84	`string that makes it look like a human. But most robots want`
85	`to "play nicely" on the internet and are quite open`
86	`about the fact that they are a robot. And so the UserAgent string`
87	`provides a good first-guess about whether or not a request originates`
88	`from a human or a robot.`
89
90	`The [/help/auto-hyperlink\|auto-hyperlink] setting, shown as`
91	`"<b>Enable hyperlinks based on User-Agent and/or Javascript</b>" on`
92	`the Robot Defense Settings page,`
93	`can be set to "UserAgent only" or "UserAgent and Javascript" or "off".`
94	`If the UserAgent string looks like a human and not a robot, then`
95	`Fossil will enable hyperlinks even if the <b>Hyperlink</b> capability`
96	`is omitted from the user permissions. This setting gives humans easy`
97	`access to the hyperlinks while preventing robots`
98	`from walking the billions of pages on a typical Fossil site.`
99
100	`If the setting is "UserAgent only" (2), then the hyperlinks are simply`
101	`enabled and that is all. But if the setting is "UserAgent and Javascript" (1),`
102	`then the hyperlinks are not enabled directly.`
103	`Instead, the HTML code that is generated contains anchor tags ("<a>")`
104	`with "href=" attributes that point to [/honeypot] rather than the correct`
105	`link. JavaScript code is added to the end of the page that goes back and`
106	`fills in the correct "href=" attributes of`
107	`the anchor tags with the true hyperlink targets, thus enabling the hyperlinks.`
108	`This extra step of using JavaScript to enable the hyperlink targets`
109	`is a security measure against robots that forge a human-looking`
110	`UserAgent string. Most robots do not bother to run JavaScript and`
111	`so to the robot the empty anchor tag will be useless. But all modern`
112	`web browsers implement JavaScript, so hyperlinks will show up`
113	`normally for human users.`
114
115	`If the [/help/auto-hyperlink\|"auto-hyperlink"] setting is (2)`
116	`"<b>Enable hyperlinks using User-Agent and/or Javascript</b>",`
117	`then there are now two additional sub-settings that control when`
118	`hyperlinks are enabled.`
119
120	`The first new sub-setting is a delay (in milliseconds) before setting`
121	`the "href=" attributes on anchor tags. The default value for this`
122	`delay is 10 milliseconds. The idea here is that a robots will try to`
123	`interpret the links on the page immediately, and will not wait for delayed`
124	`scripts to be run, and thus will never enable the true links.`
125
126	`The second sub-setting waits to run the`
127	`JavaScript that sets the "href=" attributes on anchor tags until after`
128	`at least one "mousedown" or "mousemove" event has been detected on the`
129	`<body> element of the page. The thinking here is that robots will not be`
130	`simulating mouse motion and so no mouse events will ever occur and`
131	`hence the hyperlinks will never become enabled for robots.`
132
133	`See also [./loadmgmt.md\|Managing Server Load] for a description`
134	`of how expensive pages can be disabled when the server is under heavy`
135	`load.`
136
137	`<h2>Do Not Allow Robot Access To Certain Pages</h2>`
138
139	`The [/help/robot-restrict\|robot-restrict setting] is a comma-separated`
140	`list of GLOB patterns for pages for which robot access is prohibited.`
141	`The default value is:`
142
143	`<blockquote><pre>`
144	`timelineX,diff,annotate,fileage,file,finfo,reports,tree,hexdump,download`
145	`</pre></blockquote>`
146
147	`Each entry corresponds to the first path element on the URI for a`
148	`Fossil-generated page. If Fossil does not know for certain that the`
149	`HTTP request is coming from a human, then any attempt to access one of`
150	`these pages brings up a javascript-powered captcha. The user has to`
151	`click the accept button the captcha once, and that sets a cookie allowing`
152	`the user to continue surfing without interruption for 15 minutes or so`
153	`before being presented with another captcha.`
154
155	`Some path elements have special meanings:`
156
157	`* <b>timelineX →</b>`
158	`This means a subset of /timeline/ pages that are considered`
159	`"expensive". The exact definition of which timeline pages are`
160	`expensive and which are not is still the subject of active`
161	`experimentation and is likely to change by the time you read this`
162	`text. The idea is that anybody (including robots) can see a timeline`
163	`of the most recent changes, but timelines of long-ago change or that`
164	`contain lists of file changes or other harder-to-compute values are`
165	`prohibited.`
166
167	`* <b>zip →</b>`
168	`The special "zip" keyword also matches "/tarball/" and "/sqlar/".`
169
170	`* <b>zipX →</b>`
171	`This is like "zip" in that it restricts access to "/zip/", "/tarball"/`
172	`and "/sqlar/" but with exceptions:<ol type="a">`
173	`<li><p> If the [/help/robot-zip-leaf\|robot-zip-leaf] setting is`
174	`true, then tarballs of leaf check-ins are allowed. This permits`
175	`URLs that attempt to download the latest check-in on trunk or`
176	`from a named branch, for example.`
177	`<li><p> If a check-in has a tag that matches the GLOB list in`
178	`[/help/robot-zip-tag\|robot-zip-tag], then tarballs of that`
179	`check-in are allowed. This allow check-ins tagged with`
180	`"release" or "allow-robots" (for example) to be downloaded`
181	`without restriction.`
182	`</ol>`
183	`The "zipX" restriction is not in the default robot-restrict setting.`
184	`This is something you might want to add, depending on your needs.`
185
186	`* <b>diff →</b>`
187	`This matches /vdiff/ and /fdiff/ and /vpatch/ and any other page that`
188	`is primarily about showing the difference between two check-ins or two`
189	`file versioons.`
190
191	`* <b>annotate →</b>`
192	`This also matches /blame/ and /praise/.`
193
194	`Other special keywords may be added in the future.`
195
196	`The default [/help/robot-restrict\|robot-restrict]`
197	`setting has been shown in practice to do a good job of keeping`
198	`robots from consuming all available CPU and bandwidth while will`
199	`still allowing humans access to the full power of the site without`
200	`having to be logged in.`
201
202	`One possible enhancement is to add "zipX" to the`
203	`[/help/robot-restrict\|robot-restrict] setting,`
204	`and enable [help?cmd=robot-zip-leaf\|robot-zip-leaf]`
205	`and configure [help?cmd=robot-zip-tag\|robot-zip-tag].`
206	`Do this if you find that robots downloading lots of`
207	`obscure tarballs is causing load issues on your site.`
208
209	`<h2>Anti-robot Exception RegExps</h2>`
210
211	`The [/help/robot-exception\|robot-exception setting] under the name`
212	`of <b>Exceptions to anti-robot restrictions</b> is a list of`
213	`[/re_rules\|regular expressions], one per line, that match`
214	`URIs that will bypass the captcha and allow robots full access. The`
215	`intent of this setting is to allow automated build scripts`
216	`to download specific tarballs of project snapshots.`
217
218	`The recommended value for this setting allows robots to use URIs of the`
219	`following form:`
220
221	`<blockquote>`
222	`<b>https://</b><i>DOMAIN</i><b>/tarball/release/</b><i>HASH</i><b>/</b><i>NAME</i><b>.tar.gz</b>`
223	`</blockquote>`
224
225	`The <i>HASH</i> part of this URL can be any valid`
226	`[./checkin_names.wiki\|check-in name]. The link works as long as that`
227	`check-in is tagged with the "release" symbolic tag. In this way,`
228	`robots are permitted to download tarballs (and ZIP archives) of official`
229	`releases, but not every intermediate check-in between releases. Humans`
230	`who are willing to click the captcha can still download whatever they`
231	`want, but robots are blocked by the captcha. This prevents aggressive`
232	`robots from downloading tarballs of every historical check-in of your`
233	`project, once per day, which many robots these days seem eager to do.`
234
235	`For example, on the Fossil project itself, this URL will work, even for`
236	`robots:`
237
238	`<blockquote>`
239	`https://fossil-scm.org/home/tarball/release/version-2.27/fossil-scm.tar.gz`
240	`</blockquote>`
241
242	`But the next URL will not work for robots because check-in 3bbd18a284c8bd6a`
243	`is not tagged as a "release":`
244
245	`<blockquote>`
246	`https://fossil-scm.org/home/tarball/release/3bbd18a284c8bd6a/fossil-scm.tar.gz`
247	`</blockquote>`
248
249	`The second URL will work for humans, just not robots.`
250
251	`<h2>The Ongoing Struggle</h2>`
252
253	`Fossil currently does a good job of providing easy access to humans`
254	`while keeping out troublesome robots. However, robots`
255	`continue to grow more sophisticated, requiring ever more advanced`
256	`defenses. This "arms race" is unlikely to ever end. The developers of`
257	`Fossil will continue to try improve the robot defenses of Fossil so`
258	`check back from time to time for the latest releases and updates.`
259
260	`Readers of this page who have suggestions on how to improve the robot`
261	`defenses in Fossil are invited to submit your ideas to the Fossil Users`
262	`forum:`
263	`[https://fossil-scm.org/forum].`
264

Fossil SCM

Keyboard Shortcuts