Fossil SCM

fossil-scm / www / antibot.wiki
Source Blame History 263 lines
c9082b2… drh 1 <title>Defense Against Robots</title>
c9082b2… drh 2
8d32812… drh 3 A typical Fossil website can have billions and billions of pages,
8d32812… drh 4 and many of those pages (for example diffs and annotations and tarballs)
8d32812… drh 5 can be expensive to compute.
c9082b2… drh 6 If a robot walks a Fossil-generated website,
c9082b2… drh 7 it can present a crippling bandwidth and CPU load.
14e2392… drh 8 A "robots.txt" file can help, but in practice, most robots these
14e2392… drh 9 days ignore the robots.txt file, so it won't help much.
c9082b2… drh 10
c9082b2… drh 11 A Fossil website is intended to be used
c9082b2… drh 12 interactively by humans, not walked by robots. This article
1e26962… drh 13 describes the techniques used by Fossil to try to welcome human
c9082b2… drh 14 users while keeping out robots.
8d32812… drh 15
14e2392… drh 16 <h2>Defenses Are Enabled By Default</h2>
14e2392… drh 17
14e2392… drh 18 In the latest implementations of Fossil, most robot defenses are
14e2392… drh 19 enabled by default. You can probably get by with standing up a
14e2392… drh 20 public-facing Fossil instance in the default configuration. But
14e2392… drh 21 you can also customize the defenses to serve your particular needs.
14e2392… drh 22
14e2392… drh 23 <h2>Customizing Anti-Robot Defenses</h2>
8d32812… drh 24
8d32812… drh 25 Admin users can configure robot defenses on the
8d32812… drh 26 "Robot Defense Settings" page (/setup_robot).
8d32812… drh 27 That page is accessible (to Admin users) from the default menu bar
8d32812… drh 28 by click on the "Admin" menu choice, then selecting the
8d32812… drh 29 "Robot-Defense" link from the list.
1e26962… drh 30
779ddef… wyoung 31 <h2>The Hyperlink User Capability</h2>
1e26962… drh 32
1e26962… drh 33 Every Fossil web session has a "user". For random passers-by on the internet
c9082b2… drh 34 (and for robots) that user is "nobody". The "anonymous" user is also
1e26962… drh 35 available for humans who do not wish to identify themselves. The difference
1e26962… drh 36 is that "anonymous" requires a login (using a password supplied via
f47b705… jan.nijtmans 37 a CAPTCHA) whereas "nobody" does not require a login.
f47b705… jan.nijtmans 38 The site administrator can also create logins with
1e26962… drh 39 passwords for specific individuals.
1e26962… drh 40
779ddef… wyoung 41 Users without the <b>[./caps/ref.html#h | Hyperlink]</b> capability
779ddef… wyoung 42 do not see most Fossil-generated hyperlinks. This is
c9082b2… drh 43 a simple defense against robots, since [./caps/#ucat | the "nobody"
779ddef… wyoung 44 user category] does not have this capability by default.
779ddef… wyoung 45 Users must log in (perhaps as
c9082b2… drh 46 "anonymous") before they can see any of the hyperlinks. A robot
779ddef… wyoung 47 that cannot log into your Fossil repository will be unable to walk
779ddef… wyoung 48 its historical check-ins, create diffs between versions, pull zip
c9082b2… drh 49 archives, etc. by visiting links, because there are no links.
779ddef… wyoung 50
779ddef… wyoung 51 A text message appears at the top of each page in this situation to
779ddef… wyoung 52 invite humans to log in as anonymous in order to activate hyperlinks.
779ddef… wyoung 53
c9082b2… drh 54 But requiring a login, even an anonymous login, can be annoying.
c9082b2… drh 55 Fossil provides other techniques for blocking robots which
1e26962… drh 56 are less cumbersome to humans.
1e26962… drh 57
8d32812… drh 58 <h2>Automatic Hyperlinks Based on UserAgent and Javascript</h2>
1e26962… drh 59
1e26962… drh 60 Fossil has the ability to selectively enable hyperlinks for users
779ddef… wyoung 61 that lack the <b>Hyperlink</b> capability based on their UserAgent string in the
1e26962… drh 62 HTTP request header and on the browsers ability to run Javascript.
1e26962… drh 63
1e26962… drh 64 The UserAgent string is a text identifier that is included in the header
1e26962… drh 65 of most HTTP requests that identifies the specific maker and version of
c9082b2… drh 66 the browser (or robot) that generated the request. Typical UserAgent
1e26962… drh 67 strings look like this:
1e26962… drh 68
1e26962… drh 69 <ul>
1e26962… drh 70 <li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0
1e26962… drh 71 <li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0)
1e26962… drh 72 <li> Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
1e26962… drh 73 <li> Wget/1.12 (openbsd4.9)
1e26962… drh 74 </ul>
1e26962… drh 75
1e26962… drh 76 The first two UserAgent strings above identify Firefox 19 and
fe38a76… drh 77 Internet Explorer 8.0, both running on Windows NT. The third
c9082b2… drh 78 example is the robot used by Google to index the internet.
1e26962… drh 79 The fourth example is the "wget" utility running on OpenBSD.
0996347… wyoung 80 Thus the first two UserAgent strings above identify the requester
c9082b2… drh 81 as human whereas the second two identify the requester as a robot.
1e26962… drh 82 Note that the UserAgent string is completely under the control
c9082b2… drh 83 of the requester and so a malicious robot can forge a UserAgent
c9082b2… drh 84 string that makes it look like a human. But most robots want
c9082b2… drh 85 to "play nicely" on the internet and are quite open
c9082b2… drh 86 about the fact that they are a robot. And so the UserAgent string
1e26962… drh 87 provides a good first-guess about whether or not a request originates
c9082b2… drh 88 from a human or a robot.
c9082b2… drh 89
c64f28d… drh 90 The [/help/auto-hyperlink|auto-hyperlink] setting, shown as
8d32812… drh 91 "<b>Enable hyperlinks based on User-Agent and/or Javascript</b>" on
8d32812… drh 92 the Robot Defense Settings page,
8d32812… drh 93 can be set to "UserAgent only" or "UserAgent and Javascript" or "off".
8d32812… drh 94 If the UserAgent string looks like a human and not a robot, then
c9082b2… drh 95 Fossil will enable hyperlinks even if the <b>Hyperlink</b> capability
efd79f8… drh 96 is omitted from the user permissions. This setting gives humans easy
c9082b2… drh 97 access to the hyperlinks while preventing robots
c9082b2… drh 98 from walking the billions of pages on a typical Fossil site.
c9082b2… drh 99
8d32812… drh 100 If the setting is "UserAgent only" (2), then the hyperlinks are simply
8d32812… drh 101 enabled and that is all. But if the setting is "UserAgent and Javascript" (1),
efd79f8… drh 102 then the hyperlinks are not enabled directly.
1e26962… drh 103 Instead, the HTML code that is generated contains anchor tags ("&lt;a&gt;")
c9082b2… drh 104 with "href=" attributes that point to [/honeypot] rather than the correct
c9082b2… drh 105 link. JavaScript code is added to the end of the page that goes back and
c9082b2… drh 106 fills in the correct "href=" attributes of
c9082b2… drh 107 the anchor tags with the true hyperlink targets, thus enabling the hyperlinks.
0996347… wyoung 108 This extra step of using JavaScript to enable the hyperlink targets
c9082b2… drh 109 is a security measure against robots that forge a human-looking
c9082b2… drh 110 UserAgent string. Most robots do not bother to run JavaScript and
c9082b2… drh 111 so to the robot the empty anchor tag will be useless. But all modern
0996347… wyoung 112 web browsers implement JavaScript, so hyperlinks will show up
1e26962… drh 113 normally for human users.
1e26962… drh 114
c64f28d… drh 115 If the [/help/auto-hyperlink|"auto-hyperlink"] setting is (2)
8d32812… drh 116 "<b>Enable hyperlinks using User-Agent and/or Javascript</b>",
8d32812… drh 117 then there are now two additional sub-settings that control when
8d32812… drh 118 hyperlinks are enabled.
c9082b2… drh 119
c9082b2… drh 120 The first new sub-setting is a delay (in milliseconds) before setting
c9082b2… drh 121 the "href=" attributes on anchor tags. The default value for this
c9082b2… drh 122 delay is 10 milliseconds. The idea here is that a robots will try to
c9082b2… drh 123 interpret the links on the page immediately, and will not wait for delayed
c9082b2… drh 124 scripts to be run, and thus will never enable the true links.
c9082b2… drh 125
c9082b2… drh 126 The second sub-setting waits to run the
0996347… wyoung 127 JavaScript that sets the "href=" attributes on anchor tags until after
c9082b2… drh 128 at least one "mousedown" or "mousemove" event has been detected on the
c9082b2… drh 129 &lt;body&gt; element of the page. The thinking here is that robots will not be
c9082b2… drh 130 simulating mouse motion and so no mouse events will ever occur and
c9082b2… drh 131 hence the hyperlinks will never become enabled for robots.
f146e21… drh 132
f146e21… drh 133 See also [./loadmgmt.md|Managing Server Load] for a description
a0ce33c… drh 134 of how expensive pages can be disabled when the server is under heavy
a0ce33c… drh 135 load.
a0ce33c… drh 136
8d32812… drh 137 <h2>Do Not Allow Robot Access To Certain Pages</h2>
8d32812… drh 138
c64f28d… drh 139 The [/help/robot-restrict|robot-restrict setting] is a comma-separated
8d32812… drh 140 list of GLOB patterns for pages for which robot access is prohibited.
8d32812… drh 141 The default value is:
8d32812… drh 142
8d32812… drh 143 <blockquote><pre>
67726b6… florian 144 timelineX,diff,annotate,fileage,file,finfo,reports,tree,hexdump,download
8d32812… drh 145 </pre></blockquote>
8d32812… drh 146
8d32812… drh 147 Each entry corresponds to the first path element on the URI for a
8d32812… drh 148 Fossil-generated page. If Fossil does not know for certain that the
8d32812… drh 149 HTTP request is coming from a human, then any attempt to access one of
8d32812… drh 150 these pages brings up a javascript-powered captcha. The user has to
8d32812… drh 151 click the accept button the captcha once, and that sets a cookie allowing
8d32812… drh 152 the user to continue surfing without interruption for 15 minutes or so
8d32812… drh 153 before being presented with another captcha.
8d32812… drh 154
8d32812… drh 155 Some path elements have special meanings:
8d32812… drh 156
8d32812… drh 157 * <b>timelineX &rarr;</b>
8d32812… drh 158 This means a subset of /timeline/ pages that are considered
8d32812… drh 159 "expensive". The exact definition of which timeline pages are
8d32812… drh 160 expensive and which are not is still the subject of active
8d32812… drh 161 experimentation and is likely to change by the time you read this
8d32812… drh 162 text. The idea is that anybody (including robots) can see a timeline
8d32812… drh 163 of the most recent changes, but timelines of long-ago change or that
8d32812… drh 164 contain lists of file changes or other harder-to-compute values are
8d32812… drh 165 prohibited.
8d32812… drh 166
8d32812… drh 167 * <b>zip &rarr;</b>
8d32812… drh 168 The special "zip" keyword also matches "/tarball/" and "/sqlar/".
8d32812… drh 169
14e2392… drh 170 * <b>zipX &rarr;</b>
14e2392… drh 171 This is like "zip" in that it restricts access to "/zip/", "/tarball"/
14e2392… drh 172 and "/sqlar/" but with exceptions:<ol type="a">
c64f28d… drh 173 <li><p> If the [/help/robot-zip-leaf|robot-zip-leaf] setting is
14e2392… drh 174 true, then tarballs of leaf check-ins are allowed. This permits
14e2392… drh 175 URLs that attempt to download the latest check-in on trunk or
14e2392… drh 176 from a named branch, for example.
14e2392… drh 177 <li><p> If a check-in has a tag that matches the GLOB list in
c64f28d… drh 178 [/help/robot-zip-tag|robot-zip-tag], then tarballs of that
14e2392… drh 179 check-in are allowed. This allow check-ins tagged with
14e2392… drh 180 "release" or "allow-robots" (for example) to be downloaded
14e2392… drh 181 without restriction.
14e2392… drh 182 </ol>
14e2392… drh 183 The "zipX" restriction is not in the default robot-restrict setting.
14e2392… drh 184 This is something you might want to add, depending on your needs.
14e2392… drh 185
8d32812… drh 186 * <b>diff &rarr;</b>
8d32812… drh 187 This matches /vdiff/ and /fdiff/ and /vpatch/ and any other page that
8d32812… drh 188 is primarily about showing the difference between two check-ins or two
8d32812… drh 189 file versioons.
8d32812… drh 190
8d32812… drh 191 * <b>annotate &rarr;</b>
8d32812… drh 192 This also matches /blame/ and /praise/.
8d32812… drh 193
8d32812… drh 194 Other special keywords may be added in the future.
8d32812… drh 195
c64f28d… drh 196 The default [/help/robot-restrict|robot-restrict]
14e2392… drh 197 setting has been shown in practice to do a good job of keeping
8d32812… drh 198 robots from consuming all available CPU and bandwidth while will
8d32812… drh 199 still allowing humans access to the full power of the site without
8d32812… drh 200 having to be logged in.
8d32812… drh 201
14e2392… drh 202 One possible enhancement is to add "zipX" to the
c64f28d… drh 203 [/help/robot-restrict|robot-restrict] setting,
14e2392… drh 204 and enable [help?cmd=robot-zip-leaf|robot-zip-leaf]
14e2392… drh 205 and configure [help?cmd=robot-zip-tag|robot-zip-tag].
14e2392… drh 206 Do this if you find that robots downloading lots of
14e2392… drh 207 obscure tarballs is causing load issues on your site.
8d32812… drh 208
8d32812… drh 209 <h2>Anti-robot Exception RegExps</h2>
8d32812… drh 210
c64f28d… drh 211 The [/help/robot-exception|robot-exception setting] under the name
8d32812… drh 212 of <b>Exceptions to anti-robot restrictions</b> is a list of
8d32812… drh 213 [/re_rules|regular expressions], one per line, that match
8d32812… drh 214 URIs that will bypass the captcha and allow robots full access. The
8d32812… drh 215 intent of this setting is to allow automated build scripts
8d32812… drh 216 to download specific tarballs of project snapshots.
8d32812… drh 217
8d32812… drh 218 The recommended value for this setting allows robots to use URIs of the
8d32812… drh 219 following form:
8d32812… drh 220
8d32812… drh 221 <blockquote>
8d32812… drh 222 <b>https://</b><i>DOMAIN</i><b>/tarball/release/</b><i>HASH</i><b>/</b><i>NAME</i><b>.tar.gz</b>
8d32812… drh 223 </blockquote>
8d32812… drh 224
8d32812… drh 225 The <i>HASH</i> part of this URL can be any valid
8d32812… drh 226 [./checkin_names.wiki|check-in name]. The link works as long as that
8d32812… drh 227 check-in is tagged with the "release" symbolic tag. In this way,
8d32812… drh 228 robots are permitted to download tarballs (and ZIP archives) of official
8d32812… drh 229 releases, but not every intermediate check-in between releases. Humans
8d32812… drh 230 who are willing to click the captcha can still download whatever they
8d32812… drh 231 want, but robots are blocked by the captcha. This prevents aggressive
8d32812… drh 232 robots from downloading tarballs of every historical check-in of your
8d32812… drh 233 project, once per day, which many robots these days seem eager to do.
8d32812… drh 234
8d32812… drh 235 For example, on the Fossil project itself, this URL will work, even for
8d32812… drh 236 robots:
8d32812… drh 237
8d32812… drh 238 <blockquote>
8d32812… drh 239 https://fossil-scm.org/home/tarball/release/version-2.27/fossil-scm.tar.gz
8d32812… drh 240 </blockquote>
8d32812… drh 241
8d32812… drh 242 But the next URL will not work for robots because check-in 3bbd18a284c8bd6a
8d32812… drh 243 is not tagged as a "release":
8d32812… drh 244
8d32812… drh 245 <blockquote>
8d32812… drh 246 https://fossil-scm.org/home/tarball/release/3bbd18a284c8bd6a/fossil-scm.tar.gz
8d32812… drh 247 </blockquote>
8d32812… drh 248
8d32812… drh 249 The second URL will work for humans, just not robots.
8d32812… drh 250
779ddef… wyoung 251 <h2>The Ongoing Struggle</h2>
a0ce33c… drh 252
8d32812… drh 253 Fossil currently does a good job of providing easy access to humans
c9082b2… drh 254 while keeping out troublesome robots. However, robots
c9082b2… drh 255 continue to grow more sophisticated, requiring ever more advanced
a0ce33c… drh 256 defenses. This "arms race" is unlikely to ever end. The developers of
c9082b2… drh 257 Fossil will continue to try improve the robot defenses of Fossil so
f47b705… jan.nijtmans 258 check back from time to time for the latest releases and updates.
1e26962… drh 259
c9082b2… drh 260 Readers of this page who have suggestions on how to improve the robot
1e26962… drh 261 defenses in Fossil are invited to submit your ideas to the Fossil Users
8ddeb17… stephan 262 forum:
8ddeb17… stephan 263 [https://fossil-scm.org/forum].

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button