Fossil SCM
Improvements to the anti-robot documentation page.
Commit
8d3281267d9a38fc9d8f8a44b837cbdc8120898b9b573cdd08f9bd0af6271cd2
Parent
819da69d7b6d37d…
2 files changed
+1
-1
+116
-31
+1
-1
| --- src/robot.c | ||
| +++ src/robot.c | ||
| @@ -280,11 +280,11 @@ | ||
| 280 | 280 | ** to anti-robot defenses and should be allowed through. For |
| 281 | 281 | ** example, to allow robots to download tarballs or ZIP archives |
| 282 | 282 | ** for named versions and releases, you could use an expression like |
| 283 | 283 | ** this: |
| 284 | 284 | ** |
| 285 | -** ^/(tarball|zip)\\b*\\b(version-|release)\\b | |
| 285 | +** ^/tarball/(version-[0-9.]+|release)/ | |
| 286 | 286 | ** |
| 287 | 287 | ** This setting can hold multiple regular expressions, one |
| 288 | 288 | ** regular expression per line. The input URL is exempted from |
| 289 | 289 | ** anti-robot defenses if any of the multiple regular expressions |
| 290 | 290 | ** matches. |
| 291 | 291 |
| --- src/robot.c | |
| +++ src/robot.c | |
| @@ -280,11 +280,11 @@ | |
| 280 | ** to anti-robot defenses and should be allowed through. For |
| 281 | ** example, to allow robots to download tarballs or ZIP archives |
| 282 | ** for named versions and releases, you could use an expression like |
| 283 | ** this: |
| 284 | ** |
| 285 | ** ^/(tarball|zip)\\b*\\b(version-|release)\\b |
| 286 | ** |
| 287 | ** This setting can hold multiple regular expressions, one |
| 288 | ** regular expression per line. The input URL is exempted from |
| 289 | ** anti-robot defenses if any of the multiple regular expressions |
| 290 | ** matches. |
| 291 |
| --- src/robot.c | |
| +++ src/robot.c | |
| @@ -280,11 +280,11 @@ | |
| 280 | ** to anti-robot defenses and should be allowed through. For |
| 281 | ** example, to allow robots to download tarballs or ZIP archives |
| 282 | ** for named versions and releases, you could use an expression like |
| 283 | ** this: |
| 284 | ** |
| 285 | ** ^/tarball/(version-[0-9.]+|release)/ |
| 286 | ** |
| 287 | ** This setting can hold multiple regular expressions, one |
| 288 | ** regular expression per line. The input URL is exempted from |
| 289 | ** anti-robot defenses if any of the multiple regular expressions |
| 290 | ** matches. |
| 291 |
+116
-31
| --- www/antibot.wiki | ||
| +++ www/antibot.wiki | ||
| @@ -1,17 +1,25 @@ | ||
| 1 | 1 | <title>Defense Against Robots</title> |
| 2 | 2 | |
| 3 | -A typical Fossil website can have millions of pages, and many of | |
| 4 | -those pages (for example diffs and annotations and tarballs) can | |
| 5 | -be expensive to compute. | |
| 3 | +A typical Fossil website can have billions and billions of pages, | |
| 4 | +and many of those pages (for example diffs and annotations and tarballs) | |
| 5 | +can be expensive to compute. | |
| 6 | 6 | If a robot walks a Fossil-generated website, |
| 7 | 7 | it can present a crippling bandwidth and CPU load. |
| 8 | 8 | |
| 9 | 9 | A Fossil website is intended to be used |
| 10 | 10 | interactively by humans, not walked by robots. This article |
| 11 | 11 | describes the techniques used by Fossil to try to welcome human |
| 12 | 12 | users while keeping out robots. |
| 13 | + | |
| 14 | +<h2>Setting Up Anti-Robot Defenses</h2> | |
| 15 | + | |
| 16 | +Admin users can configure robot defenses on the | |
| 17 | +"Robot Defense Settings" page (/setup_robot). | |
| 18 | +That page is accessible (to Admin users) from the default menu bar | |
| 19 | +by click on the "Admin" menu choice, then selecting the | |
| 20 | +"Robot-Defense" link from the list. | |
| 13 | 21 | |
| 14 | 22 | <h2>The Hyperlink User Capability</h2> |
| 15 | 23 | |
| 16 | 24 | Every Fossil web session has a "user". For random passers-by on the internet |
| 17 | 25 | (and for robots) that user is "nobody". The "anonymous" user is also |
| @@ -36,11 +44,11 @@ | ||
| 36 | 44 | |
| 37 | 45 | But requiring a login, even an anonymous login, can be annoying. |
| 38 | 46 | Fossil provides other techniques for blocking robots which |
| 39 | 47 | are less cumbersome to humans. |
| 40 | 48 | |
| 41 | -<h2>Automatic Hyperlinks Based on UserAgent</h2> | |
| 49 | +<h2>Automatic Hyperlinks Based on UserAgent and Javascript</h2> | |
| 42 | 50 | |
| 43 | 51 | Fossil has the ability to selectively enable hyperlinks for users |
| 44 | 52 | that lack the <b>Hyperlink</b> capability based on their UserAgent string in the |
| 45 | 53 | HTTP request header and on the browsers ability to run Javascript. |
| 46 | 54 | |
| @@ -68,21 +76,22 @@ | ||
| 68 | 76 | to "play nicely" on the internet and are quite open |
| 69 | 77 | about the fact that they are a robot. And so the UserAgent string |
| 70 | 78 | provides a good first-guess about whether or not a request originates |
| 71 | 79 | from a human or a robot. |
| 72 | 80 | |
| 73 | -In Fossil, under the Admin/Robot-Defense menu, there is a setting entitled | |
| 74 | -"<b>Enable hyperlinks based on User-Agent and/or Javascript</b>". | |
| 75 | -If this setting is set to "UserAgent only" or "UserAgent and Javascript", | |
| 76 | -and if the UserAgent string looks like a human and not a robot, then | |
| 81 | +The [/help?cmd=auto-hyperlink|auto-hyperlink] setting, shown as | |
| 82 | +"<b>Enable hyperlinks based on User-Agent and/or Javascript</b>" on | |
| 83 | +the Robot Defense Settings page, | |
| 84 | +can be set to "UserAgent only" or "UserAgent and Javascript" or "off". | |
| 85 | +If the UserAgent string looks like a human and not a robot, then | |
| 77 | 86 | Fossil will enable hyperlinks even if the <b>Hyperlink</b> capability |
| 78 | 87 | is omitted from the user permissions. This setting gives humans easy |
| 79 | 88 | access to the hyperlinks while preventing robots |
| 80 | 89 | from walking the billions of pages on a typical Fossil site. |
| 81 | 90 | |
| 82 | -If the setting is "UserAgent only", then the hyperlinks are simply | |
| 83 | -enabled and that is all. But if the setting is "UserAgent and Javascript", | |
| 91 | +If the setting is "UserAgent only" (2), then the hyperlinks are simply | |
| 92 | +enabled and that is all. But if the setting is "UserAgent and Javascript" (1), | |
| 84 | 93 | then the hyperlinks are not enabled directly. |
| 85 | 94 | Instead, the HTML code that is generated contains anchor tags ("<a>") |
| 86 | 95 | with "href=" attributes that point to [/honeypot] rather than the correct |
| 87 | 96 | link. JavaScript code is added to the end of the page that goes back and |
| 88 | 97 | fills in the correct "href=" attributes of |
| @@ -92,30 +101,14 @@ | ||
| 92 | 101 | UserAgent string. Most robots do not bother to run JavaScript and |
| 93 | 102 | so to the robot the empty anchor tag will be useless. But all modern |
| 94 | 103 | web browsers implement JavaScript, so hyperlinks will show up |
| 95 | 104 | normally for human users. |
| 96 | 105 | |
| 97 | -<h2>Further Defenses</h2> | |
| 98 | - | |
| 99 | -Recently (as of this writing, in the spring of 2013) the Fossil server | |
| 100 | -on the SQLite website ([http://www.sqlite.org/src/]) has been hit repeatedly | |
| 101 | -by Chinese robots that use forged UserAgent strings to make them look | |
| 102 | -like normal web browsers and which interpret JavaScript. We do not | |
| 103 | -believe these attacks to be nefarious since SQLite is public domain | |
| 104 | -and the attackers could obtain all information they ever wanted to | |
| 105 | -know about SQLite simply by cloning the repository. Instead, we | |
| 106 | -believe these "attacks" are coming from "script kiddies". But regardless | |
| 107 | -of whether or not malice is involved, these attacks do present | |
| 108 | -an unnecessary load on the server which reduces the responsiveness of | |
| 109 | -the SQLite website for well-behaved and socially responsible users. | |
| 110 | -For this reason, additional defenses against | |
| 111 | -robots have been put in place. | |
| 112 | - | |
| 113 | -On the Admin/Robot-Defense page of Fossil, just below the | |
| 114 | -"<b>Enable hyperlinks using User-Agent and/or Javascript</b>" | |
| 115 | -setting, there are now two additional sub-settings that can be optionally | |
| 116 | -enabled to control hyperlinks. | |
| 106 | +If the [/help?cmd=auto-hyperlink|"auto-hyperlink"] setting is (2) | |
| 107 | +"<b>Enable hyperlinks using User-Agent and/or Javascript</b>", | |
| 108 | +then there are now two additional sub-settings that control when | |
| 109 | +hyperlinks are enabled. | |
| 117 | 110 | |
| 118 | 111 | The first new sub-setting is a delay (in milliseconds) before setting |
| 119 | 112 | the "href=" attributes on anchor tags. The default value for this |
| 120 | 113 | delay is 10 milliseconds. The idea here is that a robots will try to |
| 121 | 114 | interpret the links on the page immediately, and will not wait for delayed |
| @@ -130,13 +123,105 @@ | ||
| 130 | 123 | |
| 131 | 124 | See also [./loadmgmt.md|Managing Server Load] for a description |
| 132 | 125 | of how expensive pages can be disabled when the server is under heavy |
| 133 | 126 | load. |
| 134 | 127 | |
| 128 | +<h2>Do Not Allow Robot Access To Certain Pages</h2> | |
| 129 | + | |
| 130 | +The [/help?cmd=robot-restrict|robot-restrict setting] is a comma-separated | |
| 131 | +list of GLOB patterns for pages for which robot access is prohibited. | |
| 132 | +The default value is: | |
| 133 | + | |
| 134 | +<blockquote><pre> | |
| 135 | +timelineX,diff,annotate,zip,fileage,file,finfo,reports | |
| 136 | +</pre></blockquote> | |
| 137 | + | |
| 138 | +Each entry corresponds to the first path element on the URI for a | |
| 139 | +Fossil-generated page. If Fossil does not know for certain that the | |
| 140 | +HTTP request is coming from a human, then any attempt to access one of | |
| 141 | +these pages brings up a javascript-powered captcha. The user has to | |
| 142 | +click the accept button the captcha once, and that sets a cookie allowing | |
| 143 | +the user to continue surfing without interruption for 15 minutes or so | |
| 144 | +before being presented with another captcha. | |
| 145 | + | |
| 146 | +Some path elements have special meanings: | |
| 147 | + | |
| 148 | + * <b>timelineX →</b> | |
| 149 | + This means a subset of /timeline/ pages that are considered | |
| 150 | + "expensive". The exact definition of which timeline pages are | |
| 151 | + expensive and which are not is still the subject of active | |
| 152 | + experimentation and is likely to change by the time you read this | |
| 153 | + text. The idea is that anybody (including robots) can see a timeline | |
| 154 | + of the most recent changes, but timelines of long-ago change or that | |
| 155 | + contain lists of file changes or other harder-to-compute values are | |
| 156 | + prohibited. | |
| 157 | + | |
| 158 | + * <b>zip →</b> | |
| 159 | + The special "zip" keyword also matches "/tarball/" and "/sqlar/". | |
| 160 | + | |
| 161 | + * <b>diff →</b> | |
| 162 | + This matches /vdiff/ and /fdiff/ and /vpatch/ and any other page that | |
| 163 | + is primarily about showing the difference between two check-ins or two | |
| 164 | + file versioons. | |
| 165 | + | |
| 166 | + * <b>annotate →</b> | |
| 167 | + This also matches /blame/ and /praise/. | |
| 168 | + | |
| 169 | +Other special keywords may be added in the future. | |
| 170 | + | |
| 171 | +The default [/help?cmd=robot-restrict|robot-restrict] | |
| 172 | +setting has been shown in practice to do a great job of keeping | |
| 173 | +robots from consuming all available CPU and bandwidth while will | |
| 174 | +still allowing humans access to the full power of the site without | |
| 175 | +having to be logged in. | |
| 176 | + | |
| 177 | + | |
| 178 | +<h2>Anti-robot Exception RegExps</h2> | |
| 179 | + | |
| 180 | +The [/help?cmd=robot-exception|robot-exception setting] under the name | |
| 181 | +of <b>Exceptions to anti-robot restrictions</b> is a list of | |
| 182 | +[/re_rules|regular expressions], one per line, that match | |
| 183 | +URIs that will bypass the captcha and allow robots full access. The | |
| 184 | +intent of this setting is to allow automated build scripts | |
| 185 | +to download specific tarballs of project snapshots. | |
| 186 | + | |
| 187 | +The recommended value for this setting allows robots to use URIs of the | |
| 188 | +following form: | |
| 189 | + | |
| 190 | +<blockquote> | |
| 191 | +<b>https://</b><i>DOMAIN</i><b>/tarball/release/</b><i>HASH</i><b>/</b><i>NAME</i><b>.tar.gz</b> | |
| 192 | +</blockquote> | |
| 193 | + | |
| 194 | +The <i>HASH</i> part of this URL can be any valid | |
| 195 | +[./checkin_names.wiki|check-in name]. The link works as long as that | |
| 196 | +check-in is tagged with the "release" symbolic tag. In this way, | |
| 197 | +robots are permitted to download tarballs (and ZIP archives) of official | |
| 198 | +releases, but not every intermediate check-in between releases. Humans | |
| 199 | +who are willing to click the captcha can still download whatever they | |
| 200 | +want, but robots are blocked by the captcha. This prevents aggressive | |
| 201 | +robots from downloading tarballs of every historical check-in of your | |
| 202 | +project, once per day, which many robots these days seem eager to do. | |
| 203 | + | |
| 204 | +For example, on the Fossil project itself, this URL will work, even for | |
| 205 | +robots: | |
| 206 | + | |
| 207 | +<blockquote> | |
| 208 | +https://fossil-scm.org/home/tarball/release/version-2.27/fossil-scm.tar.gz | |
| 209 | +</blockquote> | |
| 210 | + | |
| 211 | +But the next URL will not work for robots because check-in 3bbd18a284c8bd6a | |
| 212 | +is not tagged as a "release": | |
| 213 | + | |
| 214 | +<blockquote> | |
| 215 | +https://fossil-scm.org/home/tarball/release/3bbd18a284c8bd6a/fossil-scm.tar.gz | |
| 216 | +</blockquote> | |
| 217 | + | |
| 218 | +The second URL will work for humans, just not robots. | |
| 219 | + | |
| 135 | 220 | <h2>The Ongoing Struggle</h2> |
| 136 | 221 | |
| 137 | -Fossil currently does a very good job of providing easy access to humans | |
| 222 | +Fossil currently does a good job of providing easy access to humans | |
| 138 | 223 | while keeping out troublesome robots. However, robots |
| 139 | 224 | continue to grow more sophisticated, requiring ever more advanced |
| 140 | 225 | defenses. This "arms race" is unlikely to ever end. The developers of |
| 141 | 226 | Fossil will continue to try improve the robot defenses of Fossil so |
| 142 | 227 | check back from time to time for the latest releases and updates. |
| 143 | 228 |
| --- www/antibot.wiki | |
| +++ www/antibot.wiki | |
| @@ -1,17 +1,25 @@ | |
| 1 | <title>Defense Against Robots</title> |
| 2 | |
| 3 | A typical Fossil website can have millions of pages, and many of |
| 4 | those pages (for example diffs and annotations and tarballs) can |
| 5 | be expensive to compute. |
| 6 | If a robot walks a Fossil-generated website, |
| 7 | it can present a crippling bandwidth and CPU load. |
| 8 | |
| 9 | A Fossil website is intended to be used |
| 10 | interactively by humans, not walked by robots. This article |
| 11 | describes the techniques used by Fossil to try to welcome human |
| 12 | users while keeping out robots. |
| 13 | |
| 14 | <h2>The Hyperlink User Capability</h2> |
| 15 | |
| 16 | Every Fossil web session has a "user". For random passers-by on the internet |
| 17 | (and for robots) that user is "nobody". The "anonymous" user is also |
| @@ -36,11 +44,11 @@ | |
| 36 | |
| 37 | But requiring a login, even an anonymous login, can be annoying. |
| 38 | Fossil provides other techniques for blocking robots which |
| 39 | are less cumbersome to humans. |
| 40 | |
| 41 | <h2>Automatic Hyperlinks Based on UserAgent</h2> |
| 42 | |
| 43 | Fossil has the ability to selectively enable hyperlinks for users |
| 44 | that lack the <b>Hyperlink</b> capability based on their UserAgent string in the |
| 45 | HTTP request header and on the browsers ability to run Javascript. |
| 46 | |
| @@ -68,21 +76,22 @@ | |
| 68 | to "play nicely" on the internet and are quite open |
| 69 | about the fact that they are a robot. And so the UserAgent string |
| 70 | provides a good first-guess about whether or not a request originates |
| 71 | from a human or a robot. |
| 72 | |
| 73 | In Fossil, under the Admin/Robot-Defense menu, there is a setting entitled |
| 74 | "<b>Enable hyperlinks based on User-Agent and/or Javascript</b>". |
| 75 | If this setting is set to "UserAgent only" or "UserAgent and Javascript", |
| 76 | and if the UserAgent string looks like a human and not a robot, then |
| 77 | Fossil will enable hyperlinks even if the <b>Hyperlink</b> capability |
| 78 | is omitted from the user permissions. This setting gives humans easy |
| 79 | access to the hyperlinks while preventing robots |
| 80 | from walking the billions of pages on a typical Fossil site. |
| 81 | |
| 82 | If the setting is "UserAgent only", then the hyperlinks are simply |
| 83 | enabled and that is all. But if the setting is "UserAgent and Javascript", |
| 84 | then the hyperlinks are not enabled directly. |
| 85 | Instead, the HTML code that is generated contains anchor tags ("<a>") |
| 86 | with "href=" attributes that point to [/honeypot] rather than the correct |
| 87 | link. JavaScript code is added to the end of the page that goes back and |
| 88 | fills in the correct "href=" attributes of |
| @@ -92,30 +101,14 @@ | |
| 92 | UserAgent string. Most robots do not bother to run JavaScript and |
| 93 | so to the robot the empty anchor tag will be useless. But all modern |
| 94 | web browsers implement JavaScript, so hyperlinks will show up |
| 95 | normally for human users. |
| 96 | |
| 97 | <h2>Further Defenses</h2> |
| 98 | |
| 99 | Recently (as of this writing, in the spring of 2013) the Fossil server |
| 100 | on the SQLite website ([http://www.sqlite.org/src/]) has been hit repeatedly |
| 101 | by Chinese robots that use forged UserAgent strings to make them look |
| 102 | like normal web browsers and which interpret JavaScript. We do not |
| 103 | believe these attacks to be nefarious since SQLite is public domain |
| 104 | and the attackers could obtain all information they ever wanted to |
| 105 | know about SQLite simply by cloning the repository. Instead, we |
| 106 | believe these "attacks" are coming from "script kiddies". But regardless |
| 107 | of whether or not malice is involved, these attacks do present |
| 108 | an unnecessary load on the server which reduces the responsiveness of |
| 109 | the SQLite website for well-behaved and socially responsible users. |
| 110 | For this reason, additional defenses against |
| 111 | robots have been put in place. |
| 112 | |
| 113 | On the Admin/Robot-Defense page of Fossil, just below the |
| 114 | "<b>Enable hyperlinks using User-Agent and/or Javascript</b>" |
| 115 | setting, there are now two additional sub-settings that can be optionally |
| 116 | enabled to control hyperlinks. |
| 117 | |
| 118 | The first new sub-setting is a delay (in milliseconds) before setting |
| 119 | the "href=" attributes on anchor tags. The default value for this |
| 120 | delay is 10 milliseconds. The idea here is that a robots will try to |
| 121 | interpret the links on the page immediately, and will not wait for delayed |
| @@ -130,13 +123,105 @@ | |
| 130 | |
| 131 | See also [./loadmgmt.md|Managing Server Load] for a description |
| 132 | of how expensive pages can be disabled when the server is under heavy |
| 133 | load. |
| 134 | |
| 135 | <h2>The Ongoing Struggle</h2> |
| 136 | |
| 137 | Fossil currently does a very good job of providing easy access to humans |
| 138 | while keeping out troublesome robots. However, robots |
| 139 | continue to grow more sophisticated, requiring ever more advanced |
| 140 | defenses. This "arms race" is unlikely to ever end. The developers of |
| 141 | Fossil will continue to try improve the robot defenses of Fossil so |
| 142 | check back from time to time for the latest releases and updates. |
| 143 |
| --- www/antibot.wiki | |
| +++ www/antibot.wiki | |
| @@ -1,17 +1,25 @@ | |
| 1 | <title>Defense Against Robots</title> |
| 2 | |
| 3 | A typical Fossil website can have billions and billions of pages, |
| 4 | and many of those pages (for example diffs and annotations and tarballs) |
| 5 | can be expensive to compute. |
| 6 | If a robot walks a Fossil-generated website, |
| 7 | it can present a crippling bandwidth and CPU load. |
| 8 | |
| 9 | A Fossil website is intended to be used |
| 10 | interactively by humans, not walked by robots. This article |
| 11 | describes the techniques used by Fossil to try to welcome human |
| 12 | users while keeping out robots. |
| 13 | |
| 14 | <h2>Setting Up Anti-Robot Defenses</h2> |
| 15 | |
| 16 | Admin users can configure robot defenses on the |
| 17 | "Robot Defense Settings" page (/setup_robot). |
| 18 | That page is accessible (to Admin users) from the default menu bar |
| 19 | by click on the "Admin" menu choice, then selecting the |
| 20 | "Robot-Defense" link from the list. |
| 21 | |
| 22 | <h2>The Hyperlink User Capability</h2> |
| 23 | |
| 24 | Every Fossil web session has a "user". For random passers-by on the internet |
| 25 | (and for robots) that user is "nobody". The "anonymous" user is also |
| @@ -36,11 +44,11 @@ | |
| 44 | |
| 45 | But requiring a login, even an anonymous login, can be annoying. |
| 46 | Fossil provides other techniques for blocking robots which |
| 47 | are less cumbersome to humans. |
| 48 | |
| 49 | <h2>Automatic Hyperlinks Based on UserAgent and Javascript</h2> |
| 50 | |
| 51 | Fossil has the ability to selectively enable hyperlinks for users |
| 52 | that lack the <b>Hyperlink</b> capability based on their UserAgent string in the |
| 53 | HTTP request header and on the browsers ability to run Javascript. |
| 54 | |
| @@ -68,21 +76,22 @@ | |
| 76 | to "play nicely" on the internet and are quite open |
| 77 | about the fact that they are a robot. And so the UserAgent string |
| 78 | provides a good first-guess about whether or not a request originates |
| 79 | from a human or a robot. |
| 80 | |
| 81 | The [/help?cmd=auto-hyperlink|auto-hyperlink] setting, shown as |
| 82 | "<b>Enable hyperlinks based on User-Agent and/or Javascript</b>" on |
| 83 | the Robot Defense Settings page, |
| 84 | can be set to "UserAgent only" or "UserAgent and Javascript" or "off". |
| 85 | If the UserAgent string looks like a human and not a robot, then |
| 86 | Fossil will enable hyperlinks even if the <b>Hyperlink</b> capability |
| 87 | is omitted from the user permissions. This setting gives humans easy |
| 88 | access to the hyperlinks while preventing robots |
| 89 | from walking the billions of pages on a typical Fossil site. |
| 90 | |
| 91 | If the setting is "UserAgent only" (2), then the hyperlinks are simply |
| 92 | enabled and that is all. But if the setting is "UserAgent and Javascript" (1), |
| 93 | then the hyperlinks are not enabled directly. |
| 94 | Instead, the HTML code that is generated contains anchor tags ("<a>") |
| 95 | with "href=" attributes that point to [/honeypot] rather than the correct |
| 96 | link. JavaScript code is added to the end of the page that goes back and |
| 97 | fills in the correct "href=" attributes of |
| @@ -92,30 +101,14 @@ | |
| 101 | UserAgent string. Most robots do not bother to run JavaScript and |
| 102 | so to the robot the empty anchor tag will be useless. But all modern |
| 103 | web browsers implement JavaScript, so hyperlinks will show up |
| 104 | normally for human users. |
| 105 | |
| 106 | If the [/help?cmd=auto-hyperlink|"auto-hyperlink"] setting is (2) |
| 107 | "<b>Enable hyperlinks using User-Agent and/or Javascript</b>", |
| 108 | then there are now two additional sub-settings that control when |
| 109 | hyperlinks are enabled. |
| 110 | |
| 111 | The first new sub-setting is a delay (in milliseconds) before setting |
| 112 | the "href=" attributes on anchor tags. The default value for this |
| 113 | delay is 10 milliseconds. The idea here is that a robots will try to |
| 114 | interpret the links on the page immediately, and will not wait for delayed |
| @@ -130,13 +123,105 @@ | |
| 123 | |
| 124 | See also [./loadmgmt.md|Managing Server Load] for a description |
| 125 | of how expensive pages can be disabled when the server is under heavy |
| 126 | load. |
| 127 | |
| 128 | <h2>Do Not Allow Robot Access To Certain Pages</h2> |
| 129 | |
| 130 | The [/help?cmd=robot-restrict|robot-restrict setting] is a comma-separated |
| 131 | list of GLOB patterns for pages for which robot access is prohibited. |
| 132 | The default value is: |
| 133 | |
| 134 | <blockquote><pre> |
| 135 | timelineX,diff,annotate,zip,fileage,file,finfo,reports |
| 136 | </pre></blockquote> |
| 137 | |
| 138 | Each entry corresponds to the first path element on the URI for a |
| 139 | Fossil-generated page. If Fossil does not know for certain that the |
| 140 | HTTP request is coming from a human, then any attempt to access one of |
| 141 | these pages brings up a javascript-powered captcha. The user has to |
| 142 | click the accept button the captcha once, and that sets a cookie allowing |
| 143 | the user to continue surfing without interruption for 15 minutes or so |
| 144 | before being presented with another captcha. |
| 145 | |
| 146 | Some path elements have special meanings: |
| 147 | |
| 148 | * <b>timelineX →</b> |
| 149 | This means a subset of /timeline/ pages that are considered |
| 150 | "expensive". The exact definition of which timeline pages are |
| 151 | expensive and which are not is still the subject of active |
| 152 | experimentation and is likely to change by the time you read this |
| 153 | text. The idea is that anybody (including robots) can see a timeline |
| 154 | of the most recent changes, but timelines of long-ago change or that |
| 155 | contain lists of file changes or other harder-to-compute values are |
| 156 | prohibited. |
| 157 | |
| 158 | * <b>zip →</b> |
| 159 | The special "zip" keyword also matches "/tarball/" and "/sqlar/". |
| 160 | |
| 161 | * <b>diff →</b> |
| 162 | This matches /vdiff/ and /fdiff/ and /vpatch/ and any other page that |
| 163 | is primarily about showing the difference between two check-ins or two |
| 164 | file versioons. |
| 165 | |
| 166 | * <b>annotate →</b> |
| 167 | This also matches /blame/ and /praise/. |
| 168 | |
| 169 | Other special keywords may be added in the future. |
| 170 | |
| 171 | The default [/help?cmd=robot-restrict|robot-restrict] |
| 172 | setting has been shown in practice to do a great job of keeping |
| 173 | robots from consuming all available CPU and bandwidth while will |
| 174 | still allowing humans access to the full power of the site without |
| 175 | having to be logged in. |
| 176 | |
| 177 | |
| 178 | <h2>Anti-robot Exception RegExps</h2> |
| 179 | |
| 180 | The [/help?cmd=robot-exception|robot-exception setting] under the name |
| 181 | of <b>Exceptions to anti-robot restrictions</b> is a list of |
| 182 | [/re_rules|regular expressions], one per line, that match |
| 183 | URIs that will bypass the captcha and allow robots full access. The |
| 184 | intent of this setting is to allow automated build scripts |
| 185 | to download specific tarballs of project snapshots. |
| 186 | |
| 187 | The recommended value for this setting allows robots to use URIs of the |
| 188 | following form: |
| 189 | |
| 190 | <blockquote> |
| 191 | <b>https://</b><i>DOMAIN</i><b>/tarball/release/</b><i>HASH</i><b>/</b><i>NAME</i><b>.tar.gz</b> |
| 192 | </blockquote> |
| 193 | |
| 194 | The <i>HASH</i> part of this URL can be any valid |
| 195 | [./checkin_names.wiki|check-in name]. The link works as long as that |
| 196 | check-in is tagged with the "release" symbolic tag. In this way, |
| 197 | robots are permitted to download tarballs (and ZIP archives) of official |
| 198 | releases, but not every intermediate check-in between releases. Humans |
| 199 | who are willing to click the captcha can still download whatever they |
| 200 | want, but robots are blocked by the captcha. This prevents aggressive |
| 201 | robots from downloading tarballs of every historical check-in of your |
| 202 | project, once per day, which many robots these days seem eager to do. |
| 203 | |
| 204 | For example, on the Fossil project itself, this URL will work, even for |
| 205 | robots: |
| 206 | |
| 207 | <blockquote> |
| 208 | https://fossil-scm.org/home/tarball/release/version-2.27/fossil-scm.tar.gz |
| 209 | </blockquote> |
| 210 | |
| 211 | But the next URL will not work for robots because check-in 3bbd18a284c8bd6a |
| 212 | is not tagged as a "release": |
| 213 | |
| 214 | <blockquote> |
| 215 | https://fossil-scm.org/home/tarball/release/3bbd18a284c8bd6a/fossil-scm.tar.gz |
| 216 | </blockquote> |
| 217 | |
| 218 | The second URL will work for humans, just not robots. |
| 219 | |
| 220 | <h2>The Ongoing Struggle</h2> |
| 221 | |
| 222 | Fossil currently does a good job of providing easy access to humans |
| 223 | while keeping out troublesome robots. However, robots |
| 224 | continue to grow more sophisticated, requiring ever more advanced |
| 225 | defenses. This "arms race" is unlikely to ever end. The developers of |
| 226 | Fossil will continue to try improve the robot defenses of Fossil so |
| 227 | check back from time to time for the latest releases and updates. |
| 228 |