Fossil SCM
Update the defense-against-robots documentation to align with current behavior.
Commit
c9082b29711617c0ca50634dd8acc7bfb58255d26b88b3a1d62e3fee5c8e7305
Parent
df337eb61c8e632…
2 files changed
+65
-66
+1
-1
+65
-66
| --- www/antibot.wiki | ||
| +++ www/antibot.wiki | ||
| @@ -1,44 +1,43 @@ | ||
| 1 | -<title>Defense Against Spiders</title> | |
| 2 | - | |
| 3 | -The website presented by a Fossil server has many hyperlinks. | |
| 4 | -Even a modest project can have millions of pages in its | |
| 5 | -tree, and many of those pages (for example diffs and annotations | |
| 6 | -and ZIP archives of older check-ins) can be expensive to compute. | |
| 7 | -If a spider or bot tries to walk a website implemented by | |
| 8 | -Fossil, it can present a crippling bandwidth and CPU load. | |
| 9 | - | |
| 10 | -The website presented by a Fossil server is intended to be used | |
| 11 | -interactively by humans, not walked by spiders. This article | |
| 1 | +<title>Defense Against Robots</title> | |
| 2 | + | |
| 3 | +A typical Fossil website can have billions of pages, and many of | |
| 4 | +those pages (for example diffs and annotations and tarballs) can | |
| 5 | +be expensive to compute. | |
| 6 | +If a robot walks a Fossil-generated website, | |
| 7 | +it can present a crippling bandwidth and CPU load. | |
| 8 | + | |
| 9 | +A Fossil website is intended to be used | |
| 10 | +interactively by humans, not walked by robots. This article | |
| 12 | 11 | describes the techniques used by Fossil to try to welcome human |
| 13 | -users while keeping out spiders. | |
| 12 | +users while keeping out robots. | |
| 14 | 13 | |
| 15 | 14 | <h2>The Hyperlink User Capability</h2> |
| 16 | 15 | |
| 17 | 16 | Every Fossil web session has a "user". For random passers-by on the internet |
| 18 | -(and for spiders) that user is "nobody". The "anonymous" user is also | |
| 17 | +(and for robots) that user is "nobody". The "anonymous" user is also | |
| 19 | 18 | available for humans who do not wish to identify themselves. The difference |
| 20 | 19 | is that "anonymous" requires a login (using a password supplied via |
| 21 | 20 | a CAPTCHA) whereas "nobody" does not require a login. |
| 22 | 21 | The site administrator can also create logins with |
| 23 | 22 | passwords for specific individuals. |
| 24 | 23 | |
| 25 | 24 | Users without the <b>[./caps/ref.html#h | Hyperlink]</b> capability |
| 26 | 25 | do not see most Fossil-generated hyperlinks. This is |
| 27 | -a simple defense against spiders, since [./caps/#ucat | the "nobody" | |
| 26 | +a simple defense against robots, since [./caps/#ucat | the "nobody" | |
| 28 | 27 | user category] does not have this capability by default. |
| 29 | 28 | Users must log in (perhaps as |
| 30 | -"anonymous") before they can see any of the hyperlinks. A spider | |
| 29 | +"anonymous") before they can see any of the hyperlinks. A robot | |
| 31 | 30 | that cannot log into your Fossil repository will be unable to walk |
| 32 | 31 | its historical check-ins, create diffs between versions, pull zip |
| 33 | -archives, etc. by visiting links, because they aren't there. | |
| 32 | +archives, etc. by visiting links, because there are no links. | |
| 34 | 33 | |
| 35 | 34 | A text message appears at the top of each page in this situation to |
| 36 | 35 | invite humans to log in as anonymous in order to activate hyperlinks. |
| 37 | 36 | |
| 38 | -Because this required login step is annoying to some, | |
| 39 | -Fossil provides other techniques for blocking spiders which | |
| 37 | +But requiring a login, even an anonymous login, can be annoying. | |
| 38 | +Fossil provides other techniques for blocking robots which | |
| 40 | 39 | are less cumbersome to humans. |
| 41 | 40 | |
| 42 | 41 | <h2>Automatic Hyperlinks Based on UserAgent</h2> |
| 43 | 42 | |
| 44 | 43 | Fossil has the ability to selectively enable hyperlinks for users |
| @@ -45,11 +44,11 @@ | ||
| 45 | 44 | that lack the <b>Hyperlink</b> capability based on their UserAgent string in the |
| 46 | 45 | HTTP request header and on the browsers ability to run Javascript. |
| 47 | 46 | |
| 48 | 47 | The UserAgent string is a text identifier that is included in the header |
| 49 | 48 | of most HTTP requests that identifies the specific maker and version of |
| 50 | -the browser (or spider) that generated the request. Typical UserAgent | |
| 49 | +the browser (or robot) that generated the request. Typical UserAgent | |
| 51 | 50 | strings look like this: |
| 52 | 51 | |
| 53 | 52 | <ul> |
| 54 | 53 | <li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0 |
| 55 | 54 | <li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0) |
| @@ -57,92 +56,92 @@ | ||
| 57 | 56 | <li> Wget/1.12 (openbsd4.9) |
| 58 | 57 | </ul> |
| 59 | 58 | |
| 60 | 59 | The first two UserAgent strings above identify Firefox 19 and |
| 61 | 60 | Internet Explorer 8.0, both running on Windows NT. The third |
| 62 | -example is the spider used by Google to index the internet. | |
| 61 | +example is the robot used by Google to index the internet. | |
| 63 | 62 | The fourth example is the "wget" utility running on OpenBSD. |
| 64 | 63 | Thus the first two UserAgent strings above identify the requester |
| 65 | -as human whereas the second two identify the requester as a spider. | |
| 64 | +as human whereas the second two identify the requester as a robot. | |
| 66 | 65 | Note that the UserAgent string is completely under the control |
| 67 | -of the requester and so a malicious spider can forge a UserAgent | |
| 68 | -string that makes it look like a human. But most spiders truly | |
| 69 | -seem to desire to "play nicely" on the internet and are quite open | |
| 70 | -about the fact that they are a spider. And so the UserAgent string | |
| 66 | +of the requester and so a malicious robot can forge a UserAgent | |
| 67 | +string that makes it look like a human. But most robots want | |
| 68 | +to "play nicely" on the internet and are quite open | |
| 69 | +about the fact that they are a robot. And so the UserAgent string | |
| 71 | 70 | provides a good first-guess about whether or not a request originates |
| 72 | -from a human or a spider. | |
| 73 | - | |
| 74 | -In Fossil, under the Admin/Access menu, there is a setting entitled | |
| 75 | -"<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>". | |
| 76 | -If this setting is enabled, and if the UserAgent string looks like a | |
| 77 | -human and not a spider, then Fossil will enable hyperlinks even if | |
| 78 | -the <b>Hyperlink</b> capability is omitted from the user permissions. This setting | |
| 79 | -gives humans easy access to the hyperlinks while preventing spiders | |
| 80 | -from walking the millions of pages on a typical Fossil site. | |
| 81 | - | |
| 82 | -But the hyperlinks are not enabled directly with the setting above. | |
| 71 | +from a human or a robot. | |
| 72 | + | |
| 73 | +In Fossil, under the Admin/Robot-Defense menu, there is a setting entitled | |
| 74 | +"<b>Enable hyperlinks based on User-Agent and/or Javascript</b>". | |
| 75 | +If this setting is set to "UserAgent only" or "UserAgent and Javascript", | |
| 76 | +and if the UserAgent string looks like a human and not a robot, then | |
| 77 | +Fossil will enable hyperlinks even if the <b>Hyperlink</b> capability | |
| 78 | +is omitted from the user permissions. This settingn gives humans easy | |
| 79 | +access to the hyperlinks while preventing robots | |
| 80 | +from walking the billions of pages on a typical Fossil site. | |
| 81 | + | |
| 82 | +If the setting is "UserAgent only", then the hyperlinks are simply | |
| 83 | +enabled and that is all. But if the setting is "UserAgent and Javascript", | |
| 84 | +then the hyperlinks are not enabled directly . | |
| 83 | 85 | Instead, the HTML code that is generated contains anchor tags ("<a>") |
| 84 | -without "href=" attributes. Then, JavaScript code is added to the | |
| 85 | -end of the page that goes back and fills in the "href=" attributes of | |
| 86 | -the anchor tags with the hyperlink targets, thus enabling the hyperlinks. | |
| 86 | +with "href=" attributes that point to [/honeypot] rather than the correct | |
| 87 | +link. JavaScript code is added to the end of the page that goes back and | |
| 88 | +fills in the correct "href=" attributes of | |
| 89 | +the anchor tags with the true hyperlink targets, thus enabling the hyperlinks. | |
| 87 | 90 | This extra step of using JavaScript to enable the hyperlink targets |
| 88 | -is a security measure against spiders that forge a human-looking | |
| 89 | -UserAgent string. Most spiders do not bother to run JavaScript and | |
| 90 | -so to the spider the empty anchor tag will be useless. But all modern | |
| 91 | +is a security measure against robots that forge a human-looking | |
| 92 | +UserAgent string. Most robots do not bother to run JavaScript and | |
| 93 | +so to the robot the empty anchor tag will be useless. But all modern | |
| 91 | 94 | web browsers implement JavaScript, so hyperlinks will show up |
| 92 | 95 | normally for human users. |
| 93 | 96 | |
| 94 | 97 | <h2>Further Defenses</h2> |
| 95 | 98 | |
| 96 | 99 | Recently (as of this writing, in the spring of 2013) the Fossil server |
| 97 | 100 | on the SQLite website ([http://www.sqlite.org/src/]) has been hit repeatedly |
| 98 | -by Chinese spiders that use forged UserAgent strings to make them look | |
| 101 | +by Chinese robots that use forged UserAgent strings to make them look | |
| 99 | 102 | like normal web browsers and which interpret JavaScript. We do not |
| 100 | 103 | believe these attacks to be nefarious since SQLite is public domain |
| 101 | 104 | and the attackers could obtain all information they ever wanted to |
| 102 | 105 | know about SQLite simply by cloning the repository. Instead, we |
| 103 | 106 | believe these "attacks" are coming from "script kiddies". But regardless |
| 104 | 107 | of whether or not malice is involved, these attacks do present |
| 105 | 108 | an unnecessary load on the server which reduces the responsiveness of |
| 106 | 109 | the SQLite website for well-behaved and socially responsible users. |
| 107 | 110 | For this reason, additional defenses against |
| 108 | -spiders have been put in place. | |
| 111 | +robots have been put in place. | |
| 109 | 112 | |
| 110 | -On the Admin/Access page of Fossil, just below the | |
| 111 | -"<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>" | |
| 113 | +On the Admin/Robot-Defense page of Fossil, just below the | |
| 114 | +"<b>Enable hyperlinks using User-Agent and/or Javascript</b>" | |
| 112 | 115 | setting, there are now two additional sub-settings that can be optionally |
| 113 | 116 | enabled to control hyperlinks. |
| 114 | 117 | |
| 115 | -The first sub-setting waits to run the | |
| 116 | -JavaScript that sets the "href=" attributes on anchor tags until after | |
| 117 | -at least one "mouseover" event has been detected on the <body> | |
| 118 | -element of the page. The thinking here is that spiders will not be | |
| 119 | -simulating mouse motion and so no mouseover events will ever occur and | |
| 120 | -hence the hyperlinks will never become enabled for spiders. | |
| 121 | - | |
| 122 | -The second new sub-setting is a delay (in milliseconds) before setting | |
| 123 | -the "href=" attributes on anchor tags. The default value for this | |
| 124 | -delay is 10 milliseconds. The idea here is that a spider will try to | |
| 125 | -render the page immediately, and will not wait for delayed scripts | |
| 126 | -to be run, thus will never enable the hyperlinks. | |
| 127 | - | |
| 128 | -These two sub-settings can be used separately or together. If used together, | |
| 129 | -then the delay timer does not start until after the first mouse movement | |
| 130 | -is detected. | |
| 118 | +The first new sub-setting is a delay (in milliseconds) before setting | |
| 119 | +the "href=" attributes on anchor tags. The default value for this | |
| 120 | +delay is 10 milliseconds. The idea here is that a robots will try to | |
| 121 | +interpret the links on the page immediately, and will not wait for delayed | |
| 122 | +scripts to be run, and thus will never enable the true links. | |
| 123 | + | |
| 124 | +The second sub-setting waits to run the | |
| 125 | +JavaScript that sets the "href=" attributes on anchor tags until after | |
| 126 | +at least one "mousedown" or "mousemove" event has been detected on the | |
| 127 | +<body> element of the page. The thinking here is that robots will not be | |
| 128 | +simulating mouse motion and so no mouse events will ever occur and | |
| 129 | +hence the hyperlinks will never become enabled for robots. | |
| 131 | 130 | |
| 132 | 131 | See also [./loadmgmt.md|Managing Server Load] for a description |
| 133 | 132 | of how expensive pages can be disabled when the server is under heavy |
| 134 | 133 | load. |
| 135 | 134 | |
| 136 | 135 | <h2>The Ongoing Struggle</h2> |
| 137 | 136 | |
| 138 | 137 | Fossil currently does a very good job of providing easy access to humans |
| 139 | -while keeping out troublesome robots and spiders. However, spiders and | |
| 140 | -bots continue to grow more sophisticated, requiring ever more advanced | |
| 138 | +while keeping out troublesome robots. However, robots | |
| 139 | +continue to grow more sophisticated, requiring ever more advanced | |
| 141 | 140 | defenses. This "arms race" is unlikely to ever end. The developers of |
| 142 | -Fossil will continue to try improve the spider defenses of Fossil so | |
| 141 | +Fossil will continue to try improve the robot defenses of Fossil so | |
| 143 | 142 | check back from time to time for the latest releases and updates. |
| 144 | 143 | |
| 145 | -Readers of this page who have suggestions on how to improve the spider | |
| 144 | +Readers of this page who have suggestions on how to improve the robot | |
| 146 | 145 | defenses in Fossil are invited to submit your ideas to the Fossil Users |
| 147 | 146 | forum: |
| 148 | 147 | [https://fossil-scm.org/forum]. |
| 149 | 148 |
| --- www/antibot.wiki | |
| +++ www/antibot.wiki | |
| @@ -1,44 +1,43 @@ | |
| 1 | <title>Defense Against Spiders</title> |
| 2 | |
| 3 | The website presented by a Fossil server has many hyperlinks. |
| 4 | Even a modest project can have millions of pages in its |
| 5 | tree, and many of those pages (for example diffs and annotations |
| 6 | and ZIP archives of older check-ins) can be expensive to compute. |
| 7 | If a spider or bot tries to walk a website implemented by |
| 8 | Fossil, it can present a crippling bandwidth and CPU load. |
| 9 | |
| 10 | The website presented by a Fossil server is intended to be used |
| 11 | interactively by humans, not walked by spiders. This article |
| 12 | describes the techniques used by Fossil to try to welcome human |
| 13 | users while keeping out spiders. |
| 14 | |
| 15 | <h2>The Hyperlink User Capability</h2> |
| 16 | |
| 17 | Every Fossil web session has a "user". For random passers-by on the internet |
| 18 | (and for spiders) that user is "nobody". The "anonymous" user is also |
| 19 | available for humans who do not wish to identify themselves. The difference |
| 20 | is that "anonymous" requires a login (using a password supplied via |
| 21 | a CAPTCHA) whereas "nobody" does not require a login. |
| 22 | The site administrator can also create logins with |
| 23 | passwords for specific individuals. |
| 24 | |
| 25 | Users without the <b>[./caps/ref.html#h | Hyperlink]</b> capability |
| 26 | do not see most Fossil-generated hyperlinks. This is |
| 27 | a simple defense against spiders, since [./caps/#ucat | the "nobody" |
| 28 | user category] does not have this capability by default. |
| 29 | Users must log in (perhaps as |
| 30 | "anonymous") before they can see any of the hyperlinks. A spider |
| 31 | that cannot log into your Fossil repository will be unable to walk |
| 32 | its historical check-ins, create diffs between versions, pull zip |
| 33 | archives, etc. by visiting links, because they aren't there. |
| 34 | |
| 35 | A text message appears at the top of each page in this situation to |
| 36 | invite humans to log in as anonymous in order to activate hyperlinks. |
| 37 | |
| 38 | Because this required login step is annoying to some, |
| 39 | Fossil provides other techniques for blocking spiders which |
| 40 | are less cumbersome to humans. |
| 41 | |
| 42 | <h2>Automatic Hyperlinks Based on UserAgent</h2> |
| 43 | |
| 44 | Fossil has the ability to selectively enable hyperlinks for users |
| @@ -45,11 +44,11 @@ | |
| 45 | that lack the <b>Hyperlink</b> capability based on their UserAgent string in the |
| 46 | HTTP request header and on the browsers ability to run Javascript. |
| 47 | |
| 48 | The UserAgent string is a text identifier that is included in the header |
| 49 | of most HTTP requests that identifies the specific maker and version of |
| 50 | the browser (or spider) that generated the request. Typical UserAgent |
| 51 | strings look like this: |
| 52 | |
| 53 | <ul> |
| 54 | <li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0 |
| 55 | <li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0) |
| @@ -57,92 +56,92 @@ | |
| 57 | <li> Wget/1.12 (openbsd4.9) |
| 58 | </ul> |
| 59 | |
| 60 | The first two UserAgent strings above identify Firefox 19 and |
| 61 | Internet Explorer 8.0, both running on Windows NT. The third |
| 62 | example is the spider used by Google to index the internet. |
| 63 | The fourth example is the "wget" utility running on OpenBSD. |
| 64 | Thus the first two UserAgent strings above identify the requester |
| 65 | as human whereas the second two identify the requester as a spider. |
| 66 | Note that the UserAgent string is completely under the control |
| 67 | of the requester and so a malicious spider can forge a UserAgent |
| 68 | string that makes it look like a human. But most spiders truly |
| 69 | seem to desire to "play nicely" on the internet and are quite open |
| 70 | about the fact that they are a spider. And so the UserAgent string |
| 71 | provides a good first-guess about whether or not a request originates |
| 72 | from a human or a spider. |
| 73 | |
| 74 | In Fossil, under the Admin/Access menu, there is a setting entitled |
| 75 | "<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>". |
| 76 | If this setting is enabled, and if the UserAgent string looks like a |
| 77 | human and not a spider, then Fossil will enable hyperlinks even if |
| 78 | the <b>Hyperlink</b> capability is omitted from the user permissions. This setting |
| 79 | gives humans easy access to the hyperlinks while preventing spiders |
| 80 | from walking the millions of pages on a typical Fossil site. |
| 81 | |
| 82 | But the hyperlinks are not enabled directly with the setting above. |
| 83 | Instead, the HTML code that is generated contains anchor tags ("<a>") |
| 84 | without "href=" attributes. Then, JavaScript code is added to the |
| 85 | end of the page that goes back and fills in the "href=" attributes of |
| 86 | the anchor tags with the hyperlink targets, thus enabling the hyperlinks. |
| 87 | This extra step of using JavaScript to enable the hyperlink targets |
| 88 | is a security measure against spiders that forge a human-looking |
| 89 | UserAgent string. Most spiders do not bother to run JavaScript and |
| 90 | so to the spider the empty anchor tag will be useless. But all modern |
| 91 | web browsers implement JavaScript, so hyperlinks will show up |
| 92 | normally for human users. |
| 93 | |
| 94 | <h2>Further Defenses</h2> |
| 95 | |
| 96 | Recently (as of this writing, in the spring of 2013) the Fossil server |
| 97 | on the SQLite website ([http://www.sqlite.org/src/]) has been hit repeatedly |
| 98 | by Chinese spiders that use forged UserAgent strings to make them look |
| 99 | like normal web browsers and which interpret JavaScript. We do not |
| 100 | believe these attacks to be nefarious since SQLite is public domain |
| 101 | and the attackers could obtain all information they ever wanted to |
| 102 | know about SQLite simply by cloning the repository. Instead, we |
| 103 | believe these "attacks" are coming from "script kiddies". But regardless |
| 104 | of whether or not malice is involved, these attacks do present |
| 105 | an unnecessary load on the server which reduces the responsiveness of |
| 106 | the SQLite website for well-behaved and socially responsible users. |
| 107 | For this reason, additional defenses against |
| 108 | spiders have been put in place. |
| 109 | |
| 110 | On the Admin/Access page of Fossil, just below the |
| 111 | "<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>" |
| 112 | setting, there are now two additional sub-settings that can be optionally |
| 113 | enabled to control hyperlinks. |
| 114 | |
| 115 | The first sub-setting waits to run the |
| 116 | JavaScript that sets the "href=" attributes on anchor tags until after |
| 117 | at least one "mouseover" event has been detected on the <body> |
| 118 | element of the page. The thinking here is that spiders will not be |
| 119 | simulating mouse motion and so no mouseover events will ever occur and |
| 120 | hence the hyperlinks will never become enabled for spiders. |
| 121 | |
| 122 | The second new sub-setting is a delay (in milliseconds) before setting |
| 123 | the "href=" attributes on anchor tags. The default value for this |
| 124 | delay is 10 milliseconds. The idea here is that a spider will try to |
| 125 | render the page immediately, and will not wait for delayed scripts |
| 126 | to be run, thus will never enable the hyperlinks. |
| 127 | |
| 128 | These two sub-settings can be used separately or together. If used together, |
| 129 | then the delay timer does not start until after the first mouse movement |
| 130 | is detected. |
| 131 | |
| 132 | See also [./loadmgmt.md|Managing Server Load] for a description |
| 133 | of how expensive pages can be disabled when the server is under heavy |
| 134 | load. |
| 135 | |
| 136 | <h2>The Ongoing Struggle</h2> |
| 137 | |
| 138 | Fossil currently does a very good job of providing easy access to humans |
| 139 | while keeping out troublesome robots and spiders. However, spiders and |
| 140 | bots continue to grow more sophisticated, requiring ever more advanced |
| 141 | defenses. This "arms race" is unlikely to ever end. The developers of |
| 142 | Fossil will continue to try improve the spider defenses of Fossil so |
| 143 | check back from time to time for the latest releases and updates. |
| 144 | |
| 145 | Readers of this page who have suggestions on how to improve the spider |
| 146 | defenses in Fossil are invited to submit your ideas to the Fossil Users |
| 147 | forum: |
| 148 | [https://fossil-scm.org/forum]. |
| 149 |
| --- www/antibot.wiki | |
| +++ www/antibot.wiki | |
| @@ -1,44 +1,43 @@ | |
| 1 | <title>Defense Against Robots</title> |
| 2 | |
| 3 | A typical Fossil website can have billions of pages, and many of |
| 4 | those pages (for example diffs and annotations and tarballs) can |
| 5 | be expensive to compute. |
| 6 | If a robot walks a Fossil-generated website, |
| 7 | it can present a crippling bandwidth and CPU load. |
| 8 | |
| 9 | A Fossil website is intended to be used |
| 10 | interactively by humans, not walked by robots. This article |
| 11 | describes the techniques used by Fossil to try to welcome human |
| 12 | users while keeping out robots. |
| 13 | |
| 14 | <h2>The Hyperlink User Capability</h2> |
| 15 | |
| 16 | Every Fossil web session has a "user". For random passers-by on the internet |
| 17 | (and for robots) that user is "nobody". The "anonymous" user is also |
| 18 | available for humans who do not wish to identify themselves. The difference |
| 19 | is that "anonymous" requires a login (using a password supplied via |
| 20 | a CAPTCHA) whereas "nobody" does not require a login. |
| 21 | The site administrator can also create logins with |
| 22 | passwords for specific individuals. |
| 23 | |
| 24 | Users without the <b>[./caps/ref.html#h | Hyperlink]</b> capability |
| 25 | do not see most Fossil-generated hyperlinks. This is |
| 26 | a simple defense against robots, since [./caps/#ucat | the "nobody" |
| 27 | user category] does not have this capability by default. |
| 28 | Users must log in (perhaps as |
| 29 | "anonymous") before they can see any of the hyperlinks. A robot |
| 30 | that cannot log into your Fossil repository will be unable to walk |
| 31 | its historical check-ins, create diffs between versions, pull zip |
| 32 | archives, etc. by visiting links, because there are no links. |
| 33 | |
| 34 | A text message appears at the top of each page in this situation to |
| 35 | invite humans to log in as anonymous in order to activate hyperlinks. |
| 36 | |
| 37 | But requiring a login, even an anonymous login, can be annoying. |
| 38 | Fossil provides other techniques for blocking robots which |
| 39 | are less cumbersome to humans. |
| 40 | |
| 41 | <h2>Automatic Hyperlinks Based on UserAgent</h2> |
| 42 | |
| 43 | Fossil has the ability to selectively enable hyperlinks for users |
| @@ -45,11 +44,11 @@ | |
| 44 | that lack the <b>Hyperlink</b> capability based on their UserAgent string in the |
| 45 | HTTP request header and on the browsers ability to run Javascript. |
| 46 | |
| 47 | The UserAgent string is a text identifier that is included in the header |
| 48 | of most HTTP requests that identifies the specific maker and version of |
| 49 | the browser (or robot) that generated the request. Typical UserAgent |
| 50 | strings look like this: |
| 51 | |
| 52 | <ul> |
| 53 | <li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0 |
| 54 | <li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0) |
| @@ -57,92 +56,92 @@ | |
| 56 | <li> Wget/1.12 (openbsd4.9) |
| 57 | </ul> |
| 58 | |
| 59 | The first two UserAgent strings above identify Firefox 19 and |
| 60 | Internet Explorer 8.0, both running on Windows NT. The third |
| 61 | example is the robot used by Google to index the internet. |
| 62 | The fourth example is the "wget" utility running on OpenBSD. |
| 63 | Thus the first two UserAgent strings above identify the requester |
| 64 | as human whereas the second two identify the requester as a robot. |
| 65 | Note that the UserAgent string is completely under the control |
| 66 | of the requester and so a malicious robot can forge a UserAgent |
| 67 | string that makes it look like a human. But most robots want |
| 68 | to "play nicely" on the internet and are quite open |
| 69 | about the fact that they are a robot. And so the UserAgent string |
| 70 | provides a good first-guess about whether or not a request originates |
| 71 | from a human or a robot. |
| 72 | |
| 73 | In Fossil, under the Admin/Robot-Defense menu, there is a setting entitled |
| 74 | "<b>Enable hyperlinks based on User-Agent and/or Javascript</b>". |
| 75 | If this setting is set to "UserAgent only" or "UserAgent and Javascript", |
| 76 | and if the UserAgent string looks like a human and not a robot, then |
| 77 | Fossil will enable hyperlinks even if the <b>Hyperlink</b> capability |
| 78 | is omitted from the user permissions. This settingn gives humans easy |
| 79 | access to the hyperlinks while preventing robots |
| 80 | from walking the billions of pages on a typical Fossil site. |
| 81 | |
| 82 | If the setting is "UserAgent only", then the hyperlinks are simply |
| 83 | enabled and that is all. But if the setting is "UserAgent and Javascript", |
| 84 | then the hyperlinks are not enabled directly . |
| 85 | Instead, the HTML code that is generated contains anchor tags ("<a>") |
| 86 | with "href=" attributes that point to [/honeypot] rather than the correct |
| 87 | link. JavaScript code is added to the end of the page that goes back and |
| 88 | fills in the correct "href=" attributes of |
| 89 | the anchor tags with the true hyperlink targets, thus enabling the hyperlinks. |
| 90 | This extra step of using JavaScript to enable the hyperlink targets |
| 91 | is a security measure against robots that forge a human-looking |
| 92 | UserAgent string. Most robots do not bother to run JavaScript and |
| 93 | so to the robot the empty anchor tag will be useless. But all modern |
| 94 | web browsers implement JavaScript, so hyperlinks will show up |
| 95 | normally for human users. |
| 96 | |
| 97 | <h2>Further Defenses</h2> |
| 98 | |
| 99 | Recently (as of this writing, in the spring of 2013) the Fossil server |
| 100 | on the SQLite website ([http://www.sqlite.org/src/]) has been hit repeatedly |
| 101 | by Chinese robots that use forged UserAgent strings to make them look |
| 102 | like normal web browsers and which interpret JavaScript. We do not |
| 103 | believe these attacks to be nefarious since SQLite is public domain |
| 104 | and the attackers could obtain all information they ever wanted to |
| 105 | know about SQLite simply by cloning the repository. Instead, we |
| 106 | believe these "attacks" are coming from "script kiddies". But regardless |
| 107 | of whether or not malice is involved, these attacks do present |
| 108 | an unnecessary load on the server which reduces the responsiveness of |
| 109 | the SQLite website for well-behaved and socially responsible users. |
| 110 | For this reason, additional defenses against |
| 111 | robots have been put in place. |
| 112 | |
| 113 | On the Admin/Robot-Defense page of Fossil, just below the |
| 114 | "<b>Enable hyperlinks using User-Agent and/or Javascript</b>" |
| 115 | setting, there are now two additional sub-settings that can be optionally |
| 116 | enabled to control hyperlinks. |
| 117 | |
| 118 | The first new sub-setting is a delay (in milliseconds) before setting |
| 119 | the "href=" attributes on anchor tags. The default value for this |
| 120 | delay is 10 milliseconds. The idea here is that a robots will try to |
| 121 | interpret the links on the page immediately, and will not wait for delayed |
| 122 | scripts to be run, and thus will never enable the true links. |
| 123 | |
| 124 | The second sub-setting waits to run the |
| 125 | JavaScript that sets the "href=" attributes on anchor tags until after |
| 126 | at least one "mousedown" or "mousemove" event has been detected on the |
| 127 | <body> element of the page. The thinking here is that robots will not be |
| 128 | simulating mouse motion and so no mouse events will ever occur and |
| 129 | hence the hyperlinks will never become enabled for robots. |
| 130 | |
| 131 | See also [./loadmgmt.md|Managing Server Load] for a description |
| 132 | of how expensive pages can be disabled when the server is under heavy |
| 133 | load. |
| 134 | |
| 135 | <h2>The Ongoing Struggle</h2> |
| 136 | |
| 137 | Fossil currently does a very good job of providing easy access to humans |
| 138 | while keeping out troublesome robots. However, robots |
| 139 | continue to grow more sophisticated, requiring ever more advanced |
| 140 | defenses. This "arms race" is unlikely to ever end. The developers of |
| 141 | Fossil will continue to try improve the robot defenses of Fossil so |
| 142 | check back from time to time for the latest releases and updates. |
| 143 | |
| 144 | Readers of this page who have suggestions on how to improve the robot |
| 145 | defenses in Fossil are invited to submit your ideas to the Fossil Users |
| 146 | forum: |
| 147 | [https://fossil-scm.org/forum]. |
| 148 |
+1
-1
| --- www/mkindex.tcl | ||
| +++ www/mkindex.tcl | ||
| @@ -14,11 +14,11 @@ | ||
| 14 | 14 | aboutcgi.wiki {How CGI Works In Fossil} |
| 15 | 15 | aboutdownload.wiki {How The Download Page Works} |
| 16 | 16 | adding_code.wiki {Adding New Features To Fossil} |
| 17 | 17 | adding_code.wiki {Hacking Fossil} |
| 18 | 18 | alerts.md {Email Alerts And Notifications} |
| 19 | - antibot.wiki {Defense against Spiders and Bots} | |
| 19 | + antibot.wiki {Defense against Spiders and Robots} | |
| 20 | 20 | backoffice.md {The "Backoffice" mechanism of Fossil} |
| 21 | 21 | backup.md {Backing Up a Remote Fossil Repository} |
| 22 | 22 | blame.wiki {The Annotate/Blame Algorithm Of Fossil} |
| 23 | 23 | blockchain.md {Is Fossil A Blockchain?} |
| 24 | 24 | branching.wiki {Branching, Forking, Merging, and Tagging} |
| 25 | 25 |
| --- www/mkindex.tcl | |
| +++ www/mkindex.tcl | |
| @@ -14,11 +14,11 @@ | |
| 14 | aboutcgi.wiki {How CGI Works In Fossil} |
| 15 | aboutdownload.wiki {How The Download Page Works} |
| 16 | adding_code.wiki {Adding New Features To Fossil} |
| 17 | adding_code.wiki {Hacking Fossil} |
| 18 | alerts.md {Email Alerts And Notifications} |
| 19 | antibot.wiki {Defense against Spiders and Bots} |
| 20 | backoffice.md {The "Backoffice" mechanism of Fossil} |
| 21 | backup.md {Backing Up a Remote Fossil Repository} |
| 22 | blame.wiki {The Annotate/Blame Algorithm Of Fossil} |
| 23 | blockchain.md {Is Fossil A Blockchain?} |
| 24 | branching.wiki {Branching, Forking, Merging, and Tagging} |
| 25 |
| --- www/mkindex.tcl | |
| +++ www/mkindex.tcl | |
| @@ -14,11 +14,11 @@ | |
| 14 | aboutcgi.wiki {How CGI Works In Fossil} |
| 15 | aboutdownload.wiki {How The Download Page Works} |
| 16 | adding_code.wiki {Adding New Features To Fossil} |
| 17 | adding_code.wiki {Hacking Fossil} |
| 18 | alerts.md {Email Alerts And Notifications} |
| 19 | antibot.wiki {Defense against Spiders and Robots} |
| 20 | backoffice.md {The "Backoffice" mechanism of Fossil} |
| 21 | backup.md {Backing Up a Remote Fossil Repository} |
| 22 | blame.wiki {The Annotate/Blame Algorithm Of Fossil} |
| 23 | blockchain.md {Is Fossil A Blockchain?} |
| 24 | branching.wiki {Branching, Forking, Merging, and Tagging} |
| 25 |