to dirprocess: robots.txt file                                            rev 12 aug 2021

Category: websites



.......................................................
summary of robots.txt file: 

  Anyone can see the file by entering robots.txt in the url -
    so anyone can see what pages you do or don’t want to be crawled, 
    so don’t use them to hide private user information.

  Best practice to indicate the location of any sitemaps
    associated with this domain at the bottom of the robots.txt file.

  IS case-sensitive.

  Can use comments (#) but only outside of directive blocks.

  Use blank line between each directive.

  Order doesn't matter. Bot will follow the more granular command.


robots.txt checker:
  do a web search, there are several.

Crawl-delay:
  https://yoast.com/ultimate-guide-robots-txt/#crawl-delay-directive
  https://www.contentkingapp.com/academy/robotstxt/faq/crawl-delay-10/
  crawl-delay: 10 - milliseconds or seconds? most say seconds.
  Googlebot does not acknowledge crawl-delay; have to use their console. pbbbbfft
  600 seconds = 10 minutes
  crawl-delay of 10 seconds = 8,640 pages a day

Info about robots.txt: (These are all good, each has some unique useful details.)
  https://www.robotstxt.org/robotstxt.html
  https://www.cloudflare.com/learning/bots/what-is-robots.txt/
  https://moz.com/learn/seo/robotstxt
  https://yoast.com/ultimate-guide-robots-txt/
  https://yoast.com/wordpress-robots-txt-example/
  https://en.wikipedia.org/wiki/Robots_exclusion_standard
  https://ahrefs.com/blog/robots-txt/



.......................................................
Robots lists:

  https://www.robotstxt.org/db.html
  http://www.botsvsbrowsers.com/


.......................................................
Bad crawling bots: 
   They are analytics aggregators, their data is mostly useful to 
     the people/companies who suscribe to them.
   Requests volume eat too much server resources and bandwidth. 
   If not restricted to access your website, these bots tend to
      obey the delays command in robots.txt.
   More than half of web traffic comes from robots, not from real users.
   [21 jun 2018]

  Blocking them gets you
    less spam
    safer website
    less stolen content
    lower bandwidth

  Can block in robots.txt, by user agent string - but they can ignore.
  Can block at server level, in .htaccess file.


......................... 
blocking the bad bots:
  These are the bots i'm currently blocking, according to the info
    at simtechdev.com (url below), and including some i'm seeing 
    in site logs.
    See canonical file in folder.

# This is the basic stuff i would have on all sites:
crawl-delay: 20 

User-agent: *
Disallow: /cgi-bin

User-agent: AhrefsBot
Disallow: /

User-agent: AspiegelBot
Disallow: /

User-agent: DotBot
Disallow: /

User-agent: MauiBot
Disallow: /

User-agent: MJ12Bot
Disallow: /

User-agent: oBot
Disallow: /

User-agent: PetalBot
Disallow: /

User-agent: SEMrushBot
Disallow: /

# you may want to allow the facebook crawler if your site is active on fb.
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
Disallow: /

facebookexternalhit/1.1
Disallow: /

Facebook
Disallow: /

Sitemap: https://www.DOMAIN.TLD/my-sitemap.xml


.........................
can combine like this:

User-agent: AhrefsBot
User-agent: Neevabot 
User-agent: SemrushBot
Disallow: /

.......................................................
resources:

  https://simtechdev.com/blog/good-and-bad-bots-to-control-to-save-server-resources-and-improve-performance/
    Explains about the bots.
    Lists user-agent string and summary of bad and good bots.

  * List of 1800 bad bots
    and very good info.
    https://tab-studio.com/en/blocking-robots-on-your-page/
    [2017]
    List last updated 2017; latest comments dec 2018, promising
      an updated list. nothing since then.

  * Robots.txt
    jun 2018
    https://www.deepcrawl.com/knowledge/technical-seo-library/robots-txt/

  * Robots.txt for SEO: The Ultimate Guide
    jan 2021
    https://www.contentkingapp.com/academy/robotstxt/

  * The ultimate guide to robots.txt
    mar 2021
    https://yoast.com/ultimate-guide-robots-txt/

  * 14 Common Issues with the Robots.txt File in SEO (and How to Avoid Them)
    apr 2020
    https://www.seoclarity.net/blog/understanding-robots-txt



.........................
bad bot problems:

  * Bot Attacks: You are not alone…
    apr 2021
    https://medium.com/expedia-group-tech/bot-attacks-you-are-not-alone-d8b3290342bd

  * What is bot management? | How bot managers work
    https://www.cloudflare.com/en-gb/learning/bots/what-is-bot-management/



......................... 
specific bots:

  * Facebook Crawler
      Crawls the HTML of an app or website that was shared on Facebook 
        via copying and pasting the link or by a Facebook social plugin. 
        The crawler gathers, caches, and displays information about
        the app or website such as its title, description, and thumbnail image.
      You may want to allow the facebook crawler if your site is active on fb.
      User agent strings are
        facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
        facebookexternalhit/1.1
        TibetSun is getting one called just 'Facebook'.
      https://developers.facebook.com/docs/sharing/webmasters/crawler

  * oBot
      "oBot is the web crawling bot of the Content Security Division of 
       IBM Germany Research & Development GmbH. ... [results in a]
       database that is made available to our customers in several content 
       filtering products."
      https://www.reddit.com/r/bigseo/comments/jigeg5/obot_do_you_know_this_bot/
      Original info at http://filterdb.iss.net/crawler/
        but it has bad cert; didn't open it. [31 mar 2021]

  * SEMrushbot
    - https://dmjcomputerservices.com/blog/blocking-semrushbot-from-website/



_______________________________________________________
begin 29 mar 2021
-- 0 --