to dirprocess: bots - info         rev 28 jun 2021

Category: websites


.......................................................
summaries:

  See info about bad crawling bots in process-web-bots-info.html
  How to block them in the robots.txt file: process-web-bots-block_with_robots_txt.html
  However, since they are by definition 'bad', some bots ignore the robots.txt file.
  They can be blocked in the .htaccess file.

  "optimize good ones by altering robots.txt;
   block bad ones by IP address in .htaccess."

  * What is bot traffic, and how to stop?
      https://www.cloudflare.com/learning/bots/what-is-a-bot/
      https://www.cloudflare.com/learning/bots/what-is-bot-traffic/

.......................................................
Bad crawling bots: 
   They are analytics aggregators, their data is mostly useful to 
     the people/companies who suscribe to them.
   Requests volume eat too much server resources and bandwidth. 
   If not restricted to access your website, these bots tend to
      obey the delays command in robots.txt.
   More than half of web traffic comes from robots, not from real users.
   [21 jun 2018]

  Blocking them gets you
    less spam
    safer website
    less stolen content
    lower bandwidth

  Can block in robots.txt, by user agent string - but they can ignore.
  Can block at server level, in .htaccess file.
  Can block with Google Analytics. (?)
  Can block with proxy like Cloudflare.


.......................................................
sources:

  * How to block bad website bots and spiders With .htaccess tweaks 
  
      https://www.seoblog.com/block-bots-spiders-htaccess/
      mar 2018
      best one - 
        explains about the different kinds of bots.
        blocking with robots.txt
        blocking with .htaccess (on apache)
          examples of different syntax.
          explanations of the syntax!


  - Pre-made lists:
      https://www.robotstxt.org/db.html
      http://www.botsvsbrowsers.com/
      Trouble is these are all old. The bots change. But they give an idea anyway:
       https://pastebin.com/5Hw9KZnW
         jun 2012
       https://tab-studio.com/en/blocking-robots-on-your-page/
         dec 2017
       https://stackoverflow.com/questions/27431228/how-to-block-bad-bots-in-htaccess
         2014, 2018

  - How to Identify Robots with Apache Logs
      https://www.sumologic.com/insight/apache-logs-identifying-robots/
      may 2019
      (i've just been checking my awstats and getting user agent strings from there)


  https://simtechdev.com/blog/good-and-bad-bots-to-control-to-save-server-resources-and-improve-performance/
    Explains about the bots.
    Lists user-agent string and summary of bad and good bots.

  * List of 1800 bad bots
    and very good info.
    https://tab-studio.com/en/blocking-robots-on-your-page/
    [2017]
    List last updated 2017; latest comments dec 2018, promising
      an updated list. nothing since then.


.........................
bad bot problems:

  * Bot Attacks: You are not alone…
    apr 2021
    https://medium.com/expedia-group-tech/bot-attacks-you-are-not-alone-d8b3290342bd

  * What is bot management? | How bot managers work
    https://www.cloudflare.com/en-gb/learning/bots/what-is-bot-management/

......................... 
specific bots:

  * Facebook Crawler
      Crawls the HTML of an app or website that was shared on Facebook 
        via copying and pasting the link or by a Facebook social plugin. 
        The crawler gathers, caches, and displays information about
        the app or website such as its title, description, and thumbnail image.
      You may want to allow the facebook crawler if your site is active on fb.
      User agent strings are
        facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
        facebookexternalhit/1.1
        TibetSun is getting one called just 'Facebook'.
      https://developers.facebook.com/docs/sharing/webmasters/crawler

  * oBot
      "oBot is the web crawling bot of the Content Security Division of 
       IBM Germany Research & Development GmbH. ... [results in a]
       database that is made available to our customers in several content 
       filtering products."
      https://www.reddit.com/r/bigseo/comments/jigeg5/obot_do_you_know_this_bot/
      Original info at http://filterdb.iss.net/crawler/
        but it has bad cert; didn't open it. [31 mar 2021]

  * SEMrushbot
    - https://dmjcomputerservices.com/blog/blocking-semrushbot-from-website/



_______________________________________________________
begin 28 jun 2021
-- 0 --