to dirprocess: robots.txt file rev 12 aug 2021
Category: websites
.......................................................
summary of robots.txt file:
Anyone can see the file by entering robots.txt in the url -
so anyone can see what pages you do or don’t want to be crawled,
so don’t use them to hide private user information.
Best practice to indicate the location of any sitemaps
associated with this domain at the bottom of the robots.txt file.
IS case-sensitive.
Can use comments (#) but only outside of directive blocks.
Use blank line between each directive.
Order doesn't matter. Bot will follow the more granular command.
robots.txt checker:
do a web search, there are several.
Crawl-delay:
https://yoast.com/ultimate-guide-robots-txt/#crawl-delay-directive
https://www.contentkingapp.com/academy/robotstxt/faq/crawl-delay-10/
crawl-delay: 10 - milliseconds or seconds? most say seconds.
Googlebot does not acknowledge crawl-delay; have to use their console. pbbbbfft
600 seconds = 10 minutes
crawl-delay of 10 seconds = 8,640 pages a day
Info about robots.txt: (These are all good, each has some unique useful details.)
https://www.robotstxt.org/robotstxt.html
https://www.cloudflare.com/learning/bots/what-is-robots.txt/
https://moz.com/learn/seo/robotstxt
https://yoast.com/ultimate-guide-robots-txt/
https://yoast.com/wordpress-robots-txt-example/
https://en.wikipedia.org/wiki/Robots_exclusion_standard
https://ahrefs.com/blog/robots-txt/
.......................................................
Robots lists:
https://www.robotstxt.org/db.html
http://www.botsvsbrowsers.com/
.......................................................
Bad crawling bots:
They are analytics aggregators, their data is mostly useful to
the people/companies who suscribe to them.
Requests volume eat too much server resources and bandwidth.
If not restricted to access your website, these bots tend to
obey the delays command in robots.txt.
More than half of web traffic comes from robots, not from real users.
[21 jun 2018]
Blocking them gets you
less spam
safer website
less stolen content
lower bandwidth
Can block in robots.txt, by user agent string - but they can ignore.
Can block at server level, in .htaccess file.
.........................
blocking the bad bots:
These are the bots i'm currently blocking, according to the info
at simtechdev.com (url below), and including some i'm seeing
in site logs.
See canonical file in folder.
# This is the basic stuff i would have on all sites:
crawl-delay: 20
User-agent: *
Disallow: /cgi-bin
User-agent: AhrefsBot
Disallow: /
User-agent: AspiegelBot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: MauiBot
Disallow: /
User-agent: MJ12Bot
Disallow: /
User-agent: oBot
Disallow: /
User-agent: PetalBot
Disallow: /
User-agent: SEMrushBot
Disallow: /
# you may want to allow the facebook crawler if your site is active on fb.
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
Disallow: /
facebookexternalhit/1.1
Disallow: /
Facebook
Disallow: /
Sitemap: https://www.DOMAIN.TLD/my-sitemap.xml
.........................
can combine like this:
User-agent: AhrefsBot
User-agent: Neevabot
User-agent: SemrushBot
Disallow: /
.......................................................
resources:
https://simtechdev.com/blog/good-and-bad-bots-to-control-to-save-server-resources-and-improve-performance/
Explains about the bots.
Lists user-agent string and summary of bad and good bots.
* List of 1800 bad bots
and very good info.
https://tab-studio.com/en/blocking-robots-on-your-page/
[2017]
List last updated 2017; latest comments dec 2018, promising
an updated list. nothing since then.
* Robots.txt
jun 2018
https://www.deepcrawl.com/knowledge/technical-seo-library/robots-txt/
* Robots.txt for SEO: The Ultimate Guide
jan 2021
https://www.contentkingapp.com/academy/robotstxt/
* The ultimate guide to robots.txt
mar 2021
https://yoast.com/ultimate-guide-robots-txt/
* 14 Common Issues with the Robots.txt File in SEO (and How to Avoid Them)
apr 2020
https://www.seoclarity.net/blog/understanding-robots-txt
.........................
bad bot problems:
* Bot Attacks: You are not alone…
apr 2021
https://medium.com/expedia-group-tech/bot-attacks-you-are-not-alone-d8b3290342bd
* What is bot management? | How bot managers work
https://www.cloudflare.com/en-gb/learning/bots/what-is-bot-management/
.........................
specific bots:
* Facebook Crawler
Crawls the HTML of an app or website that was shared on Facebook
via copying and pasting the link or by a Facebook social plugin.
The crawler gathers, caches, and displays information about
the app or website such as its title, description, and thumbnail image.
You may want to allow the facebook crawler if your site is active on fb.
User agent strings are
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
facebookexternalhit/1.1
TibetSun is getting one called just 'Facebook'.
https://developers.facebook.com/docs/sharing/webmasters/crawler
* oBot
"oBot is the web crawling bot of the Content Security Division of
IBM Germany Research & Development GmbH. ... [results in a]
database that is made available to our customers in several content
filtering products."
https://www.reddit.com/r/bigseo/comments/jigeg5/obot_do_you_know_this_bot/
Original info at http://filterdb.iss.net/crawler/
but it has bad cert; didn't open it. [31 mar 2021]
* SEMrushbot
- https://dmjcomputerservices.com/blog/blocking-semrushbot-from-website/
_______________________________________________________
begin 29 mar 2021
-- 0 --