We've had an Apache2 server behave a bit weird lately, e.g. long loading times or capacity warnings. Checking the logs showed that a "semrush bot" was scanning all sites without any limits at maximum bandwidth. It appears to be some kind of marketing bot, and trash like that has to go. The bot also appears to often not respect the robots.txt file, thus we need a different method.
[Solution]
There are several approaches. Filtering by IP is not very useful as the bot may connect from any IP block that we're not aware of. Currently we're blocking it using .htaccess as can be found here: http://badbots.vps.tips/info/semrushbot
Code: Select all
# Bad bots filter code
# provided by http://badbots.vps.tips
SetEnvIfNoCase User-Agent "SemrushBot" bad_bots
<Limit GET POST HEAD>
Order Allow,Deny
Allow from all
Deny from env=bad_bots
</Limit>