Preventing web scraping is essential for website administrators. Beyond content theft, the server resources consumed by crawlers can be costly for many site owners.
This article presents a fundamental approach: analyzing Nginx logs to identify high-frequency IPs, then blocking them from accessing your site.
Identifying Scraper IPs
Execute the following command:
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -n 10
Where:
access.log
: Your Nginx access log file
This command analyzes request IPs in Nginx logs, identifying the top 10 most active IPs. Typically, an IP generating numerous requests in a short timeframe is likely a scraper.
Blocking IP Access
Create a blockip.conf
file in your Nginx configuration directory to manage blocked IPs.
Add scraper IPs in this format:
deny IP;
Parameters:
deny
: Nginx access control directive for restricting server accessIP
: Target IP (supports both IPv4 and IPv6)
Include this configuration in your http
, server
, or location
block:
include blockip.conf;
Restart Nginx to apply changes.
Blocked IPs will receive 403 Forbidden
responses.
Care should be taken to identify search engine spider requests and CDN back-to-origin requests;
Advanced Blocking Patterns
# Block single IP
deny IP;
# Allow single IP
allow IP;
# Block all IPs
deny all;
# Allow all IPs
allow all;
# Block IP range
deny IP/24;
Combination example:
# Whitelist specific IP while blocking others
allow IP;
deny all;