Preventing web scraping is essential for website administrators. Beyond content theft, the server resources consumed by crawlers can be costly for many site owners.

This article presents a fundamental approach: analyzing Nginx logs to identify high-frequency IPs, then blocking them from accessing your site.

Identifying Scraper IPs

Execute the following command:

awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -n 10

Where:

  • access.log: Your Nginx access log file

This command analyzes request IPs in Nginx logs, identifying the top 10 most active IPs. Typically, an IP generating numerous requests in a short timeframe is likely a scraper.

Blocking IP Access

Create a blockip.conf file in your Nginx configuration directory to manage blocked IPs.

Add scraper IPs in this format:

deny IP;

Parameters:

  • deny: Nginx access control directive for restricting server access
  • IP: Target IP (supports both IPv4 and IPv6)

Include this configuration in your http, server, or location block:

include blockip.conf;

Restart Nginx to apply changes.

Blocked IPs will receive 403 Forbidden responses.

Care should be taken to identify search engine spider requests and CDN back-to-origin requests;

Advanced Blocking Patterns

# Block single IP
deny IP;

# Allow single IP
allow IP;

# Block all IPs
deny all;

# Allow all IPs
allow all;

# Block IP range
deny IP/24;

Combination example:

# Whitelist specific IP while blocking others
allow IP;
deny all;