Skip to content

Python Web Crawlers

What Is a Crawler?

As I understand it, a crawler is essentially a technical means of targeting content that is publicly presented on the internet, analyzing its data transmission paths and presentation patterns, and then parsing and extracting the core valuable data from it.

Example 1:

  • If I want to quickly understand Xiaohongshu’s daily content recommendation logic, I can start by examining its homepage — the content is presented in a list format, where each card corresponds to a single post.
  • I can use a crawler to fetch the core information of each card (such as the title and cover image), and then combine big data analysis or deep learning methods to mine the value of this content, or even reverse-infer my own user profile on the Xiaohongshu platform.

Example 2:

  • Sometimes you encounter a situation where a website’s UI design is extremely unfriendly, but the data it carries is irreplaceable. Meanwhile, the site does not provide an official API, so there is no legitimate way to obtain the data source, making it impossible to build your own web frontend or mobile app for more efficient and intuitive analysis.
  • In such cases, crawlers can be used — under legal and compliant conditions — to scrape the structured and regular data presented on the frontend of the website and store it locally (e.g., files or an SQLite database).
  • Later, this data can be integrated into your own web projects (either Python Web + Jinja2 in a non-separated frontend/backend architecture, or packaged as a RESTful API using Spring Boot and connected to a Vue/React frontend — the specific tech stack is flexible and not the core issue).
  • Combined with customized UI design, this can greatly enhance the brain’s efficiency in analyzing data and help make more accurate decisions.

This tool, therefore, is what we call a crawler.


Why Use Python

  • Fundamentally, crawlers simulate user access to websites or mobile apps through automation, and then extract specified content according to predefined rules (this is the basic definition of a crawler; more advanced crawlers may integrate deep learning for autonomous data learning and extraction, which we won’t discuss here).
  • From a technical standpoint, any programming language can be used to develop crawlers — Java, JavaScript, etc., all have this capability.
  • However, there are critical differences in practice: the crawler ecosystem in Java or JavaScript is not nearly as mature as Python’s. Python has libraries such as Beautiful Soup, which are powerful, mature, and extremely easy to use, significantly reducing development costs.
  • More importantly, Python’s syntax is concise and easy to understand. The same functionality that might require dozens of lines in Java can often be achieved with just a few lines in Python using third-party libraries. Crawling tasks are mostly script-level needs rather than large-scale engineering projects — they usually don’t require massive service clusters or strict engineering standards. Often, a script of fewer than 50 lines can complete the entire process of crawling, parsing, and storing data.
  • Therefore, although crawling is not limited to Python, Python has become the industry’s mainstream choice due to its ecosystem advantages and development efficiency.

Core Tools Commonly Used in Python Crawlers

ToolsFunctionUse Case
requestsSend HTTP/HTTPS requestsStatic pages, API data
BeautifulSoup4Parse HTML/XML and extract dataStatic page parsing
lxmlHigh-performance HTML/XML parsingComplex page parsing
ScrapyCrawler framework (request/parse/store)Large-scale structured crawl
Selenium/PlaywrightSimulate browser behaviorDynamic rendered pages
  • The basic crawling workflow starts with sending a network request using requests.

    • You must pay attention to tricks used by websites, such as missing cookies, content-type, authentication, etc., which may cause error responses. In such cases, you need to inspect real browser requests and replicate the missing data in Python.
    • You can use Edge or Chrome, open DevTools (F12), inspect request headers, and copy the required headers into your Python script.
  • After obtaining the data:

    • If it’s RESTful API data, you’ll typically receive JSON (usually consumed by frontend frameworks or mobile clients via Retrofit2).
    • If it’s directly rendered web content, it will mostly be HTML. At this point, you analyze the HTML structure and extract the core data based on consistent patterns, store it, and further analyze the business logic behind the data.

Practical Case

  • Recently I’ve been quite bored at home, and legal things don’t interest me much.
  • Among porn, gambling, and drugs — drugs are obviously off the table.
  • As for gambling, I’m working on another project: using Python Web + Bootstrap to build a single-player Texas Hold’em and chess game. Playing offline board games at home is actually quite fun.
  • So for crawling, the only remaining illegal gray area is pornography.
  • I thought: why not use a crawler to scrape information from porn websites and download their cover images?

Note: To run this case, your network must be able to access Google. You may use Shadowsocks, V2RayNG, ClashX, etc., or refer to router-level solutions.

The website I chose to crawl is: JavBus

  • Their URL format is absurdly simple:

    • https://www.javbus.com/KBI-001
  • Domain + serial number = resource page.

The cover image section:

html
<div class="col-md-9 screencap">
    <a class="bigImage" href="/pics/cover/6qah_b.jpg">
      <img src="/pics/cover/6qah_b.jpg" title="...">
    </a>
</div>

The sample image section:

html
<div id="sample-waterfall">
  <a class="sample-box" href="https://pics.dmm.co.jp/...-1.jpg">
    <div class="photo-frame">
      <img src="/pics/sample/6qah_1.jpg">
    </div>
  </a>
  ...
</div>

This structure is basically a paradise for crawler beginners. There’s no need for Selenium or Playwright — a simple requests + BeautifulSoup setup is enough.

If I don’t crawl this, it would be against the laws of nature.


Python Code

Config File

series_image_cover_config.py

python
series = "KBI"
start_code = 1
end_code = 100

Main Logic (Parsing & Downloading)

series_image_cover_download.py

python
import os
import requests
from bs4 import BeautifulSoup

host = "https://www.javbus.com"

def image_download(names, onlyFolderImage = False):
    for name in names:
        URL = host + "/" + name
        SAVE_DIR = "./{}".format(name)

        HEADERS = {
            "authority": "www.javbus.com",
            "method": "GET",
            "path": f"/{name}",
            "scheme": "https",
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
            "accept-encoding": "gzip, deflate, br, zstd",
            "accept-language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
            "cookie": "PHPSESSID=nnelcojatbmsjh2s02i4m0fno4; existmag=mag; _tea_utm_cache_10000007=undefined",
            "dnt": "1",
            "priority": "u=0, i",
            "sec-ch-ua": '"Chromium";v="140", "Not=A?Brand";v="24", "Microsoft Edge";v="140"',
            "sec-ch-ua-mobile": "?0",
            "sec-ch-ua-platform": '"macOS"',
            "sec-fetch-dest": "document",
            "sec-fetch-mode": "navigate",
            "sec-fetch-site": "none",
            "sec-fetch-user": "?1",
            "upgrade-insecure-requests": "1",
            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                        "AppleWebKit/537.36 (KHTML, like Gecko) "
                        "Chrome/140.0.0.0 Safari/537.36 Edg/140.0.0.0"
        }

        os.makedirs(SAVE_DIR, exist_ok=True)

        session = requests.Session()
        response = session.get(URL, headers=HEADERS)
        soup = BeautifulSoup(response.text, "html.parser")

        title_tag = soup.find("title")
        title = title_tag.text.strip() if title_tag else "unknown_title"
        print("页面标题:", title)

        cover_tag = soup.select_one("a.bigImage img")
        if cover_tag:
            cover_url = cover_tag["src"]
            if cover_url.startswith("/"):
                cover_url = host + cover_url
            ext = os.path.splitext(cover_url)[1]
            cover_path = os.path.join(SAVE_DIR, f"cover{ext}")
            cover_path = cover_path.replace("cover.jpg", "folder.jpg")
            
            cover_headers = HEADERS.copy()
            cover_headers["referer"] = URL
            
            r = session.get(cover_url, headers=cover_headers, stream=True)
            with open(cover_path, "wb") as f:
                for chunk in r.iter_content(1024):
                    f.write(chunk)
            print("封面图已下载:", cover_path)

        if not onlyFolderImage:
            sample_tags = soup.select("a.sample-box")
            if not sample_tags:
                print("没有样本图")
            else:
                for idx, a_tag in enumerate(sample_tags, 1):
                    img_url = a_tag.get("href")
                    if not img_url:
                        continue
                    ext = os.path.splitext(img_url)[1]
                    img_path = os.path.join(SAVE_DIR, f"sample_{idx}{ext}")
                    r = session.get(img_url, headers=HEADERS, stream=True)
                    with open(img_path, "wb") as f:
                        for chunk in r.iter_content(1024):
                            f.write(chunk)
                    print(f"样本图 {idx} 已下载:", img_path) 

if __name__ == "__main__":
    from special_image_detail_config import videos
    
    image_download([video["name"] for video in videos])

Run:

shell
python ./series_image_cover_download.py
  • This method crawls an entire series of serial numbers.
  • A “serial number” here is similar to a product line, like Samsung’s Galaxy S, Z Fold, or Z Flip series.
  • Each movie is a combination of the series and a number, e.g., KBI-002.

Downloading All Detail Images for Specific Movies

special_image_detail_config.py

python
videos = [
  {"name": "KBI-001"},
  {"name": "KBI-002"}
]

special_image_detail_download.py

python
import os
import asyncio
import aiohttp
from bs4 import BeautifulSoup
import urllib3

from series_image_cover_config import *

# 禁用 InsecureRequestWarning 警告
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

host = "https://www.javbus.com"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "zh-CN,zh;q=0.9",
}

async def download_image_cover(session: aiohttp.ClientSession, name: str, series: str):
    URL = f"{host}/{name}"
    SAVE_DIR = f"./{series}"
    os.makedirs(SAVE_DIR, exist_ok=True)

    try:
        async with session.get(URL, timeout=aiohttp.ClientTimeout(total=10), ssl=False) as response:
            if response.status != 200:
                print(f"[{name}] 页面不存在,跳过")
                return
        
            html = await response.text()
            soup = BeautifulSoup(html, "html.parser")

            title = soup.find("title")
            cover_tag = soup.select_one("a.bigImage img")

            if not cover_tag:
                print(f"[{name}] 无封面,跳过")
                return

            cover_url = cover_tag["src"]
            if cover_url.startswith("/"):
                cover_url = host + cover_url

            ext = os.path.splitext(cover_url)[1]
            cover_path = os.path.join(SAVE_DIR, f"{name}{ext}")

            async with session.get(
                cover_url, 
                headers={"Referer": URL, **HEADERS}, 
                ssl=False
            ) as img_response:
                if img_response.status == 200:
                    with open(cover_path, "wb") as f:
                        f.write(await img_response.read())
                    
                    print(f"[{name}] 下载成功 → {cover_path}")
                    print(f"[{name}] 标题:{title.text.strip() if title else '无标题'}")
                else:
                    print(f"[{name}] 图片下载失败,状态码:{img_response.status}")

    except asyncio.TimeoutError:
        print(f"[{name}] 请求超时,跳过")
    except Exception as e:
        print(f"[{name}] 发生错误:{e}")
        return

async def start_to_download():
    names = [
        f"{SERIES}-{code:03d}" for code in range(max(1, START_CODE), END_CODE)
    ]
    
    connector = aiohttp.TCPConnector(limit=10)
    async with aiohttp.ClientSession(
        connector=connector,
        headers=HEADERS
    ) as session:

        tasks = [download_image_cover(session, name, SERIES) for name in names]
        
        batch_size = 10
        for i in range(0, len(tasks), batch_size):
            batch = tasks[i:i+batch_size]
            await asyncio.gather(*batch)
            await asyncio.sleep(0.5)  # 防止请求太快,被他们知道这是来自脚本的下载解析行为

if __name__ == "__main__":
    asyncio.run(start_to_download())

Run:

shell
python ./special_image_detail_download.py

About Video Downloads

  • Sites like JavBus usually provide magnet links. You can use tools like Motrix, but download speeds for such content are often terrible.

  • Two solutions:

    1. Use premium services like Quark Cloud or Xunlei (local downloads are usually fine).
    2. Platforms like MissAV stream via M3U8, which can be downloaded using tools described in my other article on M3U8 video downloading.

Important Disclaimer

  • If anyone other than myself reads this article and uses these techniques to commit actions that seriously violate the law, please do not drag me into it.

Just something casual. Hope you like it.