Trafilatura: Discover and Extract Text Data on the Web by adbar
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
About Trafilatura: Discover and Extract Text Data on the Web
From the project's README at github.com/adbar/trafilatura. Lightly cleaned for readability; for the full source see the upstream repo.
[](https://pypi.python.org/pypi/trafilatura) [](https://pypi.python.org/pypi/trafilatura) [](http://trafilatura.readthedocs.org/en/latest/?badge=latest) [](https://codecov.io/gh/adbar/trafilatura) [](https://pepy.tech/project/trafilatura) [](https://aclanthology.org/2021.acl-demo.15/) Introduction
Trafilatura is a cutting-edge Python package and command-line tool designed to gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data. It includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction
Health score breakdown
6-dimension composite. See methodology for formula and weights.
Adoption signals
Real-world usage data, pulled from each registry. The bigger the numbers, the more battle-tested the project.
| Signal | Value | Source |
|---|---|---|
| GitHub stars | 5.9k | github.com/adbar/trafilatura |
| GitHub forks | 365 | github.com/adbar/trafilatura |
| PYPI downloads (last month) | 7376k | trafilatura |
Release & maintenance
Is this project actively maintained, or about to die? Check the recency of last commit and last release.
| Project age | 7.1 years | since Apr 2019 |
| Last commit | 8 months ago | Sep 12, 2025 |
| Releases shipped | 39 | last: 1.4 years ago |
| Funding links | 2 | declared by maintainers |
Self-hosting cost across providers
Detected requirements: 4GB RAM, 40GB disk minimum. Cheapest plan per provider that meets the requirement.
| Provider | Plan | Specs | Monthly | |
|---|---|---|---|---|
| hetzner | CAX11 | 2c · 4GB · 40GB | $4.13 USD | Deploy → |
| vultr | VC2 | 1c · 1GB · 25GB | $5 USD | Deploy → |
| linode | Nanode 1GB | 1c · 1GB · 25GB | $5.12 USD | Deploy → |
| digitalocean | Basic Regular 1GB | 1c · 1GB · 25GB | $6 USD | Deploy → |
What people say on Hacker News
- Trafilatura: A tool and library to gather text and metadata on the Web
- Show HN: Snitchmd – Cloudflare-protected URLs into clean Markdown via Docker
- Show HN: Snapbyte – personalized email digests from HN/Reddit/Lobsters
- Show HN: ObsidianLinkBot – Telegram bot that saves articles as Obsidian notes
- Show HN: Snapbyte – personalized email digests from HN/Reddit/Lobsters
Ready to self-host Trafilatura: Discover and Extract Text Data on the Web?
Spin up a hetzner CAX11 (4GB RAM, 40GB disk) for $4.13/mo and follow the project's official install docs.
Data last refreshed May 7, 2026.
Similar open-source projects
Projects in our directory that replace the same SaaS or share topics with Trafilatura: Discover and Extract Text Data on the Web.