Trafilatura: Discover and Extract Text Data on the Web by adbar

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

web-scrapingtext-extractionnlphtml2texttext-miningcrawlertext-cleaningtext-preprocessingarticle-extractorreadabilityscrapingnews-crawler
Verdict 74/100 health $4.13/mo cheapest, hetzner 2/5 setup difficulty Last release 1.4 years ago

Self-host Trafilatura: Discover and Extract Text Data on the Web on hetzner CAX11 for $4.13/mo.

Health score
74 /100
6-dim composite
Self-hosts from
$4.13 /mo
hetzner · CAX11
Difficulty
2 /5
Docker + read README
GitHub stars
5.9k
365 forks

About Trafilatura: Discover and Extract Text Data on the Web

From the project's README at github.com/adbar/trafilatura. Lightly cleaned for readability; for the full source see the upstream repo.

[](https://pypi.python.org/pypi/trafilatura) [](https://pypi.python.org/pypi/trafilatura) [](http://trafilatura.readthedocs.org/en/latest/?badge=latest) [](https://codecov.io/gh/adbar/trafilatura) [](https://pepy.tech/project/trafilatura) [](https://aclanthology.org/2021.acl-demo.15/) Introduction

Trafilatura is a cutting-edge Python package and command-line tool designed to gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data. It includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction

Health score breakdown

6-dimension composite. See methodology for formula and weights.

activity
80
maturity
100
community
84
security
70
sustainability
88
adoption
33

Adoption signals

Real-world usage data, pulled from each registry. The bigger the numbers, the more battle-tested the project.

SignalValueSource
GitHub stars 5.9k github.com/adbar/trafilatura
GitHub forks 365 github.com/adbar/trafilatura
PYPI downloads (last month) 7376k trafilatura

Release & maintenance

Is this project actively maintained, or about to die? Check the recency of last commit and last release.

Project age7.1 yearssince Apr 2019
Last commit8 months agoSep 12, 2025
Releases shipped39last: 1.4 years ago
Funding links2declared by maintainers

Self-hosting cost across providers

Detected requirements: 4GB RAM, 40GB disk minimum. Cheapest plan per provider that meets the requirement.

ProviderPlanSpecsMonthly
hetzner CAX11 2c · 4GB · 40GB $4.13 USD Deploy →
vultr VC2 1c · 1GB · 25GB $5 USD Deploy →
linode Nanode 1GB 1c · 1GB · 25GB $5.12 USD Deploy →
digitalocean Basic Regular 1GB 1c · 1GB · 25GB $6 USD Deploy →

What people say on Hacker News

Ready to self-host Trafilatura: Discover and Extract Text Data on the Web?

Spin up a hetzner CAX11 (4GB RAM, 40GB disk) for $4.13/mo and follow the project's official install docs.

Data last refreshed May 7, 2026.

Similar open-source projects

Projects in our directory that replace the same SaaS or share topics with Trafilatura: Discover and Extract Text Data on the Web.

Frequently asked questions

Last verified . Data refreshes every 30 minutes.