How much does it cost to self-host Trafilatura: Discover and Extract Text Data on the Web?

Trafilatura: Discover and Extract Text Data on the Web can be self-hosted starting at $4.13/mo on hetzner CAX11. Detected requirements: 4GB RAM, 40GB disk.

Is Trafilatura: Discover and Extract Text Data on the Web actively maintained?

Trafilatura: Discover and Extract Text Data on the Web has a composite health score of 74/100 across activity, maturity, community, security, sustainability, and adoption. See /methodology/ for the formula.

Trafilatura: Discover and Extract Text Data on the Web by adbar

Python Apache-2.0

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

web-scrapingtext-extractionnlphtml2texttext-miningcrawlertext-cleaningtext-preprocessingarticle-extractorreadabilityscrapingnews-crawler

Verdict 74/100 health $4.13/mo cheapest, hetzner 2/5 setup difficulty Last release 1.6 years ago

Deploy Trafilatura: Discover and Extract Text Data on the Web on hetzner → View on GitHub

Health score
 74 /100 
6-dim composite

Self-hosts from

$4.13 /mo

hetzner · CAX11

Difficulty

2 /5

Docker + read README

GitHub stars

5.9k

365 forks

About Trafilatura: Discover and Extract Text Data on the Web

From the project's README at github.com/adbar/trafilatura. Lightly cleaned for readability; for the full source see the upstream repo.

[](https://pypi.python.org/pypi/trafilatura) [](https://pypi.python.org/pypi/trafilatura) [](http://trafilatura.readthedocs.org/en/latest/?badge=latest) [](https://codecov.io/gh/adbar/trafilatura) [](https://pepy.tech/project/trafilatura) [](https://aclanthology.org/2021.acl-demo.15/) Introduction

Trafilatura is a cutting-edge Python package and command-line tool designed to gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data. It includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction

Health score breakdown

6-dimension composite. See methodology for formula and weights.

activity

maturity

100

community

security

sustainability

adoption

Adoption signals

Real-world usage data, pulled from each registry. The bigger the numbers, the more battle-tested the project.

Signal	Value	Source
GitHub stars	5.9k	github.com/adbar/trafilatura
GitHub forks	365	github.com/adbar/trafilatura
PYPI downloads (last month)	8.4M	trafilatura

Release & maintenance

Is this project actively maintained, or about to die? Check the recency of last commit and last release.

Project age	7.2 years	since Apr 2019
Last commit	10 months ago	Sep 12, 2025
Releases shipped	39	last: 1.6 years ago
Funding links	2	declared by maintainers

Self-hosting cost across providers

Detected requirements: 4GB RAM, 40GB disk minimum. Cheapest plan per provider that meets the requirement.

Provider	Plan	Specs	Monthly
hetzner	CAX11	2c · 4GB · 40GB	$4.13 USD	Deploy →
vultr	VC2	1c · 1GB · 25GB	$5 USD	Deploy →
linode	Nanode 1GB	1c · 1GB · 25GB	$5.12 USD	Deploy →
digitalocean	Basic Regular 1GB	1c · 1GB · 25GB	$6 USD	Deploy →