Hacker Newsnew | past | comments | ask | show | jobs | submit | adbarba's commentslogin

Concerning tooling I'd say you have two different worlds, JavaScript and Python, each with a series of tools to tackle such tasks. It's not easy to compare them directly because of varying software environments and I haven't had a chance to test JS tools thoroughly.

For the sake of completeness: Mozilla's Readability [1] is obviously a reference in the JS world.

[1]: https://github.com/mozilla/readability


Regarding content extraction it's more accurate than newspaper3k (especially for languages other than English) and it entails more information: metadata, text, and comments. It works out of the box in most cases so no need to write a particular scraper for a given websites, which saves time. If you care about 2-3 websites and are willing to write and maintain scraping scripts then bs4/lxml/whatever is also fine.

It also features functions and a command-line interface to collect data on your own (say find recent news using feeds). So it's not merely about text extraction in the end but also text discovery.


Author here, nice to see the package on the HN's front page this morning and thanks for the kind words! Just created an account to participate in the discussion, I'll try to answer your questions.


I’ve been using this package and like it a lot.

One problem I’d like to find a solution for is how to get past cookie pop ups when scraping a website. I’ve not found a satisfactory packaged solution for this. Clearly a tough problem in general but wondered if people have found good libs to help with this. I’ve heard of solutions involving playwright etc.


Thanks! Here is what I put together in the docs, you could basically preprocess/render/filter the webpages with the software of your choice and then pass the result to trafilatura: https://trafilatura.readthedocs.io/en/latest/troubleshooting...


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: