Soup for Squeak
19 January, 2009
Zulq Alam has been working on Soup, a Squeak port of Beautiful Soup, the tolerant HTML/XML parser written in Python, which is extremely useful when you need to scrape data from a web page. He has recently announced a working release and gave some example of its usage.
Zulq notes that there’s still plenty of work to do on this port:
- No attempt is made to deal with different character sets and encodings.
- The parser will not convert entity or char references.
- The parser will not accept options such as whether to convert entities, which entities to convert, what to parse, etc.
- The parser will only do HTML; there are no configurations for other XML flavours yet.
He adds that the project repository is globally writable, and he looks forward to your feedback and contributions.