Soup for Squeak

19 January, 2009

squeaksoup

Zulq Alam has been working on Soup, a Squeak port of Beautiful Soup, the tolerant HTML/XML parser written in Python, which is extremely useful when you need to scrape data from a web page. He has recently announced a working release and gave some example of its usage.

Zulq notes that there’s still plenty of work to do on this port:

  • No attempt is made to deal with different character sets and encodings.
  • The parser will not convert entity or char references.
  • The parser will not accept options such as whether to convert entities, which entities to convert, what to parse, etc.
  • The parser will only do HTML; there are no configurations for other XML flavours yet.

He adds that the project repository is globally writable, and he looks forward to your feedback and contributions.

Follow

Get every new post delivered to your Inbox.