Spiderz
Scarily easy spidering
What is it?
Spiderz is a very simple ruby gem for spidering websites. Here is a short example that creates a sitemap.
spider = Spiderz.new "http://mysite.com"
spider.success do |url, doc|
title = (doc / "title").text.strip
puts "<a href='#{url}' >#{title}</a>"
end
spider.crawl "/"
Here is another to find 404’s
spider = Spiderz.new "http://mysite.com"
spider.failure do |url|
puts "Failed to load: url"
end
spider.crawl "/"
CALLBACKS
The idea behind Spiderz is to provide very simple callbacks that can be customized with blocks
- ‘success’ for when a page is retrieved and parsed successfully
- ‘failure’ for when a page is not retrieved or parsed successfully
- ‘started’ for when spidering starts
- ‘completed’ for when spidering starts
You can also override the default skip behaviour (by default it follows all internal links that are not mailto’s or bookmarks).
For more info read the source
REQUIREMENTS
- Hpricot
INSTALL
sudo gem install spiderz
Spiderz is open source and available at github: here
![]()