June 3, 2008

httpsnapshot

Version: 0.5.5
2005.05.12
Author: Shi-ichiro HARA (sinara@blade.nagaokaut.ac.jp)
Synopsis

This is the program to obtain WEB pages. For example, to obtain http://ruby-lang.org/ and save ruby-dir, do as following:

ruby httpsnapshot.rb http://ruby-lang.org/ ruby-dir
Then, all images on the page are gotten and the links are renamed for local browsing.

If you want other pages linked on the page, invoke it with the depth , then you get the pages recursively according to the depth (its default is 0).

ruby httpsnapshot.rb http://ruby-lang.org/ ruby-dir 2
We don’t get the pages of other sites or of other upper directories from the first page. But using httpsnapshot.rb as a library, the pages to obtain can be controled as you want.

Installation

Place httpsnapshot.rb where you can load.

Usage

httpsnapshot.rb [OPTIONS] []
Options

-f config_file
Read configuration file.
-p Proxy
Set Http Proxy server
-h
Show help.
-t
Get text files only.
-a
Output text files converted from HTML files.
-c Cache_dir
Set the cache directory of original HTML files. The default is the subdirectory .snap-cache.
-g
Only get files and do not translate HTML files. HTML files are saved in the cache directory.
-U
Do not read local cache.
-w
Only rewrite local files gotten by -g option.
-d Num
Get pages recursively in depth Num. (The same as the third option of invoking. This is prior.)
-D
Get pages by depth first
-x
Get files even if it is under the start directory.
-i
Get src even if it is out from the start site.
-j
Get href even if it is out from the start site.
-V
Set terse mode.
-E
Erase cache files before session.
-e
Erase cache files after session.
-M size
Limits the maxmam size (byte unit).
-m size
Limits the minimam size (byte unit).
-y filter-file
Filtered by filter-file. For example

# filter-file
:href-allow,ignorecase
.html?$
.shtml?$
:href-deny,ignorecase
^foo.html$
^bar.html$
allows .htm, .html, .shtm or .shtml files and denys foo.html and bar.html. We can also set src-allow and src-deny as parameters.

Configuration File

proxy: http://proxy.xxx.yyy.com:8080
Using as a library

In httpsnapshot.rb, two classes ( HttpSnapSession, HttpSnap ) are defined. Using this, we can make programs controling the restriction of WEB pages.

httpsnapshot.rb invokes the following codes as default:

#!/usr/local/bin/ruby
require “httpsnapshot”
require “getopts”
getopts(“UuxhtvVgwae”, “c:”, “d:”, “p:”, “f:.httpsnapshotrc”)

target, savedir = ARGV.shift, ARGV.shift
($OPT_h or !savedir) and usage()
savedir.sub!(/\/$/, ”) #/
depth = Integer($OPT_d || ARGV.shift)

$config = HttpSnapSession.read_config($OPT_f) if $OPT_f

proxy_s = $OPT_p || $config[“proxy”] || ENV[“HTTPSNAPSHOT_PROXY”]
if proxy_s && !proxy_s.empty?
proxy_s.sub!(/^(:?http:\/\/)?/, ‘http://’)
proxy_s.concat “:8080” if proxy_s !~ /:\d+$/
proxy = URI.parse(proxy_s)
proxy, proxy_port = proxy.host, proxy.port
end

session = HttpSnapSession.new(target, savedir, depth)
session.cache_dir = $OPT_c if $OPT_c
session.verbose = $OPT_v || !$OPT_V
session.read_local = !$OPT_U || $OPT_u
session.proxy, session.proxy_port = proxy, proxy_port
session.only_get = true if $OPT_g || $OPT_a
session.output_format = “text” if $OPT_a

unless $OPT_x
session.add_href_filter do |snap, psnap|
snap.upper_dir?(session.root) || snap.path =~ /\bcgi-bin\b/
end
end

if $OPT_t || $OPT_a
session.add_href_filter do |snap, psnap| snap.guess_text?; end
session.add_src_filter do |snap, psnap| snap.guess_text?; end
end

if $OPT_w
session.build_table
session.rewrite
else
session.start
end

if $OPT_e
session.cache_clear
end

puts “First File: #{session.start_file}” if session.first
Reference

httpsnapshot

Clicky Web Analytics