User documentation for AWW

The Academic Web Watch (AWW) controls a selection of web robots, that collect information about how web sites change over time. It stores this information, and can export it in various ways.

The included robots are few and simple, but it is easy to include more, depending on what information should be collected. AWW is mainly a Command and Control Centre (CCC), that executes tasks, and stores the results. Initially the goal is not to provide a lot of functionality, but rather to make something that is easily extensible.

The project was created by the Sourceforge user fivecode, to be used in his master project at the University of Oslo.

A screenshot of AWW's graphical user interface

AWW’s graphical user interface

AWW is open source - it can be used and modified by anyone, and it is licensed under an Affero GNU Public License.

The project information, executable files, this documenation, etc. is available from sourceforge.net, at this address: http://sourceforge.net/projects/wwatch/

Installation

Download the latest version of AWW here: http://sourceforge.net/projects/wwatch/files/

The download consists of a .tar.gz file that must be unpacked.

You can run AWW directly after unpacking:

python aww.py

Useful flags are:

-c provide a command to be executed
-d daemon mode (Starts the scheduler, but nothing else. Output is written to file.)
-cmd do not start the GUI, only a command line

For easy startup something like this could be added to your .bashrc file:

alias aww=’python ~/<INSTALLATION-FOLDER>/AWW/aww/aww.py’

If you mean to import AWW and use it in other Python programs, you can install AWW the standard Python way:

sudo python setup.py install

If you do not have administrator privileges, you can install to the home directory:

python setup.py install –user

Output from AWW is written to the folder $HOME/aww_data

The above explanation is based on Linux systems. AWW may run just fine on other platforms, but it has not been tested.

Dependencies

The Python versions 2.6 and 2.7 have been used to develop AWW. The main framework makes use of the following standard Python libraries, which will be included in most Python distributions:

  • sys
  • os
  • time
  • datetime
  • webbrowser
  • Tkinter
  • tkFileDialog
  • tkMessageBox
  • tkSimpleDialog
  • tkFont
  • re
  • subprocess
  • sqlite3
  • threading

AWW is likely to run on other Python 2 versions, but using version 3 will require modification of the program. In addition, sqlite3 is named sqlite in some Python 2 versions. This can be solved by siply changing the import statements involving sqlite in the code.

The robots included with AWW makes use of some additional standard libraries:

  • urlparse
  • urllib2
  • hashlib
  • socket
  • random
  • robotparser
  • urllib

The robots also require two libraries that are not in the standard distribution of Python:

  • bs4
  • lxml

Visualizations are meant to use the non-standard library scitools.

If the dependencies for robots are not installed, errors may be thrown when AWW is started. Avoiding this can be done by disabling the robots that use the missing dependencies. That would be done by removing the name of the robot form the list bot_list in the file __init__.py in the folder robots.

Running AWW

With AWW, you can command included robots, or add your own. The program can handle scheduling of the robots execution, and export of collected data.

As mentioned in the section named installation, you can run AWW directly after unpacking. Simply navigate to the folder containing the file aww.py, and type:

python aww.py

Command line interface

Here follows a summary of valid commands for AWW’s command line interface

Command group - Robots

bot list
list available robots
bot run_default <bot name>+
run robot, possibly with URLs from a task with frequency not set
bot run_url <bot name> <url>+
run a robot once, with specific URLs
bot run_with_task_urls <bot name> <task name>
run a robot once, with URLs belonging to given task

New robots are automatically imported on start-up. Bot list prints the robots that have been successfully imported. When activating a robot, it is possible to pass on a list of URLs, specify a task name for URL look-up, or run the robot without any arguments.

Command group - Visualization

viz list
list visualizations
viz <dataset name> <viz name>
open a visualization
viz browser <dataset name> <viz name>
open a visualization in the default browser
viz export <dataset name> <viz name>
export visualization in png format

The viz commands are not yet meant to be used. They are only included to make it convenient to create extensions for visualization at a later time. The commands can be used, but only to display a demo-visualization, that does not make use of any data sets.

Command group - Dataset

set list
list datasets
set peek <set name>
print 10 entries from a dataset
set export <file-format> <set name>+
export dataset to file (html/txt/sql)
set truncate <set name>
delete all data in a set

The datasets are stored in the database during the execution of the web robots. The robots are free to create new datasets when it is needed. As a convention the sets should always have names that start with the name of the robot that uses it, followed by an underscore, and then something describing their contents.

Datasets can be accessed by robots and visualizations through a programming hook. They can also be exported to several file formats, for further processing, or storage.

Command group - Tasks

task list
list tasks
task run <task name>+
execute the command belonging to a task
task info <task name>*
list URLs in a task
task create <task name> <command>
create a task
task frequency <task name> <minute hour dom month>
set frequency for execution of a task (you can use ‘manual’ and ‘default’)
task add_url <task name> <url>+
add a URL to a task
task import_urls <task name> <filename>
import URLs from file
task remove_url <task name> <url>+
remove a URL from a task
task remove_task <task name>
delete a task, and its collection of URLs

Tasks are created manually. They contain a task name, a command, and an execution frequency. It is also possible to add URLs to a task, regardless of what command the task contains. If the task is to activate a robot, the URLs will be passed along to the robot. Adding URLs to a task can also be done by importing them from a file.

On creation no execution frequency is assigned to tasks. It must be specified manually afterwards, in a syntax similar to that of Cron on linux machines, with a syntax where minute, hour, day of month, month are typed as integers separated by a whitespace character.

Command group - Scheduler

scheduler start
automatically iterate and execute tasks
scheduler stop
stop automatic execution of tasks

The scheduler will be started automatically when the -d flag is used on startup. It can also be started from the user interfaces.

Command group - Miscellaneous

gui
open the gui
help
display list of available commands
quit
exit the program (Ctrl-D)

Help prints a list of all commands, along with explanations for each command. Gui opens the graphical interface, and freezes the shell until it is closed again. Quit exits the command loop, and with no command interfaces open, or the scheduler running, the main module will also exit.

Graphical user interface

In the previous section the commands were sorted into groups. Most of those groups also appear as separate panels in the layout of the Graphical User Interface (GUI). The groups are: robots, visualizations, datasets, tasks, scheduler.

As an example of how to use the GUI, one can select the name of a web robot in the drop down menu of the Robots panel, enter some URLs to use in the text field next to it, and press Run robot.

To see the results of the robot execution, use the functionality in the Datasets panel. Select a dataset that begins with the name of the robot that was used, and press Export dataset. The output will be written to a file in the folder $HOME/aww_data/output. The filename used can be found in the text panel at the bottom of the GUI.

To get a better understanding of the remaining components in the GUI, please compare them to the various commands explained in the previous section.

A screenshot of the controls in the GUI

A screenshot of the controls in the GUI

AWW’s robots

Httpcodes
A simple robot named Httpcodes was created. All it does, is to make HTTP requests for URLs that are passed to it, and register the HTTP reply codes. Httpcodes was used while developing the CCC, to help make it clear what functions are needed for communication between the framework and the robots.
Upcheck
Upcheck’s purpose is to find out whether web pages have changed. It does not use a crawler, but is handed a set of specific web page URLs. It has a much simpler functionality than Dublincrawl. All it does is to download pages from the web, and store information about them. Below follows a detailed description.
Upcheck_ext
Some processing of Upcheck’s datasets will be required before they are exported, but should not occurr every time Upcheck is activated. A separate module is created for this purpose. It is not a robot in itself, but is implemented in a similar way, to give users access to it via AWW’s interface.
Dublincrawl

Dublincrawl makes use of the robot Plaincrawl, to crawl web sites. It supplies a custom function of its own, that counts the numbers and types of Dublin Core tags present in the web pages that are downloaded. Below are the details about the robot.

Dublincrawl gives a complex picture of an entire site. It is the most bot-like of the two main robots, in that it acts on its own and processes web content in a sophisticated way. It is not only an automated download of one specific web resource at specific intervals.

Urlgen
A robot by the name Urlgen is built with the purpose of generating random IP-addresses. The details around this are described in the implementation chapter. The word URL in the name of the robot can be misleading, but is something remaining from the early stage of the design, when it was hoped that it would be possible to obtain random URLs, instead of IP-addresses.
Url_list_refiner
The list of sitemap URLs that is generated by the robot Findsitemap is likely to contain broken links. Url_list_refiner is developed to make HTTP requests, and store the URLs for which a reply was received. The resulting set of URLs can then be used for a more time-efficient data collection.
Plaincrawl
To provide crawling functionality for the robot Dublincrawl, another robot is created. It is named Plaincrawl. By default it simply crawls web sites, and stores discovered URLs in the database, but it can be passed Python functions via a set-function, in order to add processing of downloaded web content.
Findsitemap

The other main robot, Upcheck, will be used to monitor updates of sitemaps. Therefore a list of URLs pointing to sitemaps must be acquired. For this purpose, a robot will be created that takes a list of URLs, and attempts to collect a list of sitemap-URLs, by searching the robots.txt-files related to the addresses in the given URL-list.

The robot Upcheck, can be used with URLs pointing to many kinds of web content. In the data collection of for this project, it is used to monitor updates of sitemaps. In order to produce a collection of sitemap-URLs, another robot is created. Its name is Findsitemap.

Findsitemap works by downloading the robots.txt file corresponding to all the URLs that are passed to it. Robots.txt sometimes contains a sitemap attribute. The sitemap-URLs registered to this attribute is what Findsitemap collects and stores.

Allthough not all robots.txt files contain references to sitemaps, the ones that do often include several. Therefore this technique can potentially produce as many, or even more sitemap URLs, than the number of URLs passed to the robot.

AWW’s visualizations

No visualizations are included with AWW at this point, except for a demonstration module.

Visualization: Surface
The demonstration module is named Surface. It does not make use of any datasets. It only contains a few lines of code, obtained from the EasyViz documentation (https://code.google.com/p/scitools/wiki/EasyvizDocumentation#Visualization_of_Scalar_Fields ). The code results in a scalar field being drawn, based on generated numbers.
A screenshot of the demo visualization *surface*

A screenshot of the demo visualization surface, opened via the command line, instead of through the GUI.

A screenshot of the demo visualization *surface* opened in the GUI.

Visualizations can also be opened in the GUI window, or in the system’s default browser, but this implies that they will first be written to disk.

Extending the functionality

There are two obvious ways to extend the functionality of AWW: by adding more web robots - letting it collect new types of information, and by creating visualizations for collected data.

Adding robots means adding Python files to the folder robots. Adding visualizations means adding Python files to the folder visualizations. Extensions that are executed by AWW can be written in any programming language, and perform most any type of task, but the communication between AWW and the extensions must happen through a Python file.

Create a .py file, and place it in the apropriate folder. This file can contain all the functionality of the new extension, or it could for example make calls via the system command line, to access functionality elsewhere.

Two mandatory functions must be implemented. For robots they are named aww_bot_register(), and aww_run(). For visualizations, they are named aww_viz_register(), and aww_visualize().

For extensions to be discovered by AWW, the name of the Python file with the mandatory functions must be listed in the __init__.py file of the folder robots or visualizations.

Further examples of how these functions can be implemented can be found by looking at the sourcecode, but here follows the complete code for one of the simplest robots included.

#!/usr/bin/env python

"""
.. module::httpcodes
   :synopsis: This module's purpose is to save reply codes
   for HTTP requests.

   When aww_run is called a HTTP request is made for each
   of the URls that are supplied as arguments. The HTTP
   status code of the reply is stored to the database.

.. moduleauthor:: fivecode <fivecode{a}users.sourceforge.net>
"""

# import some standard python packages
import time, datetime
import urllib2, hashlib
import hook
# import some of aww's functionality for web robots
from robot_tools import Robot_info, Dataset_info, HttpError
from robot_tools import get_timestamp, make_well_formed
from robot_tools import polite_open, get_standard_fields

bot_name = 'httpcodes'

def retrieve_code(url):
    """
    Obtain and store a HTTP reply code for the URL given as
    parameter.

    If the HTTP request results in a redirect, an exception
    will be raised. This is just a test bot, and the
    redirect will not be followed

    Of the 3 functions in this robot, this is the only
    one that is not required in all robots for AWW.

    :param url: A complete URL to use for a HTTP request
    """
    try:
        file1 = polite_open(url)
        if not file1: # open failed
            return None
        result = get_timestamp(), url, file1.getcode()
        file1.close()
        rowid = hook.dataset_insert(bot_name, \
                                        'results', \
                                        result)
    except HttpError, e:
        if e.http_code == 301 or e.http_code == 302 or \
            e.http_code == 303 or e.http_code == 307:
            print bot_name +' received redirect'
            print 'This test robot does not follow redirects.'
        else:
            print 'HttpError ({0}) received: {1}'.\
                format(e.http_code, e.args)
    except Exception, e:
        print bot_name + ' received exception: ' + str(e)

def aww_run(hosts=None):
    """
    A function with this signature must be present in all
    robots for AWW.

    This function may be different in different
    robots, but should always handle the hosts-argument
    for the cases None, not None, and list

    :param hosts: URLs that can be used by this robot.
    """
    if hosts is None:
        hosts = hook.get_default_urls(bot_name)
        if hosts is None:
            print 'Could not acquire host URLs'
            return
    if type(hosts) is not list:
        aww_run([hosts]) # convert hosts to a list
    else:
        for url in hosts: # for all URLs given
            retrieve_code(make_well_formed(url))

def aww_bot_register():
    """
    A function with this signature must be present in all
    robots for AWW.

    Its purpose is to register the robots name,
    description, and datasets in the main framework.
    """
    bot_info = Robot_info(bot_name, \
                              'Stores HTTP reply codes')
    # SET 1
    set1 = Dataset_info('results', '(rowid, timestamp, '
                        + 'url, reply_code)')
    set1.add_field('timestamp', 'text', \
                       'Time of execution')
    set1.add_field('start_url', 'text', \
                       'URL provided by the user')
    set1.add_field('reply_code', 'text', \
                       'Reply code from HTTP header')
    bot_info.add_dataset(set1)
    hook.bot_register(bot_info)

Troubleshooting

For further information about the software, please refer to the project’s web pages on Sourceforge: http://sourceforge.net/projects/wwatch