A framework for controlling web robots.
The program’s controller relays communication between the view and the model. The functions here give a summary of what the program can do.
Ideally, all output should happen through this function, or through log_write() in the model. This function redirects the output, based on which mode the program is running in.
Parameters: | some_str – Output string |
---|
Get the names and descriptions of all registered robots.
Returns: | dictionary – Robot names and descriptions |
---|
Imports the program code for the specified robot, if its name is registered in the database, and calls its aww_run() function.
Parameters: | bot_name – The robot to run |
---|
Imports the program code for the specified robot, if its name is registered in the database, and calls its aww_run() function, with the given URLs as argument
Parameters: |
|
---|
Acquires the URLs associated with the given task. Imports the program code for the specified robot, if its name is registered in the database, and calls its aww_run() function, with the URLs as an argument.
Parameters: |
|
---|
Writes a dataset to file in the output folder, using set_name and the current day as the filename.
The Excel format (xlsx) was originally inteded to be supported here, but was left out, because it appears to require additional libraries. More info here: http://www.python-excel.org/
Parameters: |
|
---|
Prints 10 entries from a dataset.
Parameters: | set_name – The name of the dataset |
---|
Get the names and descriptions of all the datasets.
Returns: | dictionary – Entries with datasetset names and corresponding descriptions |
---|
Stops the scheduler, and exits the program.
If the string arg is found in the list argv the trailing elements, up to next string starting with a hyphen (-), is returned.
Parameters: |
|
---|---|
Returns: | string – the argument |
Opens the graphical interface.
Receives a command as a string, parses it, to retrieve a command object, then executes the function described by that command object.
Parameters: | cmd – A string containing a command |
---|
Instead of opening a user interface, the scheduler is started directly. In addition output is redirected to the log file.
Confirm whether the task scheduler is runnning
Returns: | boolean – True if the scheduler is running |
---|
Starts the task scheduler.
Stops the task scheduler.
prints two strings, starting the second one at the given offset.
Parameters: |
|
---|
Delete all entries from a table
Parameters: | table_name – The name of the table |
---|
For every task created there exists a (possibly empty) list of URLs. This functions appends URLs to such lists.
Parameters: |
|
---|
Creates a task, and adds it to the table tasks. A task is a tuple containing a task name, execution frequency, and a command to be executed.
Parameters: |
|
---|
Get information about all the tasks.
Returns: | list of lists – the lists are on the form (bot_name, task_name, frequency) |
---|
Retrieves a list of URLs stored for this task.
Parameters: | task_name – The name of the task |
---|---|
Returns: | list – URLs belonging to the given task, or None, if the list is empty |
Reads URLs from a text file, and saves them to a task. In the text file each line should contain one URL.
Parameters: |
|
---|
Delete a task from the database.
Parameters: | task_name – The name of the task |
---|
Removes an URL from a list of URLs belonging to a task.
Parameters: |
|
---|
This function retrieves the command string belonging to the given task. It then parses it with help of the commandline module, to get a command object, then executes the function described by that command object.
Parameters: | task_name – The name of the task |
---|
If the given task exists, its frequency is set to the specified value.
Parameters: |
|
---|
Imports the program code for the specified visualization, if its name is registered in the database, and calls its aww_run() function.
Parameters: |
|
---|---|
Returns: | string – filename of exported graphics |
Get the names and descriptions of all registered visualizations.
Returns: | dictionary – Visualzation names and descriptions |
---|
Anything concerning SQL happens here, via SQLite.
(SQLite supports the data types: null, integer, real, text, blob, but in the current implementation only text and integer are used.)
Most database related functions contain exception handling, and mostly with generic exceptions. Using generic exceptions can be considered bad practice, but the goal here is a roboust program. Therefore the worst case scenario should be that functions return None, and not an exception.
Determine whether information about a robot exists in the database.
Parameters: | bot_name – The name of the robot |
---|---|
Returns: | boolean – True if the robot was found |
Get the names and descriptions of all registered robots.
Returns: | dictionary – Robot names and descriptions |
---|
All tasks have a list of URLs. This function returns the URLs of the default task for the given robot.
(The default task is the first task found where execution frequency equals ‘not set’)
Suggestion for improvement: Tasks are no longer associated to robots. This function is no useful. URLs must be given by specifying a task, or typing them in manually. This function should probably be removed.
Parameters: | bot_name – The name of the robot |
---|---|
Returns: | list – default URLs for a robot |
Saves the name and a description of a robot in database.
(Datasets for the robots are created through other functions.)
Parameters: | bot_info – instance of robots.robot_tools.Robot_info |
---|
Creates a table in the database named on the form bot_name_set_name, then adds information about the dataset to the table datasets
Parameters: |
|
---|
Determine whether a dataset exists in the database. This is different from the function table_exists(). Here we only go through the list of datasets returned by datasets.get().
Parameters: | set_name – The name of the dataset |
---|---|
Returns: | boolean – True if the dataset exists |
Add a description of an existing dataset.
Parameters: |
|
---|
Write a list of tuples to a html file.
Parameters: |
|
---|
Write a table (as SQL) to file.
Suggestion for improvement: param dataset is not used, so remove it
Parameters: |
|
---|
Write a list of tuples to a text file.
Parameters: |
|
---|
Write a list of tuples to a text file.
TODO: This function has not been implemented
Parameters: |
|
---|
Get the names and descriptions of all the datasets.
Returns: | dictionary – Entries with dataset names and corresponding descriptions |
---|
This returns a connection to the database. This function should be called every time the database is accessed, to avoid racing conditions in threads, and because days could potentially pass between function calls.
Returns: | A pysqlite connection object |
---|
Returns the argument, possibly with a (nr)-suffix, to make sure we do not overwrite an existing file.
Suggestion for improvement of code: should not need to take the parameter file_extension, but rather work with the complete filename containe in the parameter wanted_name.
Parameters: |
|
---|---|
Returns: | string – the wanted file name, but possibly modified |
This returns a string, that is a path to the program’s main folder. There is a folder hierarchy inside it, but anything written to disk by the program ends up somewhere within this folder. The main folder should automatically be located at the bottom level of the user’s home directory.
Returns: | string – the folder path |
---|
Returns a folder within the main folder, where exported datasets and visualizations are stored.
Returns: | string – the folder path |
---|
The model is not meant to print output for users to see, but can write to a log file instead.
Parameters: | some_str – The text that should be written to file |
---|
This is called automatically on startup, to ensure that all robots are available. This happens by going through the variable bot_list, in the __init__ file of the robots sub-package, and adding any unknown robots.
This is called automatically on startup, to ensure that all visualizations are available. This happens by going through the variable viz_list, in the __init__ file of the visualizations sub-package, and adding any unknown visualzations.
This is called automatically on startup. It creates a file for the database, if missing, as well as all the tables required for basic program functionality.
Check for existence of a value in a table.
Parameters: |
|
---|---|
Returns: | boolean – True if the value was found |
Writes the contents of the table corresponding to set_name to filename.
Suggestion for improvement: Should check that the filename is free.
Parameters: |
|
---|
Checks if a table exists in the database
Parameters: | table_name – The name of the table |
---|---|
Returns: | boolean – True if the table was found |
This function returns all the data in the set. For sets of large size the data should be extracted in another way,
Suggestion for improvement: Could we return a subset defined by a time interval?
Parameters: | table_name – The name of the table |
---|---|
Returns: | list – A list with all the tuples in the table |
Retrieves the column names for the given table.
Suggestion for improvement: it could instead return a dictionary including column descriptions, but descriptions are not saved in the system
Parameters: | dataset_name – The name of the table |
---|---|
Returns: | list – A list with all the column names in the table |
Insertion of multiple tuples into database. The tuples must all contain the same number of elements. If values for all table columns are not provided, then column names must also be specified.
Parameters: |
|
---|
Insertion of single tuples into database. If values for all columns are not provided, then column names must also be specified.
The function name contains special because it returns the rowid of the inserted entry. (This also means that it can only take single entries, and not lists of entries)
Parameters: |
|
---|
Check if there are any entries present in a table.
Parameters: | table_name – The name of the table |
---|---|
Returns: | boolean – True if the table is empty |
Parameters: | dataset_name – The name of the table |
---|---|
Returns: | int – number of rows in table |
Retrieves 10 entries from a table, to give an impression of the table’s structure and content.
Parameters: | set_name – The name of the table |
---|---|
Returns: | list of tuples – Rows from the given table |
Returns one tuple from table_name. Order of entries is not considered. The entry is removed from the table.
When crawling large collections of URLs, this type of functionality makes it possible to use the database as a que.
Parameters: | table_name – The name of the table |
---|---|
Returns: | tuple – The first entry found |
Delete all content from a table.
Parameters: | table_name – The name of the table |
---|
For every task created there exists a (possibly empty) list of URLs. This functions appends URLs to such lists.
Parameters: |
|
---|
Creates a task, and adds it to the table tasks. A task is a tuple containing a task name, execution frequency, and a command to be executed.
Parameters: |
|
---|
Determine whether information about a task exists in the database.
Parameters: | task_name – The name of the task |
---|---|
Returns: | boolean – True if the task was found |
Get information about all the tasks.
Returns: | list of lists – the lists are on the form (bot_name, task_name, frequency) |
---|
Retrieves a list of URLs stored for this task.
Parameters: | task_name – The name of the task |
---|---|
Returns: | list – URLs belonging to the given task, or None, if the list is empty |
Reads URLs from a text file, and saves them to a task. In the text file each line should contain one URL.
Parameters: |
|
---|---|
Returns: | int – number of urls added |
Delete a task from the database.
Parameters: | task_name – The name of the task |
---|
Removes an URL from a list of URLs belonging to a task.
Parameters: |
|
---|
If the given task exists, its frequency is set to the specified value.
Parameters: |
|
---|
Determine whether information about a visualization exists in the database.
Parameters: | viz_name – The name of the visualization |
---|---|
Returns: | boolean – True if the visualization is found |
Get the names and descriptions of all registered visualizations.
Returns: | dictionary – Visualzation names and descriptions |
---|
Save name and description of a visualization in database.
Parameters: |
|
---|
This module enables schedulation of tasks. It utlilizes, and is utilized by the controller module. When start_scheduler() is run, a loop is entered, which periodically checks the current time against the execution frequencies of the tasks, and executes any tasks with matching execution time.
This function returns one of Pythons datetime objects, representing the next point in time that the given frequency describes.
Parameters: | freq – A time frequency on the form: ‘minute hour dom month’ (* means every) |
---|---|
Returns: | datetime.datetime – a point in time described by the freq parameter |
Confirm whether the task scheduler is runnning
Returns: | boolean – True if the scheduler is running |
---|
Compares two datetime.datetime objects, to see if they are equal.
Parameters: |
|
---|---|
Returns: | boolean – True if the objects are equal, with a precision level of minutes |
Iterates the task que and reuturns all tasks that should be executed during the current minute.
Returns: | list – A list of tasks |
---|
Prints all tasks in the task queue.
Checks if the current minute has changed since the last time the function was called. If we are in a new minute, all tasks are retrieved from the model, and their timestamps are regenerated.
Loops forever. Refreshes the task queue every minute.
Creates a new thread and uses it to start the scheduling loop.
Stops the task scheduler.
It breaks the scheduling loop, by negating a boolean, so that run() returns in anotother thread.
The command line provides functionality by parsing commands, and their arguments, and then calling appropriate functions in controller.py
When parsing a command returns a Command object, a function referred to by it is called. Ideally that will be a function in the controller module, but often it is a private function in this module, where arguments to the command are sorted out, before calling the apropriate function in the controller.
Information about a single command
A pointer to this function is passed to the readline module to enable tab completion.
Builds the list of commands. Each Command object contains its own description, arguments, and witch function to call
Loops and parses commands. When the loop ends control is returned, which should cause the program to exit.
Goes throgh the list of commands possible commands and compare them with the parameter user_input. If a matching command is found it is returned.
Parameters: | user_input – The text to be used for comparison |
---|---|
Returns: | Command – A command matching the input |
The GUI module has little functionality, only graphical components that relay input to functions in the controller module.
Instantiating this class opens the Graphical Interface. This is normally done by the funtion open_gui().
Updates scheduler info, and drop down menus.
Creates an instance of the class AWW_GUI, and redirects I/O to it.
The hook is meant to be imported by extensions that require access to functionality in the model. Usually this means robots or visualizations.
Often parameters are given as extension_name and set_name, which are later concatenated into a table name before a function in the model is called.
Register information about a robot in the database.
Parameters: | bot_info – instance of robots.robot_tools.Bot_info |
---|
Check for existence of at least on instance of value in the column column_name in a dataset.
Parameters: |
|
---|---|
Returns: | boolean – True if the value was found at least once |
Retrieve the entire dataset as a list of tuples.
This function may not be suitable for large datasets.
Parameters: |
|
---|---|
Returns: | list – a list of tuples containing dataset entries |
Insert one or more tuples. If column_names is None then the content of the content of the tuples must be ordered the same way as when the dataset was created.
Parameters: |
|
---|
This function takes only one tuple for insertion. The function is special in that it returns an integer, representing the resulting rowid of the insertion.
Parameters: |
|
---|---|
Returns: | int – rowid of the resulting table entry |
Check whether there are any entries in a dataset.
Parameters: |
|
---|---|
Returns: | boolean – True if there is nothing in the dataset. |
Gets an entry from the dataset. The entry is deleted from the dataset at the same time.
Parameters: |
|
---|---|
Returns: | tuple – an entry from the dataset, or None |
All tasks have a (possibly empty) list of URLs. This function returns the URLs of the default task for the given robot.
(The default task is the first task found where execution frequency equals ‘not set’)
Suggestion for improvement: Tasks are no longer associated to robots. This function is not useful. URLs must be given by specifying a task, or typing them in manually.
Parameters: | bot_name – The name of the robot |
---|---|
Returns: | list – default URLs for a robot |
Delete all entries from a dataset.
Parameters: |
|
---|
Save name and description of a visualization in database.
Parameters: |
|
---|
The robots sub-package contains modules representing web robots. In addition it contains a module named robot_tools, with functionality meant to be used by web robots. That way, functionality does not have to be re-created every time a new robot is made.
By default this handler is added to HTTP-open requests. It throws exceptions instead of following redirects. The exceptions can be caught in order to handle redirects in a different place.
Holds information about a dataset. Fields is a list of lists on the form [[name, type, description],...]
Custom exception for passing on HTTP-related information
Holds information about a robot. An object of this class can be passed to the model when a robot is registered in the database.
This handler should be added to all openers for requests made with AWW’s robots. It registeres the traffic in global variables, and waits if the traffic load is high.
If conditions are not right, it waits a while, if that don’t help, it throws an exception.
If there is only one thread, waiting is a good option. If we get multithreading, this functionality might give racing conditions.
Writes the given string to debug_info.txt if the dlevel parameter is smaller or equal to the value of the global variable debug_level.
Parameters: |
|
---|
Returns bytes downloaded during the current minute. Indirectly causes counters to reset if current_minute has changed.
Returns: | int – No. of bytes downloaded during current minute |
---|
Returns outgoing HTTP requests during the current minute. Indirectly causes counters to reset if current_minute has changed.
Returns: | int – No. of requests during current minute |
---|
Get the number of HTTP requests that has been made for a specific hostname during the current minute.
Parameters: | url – An URL used for lookup |
---|---|
Returns: | int – No. of HTTP requests to the given hostname during the current minute |
Returns a list of lists with information about columns that datasets usually will include: timestamp, url, http_status
Get current date and time as a string.
Returns: | string – Current time, on the format <YYYY_MM_DD HH:MM:SS> |
---|
In Excel time is sometimes represented as number of days since the beginning of year1900, and a fraction describing the time of day. This function returns the current time on that format.
Returns: | float – Current date and time, based on Excel’s format |
---|
Considers traffic load, but not robots.txt Useful, for example for downloading robots.txt files, and some other resources located at the web root (which is sometimes disallowed in robots.txt).
Aside from not consulting the robots.txt file, this functions makes all the same considerations as polite_open() before making the HTTP request.
Parameters: | url – The HTTP address to request |
---|---|
Returns: | file object – The file that was retrieved |
This function is used for logging the number of HTTP requests to individual hostnames during the current minute.
Parameters: | url – The URL to update |
---|
Ensures that the given URL starts with http://. If this function is called with a relative path, the result will not make sense.
Parameters: | url – An URL to be checked, and possibly modified |
---|---|
Returns: | string – A valid representation of the given URL |
A version of polite_open() that uses less resources by only retrieving the head of the web page returned for url argrument, if any.
Parameters: | url – The address to use for the HTTP request |
---|---|
Returns: | file object – The header that was retrieved |
Considers robots.txt. Uses an opener that considers traffic load, as well as includes the user agent string. Sleeps for high traffic locally and remotely. Might create racing conditions when multi-threading.
This function will raise HttpErrors (defined in this module) and possibly other exceptions. These should be handled by all robots that use the function.
Parameters: | url – The HTTP address to request |
---|---|
Returns: | file object – The file that was retrieved |
Prints the values of some traffic related global values to stdout.
We measure our local load in bytes downloaded and http-requests made. Both are reset once a minute (if get functions are called)
Consults the robots.txt file for the given URL, and confirms whether a request can be sent for to that address.
Parameters: | url – The web address to check |
---|---|
Returns: | boolean – True if the URL can be downloaded |
Returns True if the urls have the same host, otherwise False.
Parameters: |
|
---|---|
Returns: | boolean – True if the hostname is the same for both URLs |
Attempts to extract the hostname from the specified URL.
Parameters: | url – An URL to analyze |
---|---|
Returns: | string – The hostname of the given URL |
The visualization sub-package is meant to contain visualizations for collected datasets. In addition, it contains a module named visualization_tools, with functionality that can be useful when creating visualizations.
Use functionality from the easyviz pacakge to covnert a file.
If png_filename is specified as several files (using * notation), an animated gif should be made.
Parameters: |
|
---|