HaExcel

Home

Motivation for HaExcel

Refact Spreadsheet

Download


HaExcel picture.
HaExcel is a framework to manipulate, transform and query spreadsheets. It is implemented in Haskell and consists of the following parts:
Library
A generic/reusable library to map spreadsheets into relational database models and back: This library contains an algebraic data type to model a (generic) spreadsheet and functions to transform it into a relational model and vice versa. Such functions implement the refinement rules introduced in the paper From Spreadsheets to Relational Databases and Back. The library includes two code generator functions: one that produces the SQL code to create and populate the database, and a function that generates Excel/Gnumeric code to map the database back into a spreadsheet. A MySQL database can also be created and manipulated using this library under HaskellDB. The API of HaExcel can be downloaded here.
Front-ends
A front-end to read spreadsheets in the Excel and Gnumeric formats: The front-end reads spreadsheets in portable XML documents using the UMinho Haskell Libraries. We reuse the spatial logic algorithms from the UCheck project to discover the tables stored in the spreadsheet.
Tools
Two spreadsheet tools: A batch (compilable from the sources) and an online tool (here) that allows the users to read, transform, refactor and query spreadsheets.

To aid the presentation of the framework, we present a spreadsheet that stores information about a housing renting system, gathering information about clients, owners and rents. It also stores prices and dates of renting. The name of each column gives a clear idea of the information it represents. The 'totalDays' column is represented by a formulas which subtracting the column 'rentFinish' to 'rentStart'. The 'total rent' column is also represented by a formula that multiplies the total number of days of renting by the rent per day value.

A spreadsheet example.


The spreadsheet back-end generates a refactored spreadsheet in this format. Formulas of the original spreadsheet are migrated to the new one. For the above example the resulting spreadsheet is shown in the next figure.

A spreadsheet example.

Moreover an additional formula is generated for each table in order to ensure the uniqueness of its primary key. Thus the resulting spreadsheet will warn the user if he introduces in the spreadsheet an already existing value for the primary key (see figure below).

A spreadsheet refactored.


The SQL back-end generates SQL code which creates the database according to the derived RDB schema. The relational database schema (here represented by an entity-relation diagram) is as follows.

A relational schema.

The returned RDB schema defines a 3NF database, consisting of four tables. Furthermore, it produces SQL code to insert the migrated data in the database corresponding to SQL 'insert' instructions with the migrated data as argument. Because some values of the spreadsheet are defined through formulas, we generate also SQL triggers, that models the spreadsheet formulas, which are used to update the database and guarantee its integrity. Next, we present the SQL code to create and populate the 'Client' table as well as one of the generated triggers induced by the two formulas of the example.

CREATE table tbl8 (
clientNo VARCHAR(256),
cName VARCHAR(256),
PRIMARY KEY (clientNo));

INSERT INTO tbl8 VALUES
("cr76","john"),
("cr56","aline");
create trigger ssformulas before insert on renting
for each row begin
set new.totalDays = new.rentFinish - new.rentStart;
set new.total_rent = new.rentPerDay * new.totalDays;
end;

The complete SQL code to create the database and insert the migrated data can be obtained from this link.


Experimental Results We have used the HaExcel to refactor real spreadsheets taken from the large EUSES Spreadsheet Corpus. We have produced optimized (non redundant) and more modular spreadsheets obeying to the third normal form. Moreover we generate additional formulas which guarantee that each row in each table follows the 3NF, that is, there are no two rows with the same primary key.


Jácome Cunha
Last update: October, 2008

Valid CSS! Valid XHTML 1.0 Transitional