Jun 29, 2011

Greece's Public Sector merges its data, how it was done

Kallikratis is the codename of the largest project ever conceived in the later years (2010-2011) of Greece's public sector digitalisation / computerisation effort and was founded by the Ministry of Interior and Decentralization.

The goal? To migrate different datasets spread among different applications and databases within the public sector's infrastructure.
That accounts to million rows of citizens' and public organizations' data / datasets distributed among different applications, structures, databases etc. having to be migrated in a homogenous, valid, normalized and commonly accepted data structure. In other words an ETL - Extract Transform Load process was required.

imho the talks and negotations to define the details of the final homogenous structure for each dataset was the hardest part, requiring cooperative thinking and strategizing among different tech companies and the ministry.

Datasets were categorized by domain: for example citizens' demographic data, public sector's and prefectures' economical/budgeting data, document management systems' data, and the list goes on.

The final process took place in the beggining of the year and lasted about a month, the merge happened in the first five days with corrections following until the end of January.

The project involved 2 vital factors:

1. had to be fast - ( execution, implementation,debugging, ability to change how things work on site wihtout the need to alter code / or recompile enabling tech support to operate)
2. and opensource


The plan was to extract and transform the data to a common XML file, validate it against the corresponding XSD and finally load it into the new environment. We ended up with a set of numerous transform processes and one generic load process that accepted as input the transformed xml files. 
Codename of the process was Datapair,  thumbs up for the...original name i found are welcome - sarcasm ensues :)

Various apis / platforms were investigated with some of them being:
For reasons that i would not like to present here and now, we chose the pentaho solution.
(briefly, CloverETL came at a cost for advanced features, while Talend's performance was found subpar 
at least at that time)

In the end everything worked better than expected and i am especially  happy about our tech support department beeing able to learn and finally operate the Pentaho Data Integration suite. That was a great relief enabling more people to take part in the process.


Someone could consider this,(and i mean the project as a whole from conceptualization to implementation), a default practice or a standardized ,(sic), solution but for Greece's Software industry was well...groundbreaking.

Expect more articles to come,  further analysing Pentaho's components as a minimal tribute to the platform that did the job for us.

Extra thank you for reaching the end of this post :)


Real Time Web Analytics