MAKI: Tools for web data knowledge extraction

Use case: Public procurement data from diverse European countries    

The current economical crisis and outbreaks of corruption in public institutions disclosed by media in Europe have provoked an increasing concerning about the use of public resources. The lack of transparency, the difficulty to access public information and the impossibility to identify the players in the public procurement circus make the analysis of public spending a detective operation. Fortunately, recent platforms are pushing governments to implement transparency measures that include the disclosure of data through the web.

In theory, open data would simplify the task of experts in the analysis of public spending. It would additionally, permit citizens to use this data to improve their knowledge about public spending. However, final solutions are far away from this ideal situation. Ignoring the political aspects involved in the disclosure of public spending data, there are a important number of technical and operational issues:

  • The lack of a common standard of web data format makes impossible to maintain a unique solution that can deal with these data.
  • Inconsistent (or non-existent) APIs that make extremely difficult to access the data in an organized and consistent way.
  • Poor definition of data types leading to dirty, malformed and inconsistent data from the very source.

The mentioned issues make that every country disclosures their public spending data in a completely different way. Additionally, data access is not designed to be machine-oriented through the use of APIs. This makes necessary the utilization of ad-hoc crawlers in order to extract the information contained in the source webs. This task is time consuming and requires of programmers that can deal with technical aspects such as Javascript, AJAX, pop-ups and other common artefacts used in current web pages.

See Technical Report UCAM-CL-TR-881 (or arXiv)

Extracted Data

The case study aimed to extract the data from 35 European countries (37 considering TED and its archive). Each data source corresponds to the web page used by public institutions to offer their existing procurement data to the citizens. Summary can be found in the table here.

From the original plan we discarded 9 data sources. Liechtenstein, Germany and Luxembourg already publish all their related data using the Tenders Electronic Daily (TED). From the list of processed countries, three groups are observed according to the format they make their data available: XML, HTML and CSV (or XLS). Only Romania uses CSV and XLS files. The remaining countries do not provide alternative data formats, apart from HTML. See the list of countries in the table below.

The size of every database depends on the amount of information extracted from the original data source. In some cases, there is an important amount of textual data to be stored, in other cases the mapping is more extensive or the amount of available entries varies depending on the data source.

The right figure shows the number of entries per database. There is a large variability in the number of publications. If we compute the average size per entry dividing the the database size by the number of entries a change of the order can be observed. Most of the databases remain under 6 KBytes. Currently the top five largest average entries are for TED (11.6 KBytes), NO (10.7 KBytes), IE (9.5 KBytes), SK (9 KBytes) and HR (8.6 KBytes). This result may indicate a larger textual contain in the records for these databases.


















Code Resitory

GitLab project webknowledge - procurement:


Processed databases per country

SQL relational database snapshots containing the relational databases created from the raw data mentioned above. The detail of data can be found in Section 10 of Technical Report  UCAM-CL-TR-881 (or arXiv)

Dataset request

In order to access the available datasets please fill in the form below: