MAKI: Tools for web data knowledge extraction

Can we automate web data knowledge extraction?     

Knowledge extraction is a major task in many companies and research projects that demand data allocated in the web in order to store it, analyse it or simply sell it to third parties. This task requires an understanding of the data layout and what has to be extracted. In some cases, the utilization of metadata or data model descriptions may help to understand the structure of the data. Unfortunately, this information is not available in most cases.

Most knowledge extraction is done in ad-hoc solutions designed from scratch. Normally these solutions comprise the acquisition, parsing, transformation and storage of the data. The process is normally conducted by developers, with a certain programming background. The developers have to deal with two different problems: the technical complexity of parsing a given document and understanding the semantics of the information contained in that document. Unfortunately, certain areas of knowledge may require subject-matter to identify these semantics. Normally, the developers work in conjunction with the expert as they may not have any technical background. This forces the developers to spend precious time absorbing the knowledge of the expert.

Apart from the technical difficulties of the aforementioned elements, the most difficult task is the knowledge extraction from the expert. We can consider the expert the keystone of the whole process. In MAKI project, the web knowledge extraction aims at using an expert-centric methodology. Following this idea, the whole knowledge extraction task should be designed around the expert and his/her knowledge. From data acquisition to knowledge extraction, the expert is assisted by a set of tools that help her through the process with minimal intervention from the developers. Our methodology has the following features:

  • Expert-centric design: the domain knowledge expert is the main actor of the web knowledge extraction process and a full extraction pipeline can be driven by the expert.
  • Machine-assisted: in many cases, data can be significantly complex (optional fields, repetitive structures, hierarchies, etc.). Machine learning solutions assist experts in the process, making it possible to enhance and simplify the whole task.
  • Reusable: many of the tasks and subtasks that comprise knowledge extraction are repetitive. The definition of a framework can support the definition of common and reusable routines.
  • Generic solution: developing an ad-hoc solution makes it difficult to maintain over time. However, a black-box approach where the behaviour of the system depends on a set of inputs and outputs reduces the whole problem to a pipeline design. This improves the maintainability of the code and it makes it possible to focus the effort on improving/creating black boxes.
  • Configurable and extensible: the web data exhibits complexity and diverse structure. MAKI considers potential extensions for the developers that allow easy adaptation to new use cases.
  • Formant independent: any data acquisition strategy must be independent of the format of the incoming data (HTML, XML, CSV, TXT, etc.) PDFs are out of scope of MAKI project.
  • Database independent: the acquisition process must be independent of the database destination and therefore, transparent for the expert.

MAKI framework overview

The overview of MAKI framework is shown below. See Technical Report UCAM-CL-TR-881 (or arXiv) for full description of MAKI architecture including a case study on the public procurement data.

  • Crawling: The crawling component is designed to collect the data from the original data sources for its later processing. The crawled elements are stored into a secondary storage system such as a database or a file system.
  • Transformation: The crawled elements may appear in different data formats (HTML, XML, json, txt, CSV, etc.) and in many cases the file to be processed can be found in a compressed format. In some cases, such as XML, the data structure is self-descriptive simplifying the parsing process, and therefore the data extraction. The transformation
    component converts raw data extracted from the original data source into an XML file containing the same information. In order to perform this transformation, this component employs a dictionary input which is created by the structure analysis component.
  • Structure analysis: File formats such as HTML, CSV or TXT may have a structured content even if the original type is not self-descriptive. The structure analysis component permits to identify the structure of the elements contained into these files by analysing dataset samples. The final result is a dictionary containing a description about the elements that have to be considered as pieces of information during the transformation process. This module is language and format independent which permits its reutilization without ad-hoc coding.
  • Knowledge extraction: The final destination of the crawled information is a database. The schema of this database may differ from the original crawled data and requires a logical transformation. Because the crawled elements could have been obtained from different data sources, in different languages and formats it is necessary human intervention to
    drive the conversion between the original raw data and the final database destination. This knowledge extraction process, is a guided process where a human expert uses structured
    files (result of the transformation component) to define a mapping between the original data and the destination database.
  •  Parsing: The parsing component uses the expert defined mapping to populate the destination database. It maps the semantics in the original data source to the common format
    represented in the target database. This component is completely reusable and only depends on the input data, and the mapping defined by the expert in order to carry out the
    data transformation.


MAKI: Web Data Knowledge Extraction  University of Cambridge Computer Laboratory, UCAM-CL-TR-881 (or arXiv), 2016.


Source code repository 

GitLab project webknowledge:

GitLab webknowledge subproject procurement:

Readme files along the project description and codes describe the instruction detail. See the brief instruction here.

Use case: public procurement data

Use case of the public procurement data can be found at use case section for more details.


The reseach is funded by H2020 DIGIWHIST project (645852).  

Contact Email

Please email any question to