Query Rewriting with Thesaurus-Based for Handling Semantic Heterogeneity in Database Integration

Integration of data sources is a process of combining two or more data resource so that the data which contained can be accessed simultaneously [1]. In the process of integrating data sources, data can be derived from different places or applications. Hence, its heterogeneously potential in format, structure, syntax, and semantic [2]. Heterogeneity can occur at the schema or instance data level [1]. This paper only focuses on semantic heterogeneity in both schema and instance data level. Semantic diversity at the schema level is related to name conflicts caused by synonyms, hyponyms, hypernym, and polysemy. On the other hand, semantic diversity at the data instance level only associated with a name conflict caused by synonyms.


I. Introduction
Integration of data sources is a process of combining two or more data resource so that the data which contained can be accessed simultaneously [1]. In the process of integrating data sources, data can be derived from different places or applications. Hence, its heterogeneously potential in format, structure, syntax, and semantic [2]. Heterogeneity can occur at the schema or instance data level [1]. This paper only focuses on semantic heterogeneity in both schema and instance data level. Semantic diversity at the schema level is related to name conflicts caused by synonyms, hyponyms, hypernym, and polysemy. On the other hand, semantic diversity at the data instance level only associated with a name conflict caused by synonyms.
Research on handling the diversity of data sources has long been done. Query rewriting becomes one of the methods that have been proposed [3]. This method contains a process of rewriting an original query to the new one by adjusting concepts or terminology which used in each data source [3]. There are several approaches in query rewriting, one of them is the ontology-based query rewriting [3] [9]. On this method, ontology is used as a representation of the schema from any data source [3]. Moreover, query rewriting with ontology requires a global ontology as a mediator in identifying the data source schema [3]. In order to make global ontology, ontology reference is needed to identify the connection between existing concept [6]. It usually specific to a particular problem domain [6]. This kind of reference contains both concept and relation which refers to specific standard [6]. The main problem which usually seen is not all problem domains have a reference ontology [6]. In the domain of problem which have no ontology references, global ontologies created based on developer knowledge which potentially produce ambiguity [6].
This paper proposes a query rewriting method using a thesaurus to identify the scheme of a data source. In this step, a global scheme does not necessary. Thus, the identification is processed on schema matching by using the thesaurus and n-gram similarity. This process can be seen on Fig. 1.
Nowadays, studies on handling semantic heterogeneity still become a challenge for researcher. Several methods have been used to solve these problems, one of which is query rewriting, implemented by rewriting a query into the latest one by using the selected schema. Semantic query rewriting needs a framework in order to identify the connection through the data schema sources. This line is used as a basis for scheme selection. Also, ontology is a model which often be used in these specific cases. The lack of ontology becomes a significant problem that usually seen. Therefore, this paper will describe an alternative framework in order to identify the link of semantic, which assisted by thesaurus.
This is an open access article under the CC BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/).

Keywords:
database integration semantic heterogeneity query rewriting thesaurus This paper consists of several sections. The second section, methods describes query extraction, schema matching, keyword enrichment, and generate a query. The result and discussion explains about analysis and test result. The last one is the conclusion

II. Method
The query extraction process is needed to identify the schema of the querying user [10]. This process is made by dividing the querying user into three parts: domain scheme, property scheme, and keyword [3]. In the relational model, domains represent the name of the table, and attribute data is represented by attribute name and keyword, which represent data value. Both domain and property schema are processed at the schema matching stage. At the same time, the keyword is processed at the keyword enrichment phase. The example of query extraction results can be founded in Fig. 2.
Schema matching is needed in order to choose similar data resources with user query schema. The selection process is carried out by considering the similarity between semantics and syntax. The process consists of five stages: schema extraction, get source schema, schema enrichment, string matching, and schema selection. Fig. 3 is the proposed schema matching process.
Schema extraction is the pre-stage of schema matching. The purpose of this phase is to extract the schema from each data source. The schema extracted includes: the name of the data source, the table name, the table relation, attribute names, and attribute data type. The extracted schema is stored in a schema repository. Fig. 4 is an example of the schema extraction results.
Get source schema is a process of getting data source which produced by extraction schema stage. The obtained schema will then be calculated in order to find the syntactical similarity values with the schema generated from the enrichment one. The calculation performed on the string matching stage.
Enrichment schema is an enrichment process of user query outline that will be compared with the data source on the string matching process by adding synonyms, hyponyms, and hypernyms. The purpose of these three stages is to identify the data source, which has a semantic correlation. In this paper, the identification of synonyms, hyponyms, and hypernyms are identified by thesaurus. The selected thesaurus is WordNet. The words in Wordnet are organized into a set of a synonym (synset) [11]. Each set closely related to other synset based on semantic relationships such as synonym, hyponym, hypernym, and antonym. A hierarchy tree can be founded from a synset correlation. Fig. 5 describes a synset connection. A synonym can be identified by looking for a similar word located in a common synset. In addition, a hyponym can be founded by searching for an identic word that stands below it. Furthermore, hypernym can be seen by searching for words on it [12]. Fig. 6 is the proposed schema enrichment process. The sample results of enrichment schema, as shown in Table I.
After the data source schema has been obtained, and the user query successfully enriched, the following stage is string matching. String matching is a process to calculate the value of similarity between each scheme represented by a string [13]. This value is used as a basis for determining which schema that will be used as the query. The calculation is carried out between the domain scheme with table name as well as the property structure with the attribute name.
The string matching technique used in this paper is N-Gram Similarity. This method can be used in multiple string comparisons. By using this procedure, the typical number of n-gram can be counted as n character series between the string. In order to count the similarity of two strings, we can use the Jaccard Coefficient equation. Fig. 7 is an example of the n-gram similarity calculation [14].
Schema selection is a data source selection process in order to find the most appropriate structure with user query schema. This phase is carried out based on the highest similarity. Both string matching and schema selection are implemented consecutively, where the calculation and selection  are made for the table and must first be done. Not only reducing the calculation of string matching, this process also decreasing the selection error made by homonym conditions. The example of schema selection is presented in Fig. 8.
The semantic heterogeneity on instance data level occurs due to entities differenciacy while it saved. This diversity contributes an impact on the completeness of data which are integrated. This problem is solved by keyword enrichment. This process is occurred by adding the synonym of keyword. The purpose of this additional is to integrate the information, not only based on the keyword which inputted but also followed by synonym of it.
The synonym identification is performed by thesaurus WordNet. This process followed by words recognition that are located in the corresponding synset as the keyword. Fig. 9 is the proposed keyword enrichment process.
Generate query is a process of query building in accordance with both schema and keyword, which generated in the process of matching schema and keyword enrichment [15]. In this research, the query is built in accordance with the terminology of SQL (Structured Query Language) language SELECT. In order to show the selected data, both SELECT and terminology must have contained in the SELECT order. While SELECT represents the attribute of the table name, in other cases, FORM represents the table name itself. Furthermore, WHERE, ORDER BY, GROUP BY, and HAVING are optional terminology that is representing the condition of data.
The main focus of query development concerns in three parts, such as SELECT, FROM, and WHERE. From user query perspective, SELECT represents the attribute schema. FORM represents the domain schema, as well as WHERE represents the keyword. Fig. 10 is an example of generating query results.

III. Results and Discussions
In order to validate an offer, it is needed to build SQRe (Semantic Query Rewriting) tool and performed some experiments. SORe developed with CodeIgniter framework (PHP based) and using library NLTK (Python), which can be used to build API wordnet. Experiments were carried out by integrating two databases from different health information systems. Both data sources have semantic diversity at the schema level and instance data level. The first database scheme was shown in Fig. 11, and the second one can be seen in Fig. 12.
The test was finished by determining 5 query user and heterogeneity types of 2 data sources. Table III is showing the result of the test. The table showed that this model could handle the semantic heterogeneity in database integration, such as "pria-lelaki', 'pasien-penderita', 'pekerjaanprofesi', 'kelamin-gender'. However, query 3 and query 5 were failed. The failure of query 3 caused by the matching method couldn't handle a scheme which have more than two words, such as "kode penyakit". In addition to this, the limitation of data synset (in query 5) such as "aktivitas-profesi", have made the word connection become unidentified.

IV. Conclusion
This study has introduced an alternative method to handle semantic heterogeneity in the process of database integration with thesaurus-based query rewriting. Semantic heterogeneity at the schema data level is handled by identifying synonyms, hyponymy, and hypernym of each user query. The result of this identification then compared with each data source schema. Semantic heterogeneity at the instance data level handled by identifying synonyms of the keywords, and it will be used in keyword enrichment. Furthermore, the technique used in this schema comparison is n-gram similarity.
The proposed method can be optimized in further research. The reduction of synonym, hyponym, and hypernym can be minimized in order to simplify the calculation. Moreover, the election of schema can be added by metadata analysis and instance data from any data source. The process of schema election can collaborate with both metadata and instance data checking of any source schema. This process is expected can improve the speed as well as the accuracy of the query rewriting process.

Acknowledgement
This research was supported by Politeknik Negeri Bali and Macquarie University of Sydney. We thank our colleagues from both institutions. We thank everyone who contributed to the completion of this paper in one way or another. Hopefully, this research can be useful.