A Review of Accessing Big Data with Significant Ontologies

Accessing and managing information in the big data scenarios is extremely difficult due to the multi dimensions of big data: (1) Volume which cares about the size of the data, especially the nontraditional data which produce terabytes of data within minutes. (2) Variety that represent the data stream such as social media. (3) Velocity which refers to the data types. (4) Value that refers to the valuable information that is hidden in non-traditiona1 data.


I. Introduction
Accessing and managing information in the big data scenarios is extremely difficult due to the multi dimensions of big data: (1) Volume which cares about the size of the data, especially the nontraditional data which produce terabytes of data within minutes. (2) Variety that represent the data stream such as social media. (3) Velocity which refers to the data types. (4) Value that refers to the valuable information that is hidden in non-traditiona1 data.
Ontology-based data access (OBDA) is a promising paradigm for solving the problem of accessing these massive amounts of accumulated data and to designing effective platforms for accessing data [1]. Figure 1 represents OBDA characteristic that consists of: 1) An ontology that represents a conceptual view of the data for a domain of interest. 2) Mapping layer that is able to solve the problems arising from the difference between the basic elements managed by data sources and the elements managed by the ontology. 3) The data sources are the repositories used in the organizations by different services and applications [1] [2][3] [4]. Thus, OBDA system behaves as a form of information integration that replace the global schema with a general ontology-based and end user oriented query interface over diverse data sources. Ontology with the corresponding mappings to the data sources are offering the required documentations for collecting the correct data to be returned to the client.
OBDA specifications focus on the role of answering queries to insure that they give the same answers to the considered queries for all possible extensions of data sources [4]. The life cycle of OBDA system starts from the point that end-users pass their SPARQL queries over a visual interface to the ontology layer without any knowledge of the actual structure of the data. Ontology rewrites the query obtained using one of the description logic notations that exists behind ontology. The previous query is rewritten again with respect of a mapping assertions over the data sources to get the query answer. In this scenario end-users and experts can access big data without asking IT experts.
Ontology Based Data Access (OBDA) is a recently proposed approach which is able to provide a conceptual view on relational data sources. It addresses the problem of the direct access to big data through providing end-users with an ontology that goes between users and sources in which the ontology is connected to the data via mappings. We introduced the languages used to represent the ontologies and the mapping assertions technique that derived the query answering from sources. Query answering is divided into two steps: (i) Ontology rewriting, in which the query is rewritten with respect to the ontology into new query; (ii) mapping rewriting the query that obtained from previous step reformulating it over the data sources using mapping assertions. In this survey, we aim to study the earlier works done by other researchers in the fields of ontology, mapping and query answering over data sources.
To make this idea clearer, let us assume that the ontology T is given by a set of semantics represented by description logic's (DLs). D is a relational database compatible with data sources S, and M is the mapping assertions each one of the from, ( ⃗) → ѱ( ⃗) where ( ⃗) is a query over S that returning rows of values for ⃗, and ѱ( ⃗) is a query over T whose free variables are from ⃗ [2]. Later in this paper review, we will see how ontology, mappings as inputs, can help end-users compute a query that can be executed over the data sources

II. Motivation
In the uniform sources of data, the execution time for queries can be retrieved within minutes or seconds in the different sources. End-users need to collaborate with some IT skilled experts to develop queries that retrieve the required data. In this scenario the time round between asking and retrieving the results may be in the range of days or more. So the challenge here, is how end-users and experts can access big data without asking IT experts.
OBDA system is a recently proposed approach to address the problem of the direct access to data. It is integrated from several sources to avoid the bottleneck by automating query translation process, OBDA can be considered as a virtual approach which tells us where the exact direction of data is. OBDA also solves the problem of structural heterogeneity in which different information systems store their data in different structures and semantic heterogeneity which refers to the content of information items and its intended meanings [5].
There are several features for a successful implementation for OBDA that lead us to believe it is the right approach for end-users to access Big Data [2][4][5]: • Ontologies: The objective of an ontology on OBDA system is to describe the domain, classifying and categorizing the elements contained within it. • Mapping Assertion: Ontology plays an important role in information integration; it puts together all information of different formatting. In order to support data integration, mapping connect ontologies with data sources. • Query Answering: The database queries used in OBDA are typically conjunctive queries in first-order-logic. These queries can be categorized into two: (i) Instance queries (IQs) that ask for the instance of a single concept between OBDA specifications. (ii) Union of conjunctive queries (UCQs) that ask for a set of queries between OBDA specifications In order for end-users to create value of the data which rapidly increase, OBDA also considered the following points: (1) it is declarative, therefore no need for end-users and IT experts, to write special purpose program code. (2) Relational databases can remain as they are, hence no need for moving large and complex data sets. (3) OBDA is an adaptive system according to data scalability so data retrieving remains stable. (4) OBDA hide the complicity of data sources for the end-users. (5) The relationship between the ontology concepts and the data sources, provides a means for the experts (database administrators) to make their knowledge available to the end-user.

A. Data Sources and Big Data
Data sources can be designated as structured or unstructured data. The term structured data refers to an identifiable structure in which the data is stored based on a methodology of columns and rows; also it is organized for human readers in a way that the data is becoming searchable by its types within content. The term "unstructured data" refers to any data that has no identifiable structure such as videos, emails, documents and texts, each of which has its own structure or format.
Big data is an expression that refers to a collection of enormous and complex data sets being generated and accumulated through three levels: the employees in companies who enter the data into the computer systems, the users who could generate the wrong data through signing up into websites such as Facebook; this level is larger than the first one according to the magnitude, and thirdly the accumulated data are derived from several machines (Satellites, sensors, robots, etc.).All the three levels, produce together the big data which have three main characteristics: volume, velocity, and variety. However, [6] adds one more characteristic: value; the justification is that there is a lot of information hidden in larger bodies of nontraditiona1 data so the challenge is to identify what is valuable, and then transform and extract the relevant data for analysis [7].

B. Ontology Rules
Ontologies are the structural frameworks for organizing information represented in a formal definition of the types, properties and interrelationships of the entities that exist in some domain. However ontologies take over additional tasks as discussed in following sections.

1) Content Explication
Sing1e ontology approaches [2][5] [8] in Figure 2 single global ontology provide a shared vocabulary, such that all information sources are related to one global ontology and mapped to local data sources for information retrieval. This approach is not effective if one information source has a different view on a domain in addition to its sensitivity to the changes in information sources, any changing imply changes in global ontology and mapping data source.
Multi ontology approaches [2][5] [8] in Figure 3: 1) Each information source is described by its own ontology. 2) Each source ontology can be developed without respect to other sources or their ontologyies. 3) It can simplify the integration task. 4) Not effective in comparing different source ontologies due to the lack of a common vocabulary.
Hybrid ontology approaches [2][5] [8] these ontologies are built from a global shared vocabulary to make them comparable. In Figure 4: 1) Semantic of each source is described by its own ontology. 2) No need for modifications in mapping or shared vocabulary in terms of adding new sources. 3) It is extremely hard to reused existing ontologies because all sources refer to the shared vocabulary.

2) Ontology knowledge
Description logic's are logic's specifically designed to represent the structured knowledge to represent a domain that composed of objects and structured into: (i) Concepts which correspond to a classes and denote sets of objects. (ii) Roles which correspond to (a binary) relationships and denote binary relations on objects.
Web Ontology Language (OWL) is a richer vocabulary description language for describing properties and classes. The formal underpinning of OWL is based on Description Logic's (DLs) knowledge representation formalisms with well-understood computational properties [9]. DL ontology consist of the Terminological Box (TBox) and Assertion Box (ABox), Tbox describe a system in terms of controlled vocabulary such as a set of classes and properties. ABox is a TBox statements that represents the ontology vocabu1ary, TBox and ABox together representing the base knowledge (KB).
DLs are a family of logic's concerned with knowledge representation, it is a decidable fragment of first-order-logic (FOL) associated with a set of automatic reasoning procedures. The basic constructs of a DL are the notion of a concepts and the notion of relationship. Complex concept and relationship expression can be constructed from atomic concepts and relationships with suitable constructs between them [4] [9]. Since the ontology is a model of (some aspect of) the world, it can introduce vocabulary relevant to domain with specific meaning (semantics) in terms of A happy cat owner owns a cat and all cats he cares for are healthy which can be formalized using suitable description logic (DL) The most known description logic's are [10]:   R(a,  b) describing the data. Further details will be found in [3][4][10] [11]. Figure 5 shows the example of DL knowledge base.

C. Mapping
The purpose of mapping is to reconcile heterogeneity derived from different designed schema's even if the people or organizations are model the same domain, mostly these problems happened between the mediated schema and the schema of the data sources. In Figure 5, schema mappings describe the relation in which instances of the mediated schema are consistent with current instances of the data sources [12]. I(G)(I(Si)): the set of possible instances of the mediates schema G(S).
Mapping represent all possible instances of mediated schema G given instances of sources ( 1 ), ( 2 ) ,… ( ) . In other words mapping assertion specifies the semantic relationship between elements of a DL TBox ontology to elements of a data sources [4].
Many OBDA studies focused on understanding which languages for the ontology and mappings allow query answering to be performed taking into account the inconsistency and redundancy for mappings OBDA [3]. Query execution can be performed if (1) the ontology is expressed in description logic DL -Lite. family ontology language, and (2) the mapping are of types (a) Globa1-as-View (GAV) in which mediated schema defined as a set of views over the data sources, in which mapping is executed from entities in the global ontology to entities in the original sources (b) Local-as-View (LAV) in which data sources defined as views over the mediated schema, in which mapping executed from entities in the original sources to the global ontology (c) GLAV the combination of the two.
Mapping analysis in OBDA aims to provide the designer with the useful services that produce a well-founded OBDA specification, thus two important points should be considered: (1) Inconsistent mapping M with respect to Ontology O and source schema S means that retrieval, data lead to inconsistent OBDA specifications even the S schema is non-empty. In other words, no data retrieval or the data are mismatched.

IV. Methodology of OBDA
In figure 6, the query that obtained from the end-user via visual query system divided into two steps: (i) Ontology rewriting, in which the query is rewritten with respect to the ontology into new query(ii) Mapping rewriting in which the query obtained is reformulated over the data sources using mapping assertions [14]. The specification of OBDA is a triple of J = (O, S, M) where O is the Fig. 5. Semantics of schema mappings description logic TBox ontology, S is a source schema with integrated integrity constraints, and M is a mapping between the two consist of assertion of the form where ϕ(x) is a query over sources and ѱ(x) is a query over Ontology [11] [13]. We donate to the O with a signature ∑O and description logic language with LO, while S has the signature ∑S and description logic language with LS;X is the number of arguments that the function passes. The functionality of m∈ M mapping assertions with the form of equation (3) The ontology O is as follow In words O, specified Asian and African as Humans, Asian can not be African, and every Human has a Name and located in a Location that has a Code. Moreover, every Code has a Name. Mapping M between O and S is as follows: The semantic of OBDA specifications j with respect of S is legal if where is a set of facts over ∑s. In other words, for each S a legal instance, always exists. In equation (5) every mapping assertion will denote the existential arguments in the head (m) [

A. Evaluation
The main aim of ontology rewriting query is to solve the problem of query answering that comes from the end-user. The idea behind that is to transform the given query and TBox into an expanded query that contains all relevant information captured in the TBox, also to evaluate the expanded query over ABox only. The expanded version is also formed by a union of conjunctive queries (UCQs) that avoid keeping the large ABoxes in memory [16].
Another issue is the size of the rewriting query over ontology which equal the size of TBox and the ordered query. In this case, (UCQs) will contain hundreds or thousands of queries which affect the performance of retrieving information.
Two types of problems may appear in OBDA system: (1) Syntax error, such that the ontology TBox represented by DL-Lite family semantically formulated correctly and the mapping assertion does not contain misspellings. (2) Semantic problems, where the ontology does not contain unsatisfiable concepts, roles, or attributes. The semantic problems for the mapping where a mapping assertion m∈ M is semantically anomalous if the answer to either the head query of m or the body query of m is empty, also of the body of the query is empty (SQL over database) then the m assertions is useless, but if the head of the query is empty (Conjunctive queries) is empty and the body is not, the assertion may lead to a contradiction [15].

B. Table Summary
In this section, we present a discussion related to the OBDA system that we present. First, we make a comparison between different systems that uses OBDA for the integration of heterogeneous information sources. We compare the ontology languages as well as to connect ontology with sources via mappings. From Table 1 we find that ontology is formulated using DL-Lite family [17][18] [19] [20], and DL behind OWL as shown in (1) [14][21] [22][23] [24]. From Table 1 most of the presented platforms used GAV mapping rewriting. Also, it shows the methodology that implement OBDA specifications and some important points that shed the light in how these systems derived the data sources.
In Table 2, we present a discussion related to mapping connection to information sources as follows: (1) Straight forward approach that connect ontology to data schema in terms of one-to-one copy of the structure of the database and encode it in a language that makes automated reasoning possible. (2) Definition approach does not correspond to the structure of the database, these are only linked to the information by the terms that is defined. (3) Structure enrichment which combine the two previously the structure and the information source. (4) Meta-Annotation that adds semantic information to an information sources which present in the World Wide Web [5]. Table 3 summarized the standard languages and the query models that we used in this review paper. GAV in which ontology is defined as a set of views over the data sources. In GLAV approach, each mapping rule is represented by a conjunctive query written in the global schema associated with a conjunctive one written in source schemas. An R2RML is a mapping language that connect the relational databases to RDF dataset throw logical tables to retrieve data from the input database. Standard languages also represent DL -Lite family and OWL2QL ontology languages with formally defined meaning [3][4] [11][13] [25]. Table 3 also shows that query answering could be a union of conjunctive query (UCQs) [3] or standalone conjunctive query (CQs) over ontology.  GAV for mapping assertion GLAV for mapping assertion Conjunctive query (CQs) [11] [13] VI. Conclusion A promising OBDA system is able to solve many challenges related to end use of data access especially on big data. This approach presented a query answering based on two steps (i) Ontology rewriting. (ii) Mapping rewriting over data sources. A successfully OBDA implementation can solve the problem of accessing big data as follows (1) There is no need to write a special coding by the endusers or the IT experts. (2) Data can be left in the relational database. (3) It provides a flexible query language which corresponds to end-users. (4) The ontology will hide the complexity of the source schema for the end-user. (5) Database expert's knowledge will be available to end-users because of the relationship between the ontology and the sources via mapping. From this survey we have found that most of the researchers' efforts studying how to extract implicit knowledge from big data based on the use of ontologies and the declarative mappings between data and ontology schema's. Also, researchers introduced existing platforms and under constructing ones based on OBDA systems to give end users the ability to access big data through visual interfaces to write queries.