|Similarity-based operations similarity join similarity grouping data integration|
The research field of data integration is an area of growing practical importance, especially considering the increasing availability of huge amounts of data from more and more source systems. According current research includes approaches for solving the problem of conflicts on the data level addressed in this thesis. Dealing with discrepancies in data still is a big challenge, relevant for instance during eliminating duplicates from semantically overlapping sources as well as for combining complementary data from different sources. According operations most often cannot only be based on equality of values, because only in rare cases there are identifiers valid across system boundaries. Using other attribute values is problematic, because erroneous data and varying conventions for information representation are common problems in this field. Therefore, according operations have to be based on the similarity of data objects and values. The concept of similarity itself is problematic regarding its usage and foundations of its semantics. Successful applications often have a very specific view of similarity measures and predicates that represent a narrow focus on the context of similarity for this given scenario. To provide similarity-based operations for data integration purposes requires a broader view on similarity, suitable to include for instance a number of generic and tailor-made similarity measures useful in a given data integration system. These problems are addressed in this thesis by providing similarity-based operations according to a small, generic framework. Similarity-based selection, join, and grouping operations are discussed regarding their general semantics and special aspects of underlying similarity relations. According algorithms suitable for data processing are described for materialised and virtual integration scenarios. Implementations are given and evaluated to prove the applicability and efficiency of the proposed approaches. On the predicate level the thesis is focused on string similarity, namely based on the Levenshtein or edit distance. The efficient processing of similarity-based operations mainly depends on an efficient evaluation of similarity predicates, which is illustrated for string similarity based on index support in materialised and pre-selection in virtual data integration scenarios.